Lemming - A flexible and accurate lemmatizer

(last update: 22/10/2015)


a lemming another lemming yet another lemming

(Source: wikimedia.org)

Lemming is a statistical lemmatizer, a tool that maps a word form to its cannonical base form. Lemming needs part-of-speech information and can be run as part of a pipeline or jointly with MarMoT. On this page you can find links to the source code, binaries and pretrained models.

Usage

Application

To apply a pipeline Lemming model lemming.srl and a MarMoT model marmot.srl simply run:

java -Xmx5g -cp marmot.jar:trove.jar marmot.morph.cmd.Annotator\
            -model-file marmot.srl\
            -lemmatizer-file lemming.srl\
            -test-file form-index=0,raw.tsv\
            -pred-file lemma.tsv

where 'raw.tsv' is in a one-token-per line format (sentence boundaries indicated by an empty line).


Training

The command line interface for training new models is currently in a very bad state. Here we explain how to train pipeline models. This can benefits from different lexical resources such as spelling lexicons, unigram statistics and (NEW) cluster indexes. The training described here differs from the experiments run in the paper in that we use certain inexact feature tables and SGD-based training. Accuracy should still be comparable (or even better when using the word clusters). Feel free to contact us if there is any problem with training new models.

## OPTIONS
# Use POS only model and log-linear model
options=use-morph=false,use-perceptron=false
# Use SGD instead of Mallet LBFGS, extract features online (saves memory)
options=$options,use-mallet=false,offline-feature-extraction=false
# Extract POS tag-dependent Edit Trees (less candidates per token)
options=$options,tag-dependent=true
#Hash feature tables saves memory at low drop in accuracy
options=$options,use-hash-feature-table=true
# Optional file that maps word form to some cluster_id (have a look at MarLiN)

# OPTIONAL lexical resources
# unigram counts from e.g. Wikipedia:
#  ...
#  up      1644807
#  only    1639016
#  most    1602804
#  ...
unigram_file=counts.1.tsv

# a lexicon e.g. Aspell:
#  ...
#  ethology
#  ethological
#  ethology's
#  ethologist
#  ...
aspell_file=aspell.txt
options=$options,unigram-file=min-count=5"\,"${unigram_file}";"min-count=1"\,"${aspell_file}

# A MarLiN cluster file:
#  ...
#  film 100_99
#  movie 100_99
#  trailer 100_99
#  film 200_118
#  movie 200_118
#  trailer 200_159
#  ...
cluster_file=marlin.tsv
options=$options,cluster-file=$cluster_file

# Here train.conll is some treebank in Conll 2006 format.
# Other format can be used if the indexes are set accordingly.
# The options string is mandatory, but can be "_" if you do not want
# to set any options.
time java -Xmx20g -cp marmot.jar:mallet.jar:trove.jar\
         lemming.lemma.cmd.Trainer\
         lemming.lemma.ranker.RankerTrainer\
         $options\
         lemming.srl\
         form-index=1,lemma-index=2,tag-index=3,morph-index=5,data/train.conll

Downloads

Contact: Thomas Müller (CIS page)