ChipMunk - A morphological segmenter and stemmer

a chipmunk another chipmunk yet another chipmunk


ChipMunk is a tool for labeled segmentation, morphological analysis and stemming. The implementation found here is not the one used in the paper, but a complete rewrite. On this page you can find links to the source code, binaries, pretrained models, datasets and more.


The script shows how ChipMunk can be trained and run. It is completely self-contained and will download all the needed JARs and datasets.


Assuming an input file input.txt in a one-word-per-line format, words can be segmented by running:

java -cp $LIBS chipmunk.segmenter.cmd.Segment\
           --input-file input.txt\
           --output-file output.txt


As we point out in the paper, ChipMunk can also be used as a stemmer or 'root detector', words can be stemmed by running:

java -cp $LIBS chipmunk.segmenter.cmd.Stem\
        --input-file input.txt\
        --output-file output.txt\
        --mode stemming
The model-file can point to the same segmentation model as above. The model should use a tagset of level 2 or higher. The mode can also be set to 'root-detection'.

Here is the output of our English model for the word 'horrified':
$ java -cp $LIBS chipmunk.segmenter.cmd.Segment ...
$ java -cp $LIBS chipmunk.segmenter.cmd.Stem ... --mode stemming
> horrified	horrifi
$ java -cp $LIBS chipmunk.segmenter.cmd.Stem ... --mode root-detection
> horrified	horr


Assuming a training file in the format found in the dataset archives provided below, a new model can by trained by running:

java -cp $LIBS chipmunk.segmenter.cmd.Train\
	 --train-file supplement/seg/eng/trn\
	 --lang eng\
	 --verbose false\
	 --crf-mode true\
	 --tag-level 0\
	 --dictionary-paths "supplement/additional/eng/aspell.txt ..."\

tag-level sets the tagset granularity as discussed in the paper and crf-mode can be used to choose between structured averaged perceptron and LBFGS-based CRF training (using the LBFGS implementation of Mallet).


Reference: (to appear) 2015. Ryan Cotterell, Thomas Müller, Alex Fraser and Hinrich Schütze. Labeled Morphological Segmentation with Semi-Markov Models. CoNLL
Contact: Thomas Müller (CIS page)