(Source: wikimedia.org)
ChipMunk is a tool for labeled segmentation, morphological analysis and stemming. The implementation found here is not the one used in the paper, but a complete rewrite. On this page you can find links to the source code, binaries, pretrained models, datasets and more.
The chipmunk_example.sh script shows how ChipMunk can be trained and run. It is completely self-contained and will download all the needed JARs and datasets.
Assuming an input file input.txt in a one-word-per-line format, words can be segmented by running:
java -cp $LIBS chipmunk.segmenter.cmd.Segment\ --model-file eng.chipmunk.srl\ --input-file input.txt\ --output-file output.txt
As we point out in the paper, ChipMunk can also be used as a stemmer or 'root detector', words can be stemmed by running:
java -cp $LIBS chipmunk.segmenter.cmd.Stem\ --model-file eng.chipmunk.srl\ --input-file input.txt\ --output-file output.txt\ --mode stemmingThe model-file can point to the same segmentation model as above. The model should use a tagset of level 2 or higher. The mode can also be set to 'root-detection'.
$ java -cp $LIBS chipmunk.segmenter.cmd.Segment ... > horrified horr:ROOT|SEGMENT ifi:DERI|SUFFIX|SEGMENT ed:INFL|SUFFIX|SEGMENT $ java -cp $LIBS chipmunk.segmenter.cmd.Stem ... --mode stemming > horrified horrifi $ java -cp $LIBS chipmunk.segmenter.cmd.Stem ... --mode root-detection > horrified horr
Assuming a training file in the format found in the dataset archives provided below, a new model can by trained by running:
java -cp $LIBS chipmunk.segmenter.cmd.Train\ --train-file supplement/seg/eng/trn\ --lang eng\ --verbose false\ --crf-mode true\ --tag-level 0\ --dictionary-paths "supplement/additional/eng/aspell.txt ..."\ --model-file eng.chipmunk.srl
tag-level sets the tagset granularity as discussed in the paper and crf-mode can be used to choose between structured averaged perceptron and LBFGS-based CRF training (using the LBFGS implementation of Mallet).
Reference: (to appear) 2015. Ryan Cotterell, Thomas Müller, Alex Fraser and Hinrich Schütze. Labeled Morphological Segmentation with Semi-Markov Models. CoNLL
Contact: Thomas Müller (CIS page)