ChipMunk - A morphological segmenter and stemmer


a chipmunk another chipmunk yet another chipmunk

(Source: wikimedia.org)

ChipMunk is a tool for labeled segmentation, morphological analysis and stemming. The implementation found here is not the one used in the paper, but a complete rewrite. On this page you can find links to the source code, binaries, pretrained models, datasets and more.

Usage

The chipmunk_example.sh script shows how ChipMunk can be trained and run. It is completely self-contained and will download all the needed JARs and datasets.


Segmentation

Assuming an input file input.txt in a one-word-per-line format, words can be segmented by running:

java -cp $LIBS chipmunk.segmenter.cmd.Segment\
           --model-file eng.chipmunk.srl\
           --input-file input.txt\
           --output-file output.txt

Stemming

As we point out in the paper, ChipMunk can also be used as a stemmer or 'root detector', words can be stemmed by running:

java -cp $LIBS chipmunk.segmenter.cmd.Stem\
        --model-file eng.chipmunk.srl\
        --input-file input.txt\
        --output-file output.txt\
        --mode stemming
The model-file can point to the same segmentation model as above. The model should use a tagset of level 2 or higher. The mode can also be set to 'root-detection'.

Here is the output of our English model for the word 'horrified':
$ java -cp $LIBS chipmunk.segmenter.cmd.Segment ...
> horrified	horr:ROOT|SEGMENT ifi:DERI|SUFFIX|SEGMENT ed:INFL|SUFFIX|SEGMENT
$ java -cp $LIBS chipmunk.segmenter.cmd.Stem ... --mode stemming
> horrified	horrifi
$ java -cp $LIBS chipmunk.segmenter.cmd.Stem ... --mode root-detection
> horrified	horr

Training

Assuming a training file in the format found in the dataset archives provided below, a new model can by trained by running:

java -cp $LIBS chipmunk.segmenter.cmd.Train\
	 --train-file supplement/seg/eng/trn\
	 --lang eng\
	 --verbose false\
	 --crf-mode true\
	 --tag-level 0\
	 --dictionary-paths "supplement/additional/eng/aspell.txt ..."\
	 --model-file eng.chipmunk.srl

tag-level sets the tagset granularity as discussed in the paper and crf-mode can be used to choose between structured averaged perceptron and LBFGS-based CRF training (using the LBFGS implementation of Mallet).


Downloads

Reference: (to appear) 2015. Ryan Cotterell, Thomas Müller, Alex Fraser and Hinrich Schütze. Labeled Morphological Segmentation with Semi-Markov Models. CoNLL
Contact: Thomas Müller (CIS page)