MarLiN - A fast word clustering tool


a marlin another marlin yet another marlin

(Source: wikimedia.org)

Introduction

MarLiN is a program suite to induce word clusters form text. MarLin extracts word contexts from huge amounts of text and uses them to group similar words such as vehicles, cities and calendar days. It is based on the algorithm by Martin, Lierman and Ney (1998).

Usage

The following code will cluster the words in example.txt into 10 classes.

marlin_count --text example.txt --bigrams bigrams --words words
marlin_cluster --words words --bigrams bigrams --output --c 10

marlin_count extracts a word list and bigram statistics and marlin_cluster performs the actual clustering.

Download

Reference: 2015. Thomas Müller and Hinrich Schütze. Robust Morphological Tagging with Word Representations. NAACL ( bib)
Contact: Thomas Müller (CIS page)