Welcome to Cistern

Cistern is the principal repository of tools and resources released by the Center for Information and Language Processing (CIS) of the University of Munich (LMU).
The CIS conducts research on linguistically-informed statistical natural language processing (NLP) including problems such as part-of-speech tagging, parsing and sentiment analysis.
On this site, we store and share tools and resources such as data sets, lexicons, binaries and models.

2022 projects

GLP Part-Of-Speech models
Bible Named Entities Resource
Low-Resource Contextualized Embeddings - Improving Low-Resource Languages in Pre-Trained Language Models

2021 projects

Preprocessed Europeana NER data - Data Centric Domain Adaptation for Historical Text with OCR Errors
Anchor-Embeddings - Anchor-based Bilingual Word Embeddings for Low-Resource Languages
StaticLAMA.
Language Models for Lexical Inference in Context - Code and data splits for applying pretrained language models to the LIiC task
mLAMA: Code and Data.

2020 projects

reasoning-over-facts Code and data to text pretrained language models' reasoning capabilities
COVID-QA Extractive QA on biomedical publications about COVID-19, with inexpensive domain adaptation of BERT
E-BERT Code and pretrained models: Efficient-Yet-Effective Entity Embeddings for BERT
LAMA misprimed & negated Misprimed and negated knowledge probes for pretrained language models
LAMA-UHN A harder version of the LAMA benchmark (integrated into the Facebook Research LAMA repo)
ContraCAT Contrastive Coreference Analytical Templates (for Machine Translation)
Unsupervised KG-to-text and text-to-KG conversion - Code and benchmark for unsupervised text generation from knowledge graphs and semantic parsing
SimAlign is a high-quality word alignment tool that uses static and contextualized embeddings and does not require parallel training data.
PET is a framework for few-shot learning using task descriptions in natural language.
FewGLUE is a dataset for few-shot text classification derived from the SuperGLUE dataset.
DagoBERT is a BERT-based model for generating derivationally complex words.

2019 projects

Low Resource CQA Evaluation data for unsupervised Duplicate Question Detection, based on 12 low-resource Stack Exchange forums
DensRay - interpretable dimension in word embedding spaces
SherLIiC - a hard NLI evaluation benchmark using lexical inference in context (LIiC)
BERTRAM is a powerful architecture based on BERT that is capable of inferring high-quality embeddings for rare words that are suitable as input representations for deep language models.
WNLaMPro is a dataset that can be used to measure how well BERT and other masked language models understand rare words.

2018 projects

Input optimization for NLP Creating interpretable representations for neurons in an RNN.
Comult - Embeddings for 1000+ languages.

Earlier projects

MED - Code of the LMU system for the SIGMORPHON 2016 shared task on morphological reinflection
corefResources - two corpora of automatically extracted coreference chains: (1) KBPchains, (2) English Gigaword data
noise-mitigation - Noise Mitigation for Neural Entity Typing and Relation Extraction
FIGMENT - a fine-grained embedding-based entity typer
FIGMENT2 - fine-grained entity typing using multi-level representation of entities
Lemming - a flexible and accurate lemmatizer
MarMoT - a fast and accurate morphological tagger
ChipMunk - a morphological segmenter and analyzer
LatMor - a Latin computational morphology
MarLiN - a fast word clustering tool
BitPar - a parser for highly ambiguous probabilistic context-free grammars
TreeTagger - a tool for annotating text with part-of-speech and lemma information
RFTagger - a tool for the annotation of text with fine-grained POS tags
Ocrocis - a project manager for the OCR toolkit Ocropy by Thomas Breuel
SFST - a finite state transducer toolkit
SMOR - a German computational morphology
AttentionUncertainty - attention methods for uncertainty detection
AutoExtend - extending word embeddings
CoSimRank - a fast and accurate graph based similarity measure
GlobalNormalization - The code, parameters and prepared dataset used for global normalization of convolutional neural networks for joint entity and relation classification
Open Relation Argument Extraction - Corpus and code for extracting relation arguments of non-standard type.
SFbenchmark - relation classification benchmark for Slot Filling
CIS_SlotFilling - the CIS slot filling system
semiCRF - a character-based neural network with semi-Markov CRF output layer for robust multilingual part-of-speech tagging