Welcome to Cistern
Cistern is the principal repository of tools and resources released by the Center for Information and Language Processing (CIS) of the University of Munich (LMU).
The CIS conducts research on linguistically-informed statistical natural language processing (NLP) including problems such as part-of-speech tagging, parsing and sentiment analysis.
On this site, we store and share tools and resources such as data sets, lexicons, binaries and models.
2022 projects
2021 projects
2020 projects
- reasoning-over-facts Code and data to text pretrained language models' reasoning capabilities
- COVID-QA Extractive QA on biomedical publications about COVID-19, with inexpensive domain adaptation of BERT
- E-BERT Code and pretrained models: Efficient-Yet-Effective Entity Embeddings for BERT
- LAMA misprimed & negated Misprimed and negated knowledge probes for pretrained language models
- LAMA-UHN A harder version of the LAMA benchmark (integrated into the Facebook Research LAMA repo)
- ContraCAT Contrastive Coreference Analytical Templates (for Machine Translation)
- Unsupervised KG-to-text and text-to-KG conversion - Code and benchmark for unsupervised text generation from knowledge graphs and semantic parsing
- SimAlign
is a high-quality word alignment tool that uses static and contextualized embeddings and does not require parallel training data.
- PET is a framework for few-shot learning using task descriptions in natural language.
- FewGLUE is a dataset for few-shot text classification derived from the SuperGLUE dataset.
- DagoBERT is a BERT-based model for generating derivationally complex words.
2019 projects
- Low Resource CQA Evaluation data for unsupervised Duplicate Question Detection, based on 12 low-resource Stack Exchange forums
- DensRay -
interpretable dimension in word embedding spaces
- SherLIiC - a hard NLI evaluation benchmark using lexical inference in context (LIiC)
- BERTRAM is a powerful architecture based on BERT that is capable of inferring high-quality embeddings for rare words that are suitable as input representations for deep language models.
- WNLaMPro is a dataset that can be used to measure how well BERT and other masked language models understand rare words.
2018 projects
Earlier projects
- MED - Code of the LMU system for the SIGMORPHON 2016 shared task on morphological reinflection
- corefResources - two corpora of automatically extracted coreference chains: (1) KBPchains, (2) English Gigaword data
- noise-mitigation - Noise Mitigation for Neural Entity Typing and Relation Extraction
- FIGMENT - a fine-grained embedding-based entity typer
- FIGMENT2 - fine-grained entity typing using multi-level representation of entities
- Lemming - a flexible and accurate lemmatizer
- MarMoT - a fast and accurate morphological tagger
- ChipMunk - a morphological segmenter and analyzer
- LatMor - a Latin computational morphology
- MarLiN - a fast word clustering tool
- BitPar - a parser for highly ambiguous probabilistic context-free grammars
- TreeTagger - a tool for annotating text with part-of-speech and lemma information
- RFTagger - a tool for the annotation of text with fine-grained POS tags
- Ocrocis - a project manager for the OCR toolkit Ocropy by Thomas Breuel
- SFST - a finite state transducer toolkit
- SMOR - a German computational morphology
- AttentionUncertainty - attention methods for uncertainty detection
- AutoExtend - extending word embeddings
- CoSimRank - a fast and accurate graph based similarity measure
- GlobalNormalization - The code, parameters and prepared dataset used for global normalization of convolutional neural networks for joint entity and relation classification
- Open Relation Argument Extraction - Corpus and code for extracting relation arguments of non-standard type.
- SFbenchmark - relation classification benchmark for Slot Filling
- CIS_SlotFilling - the CIS slot filling system
- semiCRF - a character-based neural network with semi-Markov CRF output layer for robust multilingual part-of-speech tagging