Towards a Broad Coverage Named Entity Resource: a Data-efficient Approach for Many Diverse Languages

Authors: Silvia Severini, Ayyoob Imani, Philipp Dufter, Hinrich Schütze
Paper: Link

We release our named entities for 1340 languages, 1134 of which are lowest-resource. It mainly contains people and location NEs. The total number of NEs is 674,493, so there are 503 NEs per language on average with at least 300 names in 95% of the languages. The three best represented families are the Austronesian, Niger-Congo, and Indo-European ones. However, our coverage broadly includes all major areas of linguistic diversity, including Amazonian (e.g., Kaingang), African (e.g., Sango) and Papua New Guinea (e.g., Saniyo-Hiyewe).

Data

License: Creative commons CC-BY

Contact: Silvia Severini (Homepage)

If you use the data in your work, please cite the following paper:

@inproceedings{severini-etal-2022-towards,
   title = "Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages",
   author = {Severini, Silvia  and ImaniGooghari, Ayyoob  and Dufter, Philipp  and Sch{\"u}tze, Hinrich},
   booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
   year = "2022",
   publisher = "European Language Resources Association",
   url = "https://aclanthology.org/2022.lrec-1.417",
   pages = "3923--3933",
}