We release our named entities for 1340 languages, 1134 of which are lowest-resource. It mainly contains people and location NEs. The total number of NEs is 674,493, so there are 503 NEs per language on average with at least 300 names in 95% of the languages. The three best represented families are the Austronesian, Niger-Congo, and Indo-European ones. However, our coverage broadly includes all major areas of linguistic diversity, including Amazonian (e.g., Kaingang), African (e.g., Sango) and Papua New Guinea (e.g., Saniyo-Hiyewe).
License: Creative commons CC-BY
Contact: Silvia Severini (Homepage)
If you use the data in your work, please cite the following paper:
@misc{severini2022broad, title={Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages}, author={Silvia Severini and Ayyoob Imani and Philipp Dufter and Hinrich Schütze}, year={2022}, eprint={2201.12219}, archivePrefix={arXiv}, primaryClass={cs.CL} }