MULTICONER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and codemixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions.
Language(s)
Spanish
Dataset description link
Year
2023
Domain
General
Text types
Wiki sentences
Questions
Search queries
Data access
Public
Data link
Publication
"Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, Oleg Rokhlenko (2022) MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition. Proceedings of the 29th International Conference on Computational Linguistics, pages 3798–3809
October 12–17, 2022"
October 12–17, 2022"
Publication link
License
CC-BY-4.0
NLP Topic
Number of units
264207
Type of units
Sentences
Sentences
264207
Training set size
16453
Test set size
246900
Development set size
854
Size - additional information
named entities