MULTICONER is a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation.
Language(s)
Spanish
Dataset description link
Year
2022
Domain
Diverse
Text types
Wikipedia
Questions
Search queries
Annotations
Named entities: PERSON, LOCATION, GROUPS, CORPORATION, PRODUCT, CREATIVE-WORK
Format
CoNLL
Data access
Public
Publication link
License
CC-BY-4.0
NLP Topic
Number of units
233987
Type of units
Sentences
Sentences
233987
Training set size
15300 sentences
Test set size
217887 sentences
Development set size
800 sentences