MultiCoNER-ES

MULTICONER is a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation.

Language(s)
Spanish
Year
2022
Domain
Diverse
Text types
Wikipedia
Questions
Search queries
Annotations
Named entities: PERSON, LOCATION, GROUPS, CORPORATION, PRODUCT, CREATIVE-WORK
Format
CoNLL
Data access
Public

License
CC-BY-4.0
Number of units
233987
Type of units
Sentences
Sentences
233987
Training set size
15300 sentences
Test set size
217887 sentences
Development set size
800 sentences

If you have published a result better than those on the list, send a message to odesia-comunicacion@lsi.uned.es indicating the result and the DOI of the article, along with a copy of it if it is not published openly.