MultiCoNER-ES | Portal ODESIA

MULTICONER is a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation.

Language(s)

Spanish

Dataset description link

https://aclanthology.org/2022.coling-1.334.pdf

Year

2022

Domain

Diverse

Text types

Wikipedia

Questions

Search queries

Annotations

Named entities: PERSON, LOCATION, GROUPS, CORPORATION, PRODUCT, CREATIVE-WORK

Format

CoNLL

Data access

Public

Data link

https://registry.opendata.aws/multiconer/

Publication link

https://aclanthology.org/2022.coling-1.334.pdf

License

CC-BY-4.0

NLP Topic

(named) entity recognition

Number of units

233987

Type of units

Sentences

233987

Training set size

15300 sentences

Test set size

217887 sentences

Development set size

800 sentences

Log in or register to post comments