(named) entity recognition

MultiCoNER-ES

MULTICONER is a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions.

DIANN-2018-ES

The corpus is a collection of 500 abstracts from Elsevier journal papers related to the biomedical domain collected between 2017 and 2018. It is divided into two disjoined parts: training set (80%) and test set (20%). It is annotated with disabilities and negations and their scope.

MedProcNER/ProcTEMIST corpus 2023

Dataset of 1,000 clinical case reports manually annotated by multiple clinical experts with clinical procedures. The case reports were selected by clinical experts and belong to various medical specialties including, amongst others, oncology, odontology, urology, and psychiatry. They are the same text documents that were used for the corpus and shared task on diseases DisTEMIST, building towards a collection of fully-annotated texts for clinical concept recognition and normalization.

MEDDOPLACE Corpus: Gold Standard annotations for Medical Documents Place-related Content Extraction

The MEDDOPLACE Gold Standard corpus is a collection of 1,000 clinical case reports in Spanish from various medical specialties such as psychiatry, neurology, travel medicine, infectious diseases, cardiology, occupational medicine and oncology. The corpus is annotated on the one hand with places and locations and on the other hand location classes   of clinical relevance: (a) birthplace, (b) residence, (c) movement, and (d) healthcare attention.

MultiCoNER v2 ES

MULTICONER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and codemixing subsets. This dataset is designed to represent  contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions.