The corpus consists of (i) the translation and manual curation of documents with tmVar3 annotations (Wei et al., 2022), which include PubMed summaries, to which associated diseases and symptoms were added; and (ii) the manual annotation of PubMed summaries in Spanish.
Language(s)
Spanish
Dataset description link
Year
2024
Domain
Biology
Annotations
Each annotation includes: pmid (PubMed article ID), start and end (positions in the text), term (exact text), and entity, which can be Disease (disease), Gene (gene), Transcript (transcription variant), DNAMutation (DNA mutation), or OtherMutation (other mutations, such as in exons or missense).
Format
txt
Publication
Agüero-Torales et al. (2024). Overview of GenoVarDis at IberLEF 2024: NER of Genomic Variants and Related Diseases in Spanish. Procesamiento del Lenguaje Natural, 73: 421-434.
NLP Topic
Number of units
633
Type of units
Documents
Training set size
427
Test set size
136
Development set size
70