GUA-SPA: Guarani Spanish corpus

Corpus of texts extracted from Paraguayan tweets and news articles, where it is usual to see these Jopara or Jehe’a varieties, and also the use of Spanish sentences that include Guarani loanwords. The dataset is used for three tasks: language identification, NER, and a classification task for the way Spanish spans are used in the code-switched context. The corpus of the task consists of 1500 texts and about 25 thousand tokens. The data contains sentences extracted from news articles and tweets. 

Language(s)
Spanish
Spanish (Paraguay)
Guarani
Year
2023
Domain
News
Text types
News

Publication
Luis Chiruzzo, Marvin Agüero-Torales, Gustavo Giménez-Lugo, Aldo Alvarez, Yliana Rodríguez, Santiago Góngora, Thamar Solorio, Roberto Zanoli, Goutham Karunakaran (2023) Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis. Procesamiento del Lenguaje Natural, Revista nº 71, septiembre de 2023, pp. 321-328
Number of units
1500
Type of units
Documents
Tokens
24849
Documents
1500
Training set size
1140
Test set size
180
Development set size
180
Size - additional information

named entities, language, code switching

If you have published a result better than those on the list, send a message to odesia-comunicacion@lsi.uned.es indicating the result and the DOI of the article, along with a copy of it if it is not published openly.