code switching detection

GUA-SPA: Guarani Spanish corpus

Corpus of texts extracted from Paraguayan tweets and news articles, where it is usual to see these Jopara or Jehe’a varieties, and also the use of Spanish sentences that include Guarani loanwords. The dataset is used for three tasks: language identification, NER, and a classification task for the way Spanish spans are used in the code-switched context. The corpus of the task consists of 1500 texts and about 25 thousand tokens. The data contains sentences extracted from news articles and tweets.