IberAuTexTification

A dataset generated for the shared task focused on detecting machine-generated text and model attribution in the six main languages of the Iberian Peninsula: Catalan, English, Spanish, Basque, Galician, and Portuguese. The dataset includes human and machine-generated texts in seven domains: Chat, How-to, News, Literary, Reviews, Tweets, and Wikipedia. The generated texts are obtained using six language models: BLOOM-1B1, BLOOM-3B, BLOOM-7B1, Babbage, Curie, and text-davinci-003.

Language(s)
Spanish
English
Portuguese
Year
2024
Domain
News
Social
others
Annotations
Two binary labels indicating whether the text has been automatically generated or not, and if so, the model that generated it.
Format
tsv
Data access
Public

Publication
Sarvazyan et al. (2024). Overview of IberAuTexTification at IberLEF 2024: Detection and Attribution of Machine-Generated Text on Languages of the Iberian Peninsula. Procesamiento del Lenguaje Natural, 73: 421-434.
License
CC-BY-4.0
NLP Topic
Number of units
168128
Type of units
Documents

If you have published a result better than those on the list, send a message to odesia-comunicacion@lsi.uned.es indicating the result and the DOI of the article, along with a copy of it if it is not published openly.