IberAuTexTification | Portal ODESIA

A dataset generated for the shared task focused on detecting machine-generated text and model attribution in the six main languages of the Iberian Peninsula: Catalan, English, Spanish, Basque, Galician, and Portuguese. The dataset includes human and machine-generated texts in seven domains: Chat, How-to, News, Literary, Reviews, Tweets, and Wikipedia. The generated texts are obtained using six language models: BLOOM-1B1, BLOOM-3B, BLOOM-7B1, Babbage, Curie, and text-davinci-003.

Language(s)

Spanish

English

Portuguese

Dataset description link

https://huggingface.co/datasets/Genaios/iberautextification

Year

2024

Domain

News

Social

others

Annotations

Two binary labels indicating whether the text has been automatically generated or not, and if so, the model that generated it.

Format

tsv

Annotation guide link

https://huggingface.co/datasets/Genaios/iberautextification

Data access

Public

Data link

https://huggingface.co/datasets/Genaios/iberautextification

Publication

Sarvazyan et al. (2024). Overview of IberAuTexTification at IberLEF 2024: Detection and Attribution of Machine-Generated Text on Languages of the Iberian Peninsula. Procesamiento del Lenguaje Natural, 73: 421-434.

Publication link

http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6628/4020

License

CC-BY-4.0

NLP Topic

text generation

Number of units

168128

Type of units

Documents

Log in or register to post comments