The Mu-SHROOM dataset consists of a collection of prompts, model outputs, logits, and identifiers for openly available LLMs. The dataset encompasses 10 languages with validation and test data (Modern Standard Arabic, German, English, Spanish, Finnish, French, Hindi, Italian, Swedish and Mandarin Chinese), 4 test-only (“surprise”) languages (Catalan, Czech, Basque and Farsi), as well as unlabeled training data for English, Spanish, French, and Chinese. Supplementary metadata, including raw annotations before post-processing and the Wikipedia URLs used as references, as well as the scripts used to generate model outputs for all 14 languages and code for the annotation and submission interfaces are all publicly available.
Language(s)
Spanish
English
Arabic
Deuch
Farsi
French
Hindi
Italian
Swedish
Chinese
Dataset description link
Year
2025
Domain
Diverse
Text types
Wikipedia
Annotations
character-level hallucination spans with hard and soft labels, and annotator IDs
Annotation guide link
Data access
Public
Publication
Raul Vazquez, Timothee Mickus, Elaine Zosa, Teemu Vahtola, Jörg Tiedemann, Aman Sinha, Vincent Segonne, Fernando Sanchez - Vega, Alessandro Raganato, Jind?ich Libovický, Jussi Karlgren, Shaoxiong Ji, Jind?ich Helcl, Liane Guillou, Ona De Gibert, Jaione Bengoetxea, Joseph Attieh, and Marianna Apidianaki. 2025. SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 2472–2497, Vienna, Austria. Association for Computational Linguistics.
Publication link
License
CC-BY-4.0
NLP Topic
Number of units
200
Test set size
150
Development set size
50

