MultiParaDetox is a dataset for text detoxification, a style transfer task that converts toxic expressions into a neutral register. It extends the ParaDetox approach to multiple languages, enabling the automatic creation of parallel corpora for detoxification.
Language(s)
Spanish
Ukrainian
Dataset description link
Year
2024
Domain
Diverse
Annotations
parallel corpus.
Annotation guide link
Data access
Public
Publication
Daryna Dementieva, Nikolay Babakov, and Alexander Panchenko. 2024. MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 124140, Mexico City, Mexico. Association for Computational Linguistics.
Publication link
License
Openrail++
NLP Topic
Number of units
1720
Type of units
Tweets
Training set size
720
Test set size
600
Development set size
400

