five

textdetox/uk_paradetox

收藏
Hugging Face2024-06-25 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/textdetox/uk_paradetox
下载链接
链接失效反馈
官方服务:
资源简介:
乌克兰平行文本去毒语料库,用于乌克兰语的文本去毒任务。该语料库基于乌克兰推文语料库构建。更多详细信息,请参阅MultiParaDetox论文。

乌克兰平行文本去毒语料库,用于乌克兰语的文本去毒任务。该语料库基于乌克兰推文语料库构建。更多详细信息,请参阅MultiParaDetox论文。
提供机构:
textdetox
原始信息汇总

Ukrainian Parallel Text Detoxification 数据集

概述

  • 任务类别: 文本到文本生成
  • 语言: 乌克兰语
  • 数据规模: 1K<n<10K

数据来源

  • 基于乌克兰语推文语料库 corpus

引用

  • 如需引用该数据集,请参考以下文献:

@inproceedings{dementieva-etal-2024-multiparadetox, title = "{M}ulti{P}ara{D}etox: Extending Text Detoxification with Parallel Data to New Languages", author = "Dementieva, Daryna and Babakov, Nikolay and Panchenko, Alexander", editor = "Duh, Kevin and Gomez, Helena and Bethard, Steven", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-short.12", pages = "124--140", abstract = "Text detoxification is a textual style transfer (TST) task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recently, text detoxification methods found their applications in various task such as detoxification of Large Language Models (LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic speech combating in social networks (Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023). All these applications are extremely important to ensure safe communication in modern digital worlds. However, the previous approaches for parallel text detoxification corpora collection{---}ParaDetox (Logacheva et al., 2022) and APPADIA (Atwell et al., 2022){---}were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language. Then, we experiment with different text detoxification models{---}from unsupervised baselines to LLMs and fine-tuned models on the presented parallel corpora{---}showing the great benefit of parallel corpus presence to obtain state-of-the-art text detoxification models for any language.", }

@inproceedings{dementieva2024overview, title={Overview of the Multilingual Text Detoxification Task at PAN 2024}, author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander}, booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum}, editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{v{s}}{v{c}}{a}kov{a} and Alba Garc{i}a Seco de Herrera}, year={2024}, organization={CEUR-WS.org} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作