K/DA
收藏arXiv2025-06-16 更新2025-06-19 收录
下载链接:
https://github.com/minkyeongjeon/kda
下载链接
链接失效反馈官方服务:
资源简介:
K/DA 数据集是一个包含约 7.5K 个中性-毒性配对的韩语数据集,旨在帮助训练语言净化模型。该数据集由韩国大学、首尔国立大学和 KAIST AI 共同创建,通过自动化流程生成,涵盖了显式侮辱、隐含侮辱及其变体等多种形式的侮辱性语言。数据集采用 Retrieval-Augmented Generation (RAG) 技术生成,并经过筛选以确保数据质量和多样性。该数据集适用于多种语言和模型类型,有助于提高语言净化模型的性能。
The K/DA Dataset is a Korean dataset containing approximately 7.5K neutral-toxic language pairs, designed to facilitate the training of language purification models. It was co-created by South Korean academic institutions including Seoul National University and KAIST AI, and generated via an automated processing pipeline. The dataset covers various forms of offensive language, including explicit insults, implicit insults and their variants. It is developed using Retrieval-Augmented Generation (RAG) technology, and has been filtered to ensure data quality and diversity. This dataset is compatible with multiple languages and model types, and helps enhance the performance of language purification models.
提供机构:
韩国大学, 首尔国立大学, KAIST AI
创建时间:
2025-06-16



