trueorfalse441/korean_hate_speech_copy
收藏数据集概述
数据集基本信息
- 名称: K-MHaS
- 语言: 韩语
- 许可证: CC BY-SA 4.0
- 多语言性: 单语种
- 大小类别: 100K<n<1M
- 源数据: 原始数据
- 标签: K-MHaS, Korean NLP, Hate Speech Detection, Dataset, Coling2022
- 任务类别: 文本分类
- 任务ID: 多标签分类, 仇恨言论检测
- PapersWithCode ID: korean-multi-label-hate-speech-dataset
数据集结构
特征
- text: 字符串类型,来自韩语在线新闻评论的文本。
- label: 序列类型,包含以下类别标签:
0: 出身歧视 (Origin)1: 外貌歧视 (Physical)2: 政治倾向歧视 (Politics)3: 厌恶辱骂 (Profanity)4: 年龄歧视 (Age)5: 性别歧视 (Gender)6: 种族歧视 (Race)7: 宗教歧视 (Religion)8: 非仇恨言论 (Not Hate Speech)
数据分割
- 训练集: 78,977个样本
- 验证集: 8,776个样本
- 测试集: 21,939个样本
数据集创建
数据收集与规范化
- 数据来源: 韩国在线新闻评论,来自Kaggle和Github。
- 收集时间: 2018年1月至2020年6月。
标注过程
- 标注者: 五名母语为韩语的标注者。
- 标注指南: 区分仇恨言论和非仇恨言论,并标注仇恨言论的类别。
个人和敏感信息
- 数据集包含仇恨言论示例,但不包含个人敏感信息。
使用数据集的注意事项
社会影响
- 提供了一个新的多标签韩语仇恨言论检测数据集,展示了其在检测韩语仇恨言论模式中的可用性。
偏见讨论
- 标注者来自众包平台,他们在处理数据前被告知仇恨言论的相关信息。
其他已知限制
- [更多信息需要]
附加信息
数据集策展人
- Taejun Lim, Heejun Lee, Bogeun Jo
许可证信息
- Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
引用信息
@inproceedings{lee-etal-2022-k, title = "K-{MH}a{S}: A Multi-label Hate Speech Detection Dataset in {K}orean Online News Comment", author = "Lee, Jean and Lim, Taejun and Lee, Heejun and Jo, Bogeun and Kim, Yangsok and Yoon, Heegeun and Han, Soyeon Caren", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.311", pages = "3530--3538", abstract = "Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.", }
贡献者
- Jean Lee (The University of Sydney)
- Taejun Lim (The University of Sydney)
- Heejun Lee (BigWave AI)
- Bogeun Jo (BigWave AI)
- Yangsok Kim (Keimyung University)
- Heegeun Yoon (National Information Society Agency)
- Soyeon Caren Han (The University of Western Australia and The University of Sydney)



