trueorfalse441/korean_hate_speech_copy

Name: trueorfalse441/korean_hate_speech_copy
Creator: trueorfalse441
Published: 2023-11-21 12:15:43
License: 暂无描述

Hugging Face2023-11-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/trueorfalse441/korean_hate_speech_copy

下载链接

链接失效反馈

官方服务：

资源简介：

K-MHaS（韩语多标签仇恨言论检测数据集）包含来自韩国在线新闻评论的109,692条话语，标注了8种细粒度的仇恨言论类别（如政治、出身、外貌、年龄、性别、宗教、种族、脏话）或非仇恨言论类别。每条话语可能包含一到四个标签，能够有效处理韩语语言模式。该数据集旨在通过多标签注释方案反映仇恨言论的主观性和交叉性，适用于仇恨言论检测任务，包括二分类和多标签分类。数据集分为训练集、验证集和测试集，分别包含78,977、8,776和21,939个样本。

提供机构：

trueorfalse441

原始信息汇总

数据集概述

数据集基本信息

名称: K-MHaS
语言: 韩语
许可证: CC BY-SA 4.0
多语言性: 单语种
大小类别: 100K<n<1M
源数据: 原始数据
标签: K-MHaS, Korean NLP, Hate Speech Detection, Dataset, Coling2022
任务类别: 文本分类
任务ID: 多标签分类, 仇恨言论检测
PapersWithCode ID: korean-multi-label-hate-speech-dataset

数据集结构

特征

text: 字符串类型，来自韩语在线新闻评论的文本。
label: 序列类型，包含以下类别标签:
- 0: 出身歧视 (Origin)
- 1: 外貌歧视 (Physical)
- 2: 政治倾向歧视 (Politics)
- 3: 厌恶辱骂 (Profanity)
- 4: 年龄歧视 (Age)
- 5: 性别歧视 (Gender)
- 6: 种族歧视 (Race)
- 7: 宗教歧视 (Religion)
- 8: 非仇恨言论 (Not Hate Speech)

数据分割

训练集: 78,977个样本
验证集: 8,776个样本
测试集: 21,939个样本

数据集创建

数据收集与规范化

数据来源: 韩国在线新闻评论，来自Kaggle和Github。
收集时间: 2018年1月至2020年6月。

标注过程

标注者: 五名母语为韩语的标注者。
标注指南: 区分仇恨言论和非仇恨言论，并标注仇恨言论的类别。

个人和敏感信息

数据集包含仇恨言论示例，但不包含个人敏感信息。

使用数据集的注意事项

社会影响

提供了一个新的多标签韩语仇恨言论检测数据集，展示了其在检测韩语仇恨言论模式中的可用性。

偏见讨论

标注者来自众包平台，他们在处理数据前被告知仇恨言论的相关信息。

其他已知限制

[更多信息需要]

附加信息

数据集策展人

Taejun Lim, Heejun Lee, Bogeun Jo

许可证信息

Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

引用信息

@inproceedings{lee-etal-2022-k, title = "K-{MH}a{S}: A Multi-label Hate Speech Detection Dataset in {K}orean Online News Comment", author = "Lee, Jean and Lim, Taejun and Lee, Heejun and Jo, Bogeun and Kim, Yangsok and Yoon, Heegeun and Han, Soyeon Caren", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.311", pages = "3530--3538", abstract = "Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.", }

贡献者

Jean Lee (The University of Sydney)
Taejun Lim (The University of Sydney)
Heejun Lee (BigWave AI)
Bogeun Jo (BigWave AI)
Yangsok Kim (Keimyung University)
Heegeun Yoon (National Information Society Agency)
Soyeon Caren Han (The University of Western Australia and The University of Sydney)

5,000+

优质数据集

54 个

任务类型

进入经典数据集