sayafadhil43/kmhas_korean_hate_speech

Name: sayafadhil43/kmhas_korean_hate_speech
Creator: sayafadhil43
Published: 2026-03-17 14:50:20
License: 暂无描述

Hugging Face2026-03-17 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/sayafadhil43/kmhas_korean_hate_speech

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language: - ko language_creators: - found license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: 'K-MHaS' size_categories: - 100K<n<1M source_datasets: - original tags: - K-MHaS - Korean NLP - Hate Speech Detection - Dataset - Coling2022 task_categories: - text-classification task_ids: - multi-label-classification - hate-speech-detection paperswithcode_id: korean-multi-label-hate-speech-dataset dataset_info: features: - name: text dtype: string - name: label sequence: class_label: names: 0: origin 1: physical 2: politics 3: profanity 4: age 5: gender 6: race 7: religion 8: not_hate_speech splits: - name: train num_bytes: 6845463 num_examples: 78977 - name: validation num_bytes: 748899 num_examples: 8776 - name: test num_bytes: 1902352 num_examples: 21939 download_size: 9496714 dataset_size: 109692 --- # Dataset Card for K-MHaS ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Sample Code <a href="https://colab.research.google.com/drive/171KhS1_LVBtpAFd_kaT8lcrZmhcz5ehY?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="base"/></a> ## Dataset Description - **Homepage:** [K-MHaS](https://github.com/adlnlp/K-MHaS) - **Repository:** [Korean Multi-label Hate Speech Dataset](https://github.com/adlnlp/K-MHaS) - **Paper:** [K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment](https://arxiv.org/abs/2208.10684) - **Point of Contact:** [Caren Han](caren.han@sydney.edu.au) - **Sample code:** [Colab](https://colab.research.google.com/drive/171KhS1_LVBtpAFd_kaT8lcrZmhcz5ehY?usp=sharing) ### Dataset Summary The Korean Multi-label Hate Speech Dataset, **K-MHaS**, consists of 109,692 utterances from Korean online news comments, labelled with 8 fine-grained hate speech classes (labels: `Politics`, `Origin`, `Physical`, `Age`, `Gender`, `Religion`, `Race`, `Profanity`) or `Not Hate Speech` class. Each utterance provides from a single to four labels that can handles Korean language patterns effectively. For more details, please refer to our paper about [**K-MHaS**](https://aclanthology.org/2022.coling-1.311), published at COLING 2022. ### Supported Tasks and Leaderboards Hate Speech Detection * `binary classification` (labels: `Hate Speech`, `Not Hate Speech`) * `multi-label classification`: (labels: `Politics`, `Origin`, `Physical`, `Age`, `Gender`, `Religion`, `Race`, `Profanity`, `Not Hate Speech`) For the multi-label classification, a `Hate Speech` class from the binary classification, is broken down into eight classes, associated with the hate speech category. In order to reflect the social and historical context, we select the eight hate speech classes. For example, the `Politics` class is chosen, due to a significant influence on the style of Korean hate speech. ### Languages Korean ## Dataset Structure ### Data Instances The dataset is provided with train/validation/test set in the txt format. Each instance is a news comment with a corresponding one or more hate speech classes (labels: `Politics`, `Origin`, `Physical`, `Age`, `Gender`, `Religion`, `Race`, `Profanity`) or `Not Hate Speech` class. The label numbers matching in both English and Korean is in the data fields section. ```python {'text':'수꼴틀딱시키들이 다 디져야 나라가 똑바로 될것같다..답이 없는 종자들ㅠ' 'label': [2, 3, 4] } ``` ### Data Fields * `text`: utterance from Korean online news comment. * `label`: the label numbers matching with 8 fine-grained hate speech classes and `not hate speech` class are follows. * `0`: `Origin`(`출신차별`) hate speech based on place of origin or identity; * `1`: `Physical`(`외모차별`) hate speech based on physical appearance (e.g. body, face) or disability; * `2`: `Politics`(`정치성향차별`) hate speech based on political stance; * `3`: `Profanity`(`혐오욕설`) hate speech in the form of swearing, cursing, cussing, obscene words, or expletives; or an unspecified hate speech category; * `4`: `Age`(`연령차별`) hate speech based on age; * `5`: `Gender`(`성차별`) hate speech based on gender or sexual orientation (e.g. woman, homosexual); * `6`: `Race`(`인종차별`) hate speech based on ethnicity; * `7`: `Religion`(`종교차별`) hate speech based on religion; * `8`: `Not Hate Speech`(`해당사항없음`). ### Data Splits In our repository, we provide splitted datasets that have 78,977(train) / 8,776 (validation) / 21,939 (test) samples, preserving the class proportion. ## Dataset Creation ### Curation Rationale We propose K-MHaS, a large size Korean multi-label hate speech detection dataset that represents Korean language patterns effectively. Most datasets in hate speech research are annotated using a single label classification of particular aspects, even though the subjectivity of hate speech cannot be explained with a mutually exclusive annotation scheme. We propose a multi-label hate speech annotation scheme that allows overlapping labels associated with the subjectivity and the intersectionality of hate speech. ### Source Data #### Initial Data Collection and Normalization Our dataset is based on the Korean online news comments available on Kaggle and Github. The unlabeled raw data was collected between January 2018 and June 2020. Please see the details in our paper [K-MHaS](https://aclanthology.org/2022.coling-1.311) published at COLING2020. #### Who are the source language producers? The language producers are users who left the comments on the Korean online news platform between 2018 and 2020. ### Annotations #### Annotation process We begin with the common categories of hate speech found in literature and match the keywords for each category. After the preliminary round, we investigate the results to merge or remove labels in order to provide the most representative subtype labels of hate speech contextual to the cultural background. Our annotation instructions explain a twolayered annotation to (a) distinguish hate and not hate speech, and (b) the categories of hate speech. Annotators are requested to consider given keywords or alternatives of each category within social, cultural, and historical circumstances. For more details, please refer to the paper [K-MHaS](https://aclanthology.org/2022.coling-1.311). #### Who are the annotators? Five native speakers were recruited for manual annotation in both the preliminary and main rounds. ### Personal and Sensitive Information This datasets contains examples of hateful language, however, has no personal information. ## Considerations for Using the Data ### Social Impact of Dataset We propose K-MHaS, a new large-sized dataset for Korean hate speech detection with a multi-label annotation scheme. We provided extensive baseline experiment results, presenting the usability of a dataset to detect Korean language patterns in hate speech. ### Discussion of Biases All annotators were recruited from a crowdsourcing platform. They were informed about hate speech before handling the data. Our instructions allowed them to feel free to leave if they were uncomfortable with the content. With respect to the potential risks, we note that the subjectivity of human annotation would impact on the quality of the dataset. ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators This dataset is curated by Taejun Lim, Heejun Lee and Bogeun Jo. ### Licensing Information Creative Commons Attribution-ShareAlike 4.0 International (cc-by-sa-4.0). ### Citation Information ``` @inproceedings{lee-etal-2022-k, title = "K-{MH}a{S}: A Multi-label Hate Speech Detection Dataset in {K}orean Online News Comment", author = "Lee, Jean and Lim, Taejun and Lee, Heejun and Jo, Bogeun and Kim, Yangsok and Yoon, Heegeun and Han, Soyeon Caren", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.311", pages = "3530--3538", abstract = "Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.", } ``` ### Contributions The contributors of the work are: - [Jean Lee](https://jeanlee-ai.github.io/) (The University of Sydney) - [Taejun Lim](https://github.com/taezun) (The University of Sydney) - [Heejun Lee](https://bigwaveai.com/) (BigWave AI) - [Bogeun Jo](https://bigwaveai.com/) (BigWave AI) - Yangsok Kim (Keimyung University) - Heegeun Yoon (National Information Society Agency) - [Soyeon Caren Han](https://drcarenhan.github.io/) (The University of Western Australia and The University of Sydney)

注释创建者： - 众包语言： - 韩语（ko）语言采集方式： - 公开获取许可证： - CC BY-SA 4.0 多语言属性： - 单语言数据集名称：'K-MHaS' 数据规模： - 10万 < 样本数 < 100万源数据集： - 原创数据集标签： - K-MHaS - 韩语自然语言处理 - 仇恨言论检测 - 数据集 - COLING 2022 任务类别： - 文本分类任务子类别： - 多标签分类 - 仇恨言论检测 PapersWithCode ID：korean-multi-label-hate-speech-dataset 数据集信息：特征： - 名称：text，数据类型：字符串 - 名称：label，类型为序列，类标签对应如下： 0: 出身歧视 1: 外貌/身体歧视 2: 政治倾向歧视 3: 侮辱性言语 4: 年龄歧视 5: 性别/性取向歧视 6: 种族歧视 7: 宗教歧视 8: 非仇恨言论数据划分： - 名称：train，字节数：6845463，样本数：78977 - 名称：validation，字节数：748899，样本数：8776 - 名称：test，字节数：1902352，样本数：21939 下载大小：9496714字节数据集总样本数：109692 # K-MHaS 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持的任务与基准排行榜](#supported-tasks-and-leaderboards) - [使用语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [注释](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [已知其他局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可证信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 示例代码 <a href="https://colab.research.google.com/drive/171KhS1_LVBtpAFd_kaT8lcrZmhcz5ehY?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="base"/></a> ## 数据集描述 - **主页**：[K-MHaS](https://github.com/adlnlp/K-MHaS) - **代码仓库**：[韩语多标签仇恨言论数据集](https://github.com/adlnlp/K-MHaS) - **论文**：[《K-MHaS：韩语在线新闻评论中的多标签仇恨言论检测数据集》](https://arxiv.org/abs/2208.10684) - **联系人**：[Caren Han](caren.han@sydney.edu.au) - **示例代码**：[Colab](https://colab.research.google.com/drive/171KhS1_LVBtpAFd_kaT8lcrZmhcz5ehY?usp=sharing) ### 数据集概述 **韩语多标签仇恨言论数据集（K-MHaS）** 包含109692条来自韩国在线新闻评论的话语，标注了8个细粒度仇恨言论类别（标签：`政治倾向歧视`、`出身歧视`、`外貌/身体歧视`、`年龄歧视`、`性别/性取向歧视`、`宗教歧视`、`种族歧视`、`侮辱性言语`）或`非仇恨言论`类别。每条话语可被标注1至4个标签，可有效适配韩语语言模式。如需了解更多细节，请参阅我们发表于COLING 2022的论文[**K-MHaS**](https://aclanthology.org/2022.coling-1.311)。 ### 支持的任务与基准排行榜仇恨言论检测 * **二分类任务**（标签：`仇恨言论`、`非仇恨言论`） * **多标签分类任务**（标签：`政治倾向歧视`、`出身歧视`、`外貌/身体歧视`、`年龄歧视`、`性别/性取向歧视`、`宗教歧视`、`种族歧视`、`侮辱性言语`、`非仇恨言论`）在多标签分类任务中，二分类任务下的`仇恨言论`类别被拆解为8个与仇恨言论范畴相关的细粒度类别。为贴合社会与历史语境，我们选取了这8类仇恨言论类别。例如，考虑到政治倾向因素对韩国仇恨言论风格具有显著影响，我们纳入了`政治倾向歧视`类别。 ### 使用语言韩语 ## 数据集结构 ### 数据实例本数据集以文本文件形式提供训练集、验证集与测试集。每条数据为一条新闻评论，对应一个或多个仇恨言论类别（标签：`政治倾向歧视`、`出身歧视`、`外貌/身体歧视`、`年龄歧视`、`性别/性取向歧视`、`宗教歧视`、`种族歧视`、`侮辱性言语`）或`非仇恨言论`类别。英文与韩语对应的标签编号详见数据字段章节。 python {'text':'수꼴틀딱시키들이 다 디져야 나라가 똑바로 될것같다..답이 없는 종자들ㅠ' 'label': [2, 3, 4] } ### 数据字段 * `text`：韩国在线新闻评论中的原始话语。 * `label`：对应8类细粒度仇恨言论与`非仇恨言论`类别的标签编号如下： * `0`：`Origin`（`출신차별`，出身歧视）：基于出身或身份的仇恨言论； * `1`：`Physical`（`외모차별`，外貌歧视）：基于外貌（如身材、容貌）或残障的仇恨言论； * `2`：`Politics`（`정치성향차별`，政治倾向歧视）：基于政治立场的仇恨言论； * `3`：`Profanity`（`혐오욕설`，侮辱性言语）：以咒骂、侮辱、淫秽词汇或粗口形式呈现的仇恨言论，或未明确归类的仇恨言论； * `4`：`Age`（`연령차별`，年龄歧视）：基于年龄的仇恨言论； * `5`：`Gender`（`성차별`，性别歧视）：基于性别或性取向（如女性、同性恋群体）的仇恨言论； * `6`：`Race`（`인종차별`，种族歧视）：基于族裔的仇恨言论； * `7`：`Religion`（`종교차별`，宗教歧视）：基于宗教信仰的仇恨言论； * `8`：`Not Hate Speech`（`해당사항없음`，无仇恨内容）：非仇恨言论。 ### 数据划分在本代码仓库中，我们提供了划分完成的数据集，训练集/验证集/测试集样本数分别为78977/8776/21939，且保留了各类别的分布比例。 ## 数据集构建 ### 构建初衷我们提出了K-MHaS，一个可有效适配韩语语言模式的大规模韩语多标签仇恨言论检测数据集。现有仇恨言论研究中的多数数据集采用单一标签分类方案进行注释，但仇恨言论的主观性无法通过互斥的注释体系体现。因此我们提出了多标签仇恨言论注释方案，允许标注与仇恨言论主观性和交叉性相关的重叠标签。 ### 源数据 #### 初始数据采集与标准化本数据集基于Kaggle与GitHub上公开的韩国在线新闻评论数据。未标注的原始数据采集于2018年1月至2020年6月期间。如需了解更多细节，请参阅我们发表于COLING 2020的论文[K-MHaS](https://aclanthology.org/2022.coling-1.311)。 #### 源语言内容创作者是谁？源语言内容创作者为2018年至2020年间在韩国在线新闻平台发布评论的用户。 ### 注释 #### 注释流程我们首先采用现有文献中通用的仇恨言论类别，并为每个类别匹配关键词。预标注阶段结束后，我们对标注结果进行分析，合并或删除部分标签，以提供贴合文化背景的最具代表性的仇恨言论子类别标签。我们的注释指南采用双层注释体系：(a) 区分仇恨言论与非仇恨言论；(b) 标注仇恨言论的具体类别。要求标注人员在社会、文化与历史语境下，结合给定关键词或替代词进行标注。如需了解更多细节，请参阅论文[K-MHaS](https://aclanthology.org/2022.coling-1.311)。 #### 注释人员是谁？我们招募了5名韩语母语者参与预标注与正式标注阶段的人工注释工作。 ### 个人与敏感信息本数据集包含仇恨性语言示例，但未包含任何个人敏感信息。 ## 数据集使用注意事项 ### 数据集的社会影响我们提出了K-MHaS，一个采用多标签注释方案的新型大规模韩语仇恨言论检测数据集。我们提供了全面的基准实验结果，验证了该数据集在检测韩语仇恨言论语言模式方面的可用性。 ### 偏差分析所有注释人员均从众包平台招募。在处理数据前，我们已向他们讲解了仇恨言论相关知识。我们的指南允许标注人员在对内容感到不适时随时停止标注。考虑到潜在风险，我们需说明：人工注释的主观性将对数据集质量产生影响。 ### 已知其他局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者本数据集由Lim Taejun、Lee Heejun与Jo Bogeun维护。 ### 许可证信息采用知识共享署名-相同方式共享4.0国际协议（CC BY-SA 4.0）。 ### 引用信息 @inproceedings{lee-etal-2022-k, title = "K-{MH}a{S}: A Multi-label Hate Speech Detection Dataset in {K}orean Online News Comment", author = "Lee, Jean and Lim, Taejun and Lee, Heejun and Jo, Bogeun and Kim, Yangsok and Yoon, Heegeun and Han, Soyeon Caren", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.311", pages = "3530--3538", abstract = "Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.", } ### 贡献者本工作的贡献者包括： - [Jean Lee](https://jeanlee-ai.github.io/)（悉尼大学） - [Lim Taejun](https://github.com/taezun)（悉尼大学） - [Lee Heejun](https://bigwaveai.com/)（BigWave AI） - [Jo Bogeun](https://bigwaveai.com/)（BigWave AI） - Kim Yangsok（启明大学） - Yoon Heegeun（韩国国家信息社会振兴院） - [Soyeon Caren Han](https://drcarenhan.github.io/)（西澳大学与悉尼大学）

提供机构：

sayafadhil43

5,000+

优质数据集

54 个

任务类型

进入经典数据集