TUKE-KEMT/hate_speech_slovak
收藏Hugging Face2024-06-06 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/TUKE-KEMT/hate_speech_slovak
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- text-classification
language:
- sk
size_categories:
- 10K<n<100K
---
# Slovak Hate Speech and Offensive Language Database
The dataset contains posts from a social network with human annotations.
## Annotations
The posts are marked 1 if the post contain hateful or offensive language, 0 otherwise.
## Dataset Creation
The source data were scraped from a social network from a selection of public pages for sport, politics or general discussion. The gathered data were cleaned from span with a text clustering.
The posts were annotated by a group of students of the Technical University of Košice, Slovakia.
We removed annotations of users that had low level of agreement with others.
## Data filtering
One item was annotated by multiple annotators, but some annotators are unreliable. We had to identify unreliable annotators.
1. We removed annotations from users that mostly (90%) clicked on the same option .
2. We calculated level of agreement for each annotator. Annotator gets a positive point for each annotation, if he annotated the same as other annotators and negative if he or she annotated differently. For each annotator we calculate ratio of positive and negative points.
3. We remove annotations from annotators with low ratio of agreement (less than 70%).
4. We calculate votes for positive, neutral and negative class for each annotation from the remaining annotators. We remove annotations where neutral class has majority.
## Bias
Annotations are dependent on the personal opinions of the annotators. Class for most of the items was determined by voting of trustworthy annotators, but some items had only one vote available.
## Credits
- [NLP@ KEMT](https://nlp.kemt.fei.tuke.sk) Technical University of Košice, Slovakia
- Vladimír Ferko: annotation application and preliminary experiments
- Daniel Hládek: data filtering and export
提供机构:
TUKE-KEMT
原始信息汇总
Slovak Hate Speech and Offensive Language Database
数据集概述
- 语言: 斯洛伐克语 (sk)
- 规模: 10,000 < n < 100,000
- 任务类别: 文本分类
- 许可: CC-BY-SA-4.0
数据内容
- 包含来自社交网络的帖子,并附有人工标注。
- 帖子被标记为1,如果包含仇恨或攻击性语言;否则标记为0。
数据创建
- 原始数据从体育、政治或一般讨论的公共页面中抓取。
- 数据通过文本聚类清理了垃圾信息。
- 由斯洛伐克科希策理工大学的学生进行标注。
数据过滤
- 每个项目由多个标注者标注,但存在不可靠的标注者。
- 过滤过程包括:
- 移除大多数(90%)点击相同选项的用户的标注。
- 计算每个标注者的同意水平。
- 移除同意水平低于70%的标注者的标注。
- 计算剩余标注者对正面、中立和负面类别的投票,并移除中立类别占多数的标注。
偏差
- 标注受标注者个人意见影响。
- 大多数项目的类别由可信赖的标注者的投票决定,但某些项目只有一票。
贡献者
- NLP@ KEMT 斯洛伐克科希策理工大学
- Vladimír Ferko: 标注应用和初步实验
- Daniel Hládek: 数据过滤和导出



