odegiber/hate_speech18
收藏Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/odegiber/hate_speech18
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language_creators:
- found
language:
- en
license:
- cc-by-sa-3.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- intent-classification
paperswithcode_id: hate-speech
pretty_name: Hate Speech
dataset_info:
features:
- name: text
dtype: string
- name: user_id
dtype: int64
- name: subforum_id
dtype: int64
- name: num_contexts
dtype: int64
- name: label
dtype:
class_label:
names:
'0': noHate
'1': hate
'2': idk/skip
'3': relation
splits:
- name: train
num_bytes: 1375340
num_examples: 10944
download_size: 3664530
dataset_size: 1375340
train-eval-index:
- config: default
task: text-classification
task_id: multi_class_classification
splits:
train_split: train
col_mapping:
text: text
label: target
metrics:
- type: accuracy
name: Accuracy
- type: f1
name: F1 macro
args:
average: macro
- type: f1
name: F1 micro
args:
average: micro
- type: f1
name: F1 weighted
args:
average: weighted
- type: precision
name: Precision macro
args:
average: macro
- type: precision
name: Precision micro
args:
average: micro
- type: precision
name: Precision weighted
args:
average: weighted
- type: recall
name: Recall macro
args:
average: macro
- type: recall
name: Recall micro
args:
average: micro
- type: recall
name: Recall weighted
args:
average: weighted
---
# Dataset Card for [Dataset Name]
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/Vicomtech/hate-speech-dataset
- **Repository:** https://github.com/Vicomtech/hate-speech-dataset
- **Paper:** https://www.aclweb.org/anthology/W18-51.pdf
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
These files contain text extracted from Stormfront, a white supremacist forum. A random set of forums posts have been sampled from
several subforums and split into sentences. Those sentences have been manually labelled as containing hate speech or not, according
to certain annotation guidelines.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
English
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
- text: the provided sentence
- user_id: information to make it possible to re-build the conversations these sentences belong to
- subforum_id: information to make it possible to re-build the conversations these sentences belong to
- num_contexts: number of previous posts the annotator had to read before making a decision over the category of the sentence
- label: hate, noHate, relation (sentence in the post doesn't contain hate speech on their own, but combination of serveral sentences does)
or idk/skip (sentences that are not written in English or that don't contain information as to be classified into hate or noHate)
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```
@inproceedings{gibert2018hate,
title = "{Hate Speech Dataset from a White Supremacy Forum}",
author = "de Gibert, Ona and
Perez, Naiara and
Garc{\'\i}a-Pablos, Aitor and
Cuadros, Montse",
booktitle = "Proceedings of the 2nd Workshop on Abusive Language Online ({ALW}2)",
month = oct,
year = "2018",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W18-5102",
doi = "10.18653/v1/W18-5102",
pages = "11--20",
}
```
### Contributions
Thanks to [@czabo](https://github.com/czabo) for adding this dataset.
提供机构:
odegiber
原始信息汇总
数据集描述
数据集摘要
该数据集包含从白人至上主义论坛Stormfront提取的文本。从多个子论坛中随机抽取的论坛帖子被分割成句子,并根据特定的标注指南手动标注为是否包含仇恨言论。
支持的任务和排行榜
- 任务类别: 文本分类
- 任务ID: 意图分类
- Papers with Code ID: hate-speech
语言
英语
数据集结构
数据实例
- 特征:
text: 字符串类型,提供的句子user_id: 整数类型,用于重建句子所属的对话subforum_id: 整数类型,用于重建句子所属的对话num_contexts: 整数类型,标注者在做决策前需要阅读的前置帖子数量label: 类别标签,包括noHate、hate、idk/skip、relation
数据分割
- 训练集:
name: trainnum_bytes: 1375340num_examples: 10944
数据集大小
- 下载大小: 3664530
- 数据集大小: 1375340
数据集创建
数据集信息
- 特征:
text: 字符串类型user_id: 整数类型subforum_id: 整数类型num_contexts: 整数类型label: 类别标签,包括noHate、hate、idk/skip、relation
数据分割
- 训练集:
name: trainnum_bytes: 1375340num_examples: 10944
训练-评估指标
- 配置: default
- 任务: 文本分类
- 任务ID: 多类分类
- 训练分割: train
- 列映射:
text: textlabel: target
- 评估指标:
- 准确率 (Accuracy)
- F1 宏平均 (F1 macro)
- F1 微平均 (F1 micro)
- F1 加权平均 (F1 weighted)
- 精确率 宏平均 (Precision macro)
- 精确率 微平均 (Precision micro)
- 精确率 加权平均 (Precision weighted)
- 召回率 宏平均 (Recall macro)
- 召回率 微平均 (Recall micro)
- 召回率 加权平均 (Recall weighted)



