davanstrien/gahd
收藏Hugging Face2024-04-10 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/gahd
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
language:
- de
pretty_name: GAHD
configs:
- config_name: default
data_files:
- split: train
path: "data/gahd.csv"
- config_name: gahd_disaggregated
data_files:
- split: train
path: "data/gahd_disaggregated.csv"
---
**NOTE** README copied from https://github.com/jagol/gahd
This repository contains the dataset from our NAACL 2024 paper "Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset".
`gahd.csv` contains the following columns:
- `gahd_id`: unique identifier of the entry
- `text`: text of the entry
- `label`: `0` = "not-hate speech", `1` = "hate speech"
- `round`: round in which the entry was created
- `split`: "train", "dev", or "test"
- `contrastive_gahd_id`: `gahd_id` of its contrastive example
`gahd_disaggregated.csv` contains the following additional columns:
- `source`:
- if annotators entered the entry via the Dynabench interface: `dynabench`
- if the entry was translated from the Vidgen et al. 2021 dataset: `translation`
- if the entry stems from the Leipzit news corpus: `news`
- `model_prediction`: label predicted by the target model, `0` or `1`
- `annotator_id`: unique identifier of the annotator that created the entry
- `annotator_labels`: a string containing a forward slash-separated list of all labels by annotators
- `expert_labels`: `0` or `1` if an expert annotator annotated the entry, otherwise empty
When using GAHD, please cite our preprint on Arxiv:
```
@misc{goldzycher2024improving,
title={Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset},
author={Janis Goldzycher and Paul Röttger and Gerold Schneider},
year={2024},
eprint={2403.19559},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
davanstrien
原始信息汇总
数据集概述
基本信息
- 许可证: CC-BY-4.0
- 任务类别: 文本分类
- 语言: 德语
- 数据集名称: GAHD
配置详情
-
默认配置
- 数据文件:
data/gahd.csv - 分割: 训练
- 数据文件:
-
gahd_disaggregated配置
- 数据文件:
data/gahd_disaggregated.csv - 分割: 训练
- 数据文件:
数据集内容
-
gahd.csv
- 列信息:
gahd_id: 唯一标识符text: 文本内容label: 标签 (0: "非仇恨言论",1: "仇恨言论")round: 创建轮次split: 分割类型 ("train", "dev", "test")contrastive_gahd_id: 对比示例的gahd_id
- 列信息:
-
gahd_disaggregated.csv
- 额外列信息:
source: 数据来源 (dynabench,translation,news)model_prediction: 目标模型的预测标签 (0或1)annotator_id: 标注者唯一标识符annotator_labels: 标注者提供的标签列表,以斜杠分隔expert_labels: 专家标注者提供的标签 (0或1),否则为空
- 额外列信息:
引用信息
- 论文: "Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset"
- 作者: Janis Goldzycher, Paul Röttger, Gerold Schneider
- 年份: 2024
- 预印本: arXiv:2403.19559
- 类别: cs.CL



