davanstrien/gahd

Name: davanstrien/gahd
Creator: davanstrien
Published: 2024-04-10 15:17:00
License: 暂无描述

Hugging Face2024-04-10 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/davanstrien/gahd

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification language: - de pretty_name: GAHD configs: - config_name: default data_files: - split: train path: "data/gahd.csv" - config_name: gahd_disaggregated data_files: - split: train path: "data/gahd_disaggregated.csv" --- **NOTE** README copied from https://github.com/jagol/gahd This repository contains the dataset from our NAACL 2024 paper "Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset". `gahd.csv` contains the following columns: - `gahd_id`: unique identifier of the entry - `text`: text of the entry - `label`: `0` = "not-hate speech", `1` = "hate speech" - `round`: round in which the entry was created - `split`: "train", "dev", or "test" - `contrastive_gahd_id`: `gahd_id` of its contrastive example `gahd_disaggregated.csv` contains the following additional columns: - `source`: - if annotators entered the entry via the Dynabench interface: `dynabench` - if the entry was translated from the Vidgen et al. 2021 dataset: `translation` - if the entry stems from the Leipzit news corpus: `news` - `model_prediction`: label predicted by the target model, `0` or `1` - `annotator_id`: unique identifier of the annotator that created the entry - `annotator_labels`: a string containing a forward slash-separated list of all labels by annotators - `expert_labels`: `0` or `1` if an expert annotator annotated the entry, otherwise empty When using GAHD, please cite our preprint on Arxiv: ``` @misc{goldzycher2024improving, title={Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset}, author={Janis Goldzycher and Paul Röttger and Gerold Schneider}, year={2024}, eprint={2403.19559}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

davanstrien

原始信息汇总

数据集概述

基本信息

许可证: CC-BY-4.0
任务类别: 文本分类
语言: 德语
数据集名称: GAHD

配置详情

默认配置
- 数据文件: data/gahd.csv
- 分割: 训练
gahd_disaggregated配置
- 数据文件: data/gahd_disaggregated.csv
- 分割: 训练

数据集内容

gahd.csv
- 列信息:
  - gahd_id: 唯一标识符
  - text: 文本内容
  - label: 标签 (0: "非仇恨言论", 1: "仇恨言论")
  - round: 创建轮次
  - split: 分割类型 ("train", "dev", "test")
  - contrastive_gahd_id: 对比示例的gahd_id
gahd_disaggregated.csv
- 额外列信息:
  - source: 数据来源 (dynabench, translation, news)
  - model_prediction: 目标模型的预测标签 (0 或 1)
  - annotator_id: 标注者唯一标识符
  - annotator_labels: 标注者提供的标签列表，以斜杠分隔
  - expert_labels: 专家标注者提供的标签 (0 或 1)，否则为空

引用信息

论文: "Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset"
作者: Janis Goldzycher, Paul Röttger, Gerold Schneider
年份: 2024
预印本: arXiv:2403.19559
类别: cs.CL

5,000+

优质数据集

54 个

任务类型

进入经典数据集