five

davanstrien/gahd

收藏
Hugging Face2024-04-10 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/gahd
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-classification language: - de pretty_name: GAHD configs: - config_name: default data_files: - split: train path: "data/gahd.csv" - config_name: gahd_disaggregated data_files: - split: train path: "data/gahd_disaggregated.csv" --- **NOTE** README copied from https://github.com/jagol/gahd This repository contains the dataset from our NAACL 2024 paper "Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset". `gahd.csv` contains the following columns: - `gahd_id`: unique identifier of the entry - `text`: text of the entry - `label`: `0` = "not-hate speech", `1` = "hate speech" - `round`: round in which the entry was created - `split`: "train", "dev", or "test" - `contrastive_gahd_id`: `gahd_id` of its contrastive example `gahd_disaggregated.csv` contains the following additional columns: - `source`: - if annotators entered the entry via the Dynabench interface: `dynabench` - if the entry was translated from the Vidgen et al. 2021 dataset: `translation` - if the entry stems from the Leipzit news corpus: `news` - `model_prediction`: label predicted by the target model, `0` or `1` - `annotator_id`: unique identifier of the annotator that created the entry - `annotator_labels`: a string containing a forward slash-separated list of all labels by annotators - `expert_labels`: `0` or `1` if an expert annotator annotated the entry, otherwise empty When using GAHD, please cite our preprint on Arxiv: ``` @misc{goldzycher2024improving, title={Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset}, author={Janis Goldzycher and Paul Röttger and Gerold Schneider}, year={2024}, eprint={2403.19559}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
davanstrien
原始信息汇总

数据集概述

基本信息

  • 许可证: CC-BY-4.0
  • 任务类别: 文本分类
  • 语言: 德语
  • 数据集名称: GAHD

配置详情

  • 默认配置

    • 数据文件: data/gahd.csv
    • 分割: 训练
  • gahd_disaggregated配置

    • 数据文件: data/gahd_disaggregated.csv
    • 分割: 训练

数据集内容

  • gahd.csv

    • 列信息:
      • gahd_id: 唯一标识符
      • text: 文本内容
      • label: 标签 (0: "非仇恨言论", 1: "仇恨言论")
      • round: 创建轮次
      • split: 分割类型 ("train", "dev", "test")
      • contrastive_gahd_id: 对比示例的gahd_id
  • gahd_disaggregated.csv

    • 额外列信息:
      • source: 数据来源 (dynabench, translation, news)
      • model_prediction: 目标模型的预测标签 (01)
      • annotator_id: 标注者唯一标识符
      • annotator_labels: 标注者提供的标签列表,以斜杠分隔
      • expert_labels: 专家标注者提供的标签 (01),否则为空

引用信息

  • 论文: "Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset"
  • 作者: Janis Goldzycher, Paul Röttger, Gerold Schneider
  • 年份: 2024
  • 预印本: arXiv:2403.19559
  • 类别: cs.CL
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作