five

Paul/hatecheck-spanish

收藏
Hugging Face2022-07-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Paul/hatecheck-spanish
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - expert-generated language: - es license: - cc-by-4.0 multilinguality: - monolingual pretty_name: Spanish HateCheck size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - hate-speech-detection --- # Dataset Card for Multilingual HateCheck ## Dataset Description Multilingual HateCheck (MHC) is a suite of functional tests for hate speech detection models in 10 different languages: Arabic, Dutch, French, German, Hindi, Italian, Mandarin, Polish, Portuguese and Spanish. For each language, there are 25+ functional tests that correspond to distinct types of hate and challenging non-hate. This allows for targeted diagnostic insights into model performance. For more details, please refer to our paper about MHC, published at the 2022 Workshop on Online Abuse and Harms (WOAH) at NAACL 2022. If you are using MHC, please cite our work! - **Paper:** Röttger et al. (2022) - Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models. https://arxiv.org/abs/2206.09917 - **Repository:** https://github.com/rewire-online/multilingual-hatecheck - **Point of Contact:** paul@rewire.online ## Dataset Structure The csv format mostly matches the original HateCheck data, with some adjustments for specific languages. **mhc_case_id** The test case ID that is unique to each test case across languages (e.g., "mandarin-1305") **functionality** The shorthand for the functionality tested by the test case (e.g, "target_obj_nh"). The same functionalities are tested in all languages, except for Mandarin and Arabic, where non-Latin script required adapting the tests for spelling variations. **test_case** The test case text. **label_gold** The gold standard label ("hateful" or "non-hateful") of the test case. All test cases within a given functionality have the same gold standard label. **target_ident** Where applicable, the protected group that is targeted or referenced in the test case. All HateChecks cover seven target groups, but their composition varies across languages. **ref_case_id** For hateful cases, where applicable, the ID of the hateful case which was perturbed to generate this test case. For non-hateful cases, where applicable, the ID of the hateful case which is contrasted by this test case. **ref_templ_id** The equivalent to ref_case_id, but for template IDs. **templ_id** The ID of the template from which the test case was generated. **case_templ** The template from which the test case was generated (where applicable). **gender_male** and **gender_female** For gender-inflected languages (French, Spanish, Portuguese, Hindi, Arabic, Italian, Polish, German), only for cases where gender inflection is relevant, separate entries for gender_male and gender_female replace case_templ. **label_annotated** A list of labels given by the three annotators who reviewed the test case (e.g., "['hateful', 'hateful', 'hateful']"). **label_annotated_maj** The majority vote of the three annotators (e.g., "hateful"). In some cases this differs from the gold label given by our language experts. **disagreement_in_case** True if label_annotated_maj does not match label_gold for the entry. **disagreement_in_template** True if the test case is generated from an IDENT template and there is at least one case with disagreement_in_case generated from the same template. This can be used to exclude entire templates from MHC.
提供机构:
Paul
原始信息汇总

数据集概述

数据集名称

  • 名称: Spanish HateCheck

数据集描述

  • 目的: 用于检测西班牙语中的仇恨言论。
  • 内容: 包含超过25种不同类型的仇恨言论和挑战性的非仇恨言论的功能测试。
  • 语言: 西班牙语

数据集结构

  • 格式: CSV
  • 字段:
    • mhc_case_id: 跨语言的唯一测试案例ID。
    • functionality: 测试案例所测试的功能简写。
    • test_case: 测试案例文本。
    • label_gold: 测试案例的金标准标签(“hateful”或“non-hateful”)。
    • target_ident: 适用的受保护群体。
    • ref_case_id: 适用的参考案例ID。
    • ref_templ_id: 参考模板ID。
    • templ_id: 生成测试案例的模板ID。
    • case_templ: 生成测试案例的模板(适用的)。
    • gender_malegender_female: 性别相关的语言中的性别标记。
    • label_annotated: 三位注释者给出的标签列表。
    • label_annotated_maj: 三位注释者的多数投票结果。
    • disagreement_in_case: 如果多数投票结果与金标准标签不匹配,则为True。
    • disagreement_in_template: 如果同一模板生成的案例中存在不匹配,则为True。

数据集特点

  • 多语言性: 单语种(西班牙语)
  • 许可: CC-BY-4.0
  • 数据来源: 原始数据
  • 任务类别: 文本分类
  • 任务ID: 仇恨言论检测
  • 数据集大小: 1K<n<10K
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作