Paul/hatecheck-spanish
收藏Hugging Face2022-07-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Paul/hatecheck-spanish
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language_creators:
- expert-generated
language:
- es
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: Spanish HateCheck
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- hate-speech-detection
---
# Dataset Card for Multilingual HateCheck
## Dataset Description
Multilingual HateCheck (MHC) is a suite of functional tests for hate speech detection models in 10 different languages: Arabic, Dutch, French, German, Hindi, Italian, Mandarin, Polish, Portuguese and Spanish.
For each language, there are 25+ functional tests that correspond to distinct types of hate and challenging non-hate.
This allows for targeted diagnostic insights into model performance.
For more details, please refer to our paper about MHC, published at the 2022 Workshop on Online Abuse and Harms (WOAH) at NAACL 2022. If you are using MHC, please cite our work!
- **Paper:** Röttger et al. (2022) - Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models. https://arxiv.org/abs/2206.09917
- **Repository:** https://github.com/rewire-online/multilingual-hatecheck
- **Point of Contact:** paul@rewire.online
## Dataset Structure
The csv format mostly matches the original HateCheck data, with some adjustments for specific languages.
**mhc_case_id**
The test case ID that is unique to each test case across languages (e.g., "mandarin-1305")
**functionality**
The shorthand for the functionality tested by the test case (e.g, "target_obj_nh"). The same functionalities are tested in all languages, except for Mandarin and Arabic, where non-Latin script required adapting the tests for spelling variations.
**test_case**
The test case text.
**label_gold**
The gold standard label ("hateful" or "non-hateful") of the test case. All test cases within a given functionality have the same gold standard label.
**target_ident**
Where applicable, the protected group that is targeted or referenced in the test case. All HateChecks cover seven target groups, but their composition varies across languages.
**ref_case_id**
For hateful cases, where applicable, the ID of the hateful case which was perturbed to generate this test case. For non-hateful cases, where applicable, the ID of the hateful case which is contrasted by this test case.
**ref_templ_id**
The equivalent to ref_case_id, but for template IDs.
**templ_id**
The ID of the template from which the test case was generated.
**case_templ**
The template from which the test case was generated (where applicable).
**gender_male** and **gender_female**
For gender-inflected languages (French, Spanish, Portuguese, Hindi, Arabic, Italian, Polish, German), only for cases where gender inflection is relevant, separate entries for gender_male and gender_female replace case_templ.
**label_annotated**
A list of labels given by the three annotators who reviewed the test case (e.g., "['hateful', 'hateful', 'hateful']").
**label_annotated_maj**
The majority vote of the three annotators (e.g., "hateful"). In some cases this differs from the gold label given by our language experts.
**disagreement_in_case**
True if label_annotated_maj does not match label_gold for the entry.
**disagreement_in_template**
True if the test case is generated from an IDENT template and there is at least one case with disagreement_in_case generated from the same template. This can be used to exclude entire templates from MHC.
提供机构:
Paul
原始信息汇总
数据集概述
数据集名称
- 名称: Spanish HateCheck
数据集描述
- 目的: 用于检测西班牙语中的仇恨言论。
- 内容: 包含超过25种不同类型的仇恨言论和挑战性的非仇恨言论的功能测试。
- 语言: 西班牙语
数据集结构
- 格式: CSV
- 字段:
- mhc_case_id: 跨语言的唯一测试案例ID。
- functionality: 测试案例所测试的功能简写。
- test_case: 测试案例文本。
- label_gold: 测试案例的金标准标签(“hateful”或“non-hateful”)。
- target_ident: 适用的受保护群体。
- ref_case_id: 适用的参考案例ID。
- ref_templ_id: 参考模板ID。
- templ_id: 生成测试案例的模板ID。
- case_templ: 生成测试案例的模板(适用的)。
- gender_male 和 gender_female: 性别相关的语言中的性别标记。
- label_annotated: 三位注释者给出的标签列表。
- label_annotated_maj: 三位注释者的多数投票结果。
- disagreement_in_case: 如果多数投票结果与金标准标签不匹配,则为True。
- disagreement_in_template: 如果同一模板生成的案例中存在不匹配,则为True。
数据集特点
- 多语言性: 单语种(西班牙语)
- 许可: CC-BY-4.0
- 数据来源: 原始数据
- 任务类别: 文本分类
- 任务ID: 仇恨言论检测
- 数据集大小: 1K<n<10K



