Paul/hatecheck-dutch

Name: Paul/hatecheck-dutch
Creator: Paul
Published: 2022-07-05 10:41:31
License: 暂无描述

Hugging Face2022-07-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Paul/hatecheck-dutch

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language_creators: - expert-generated language: - nl license: - cc-by-4.0 multilinguality: - monolingual pretty_name: Dutch HateCheck size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - hate-speech-detection --- # Dataset Card for Multilingual HateCheck ## Dataset Description Multilingual HateCheck (MHC) is a suite of functional tests for hate speech detection models in 10 different languages: Arabic, Dutch, French, German, Hindi, Italian, Mandarin, Polish, Portuguese and Spanish. For each language, there are 25+ functional tests that correspond to distinct types of hate and challenging non-hate. This allows for targeted diagnostic insights into model performance. For more details, please refer to our paper about MHC, published at the 2022 Workshop on Online Abuse and Harms (WOAH) at NAACL 2022. If you are using MHC, please cite our work! - **Paper:** Röttger et al. (2022) - Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models. https://arxiv.org/abs/2206.09917 - **Repository:** https://github.com/rewire-online/multilingual-hatecheck - **Point of Contact:** paul@rewire.online ## Dataset Structure The csv format mostly matches the original HateCheck data, with some adjustments for specific languages. **mhc_case_id** The test case ID that is unique to each test case across languages (e.g., "mandarin-1305") **functionality** The shorthand for the functionality tested by the test case (e.g, "target_obj_nh"). The same functionalities are tested in all languages, except for Mandarin and Arabic, where non-Latin script required adapting the tests for spelling variations. **test_case** The test case text. **label_gold** The gold standard label ("hateful" or "non-hateful") of the test case. All test cases within a given functionality have the same gold standard label. **target_ident** Where applicable, the protected group that is targeted or referenced in the test case. All HateChecks cover seven target groups, but their composition varies across languages. **ref_case_id** For hateful cases, where applicable, the ID of the hateful case which was perturbed to generate this test case. For non-hateful cases, where applicable, the ID of the hateful case which is contrasted by this test case. **ref_templ_id** The equivalent to ref_case_id, but for template IDs. **templ_id** The ID of the template from which the test case was generated. **case_templ** The template from which the test case was generated (where applicable). **gender_male** and **gender_female** For gender-inflected languages (French, Spanish, Portuguese, Hindi, Arabic, Italian, Polish, German), only for cases where gender inflection is relevant, separate entries for gender_male and gender_female replace case_templ. **label_annotated** A list of labels given by the three annotators who reviewed the test case (e.g., "['hateful', 'hateful', 'hateful']"). **label_annotated_maj** The majority vote of the three annotators (e.g., "hateful"). In some cases this differs from the gold label given by our language experts. **disagreement_in_case** True if label_annotated_maj does not match label_gold for the entry. **disagreement_in_template** True if the test case is generated from an IDENT template and there is at least one case with disagreement_in_case generated from the same template. This can be used to exclude entire templates from MHC.

提供机构：

Paul

原始信息汇总

数据集概述

基本信息

名称: Dutch HateCheck
语言: 荷兰语 (nl)
许可证: CC-BY-4.0
多语言性: 单语种
数据集大小: 1K<n<10K
数据来源: 原始数据
任务类别: 文本分类
任务ID: 仇恨言论检测

数据集结构

文件格式: CSV
主要字段:
- mhc_case_id: 跨语言的唯一测试案例ID
- functionality: 测试案例的功能简写
- test_case: 测试案例文本
- label_gold: 测试案例的金标准标签（仇恨/非仇恨）
- target_ident: 目标或引用的受保护群体
- ref_case_id: 相关联的仇恨案例ID或对比案例ID
- ref_templ_id: 模板ID
- templ_id: 生成测试案例的模板ID
- case_templ: 生成测试案例的模板（适用时）
- gender_male 和 gender_female: 性别相关的语言中的性别标记
- label_annotated: 三位注释者给出的标签列表
- label_annotated_maj: 三位注释者的多数投票结果
- disagreement_in_case: 如果多数投票结果与金标准标签不匹配，则为True
- disagreement_in_template: 如果同一模板生成的案例中存在不匹配，则为True

5,000+

优质数据集

54 个

任务类型

进入经典数据集