LanD-FBK/ML_MTCONAN_KN

Name: LanD-FBK/ML_MTCONAN_KN
Creator: LanD-FBK
Published: 2024-10-30 13:48:31
License: 暂无描述

Hugging Face2024-10-30 更新2025-04-19 收录

下载链接：

https://hf-mirror.com/datasets/LanD-FBK/ML_MTCONAN_KN

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - es - it - eu --- # 1st Workshop on Multilingual Counterspeech generation: Shared Task ## Dataset description The datasets consist of 596 Hate Speech-Counter Narrative pairs. In this dataset, the hate speech is taken from [MTCONAN](https://github.com/marcoguerini/CONAN/tree/master/Multitarget-CONAN), while the counter narratives are newly generated. Together with each pair, we also provide 5 background knowledge sentences, some of which are relevant for obtaining the counter narratives. The dataset is available in 4 different languages (Basque, English, Italian and Spanish): while the knowledge sentences in the various languages are just automatically translated from the English version, the hate speech and counter narratives translations were manually checked. The dataset is divided into the following splits: - Development: 100 pairs. [AVAILABLE NOW!] - Train: 396 pairs [AVAILABLE NOW!] - Test: 100 pairs [TBA] In order to score the shared task participants, the CNs will be kept hidden during the shared task while the HS and the knowledge will be released for participants to prepare their submissions. The dataset covers multiple targets of hate: Jews, LGBT+, Migrants, Muslims, People of color and Women. Participants also have available the CONAN manually curated data in the following languages (1) [English, French, Italian](https://github.com/marcoguerini/CONAN/tree/master/CONAN), (2) [Basque and Spanish](https://github.com/ixa-ehu/conan-e). ## File description We provide the data in json and csv format. Each entry in the dataset has the following fields: the INDEX of the corresponding hate speech in the MTCONAN dataset (``MTCONAN_ID``), the hate speech (``HS``), the background knowledge (``KN``), the counternarrative (``KN_CN``), the target of hate (``TARGET``), the language (``LANG``), the dataset split (``SPLIT``), an identifier for each HS - KN_CN pair (``PAIR_ID``: different versions of the same pair in different languages have the same ``PAIR_ID``), a unique identifier for each pair in each language (``ID``), obtained by concatenating the ``PAIR_ID`` and ``LANG`` (e.g. "IT01").

language: - 英语 - 西班牙语 - 意大利语 - 巴斯克语 # 第一届多语言反仇恨言论生成研讨会：共享任务 ## 数据集说明本数据集共包含596组仇恨言论-反叙事（Counter Narrative）配对。其中，仇恨言论取自[MTCONAN](https://github.com/marcoguerini/CONAN/tree/master/Multitarget-CONAN)数据集，反叙事则为全新生成。每一组配对均附带5条背景知识语句，其中部分语句可为反叙事的生成提供参考。本数据集支持4种语言：巴斯克语、英语、意大利语及西班牙语。不同语言版本的背景知识语句均由英文版本自动翻译得到，而仇恨言论与反叙事的翻译均经过人工审核。数据集划分为以下子集： - 开发集：100组配对，【现已开放！】 - 训练集：396组配对，【现已开放！】 - 测试集：100组配对，【待公布（TBA）】为对共享任务的参赛作品进行评分，活动期间反叙事内容将予以隐藏，参赛选手可获取仇恨言论与背景知识以筹备提交作品。本数据集覆盖多类仇恨攻击目标：犹太人、LGBT+群体、移民、穆斯林、有色人种及女性。参赛选手还可使用以下语言的CONAN人工精选数据集：(1) [英语、法语、意大利语](https://github.com/marcoguerini/CONAN/tree/master/CONAN)，(2) [巴斯克语与西班牙语](https://github.com/ixa-ehu/conan-e)。 ## 文件说明我们以JSON与CSV格式提供数据集。数据集中的每条条目包含以下字段：MTCONAN数据集中对应仇恨言论的索引（``MTCONAN_ID``）、仇恨言论文本（``HS``）、背景知识语句（``KN``）、反叙事文本（``KN_CN``）、仇恨攻击目标（``TARGET``）、语言（``LANG``）、数据集子集划分（``SPLIT``）、每组仇恨言论-反叙事配对的标识符（``PAIR_ID``：同一配对在不同语言下的不同版本将共享相同的``PAIR_ID``）、每种语言下每条配对的唯一标识符（``ID``），由``PAIR_ID``与``LANG``拼接得到，例如“IT01”。

提供机构：

LanD-FBK

5,000+

优质数据集

54 个

任务类型

进入经典数据集