SINAI/CONAN-MT-SP
收藏Hugging Face2024-05-20 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/SINAI/CONAN-MT-SP
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- es
tags:
- Counternarrative
- Counter-speech
pretty_name: CONAN-MT-SP
---
# CONAN-MT-SP
CONAN-SP is a a new dataset for Spanish counter-narrative. It include a hate-speech comment (HS) and the corresponded counter-narrative (CN).
# How is constructed?
The English CONAN Multitarget (CONAN-MT) corpus ([Margherita Fanton et al. , 2021](https://aclanthology.org/2021.acl-long.250.pdf) is taken as a starting point and an automatic translation is carried out using the API of DeepL to obtain the CONAN-MT-SP (CONAN Multitarget in Spanish) corpus. CONAN-MT consists of 5003 HS-CN pairs covering multiple hate targets (DISABLED, JEWS, LGBT+, MIGRANTS, MUSLIMS, PEOPLE OF COLOR (POC), WOMEN)
GPT-4 model based on GPT technologies, is applied to the HS part of this corpus, which is provided as prompting together with 8 ContraNarrative (CN) examples.
Each instance of the corpus consists of the HS and CN part translated directly into Spanish with DeepL from the CONAN Multitarget corpus, plus the CN generated by GPT4. In addition, evaluations by human experts have also been included as part of the CONAN-MT-SP corpus.
To construct CONAN-MT-SP, we remove the pairs that contain duplicates of hate-speech texts and the examples used in the prompt for the model to generate the counter-narrative. The prompt strategy used in GPT-4 model consist in a task description and 8 examples of HS-CN pairs (one for each target).
The structure of CONAN-MT-SP is the hate-speech and counternarrative provided by CONAN-MT and the counter-narrative texts generated by GPT-4 model. We do not apply any filter to the CN generated by GPT-4. Furthermore, we associated the values of the different metrics used in the manual evaluation carried by humans.
The evaluation metrics are:
- Offensiveness:
- 0 (not sure)
- 1 (not offensive)
- 2 (maybe offensive)
- 3 (completely offensive)
- Stance:
- 0 (irrelevant)
- 1 (strongly agree)
- 2 (slightly agree/disagree)
- 3 (strongly disagree)
- Informativeness:
- 0 (irrelevant)
- 1 (not informative)
- 2 (generic and uninformative statement)
- 3 (specific and informative)
- Truthfulness:
- 0 (not sure)
- 1 (not true)
- 2 (partially true)
- 3 (completely true)
- Editing required:
- 0 (no editing)
- 1 (yes editing)
- Comparison between H-M:
- 0 (both CN are equally valid)
- 1 (human generates a better CN)
- 2 (machine generates a better CN)
- 3 (neither CN is good)
# Citation
María Estrella Vallecillo Rodríguez, María Victoria Cantero Romero, Isabel Cabrera De Castro, Arturo Montejo Ráez and María Teresa Martín Valdivia (2024). CONAN-MT-SP: A Spanish Corpus for Counternarrative using GPT Models. In Proceedings of The Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino (Italia) on 20-25 May, 2024.
```bibtex
@inproceedings{vallecillo-rodriguez-etal-2024-conan-mt,
title = "{CONAN}-{MT}-{SP}: A {S}panish Corpus for Counternarrative Using {GPT} Models",
author = "Vallecillo Rodr{\'\i}guez, Mar{\'\i}a Estrella and
Cantero Romero, Maria Victoria and
Cabrera De Castro, Isabel and
Montejo R{\'a}ez, Arturo and
Mart{\'\i}n Valdivia, Mar{\'\i}a Teresa",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italy",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.326",
pages = "3677--3688",
abstract = "This paper describes the automated generation of CounterNarratives (CNs) for Hate Speech (HS) in Spanish using GPT-based models. Our primary objective is to evaluate the performance of these models in comparison to human capabilities. For this purpose, the English CONAN Multitarget corpus is taken as a starting point and we use the DeepL API to automatically translate into Spanish. Two GPT-based models, GPT-3 and GPT-4, are applied to the HS segment through a few-shot prompting strategy to generate a new CN. As a consequence of our research, we have created a high quality corpus in Spanish that includes the original HS-CN pairs translated into Spanish, in addition to the CNs generated automatically with the GPT models and that have been evaluated manually. The resulting CONAN-MT-SP corpus and its evaluation will be made available to the research community, representing the most extensive linguistic resource of CNs in Spanish to date. The results demonstrate that, although the effectiveness of GPT-4 outperforms GPT-3, both models can be used as systems to automatically generate CNs to combat the HS. Moreover, these models consistently outperform human performance in most instances.",
}
```
提供机构:
SINAI
原始信息汇总
数据集概述
数据集名称
- CONAN-MT-SP
数据集描述
- CONAN-MT-SP 是一个针对西班牙语的反叙事数据集,包含仇恨言论(HS)及其对应的反叙事(CN)。
数据集构建
- 数据集基于英语的 CONAN Multitarget (CONAN-MT) 语料库,通过DeepL API自动翻译成西班牙语。
- 包含5003对HS-CN,覆盖多个仇恨目标(DISABLED, JEWS, LGBT+, MIGRANTS, MUSLIMS, PEOPLE OF COLOR (POC), WOMEN)。
- 使用GPT-4模型对HS部分进行处理,生成新的CN。
- 每个实例包括从CONAN Multitarget语料库翻译的HS和CN,以及GPT-4生成的CN。
- 数据集排除了包含重复仇恨言论文本和用于模型生成CN的示例。
- 人类专家对数据集进行了手动评估,评估指标包括攻击性、立场、信息性、真实性、编辑需求和CN的比较。
数据集评估指标
- 攻击性:0(不确定)、1(不攻击性)、2(可能攻击性)、3(完全攻击性)
- 立场:0(无关)、1(强烈同意)、2(轻微同意/不同意)、3(强烈不同意)
- 信息性:0(无关)、1(不信息)、2(通用且不信息声明)、3(特定且信息)
- 真实性:0(不确定)、1(不真实)、2(部分真实)、3(完全真实)
- 编辑需求:0(无需编辑)、1(需要编辑)
- CN比较:0(两个CN同样有效)、1(人类生成更好的CN)、2(机器生成更好的CN)、3(两个CN都不好)
引用信息
- 作者:María Estrella Vallecillo Rodríguez, María Victoria Cantero Romero, Isabel Cabrera De Castro, Arturo Montejo Ráez, María Teresa Martín Valdivia
- 出版:2024年联合国际计算语言学、语言资源和评估会议(LREC-COLING 2024)
- 地点:都灵,意大利
- 日期:2024年5月20-25日
- 标题:CONAN-MT-SP: A Spanish Corpus for Counternarrative using GPT Models
- 出版商:ELRA和ICCL
- 页码:3677-3688



