CHASM

github2022-07-18 更新2024-05-31 收录

下载链接：

https://github.com/tmu-nlp/CHASM

下载链接

链接失效反馈

官方服务：

资源简介：

CHASM数据集包含由GPT-2、GPT-Neo和GPT-3生成的306个反叙事和42个微干预消息，以及通过亚马逊Mechanical Turk进行的人类评估获得的标签，用于评估仇恨言论或微攻击的冒犯性以及模型生成的冒犯性、立场和信息性。

The CHASM Dataset comprises 306 counter-narratives and 42 micro-intervention messages generated by GPT-2, GPT-Neo, and GPT-3, alongside labels obtained from human assessments carried out via Amazon Mechanical Turk. These labels are designed to evaluate the offensiveness of hate speech or microaggressions, as well as the offensiveness, stance, and informativeness of the model-generated messages.

创建时间：

2022-05-19

原始信息汇总

CHASM: A Corpus of Countering HAte Speech and Microaggressions

关于CHASM

CHASM数据集包含：

306条反仇恨言论和42条微干预信息，由GPT-2、GPT-Neo和GPT-3通过提示生成
通过Amazon Mechanical Turk进行的人工评估标签：
- 每条仇恨言论或微攻击的冒犯性
- 每个模型生成内容的冒犯性、立场和信息性

数据集

counter_conan.json
counter_sbic.json

格式

每个数据集的格式如下：

id: 四句话的集合ID
- post
  - text: 仇恨言论或微攻击
  - score: 由众包工作者标注的冒犯性评分，共九个标签（每个模型三个工作者）
- GPT-3
  - text: 反叙事
  - score
    - off: 由三个众包工作者标注的冒犯性评分
    - stance: 由三个众包工作者标注的立场评分
    - info: 由三个众包工作者标注的信息性评分
- GPT-2 和 GPT-Neo 具有与 GPT-3 相同的 text 和 score 字段

引用

@inproceedings{ashida-komachi-2022-towards, title = "Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions", author = "Ashida, Mana and Komachi, Mamoru", booktitle = "Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)", month = jul, year = "2022", address = "Seattle, Washington (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.woah-1.2", pages = "11--23" }

同时引用以下数据集：

@inproceedings{chung-etal-2019-conan, title = "{CONAN} - {CO}unter {NA}rratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech", author = "Chung, Yi-Ling and Kuzmenko, Elizaveta and Tekiroglu, Serra Sinem and Guerini, Marco", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1271", doi = "10.18653/v1/P19-1271", pages = "2819--2829" }

@inproceedings{fanton-2021-human, title="{Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech}", author="{Fanton, Margherita and Bonaldi, Helena and Tekiroğlu, Serra Sinem and Guerini, Marco}", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics", month = aug, year = "2021", publisher = "Association for Computational Linguistics", }

@inproceedings{chung-etal-2021-knowledge, title = "{Towards Knowledge-Grounded Counter Narrative Generation for Hate Speech", author = "Chung, Yi-Ling and Tekiroğlu, Serra Sinem and Guerini, Marco", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", }

以及SBIC和SelfMA数据集。

@inproceedings{sap2020socialbiasframes, title={Social Bias Frames: Reasoning about Social and Power Implications of Language}, author={Sap, Maarten and Gabriel, Saadia and Qin, Lianhui and Jurafsky, Dan and Smith, Noah A and Choi, Yejin}, year={2020}, booktitle={ACL}, }

@inproceedings{breitfeller-etal-2019-finding, title = "Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts", author = "Breitfeller, Luke and Ahn, Emily and Jurgens, David and Tsvetkov, Yulia", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D19-1176", doi = "10.18653/v1/D19-1176", pages = "1664--1674", }

搜集汇总

数据集介绍

构建方式

CHASM数据集的构建过程主要依赖于生成式预训练模型（如GPT-2、GPT-Neo和GPT-3）的提示生成技术。通过精心设计的提示，模型生成了306条反仇恨言论和42条微干预信息。随后，这些生成的内容通过亚马逊的Mechanical Turk平台进行了人工评估，评估内容包括仇恨言论或微侵犯的冒犯性，以及模型生成内容的冒犯性、立场和信息量。这一过程确保了数据集的质量和实用性。

特点

CHASM数据集的一个显著特点是其包含了多种生成模型（GPT-2、GPT-Neo和GPT-3）的反仇恨言论和微干预信息，这些信息均经过人工评估，确保了数据的多样性和可靠性。此外，数据集中的每条记录都包含了详细的评分信息，如冒犯性、立场和信息量，这些评分由多名众包工人独立完成，进一步增强了数据的客观性和准确性。

使用方法

CHASM数据集的使用方法相对直观。研究人员可以通过加载`counter_conan.json`和`counter_sbic.json`文件来访问数据集。每个文件中的记录都包含了原始仇恨言论或微侵犯的文本、模型生成的反言论或微干预信息，以及相应的评分数据。这些数据可以用于训练和评估反仇恨言论生成模型，或者用于研究在线言论的干预策略。通过分析这些数据，研究人员可以更好地理解如何有效地对抗在线仇恨言论和微侵犯。

背景与挑战

背景概述

CHASM数据集由Mana Ashida和Mamoru Komachi于2022年提出，旨在应对在线仇恨言论和微侵犯的挑战。该数据集收录了由GPT-2、GPT-Neo和GPT-3生成的306条反言论和42条微干预信息，并通过亚马逊Mechanical Turk平台进行了人类评估，标注了每条仇恨言论或微侵犯的冒犯性，以及模型生成内容的冒犯性、立场和信息量。CHASM的创建标志着在自动化生成反仇恨言论和微侵犯信息领域的重大进展，为相关研究提供了宝贵的资源。

当前挑战

CHASM数据集面临的挑战主要体现在两个方面。首先，在解决领域问题上，尽管自动化生成反仇恨言论和微侵犯信息的技术已取得一定进展，但如何确保生成内容的冒犯性低、立场明确且信息丰富仍是一个复杂的问题。其次，在数据构建过程中，如何有效利用人类评估来确保数据质量，同时克服众包平台标注的主观性和不一致性，也是构建高质量数据集的关键挑战。这些挑战不仅影响了数据集的可靠性，也对后续研究的准确性和有效性提出了更高的要求。

常用场景

经典使用场景

CHASM数据集在自然语言处理领域中被广泛应用于对抗在线仇恨言论和微侵犯的研究。通过提供由GPT-2、GPT-Neo和GPT-3生成的对抗性言论和微干预信息，该数据集为研究人员提供了一个宝贵的资源，用于开发和评估自动生成对抗性言论的模型。这些模型的目标是减少在线平台上的仇恨言论和微侵犯行为，从而促进更健康的在线交流环境。

衍生相关工作

CHASM数据集衍生了一系列相关研究，包括基于知识图谱的对抗性言论生成模型、多语言对抗性言论数据集以及针对特定类型仇恨言论的对抗性言论生成方法。这些研究不仅扩展了CHASM数据集的应用范围，还为自然语言处理领域提供了新的研究方向和方法论。

数据集最近研究