copenlu/wiki-stance

Name: copenlu/wiki-stance
Creator: copenlu
Published: 2024-05-17 11:32:42
License: 暂无描述

Hugging Face2024-05-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/copenlu/wiki-stance

下载链接

链接失效反馈

官方服务：

资源简介：

Wiki-Stance数据集是一个多语言数据集，基于英语、德语和土耳其语的维基百科文章删除讨论，涵盖了2005年至2022年的数据。该数据集旨在支持维基百科的内容审核，通过检测讨论中的立场并预测相关政策。数据集包含讨论中的评论、编辑的立场标签（保留、删除、合并、评论）以及相关的维基百科政策。数据集的创建过程包括从维基百科API中检索删除讨论、筛选提及政策的评论、合并相似政策、去除评论中的政策提及等步骤。数据集的注释基于编辑在讨论中表达的立场和提到的政策，编辑被视为注释者。数据集还包含了对个人和敏感信息的处理建议，以及对数据使用中的偏见、风险和限制的讨论。

Wiki-Stance Dataset is a multilingual dataset based on Wikipedia article deletion discussions in English, German and Turkish, covering data from 2005 to 2022. This dataset aims to support Wikipedia content moderation by detecting stances in discussions and predicting relevant policies. The dataset includes comments from the discussions, stance labels for edits (keep, delete, merge, comment) and relevant Wikipedia policies. The dataset creation process includes retrieving deletion discussions from the Wikipedia API, filtering comments that mention policies, merging similar policies, and removing policy mentions from comments. Dataset annotations are based on the stances expressed by editors in discussions and the policies they mentioned, with editors regarded as annotators. The dataset also contains recommendations for handling personal and sensitive information, as well as discussions of biases, risks and limitations in data usage.

提供机构：

copenlu

原始信息汇总

数据集卡片 - Wiki-Stance

数据集详情

数据集描述

Wiki-Stance 数据集是在 EMNLP 2023 论文 "Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor Discussions" 中引入的。

数据集来源

仓库: https://github.com/copenlu/wiki-stance
论文: https://aclanthology.org/2023.emnlp-main.361/

列名描述

title - 被考虑删除的维基百科页面的标题
username - 评论作者的维基百科用户名
timestamp - 评论的时间戳
decision - 评论在原始语言中的立场标签
comment - 维基百科编辑者关于删除讨论的评论文本
topic - 立场任务的主题（通常是“删除[标题]”）
en_label - Decision 的英文翻译
policy - 与评论相关的维基百科政策代码
policy_title - 与评论相关的维基百科政策标题
policy_index - 维基百科政策的索引（特定于我们的数据集）

用途

该数据集旨在通过立场检测和支持内容审核，预测维基百科中三种语言的删除讨论中的政策。

直接用途

该数据集可用于讨论中的立场检测，以支持内容审核，并预测引用预定义标准和指南的社区中的政策。该数据集尚未在维基百科以外的环境中进行测试，但可能有助于大规模内容审核。它还可用于透明立场检测，即参照政策的立场检测，应用范围超过维基百科。

数据集创建

源数据

该数据集基于维基百科的删除讨论，涵盖三种语言（英语、德语、土耳其语）从2005年（土耳其语为2006年）到2022年。

数据收集和处理

我们通过各自的 MediaWiki API 识别并检索英语、德语和土耳其语维基百科的删除讨论存档页面。从这些页面中，我们选择提及维基百科页面的评论，这些评论通常指政策或政策缩写。如果政策缩写链接到政策页面，Wikimedia API 会解析并返回实际的政策或维基百科页面标题。对于每种语言，我们通过 Wikimedia API 检索完整的政策页面，手动选择实际的政策页面，并丢弃其他维基百科页面。我们进一步丢弃在各自语言删除讨论中提及不频繁的政策。

为了将具有相同或相似含义的子政策或主政策的子类别合并到主政策中，我们根据子政策在政策页面文本中链接到主政策的情况进行合并。大多数评论只涉及一个政策，我们通过选择第一个提及的政策来保持每个评论只有一个政策。我们进一步使用正则表达式从评论中删除所有政策提及，这通常会破坏句子的语法性，但有必要防止标签信息的泄露。

立场标签（保留、删除、合并和评论）可以用不同的形式或拼写方式表达。我们手动识别标签可能表达的不同方式，并将它们聚合到四个标准标签中。

我们通过（半自动化）将三种语言的政策链接到相应的英语政策（如果德语或土耳其语政策存在）来创建多语言数据集。我们使用跨语言链接进行此操作。

数据集被分为训练/测试/开发集，其中英语和德语的分割为80%/15%/5%，但由于土耳其语评论数量较少，我们决定调整土耳其语的分割，至少有200个测试示例。

源数据生产者

数据创建者是各自维基百科语言中参与删除讨论的维基百科编辑者。

标注过程

标注是基于讨论评论创建的。立场标签是基于编辑者在讨论中在其评论中表达的标签创建的，政策标签也是如此。

标注者

因此，编辑者可以被视为标注者。

个人和敏感信息

从在线社区收集的所有数据都应被视为敏感信息，特别是为了保护编辑者的隐私。

偏差、风险和局限性

社区数据应受到尊重，并小心处理，以免超出创建者的意愿。该数据集提供的数据显示了社区讨论的快照，因为它只关注提及政策的评论（英语约为20%，德语和土耳其语约为2%）。

建议

我们不鼓励识别编辑者或以任何形式在个人层面上处理编辑者信息的工作。

引用

如果您发现我们的数据集有帮助，请在您的工作中引用我们使用以下引用：

@inproceedings{kaffee-etal-2023-article, title = "Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual {W}ikipedia Editor Discussions", author = "Kaffee, Lucie-Aim{e}e and Arora, Arnav and Augenstein, Isabelle", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.361", doi = "10.18653/v1/2023.emnlp-main.361", pages = "5891--5909", abstract = "The moderation of content on online platforms is usually non-transparent. On Wikipedia, however, this discussion is carried out publicly and editors are encouraged to use the content moderation policies as explanations for making moderation decisions. Currently, only a few comments explicitly mention those policies {--} 20{%} of the English ones, but as few as 2{%} of the German and Turkish comments. To aid in this process of understanding how content is moderated, we construct a novel multilingual dataset of Wikipedia editor discussions along with their reasoning in three languages. The dataset contains the stances of the editors (keep, delete, merge, comment), along with the stated reason, and a content moderation policy, for each edit decision. We demonstrate that stance and corresponding reason (policy) can be predicted jointly with a high degree of accuracy, adding transparency to the decision-making process. We release both our joint prediction models and the multilingual content moderation dataset for further research on automated transparent content moderation.", }

搜集汇总

数据集介绍

构建方式

该数据集以Wikipedia上的文章删除讨论为基础，通过采集英语、德语和土耳其语三种语言的文章删除讨论页面，利用MediaWiki API获取讨论中的评论，并通过人工筛选和链接策略页面对评论中的政策提及进行标注，最终构建出一个包含立场和政策标签的多语言数据集。

使用方法

使用该数据集时，研究者可以将其应用于立场检测以支持内容审查，也可以利用数据集中的政策标签进行社区中预先定义标准和指南的预测。此外，该数据集还可以用于透明立场检测的研究，即在立场检测中引用政策，其应用范围可能超出Wikipedia，适用于更广泛的内容审查领域。

背景与挑战

背景概述

Wiki-Stance数据集，源于2023年EMNLP会议论文，旨在通过立场检测支持维基百科的内容审核，并预测给定评论中涉及的政策。该数据集依托于三个语言版本的维基百科（英语、德语、土耳其语）自2005年至2022年的文章删除讨论记录，关注于编辑讨论中的立场观点及所提及的政策，构建了一个多语言的数据集。数据集的创建体现了对在线社区讨论数据的尊重与隐私保护，以及对编辑意愿的重视，由维基百科编辑们贡献的讨论评论构成了数据来源，同时编辑们也成为数据的标注者。

当前挑战

该数据集在构建过程中面临的挑战包括：如何准确提取和标注涉及政策立场的评论，以及如何处理跨语言的Policy链接和对应的多语言政策页面。此外，数据集中涉及的政策标注仅占全部评论的很小比例，这可能导致数据的不平衡性。在研究领域问题方面，该数据集的挑战在于如何提高立场检测的透明度和准确性，以及如何在多语言环境中有效预测相关政策。

常用场景

经典使用场景

在文本分类领域中，Wiki-Stance数据集以其独特的立场和策略标签，成为研究透明立场检测的重要资源。该数据集被广泛用于对维基百科删除讨论中的评论进行立场检测，以及预测所提及的相应策略，旨在辅助内容审查过程。

解决学术问题

Wiki-Stance数据集解决了学术研究中如何提高在线平台内容审查透明度的问题。通过提供带有立场标签和策略标签的评论，该数据集使得研究者能够探索立场检测与策略预测的联合模型，进而增强决策过程的透明度。

实际应用

实际应用中，Wiki-Stance数据集可被用于优化在线社区的内容审查机制，特别是在维基百科等大型协作平台上。其研究成果有助于自动化内容审查流程，提高审查效率和公正性。

数据集最近研究