SemEvalWorkshop/sem_eval_2020_task_11

Name: SemEvalWorkshop/sem_eval_2020_task_11
Creator: SemEvalWorkshop
Published: 2024-01-18 11:15:40
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/SemEvalWorkshop/sem_eval_2020_task_11

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - found language: - en license: - unknown multilinguality: - monolingual size_categories: - n<1K source_datasets: - original task_categories: - text-classification - token-classification task_ids: [] pretty_name: SemEval-2020 Task 11 tags: - propaganda-span-identification - propaganda-technique-classification dataset_info: features: - name: article_id dtype: string - name: text dtype: string - name: span_identification sequence: - name: start_char_offset dtype: int64 - name: end_char_offset dtype: int64 - name: technique_classification sequence: - name: start_char_offset dtype: int64 - name: end_char_offset dtype: int64 - name: technique dtype: class_label: names: '0': Appeal_to_Authority '1': Appeal_to_fear-prejudice '2': Bandwagon,Reductio_ad_hitlerum '3': Black-and-White_Fallacy '4': Causal_Oversimplification '5': Doubt '6': Exaggeration,Minimisation '7': Flag-Waving '8': Loaded_Language '9': Name_Calling,Labeling '10': Repetition '11': Slogans '12': Thought-terminating_Cliches '13': Whataboutism,Straw_Men,Red_Herring splits: - name: train num_bytes: 2358613 num_examples: 371 - name: test num_bytes: 454100 num_examples: 90 - name: validation num_bytes: 396410 num_examples: 75 download_size: 0 dataset_size: 3209123 --- # Dataset Card for SemEval-2020 Task 11 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [PTC TASKS ON "DETECTION OF PROPAGANDA TECHNIQUES IN NEWS ARTICLES"](https://propaganda.qcri.org/ptc/index.html) - **Paper:** [SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles](https://arxiv.org/abs/2009.02696) - **Leaderboard:** [PTC Tasks Leaderboard](https://propaganda.qcri.org/ptc/leaderboard.php) - **Point of Contact:** [Task organizers contact](semeval-2020-task-11-organizers@googlegroups.com) ### Dataset Summary Propagandistic news articles use specific techniques to convey their message, such as whataboutism, red Herring, and name calling, among many others. The Propaganda Techniques Corpus (PTC) allows to study automatic algorithms to detect them. We provide a permanent leaderboard to allow researchers both to advertise their progress and to be up-to-speed with the state of the art on the tasks offered (see below for a definition). ### Supported Tasks and Leaderboards More information on scoring methodology can be found in [propaganda tasks evaluation document](https://propaganda.qcri.org/ptc/data/propaganda_tasks_evaluation.pdf) ### Languages This dataset consists of English news articles ## Dataset Structure ### Data Instances Each example is structured as follows: ``` { "span_identification": { "end_char_offset": [720, 6322, ...], "start_char_offset": [683, 6314, ...] }, "technique_classification": { "end_char_offset": [720,6322, ...], "start_char_offset": [683,6314, ...], "technique": [7,8, ...] }, "text": "Newt Gingrich: The truth about Trump, Putin, and Obama\n\nPresident Trump..." } ``` ### Data Fields - `text`: The full text of the news article. - `span_identification`: a dictionary feature containing: - `start_char_offset`: The start character offset of the span for the SI task - `end_char_offset`: The end character offset of the span for the SI task - `technique_classification`: a dictionary feature containing: - `start_char_offset`: The start character offset of the span for the TC task - `end_char_offset`: The start character offset of the span for the TC task - `technique`: the propaganda technique classification label, with possible values including `Appeal_to_Authority`, `Appeal_to_fear-prejudice`, `Bandwagon,Reductio_ad_hitlerum`, `Black-and-White_Fallacy`, `Causal_Oversimplification`. ### Data Splits | | Train | Valid | Test | | ----- | ------ | ----- | ---- | | Input Sentences | 371 | 75 | 90 | | Total Annotations SI | 5468 | 940 | 0 | | Total Annotations TC | 6128 | 1063 | 0 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization In order to build the PTC-SemEval20 corpus, we retrieved a sample of news articles from the period starting in mid-2017 and ending in early 2019. We selected 13 propaganda and 36 non-propaganda news media outlets, as labeled by Media Bias/Fact Check,3 and we retrieved articles from these sources. We deduplicated the articles on the basis of word n-grams matching (Barron-Cede ´ no and Rosso, 2009) and ˜ we discarded faulty entries (e.g., empty entries from blocking websites). #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process The annotation job consisted of both spotting a propaganda snippet and, at the same time, labeling it with a specific propaganda technique. The annotation guidelines are shown in the appendix; they are also available online.4 We ran the annotation in two phases: (i) two annotators label an article independently and (ii) the same two annotators gather together with a consolidator to discuss dubious instances (e.g., spotted only by one annotator, boundary discrepancies, label mismatch, etc.). This protocol was designed after a pilot annotation stage, in which a relatively large number of snippets had been spotted by one annotator only. The annotation team consisted of six professional annotators from A Data Pro trained to spot and label the propaganda snippets from free text. The job was carried out on an instance of the Anafora annotation platform (Chen and Styler, 2013), which we tailored for our propaganda annotation task. We evaluated the annotation process in terms of γ agreement (Mathet et al., 2015) between each of the annotators and the final gold labels. The γ agreement on the annotated articles is on average 0.6; see (Da San Martino et al., 2019b) for a more detailed discussion of inter-annotator agreement. The training and the development part of the PTC-SemEval20 corpus are the same as the training and the testing datasets described in (Da San Martino et al., 2019b). The test part of the PTC-SemEval20 corpus consists of 90 additional articles selected from the same sources as for training and development. For the test articles, we further extended the annotation process by adding one extra consolidation step: we revisited all the articles in that partition and we performed the necessary adjustments to the spans and to the labels as necessary, after a thorough discussion and convergence among at least three experts who were not involved in the initial annotations. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ``` @misc{martino2020semeval2020, title={SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles}, author={G. Da San Martino and A. Barrón-Cedeño and H. Wachsmuth and R. Petrov and P. Nakov}, year={2020}, eprint={2009.02696}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions Thanks to [@ZacharySBrown](https://github.com/ZacharySBrown) for adding this dataset.

提供机构：

SemEvalWorkshop

原始信息汇总

数据集卡片：SemEval-2020 Task 11

数据集描述

数据集摘要

该数据集用于研究自动算法检测新闻文章中的宣传技巧，如whataboutism、red Herring和name calling等。提供了永久排行榜，以便研究人员展示他们的进展并了解该领域的最新技术。

支持的任务和排行榜

更多评分方法的信息可以在宣传任务评估文档中找到。

语言

该数据集包含英语新闻文章。

数据集结构

数据实例

每个示例的结构如下：

json { "span_identification": { "end_char_offset": [720, 6322, ...], "start_char_offset": [683, 6314, ...] }, "technique_classification": { "end_char_offset": [720,6322, ...], "start_char_offset": [683,6314, ...], "technique": [7,8, ...] }, "text": "Newt Gingrich: The truth about Trump, Putin, and Obama

President Trump..." }

数据字段

text: 新闻文章的全文。
span_identification: 包含以下字段的字典特征：
- start_char_offset: SI任务中span的起始字符偏移量。
- end_char_offset: SI任务中span的结束字符偏移量。
technique_classification: 包含以下字段的字典特征：
- start_char_offset: TC任务中span的起始字符偏移量。
- end_char_offset: TC任务中span的结束字符偏移量。
- technique: 宣传技巧分类标签，可能的值包括Appeal_to_Authority, Appeal_to_fear-prejudice, Bandwagon,Reductio_ad_hitlerum, Black-and-White_Fallacy, Causal_Oversimplification等。

数据分割

	Train	Valid	Test
输入句子	371	75	90
总注释 SI	5468	940	0
总注释 TC	6128	1063	0

数据集创建

数据收集和规范化

为了构建PTC-SemEval20语料库，我们从2017年年中到2019年初检索了新闻文章样本。我们选择了13个宣传和36个非宣传新闻媒体，并从这些来源中检索文章。我们基于词n-gram匹配去重，并丢弃了有问题的条目。

注释过程

注释工作包括同时发现宣传片段并为其贴上特定的宣传技巧标签。注释指南见附录，也可在线获取。我们分两个阶段进行注释：（i）两名注释者独立标记一篇文章，（ii）这两名注释者与一名整合者一起讨论有疑问的实例。注释团队由A Data Pro培训的六名专业注释者组成，他们被训练从自由文本中发现并标记宣传片段。注释工作在定制的Anafora注释平台上进行。我们使用γ协议评估注释过程，平均γ协议为0.6。

使用数据的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

[更多信息需要]

许可信息

[更多信息需要]

引用信息

@misc{martino2020semeval2020, title={SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles}, author={G. Da San Martino and A. Barrón-Cedeño and H. Wachsmuth and R. Petrov and P. Nakov}, year={2020}, eprint={2009.02696}, archivePrefix={arXiv}, primaryClass={cs.CL} }

贡献

感谢@ZacharySBrown添加此数据集。

搜集汇总

数据集介绍

构建方式

该数据集源自SemEval-2020任务11，旨在推动新闻文本中 propaganda 技术检测研究。其构建基础为2017年至2019年间从13个propaganda媒体与36个非propaganda媒体采集的新闻文章，经词n-gram去重与无效条目过滤后形成原始语料。标注流程采用两阶段协作模式：首先由两位专业标注员独立完成propaganda片段的识别与技巧分类，随后与协调员共同讨论解决分歧，包括单方标注、边界差异及标签矛盾等问题。测试集额外引入第三轮专家复核，确保标注质量。最终语料包含371篇训练样本、75篇验证样本与90篇测试样本，涵盖14种propaganda技巧的细粒度标注。

特点

该数据集的核心特点在于其双任务层次化标注架构：一方面提供propaganda片段的精确字符级定位（span identification），另一方面赋予每个片段14种技巧标签（如诉诸权威、非黑即白谬误、口号等），形成从检测到分类的完整技术链条。数据来源兼具propaganda与非propaganda媒体，增强了模型对真实场景的泛化能力。标注过程采用γ一致性评估，平均信度达0.6，测试集更经多轮专家校准，确保标签可靠性。此外，数据集规模虽小（总计536篇），但每篇均包含密集的propaganda标注点（训练集达6128个技巧标注），适合深度学习模型的细粒度训练。

使用方法

数据集支持两项核心任务：propaganda片段识别（SI）与propaganda技巧分类（TC）。使用时需加载JSON结构数据，其中'text'字段存储全文，'span_identification'与'technique_classification'分别提供字符偏移与标签信息。研究者可基于字符偏移提取子序列构建序列标注模型，或利用预训练语言模型进行端到端分类。推荐采用分层评估策略：先评估片段定位的精确度与召回率，再评估技巧分类的宏平均F1值。官方提供排行榜与评估脚本，便于与基准模型对比。由于数据量较小，建议结合数据增强或迁移学习提升模型鲁棒性。

背景与挑战

背景概述

在信息爆炸的时代，新闻文本中潜藏的宣传手法日益成为威胁公共话语理性的隐患。为应对这一挑战，SemEval-2020 Task 11数据集于2020年由卡塔尔计算研究所的G. Da San Martino、A. Barrón-Cedeño等学者联合发布，旨在推动自动检测新闻文章中宣传技巧的研究。该数据集依托Propaganda Techniques Corpus（PTC），聚焦于识别如诉诸权威、恐惧煽动、标签化等14种具体宣传策略，并细分为跨度识别与技术分类两项子任务。其构建基于2017至2019年间来自13家宣传性媒体与36家非宣传性媒体的新闻样本，经由专业标注团队在Anafora平台上的多轮协作与一致性校验完成。该数据集的问世为自然语言处理领域提供了首个系统化的宣传检测基准，显著推动了文本操控识别技术的进展。

当前挑战

该数据集面临的核心挑战在于宣传技巧的隐蔽性与多样性。首先，宣传手法常与正常论述交织，使得模型难以精准区分自然修辞与刻意操控，加之不同技巧间界限模糊（如‘从众效应’与‘诉诸恐惧’可能共存），加剧了分类的歧义性。其次，构建过程中遭遇了标注一致性的难题：初版标注中大量片段仅被单一标注者识别，需通过双人独立标注与第三方协调的复杂协议提升可靠性，最终γ一致性仅达0.6，凸显了主观判断的波动。此外，数据规模有限（仅371篇训练文章），且标注集中于英文新闻，限制了模型对跨语言、跨文化宣传模式的泛化能力，为实际部署带来严峻考验。

常用场景

经典使用场景

SemEval-2020 Task 11 数据集（简称 PTC-SemEval20）专为检测新闻文本中的宣传技巧而构建，其核心应用场景涵盖两大经典任务：宣传片段识别（Span Identification）与宣传技巧分类（Technique Classification）。前者旨在从新闻文章中精准定位包含宣传性表述的文本片段，后者则进一步将这些片段归入如“诉诸权威”“诉诸恐惧”“非黑即白谬误”“因果过度简化”等十四种预设的宣传技巧类别。该数据集以英文新闻文章为语料，样本来源于经媒体偏见评估机构标注的宣传与非宣传媒体渠道，并通过精细的专家标注流程确保标注质量，为计算语言学与自然语言处理领域提供了研究宣传话语自动检测的标准化基准。

实际应用

在实际应用中，PTC-SemEval20 数据集赋能了多项关键场景，包括新闻媒体内容审核、社交媒体舆情监控以及政治宣传分析。借助该数据集训练的模型，新闻机构可自动标记报道中潜在的宣传性表述，辅助编辑团队维护信息客观性；社交媒体平台则能实时识别用户生成内容中如“从众效应”“稻草人论证”等操纵性策略，遏制虚假信息的扩散。此外，政府与智库机构可利用此类模型分析政治演讲或竞选广告中的宣传手法，提升公众对媒体操纵的认知素养。该数据集还支持开发面向教育领域的工具，通过实例化展示不同宣传技巧，帮助读者培养批判性思维，从而在更广泛的社会层面强化信息生态的透明度与韧性。

衍生相关工作

PTC-SemEval20 数据集催生了一系列具有影响力的衍生产品与后续研究。其中，由原始团队维护的 PTC 永久排行榜持续追踪宣传检测任务的最新进展，促进了诸如基于 BERT 的序列标注模型、融合跨任务知识的联合学习框架以及注意力机制增强的多标签分类器等方法的涌现。学者们还基于该数据集拓展出多语言宣传检测任务，将英语标注体系迁移至阿拉伯语、西班牙语等语种，并探索了半监督学习与提示学习在低资源场景下的应用。此外，该数据集与 Media Bias/Fact Check 等外部资源联动，推动了宣传技巧与媒体偏见之间关联性的定量分析，为理解信息操纵的传播模式提供了新的理论视角与实证依据。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集