下载链接：

https://modelscope.cn/datasets/QCRI/ArmPro

下载链接

链接失效反馈

官方服务：

资源简介：

# ArMPro This repo contains the *Arabic* propaganda dataset (**ArMPro**).  ![License](https://img.shields.io/badge/license-CC--BY--NC--SA-blue) [![Paper](https://img.shields.io/badge/Paper-Download%20PDF-green)](https://aclanthology.org/2024.lrec-main.244.pdf) **Table of contents:** * [Dataset](#dataset) + [Data splits](#data-splits) + [Coarse-grained label distribuition](#coarse-grained-label-distribuition) + [Fine-grained label distribuition](#fine-grained-label-distribuition) * [Licensing](#licensing) * [Citation](#citation) ## Dataset This dataset represents the largest one to date for fine-grained propaganda detection for Arabic. It includes **8,000** paragraphs extracted from over **2,800** Arabic news articles, covering a large variety of news domains. Example annotated paragraph: <img width="350" alt="Screenshot 2024-05-04 at 3 56 26 PM" src="https://github.com/MaramHasanain/ArMPro/assets/3918663/255f6b47-1942-48cb-ba0a-259a79a7f93a"> ### Data splits We split the dataset in a stratified manner, allocating 75\%, 8.5\%, and 16.5\% for training, development, and testing, respectively. During the stratified sampling, the multilabel setting was considered when splitting the dataset. This ensures that persuasion techniques are similarly distributed across the splits. ### Coarse-grained label distribuition | **Binary label** | **Train** | **Dev** | **Test** | |--------------------|-------|-----|-------| | Propagandistic | 3,777 | 425 | 832 | | Non-Propagandistic | 2,225 | 247 | 494 | | **Total** | **6,002** | **672** | **1,326** | | **Coarse-grained label** | **Train** | **Dev** | **Test** | |----------------------|-------|-----|-------| | Manipulative Wording | 3,460 | 387 | 757 | | no technique | 2,225 | 247 | 494 | | Reputation | 1,404 | 163 | 314 | | Justification | 471 | 48 | 102 | | Simplification | 384 | 42 | 82 | | Call | 176 | 21 | 40 | | Distraction | 74 | 9 | 16 | | **Total** | **8,194** | **917** | **1,805** | ### Fine-grained label distribuition | **Technique** | **Train** | **Dev** | **Test** | |----------------------------------|--------|-------|-------| | Loaded Language | 7,862 | 856 | 1670 | | no technique | 2,225 | 247 | 494 | | Name Calling/Labeling | 1,526 | 158 | 328 | | Exaggeration/Minimisation | 967 | 113 | 210 | | Questioning the Reputation | 587 | 58 | 131 | | Obfuscation/Vagueness/Confusion | 562 | 62 | 132 | | Causal Oversimplification | 289 | 33 | 67 | | Doubt | 227 | 27 | 49 | | Appeal to Authority | 192 | 22 | 42 | | Flag Waving | 174 | 22 | 41 | | Repetition | 123 | 13 | 30 | | Slogans | 101 | 19 | 24 | | Appeal to Fear/Prejudice | 93 | 11 | 21 | | Appeal to Hypocrisy | 82 | 9 | 17 | | Consequential Oversimplification | 81 | 10 | 19 | | False Dilemma/No Choice | 60 | 6 | 13 | | Conversation Killer | 53 | 6 | 13 | | Appeal to Time | 52 | 6 | 12 | | Appeal to Popularity | 44 | 4 | 8 | | Appeal to Values | 38 | 5 | 9 | | Red Herring | 38 | 4 | 8 | | Guilt by Association | 22 | 2 | 5 | | Whataboutism | 20 | 4 | 4 | | Straw Man | 19 | 2 | 4 | | **Total** | **15,437** | **1,699** | **3,351** | **Note**: "no technique" refers to paragraphs without any propagandistic techniques use.  ## Licensing This dataset is licensed under CC BY-NC-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ ## Citation If you use our dataset in a scientific publication, we would appreciate using the following citations: - Maram Hasanain, Fatema Ahmad, and Firoj Alam. 2024. Can GPT-4 Identify Propaganda? Annotation and Detection of Propaganda Spans in News Articles. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2724–2744, Torino, Italia. ELRA and ICCL. ``` @inproceedings{hasanain-etal-2024-gpt, title = "Can {GPT}-4 Identify Propaganda? Annotation and Detection of Propaganda Spans in News Articles", author = "Hasanain, Maram and Ahmad, Fatema and Alam, Firoj", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.244", pages = "2724--2744", abstract = "The use of propaganda has spiked on mainstream and social media, aiming to manipulate or mislead users. While efforts to automatically detect propaganda techniques in textual, visual, or multimodal content have increased, most of them primarily focus on English content. The majority of the recent initiatives targeting medium to low-resource languages produced relatively small annotated datasets, with a skewed distribution, posing challenges for the development of sophisticated propaganda detection models. To address this challenge, we carefully develop the largest propaganda dataset to date, ArPro, comprised of 8K paragraphs from newspaper articles, labeled at the text span level following a taxonomy of 23 propagandistic techniques. Furthermore, our work offers the first attempt to understand the performance of large language models (LLMs), using GPT-4, for fine-grained propaganda detection from text. Results showed that GPT-4{'}s performance degrades as the task moves from simply classifying a paragraph as propagandistic or not, to the fine-grained task of detecting propaganda techniques and their manifestation in text. Compared to models fine-tuned on the dataset for propaganda detection at different classification granularities, GPT-4 is still far behind. Finally, we evaluate GPT-4 on a dataset consisting of six other languages for span detection, and results suggest that the model struggles with the task across languages. We made the dataset publicly available for the community.", } ```

# ArMPro 本仓库包含**阿拉伯语宣传数据集（ArMPro）**。 ![License](https://img.shields.io/badge/license-CC--BY--NC--SA-blue) [![Paper](https://img.shields.io/badge/Paper-下载论文PDF-green)](https://aclanthology.org/2024.lrec-main.244.pdf) **目录：** * [数据集](#dataset) + [数据划分](#data-splits) + [粗粒度标签分布](#coarse-grained-label-distribution) + [细粒度标签分布](#fine-grained-label-distribution) * [授权协议](#licensing) * [引用](#citation) ## 数据集本数据集为目前规模最大的阿拉伯语细粒度宣传检测数据集。其包含从2800余篇阿拉伯语新闻文章中提取的8000段标注文本，涵盖多元新闻领域。 **标注段落示例：** <img width="350" alt="Screenshot 2024-05-04 at 3 56 26 PM" src="https://github.com/MaramHasanain/ArMPro/assets/3918663/255f6b47-1942-48cb-ba0a-259a79a7f93a"> ### 数据划分我们采用分层采样方式对数据集进行拆分，分别将75%、8.5%与16.5%的数据分配至训练集、开发集与测试集。分层划分过程中已考虑多标签设置，以确保各类宣传技术在各划分集中的分布保持均衡。 ### 粗粒度标签分布 | **二元标签** | **训练集** | **开发集** | **测试集** | |--------------------|-------|-----|-------| | 宣传类 | 3,777 | 425 | 832 | | 非宣传类 | 2,225 | 247 | 494 | | **总计** | **6,002** | **672** | **1,326** | | **粗粒度标签** | **训练集** | **开发集** | **测试集** | |----------------------|-------|-----|-------| | 操纵性措辞 | 3,460 | 387 | 757 | | 无技术使用 | 2,225 | 247 | 494 | | 声誉类 | 1,404 | 163 | 314 | | 合理化辩解 | 471 | 48 | 102 | | 简化论证 | 384 | 42 | 82 | | 号召呼吁 | 176 | 21 | 40 | | 转移注意力 | 74 | 9 | 16 | | **总计** | **8,194** | **917** | **1,805** | ### 细粒度标签分布 | **宣传技术** | **训练集** | **开发集** | **测试集** | |----------------------------------|--------|-------|-------| | 情感负载措辞（Loaded Language） | 7,862 | 856 | 1670 | | 无技术使用 | 2,225 | 247 | 494 | | 贴标签/点名攻击（Name Calling/Labeling） | 1,526 | 158 | 328 | | 夸大/贬低（Exaggeration/Minimisation） | 967 | 113 | 210 | | 质疑声誉（Questioning the Reputation） | 587 | 58 | 131 | | 模糊混淆/晦涩表述（Obfuscation/Vagueness/Confusion） | 562 | 62 | 132 | | 因果简化（Causal Oversimplification） | 289 | 33 | 67 | | 制造怀疑（Doubt） | 227 | 27 | 49 | | 诉诸权威（Appeal to Authority） | 192 | 22 | 42 | | 国旗挥舞（Flag Waving） | 174 | 22 | 41 | | 重复表述（Repetition） | 123 | 13 | 30 | | 口号标语（Slogans） | 101 | 19 | 24 | | 诉诸恐惧/偏见（Appeal to Fear/Prejudice） | 93 | 11 | 21 | | 诉诸虚伪（Appeal to Hypocrisy） | 82 | 9 | 17 | | 结果简化（Consequential Oversimplification） | 81 | 10 | 19 | | 虚假两难/别无选择（False Dilemma/No Choice） | 60 | 6 | 13 | | 终止对话（Conversation Killer） | 53 | 6 | 13 | | 诉诸时间（Appeal to Time） | 52 | 6 | 12 | | 诉诸大众（Appeal to Popularity） | 44 | 4 | 8 | | 诉诸价值观（Appeal to Values） | 38 | 5 | 9 | | 转移话题（Red Herring） | 38 | 4 | 8 | | 关联罪责（Guilt by Association） | 22 | 2 | 5 | | 双标指责（Whataboutism） | 20 | 4 | 4 | | 稻草人谬误（Straw Man） | 19 | 2 | 4 | | **总计** | **15,437** | **1,699** | **3,351** | **注：“无技术使用”指未使用任何宣传技术的段落。** ## 授权协议本数据集采用CC BY-NC-SA 4.0协议进行授权。若需查看协议副本，请访问：https://creativecommons.org/licenses/by-nc-sa/4.0/ ## 引用若您在学术出版物中使用本数据集，请引用如下文献： - Maram Hasanain, Fatema Ahmad, 和 Firoj Alam. 2024. 大语言模型（Large Language Model, LLM）能否识别宣传？新闻文章中宣传片段的标注与检测. 见：2024年计算语言学国际联合会议、语言资源与评估大会（LREC-COLING 2024）论文集，第2724–2744页，意大利都灵。ELRA与ICCL。 @inproceedings{hasanain-etal-2024-gpt, title = "Can {GPT}-4 Identify Propaganda? Annotation and Detection of Propaganda Spans in News Articles", author = "Hasanain, Maram and Ahmad, Fatema and Alam, Firoj", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.244", pages = "2724--2744", abstract = "当前主流媒体与社交媒体上的宣传内容激增，其目的在于操纵或误导用户。尽管针对文本、视觉或多模态内容中的宣传技术自动检测的研究日益增多，但绝大多数研究主要聚焦于英语内容。近期针对中低资源语言的相关工作大多仅构建了规模较小的标注数据集，且存在标签分布不均衡的问题，这为开发高精度的宣传检测模型带来了挑战。为应对这一挑战，我们精心构建了目前规模最大的阿拉伯语宣传数据集ArMPro，该数据集包含来自新闻文章的8000段文本，并按照包含23种宣传技术的分类体系对文本片段进行细粒度标注。此外，本研究首次尝试探究大语言模型（Large Language Models，LLMs，如GPT-4）在文本细粒度宣传检测任务中的表现。实验结果表明，当任务从简单的段落宣传分类升级至细粒度的宣传技术检测与文本表征任务时，GPT-4的性能会出现下降。与在本数据集上微调的宣传检测模型相比，GPT-4的性能仍存在较大差距。最后，我们在包含其他六种语言的片段检测数据集上对GPT-4进行了评估，结果显示该模型在跨语言任务中表现欠佳。我们已将本数据集公开以供社区使用。", }

应用场景：