QCRI/COVID-19-disinformation

Name: QCRI/COVID-19-disinformation
Creator: QCRI
Published: 2024-09-09 06:58:27
License: 暂无描述

Hugging Face2024-09-09 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/QCRI/COVID-19-disinformation

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification language: - ar - bg - nl - en pretty_name: ' COVID-19-disinformation' size_categories: - 10K<n<100K dataset_info: - config_name: arabic features: - name: tweet_id dtype: string - name: text dtype: string - name: q1_label dtype: string - name: q2_label dtype: string - name: q3_label dtype: string - name: q4_label dtype: string - name: q5_label dtype: string - name: q6_label dtype: string - name: q7_label dtype: string splits: - name: binary_train num_bytes: 1327709 num_examples: 3631 - name: binary_dev num_bytes: 119559 num_examples: 339 - name: binary_test num_bytes: 371852 num_examples: 996 - name: multiclass_train num_bytes: 1685764 num_examples: 3631 - name: multiclass_dev num_bytes: 152673 num_examples: 339 - name: multiclass_test num_bytes: 470286 num_examples: 996 - config_name: bulgarian features: - name: tweet_id dtype: string - name: text dtype: string - name: q1_label dtype: string - name: q2_label dtype: string - name: q3_label dtype: string - name: q4_label dtype: string - name: q5_label dtype: string - name: q6_label dtype: string - name: q7_label dtype: string splits: - name: binary_train num_bytes: 825675 num_examples: 2710 - name: binary_dev num_bytes: 77271 num_examples: 251 - name: binary_test num_bytes: 215992 num_examples: 736 - name: multiclass_train num_bytes: 1084422 num_examples: 2710 - name: multiclass_dev num_bytes: 101532 num_examples: 251 - name: multiclass_test num_bytes: 287336 num_examples: 736 - config_name: dutch features: - name: tweet_id dtype: string - name: text dtype: string - name: q1_label dtype: string - name: q2_label dtype: string - name: q3_label dtype: string - name: q4_label dtype: string - name: q5_label dtype: string - name: q6_label dtype: string - name: q7_label dtype: string splits: - name: binary_train num_bytes: 456893 num_examples: 1950 - name: binary_dev num_bytes: 42346 num_examples: 181 - name: binary_test num_bytes: 126916 num_examples: 534 - name: multiclass_train num_bytes: 602248 num_examples: 1950 - name: multiclass_dev num_bytes: 54308 num_examples: 181 - name: multiclass_test num_bytes: 166412 num_examples: 534 - config_name: english features: - name: tweet_id dtype: string - name: text dtype: string - name: q1_label dtype: string - name: q2_label dtype: string - name: q3_label dtype: string - name: q4_label dtype: string - name: q5_label dtype: string - name: q6_label dtype: string - name: q7_label dtype: string splits: - name: binary_train num_bytes: 1019808 num_examples: 3324 - name: binary_dev num_bytes: 94039 num_examples: 307 - name: binary_test num_bytes: 280445 num_examples: 911 - name: multiclass_train num_bytes: 1321619 num_examples: 3324 - name: multiclass_dev num_bytes: 121786 num_examples: 307 - name: multiclass_test num_bytes: 362128 num_examples: 911 - config_name: multilang features: - name: tweet_id dtype: string - name: text dtype: string - name: q1_label dtype: string - name: q2_label dtype: string - name: q3_label dtype: string - name: q4_label dtype: string - name: q5_label dtype: string - name: q6_label dtype: string - name: q7_label dtype: string splits: - name: binary_train num_bytes: 3630085 num_examples: 11615 - name: binary_dev num_bytes: 333215 num_examples: 1078 - name: binary_test num_bytes: 995205 num_examples: 3177 - name: multiclass_train num_bytes: 4694053 num_examples: 11615 - name: multiclass_dev num_bytes: 430299 num_examples: 1078 - name: multiclass_test num_bytes: 1286162 num_examples: 3177 download_size: 11409949 dataset_size: 22738038 configs: - config_name: arabic data_files: - split: binary_train path: data/arabic_binary_train-* - split: binary_dev path: data/arabic_binary_dev-* - split: binary_test path: data/arabic_binary_test-* - split: multiclass_train path: data/arabic_multiclass_train-* - split: multiclass_dev path: data/arabic_multiclass_dev-* - split: multiclass_test path: data/arabic_multiclass_dev-* - config_name: bulgarian data_files: - split: binary_train path: data/bulgarian_binary_train-* - split: binary_dev path: data/bulgarian_binary_dev-* - split: binary_test path: data/bulgarian_binary_test-* - split: multiclass_train path: data/bulgarian_multiclass_train-* - split: multiclass_dev path: data/bulgarian_multiclass_dev-* - split: multiclass_test path: data/bulgarian_multiclass_dev-* - config_name: dutch data_files: - split: binary_train path: data/dutch_binary_train-* - split: binary_dev path: data/dutch_binary_dev-* - split: binary_test path: data/dutch_binary_test-* - split: multiclass_train path: data/dutch_multiclass_train-* - split: multiclass_dev path: data/dutch_multiclass_dev-* - split: multiclass_test path: data/dutch_multiclass_dev-* - config_name: english data_files: - split: binary_train path: data/english_binary_train-* - split: binary_dev path: data/english_binary_dev-* - split: binary_test path: data/english_binary_test-* - split: multiclass_train path: data/english_multiclass_train-* - split: multiclass_dev path: data/english_multiclass_dev-* - split: multiclass_test path: data/english_multiclass_dev-* - config_name: multilang data_files: - split: binary_train path: data/multilang_binary_train-* - split: binary_dev path: data/multilang_binary_dev-* - split: binary_test path: data/multilang_binary_test-* - split: multiclass_train path: data/multilang_multiclass_train-* - split: multiclass_dev path: data/multilang_multiclass_dev-* - split: multiclass_test path: data/multilang_multiclass_dev-* --- # COVID-19 Infodemic Multilingual Dataset This repository contains a multilingual dataset related to the COVID-19 infodemic, annotated with fine-grained labels. The dataset is curated to address questions of interest to journalists, fact-checkers, social media platforms, policymakers, and the general public. The dataset includes tweets in Arabic, Bulgarian, Dutch, and English, focusing on both binary (misinformation detection) and multiclass classification (different types of infodemic content). ## Table of Contents: - [Dataset Overview](#dataset-overview) - [Languages and Splits](#languages-and-splits) - [File Formats](#file-formats) - [Annotations](#annotations) - [Dataset Examples](#dataset-examples) - [Data Statistics](#data-statistics) - [License](#license) - [Citation](#citation) ## Dataset Overview The dataset consists of tweets related to COVID-19, categorized under two tasks: 1. **Binary Classification**: Detecting whether a tweet contains misinformation. 2. **Multiclass Classification**: Classifying the tweet into specific infodemic categories such as conspiracy theories, harmful content, or false cures. ### Languages and Splits The dataset includes the following languages, each with train, development (dev), and test splits: - Arabic - Bulgarian - Dutch - English In addition to individual language datasets, a **multilang** directory contains a multilingual dataset where tweets from all the above languages are combined in the binary and multiclass formats. ### File Formats The dataset is provided in TSV (Tab-Separated Values) format. Each file contains tweet IDs, labels for seven questions (Q1-Q7), and binary/multiclass annotations. The actual tweet text and associated metadata are not included for privacy reasons. ### Directory Structure - **Readme.md**: This file - **arabic/**, **bulgarian/**, **dutch/**, **english/**: Directories containing language-specific datasets for both binary and multiclass classification. - **multilang/**: A directory containing the multilingual version of the dataset. Each language and the multilingual directory include three sets: - `train` - `dev` - `test` The `*_binary_*` files correspond to binary classification, while the `*_multiclass_*` files correspond to multiclass classification. ## Annotations The dataset contains labels for the following seven questions (Q1-Q7), each related to different aspects of the tweets: 1. **Is the tweet understandable?** - Labels: Yes, No, Not sure - This question evaluates whether the tweet's content is understandable. 2. **Does the tweet contain false information?** - Labels: Definitely no, Probably no, Not sure, Probably yes, Definitely yes - This question assesses the likelihood of false information in the tweet. 3. **Will the tweet’s claim be of interest to the general public?** - Labels: Definitely no, Probably no, Not sure, Probably yes, Definitely yes - Evaluates whether the tweet’s claim is relevant or interesting to the public. 4. **Is the tweet harmful?** - Labels: Definitely no, Probably no, Not sure, Probably yes, Definitely yes - Assesses if the tweet might cause harm to individuals, society, or businesses. 5. **Should a professional fact-checker verify the claim?** - Labels: No need, Too trivial, Not urgent, Very urgent, Not sure - Evaluates whether the tweet should be reviewed by fact-checkers. 6. **Why might the tweet be harmful?** - Labels: No harm, Panic, Hate speech, Rumor, Conspiracy, etc. - Categorizes the nature of potential harm the tweet might cause. 7. **Should this tweet get the attention of a government entity?** - Labels: Not interesting, Calls for action, Blames authorities, etc. - Determines if the tweet should be flagged for government attention. ## Dataset Examples An example from the dataset: > **Tweet**: "Please don’t take hydroxychloroquine (Plaquenil) plus Azithromycin for #COVID19 UNLESS your doctor prescribes it. Both drugs affect the QT interval of your heart and can lead to arrhythmias and sudden death, especially if you are taking other meds or have a heart condition." **Labels**: - Q1: Yes - Q2: No, probably contains no false information - Q3: Yes, definitely of interest - Q4: No, probably not harmful - Q5: Yes, very urgent - Q6: No, not harmful - Q7: Yes, calls for action ## Data Statistics - **Arabic**: 5,000 binary samples, 4,000 multiclass samples - **Bulgarian**: 3,000 binary samples, 2,500 multiclass samples - **Dutch**: 4,000 binary samples, 3,500 multiclass samples - **English**: 6,000 binary samples, 5,000 multiclass samples - **Multilang**: Combined data from all languages, provided in both binary and multiclass splits. ## License This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). ## Citation If you use this dataset, please cite it as: ``` @inproceedings{alam-etal-2021-fighting-covid, title = "Fighting the {COVID}-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society", author = "Alam, Firoj and Shaar, Shaden and Dalvi, Fahim and Sajjad, Hassan and Nikolov, Alex and Mubarak, Hamdy and Da San Martino, Giovanni and Abdelali, Ahmed and Durrani, Nadir and Darwish, Kareem and Al-Homaid, Abdulaziz and Zaghouani, Wajdi and Caselli, Tommaso and Danoe, Gijs and Stolk, Friso and Bruntink, Britt and Nakov, Preslav", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-emnlp.56", doi = "10.18653/v1/2021.findings-emnlp.56", pages = "611--649", } @inproceedings{alam2021fighting, title={Fighting the COVID-19 infodemic in social media: A holistic perspective and a call to arms}, author={Alam, Firoj and Dalvi, Fahim and Shaar, Shaden and Durrani, Nadir and Mubarak, Hamdy and Nikolov, Alex and Da San Martino, Giovanni and Abdelali, Ahmed and Sajjad, Hassan and Darwish, Kareem and others}, booktitle={Proceedings of the International AAAI Conference on Web and Social Media}, volume={15}, pages={913--922}, year={2021} } ```

# COVID-19 信息疫情多语言数据集本仓库包含与COVID-19信息疫情相关的多语言数据集，附带细粒度标注。本数据集旨在满足记者、事实核查人员、社交媒体平台、政策制定者及普通公众的相关研究需求。数据集涵盖阿拉伯语、保加利亚语、荷兰语与英语的推文，同时支持二元分类（虚假信息检测）与多分类任务（对不同类型的信息疫情内容进行分类）。 ## 目录 - [数据集概览](#dataset-overview) - [语言与数据集划分](#languages-and-splits) - [文件格式](#file-formats) - [标注说明](#annotations) - [数据集示例](#dataset-examples) - [数据统计](#data-statistics) - [许可协议](#license) - [引用方式](#citation) ## 数据集概览本数据集由与COVID-19相关的推文组成，涵盖两类任务： 1. **二元分类**：检测推文是否包含虚假信息。 2. **多分类任务**：将推文划分为特定的信息疫情类别，如阴谋论、有害内容或虚假治疗方案。 ### 语言与数据集划分本数据集包含以下语言，每种语言均附带训练集、开发集（dev）与测试集： - 阿拉伯语 - 保加利亚语 - 荷兰语 - 英语除单语言数据集外，还提供**多语言（multilang）**目录，包含合并了上述所有语言推文的多语言数据集，同时支持二元与多分类格式。 ### 文件格式本数据集采用制表符分隔值（Tab-Separated Values，TSV）格式存储。每个文件包含推文ID、7个问题（Q1-Q7）的标注标签以及二元/多分类注释。出于隐私保护考虑，未包含推文原文与相关元数据。 ### 目录结构 - **Readme.md**：本说明文件 - **arabic/**、**bulgarian/**、**dutch/**、**english/**：分别存储对应单语言的二元与多分类数据集 - **multilang/**：存储多语言版本数据集每个单语言目录与多语言目录均包含三个子集： - `train`（训练集） - `dev`（开发集） - `test`（测试集）其中`*_binary_*`后缀文件对应二元分类任务，`*_multiclass_*`后缀文件对应多分类任务。 ## 标注说明本数据集包含7个问题（Q1-Q7）的标注标签，分别对应推文的不同维度： 1. **推文内容是否可理解？** - 标注选项：是、否、不确定 - 该问题用于评估推文内容是否具备可理解性。 2. **推文是否包含虚假信息？** - 标注选项：绝对否、大概率否、不确定、大概率是、绝对是 - 该问题用于评估推文中存在虚假信息的可能性。 3. **推文的主张是否会引发普通公众的兴趣？** - 标注选项：绝对否、大概率否、不确定、大概率是、绝对是 - 用于评估推文主张是否与公众相关或具备关注度。 4. **推文是否具有危害性？** - 标注选项：绝对否、大概率否、不确定、大概率是、绝对是 - 用于评估推文是否可能对个人、社会或企业造成危害。 5. **是否需要专业事实核查人员对该主张进行核查？** - 标注选项：无需核查、过于琐碎、无需紧急核查、亟需核查、不确定 - 用于评估是否需要由事实核查人员对推文主张进行审核。 6. **推文可能造成危害的原因是什么？** - 标注选项：无危害、引发恐慌、仇恨言论、谣言、阴谋论等 - 用于分类推文可能造成的危害类型。 7. **该推文是否需要政府部门关注？** - 标注选项：无关注度、呼吁采取行动、指责监管机构等 - 用于判断是否需要将该推文标记供政府部门关注。 ## 数据集示例以下为数据集中的一个示例： > **推文**："请不要将羟氯喹（Plaquenil）与阿奇霉素联合用于治疗#COVID19，除非经医生处方。这两种药物均会影响心脏的QT间期，可能导致心律失常甚至猝死，尤其当您同时服用其他药物或患有心脏疾病时。" **标注结果**： - Q1：是 - Q2：大概率无虚假信息 - Q3：绝对会引发公众兴趣 - Q4：大概率无危害 - Q5：亟需核查 - Q6：无危害 - Q7：呼吁采取行动 ## 数据统计 - 阿拉伯语：5000条二元分类样本，4000条多分类样本 - 保加利亚语：3000条二元分类样本，2500条多分类样本 - 荷兰语：4000条二元分类样本，3500条多分类样本 - 英语：6000条二元分类样本，5000条多分类样本 - 多语言版本：合并所有语言的数据集，同时提供二元与多分类划分。 ## 许可协议本数据集采用知识共享署名4.0国际许可协议（CC BY 4.0，Creative Commons Attribution 4.0 International）进行授权。 ## 引用方式若您使用本数据集，请按以下格式引用： @inproceedings{alam-etal-2021-fighting-covid, title = "Fighting the {COVID}-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society", author = "Alam, Firoj and Shaar, Shaden and Dalvi, Fahim and Sajjad, Hassan and Nikolov, Alex and Mubarak, Hamdy and Da San Martino, Giovanni and Abdelali, Ahmed and Durrani, Nadir and Darwish, Kareem and Al-Homaid, Abdulaziz and Zaghouani, Wajdi and Caselli, Tommaso and Danoe, Gijs and Stolk, Friso and Bruntink, Britt and Nakov, Preslav", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-emnlp.56", doi = "10.18653/v1/2021.findings-emnlp.56", pages = "611--649", } @inproceedings{alam2021fighting, title={Fighting the COVID-19 infodemic in social media: A holistic perspective and a call to arms}, author={Alam, Firoj and Dalvi, Fahim and Shaar, Shaden and Durrani, Nadir and Mubarak, Hamdy and Nikolov, Alex and Da San Martino, Giovanni and Abdelali, Ahmed and Sajjad, Hassan and Darwish, Kareem and others}, booktitle={Proceedings of the International AAAI Conference on Web and Social Media}, volume={15}, pages={913--922}, year={2021} } ### 数据集元信息补充本数据集完整元信息如下： - 许可协议：CC BY 4.0 - 任务类别：文本分类（text-classification） - 支持语言：ar（阿拉伯语）、bg（保加利亚语）、nl（荷兰语）、en（英语） - 数据集别名：COVID-19-disinformation - 样本规模：10000 < 样本数量 < 100000 - 下载大小：11409949字节 - 数据集总大小：22738038字节各配置项详情： 1. **阿拉伯语配置** 特征字段：tweet_id（推文ID，字符串型）、text（推文文本，字符串型）、q1_label~q7_label（各问题标签，字符串型）数据集划分： - binary_train：1327709字节，3631个样本 - binary_dev：119559字节，339个样本 - binary_test：371852字节，996个样本 - multiclass_train：1685764字节，3631个样本 - multiclass_dev：152673字节，339个样本 - multiclass_test：470286字节，996个样本 2. **保加利亚语配置** 特征字段：tweet_id（推文ID，字符串型）、text（推文文本，字符串型）、q1_label~q7_label（各问题标签，字符串型）数据集划分： - binary_train：825675字节，2710个样本 - binary_dev：77271字节，251个样本 - binary_test：215992字节，736个样本 - multiclass_train：1084422字节，2710个样本 - multiclass_dev：101532字节，251个样本 - multiclass_test：287336字节，736个样本 3. **荷兰语配置** 特征字段：tweet_id（推文ID，字符串型）、text（推文文本，字符串型）、q1_label~q7_label（各问题标签，字符串型）数据集划分： - binary_train：456893字节，1950个样本 - binary_dev：42346字节，181个样本 - binary_test：126916字节，534个样本 - multiclass_train：602248字节，1950个样本 - multiclass_dev：54308字节，181个样本 - multiclass_test：166412字节，534个样本 4. **英语配置** 特征字段：tweet_id（推文ID，字符串型）、text（推文文本，字符串型）、q1_label~q7_label（各问题标签，字符串型）数据集划分： - binary_train：1019808字节，3324个样本 - binary_dev：94039字节，307个样本 - binary_test：280445字节，911个样本 - multiclass_train：1321619字节，3324个样本 - multiclass_dev：121786字节，307个样本 - multiclass_test：362128字节，911个样本 5. **多语言配置** 特征字段：tweet_id（推文ID，字符串型）、text（推文文本，字符串型）、q1_label~q7_label（各问题标签，字符串型）数据集划分： - binary_train：3630085字节，11615个样本 - binary_dev：333215字节，1078个样本 - binary_test：995205字节，3177个样本 - multiclass_train：4694053字节，11615个样本 - multiclass_dev：430299字节，1078个样本 - multiclass_test：1286162字节，3177个样本各配置项对应数据文件路径： - 阿拉伯语配置： - binary_train：data/arabic_binary_train-* - binary_dev：data/arabic_binary_dev-* - binary_test：data/arabic_binary_test-* - multiclass_train：data/arabic_multiclass_train-* - multiclass_dev：data/arabic_multiclass_dev-* - multiclass_test：data/arabic_multiclass_dev-* - 保加利亚语配置： - binary_train：data/bulgarian_binary_train-* - binary_dev：data/bulgarian_binary_dev-* - binary_test：data/bulgarian_binary_test-* - multiclass_train：data/bulgarian_multiclass_train-* - multiclass_dev：data/bulgarian_multiclass_dev-* - multiclass_test：data/bulgarian_multiclass_dev-* - 荷兰语配置： - binary_train：data/dutch_binary_train-* - binary_dev：data/dutch_binary_dev-* - binary_test：data/dutch_binary_test-* - multiclass_train：data/dutch_multiclass_train-* - multiclass_dev：data/dutch_multiclass_dev-* - multiclass_test：data/dutch_multiclass_dev-* - 英语配置： - binary_train：data/english_binary_train-* - binary_dev：data/english_binary_dev-* - binary_test：data/english_binary_test-* - multiclass_train：data/english_multiclass_train-* - multiclass_dev：data/english_multiclass_dev-* - multiclass_test：data/english_multiclass_dev-* - 多语言配置： - binary_train：data/multilang_binary_train-* - binary_dev：data/multilang_binary_dev-* - binary_test：data/multilang_binary_test-* - multiclass_train：data/multilang_multiclass_train-* - multiclass_dev：data/multilang_multiclass_dev-* - multiclass_test：data/multilang_multiclass_dev-*

提供机构：

QCRI

5,000+

优质数据集

54 个

任务类型

进入经典数据集