COVID-19-disinformation
收藏魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/COVID-19-disinformation
下载链接
链接失效反馈官方服务:
资源简介:
# COVID-19 Infodemic Multilingual Dataset
This repository contains a multilingual dataset related to the COVID-19 infodemic, annotated with fine-grained labels. The dataset is curated to address questions of interest to journalists, fact-checkers, social media platforms, policymakers, and the general public. The dataset includes tweets in Arabic, Bulgarian, Dutch, and English, focusing on both binary (misinformation detection) and multiclass classification (different types of infodemic content).
## Table of Contents:
- [Dataset Overview](#dataset-overview)
- [Languages and Splits](#languages-and-splits)
- [File Formats](#file-formats)
- [Annotations](#annotations)
- [Dataset Examples](#dataset-examples)
- [Data Statistics](#data-statistics)
- [License](#license)
- [Citation](#citation)
## Dataset Overview
The dataset consists of tweets related to COVID-19, categorized under two tasks:
1. **Binary Classification**:
Detecting whether a tweet contains misinformation.
2. **Multiclass Classification**:
Classifying the tweet into specific infodemic categories such as conspiracy theories, harmful content, or false cures.
### Languages and Splits
The dataset includes the following languages, each with train, development (dev), and test splits:
- Arabic
- Bulgarian
- Dutch
- English
In addition to individual language datasets, a **multilang** directory contains a multilingual dataset where tweets from all the above languages are combined in the binary and multiclass formats.
### File Formats
The dataset is provided in TSV (Tab-Separated Values) format. Each file contains tweet IDs, labels for seven questions (Q1-Q7), and binary/multiclass annotations. The actual tweet text and associated metadata are not included for privacy reasons.
### Directory Structure
- **Readme.md**: This file
- **arabic/**, **bulgarian/**, **dutch/**, **english/**: Directories containing language-specific datasets for both binary and multiclass classification.
- **multilang/**: A directory containing the multilingual version of the dataset.
Each language and the multilingual directory include three sets:
- `train`
- `dev`
- `test`
The `*_binary_*` files correspond to binary classification, while the `*_multiclass_*` files correspond to multiclass classification.
## Annotations
The dataset contains labels for the following seven questions (Q1-Q7), each related to different aspects of the tweets:
1. **Is the tweet understandable?**
- Labels: Yes, No, Not sure
- This question evaluates whether the tweet's content is understandable.
2. **Does the tweet contain false information?**
- Labels: Definitely no, Probably no, Not sure, Probably yes, Definitely yes
- This question assesses the likelihood of false information in the tweet.
3. **Will the tweet’s claim be of interest to the general public?**
- Labels: Definitely no, Probably no, Not sure, Probably yes, Definitely yes
- Evaluates whether the tweet’s claim is relevant or interesting to the public.
4. **Is the tweet harmful?**
- Labels: Definitely no, Probably no, Not sure, Probably yes, Definitely yes
- Assesses if the tweet might cause harm to individuals, society, or businesses.
5. **Should a professional fact-checker verify the claim?**
- Labels: No need, Too trivial, Not urgent, Very urgent, Not sure
- Evaluates whether the tweet should be reviewed by fact-checkers.
6. **Why might the tweet be harmful?**
- Labels: No harm, Panic, Hate speech, Rumor, Conspiracy, etc.
- Categorizes the nature of potential harm the tweet might cause.
7. **Should this tweet get the attention of a government entity?**
- Labels: Not interesting, Calls for action, Blames authorities, etc.
- Determines if the tweet should be flagged for government attention.
## Dataset Examples
An example from the dataset:
> **Tweet**: "Please don’t take hydroxychloroquine (Plaquenil) plus Azithromycin for #COVID19 UNLESS your doctor prescribes it. Both drugs affect the QT interval of your heart and can lead to arrhythmias and sudden death, especially if you are taking other meds or have a heart condition."
**Labels**:
- Q1: Yes
- Q2: No, probably contains no false information
- Q3: Yes, definitely of interest
- Q4: No, probably not harmful
- Q5: Yes, very urgent
- Q6: No, not harmful
- Q7: Yes, calls for action
## Data Statistics
- **Arabic**: 5,000 binary samples, 4,000 multiclass samples
- **Bulgarian**: 3,000 binary samples, 2,500 multiclass samples
- **Dutch**: 4,000 binary samples, 3,500 multiclass samples
- **English**: 6,000 binary samples, 5,000 multiclass samples
- **Multilang**: Combined data from all languages, provided in both binary and multiclass splits.
## License
This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
## Citation
If you use this dataset, please cite it as:
```
@inproceedings{alam-etal-2021-fighting-covid,
title = "Fighting the {COVID}-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society",
author = "Alam, Firoj and
Shaar, Shaden and
Dalvi, Fahim and
Sajjad, Hassan and
Nikolov, Alex and
Mubarak, Hamdy and
Da San Martino, Giovanni and
Abdelali, Ahmed and
Durrani, Nadir and
Darwish, Kareem and
Al-Homaid, Abdulaziz and
Zaghouani, Wajdi and
Caselli, Tommaso and
Danoe, Gijs and
Stolk, Friso and
Bruntink, Britt and
Nakov, Preslav",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.56",
doi = "10.18653/v1/2021.findings-emnlp.56",
pages = "611--649",
}
@inproceedings{alam2021fighting,
title={Fighting the COVID-19 infodemic in social media: A holistic perspective and a call to arms},
author={Alam, Firoj and Dalvi, Fahim and Shaar, Shaden and Durrani, Nadir and Mubarak, Hamdy and Nikolov, Alex and Da San Martino, Giovanni and Abdelali, Ahmed and Sajjad, Hassan and Darwish, Kareem and others},
booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
volume={15},
pages={913--922},
year={2021}
}
```
# 新冠疫情信息疫情多语言数据集
本仓库收录了一套与新冠疫情信息疫情相关的多语言数据集,附带精细化标注。本数据集旨在覆盖记者、事实核查人员、社交媒体平台、政策制定者及普通公众的关切议题,涵盖阿拉伯语、保加利亚语、荷兰语与英语的推文,同时支持二分类(虚假信息检测)与多分类(不同类型信息疫情内容分类)两类任务。
## 目录
- [数据集概览](#dataset-overview)
- [语言与数据集划分](#languages-and-splits)
- [文件格式](#file-formats)
- [标注说明](#annotations)
- [数据集示例](#dataset-examples)
- [数据统计](#data-statistics)
- [授权协议](#license)
- [引用规范](#citation)
## 数据集概览
本数据集由与新冠疫情相关的推文组成,涵盖两类任务:
1. **二分类任务**:
检测推文是否包含虚假信息。
2. **多分类任务**:
将推文归类至特定的信息疫情类别,如阴谋论、有害内容或虚假疗法。
### 语言与数据集划分
本数据集包含以下语言,每种语言均设有训练集、开发集(dev)与测试集:
- 阿拉伯语
- 保加利亚语
- 荷兰语
- 英语
除各语言单独的数据集外,还设有**多语言(multilang)**目录,包含将上述所有语言的推文合并后的多语言数据集,支持二分类与多分类格式。
### 文件格式
数据集以TSV(Tab-Separated Values,制表符分隔值)格式提供。每个文件包含推文ID、针对七个问题(Q1-Q7)的标注标签,以及二分类/多分类标注。出于隐私保护考量,未收录实际推文文本与相关元数据。
### 目录结构
- **Readme.md**:本文件
- **arabic/**、**bulgarian/**、**dutch/**、**english/**:分别存放对应语言的数据集目录,涵盖二分类与多分类任务
- **multilang/**:存放多语言版本数据集的目录
每个语言目录及多语言目录均包含三个数据集子集:
- `train`(训练集)
- `dev`(开发集)
- `test`(测试集)
其中文件名包含`*_binary_*`的文件对应二分类任务,包含`*_multiclass_*`的文件对应多分类任务。
## 标注说明
本数据集包含针对以下七个问题(Q1-Q7)的标注,每个问题对应推文的不同维度:
1. **推文内容是否可理解?**
- 标注选项:是、否、不确定
- 本问题用于评估推文内容是否易于理解。
2. **推文是否包含虚假信息?**
- 标注选项:绝对否、大概率否、不确定、大概率是、绝对是
- 本问题用于评估推文中存在虚假信息的可能性。
3. **推文的主张是否会引发普通公众的兴趣?**
- 标注选项:绝对否、大概率否、不确定、大概率是、绝对是
- 用于评估推文主张是否与公众相关或具备关注度。
4. **推文是否具有危害性?**
- 标注选项:绝对否、大概率否、不确定、大概率是、绝对是
- 用于评估推文是否可能对个人、社会或企业造成伤害。
5. **是否需要专业事实核查人员对该主张进行核查?**
- 标注选项:无需核查、过于琐碎、无需紧急处理、需紧急处理、不确定
- 用于评估是否应将该推文交由事实核查人员审核。
6. **推文可能造成危害的原因是什么?**
- 标注选项:无危害、引发恐慌、仇恨言论、谣言、阴谋论等
- 用于分类推文可能造成的潜在危害的性质。
7. **该推文是否应引起政府部门的关注?**
- 标注选项:无关注度、呼吁采取行动、指责当局等
- 用于判定该推文是否应被标记以提请政府关注。
## 数据集示例
本数据集的一则示例如下:
> **推文**:"请不要自行服用羟氯喹(Plaquenil)联合阿奇霉素治疗#COVID19(#新冠疫情),除非经医生处方。这两种药物均会影响心脏的QT间期,可能导致心律失常甚至猝死,尤其是当您同时服用其他药物或患有心脏疾病时。"
**标注标签**:
- Q1:是
- Q2:否,大概率不含虚假信息
- Q3:是,绝对具备关注度
- Q4:否,大概率不具有危害性
- Q5:是,需紧急处理
- Q6:否,无危害
- Q7:是,呼吁采取行动
## 数据统计
- 阿拉伯语:5000条二分类样本,4000条多分类样本
- 保加利亚语:3000条二分类样本,2500条多分类样本
- 荷兰语:4000条二分类样本,3500条多分类样本
- 英语:6000条二分类样本,5000条多分类样本
- 多语言数据集:整合所有语言的样本,同时提供二分类与多分类划分版本。
## 授权协议
本数据集采用知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(CC BY-NC-SA 4.0)进行授权。
## 引用规范
若您使用本数据集,请按以下方式引用:
@inproceedings{alam-etal-2021-fighting-covid,
title = "Fighting the {COVID}-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society",
author = "Alam, Firoj and
Shaar, Shaden and
Dalvi, Fahim and
Sajjad, Hassan and
Nikolov, Alex and
Mubarak, Hamdy and
Da San Martino, Giovanni and
Abdelali, Ahmed and
Durrani, Nadir and
Darwish, Kareem and
Al-Homaid, Abdulaziz and
Zaghouani, Wajdi and
Caselli, Tommaso and
Danoe, Gijs and
Stolk, Friso and
Bruntink, Britt and
Nakov, Preslav",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.56",
doi = "10.18653/v1/2021.findings-emnlp.56",
pages = "611--649",
}
@inproceedings{alam2021fighting,
title={Fighting the COVID-19 infodemic in social media: A holistic perspective and a call to arms},
author={Alam, Firoj and Dalvi, Fahim and Shaar, Shaden and Durrani, Nadir and Mubarak, Hamdy and Nikolov, Alex and Da San Martino, Giovanni and Abdelali, Ahmed and Sajjad, Hassan and Darwish, Kareem and others},
booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
volume={15},
pages={913--922},
year={2021}
}
提供机构:
maas
创建时间:
2025-06-17



