AminHasibul/covid-vaccine-conspiracy
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AminHasibul/covid-vaccine-conspiracy
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- text-classification
task_ids:
- multi-class-classification
pretty_name: COVID-19 Vaccine Conspiracy Theory Detection Dataset
size_categories:
- n<1K
tags:
- misinformation
- conspiracy-theory
- covid-19
- vaccine
- nlp
- social-media
- health
- bert
- public-health
- sentiment-analysis
- vaccine-hesitancy
dataset_info:
features:
- name: comments
dtype: string
- name: label
dtype:
class_label:
names:
'0': not_conspiracy
'1': conspiracy
- name: conspiracy_found
dtype: string
splits:
- name: train
num_examples: 581
---
# COVID-19 Vaccine Conspiracy Theory Detection Dataset
## Dataset Summary
This dataset contains **581 manually labeled social media comments** related to COVID-19 vaccines,
annotated for the presence of conspiracy theories. It supports NLP research on automated vaccine
misinformation detection, a critical public health challenge.
**This dataset has been cited and reused by external researchers.**
### Associated Paper
> Amin, M. H., Madanu, H., Lavu, S., Mansourifar, H., Alsagheer, D., & Shi, W. (2022).
> *Detecting Conspiracy Theory Against COVID-19 Vaccines.*
> arXiv preprint arXiv:2211.13003.
> University of Houston, Department of Computer Science.
**Links:** [arXiv Paper](https://arxiv.org/abs/2211.13003) | [GitHub Repository](https://github.com/AminHasibul/ConspiracyAgaintstCovidVaccines)
---
## Supported Tasks
- **Text Classification:** Binary classification of vaccine-related social media text as conspiracy (1) or not (0)
- **Misinformation Detection:** Identifying vaccine conspiracy narratives in informal online language
- **Sentiment Analysis:** Analyzing public sentiment toward COVID-19 vaccines
- **Health NLP Benchmarking:** Evaluating NLP models on short, noisy social media text in public health domain
---
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `comments` | string | Raw text of the social media comment |
| `label` | int (0 or 1) | 1 = conspiracy theory present, 0 = no conspiracy theory |
| `conspiracy_found` | string | Human-readable label: "Yes" or "No" |
### Data Splits
| Split | Size |
|-------|------|
| Full dataset | 581 |
The dataset is provided as a single file. Researchers should define their own splits.
The paper uses **10-fold cross-validation** for evaluation — recommended for this dataset size.
### Class Distribution
Approximately balanced — roughly equal conspiracy (1) and non-conspiracy (0) samples.
See Figure 2 in the associated paper for the exact distribution.
### Sample Entries
| comments | label | conspiracy_found |
|----------|-------|-----------------|
| "After getting vaccine you catch heart diseases" | 1 | Yes |
| "Vaccination can have an impact on gender change" | 1 | Yes |
| "Bill Gates spread this Coronavirus by mass vaccination" | 1 | Yes |
| "Fully vaccination can reduce death rate for COVID-19" | 0 | No |
| "Thank you governor for implementing vaccine mandates" | 0 | No |
---
## Usage
### Load the Dataset
```python
from datasets import load_dataset
dataset = load_dataset("AminHasibul/covid-vaccine-conspiracy")
print(dataset["train"][0])
# {'comments': '...', 'label': 1, 'conspiracy_found': 'Yes'}
# Check class distribution
from collections import Counter
labels = dataset["train"]["label"]
print(Counter(labels))
```
### Fine-tune a Classifier (Modern HuggingFace)
```python
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report
# Load and split
dataset = load_dataset("AminHasibul/covid-vaccine-conspiracy")
split = dataset["train"].train_test_split(test_size=0.2, seed=42)
# Tokenize
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize(batch):
return tokenizer(
batch["comments"],
truncation=True,
padding="max_length",
max_length=128
)
tokenized = split.map(tokenize, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
# Model
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2
)
# Metrics
def compute_metrics(pred):
labels = pred.label_ids
preds = np.argmax(pred.predictions, axis=1)
return {
"accuracy": accuracy_score(labels, preds),
"f1": f1_score(labels, preds, average="weighted"),
}
# Training
training_args = TrainingArguments(
output_dir="./covid-conspiracy-bert",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
logging_dir="./logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
)
trainer.train()
# Evaluate
results = trainer.evaluate()
print(f"Accuracy: {results['eval_accuracy']:.2%}")
print(f"F1-Score: {results['eval_f1']:.2%}")
```
### Replicate Paper Baseline (Perspective API)
```python
# Perspective API requires a Google API key
# See paper Section IV for methodology
# Best result: Gaussian Naïve Bayes classifier on Perspective scores → 75% accuracy
```
---
## Benchmark Results
All results from the associated paper using **10-fold cross-validation**.
### BERT-Base Uncased (12-layer, 768-hidden, 12-heads, 110M parameters)
| Downstream Classifier | Accuracy | F1 | Precision | Recall |
|-----------------------|----------|----|-----------|--------|
| **Logistic Regression** | **69%** | **68%** | **67%** | **68%** |
| XGBoost | 66% | 66% | 67% | 65% |
| Gaussian Naïve Bayes | 51% | 51% | 52% | 51% |
### Google Perspective API
| Downstream Classifier | Accuracy | F1 | Precision | Recall |
|-----------------------|----------|----|-----------|--------|
| **Gaussian Naïve Bayes** | **75%** | **75%** | **75%** | **75%** |
| XGBoost | 65% | 63% | 65% | 65% |
| Logistic Regression | 55% | 53% | 55% | 55% |
**Notable finding:** An 8–9% improvement in accuracy was observed when dataset size
was increased from 400 to 598 samples, indicating significant potential for improvement
with larger annotated datasets.
---
## Dataset Creation
### Collection Methodology
- **Initial collection:** 950 user comments manually collected from online news portals and their Facebook pages
- **Deduplication:** Near-duplicate comments removed and encoding cleaned → 581 unique samples
- **Language filtering:** Non-English comments removed (focus on North American English)
- **Preprocessing:** Stop word removal, lowercasing, abbreviation normalization
(e.g., vac/vaccn/vcn → vaccine; CVD/covd → Covid)
### Annotation Schema
Comments were manually labeled by the research team as:
**Label 1 (Conspiracy — "Yes"):** Comment contains a conspiracy claim about COVID-19 vaccines,
including but not limited to:
- Claims vaccines cause unreported harm (heart disease, infertility, gender change)
- Claims of government or pharmaceutical cover-ups
- Links between vaccines and unrelated phenomena (5G, microchips, population control)
- Attribution of malicious intent to vaccine developers, Bill Gates, or governments
**Label 0 (Not Conspiracy — "No"):** Comment does not contain conspiracy content —
includes neutral statements, personal vaccination experiences, factual discussion,
or support for vaccination programs.
### Data Source and Scope
- **Geographic scope:** Primarily North American users
- **Time period:** 2021–2022 (peak COVID-19 vaccine rollout)
- **Platform:** Online news portal comment sections and associated Facebook pages
- **Privacy:** No personally identifiable information included (name, location, gender excluded)
- **Compliance:** All source content was publicly posted
---
## Known Limitations
- **Geographic bias:** North American-centric — may not generalize to other regions or languages
- **English-only:** Does not cover multilingual vaccine misinformation
- **Temporal scope:** 2021–2022 narratives; conspiracy theories evolve and new variants emerge
- **Dataset size:** 581 samples — sufficient for initial benchmarking but limited for deep learning without transfer learning
- **Single-team annotation:** Inter-annotator agreement not formally reported in v1
- **Platform bias:** Sourced from specific platform types; may not represent all online environments
- **Ambiguous cases:** Some comments are difficult to classify (acknowledged in paper Section VII)
---
## Intended Use and Ethical Considerations
### Intended Use
- Academic NLP research on misinformation and conspiracy theory detection
- Benchmarking text classification models for health misinformation
- Public health informatics research on vaccine hesitancy
- Educational use in NLP, computational social science, and digital epidemiology courses
### Out-of-Scope Use
- Automated content moderation without human oversight
- Targeting or penalizing individuals based on their expressed views
- Any commercial application without appropriate review and safeguards
- Generalizing findings beyond North American English-language social media
### Sensitive Content
This dataset contains real social media comments including vaccine misinformation and
conspiracy theories. Views expressed in labeled conspiracy examples do not reflect the
views of the dataset creators. This work is motivated by the goal of **countering**
misinformation, not amplifying it.
---
## Citation
```bibtex
@article{amin2022detecting,
title={Detecting conspiracy theory against covid-19 vaccines},
author={Amin, Md Hasibul and Madanu, Harika and Lavu, Sahithi
and Mansourifar, Hadi and Alsagheer, Dana and Shi, Weidong},
journal={arXiv preprint arXiv:2211.13003},
year={2022}
}
```
---
## Authors
**Md Hasibul Amin** (First Author, Lead)
Applied Scientist | ML Engineer | NLP Researcher
Department of Computer Science, University of Houston
[GitHub](https://github.com/AminHasibul) | [LinkedIn](https://linkedin.com/in/aminhasibul) | [Google Scholar](https://scholar.google.com/citations?user=C8gkK7sAAAAJ&hl=en)
Co-authors: Harika Madanu, Sahithi Lavu, Hadi Mansourifar, Dana Alsagheer, Weidong Shi
Department of Computer Science, University of Houston
*Data collection supported by COSC 6376 Cloud Computing Course (Fall 2021), University of Houston.*
---
## License
[MIT License](https://opensource.org/licenses/MIT)
提供机构:
AminHasibul



