AminHasibul/covid-vaccine-conspiracy

Name: AminHasibul/covid-vaccine-conspiracy
Creator: AminHasibul
Published: 2026-03-23 17:58:41
License: 暂无描述

Hugging Face2026-03-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/AminHasibul/covid-vaccine-conspiracy

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit task_categories: - text-classification task_ids: - multi-class-classification pretty_name: COVID-19 Vaccine Conspiracy Theory Detection Dataset size_categories: - n<1K tags: - misinformation - conspiracy-theory - covid-19 - vaccine - nlp - social-media - health - bert - public-health - sentiment-analysis - vaccine-hesitancy dataset_info: features: - name: comments dtype: string - name: label dtype: class_label: names: '0': not_conspiracy '1': conspiracy - name: conspiracy_found dtype: string splits: - name: train num_examples: 581 --- # COVID-19 Vaccine Conspiracy Theory Detection Dataset ## Dataset Summary This dataset contains **581 manually labeled social media comments** related to COVID-19 vaccines, annotated for the presence of conspiracy theories. It supports NLP research on automated vaccine misinformation detection, a critical public health challenge. **This dataset has been cited and reused by external researchers.** ### Associated Paper > Amin, M. H., Madanu, H., Lavu, S., Mansourifar, H., Alsagheer, D., & Shi, W. (2022). > *Detecting Conspiracy Theory Against COVID-19 Vaccines.* > arXiv preprint arXiv:2211.13003. > University of Houston, Department of Computer Science. **Links:** [arXiv Paper](https://arxiv.org/abs/2211.13003) | [GitHub Repository](https://github.com/AminHasibul/ConspiracyAgaintstCovidVaccines) --- ## Supported Tasks - **Text Classification:** Binary classification of vaccine-related social media text as conspiracy (1) or not (0) - **Misinformation Detection:** Identifying vaccine conspiracy narratives in informal online language - **Sentiment Analysis:** Analyzing public sentiment toward COVID-19 vaccines - **Health NLP Benchmarking:** Evaluating NLP models on short, noisy social media text in public health domain --- ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `comments` | string | Raw text of the social media comment | | `label` | int (0 or 1) | 1 = conspiracy theory present, 0 = no conspiracy theory | | `conspiracy_found` | string | Human-readable label: "Yes" or "No" | ### Data Splits | Split | Size | |-------|------| | Full dataset | 581 | The dataset is provided as a single file. Researchers should define their own splits. The paper uses **10-fold cross-validation** for evaluation — recommended for this dataset size. ### Class Distribution Approximately balanced — roughly equal conspiracy (1) and non-conspiracy (0) samples. See Figure 2 in the associated paper for the exact distribution. ### Sample Entries | comments | label | conspiracy_found | |----------|-------|-----------------| | "After getting vaccine you catch heart diseases" | 1 | Yes | | "Vaccination can have an impact on gender change" | 1 | Yes | | "Bill Gates spread this Coronavirus by mass vaccination" | 1 | Yes | | "Fully vaccination can reduce death rate for COVID-19" | 0 | No | | "Thank you governor for implementing vaccine mandates" | 0 | No | --- ## Usage ### Load the Dataset ```python from datasets import load_dataset dataset = load_dataset("AminHasibul/covid-vaccine-conspiracy") print(dataset["train"][0]) # {'comments': '...', 'label': 1, 'conspiracy_found': 'Yes'} # Check class distribution from collections import Counter labels = dataset["train"]["label"] print(Counter(labels)) ``` ### Fine-tune a Classifier (Modern HuggingFace) ```python from datasets import load_dataset from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer ) import numpy as np from sklearn.metrics import accuracy_score, f1_score, classification_report # Load and split dataset = load_dataset("AminHasibul/covid-vaccine-conspiracy") split = dataset["train"].train_test_split(test_size=0.2, seed=42) # Tokenize model_name = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) def tokenize(batch): return tokenizer( batch["comments"], truncation=True, padding="max_length", max_length=128 ) tokenized = split.map(tokenize, batched=True) tokenized = tokenized.rename_column("label", "labels") tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"]) # Model model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=2 ) # Metrics def compute_metrics(pred): labels = pred.label_ids preds = np.argmax(pred.predictions, axis=1) return { "accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds, average="weighted"), } # Training training_args = TrainingArguments( output_dir="./covid-conspiracy-bert", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=32, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1", logging_dir="./logs", ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"], compute_metrics=compute_metrics, ) trainer.train() # Evaluate results = trainer.evaluate() print(f"Accuracy: {results['eval_accuracy']:.2%}") print(f"F1-Score: {results['eval_f1']:.2%}") ``` ### Replicate Paper Baseline (Perspective API) ```python # Perspective API requires a Google API key # See paper Section IV for methodology # Best result: Gaussian Naïve Bayes classifier on Perspective scores → 75% accuracy ``` --- ## Benchmark Results All results from the associated paper using **10-fold cross-validation**. ### BERT-Base Uncased (12-layer, 768-hidden, 12-heads, 110M parameters) | Downstream Classifier | Accuracy | F1 | Precision | Recall | |-----------------------|----------|----|-----------|--------| | **Logistic Regression** | **69%** | **68%** | **67%** | **68%** | | XGBoost | 66% | 66% | 67% | 65% | | Gaussian Naïve Bayes | 51% | 51% | 52% | 51% | ### Google Perspective API | Downstream Classifier | Accuracy | F1 | Precision | Recall | |-----------------------|----------|----|-----------|--------| | **Gaussian Naïve Bayes** | **75%** | **75%** | **75%** | **75%** | | XGBoost | 65% | 63% | 65% | 65% | | Logistic Regression | 55% | 53% | 55% | 55% | **Notable finding:** An 8–9% improvement in accuracy was observed when dataset size was increased from 400 to 598 samples, indicating significant potential for improvement with larger annotated datasets. --- ## Dataset Creation ### Collection Methodology - **Initial collection:** 950 user comments manually collected from online news portals and their Facebook pages - **Deduplication:** Near-duplicate comments removed and encoding cleaned → 581 unique samples - **Language filtering:** Non-English comments removed (focus on North American English) - **Preprocessing:** Stop word removal, lowercasing, abbreviation normalization (e.g., vac/vaccn/vcn → vaccine; CVD/covd → Covid) ### Annotation Schema Comments were manually labeled by the research team as: **Label 1 (Conspiracy — "Yes"):** Comment contains a conspiracy claim about COVID-19 vaccines, including but not limited to: - Claims vaccines cause unreported harm (heart disease, infertility, gender change) - Claims of government or pharmaceutical cover-ups - Links between vaccines and unrelated phenomena (5G, microchips, population control) - Attribution of malicious intent to vaccine developers, Bill Gates, or governments **Label 0 (Not Conspiracy — "No"):** Comment does not contain conspiracy content — includes neutral statements, personal vaccination experiences, factual discussion, or support for vaccination programs. ### Data Source and Scope - **Geographic scope:** Primarily North American users - **Time period:** 2021–2022 (peak COVID-19 vaccine rollout) - **Platform:** Online news portal comment sections and associated Facebook pages - **Privacy:** No personally identifiable information included (name, location, gender excluded) - **Compliance:** All source content was publicly posted --- ## Known Limitations - **Geographic bias:** North American-centric — may not generalize to other regions or languages - **English-only:** Does not cover multilingual vaccine misinformation - **Temporal scope:** 2021–2022 narratives; conspiracy theories evolve and new variants emerge - **Dataset size:** 581 samples — sufficient for initial benchmarking but limited for deep learning without transfer learning - **Single-team annotation:** Inter-annotator agreement not formally reported in v1 - **Platform bias:** Sourced from specific platform types; may not represent all online environments - **Ambiguous cases:** Some comments are difficult to classify (acknowledged in paper Section VII) --- ## Intended Use and Ethical Considerations ### Intended Use - Academic NLP research on misinformation and conspiracy theory detection - Benchmarking text classification models for health misinformation - Public health informatics research on vaccine hesitancy - Educational use in NLP, computational social science, and digital epidemiology courses ### Out-of-Scope Use - Automated content moderation without human oversight - Targeting or penalizing individuals based on their expressed views - Any commercial application without appropriate review and safeguards - Generalizing findings beyond North American English-language social media ### Sensitive Content This dataset contains real social media comments including vaccine misinformation and conspiracy theories. Views expressed in labeled conspiracy examples do not reflect the views of the dataset creators. This work is motivated by the goal of **countering** misinformation, not amplifying it. --- ## Citation ```bibtex @article{amin2022detecting, title={Detecting conspiracy theory against covid-19 vaccines}, author={Amin, Md Hasibul and Madanu, Harika and Lavu, Sahithi and Mansourifar, Hadi and Alsagheer, Dana and Shi, Weidong}, journal={arXiv preprint arXiv:2211.13003}, year={2022} } ``` --- ## Authors **Md Hasibul Amin** (First Author, Lead) Applied Scientist | ML Engineer | NLP Researcher Department of Computer Science, University of Houston [GitHub](https://github.com/AminHasibul) | [LinkedIn](https://linkedin.com/in/aminhasibul) | [Google Scholar](https://scholar.google.com/citations?user=C8gkK7sAAAAJ&hl=en) Co-authors: Harika Madanu, Sahithi Lavu, Hadi Mansourifar, Dana Alsagheer, Weidong Shi Department of Computer Science, University of Houston *Data collection supported by COSC 6376 Cloud Computing Course (Fall 2021), University of Houston.* --- ## License [MIT License](https://opensource.org/licenses/MIT)

提供机构：

AminHasibul

5,000+

优质数据集

54 个

任务类型

进入经典数据集