PrashantRGore/synthetic-faers-1m-v3
收藏Hugging Face2025-12-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/PrashantRGore/synthetic-faers-1m-v3
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc0-1.0
size_categories:
- 1M<n<10M
task_categories:
- text-classification
- tabular-classification
tags:
- pharmacovigilance
- drug-safety
- signal-detection
- adverse-events
- faers
- medical
- healthcare
pretty_name: Synthetic FAERS 1M v3 - With Injected Signals
---
# Synthetic FAERS 1M v3 - Signal Detection Training Dataset
## Dataset Description
**Version:** 3.0 (Signal-Enhanced)
**Records:** 1,000,000 synthetic Individual Case Safety Reports (ICSRs)
**Signals:** 70,803 injected drug-event associations (7.1%)
**Features:** 36 columns including demographics, clinical labs, causality assessment, and temporal relationships
This is a **fully synthetic** pharmacovigilance dataset designed for training machine learning models in drug safety signal detection. Unlike v2, this version contains **purposefully injected realistic drug-event signals** based on known pharmacovigilance associations.
### Key Features
✅ **100% Synthetic** - No real patient data, fully GDPR/HIPAA compliant
✅ **Signal-Enriched** - 60K strong signals + 10K weak signals for robust ML training
✅ **50+ ML Features** - Demographics, labs, causality, temporal data
✅ **Realistic Associations** - Based on real-world pharmacovigilance patterns (anonymized)
✅ **Production-Ready** - Validated schema, clean data, ready for disproportionality analysis
## What's New in v3
**Major Enhancement:** Signal Injection
v3 addresses the critical limitation in v2 where random generation resulted in PRR values near 1.0 (no associations). This version includes:
- **8 Strong Signal Drug-Event Pairs** with PRR 3.0-9.0 (e.g., Anticoag-XR → Haemorrhage)
- **2 Borderline Signal Pairs** with PRR 1.7-2.6 for edge case testing
- **Enhanced Causality** - Probable/Likely/Certain assessments for signals
- **Temporal Patterns** - Acute onset (1-90 days) for injected signals
- **Positive Dechallenge/Rechallenge** - Realistic clinical evidence
### Injected Signal Drug-Event Pairs
| Drug | Event | Cases Injected | Expected PRR Range |
|------|-------|----------------|-------------------|
| Anticoag-XR | Haemorrhage | 3,290 | 4.5 - 8.0 |
| Lipidlow | Rhabdomyolysis | 3,155 | 3.8 - 6.5 |
| Vasodilate | Hypotension | 3,490 | 4.0 - 7.0 |
| Neurobalance | Seizure | 2,532 | 3.2 - 5.8 |
| Hepatosan | Hepatic failure | 2,194 | 5.2 - 9.0 |
| Nephroguard | Acute kidney injury | 3,308 | 3.5 - 6.2 |
| Cardiomax | Myocardial infarction | 1,829 | 2.8 - 5.5 |
| Hematocare | Neutropenia | 2,363 | 4.2 - 7.5 |
## Dataset Schema
### Core Columns (36 Total)
**Case Identification**
- \case_id\: Unique SHA256 hash (non-reversible)
- \
eceive_date\: Synthetic report receipt date
- \country\: ISO 3-letter country code
**Patient Demographics (with Differential Privacy)**
- \ge\: Patient age in years (±2 year noise added)
- \ge_group\: Regulatory category (neonate, infant, child, adolescent, adult, elderly)
- \sex\: Male/Female/Unknown
- \weight_kg\: Body weight in kg (35% missing)
**Drug Information**
- \suspect_drug\: Fictional drug name (25 unique drugs)
- \indication\: Drug indication/reason for use
- \
oute\: Route of administration
- \dose\: Dose amount
- \dose_unit\: Dose unit (mg, mcg, etc.)
- \dose_frequency\: Dosing frequency (20% missing)
- \ reatment_duration_days\: Treatment duration (30% missing)
**Adverse Event (MedDRA-like Hierarchy)**
- \event_llt\: Lowest Level Term
- \event_pt\: Preferred Term
- \event_hlt\: High Level Term
- \event_hlgt\: High Level Group Term
- \event_soc\: System Organ Class (15 unique SOCs)
**Temporal Relationships**
- \ ime_to_onset_days\: Days from drug start to event onset
- \event_duration_days\: Event duration in days (40% missing)
**Causality Assessment (WHO-UMC Style)**
- \causality_assessment\: Certain/Probable/Possible/Unlikely/Unclassified
- \dechallenge\: Positive/Negative/Not applicable/Unknown
- \
echallenge\: Positive/Negative/Not applicable/Unknown
**Clinical Context**
- \seriousness\: ICH E2B criteria (Death, Life-threatening, Hospitalization, etc.)
- \outcome\: Event outcome (Recovered, Fatal, Unknown, etc.)
- \ction_taken\: Action with suspect drug
- \concomitant_medications\: List of concomitant drugs
- \medical_history\: Relevant medical history
**Laboratory Values (with Realistic Missing Data)**
- \lt_u_l\: ALT (U/L) - 25% missing
- \st_u_l\: AST (U/L) - 25% missing
- \ilirubin_mg_dl\: Total bilirubin (mg/dL) - 25% missing
- \creatinine_mg_dl\: Serum creatinine (mg/dL) - 25% missing
- \un_mg_dl\: Blood urea nitrogen (mg/dL) - 25% missing
**Metadata**
- \
eporter_type\: Physician/Pharmacist/Consumer/Lawyer/Other
- \
eport_type\: Spontaneous/Clinical trial/Literature/etc.
## Use Cases
### 1. **Signal Detection ML Models**
Train supervised models to predict drug-event signals using 50+ features beyond just PRR/chi-square.
### 2. **Disproportionality Analysis**
Test PRR, ROR, BCPNN, MGPS algorithms with known ground truth signals.
### 3. **SISA (Sharding-based Incremental Signal Analysis)**
Train privacy-preserving federated learning models for right-to-be-forgotten compliance.
### 4. **RAG Systems**
Use signals as triggers for literature mining and evidence retrieval.
### 5. **Algorithm Benchmarking**
Compare performance of different signal detection methods on controlled data.
## Expected Analysis Results
After aggregating 1M ICSRs to drug-event pairs:
- **~1,800 unique drug-event pairs**
- **~150-250 signals with label=1** (using PRR≥2.0, Chi²≥4.0)
- **Signal rate: 8-14%** (realistic for pharmacovigilance)
- **50+ Tier 2 features** per pair for ML training
## Quick Start
\\\python
from datasets import load_dataset
# Load dataset
dataset = load_dataset('PrashantRGore/synthetic-faers-1m-v3')
df = dataset['train'].to_pandas()
print(f"Records: {len(df):,}")
print(f"Columns: {len(df.columns)}")
print(df.head())
# Check signal drug distribution
signal_drugs = ['Anticoag-XR', 'Lipidlow', 'Vasodilate']
for drug in signal_drugs:
count = (df['suspect_drug'] == drug).sum()
print(f"{drug}: {count:,} cases")
\\\
## Privacy & Compliance
**GDPR Compliant:**
- ✅ No PII (patient names, addresses, MRNs)
- ✅ No identifiable reporter information
- ✅ K-anonymity: minimum group size = 5
- ✅ Differential privacy noise on age
- ✅ Geographic data limited to country level
**HIPAA Compliant:**
- ✅ No PHI (Protected Health Information)
- ✅ No dates of birth (only age ranges)
- ✅ No facility identifiers
**Data Generation:**
- Generated: December 2025
- Method: Faker library + signal injection
- Seed: 42 (reproducible)
## Changelog
**v3.0 (December 2025)** - Signal-Enhanced Release
- ✅ Injected 70,803 realistic drug-event signals
- ✅ Added 8 strong signal pairs with PRR 3.0-9.0
- ✅ Added 2 weak signal pairs with PRR 1.7-2.6
- ✅ Enhanced causality assessment for signals
- ✅ Improved temporal patterns (acute onset)
**v2.0 (Previous)** - Random Generation
- ❌ All PRR ≈ 1.0 (no signals)
- ✅ Good for schema testing only
**v1.0 (Deprecated)** - Initial release
## Limitations
⚠️ **Not for Regulatory Submission** - Fully synthetic data
⚠️ **Simplified MedDRA** - Not licensed official MedDRA dictionary
⚠️ **No Drug-Drug Interactions** - Concomitant meds are random
⚠️ **Statistical Patterns Only** - Not based on actual clinical trials
## Citation
\\\ibtex
@dataset{synthetic_faers_v3_2025,
title={Synthetic FAERS 1M v3 - Signal Detection Training Dataset},
author={Gore, Prashant R.},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/PrashantRGore/synthetic-faers-1m-v3}
}
\\\
## License
**CC0 1.0 Universal (Public Domain)** - Fully synthetic data with no restrictions.
## Related Projects
- [PV-Signal-ML](https://github.com/PrashantRGore/PV_Signal_ML) - Full pipeline using this dataset
- [Drug-Causality-BERT](https://huggingface.co/PrashantRGore/drug-causality-bert-v2) - BERT model for causality assessment
## Contact
For questions or issues, please open a discussion on this dataset's page.
---
**Disclaimer:** This is entirely synthetic data created for machine learning research and software development. It does not contain any real patient information and should not be used for actual drug safety decisions or regulatory submissions.
提供机构:
PrashantRGore



