ajay2k3/livestock-health-disease-ssa-synthetic
收藏Hugging Face2026-01-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ajay2k3/livestock-health-disease-ssa-synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-regression
- tabular-classification
tags:
- agriculture
- livestock
- africa
- synthetic-data
- food-security
- veterinary
- disease-surveillance
- smallholder-farming
size_categories:
- 1M<n<10M
language:
- en
pretty_name: "Livestock Health & Disease Surveillance - Sub-Saharan Africa (Synthetic)"
---
# Dataset Card: Livestock Health & Disease Surveillance (Synthetic Data)
## Dataset Summary
This synthetic dataset represents **1,000,000 African smallholder households** with livestock systems, capturing livestock health, disease surveillance, veterinary access, and herd management practices across Sub-Saharan Africa. It combines baseline farm characteristics (Dataset 1) with 15 livestock-specific variables to create a comprehensive picture of livestock production systems and animal health challenges.
**Key Features:**
- **1M households** across 5 agro-ecological zones
- **27 variables** (12 base farm + 15 livestock health)
- **African-specific** livestock systems and diseases
- **Literature-grounded** distributions (50+ peer-reviewed sources)
- **Conditional dependencies** modeling real-world relationships
- **Realistic missing data** patterns
## Variables
### Base Farm Characteristics (Dataset 1 - 12 variables)
1. **agro_ecological_zone**: Arid, semi-arid, sub-humid, humid, highland
2. **region_type**: Urban, peri-urban, rural accessible, rural remote
3. **farm_size_ha**: Farm size in hectares
4. **soil_quality_index**: Soil quality (0-100 scale)
5. **rainfall_mm_annual**: Annual rainfall (mm)
6. **household_size**: Number of household members
7. **market_distance_km**: Distance to nearest market
8. **livestock_tlu**: Tropical Livestock Units owned
9. **extension_access**: Access to agricultural extension (yes/no)
10. **fertilizer_use_kg_ha**: Fertilizer application rate
11. **rainfall_mm_season**: Seasonal rainfall (mm)
12. **maize_yield_kg_ha**: Maize yield (kg/ha)
### Livestock Health & Production (NEW - 15 variables)
#### Herd Composition
13. **herd_size_cattle**: Number of cattle owned (0-50+)
14. **herd_size_small_ruminants**: Sheep and goats owned (0-100+)
15. **poultry_count**: Chickens, ducks, etc. (0-200+)
#### Veterinary Services & Access
16. **vet_distance_km**: Distance to nearest veterinary service (1-200 km)
17. **vaccination_coverage_pct**: % of herd vaccinated (0-100%)
18. **vet_visit_annual**: Had veterinary visit in past year (yes/no)
#### Disease & Health
19. **disease_incidence_annual**: Reported disease in past year (yes/no)
20. **disease_type**: Type of disease (FMD, ECF, CBPP, trypanosomiasis, PPR, Newcastle, respiratory, diarrhea, other)
21. **mortality_rate_annual_pct**: Annual livestock mortality rate (%)
22. **pasture_quality_index**: Pasture/rangeland quality (0-100 scale)
#### Management Systems
23. **grazing_system**: Type of grazing (communal, private, mixed, zero-grazing)
24. **water_source_reliability**: Water availability (year-round, seasonal, unreliable)
25. **treatment_access**: Type of treatment accessed (none, traditional, veterinary, both)
26. **feed_supplementation**: Provides supplementary feed (yes/no)
27. **livestock_dependency_index**: Household dependence on livestock (0-100 scale)
## Dataset Statistics
### Livestock Ownership
- **43.4%** of households own cattle
- **62.9%** own small ruminants (sheep/goats)
- **67.5%** keep poultry
- Mean cattle herd size: ~5 animals (among owners)
- Mean small ruminant herd: ~12 animals (among owners)
- Mean poultry flock: ~8 birds (among keepers)
### Disease Burden
- **32.7%** reported disease incidence in past year
- Most common diseases:
- Newcastle disease (poultry): 20%
- FMD (Foot & Mouth): 18%
- PPR (Peste des Petits Ruminants): 15%
- ECF (East Coast Fever): 12%
- Trypanosomiasis: 10%
### Veterinary Access
- **40.3%** had veterinary contact in past year
- Mean distance to vet services: **58.9 km**
- **20%** vaccination coverage (median)
- Treatment types:
- 35% no treatment
- 45% traditional remedies only
- 15% veterinary treatment
- 5% both traditional and veterinary
### Management Practices
- **50%** use communal grazing systems
- **25%** private grazing
- **20%** mixed systems
- **5%** zero-grazing (intensive)
- **30%** provide feed supplementation
- **40%** have year-round water access
- **35%** seasonal water only
- **25%** unreliable water
## Uses
### Permitted Uses
- **Livestock policy analysis**: Model impacts of disease control programs
- **Veterinary service planning**: Optimize clinic placement and mobile vet routes
- **Disease surveillance system design**: Test outbreak detection algorithms
- **Animal health research**: Train ML models for disease prediction
- **One Health initiatives**: Link livestock-human health systems
- **Extension service planning**: Target interventions by livestock system type
- **Educational purposes**: Teaching livestock epidemiology and policy
- **Climate adaptation**: Model livestock system resilience
- **Value chain analysis**: Link livestock production to markets
- **Research method development**: Test statistical techniques
### Prohibited Uses
- **Not for replacement of real data collection**: Cannot substitute for actual field surveys
- **Not for country-specific policy**: Too generalized for single-country decisions
- **Not for real-time disease outbreak response**: Not actual surveillance data
- **Not for individual farmer targeting**: Synthetic households are not real
- **Not for precise cost-benefit analysis**: Use for methodological prototypes only
## Dataset Creation
### Why This Dataset Exists
Real livestock health data in Sub-Saharan Africa faces critical gaps:
1. **Surveillance gaps**: Most countries lack systematic disease surveillance
2. **Underreporting**: Livestock diseases often go unreported (especially in remote areas)
3. **Fragmented data**: Information scattered across vet clinics, ministries, NGOs
4. **Access restrictions**: Sensitive disease data rarely shared publicly
5. **High collection costs**: Surveys expensive and logistically challenging
6. **Privacy concerns**: Household-level data cannot be openly published
**This synthetic dataset enables:**
- Algorithm development without waiting for data access
- Training of researchers and students
- International collaboration without data sharing barriers
- Rapid prototyping of livestock information systems
- Evidence generation for funding proposals
### Creation Methodology
**Rigorous 4-stage process** following synthetic data best practices:
#### Stage 1: Literature Review (50+ sources)
- Systematic review of livestock systems in SSA
- Disease prevalence studies (FMD, ECF, trypanosomiasis, PPR, Newcastle)
- Veterinary service coverage assessments
- Management practice surveys
- Mortality and productivity benchmarks
#### Stage 2: Parameter Specification (15 files, 60-150 lines each)
- Conditional probability distributions by zone, region, herd size
- Functional relationships (e.g., vet distance → vaccination rates)
- Species-specific disease patterns
- Management system typologies
- Full provenance tracking
#### Stage 3: Conditional Data Generation
- Base variables from Dataset 1 (smallholder farms)
- Sequential generation respecting dependencies
- Zero-inflated distributions for herd sizes
- Categorical conditioning for disease types
- Realistic missing data (MCAR: 1-10%)
#### Stage 4: Validation
- Cross-variable consistency checks
- Literature benchmark comparisons
- Logical constraint verification
- Distribution shape validation
## Limitations and Biases
### Known Limitations
1. **Oversimplified disease dynamics**: Real disease spread is more complex than modeled
2. **Static snapshot**: No temporal dynamics (outbreaks, seasonality within year)
3. **No spatial clustering**: Real diseases show geographic clustering not captured
4. **Coarse zones**: 5 AEZ categories don't capture local variation
5. **Missing variables**: No breed info, no herd demographics, no animal-level data
6. **Treatment outcomes**: No data on treatment success/failure
7. **No cost data**: Disease impacts measured only in mortality, not economics
8. **Simplified grazing**: Complex pastoral mobility patterns simplified
9. **Binary disease incidence**: Real incidence is more granular (multiple episodes)
### Potential Biases
1. **Literature bias**: Sources mostly from East Africa (Kenya, Tanzania, Ethiopia)
2. **Veterinary access**: May overestimate coverage in very remote pastoral areas
3. **Disease reporting**: Literature likely underrepresents mild/unreported diseases
4. **Poultry systems**: Village chickens well-represented, commercial systems underrepresented
5. **Traditional knowledge**: Traditional treatment effectiveness may be under-captured
6. **Gender**: No gender disaggregation of livestock ownership/management
7. **Wealth gradient**: Livestock wealth distribution may be too uniform
8. **Conflict zones**: Data may not reflect pastoralist areas affected by conflict
### What This Dataset Is NOT
- ❌ **Not real surveillance data**: Do not use for actual disease outbreak decisions
- ❌ **Not predictive**: Cannot predict real disease occurrence
- ❌ **Not country-specific**: Generalized SSA patterns, not any single country
- ❌ **Not longitudinal**: Single time point, no panel structure
- ❌ **Not spatially explicit**: No GPS coordinates, no spatial autocorrelation
## Technical Specifications
### File Formats
- **CSV**: `livestock_data.csv` (315 MB, 1M rows)
- **Parquet**: `livestock_data.parquet` (111 MB, compressed)
- **Metadata**: `metadata.json` (generation parameters, sources)
- **Data Dictionary**: `data_dictionary.csv` (variable descriptions)
### Missing Data
Realistic missing data rates by variable:
- Herd sizes: 2%
- Vet distance: 4%
- Vaccination coverage: 5%
- Disease incidence: 3%
- Pasture quality: 6%
- Mortality rate: 3%
- Disease type: 10% (conditional on disease occurrence)
- Management variables: 3-4%
### Data Quality Indicators
- ✅ All constraints validated (no impossible values)
- ✅ Conditional dependencies respected
- ✅ Literature benchmarks matched (±10%)
- ✅ Cross-variable correlations logical
- ✅ Missing data patterns realistic
## Ethical Considerations
### Privacy
- **No real households**: All data fully synthetic, cannot identify real people/places
- **No GPS coordinates**: No geographic identifiers that could reveal locations
- **Aggregated patterns only**: Individual records are fictional
### Representation
- **Pan-African focus**: Captures diversity across SSA, not dominated by single region
- **Pastoral systems included**: Arid/semi-arid zones well-represented
- **Smallholder-centric**: Large commercial farms not included
- **Traditional knowledge**: Ethnoveterinary practices acknowledged
### Responsible Use
Users should:
- ✅ Clearly label outputs as based on synthetic data
- ✅ Validate methods on real data before deployment
- ✅ Not overstate generalizability of findings
- ✅ Cite real data sources when transitioning to applications
- ✅ Engage local stakeholders when designing interventions
## Citation Information
If you use this dataset, please cite:
```bibtex
@dataset{livestock_health_synthetic_2024,
author = {Electric Sheep Africa},
title = {Livestock Health and Disease Surveillance Synthetic Dataset for Sub-Saharan Africa},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/electricsheepafrica/livestock-health-disease-ssa-synthetic}
}
```
### Key Literature Sources
This dataset synthesizes information from 50+ sources, including:
- **Perry & Grace (2009)**: Economic impacts of animal diseases (Journal of Agricultural Economics)
- **Cleaveland et al. (2001)**: Diseases of humans and domestic mammals (Phil Trans Royal Society B)
- **Leonard et al. (2017)**: Veterinary service delivery in developing countries (Rev. sci. tech. Off. int. Epiz)
- **Robinson et al. (2011)**: Global livestock production systems (FAO/ILRI)
- **AU-IBAR (2013)**: Veterinary services delivery in Africa (African Union)
- **McCorkle (1995)**: Ethnoveterinary R&D (Agriculture and Human Values)
- **Herrero et al. (2013)**: Biomass use in global livestock systems (PNAS)
- **Reid et al. (2014)**: Pastoral land development models (Ecology and Society)
Full bibliography available in parameter files (`parameters_livestock/` directory).
## Dataset Structure
### Variable Types
- **Categorical** (9 variables): Zones, disease types, systems
- **Continuous** (14 variables): Herd sizes, distances, indices, rates
- **Binary** (4 variables): Access, incidence, supplementation
### Sample Record
```csv
agro_ecological_zone,region_type,herd_size_cattle,disease_incidence_annual,vet_distance_km,...
semi_arid,rural_accessible,4,yes,35.2,...
```
## Updates and Versioning
- **Version**: 1.0
- **Release Date**: November 2024
- **Status**: Stable
- **Planned Updates**: None currently planned
## Contact
**Creator**: Electric Sheep Africa
**Repository**: [GitHub](https://github.com/electricsheepafrica/agriculture-synthetic-data)
**Issues**: Report via GitHub Issues
## License
**CC BY 4.0** (Creative Commons Attribution 4.0 International)
You are free to:
- ✅ Share and redistribute
- ✅ Adapt and build upon
- ✅ Use commercially
Under the condition that you:
- ✅ Give appropriate credit
- ✅ Indicate if changes were made
- ✅ Do not misrepresent as real surveillance data
---
## How to Load
```python
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("electricsheepafrica/livestock-health-disease-ssa-synthetic")
# Load as pandas DataFrame
import pandas as pd
df = dataset['train'].to_pandas()
# Or load Parquet directly
df = pd.read_parquet("livestock_data.parquet")
```
## Example Use Cases
### 1. Disease Risk Prediction
```python
# Train ML model to predict disease incidence
X = df[['herd_size_cattle', 'vet_distance_km', 'vaccination_coverage_pct',
'agro_ecological_zone', 'pasture_quality_index']]
y = df['disease_incidence_annual']
```
### 2. Vet Clinic Placement Optimization
```python
# Find underserved areas
underserved = df[(df['vet_distance_km'] > 60) & (df['livestock_tlu'] > 5)]
```
### 3. Vaccination Campaign Targeting
```python
# Identify high-risk, low-coverage households
targets = df[(df['vaccination_coverage_pct'] < 20) &
(df['disease_incidence_annual'] == 'yes')]
```
---
**Dataset 2 of 5** in the African Agriculture & Food Security Synthetic Data Portfolio
提供机构:
ajay2k3



