five

ajay2k3/livestock-health-disease-ssa-synthetic

收藏
Hugging Face2026-01-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ajay2k3/livestock-health-disease-ssa-synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - tabular-regression - tabular-classification tags: - agriculture - livestock - africa - synthetic-data - food-security - veterinary - disease-surveillance - smallholder-farming size_categories: - 1M<n<10M language: - en pretty_name: "Livestock Health & Disease Surveillance - Sub-Saharan Africa (Synthetic)" --- # Dataset Card: Livestock Health & Disease Surveillance (Synthetic Data) ## Dataset Summary This synthetic dataset represents **1,000,000 African smallholder households** with livestock systems, capturing livestock health, disease surveillance, veterinary access, and herd management practices across Sub-Saharan Africa. It combines baseline farm characteristics (Dataset 1) with 15 livestock-specific variables to create a comprehensive picture of livestock production systems and animal health challenges. **Key Features:** - **1M households** across 5 agro-ecological zones - **27 variables** (12 base farm + 15 livestock health) - **African-specific** livestock systems and diseases - **Literature-grounded** distributions (50+ peer-reviewed sources) - **Conditional dependencies** modeling real-world relationships - **Realistic missing data** patterns ## Variables ### Base Farm Characteristics (Dataset 1 - 12 variables) 1. **agro_ecological_zone**: Arid, semi-arid, sub-humid, humid, highland 2. **region_type**: Urban, peri-urban, rural accessible, rural remote 3. **farm_size_ha**: Farm size in hectares 4. **soil_quality_index**: Soil quality (0-100 scale) 5. **rainfall_mm_annual**: Annual rainfall (mm) 6. **household_size**: Number of household members 7. **market_distance_km**: Distance to nearest market 8. **livestock_tlu**: Tropical Livestock Units owned 9. **extension_access**: Access to agricultural extension (yes/no) 10. **fertilizer_use_kg_ha**: Fertilizer application rate 11. **rainfall_mm_season**: Seasonal rainfall (mm) 12. **maize_yield_kg_ha**: Maize yield (kg/ha) ### Livestock Health & Production (NEW - 15 variables) #### Herd Composition 13. **herd_size_cattle**: Number of cattle owned (0-50+) 14. **herd_size_small_ruminants**: Sheep and goats owned (0-100+) 15. **poultry_count**: Chickens, ducks, etc. (0-200+) #### Veterinary Services & Access 16. **vet_distance_km**: Distance to nearest veterinary service (1-200 km) 17. **vaccination_coverage_pct**: % of herd vaccinated (0-100%) 18. **vet_visit_annual**: Had veterinary visit in past year (yes/no) #### Disease & Health 19. **disease_incidence_annual**: Reported disease in past year (yes/no) 20. **disease_type**: Type of disease (FMD, ECF, CBPP, trypanosomiasis, PPR, Newcastle, respiratory, diarrhea, other) 21. **mortality_rate_annual_pct**: Annual livestock mortality rate (%) 22. **pasture_quality_index**: Pasture/rangeland quality (0-100 scale) #### Management Systems 23. **grazing_system**: Type of grazing (communal, private, mixed, zero-grazing) 24. **water_source_reliability**: Water availability (year-round, seasonal, unreliable) 25. **treatment_access**: Type of treatment accessed (none, traditional, veterinary, both) 26. **feed_supplementation**: Provides supplementary feed (yes/no) 27. **livestock_dependency_index**: Household dependence on livestock (0-100 scale) ## Dataset Statistics ### Livestock Ownership - **43.4%** of households own cattle - **62.9%** own small ruminants (sheep/goats) - **67.5%** keep poultry - Mean cattle herd size: ~5 animals (among owners) - Mean small ruminant herd: ~12 animals (among owners) - Mean poultry flock: ~8 birds (among keepers) ### Disease Burden - **32.7%** reported disease incidence in past year - Most common diseases: - Newcastle disease (poultry): 20% - FMD (Foot & Mouth): 18% - PPR (Peste des Petits Ruminants): 15% - ECF (East Coast Fever): 12% - Trypanosomiasis: 10% ### Veterinary Access - **40.3%** had veterinary contact in past year - Mean distance to vet services: **58.9 km** - **20%** vaccination coverage (median) - Treatment types: - 35% no treatment - 45% traditional remedies only - 15% veterinary treatment - 5% both traditional and veterinary ### Management Practices - **50%** use communal grazing systems - **25%** private grazing - **20%** mixed systems - **5%** zero-grazing (intensive) - **30%** provide feed supplementation - **40%** have year-round water access - **35%** seasonal water only - **25%** unreliable water ## Uses ### Permitted Uses - **Livestock policy analysis**: Model impacts of disease control programs - **Veterinary service planning**: Optimize clinic placement and mobile vet routes - **Disease surveillance system design**: Test outbreak detection algorithms - **Animal health research**: Train ML models for disease prediction - **One Health initiatives**: Link livestock-human health systems - **Extension service planning**: Target interventions by livestock system type - **Educational purposes**: Teaching livestock epidemiology and policy - **Climate adaptation**: Model livestock system resilience - **Value chain analysis**: Link livestock production to markets - **Research method development**: Test statistical techniques ### Prohibited Uses - **Not for replacement of real data collection**: Cannot substitute for actual field surveys - **Not for country-specific policy**: Too generalized for single-country decisions - **Not for real-time disease outbreak response**: Not actual surveillance data - **Not for individual farmer targeting**: Synthetic households are not real - **Not for precise cost-benefit analysis**: Use for methodological prototypes only ## Dataset Creation ### Why This Dataset Exists Real livestock health data in Sub-Saharan Africa faces critical gaps: 1. **Surveillance gaps**: Most countries lack systematic disease surveillance 2. **Underreporting**: Livestock diseases often go unreported (especially in remote areas) 3. **Fragmented data**: Information scattered across vet clinics, ministries, NGOs 4. **Access restrictions**: Sensitive disease data rarely shared publicly 5. **High collection costs**: Surveys expensive and logistically challenging 6. **Privacy concerns**: Household-level data cannot be openly published **This synthetic dataset enables:** - Algorithm development without waiting for data access - Training of researchers and students - International collaboration without data sharing barriers - Rapid prototyping of livestock information systems - Evidence generation for funding proposals ### Creation Methodology **Rigorous 4-stage process** following synthetic data best practices: #### Stage 1: Literature Review (50+ sources) - Systematic review of livestock systems in SSA - Disease prevalence studies (FMD, ECF, trypanosomiasis, PPR, Newcastle) - Veterinary service coverage assessments - Management practice surveys - Mortality and productivity benchmarks #### Stage 2: Parameter Specification (15 files, 60-150 lines each) - Conditional probability distributions by zone, region, herd size - Functional relationships (e.g., vet distance → vaccination rates) - Species-specific disease patterns - Management system typologies - Full provenance tracking #### Stage 3: Conditional Data Generation - Base variables from Dataset 1 (smallholder farms) - Sequential generation respecting dependencies - Zero-inflated distributions for herd sizes - Categorical conditioning for disease types - Realistic missing data (MCAR: 1-10%) #### Stage 4: Validation - Cross-variable consistency checks - Literature benchmark comparisons - Logical constraint verification - Distribution shape validation ## Limitations and Biases ### Known Limitations 1. **Oversimplified disease dynamics**: Real disease spread is more complex than modeled 2. **Static snapshot**: No temporal dynamics (outbreaks, seasonality within year) 3. **No spatial clustering**: Real diseases show geographic clustering not captured 4. **Coarse zones**: 5 AEZ categories don't capture local variation 5. **Missing variables**: No breed info, no herd demographics, no animal-level data 6. **Treatment outcomes**: No data on treatment success/failure 7. **No cost data**: Disease impacts measured only in mortality, not economics 8. **Simplified grazing**: Complex pastoral mobility patterns simplified 9. **Binary disease incidence**: Real incidence is more granular (multiple episodes) ### Potential Biases 1. **Literature bias**: Sources mostly from East Africa (Kenya, Tanzania, Ethiopia) 2. **Veterinary access**: May overestimate coverage in very remote pastoral areas 3. **Disease reporting**: Literature likely underrepresents mild/unreported diseases 4. **Poultry systems**: Village chickens well-represented, commercial systems underrepresented 5. **Traditional knowledge**: Traditional treatment effectiveness may be under-captured 6. **Gender**: No gender disaggregation of livestock ownership/management 7. **Wealth gradient**: Livestock wealth distribution may be too uniform 8. **Conflict zones**: Data may not reflect pastoralist areas affected by conflict ### What This Dataset Is NOT - ❌ **Not real surveillance data**: Do not use for actual disease outbreak decisions - ❌ **Not predictive**: Cannot predict real disease occurrence - ❌ **Not country-specific**: Generalized SSA patterns, not any single country - ❌ **Not longitudinal**: Single time point, no panel structure - ❌ **Not spatially explicit**: No GPS coordinates, no spatial autocorrelation ## Technical Specifications ### File Formats - **CSV**: `livestock_data.csv` (315 MB, 1M rows) - **Parquet**: `livestock_data.parquet` (111 MB, compressed) - **Metadata**: `metadata.json` (generation parameters, sources) - **Data Dictionary**: `data_dictionary.csv` (variable descriptions) ### Missing Data Realistic missing data rates by variable: - Herd sizes: 2% - Vet distance: 4% - Vaccination coverage: 5% - Disease incidence: 3% - Pasture quality: 6% - Mortality rate: 3% - Disease type: 10% (conditional on disease occurrence) - Management variables: 3-4% ### Data Quality Indicators - ✅ All constraints validated (no impossible values) - ✅ Conditional dependencies respected - ✅ Literature benchmarks matched (±10%) - ✅ Cross-variable correlations logical - ✅ Missing data patterns realistic ## Ethical Considerations ### Privacy - **No real households**: All data fully synthetic, cannot identify real people/places - **No GPS coordinates**: No geographic identifiers that could reveal locations - **Aggregated patterns only**: Individual records are fictional ### Representation - **Pan-African focus**: Captures diversity across SSA, not dominated by single region - **Pastoral systems included**: Arid/semi-arid zones well-represented - **Smallholder-centric**: Large commercial farms not included - **Traditional knowledge**: Ethnoveterinary practices acknowledged ### Responsible Use Users should: - ✅ Clearly label outputs as based on synthetic data - ✅ Validate methods on real data before deployment - ✅ Not overstate generalizability of findings - ✅ Cite real data sources when transitioning to applications - ✅ Engage local stakeholders when designing interventions ## Citation Information If you use this dataset, please cite: ```bibtex @dataset{livestock_health_synthetic_2024, author = {Electric Sheep Africa}, title = {Livestock Health and Disease Surveillance Synthetic Dataset for Sub-Saharan Africa}, year = {2024}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/electricsheepafrica/livestock-health-disease-ssa-synthetic} } ``` ### Key Literature Sources This dataset synthesizes information from 50+ sources, including: - **Perry & Grace (2009)**: Economic impacts of animal diseases (Journal of Agricultural Economics) - **Cleaveland et al. (2001)**: Diseases of humans and domestic mammals (Phil Trans Royal Society B) - **Leonard et al. (2017)**: Veterinary service delivery in developing countries (Rev. sci. tech. Off. int. Epiz) - **Robinson et al. (2011)**: Global livestock production systems (FAO/ILRI) - **AU-IBAR (2013)**: Veterinary services delivery in Africa (African Union) - **McCorkle (1995)**: Ethnoveterinary R&D (Agriculture and Human Values) - **Herrero et al. (2013)**: Biomass use in global livestock systems (PNAS) - **Reid et al. (2014)**: Pastoral land development models (Ecology and Society) Full bibliography available in parameter files (`parameters_livestock/` directory). ## Dataset Structure ### Variable Types - **Categorical** (9 variables): Zones, disease types, systems - **Continuous** (14 variables): Herd sizes, distances, indices, rates - **Binary** (4 variables): Access, incidence, supplementation ### Sample Record ```csv agro_ecological_zone,region_type,herd_size_cattle,disease_incidence_annual,vet_distance_km,... semi_arid,rural_accessible,4,yes,35.2,... ``` ## Updates and Versioning - **Version**: 1.0 - **Release Date**: November 2024 - **Status**: Stable - **Planned Updates**: None currently planned ## Contact **Creator**: Electric Sheep Africa **Repository**: [GitHub](https://github.com/electricsheepafrica/agriculture-synthetic-data) **Issues**: Report via GitHub Issues ## License **CC BY 4.0** (Creative Commons Attribution 4.0 International) You are free to: - ✅ Share and redistribute - ✅ Adapt and build upon - ✅ Use commercially Under the condition that you: - ✅ Give appropriate credit - ✅ Indicate if changes were made - ✅ Do not misrepresent as real surveillance data --- ## How to Load ```python from datasets import load_dataset # Load full dataset dataset = load_dataset("electricsheepafrica/livestock-health-disease-ssa-synthetic") # Load as pandas DataFrame import pandas as pd df = dataset['train'].to_pandas() # Or load Parquet directly df = pd.read_parquet("livestock_data.parquet") ``` ## Example Use Cases ### 1. Disease Risk Prediction ```python # Train ML model to predict disease incidence X = df[['herd_size_cattle', 'vet_distance_km', 'vaccination_coverage_pct', 'agro_ecological_zone', 'pasture_quality_index']] y = df['disease_incidence_annual'] ``` ### 2. Vet Clinic Placement Optimization ```python # Find underserved areas underserved = df[(df['vet_distance_km'] > 60) & (df['livestock_tlu'] > 5)] ``` ### 3. Vaccination Campaign Targeting ```python # Identify high-risk, low-coverage households targets = df[(df['vaccination_coverage_pct'] < 20) & (df['disease_incidence_annual'] == 'yes')] ``` --- **Dataset 2 of 5** in the African Agriculture & Food Security Synthetic Data Portfolio
提供机构:
ajay2k3
二维码
社区交流群
二维码
科研交流群
商业服务