five

electricsheepafrica/postharvest-value-chains-ssa-synthetic

收藏
Hugging Face2025-11-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/postharvest-value-chains-ssa-synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - tabular-regression - tabular-classification tags: - agriculture - food-security - africa - synthetic-data - postharvest - value-chains - market-access - storage - smallholder-farming size_categories: - 1M<n<10M language: - en pretty_name: "Post-Harvest Value Chains - Sub-Saharan Africa (Synthetic)" --- # Dataset Card: Post-Harvest Value Chains - Sub-Saharan Africa (Synthetic Data) ## Dataset Summary This synthetic dataset represents **1,000,000 African smallholder households** with comprehensive post-harvest value chain data, capturing storage systems, market access, processing, and value addition across Sub-Saharan Africa. It builds upon baseline farm characteristics (Dataset 1) and livestock systems (Dataset 2) to create a complete picture of agricultural production, livestock management, and post-harvest economics. **Key Features:** - **1M households** across 5 agro-ecological zones - **41 variables** (12 base farm + 15 livestock + 14 post-harvest) - **African-specific** value chain patterns and constraints - **Literature-grounded**: 20+ peer-reviewed sources for post-harvest variables - **100% synthetic**: No real households, privacy-preserving - **Validated**: All benchmarks passed against literature expectations ## Why This Dataset? Post-harvest losses are the **"missing middle"** of African food security: - **$4 billion annual losses** in Sub-Saharan Africa - **20-40% of cereals** lost between harvest and consumption - **Enough grain lost** to feed 48 million people annually - **Single largest opportunity** for food security improvement Yet research is hampered by: - Lack of household-level value chain data - Privacy concerns in real data collection - High cost of primary data collection - Need for large samples for ML/statistical methods This synthetic dataset enables: - **Algorithm development** for post-harvest loss prediction - **Value chain analysis** without privacy concerns - **Market access modeling** at scale - **Policy simulation** of interventions (storage technology, cooperatives, etc.) - **Teaching** agricultural economics and food systems - **Benchmarking** methods before deploying on real data ## Dataset Details ### Dataset Description **Size**: 1,000,000 rows × 41 variables **Format**: CSV (510 MB) + Parquet (175 MB) **License**: CC-BY-4.0 **Created**: November 2024 **Version**: 1.0 ### Variables The dataset contains **41 variables** across three thematic areas: #### **Base Farm Variables (12)** - From Dataset 1 | Variable | Type | Description | Range/Categories | |----------|------|-------------|------------------| | `agro_ecological_zone` | categorical | AEZ classification | arid, semi_arid, sub_humid, humid, highland | | `region_type` | categorical | Settlement type | urban, peri_urban, rural_accessible, rural_remote | | `farm_size_ha` | continuous | Farm size (hectares) | 0.1-50 ha | | `soil_quality_index` | continuous | Soil quality (0-100) | 0-100 | | `rainfall_mm_annual` | continuous | Annual rainfall | 200-2000 mm | | `household_size` | integer | Number of people | 1-20 | | `market_distance_km` | continuous | Distance to market | 0.5-150 km | | `livestock_tlu` | continuous | Tropical Livestock Units | 0-50 TLU | | `extension_access` | categorical | Agricultural extension | yes, no | | `fertilizer_use_kg_ha` | continuous | Fertilizer application | 0-300 kg/ha | | `rainfall_mm_season` | continuous | Seasonal rainfall | 100-1000 mm | | `maize_yield_kg_ha` | continuous | Maize yield | 0-6000 kg/ha | #### **Livestock Variables (15)** - From Dataset 2 | Variable | Type | Description | Range/Categories | |----------|------|-------------|------------------| | `herd_size_cattle` | continuous | Number of cattle | 0-100 | | `herd_size_small_ruminants` | continuous | Sheep/goats | 0-150 | | `poultry_count` | continuous | Chickens/ducks | 0-500 | | `vet_distance_km` | continuous | Distance to vet | 0.5-200 km | | `vaccination_coverage_pct` | continuous | Vaccination rate | 0-100% | | `disease_incidence_annual` | categorical | Disease occurred | yes, no | | `pasture_quality_index` | continuous | Pasture quality | 0-100 | | `mortality_rate_annual_pct` | continuous | Livestock mortality | 0-50% | | `vet_visit_annual` | categorical | Vet contact | yes, no | | `disease_type` | categorical | Disease type | FMD, ECF, CBPP, Newcastle, etc. | | `water_source_reliability` | categorical | Water access | year_round, seasonal, unreliable | | `grazing_system` | categorical | Grazing type | communal, private, mixed, zero | | `treatment_access` | categorical | Treatment used | none, traditional, veterinary, both | | `feed_supplementation` | categorical | Supplemental feed | yes, no | | `livestock_dependency_index` | continuous | Livelihood dependence | 0-100 | #### **Post-Harvest Value Chain Variables (14)** - NEW in Dataset 3 **Storage Infrastructure (4 variables)** | Variable | Type | Description | Range/Categories | |----------|------|-------------|------------------| | `storage_type` | categorical | Storage facility | none, traditional_granary, improved_silo, warehouse | | `storage_duration_months` | continuous | Storage period | 0-12 months | | `storage_loss_pct` | continuous | Grain loss % | 0-80% | | `hermetic_storage` | categorical | Airtight storage | yes, no | **Post-Harvest Handling (3 variables)** | Variable | Type | Description | Range/Categories | |----------|------|-------------|------------------| | `drying_method` | categorical | Grain drying | none, sun, tarp, mechanical | | `pest_control` | categorical | Pest management | none, traditional, chemical, IPM | | `sorting_grading` | categorical | Quality sorting | yes, no | **Processing Access (2 variables)** | Variable | Type | Description | Range/Categories | |----------|------|-------------|------------------| | `miller_access_km` | continuous | Distance to mill | 0.5-100 km | | `value_addition` | categorical | Processing level | none, simple_processing, packaged | **Market Access (5 variables)** | Variable | Type | Description | Range/Categories | |----------|------|-------------|------------------| | `produce_sold_pct` | continuous | % harvest sold | 0-100% | | `buyer_type` | categorical | Buyer type | local_trader, cooperative, processor, direct_consumer | | `price_received_kg` | continuous | Farmgate price | $0.05-0.60/kg | | `transport_cost_kg` | continuous | Transport cost | $0-0.15/kg | | `days_to_sell` | continuous | Time to sale | 0-180 days | ## Dataset Statistics ### Post-Harvest Key Statistics **Storage:** - **60.6%** use traditional granary storage - **20.1%** have no storage (immediate consumption/sale) - **14.3%** use improved storage (silos, PICS bags) - **5.1%** access warehouse storage - **Mean storage loss: 30.4%** (validated: 15-40% expected) - **Hermetic storage adoption: 13.6%** (validated: 10-20% expected) **Market Access:** - **66.5%** sell to local traders (trader dominance) - **18.2%** sell via cooperatives - **9.4%** sell to processors - **5.8%** direct to consumers - **Mean price: $0.21/kg** (validated: $0.18-0.30/kg expected) - **Mean transport cost: $0.02/kg** (~10% of farmgate price) **Commercialization:** - **Mean 43.2% of harvest sold** (validated: 30-55% expected) - **Mean 50 days from harvest to sale** - **86%** sell raw produce (no value addition) - **71%** don't sort/grade produce **Post-Harvest Handling:** - **54.6%** use sun drying on ground - **28%** use tarp/mat (improved) - **15%** no proper drying - **40%** use no pest control - **35%** use traditional pest control methods ## Dataset Creation ### Methodology This dataset was created using the **Synthetic Data Generation Playbook** methodology: 1. **Literature Review**: Extracted parameters from 20+ peer-reviewed sources on post-harvest systems in Sub-Saharan Africa 2. **Parameter Specification**: Created detailed YAML files for each variable including: - Probability distributions grounded in literature - Conditional dependencies on base variables - African context-specific constraints - Validation benchmarks from published research 3. **Dependency Modeling**: Mapped relationships between variables: - Storage type → storage losses (traditional 25-40%, improved 2-5%) - Buyer type → prices (traders vs cooperatives vs processors) - Market distance → transport costs and prices - Storage duration → seasonal price arbitrage - Hermetic storage → pest control effectiveness 4. **Data Generation**: Used custom generators with: - Fixed random seed (42) for reproducibility - Conditional probability distributions - Realistic correlation structures - Missing data mechanisms (MCAR, 2-5% per variable) 5. **Validation**: Validated against literature benchmarks: - Storage loss rates - Market prices and costs - Adoption rates for improved technologies - Buyer type distributions - Commercialization levels ### Source Data **Key Literature Sources (Post-Harvest Variables):** - **World Bank (2011)**: "Missing food: Post-harvest grain losses in Sub-Saharan Africa" - **Affognon et al. (2015)**: "Unpacking post-harvest losses in Sub-Saharan Africa" - World Development 66: 49-68 - **Tefera et al. (2011)**: "The metal silo: Effective grain storage technology" - Crop Protection 30: 240-245 - **Minten et al. (2013)**: "Value chains and missing markets in Eastern Africa" - Development Policy Review - **Barrett et al. (2012)**: "Smallholder participation in contract farming" - World Development 40(4): 715-730 - **Bernard et al. (2008)**: "Impact of cooperatives on smallholders' commercialization behavior" - **Baributsa et al. (2014)**: "PICS bag adoption study" - World Development 66: 49-68 - **Murdock et al. (2012)**: "Hermetic storage effectiveness" - Journal of Stored Products Research - **FAO GIEWS (2023)**: "Global food price monitoring" - **Kadjo et al. (2018)**: "Storage technology adoption determinants" - Plus 10+ additional peer-reviewed sources Full citations available in parameter files. ### Data Fields Each variable includes: - **Type**: Continuous (float) or Categorical (string) - **Unit**: Where applicable (kg, km, percent, USD, etc.) - **Distribution**: Beta, Gamma, Lognormal, or categorical probabilities - **Dependencies**: Conditional on base farm and livestock variables - **Validation**: Expected ranges from literature - **Missingness**: 2-5% missing data per variable (MCAR mechanism) ## Uses ### Recommended Use Cases 1. **Algorithm Development** - Post-harvest loss prediction models - Market access optimization - Value chain network analysis - Price formation modeling - Technology adoption prediction 2. **Research Applications** - Value chain bottleneck identification - Intervention impact simulation - Market power analysis - Gender gaps in value chains - Climate impact on post-harvest systems 3. **Policy Analysis** - Storage technology subsidy targeting - Cooperative strengthening programs - Market infrastructure investment priorities - Price stabilization policies - Extension service optimization 4. **Education** - Agricultural economics teaching - Value chain analysis training - Data science courses (large realistic dataset) - Policy modeling workshops 5. **Method Benchmarking** - Test algorithms before real data deployment - Compare modeling approaches - Validate analytical methods ### Example Use Cases **Post-Harvest Loss Prediction:** ```python # Predict storage losses based on storage type, duration, climate X = df[['storage_type', 'storage_duration_months', 'agro_ecological_zone', 'hermetic_storage', 'pest_control']] y = df['storage_loss_pct'] ``` **Market Access Modeling:** ```python # Analyze price received by market access factors X = df[['market_distance_km', 'buyer_type', 'sorting_grading', 'transport_cost_kg', 'days_to_sell']] y = df['price_received_kg'] ``` **Technology Adoption:** ```python # Model hermetic storage adoption drivers X = df[['extension_access', 'farm_size_ha', 'region_type', 'storage_type', 'storage_loss_pct']] y = (df['hermetic_storage'] == 'yes').astype(int) ``` ## Limitations ### Known Limitations 1. **Simplified Relationships** - Real agricultural systems are more complex - Non-linear interactions may be under-represented - Political economy factors not explicitly modeled 2. **Cross-Sectional Only** - Single time point (no panel/longitudinal data yet) - Seasonal dynamics simplified to annual averages - Cannot model dynamic adoption or market evolution 3. **Missing Variables** - Credit access not included - Land tenure not modeled - Specific crops beyond maize not detailed - Household labor allocation not captured - Off-farm income not included 4. **No Spatial Coordinates** - Zone-level only (no GPS coordinates) - Cannot do spatial analysis or mapping - Regional variation simplified 5. **Generalized Parameters** - Represents "typical" Sub-Saharan Africa patterns - Not country-specific - May not capture extreme heterogeneity in specific contexts 6. **Technology Adoption** - Some conditional dependencies simplified - Social network effects not modeled - Innovation diffusion dynamics not captured ### Biases - **Literature Bias**: Parameters reflect published research, which may under-represent: - Marginalized populations - Conflict-affected areas - Pastoralist systems - Urban agriculture - **Geographic Bias**: Primarily East/Southern Africa patterns (where more research exists) - **Temporal**: Reflects 2010-2024 period patterns (may not capture recent changes) - **Data Source Bias**: Literature may over-represent: - More accessible populations - NGO project areas - Cooperative members ### Recommendations Users should: 1. **Always validate** on real data before production deployment 2. **Acknowledge synthetic nature** in all uses 3. **Understand limitations** for their specific use case 4. **Consider biases** in the underlying literature 5. **Cite properly** (see below) 6. **Not use for**: - Actual policy decisions without real data validation - Individual farmer targeting - Financial risk assessment - Legal/regulatory compliance ## Additional Information ### Dataset Curators Created by **Electric Sheep Africa** using the **Synthetic Data Generation Playbook** methodology: - Literature-grounded parameter extraction - African context constraints - Validated against benchmarks (100% validation pass rate) - Open methodology and reproducible code **Playbook Compliance:** ✅ FULL COMPLIANCE VERIFIED See `PLAYBOOK_COMPLIANCE_DATASET3.md` for detailed verification. ### Licensing Information - **Dataset**: Creative Commons Attribution 4.0 (CC-BY-4.0) - **Code**: MIT License - **Documentation**: CC-BY-4.0 **Requirements**: - ✅ Attribution required - ✅ Must acknowledge synthetic nature - ✅ Follow acceptable use policy - ✅ Cite original literature sources when publishing ### Citation If you use this dataset, please cite: ```bibtex @dataset{electricsheep2024postharvest, title={Post-Harvest Value Chains: Sub-Saharan Africa Synthetic Dataset}, author={Electric Sheep Africa}, year={2024}, publisher={HuggingFace}, version={1.0}, url={https://huggingface.co/datasets/electricsheepafrica/postharvest-value-chains-ssa-synthetic}, note={Synthetic dataset. Generated using literature-grounded methodology. 1M households, 41 variables covering farm systems, livestock, and post-harvest value chains.} } ``` **Also acknowledge**: This is synthetic data generated from published literature for research purposes. ### Contributions Contributions welcome via: - Issue reports (data quality, documentation improvements) - Additional validation against real datasets - Extension to new variables or regions - Method improvements ### Contact - **Organization**: Electric Sheep Africa - **Repository**: [GitHub repository link] - **Issues**: Report via HuggingFace dataset page ## Technical Specifications ### File Formats - **Parquet**: 174.7 MB (recommended for efficiency) - **CSV**: 510.2 MB (for readability/compatibility) ### Loading the Data **Python (Pandas):** ```python import pandas as pd # From HuggingFace (parquet) df = pd.read_parquet('hf://datasets/electricsheepafrica/postharvest-value-chains-ssa-synthetic/postharvest_data.parquet') # From HuggingFace (CSV) df = pd.read_csv('hf://datasets/electricsheepafrica/postharvest-value-chains-ssa-synthetic/postharvest_data.csv') print(f"Shape: {df.shape}") print(df.head()) ``` **Python (HuggingFace Datasets):** ```python from datasets import load_dataset dataset = load_dataset('electricsheepafrica/postharvest-value-chains-ssa-synthetic') df = dataset['train'].to_pandas() print(f"Shape: {df.shape}") ``` **R:** ```r library(arrow) # Parquet df <- read_parquet('path/to/postharvest_data.parquet') # CSV df <- read.csv('path/to/postharvest_data.csv') dim(df) head(df) ``` ### Data Types - **Categorical**: Stored as strings - **Continuous**: Float64 - **Integer**: Int64 (household_size only) - **Missing**: Represented as NA/NaN ### Reproducibility **Random Seed**: 42 (fixed) **Generator Version**: 1.0 **Parameter Files**: Available in source repository To reproduce: ```bash git clone [repository] cd agriculture-food-security-synthetic-data python scripts/generate_postharvest_data.py --sample-size 1000000 --seed 42 ``` ## Ethical Considerations ### Privacy - ✅ **100% synthetic**: No real households - ✅ **No PII**: No names, locations, identifiers - ✅ **No re-identification risk**: Entirely generated data ### Potential Misuse **Do NOT use for:** - Actual policy decisions without real data validation - Individual farmer profiling or targeting - Financial credit scoring - Insurance underwriting - Legal proceedings - Regulatory compliance **Appropriate Uses:** - Algorithm development and testing - Method benchmarking - Teaching and training - Research hypothesis generation - Policy simulation (with caveats) ### Fairness Considerations - Dataset reflects patterns from literature, which may contain biases - Under-representation of marginalized groups possible - Gender dimensions simplified (household-level only) - Ethnic/cultural diversity not explicitly modeled - Users should validate fairness metrics on their use case ## Version History ### Version 1.0 (November 2024) - Initial release - 1M households - 41 variables (12 base + 15 livestock + 14 post-harvest) - Validated against 20+ literature sources - Full playbook compliance ### Future Versions (Planned) - Country-specific variants - Longitudinal/panel structure - Additional crops (beyond maize focus) - Climate change scenarios - Policy intervention scenarios --- **Dataset Status**: ✅ Production-Ready **Quality Score**: A (literature-grounded, validated) **Last Updated**: November 2024
提供机构:
electricsheepafrica
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作