electricsheepafrica/postharvest-value-chains-ssa-synthetic
收藏Hugging Face2025-11-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/electricsheepafrica/postharvest-value-chains-ssa-synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-regression
- tabular-classification
tags:
- agriculture
- food-security
- africa
- synthetic-data
- postharvest
- value-chains
- market-access
- storage
- smallholder-farming
size_categories:
- 1M<n<10M
language:
- en
pretty_name: "Post-Harvest Value Chains - Sub-Saharan Africa (Synthetic)"
---
# Dataset Card: Post-Harvest Value Chains - Sub-Saharan Africa (Synthetic Data)
## Dataset Summary
This synthetic dataset represents **1,000,000 African smallholder households** with comprehensive post-harvest value chain data, capturing storage systems, market access, processing, and value addition across Sub-Saharan Africa. It builds upon baseline farm characteristics (Dataset 1) and livestock systems (Dataset 2) to create a complete picture of agricultural production, livestock management, and post-harvest economics.
**Key Features:**
- **1M households** across 5 agro-ecological zones
- **41 variables** (12 base farm + 15 livestock + 14 post-harvest)
- **African-specific** value chain patterns and constraints
- **Literature-grounded**: 20+ peer-reviewed sources for post-harvest variables
- **100% synthetic**: No real households, privacy-preserving
- **Validated**: All benchmarks passed against literature expectations
## Why This Dataset?
Post-harvest losses are the **"missing middle"** of African food security:
- **$4 billion annual losses** in Sub-Saharan Africa
- **20-40% of cereals** lost between harvest and consumption
- **Enough grain lost** to feed 48 million people annually
- **Single largest opportunity** for food security improvement
Yet research is hampered by:
- Lack of household-level value chain data
- Privacy concerns in real data collection
- High cost of primary data collection
- Need for large samples for ML/statistical methods
This synthetic dataset enables:
- **Algorithm development** for post-harvest loss prediction
- **Value chain analysis** without privacy concerns
- **Market access modeling** at scale
- **Policy simulation** of interventions (storage technology, cooperatives, etc.)
- **Teaching** agricultural economics and food systems
- **Benchmarking** methods before deploying on real data
## Dataset Details
### Dataset Description
**Size**: 1,000,000 rows × 41 variables
**Format**: CSV (510 MB) + Parquet (175 MB)
**License**: CC-BY-4.0
**Created**: November 2024
**Version**: 1.0
### Variables
The dataset contains **41 variables** across three thematic areas:
#### **Base Farm Variables (12)** - From Dataset 1
| Variable | Type | Description | Range/Categories |
|----------|------|-------------|------------------|
| `agro_ecological_zone` | categorical | AEZ classification | arid, semi_arid, sub_humid, humid, highland |
| `region_type` | categorical | Settlement type | urban, peri_urban, rural_accessible, rural_remote |
| `farm_size_ha` | continuous | Farm size (hectares) | 0.1-50 ha |
| `soil_quality_index` | continuous | Soil quality (0-100) | 0-100 |
| `rainfall_mm_annual` | continuous | Annual rainfall | 200-2000 mm |
| `household_size` | integer | Number of people | 1-20 |
| `market_distance_km` | continuous | Distance to market | 0.5-150 km |
| `livestock_tlu` | continuous | Tropical Livestock Units | 0-50 TLU |
| `extension_access` | categorical | Agricultural extension | yes, no |
| `fertilizer_use_kg_ha` | continuous | Fertilizer application | 0-300 kg/ha |
| `rainfall_mm_season` | continuous | Seasonal rainfall | 100-1000 mm |
| `maize_yield_kg_ha` | continuous | Maize yield | 0-6000 kg/ha |
#### **Livestock Variables (15)** - From Dataset 2
| Variable | Type | Description | Range/Categories |
|----------|------|-------------|------------------|
| `herd_size_cattle` | continuous | Number of cattle | 0-100 |
| `herd_size_small_ruminants` | continuous | Sheep/goats | 0-150 |
| `poultry_count` | continuous | Chickens/ducks | 0-500 |
| `vet_distance_km` | continuous | Distance to vet | 0.5-200 km |
| `vaccination_coverage_pct` | continuous | Vaccination rate | 0-100% |
| `disease_incidence_annual` | categorical | Disease occurred | yes, no |
| `pasture_quality_index` | continuous | Pasture quality | 0-100 |
| `mortality_rate_annual_pct` | continuous | Livestock mortality | 0-50% |
| `vet_visit_annual` | categorical | Vet contact | yes, no |
| `disease_type` | categorical | Disease type | FMD, ECF, CBPP, Newcastle, etc. |
| `water_source_reliability` | categorical | Water access | year_round, seasonal, unreliable |
| `grazing_system` | categorical | Grazing type | communal, private, mixed, zero |
| `treatment_access` | categorical | Treatment used | none, traditional, veterinary, both |
| `feed_supplementation` | categorical | Supplemental feed | yes, no |
| `livestock_dependency_index` | continuous | Livelihood dependence | 0-100 |
#### **Post-Harvest Value Chain Variables (14)** - NEW in Dataset 3
**Storage Infrastructure (4 variables)**
| Variable | Type | Description | Range/Categories |
|----------|------|-------------|------------------|
| `storage_type` | categorical | Storage facility | none, traditional_granary, improved_silo, warehouse |
| `storage_duration_months` | continuous | Storage period | 0-12 months |
| `storage_loss_pct` | continuous | Grain loss % | 0-80% |
| `hermetic_storage` | categorical | Airtight storage | yes, no |
**Post-Harvest Handling (3 variables)**
| Variable | Type | Description | Range/Categories |
|----------|------|-------------|------------------|
| `drying_method` | categorical | Grain drying | none, sun, tarp, mechanical |
| `pest_control` | categorical | Pest management | none, traditional, chemical, IPM |
| `sorting_grading` | categorical | Quality sorting | yes, no |
**Processing Access (2 variables)**
| Variable | Type | Description | Range/Categories |
|----------|------|-------------|------------------|
| `miller_access_km` | continuous | Distance to mill | 0.5-100 km |
| `value_addition` | categorical | Processing level | none, simple_processing, packaged |
**Market Access (5 variables)**
| Variable | Type | Description | Range/Categories |
|----------|------|-------------|------------------|
| `produce_sold_pct` | continuous | % harvest sold | 0-100% |
| `buyer_type` | categorical | Buyer type | local_trader, cooperative, processor, direct_consumer |
| `price_received_kg` | continuous | Farmgate price | $0.05-0.60/kg |
| `transport_cost_kg` | continuous | Transport cost | $0-0.15/kg |
| `days_to_sell` | continuous | Time to sale | 0-180 days |
## Dataset Statistics
### Post-Harvest Key Statistics
**Storage:**
- **60.6%** use traditional granary storage
- **20.1%** have no storage (immediate consumption/sale)
- **14.3%** use improved storage (silos, PICS bags)
- **5.1%** access warehouse storage
- **Mean storage loss: 30.4%** (validated: 15-40% expected)
- **Hermetic storage adoption: 13.6%** (validated: 10-20% expected)
**Market Access:**
- **66.5%** sell to local traders (trader dominance)
- **18.2%** sell via cooperatives
- **9.4%** sell to processors
- **5.8%** direct to consumers
- **Mean price: $0.21/kg** (validated: $0.18-0.30/kg expected)
- **Mean transport cost: $0.02/kg** (~10% of farmgate price)
**Commercialization:**
- **Mean 43.2% of harvest sold** (validated: 30-55% expected)
- **Mean 50 days from harvest to sale**
- **86%** sell raw produce (no value addition)
- **71%** don't sort/grade produce
**Post-Harvest Handling:**
- **54.6%** use sun drying on ground
- **28%** use tarp/mat (improved)
- **15%** no proper drying
- **40%** use no pest control
- **35%** use traditional pest control methods
## Dataset Creation
### Methodology
This dataset was created using the **Synthetic Data Generation Playbook** methodology:
1. **Literature Review**: Extracted parameters from 20+ peer-reviewed sources on post-harvest systems in Sub-Saharan Africa
2. **Parameter Specification**: Created detailed YAML files for each variable including:
- Probability distributions grounded in literature
- Conditional dependencies on base variables
- African context-specific constraints
- Validation benchmarks from published research
3. **Dependency Modeling**: Mapped relationships between variables:
- Storage type → storage losses (traditional 25-40%, improved 2-5%)
- Buyer type → prices (traders vs cooperatives vs processors)
- Market distance → transport costs and prices
- Storage duration → seasonal price arbitrage
- Hermetic storage → pest control effectiveness
4. **Data Generation**: Used custom generators with:
- Fixed random seed (42) for reproducibility
- Conditional probability distributions
- Realistic correlation structures
- Missing data mechanisms (MCAR, 2-5% per variable)
5. **Validation**: Validated against literature benchmarks:
- Storage loss rates
- Market prices and costs
- Adoption rates for improved technologies
- Buyer type distributions
- Commercialization levels
### Source Data
**Key Literature Sources (Post-Harvest Variables):**
- **World Bank (2011)**: "Missing food: Post-harvest grain losses in Sub-Saharan Africa"
- **Affognon et al. (2015)**: "Unpacking post-harvest losses in Sub-Saharan Africa" - World Development 66: 49-68
- **Tefera et al. (2011)**: "The metal silo: Effective grain storage technology" - Crop Protection 30: 240-245
- **Minten et al. (2013)**: "Value chains and missing markets in Eastern Africa" - Development Policy Review
- **Barrett et al. (2012)**: "Smallholder participation in contract farming" - World Development 40(4): 715-730
- **Bernard et al. (2008)**: "Impact of cooperatives on smallholders' commercialization behavior"
- **Baributsa et al. (2014)**: "PICS bag adoption study" - World Development 66: 49-68
- **Murdock et al. (2012)**: "Hermetic storage effectiveness" - Journal of Stored Products Research
- **FAO GIEWS (2023)**: "Global food price monitoring"
- **Kadjo et al. (2018)**: "Storage technology adoption determinants"
- Plus 10+ additional peer-reviewed sources
Full citations available in parameter files.
### Data Fields
Each variable includes:
- **Type**: Continuous (float) or Categorical (string)
- **Unit**: Where applicable (kg, km, percent, USD, etc.)
- **Distribution**: Beta, Gamma, Lognormal, or categorical probabilities
- **Dependencies**: Conditional on base farm and livestock variables
- **Validation**: Expected ranges from literature
- **Missingness**: 2-5% missing data per variable (MCAR mechanism)
## Uses
### Recommended Use Cases
1. **Algorithm Development**
- Post-harvest loss prediction models
- Market access optimization
- Value chain network analysis
- Price formation modeling
- Technology adoption prediction
2. **Research Applications**
- Value chain bottleneck identification
- Intervention impact simulation
- Market power analysis
- Gender gaps in value chains
- Climate impact on post-harvest systems
3. **Policy Analysis**
- Storage technology subsidy targeting
- Cooperative strengthening programs
- Market infrastructure investment priorities
- Price stabilization policies
- Extension service optimization
4. **Education**
- Agricultural economics teaching
- Value chain analysis training
- Data science courses (large realistic dataset)
- Policy modeling workshops
5. **Method Benchmarking**
- Test algorithms before real data deployment
- Compare modeling approaches
- Validate analytical methods
### Example Use Cases
**Post-Harvest Loss Prediction:**
```python
# Predict storage losses based on storage type, duration, climate
X = df[['storage_type', 'storage_duration_months', 'agro_ecological_zone',
'hermetic_storage', 'pest_control']]
y = df['storage_loss_pct']
```
**Market Access Modeling:**
```python
# Analyze price received by market access factors
X = df[['market_distance_km', 'buyer_type', 'sorting_grading',
'transport_cost_kg', 'days_to_sell']]
y = df['price_received_kg']
```
**Technology Adoption:**
```python
# Model hermetic storage adoption drivers
X = df[['extension_access', 'farm_size_ha', 'region_type',
'storage_type', 'storage_loss_pct']]
y = (df['hermetic_storage'] == 'yes').astype(int)
```
## Limitations
### Known Limitations
1. **Simplified Relationships**
- Real agricultural systems are more complex
- Non-linear interactions may be under-represented
- Political economy factors not explicitly modeled
2. **Cross-Sectional Only**
- Single time point (no panel/longitudinal data yet)
- Seasonal dynamics simplified to annual averages
- Cannot model dynamic adoption or market evolution
3. **Missing Variables**
- Credit access not included
- Land tenure not modeled
- Specific crops beyond maize not detailed
- Household labor allocation not captured
- Off-farm income not included
4. **No Spatial Coordinates**
- Zone-level only (no GPS coordinates)
- Cannot do spatial analysis or mapping
- Regional variation simplified
5. **Generalized Parameters**
- Represents "typical" Sub-Saharan Africa patterns
- Not country-specific
- May not capture extreme heterogeneity in specific contexts
6. **Technology Adoption**
- Some conditional dependencies simplified
- Social network effects not modeled
- Innovation diffusion dynamics not captured
### Biases
- **Literature Bias**: Parameters reflect published research, which may under-represent:
- Marginalized populations
- Conflict-affected areas
- Pastoralist systems
- Urban agriculture
- **Geographic Bias**: Primarily East/Southern Africa patterns (where more research exists)
- **Temporal**: Reflects 2010-2024 period patterns (may not capture recent changes)
- **Data Source Bias**: Literature may over-represent:
- More accessible populations
- NGO project areas
- Cooperative members
### Recommendations
Users should:
1. **Always validate** on real data before production deployment
2. **Acknowledge synthetic nature** in all uses
3. **Understand limitations** for their specific use case
4. **Consider biases** in the underlying literature
5. **Cite properly** (see below)
6. **Not use for**:
- Actual policy decisions without real data validation
- Individual farmer targeting
- Financial risk assessment
- Legal/regulatory compliance
## Additional Information
### Dataset Curators
Created by **Electric Sheep Africa** using the **Synthetic Data Generation Playbook** methodology:
- Literature-grounded parameter extraction
- African context constraints
- Validated against benchmarks (100% validation pass rate)
- Open methodology and reproducible code
**Playbook Compliance:** ✅ FULL COMPLIANCE VERIFIED
See `PLAYBOOK_COMPLIANCE_DATASET3.md` for detailed verification.
### Licensing Information
- **Dataset**: Creative Commons Attribution 4.0 (CC-BY-4.0)
- **Code**: MIT License
- **Documentation**: CC-BY-4.0
**Requirements**:
- ✅ Attribution required
- ✅ Must acknowledge synthetic nature
- ✅ Follow acceptable use policy
- ✅ Cite original literature sources when publishing
### Citation
If you use this dataset, please cite:
```bibtex
@dataset{electricsheep2024postharvest,
title={Post-Harvest Value Chains: Sub-Saharan Africa Synthetic Dataset},
author={Electric Sheep Africa},
year={2024},
publisher={HuggingFace},
version={1.0},
url={https://huggingface.co/datasets/electricsheepafrica/postharvest-value-chains-ssa-synthetic},
note={Synthetic dataset. Generated using literature-grounded methodology. 1M households, 41 variables covering farm systems, livestock, and post-harvest value chains.}
}
```
**Also acknowledge**: This is synthetic data generated from published literature for research purposes.
### Contributions
Contributions welcome via:
- Issue reports (data quality, documentation improvements)
- Additional validation against real datasets
- Extension to new variables or regions
- Method improvements
### Contact
- **Organization**: Electric Sheep Africa
- **Repository**: [GitHub repository link]
- **Issues**: Report via HuggingFace dataset page
## Technical Specifications
### File Formats
- **Parquet**: 174.7 MB (recommended for efficiency)
- **CSV**: 510.2 MB (for readability/compatibility)
### Loading the Data
**Python (Pandas):**
```python
import pandas as pd
# From HuggingFace (parquet)
df = pd.read_parquet('hf://datasets/electricsheepafrica/postharvest-value-chains-ssa-synthetic/postharvest_data.parquet')
# From HuggingFace (CSV)
df = pd.read_csv('hf://datasets/electricsheepafrica/postharvest-value-chains-ssa-synthetic/postharvest_data.csv')
print(f"Shape: {df.shape}")
print(df.head())
```
**Python (HuggingFace Datasets):**
```python
from datasets import load_dataset
dataset = load_dataset('electricsheepafrica/postharvest-value-chains-ssa-synthetic')
df = dataset['train'].to_pandas()
print(f"Shape: {df.shape}")
```
**R:**
```r
library(arrow)
# Parquet
df <- read_parquet('path/to/postharvest_data.parquet')
# CSV
df <- read.csv('path/to/postharvest_data.csv')
dim(df)
head(df)
```
### Data Types
- **Categorical**: Stored as strings
- **Continuous**: Float64
- **Integer**: Int64 (household_size only)
- **Missing**: Represented as NA/NaN
### Reproducibility
**Random Seed**: 42 (fixed)
**Generator Version**: 1.0
**Parameter Files**: Available in source repository
To reproduce:
```bash
git clone [repository]
cd agriculture-food-security-synthetic-data
python scripts/generate_postharvest_data.py --sample-size 1000000 --seed 42
```
## Ethical Considerations
### Privacy
- ✅ **100% synthetic**: No real households
- ✅ **No PII**: No names, locations, identifiers
- ✅ **No re-identification risk**: Entirely generated data
### Potential Misuse
**Do NOT use for:**
- Actual policy decisions without real data validation
- Individual farmer profiling or targeting
- Financial credit scoring
- Insurance underwriting
- Legal proceedings
- Regulatory compliance
**Appropriate Uses:**
- Algorithm development and testing
- Method benchmarking
- Teaching and training
- Research hypothesis generation
- Policy simulation (with caveats)
### Fairness Considerations
- Dataset reflects patterns from literature, which may contain biases
- Under-representation of marginalized groups possible
- Gender dimensions simplified (household-level only)
- Ethnic/cultural diversity not explicitly modeled
- Users should validate fairness metrics on their use case
## Version History
### Version 1.0 (November 2024)
- Initial release
- 1M households
- 41 variables (12 base + 15 livestock + 14 post-harvest)
- Validated against 20+ literature sources
- Full playbook compliance
### Future Versions (Planned)
- Country-specific variants
- Longitudinal/panel structure
- Additional crops (beyond maize focus)
- Climate change scenarios
- Policy intervention scenarios
---
**Dataset Status**: ✅ Production-Ready
**Quality Score**: A (literature-grounded, validated)
**Last Updated**: November 2024
提供机构:
electricsheepafrica



