vinod-anbalagan/indian-agri-advice-multilingual
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vinod-anbalagan/indian-agri-advice-multilingual
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language:
- en
- hi
- mr
- bn
- ta
- te
- gu
- ml
- pa
- ur
- kn
- or
language_creators: []
license:
- cc-by-4.0
multilinguality:
- multilingual
pretty_name: "Indian Agricultural Advisory Dataset (Multilingual)"
size_categories:
- 10K<n<100K
tags:
- adaption
- instruction-tuning
- agriculture
- multilingual
- low-resource
- global-south
- icar
- agro-climatic-zones
- kisan-call-centre
- india
task_categories:
- question-answering
task_ids:
- open-domain-qa
---

This dataset is a remastered version prepared using [Adaption's](https://adaptionlabs.ai/app/auth) Adaptive Data platform.
---
# Indian Agricultural Advisory Dataset — Multilingual
**18,707 rows | 12 languages | 15 agro-climatic zones | 13 categories | 17 crops**
A multilingual agricultural advisory dataset covering all of India's 15 Planning Commission agro-climatic zones, localized to 12 Indian languages. Built from ICAR extension knowledge, India's agro-climatic zone framework, and the Handbook of Agriculture in India. Adapted using [Adaption's Adaptive Data platform](https://adaptionlabs.ai).
- **Companion dataset →** [Tamil Agricultural Advisory (Grade A, 9.4/10)](https://huggingface.co/datasets/vinod-anbalagan/tamil-agri-advisory-qa)
- **GitHub →** [VinodAnbalagan/tamil-agri-dataset-](https://github.com/VinodAnbalagan/tamil-agri-dataset-)
- **Substack →** [The Meta Gradient](https://vinodanbalagan.substack.com)
- **Built for** the Adaption Labs Uncharted Data Challenge 2026
---
## Why This Dataset Exists
India has 150 million farming households across 15 distinct agro-climatic zones. A wheat farmer in Punjab faces completely different conditions than a rice farmer in Bengal or a bajra farmer in the Thar desert. Yet most agricultural AI systems give the same generic advice regardless of zone, soil, or season.
This dataset was born from a single insight discovered while building the [Tamil Agricultural Advisory Dataset](https://huggingface.co/datasets/vinod-anbalagan/tamil-agri-advisory-qa): **metadata specificity, not row count, drives dataset quality.** When every row carries real agro-ecological context — the specific zone, soil type, irrigation method, and season — AI systems can give advice that actually fits the farmer's conditions.
We applied the same framework to all of India: 15 zones, 17 crops, 13 categories, mapped from India's Planning Commission agro-climatic zone data. Then localized to 12 Indian languages to reach over a billion speakers.
---
### Domain
- Agriculture (100%)
---
## Languages
| Language | Rows | Share |
|----------|------|-------|
| Marathi | 2,414 | 12.9% |
| Bengali | 2,177 | 11.6% |
| Gujarati | 2,066 | 11.0% |
| Tamil | 2,034 | 10.9% |
| Telugu | 2,007 | 10.7% |
| Punjabi | 1,974 | 10.6% |
| Malayalam | 1,957 | 10.5% |
| English | 1,800 | 9.6% |
| Hindi | 1,683 | 9.0% |
| Urdu | 277 | 1.5% |
| Kannada | 170 | 0.9% |
| Odia | 148 | 0.8% |
---
### Tone
- Practical (75%)
- Informative (25%)
---
## Agro-Climatic Zones (15)
| Zone | States | Key Crops | Rows |
|------|--------|-----------|------|
| Western Himalayan | J&K, Himachal, Uttarakhand | Rice, maize, wheat, potato, apple | 1,321 |
| Eastern Himalayan | Sikkim, NE states, Tripura | Rice, tea, maize, potato, orange | 1,299 |
| Lower Gangetic Plains | West Bengal, Eastern Bihar | Rice, jute, potato, mango, banana | 1,242 |
| Middle Gangetic Plains | Eastern UP, Bihar | Rice, wheat, sugarcane, potato | 1,352 |
| Upper Gangetic Plains | Central & Western UP | Wheat, sugarcane, rice, potato, mango | 1,335 |
| Trans-Gangetic Plains | Punjab, Haryana, Delhi | Wheat, rice, cotton, sugarcane | 1,317 |
| Eastern Plateau & Hills | Jharkhand, Chhattisgarh, W. Odisha | Rice, groundnut, ragi, soybean | 1,414 |
| Central Plateau & Hills | MP, Rajasthan, UP (Bundelkhand) | Soybean, wheat, gram, cotton | 1,257 |
| Western Plateau & Hills | Maharashtra (Deccan), S. MP | Jowar, cotton, sugarcane, groundnut | 1,286 |
| Southern Plateau & Hills | Karnataka, TN (interior), AP | Rice, ragi, groundnut, cotton, coconut | 1,373 |
| East Coast Plains & Hills | Coastal AP, Odisha, TN | Rice, groundnut, sugarcane, banana | 1,350 |
| West Coast Plains & Ghats | Kerala, coastal Karnataka, Goa | Rice, coconut, arecanut, rubber, pepper | 1,362 |
| Gujarat Plains & Hills | Gujarat | Groundnut, cotton, rice, wheat, bajra | 1,396 |
| Western Dry Region | Rajasthan (Thar) | Bajra, jowar, moth, guar, wheat | 1,299 |
| All (mental health) | All India | Crisis routing | 104 |
---
## Categories (13)
| Category | Rows | Description |
|----------|------|-------------|
| `government_schemes` | 1,606 | PM-KISAN, PMFBY, KCC, subsidies |
| `market_price` | 1,591 | MSP, e-NAM, APMC, direct selling |
| `fertilizer` | 1,585 | NPK dosages, organic inputs, micronutrients |
| `weather_advisory` | 1,576 | Drought, flood, cyclone, frost response |
| `irrigation` | 1,574 | Water management, drip, sprinkler, canal |
| `soil_health` | 1,568 | pH, salinity, organic matter, soil testing |
| `crop_management` | 1,567 | Intercropping, rotation, spacing, weed control |
| `pest_control` | 1,564 | Pest identification and ICAR-grounded management |
| `harvest_timing` | 1,517 | When to harvest, post-harvest storage, drying |
| `crop_disease` | 1,515 | Disease diagnosis and treatment |
| `financial_support` | 1,513 | Crop insurance, loan relief, drought compensation |
| `variety_selection` | 1,427 | ICAR-recommended varieties by zone and season |
| `mental_health_safety` | 104 | Crisis routing — Kisan Call Centre 1551, iCall 9152987821 |
---
## Answer Structure (5-Part ICAR Format)
1. **Situation Assessment** — Acknowledge the farmer's specific zone, crop, soil, and season
2. **Immediate Action** — Exact dosage, timing, cost in rupees
3. **Rationale** — Why this fits this specific agro-climatic zone
4. **Long-term Prevention** — Sustainable practice for future seasons
5. **KVK Referral** — Contact nearest Krishi Vigyan Kendra
---
## Schema (17 Columns)
| Column | Description |
|--------|-------------|
| `id` | Unique row ID |
| `question` | Context-tagged farmer question |
| `answer` | 5-part ICAR advisory answer |
| `enhanced_prompt` | Adaption-enriched prompt |
| `enhanced_completion` | Adaption-enriched advisory (avg 2,731 chars) |
| `reasoning_trace` | Chain-of-thought reasoning |
| `category` | Topic (13 categories) |
| `crop_primary` | Primary crop (17 crops) |
| `soil_type` | Soil classification |
| `irrigation_type` | Irrigation method |
| `farming_practice` | Conventional / organic / integrated |
| `region` | Agro-climatic zone (15 zones) |
| `season` | Kharif / Rabi |
| `growth_stage` | Crop growth stage |
| `severity` | Low / medium / high / urgent |
| `source_type` | Provenance |
| `reasoning_type` | Cognitive pattern |
---
## Data Sources
| Source | What It Grounded |
|--------|------------------|
| Agro-Climatic Zones of India (Planning Commission) | Zone-crop-soil-season mappings for all 15 zones |
| Handbook of Agriculture in India (Oxford, 2007) | National crop agronomy, varieties, dosages |
| Handbook on General Agriculture (ANGRAU) | Crop science, soil science, pest management |
| ICAR-CRIDA District Contingency Plans | Drought and disaster management for 32 districts |
---
## Key Design Principles
- **Distribution by design** — 14 rows per category, 12 rows per zone in the base dataset, balanced before adaptation
- **Metadata is not decorative** — 99.4% fill rate on all context columns; every row grounded in real agro-ecological data
- **Zone-specific, not generic** — the same pest control question gets different answers in different zones because soil, rainfall, and irrigation differ
- **Mental health safety** — crisis helpline routing in every language (Kisan Call Centre 1551, iCall 9152987821)
- **Lesson from Tamil** — this dataset was built after discovering that metadata specificity drives quality scores more than row count or answer length
---
## How This Dataset Was Built
1. **Zone-crop mapping** — extracted valid crop-zone-soil-season combinations from India's 15 agro-climatic zones
2. **Question generation** — 169 base questions with rotating category assignments ensuring each zone gets different question types
3. **Answer expansion** — Cohere command-r-plus generated 5-part ICAR advisory answers grounded in zone-specific context
4. **Adaptation** — Adaption's Adaptive Data platform enriched prompts, completions, and added reasoning traces
5. **Localization** — platform translated and localized to 12 Indian languages, expanding 169 → 18,707 rows
---
### Evaluation Results
**Quality Gains:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/a7a7be60-152c-4f30-b299-eb46fb47cb86.png" alt="QualityGains" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
**Grade Improvement:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/76b235b4-5720-443a-a62e-6925e8288898.png" alt="Grade" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
**Percentile Chart:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/2d74c174-d013-4c02-8d72-23ddeb3b821d.png" alt="Percentile Chart" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
---
## Companion Dataset
This dataset was built using the same framework that produced the **Tamil Agricultural Advisory Dataset** — which scored **Grade A (9.4/10)** on Adaption's platform after 10 iterative submissions. The key insight — that metadata specificity matters more than row count — was discovered during the Tamil work and applied from day one to this India-wide dataset.
- [Tamil Agricultural Advisory Dataset](https://huggingface.co/datasets/vinod-anbalagan/tamil-agri-advisory-qa) — 187 rows, Grade A, 9.4/10, Tamil language
---
## Intended Uses
- Training multilingual agricultural advisory chatbots for Indian farmers
- Building voice-based advisory systems (WhatsApp, IVR) in regional languages
- Evaluating multilingual NLP performance on domain-specific, low-resource Indian language tasks
- Research into context-aware AI for the Global South
- Fine-tuning models for zone-specific agricultural advice across India
---
## Citation
```bibtex
@dataset{anbalagan2026india_agri,
title={Indian Agricultural Advisory Dataset (Multilingual)},
author={Anbalagan, Vinod},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/vinod-anbalagan/indian-agri-advice-multilingual},
license={CC BY 4.0}
}
```
---
Built by **Vinod Anbalagan** — AI/ML researcher, Toronto.
Created as part of the **Adaption Labs Uncharted Data Challenge 2026**.
Dataset adapted using **Adaption's Adaptive Data Platform**.
Research documented on [The Meta Gradient](https://vinodanbalagan.substack.com).
*India has 15 agro-climatic zones, 22 official languages, and 150 million farming households. They all deserve AI that speaks their language and knows their soil.*
提供机构:
vinod-anbalagan



