minhhungg/rice-disease-dataset
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/minhhungg/rice-disease-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: image
dtype: image
- name: label
dtype: string
splits:
- name: healthy
num_bytes: 474126138
num_examples: 1882
- name: pests
num_bytes: 369735476
num_examples: 7142
- name: diseases
num_bytes: 901219319
num_examples: 8859
- name: nutrition
num_bytes: 1053826854
num_examples: 1156
- name: train
num_bytes: 6946207334
num_examples: 26584
- name: validation
num_bytes: 1382383061
num_examples: 5697
- name: test
num_bytes: 1417135780
num_examples: 5697
download_size: 14739266671
dataset_size: 12544633962
configs:
- config_name: default
data_files:
- split: healthy
path: data/healthy-*
- split: pests
path: data/pests-*
- split: diseases
path: data/diseases-*
- split: nutrition
path: data/nutrition-*
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
license: apache-2.0
task_categories:
- zero-shot-classification
language:
- vi
---
This is the dataset for my project of Plant Diagnosis Suite at [here](https://github.com/alberttrann/DiseaseDetector). You can check out to see a Disease Detector trained on this dataset
# Vietnamese Rice Disease & Crop Recommendation Dataset
An agricultural AI dataset collected in Vietnam, containing **37,978 rice plant images** across 21 classes (diseases, pests, nutrient deficiencies, healthy) — for image classification.
---
## Dataset Summary
| Component | Type | Samples | Classes | Task |
|---|---|---|---|---|
| Rice Disease Images | Image (JPG) | 37,978 | 21 | Image Classification |
### Splits Available
| Split | Description | Size |
|---|---|---|
| `healthy` | Original healthy rice images | 3,764 images |
| `pests` | Original pest/insect images (9 classes) | ~14,284 images |
| `diseases` | Original rice disease images (8 classes) | ~17,618 images |
| `nutrition` | Original nutrient deficiency images (3 classes) | 2,312 images |
| `train` | Stratified training split (70%) | 26,584 images |
| `validation` | Stratified validation split (15%) | 5,697 images |
| `test` | Stratified test split (15%) | 5,697 images |
> The `train` / `validation` / `test` splits draw from **all 21 classes** combined and are stratified — use these for model training. The four category splits (`healthy`, `pests`, `diseases`, `nutrition`) reflect the original folder structure of the raw collection.
---
## Dataset Structure
### Image Component — Fields
| Field | Type | Description |
|---|---|---|
| `image` | `Image` | The rice plant photograph |
| `label` | `string` | English class label (see label list below) |
### Loading the Dataset
```python
from datasets import load_dataset
# Load a specific split
ds = load_dataset("minhhungg/rice-disease-dataset", split="train")
# Load all splits
ds = load_dataset("minhhungg/rice-disease-dataset")
# Access train/val/test for model training
train = ds["train"]
val = ds["validation"]
test = ds["test"]
# Quick check
print(train[0])
# {'image': <PIL.JpegImagePlugin...>, 'label': 'Healthy'}
```
---
## Label Definitions
### Image Classes (21 total)
#### Healthy (1 class)
| Label | Vietnamese Name | Count |
|---|---|---|
| `Healthy` | Cây lúa khỏe mạnh | 3,764 |
#### Pests / Insects (9 classes)
| Label | Vietnamese Name | Approx. Count |
|---|---|---|
| `Tungro Virus` | Tungro virus | 3,480 |
| `Hispa` | Sâu gai | 2,922 |
| `Rice Gall Midge` | Sâu năn (Muỗi hành) | 1,582 |
| `Chilo Stem Borer` | Sâu đục thân (Sọc nâu) | 1,490 |
| `Rice Leaf Folder` | Sâu cuốn lá nhỏ | 1,210 |
| `Thrips` | Bọ trĩ | 1,160 |
| `Rice Skipper` | Sâu cuốn lá lớn | 950 |
| `Yellow Stem Borer` | Sâu đục thân (vàng) | 910 |
| `Brown Plant Hopper` | Rầy nâu | 580 |
#### Diseases (8 classes)
| Label | Vietnamese Name | Approx. Count |
|---|---|---|
| `Leaf Scald` | Bệnh cháy lá | 3,340 |
| `Sheath Blight` | Bệnh đốm vằn / khô vằn | 3,156 |
| `Brown Spot` | Bệnh đốm nâu | 3,140 |
| `Bacterial Leaf Blight` | Bệnh bạc lá | 2,950 |
| `Narrow Brown Spot` | Bệnh gạch nâu | 2,832 |
| `Blast` | Bệnh đạo ôn lá và cổ bông | 2,000 |
| `Bakanae Disease` | Bệnh lúa von (lúa đực) | 100 |
| `False Smut` | Bệnh than vàng | 100 |
#### Nutrient Deficiencies (3 classes)
| Label | Vietnamese Name | Count |
|---|---|---|
| `Nitrogen Deficiency` | Thiếu đạm (N) | 880 |
| `Potassium Deficiency` | Thiếu kali (K) | 766 |
| `Phosphorus Deficiency` | Thiếu lân (P) | 666 |
---
## Class Imbalance
The dataset has significant class imbalance across the image component:
```
Most common : Healthy — 3,764 images
Least common: Bakanae Disease — 100 images
Imbalance ratio: 37.6 ×
```
When training on the combined `train` split, we recommend using **class weights** or **focal loss** to account for this. The stratified split ensures proportional class representation across train/val/test.
---
## Image Properties
Based on EDA of 1,050 sampled images (50 per class):
| Property | Value |
|---|---|
| Mean height | 1,223 px |
| Mean width | 1,053 px |
| Most common aspect ratio | 1:1 (square) |
| Height range | 217 – 4,301 px |
| Width range | 201 – 4,364 px |
| Mean file size | 282 KB |
| Mean brightness | 149.5 / 255 |
| Color channels | RGB (3) |
| File format | JPEG (.jpg / .JPG / .jpeg) |
> Images vary widely in resolution. All models trained on this dataset resized inputs to **224 × 224** using `transforms.Resize((224, 224))`.
---
## Data Collection
- **Source**: Field photographs collected across multiple sources on the Internet
- **Geographic scope**: Vietnam (Mekong Delta and surrounding agricultural regions)
> **Geographic bias**: Images were collected exclusively in Vietnam. Generalization to rice-growing regions with substantially different climate, rice varieties, or lighting conditions (e.g., South Asia, East Africa) has not been validated.
---
## Dataset Creation
### Preprocessing
Images were **not resized or augmented** in the raw splits — they are served at original resolution. The `train` split CSV was generated with a stratified 70/15/15 split using `sklearn.model_selection.train_test_split(stratify=label)` with `random_state=42`.
### Recommended Preprocessing for Training
```python
from torchvision import transforms
train_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomVerticalFlip(p=0.3),
transforms.RandomRotation(30),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]), # ImageNet stats
])
val_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
```
---
## Usage Examples
### Basic Classification with HuggingFace Trainer
```python
from datasets import load_dataset
from transformers import AutoFeatureExtractor, AutoModelForImageClassification, TrainingArguments, Trainer
import numpy as np
ds = load_dataset("minhhungg/rice-disease-dataset")
# Get label list from train split
labels = sorted(set(ds["train"]["label"]))
label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for l, i in label2id.items()}
model_name = "google/efficientnet-b0"
extractor = AutoFeatureExtractor.from_pretrained(model_name)
def preprocess(batch):
inputs = extractor(images=batch["image"], return_tensors="pt")
inputs["labels"] = [label2id[l] for l in batch["label"]]
return inputs
ds_processed = ds.map(preprocess, batched=True, remove_columns=["image", "label"])
model = AutoModelForImageClassification.from_pretrained(
model_name,
num_labels=len(labels),
id2label=id2label,
label2id=label2id,
ignore_mismatched_sizes=True,
)
args = TrainingArguments(
output_dir="rice-classifier",
per_device_train_batch_size=32,
num_train_epochs=20,
evaluation_strategy="epoch",
save_strategy="best",
metric_for_best_model="accuracy",
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=1)
return {"accuracy": (preds == labels).mean()}
trainer = Trainer(
model=model,
args=args,
train_dataset=ds_processed["train"],
eval_dataset=ds_processed["validation"],
compute_metrics=compute_metrics,
)
trainer.train()
```
### PyTorch DataLoader (direct)
```python
from datasets import load_dataset
from torch.utils.data import DataLoader
from torchvision import transforms
from PIL import Image
ds = load_dataset("minhhungg/rice-disease-dataset", split="train")
labels = sorted(set(ds["label"]))
label2id = {l: i for i, l in enumerate(labels)}
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
def collate_fn(batch):
images = [transform(item["image"].convert("RGB")) for item in batch]
labels_t = [label2id[item["label"]] for item in batch]
import torch
return torch.stack(images), torch.tensor(labels_t)
loader = DataLoader(ds, batch_size=32, shuffle=True, collate_fn=collate_fn)
images, labels_t = next(iter(loader))
print(images.shape) # torch.Size([32, 3, 224, 224])
```
### Filtering to a Single Category
```python
# Work only with disease images
diseases = load_dataset("minhhungg/rice-disease-dataset", split="diseases")
# Or filter the combined train split
from datasets import load_dataset
train = load_dataset("minhhungg/rice-disease-dataset", split="train")
diseases_only = train.filter(lambda x: x["label"] in [
"Leaf Scald", "Sheath Blight", "Brown Spot",
"Bacterial Leaf Blight", "Narrow Brown Spot",
"Blast", "Bakanae Disease", "False Smut"
])
```
---
## Limitations & Biases
- **Geographic**: Collected exclusively in Vietnam. Pest and disease appearance may vary under different climate zones, rice varieties, or soil types.
- **Class imbalance**: 37× imbalance between the largest and smallest classes. Models trained without compensation may underperform on `Bakanae Disease` and `False Smut` (100 images each).
- **Image quality variability**: Images range from 201px to 4,364px wide, captured under varying field lighting conditions (overcast, direct sunlight, shade). Resolution normalization is required before training.
- **Single crop**: Only covers rice (*Oryza sativa*). Not applicable to other crops without retraining.
- **Annotation granularity**: Nutrient deficiency labels (N, P, K) are assigned at the nutrient level, not by deficiency severity stage.
dataset_info:
特征:
- 名称: image
数据类型: 图像
- 名称: label
数据类型: 字符串
划分集:
- 名称: healthy
字节数: 474126138
样本数: 1882
- 名称: pests
字节数: 369735476
样本数: 7142
- 名称: diseases
字节数: 901219319
样本数: 8859
- 名称: nutrition
字节数: 1053826854
样本数: 1156
- 名称: train
字节数: 6946207334
样本数: 26584
- 名称: validation
字节数: 1382383061
样本数: 5697
- 名称: test
字节数: 1417135780
样本数: 5697
下载大小: 14739266671
数据集总大小: 12544633962
配置项:
- 配置名称: default
数据文件:
- 划分集: healthy
路径: data/healthy-*
- 划分集: pests
路径: data/pests-*
- 划分集: diseases
路径: data/diseases-*
- 划分集: nutrition
路径: data/nutrition-*
- 划分集: train
路径: data/train-*
- 划分集: validation
路径: data/validation-*
- 划分集: test
路径: data/test-*
许可证: apache-2.0
任务类别:
- 零样本分类(zero-shot-classification)
语言:
- vi(越南语)
本数据集为本人植物诊断套件项目所用,项目地址为<https://github.com/alberttrann/DiseaseDetector>,可查阅基于本数据集训练的病害检测器。
# 越南水稻病害与作物推荐数据集
本数据集为在越南采集的农业AI数据集,涵盖21个类别的37978张水稻植株图像,涵盖病害、虫害、营养缺乏与健康植株四类,适用于图像分类任务。
---
## 数据集概览
| 组件 | 类型 | 样本量 | 类别数 | 任务类型 |
|---------------------|--------------|---------|--------|----------------|
| 水稻病害图像 | 图像(JPG) | 37,978 | 21 | 图像分类 |
### 可用划分集
| 划分集名称 | 描述 | 样本量 |
|------------------|----------------------------------------------------------------------|-----------|
| `healthy` | 原始健康水稻图像 | 3,764 张 |
| `pests` | 原始虫害/昆虫图像(共9个类别) | ~14,284 张|
| `diseases` | 原始水稻病害图像(共8个类别) | ~17,618 张|
| `nutrition` | 原始营养缺乏图像(共3个类别) | 2,312 张 |
| `train` | 分层训练划分(占总样本70%) | 26,584 张 |
| `validation` | 分层验证划分(占总样本15%) | 5,697 张 |
| `test` | 分层测试划分(占总样本15%) | 5,697 张 |
> `train`、`validation`与`test`划分集取自全部21个类别的合并样本,并采用分层抽样策略,适用于模型训练。而`healthy`、`pests`、`diseases`与`nutrition`四个类别划分集则对应原始采集数据集的文件夹结构。
---
## 数据集结构
### 图像组件字段
| 字段名 | 类型 | 描述 |
|--------|--------------|----------------------------------------------------------------------|
| `image`| `图像` | 水稻植株实拍照片 |
| `label`| `字符串` | 英文类别标签(详见下文标签列表) |
### 数据集加载
python
from datasets import load_dataset
# 加载指定划分集
ds = load_dataset("minhhungg/rice-disease-dataset", split="train")
# 加载全部划分集
ds = load_dataset("minhhungg/rice-disease-dataset")
# 获取训练、验证、测试集用于模型训练
train = ds["train"]
val = ds["validation"]
test = ds["test"]
# 快速检查样本
print(train[0])
# {'image': <PIL.JpegImagePlugin...>, 'label': 'Healthy'}
---
## 标签定义
### 图像类别(共21类)
#### 健康植株(1类)
| 标签 | 越南语名称 | 样本量 |
|---------------|--------------------------|---------|
| `Healthy` | Cây lúa khỏe mạnh | 3,764 |
#### 虫害/昆虫(9类)
| 标签 | 越南语名称 | 近似样本量 |
|---------------------------|--------------------------------|------------|
| `Tungro Virus` | Tungro virus | 3,480 |
| `Hispa` | Sâu gai | 2,922 |
| `Rice Gall Midge` | Sâu năn (Muỗi hành) | 1,582 |
| `Chilo Stem Borer` | Sâu đục thân (Sọc nâu) | 1,490 |
| `Rice Leaf Folder` | Sâu cuốn lá nhỏ | 1,210 |
| `Thrips` | Bọ trĩ | 1,160 |
| `Rice Skipper` | Sâu cuốn lá lớn | 950 |
| `Yellow Stem Borer` | Sâu đục thân (vàng) | 910 |
| `Brown Plant Hopper` | Rầy nâu | 580 |
#### 病害(8类)
| 标签 | 越南语名称 | 近似样本量 |
|-------------------------------|----------------------------------------------|------------|
| `Leaf Scald` | Bệnh cháy lá | 3,340 |
| `Sheath Blight` | Bệnh đốm vằn / khô vằn | 3,156 |
| `Brown Spot` | Bệnh đốm nâu | 3,140 |
| `Bacterial Leaf Blight` | Bệnh bạc lá | 2,950 |
| `Narrow Brown Spot` | Bệnh gạch nâu | 2,832 |
| `Blast` | Bệnh đạo ôn lá và cổ bông | 2,000 |
| `Bakanae Disease` | Bệnh lúa von (lúa đực) | 100 |
| `False Smut` | Bệnh than vàng | 100 |
#### 营养缺乏(3类)
| 标签 | 越南语名称 | 样本量 |
|---------------------------|--------------------------|---------|
| `Nitrogen Deficiency` | Thiếu đạm (N) | 880 |
| `Potassium Deficiency` | Thiếu kali (K) | 766 |
| `Phosphorus Deficiency` | Thiếu lân (P) | 666 |
---
## 类别不平衡问题
本数据集存在显著的类别不平衡问题:
最常见类别: 健康植株 — 3,764 张
最稀少类别: 恶苗病(Bakanae Disease)与稻曲病(False Smut) — 各100张
不平衡比例: 37.6 ×
若使用合并后的`train`划分集进行训练,建议采用**类别权重**或**焦点损失函数(focal loss)**以缓解该问题。分层划分策略确保了训练、验证与测试集的类别分布比例均衡。
---
## 图像属性
基于对1050张采样图像(每类别50张)的探索性数据分析(EDA)结果:
| 属性 | 数值范围/平均值 |
|-----------------------|------------------------------------|
| 平均高度 | 1,223 像素 |
| 平均宽度 | 1,053 像素 |
| 最常见宽高比 | 1:1(正方形) |
| 高度范围 | 217 – 4,301 像素 |
| 宽度范围 | 201 – 4,364 像素 |
| 平均文件大小 | 282 KB |
| 平均亮度 | 149.5 / 255 |
| 色彩通道 | RGB(3通道) |
| 文件格式 | JPEG(.jpg / .JPG / .jpeg) |
> 图像分辨率差异较大。所有基于本数据集训练的模型均通过`transforms.Resize((224, 224))`将输入图像统一调整为224 × 224像素。
---
## 数据采集
- **来源**: 从互联网多渠道采集的田间实拍照片
- **地理范围**: 越南(湄公河三角洲及周边农业区域)
> **地理偏差**: 本数据集仅在越南境内采集,未验证其在其他气候区、水稻品种或光照条件下的泛化能力。
---
## 数据集创建
### 预处理流程
原始划分集未对图像进行尺寸调整或数据增强,保留原始分辨率。`train`划分集的CSV文件通过`sklearn.model_selection.train_test_split(stratify=label)`以分层70/15/15比例生成,随机种子设为`random_state=42`。
### 推荐训练预处理流程
python
from torchvision import transforms
train_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomVerticalFlip(p=0.3),
transforms.RandomRotation(30),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]), # ImageNet 统计量
])
val_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
---
## 使用示例
### 基于HuggingFace Trainer的基础分类任务
python
from datasets import load_dataset
from transformers import AutoFeatureExtractor, AutoModelForImageClassification, TrainingArguments, Trainer
import numpy as np
ds = load_dataset("minhhungg/rice-disease-dataset")
# 从训练集获取标签列表
labels = sorted(set(ds["train"]["label"]))
label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for l, i in label2id.items()}
model_name = "google/efficientnet-b0"
extractor = AutoFeatureExtractor.from_pretrained(model_name)
def preprocess(batch):
inputs = extractor(images=batch["image"], return_tensors="pt")
inputs["labels"] = [label2id[l] for l in batch["label"]]
return inputs
ds_processed = ds.map(preprocess, batched=True, remove_columns=["image", "label"])
model = AutoModelForImageClassification.from_pretrained(
model_name,
num_labels=len(labels),
id2label=id2label,
label2id=label2id,
ignore_mismatched_sizes=True,
)
args = TrainingArguments(
output_dir="rice-classifier",
per_device_train_batch_size=32,
num_train_epochs=20,
evaluation_strategy="epoch",
save_strategy="best",
metric_for_best_model="accuracy",
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=1)
return {"accuracy": (preds == labels).mean()}
trainer = Trainer(
model=model,
args=args,
train_dataset=ds_processed["train"],
eval_dataset=ds_processed["validation"],
compute_metrics=compute_metrics,
)
trainer.train()
### 直接使用PyTorch DataLoader
python
from datasets import load_dataset
from torch.utils.data import DataLoader
from torchvision import transforms
from PIL import Image
ds = load_dataset("minhhungg/rice-disease-dataset", split="train")
labels = sorted(set(ds["label"]))
label2id = {l: i for i, l in enumerate(labels)}
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
def collate_fn(batch):
images = [transform(item["image"].convert("RGB")) for item in batch]
labels_t = [label2id[item["label"]] for item in batch]
import torch
return torch.stack(images), torch.tensor(labels_t)
loader = DataLoader(ds, batch_size=32, shuffle=True, collate_fn=collate_fn)
images, labels_t = next(iter(loader))
print(images.shape) # torch.Size([32, 3, 224, 224])
### 筛选单一类别样本
python
# 仅使用病害图像
diseases = load_dataset("minhhungg/rice-disease-dataset", split="diseases")
# 或从合并训练集中筛选病害样本
from datasets import load_dataset
train = load_dataset("minhhungg/rice-disease-dataset", split="train")
diseases_only = train.filter(lambda x: x["label"] in [
"Leaf Scald", "Sheath Blight", "Brown Spot",
"Bacterial Leaf Blight", "Narrow Brown Spot",
"Blast", "Bakanae Disease", "False Smut"
])
---
## 局限性与偏差
- **地理偏差**: 仅在越南境内采集,不同气候区、水稻品种或土壤类型下的虫害与病害外观可能存在差异,泛化能力未验证。
- **类别不平衡**: 样本量最大与最小的类别间存在37倍的不平衡,未做补偿训练的模型可能在`Bakanae Disease`与`False Smut`(各100张)类别上表现不佳。
- **图像质量差异**: 图像宽度范围为201px至4364px,拍摄光照条件各异(阴天、直射阳光、阴影),训练前需进行分辨率标准化处理。
- **仅覆盖单一作物**: 本数据集仅针对水稻(*Oryza sativa*),未适配其他作物,需重新训练方可用于其他作物病害检测。
- **标注粒度有限**: 营养缺乏标签仅按营养元素(N、P、K)分类,未标注缺乏严重程度。
提供机构:
minhhungg



