Name: minhhungg/rice-disease-dataset
Creator: minhhungg
Published: 2026-04-03 09:47:12
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/minhhungg/rice-disease-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: image dtype: image - name: label dtype: string splits: - name: healthy num_bytes: 474126138 num_examples: 1882 - name: pests num_bytes: 369735476 num_examples: 7142 - name: diseases num_bytes: 901219319 num_examples: 8859 - name: nutrition num_bytes: 1053826854 num_examples: 1156 - name: train num_bytes: 6946207334 num_examples: 26584 - name: validation num_bytes: 1382383061 num_examples: 5697 - name: test num_bytes: 1417135780 num_examples: 5697 download_size: 14739266671 dataset_size: 12544633962 configs: - config_name: default data_files: - split: healthy path: data/healthy-* - split: pests path: data/pests-* - split: diseases path: data/diseases-* - split: nutrition path: data/nutrition-* - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* license: apache-2.0 task_categories: - zero-shot-classification language: - vi --- This is the dataset for my project of Plant Diagnosis Suite at [here](https://github.com/alberttrann/DiseaseDetector). You can check out to see a Disease Detector trained on this dataset # Vietnamese Rice Disease & Crop Recommendation Dataset An agricultural AI dataset collected in Vietnam, containing **37,978 rice plant images** across 21 classes (diseases, pests, nutrient deficiencies, healthy) — for image classification. --- ## Dataset Summary | Component | Type | Samples | Classes | Task | |---|---|---|---|---| | Rice Disease Images | Image (JPG) | 37,978 | 21 | Image Classification | ### Splits Available | Split | Description | Size | |---|---|---| | `healthy` | Original healthy rice images | 3,764 images | | `pests` | Original pest/insect images (9 classes) | ~14,284 images | | `diseases` | Original rice disease images (8 classes) | ~17,618 images | | `nutrition` | Original nutrient deficiency images (3 classes) | 2,312 images | | `train` | Stratified training split (70%) | 26,584 images | | `validation` | Stratified validation split (15%) | 5,697 images | | `test` | Stratified test split (15%) | 5,697 images | > The `train` / `validation` / `test` splits draw from **all 21 classes** combined and are stratified — use these for model training. The four category splits (`healthy`, `pests`, `diseases`, `nutrition`) reflect the original folder structure of the raw collection. --- ## Dataset Structure ### Image Component — Fields | Field | Type | Description | |---|---|---| | `image` | `Image` | The rice plant photograph | | `label` | `string` | English class label (see label list below) | ### Loading the Dataset ```python from datasets import load_dataset # Load a specific split ds = load_dataset("minhhungg/rice-disease-dataset", split="train") # Load all splits ds = load_dataset("minhhungg/rice-disease-dataset") # Access train/val/test for model training train = ds["train"] val = ds["validation"] test = ds["test"] # Quick check print(train[0]) # {'image': <PIL.JpegImagePlugin...>, 'label': 'Healthy'} ``` --- ## Label Definitions ### Image Classes (21 total) #### Healthy (1 class) | Label | Vietnamese Name | Count | |---|---|---| | `Healthy` | Cây lúa khỏe mạnh | 3,764 | #### Pests / Insects (9 classes) | Label | Vietnamese Name | Approx. Count | |---|---|---| | `Tungro Virus` | Tungro virus | 3,480 | | `Hispa` | Sâu gai | 2,922 | | `Rice Gall Midge` | Sâu năn (Muỗi hành) | 1,582 | | `Chilo Stem Borer` | Sâu đục thân (Sọc nâu) | 1,490 | | `Rice Leaf Folder` | Sâu cuốn lá nhỏ | 1,210 | | `Thrips` | Bọ trĩ | 1,160 | | `Rice Skipper` | Sâu cuốn lá lớn | 950 | | `Yellow Stem Borer` | Sâu đục thân (vàng) | 910 | | `Brown Plant Hopper` | Rầy nâu | 580 | #### Diseases (8 classes) | Label | Vietnamese Name | Approx. Count | |---|---|---| | `Leaf Scald` | Bệnh cháy lá | 3,340 | | `Sheath Blight` | Bệnh đốm vằn / khô vằn | 3,156 | | `Brown Spot` | Bệnh đốm nâu | 3,140 | | `Bacterial Leaf Blight` | Bệnh bạc lá | 2,950 | | `Narrow Brown Spot` | Bệnh gạch nâu | 2,832 | | `Blast` | Bệnh đạo ôn lá và cổ bông | 2,000 | | `Bakanae Disease` | Bệnh lúa von (lúa đực) | 100 | | `False Smut` | Bệnh than vàng | 100 | #### Nutrient Deficiencies (3 classes) | Label | Vietnamese Name | Count | |---|---|---| | `Nitrogen Deficiency` | Thiếu đạm (N) | 880 | | `Potassium Deficiency` | Thiếu kali (K) | 766 | | `Phosphorus Deficiency` | Thiếu lân (P) | 666 | --- ## Class Imbalance The dataset has significant class imbalance across the image component: ``` Most common : Healthy — 3,764 images Least common: Bakanae Disease — 100 images Imbalance ratio: 37.6 × ``` When training on the combined `train` split, we recommend using **class weights** or **focal loss** to account for this. The stratified split ensures proportional class representation across train/val/test. --- ## Image Properties Based on EDA of 1,050 sampled images (50 per class): | Property | Value | |---|---| | Mean height | 1,223 px | | Mean width | 1,053 px | | Most common aspect ratio | 1:1 (square) | | Height range | 217 – 4,301 px | | Width range | 201 – 4,364 px | | Mean file size | 282 KB | | Mean brightness | 149.5 / 255 | | Color channels | RGB (3) | | File format | JPEG (.jpg / .JPG / .jpeg) | > Images vary widely in resolution. All models trained on this dataset resized inputs to **224 × 224** using `transforms.Resize((224, 224))`. --- ## Data Collection - **Source**: Field photographs collected across multiple sources on the Internet - **Geographic scope**: Vietnam (Mekong Delta and surrounding agricultural regions) > **Geographic bias**: Images were collected exclusively in Vietnam. Generalization to rice-growing regions with substantially different climate, rice varieties, or lighting conditions (e.g., South Asia, East Africa) has not been validated. --- ## Dataset Creation ### Preprocessing Images were **not resized or augmented** in the raw splits — they are served at original resolution. The `train` split CSV was generated with a stratified 70/15/15 split using `sklearn.model_selection.train_test_split(stratify=label)` with `random_state=42`. ### Recommended Preprocessing for Training ```python from torchvision import transforms train_transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.RandomHorizontalFlip(p=0.5), transforms.RandomVerticalFlip(p=0.3), transforms.RandomRotation(30), transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2), transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), # ImageNet stats ]) val_transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) ``` --- ## Usage Examples ### Basic Classification with HuggingFace Trainer ```python from datasets import load_dataset from transformers import AutoFeatureExtractor, AutoModelForImageClassification, TrainingArguments, Trainer import numpy as np ds = load_dataset("minhhungg/rice-disease-dataset") # Get label list from train split labels = sorted(set(ds["train"]["label"])) label2id = {l: i for i, l in enumerate(labels)} id2label = {i: l for l, i in label2id.items()} model_name = "google/efficientnet-b0" extractor = AutoFeatureExtractor.from_pretrained(model_name) def preprocess(batch): inputs = extractor(images=batch["image"], return_tensors="pt") inputs["labels"] = [label2id[l] for l in batch["label"]] return inputs ds_processed = ds.map(preprocess, batched=True, remove_columns=["image", "label"]) model = AutoModelForImageClassification.from_pretrained( model_name, num_labels=len(labels), id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True, ) args = TrainingArguments( output_dir="rice-classifier", per_device_train_batch_size=32, num_train_epochs=20, evaluation_strategy="epoch", save_strategy="best", metric_for_best_model="accuracy", ) def compute_metrics(eval_pred): logits, labels = eval_pred preds = np.argmax(logits, axis=1) return {"accuracy": (preds == labels).mean()} trainer = Trainer( model=model, args=args, train_dataset=ds_processed["train"], eval_dataset=ds_processed["validation"], compute_metrics=compute_metrics, ) trainer.train() ``` ### PyTorch DataLoader (direct) ```python from datasets import load_dataset from torch.utils.data import DataLoader from torchvision import transforms from PIL import Image ds = load_dataset("minhhungg/rice-disease-dataset", split="train") labels = sorted(set(ds["label"])) label2id = {l: i for i, l in enumerate(labels)} transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) def collate_fn(batch): images = [transform(item["image"].convert("RGB")) for item in batch] labels_t = [label2id[item["label"]] for item in batch] import torch return torch.stack(images), torch.tensor(labels_t) loader = DataLoader(ds, batch_size=32, shuffle=True, collate_fn=collate_fn) images, labels_t = next(iter(loader)) print(images.shape) # torch.Size([32, 3, 224, 224]) ``` ### Filtering to a Single Category ```python # Work only with disease images diseases = load_dataset("minhhungg/rice-disease-dataset", split="diseases") # Or filter the combined train split from datasets import load_dataset train = load_dataset("minhhungg/rice-disease-dataset", split="train") diseases_only = train.filter(lambda x: x["label"] in [ "Leaf Scald", "Sheath Blight", "Brown Spot", "Bacterial Leaf Blight", "Narrow Brown Spot", "Blast", "Bakanae Disease", "False Smut" ]) ``` --- ## Limitations & Biases - **Geographic**: Collected exclusively in Vietnam. Pest and disease appearance may vary under different climate zones, rice varieties, or soil types. - **Class imbalance**: 37× imbalance between the largest and smallest classes. Models trained without compensation may underperform on `Bakanae Disease` and `False Smut` (100 images each). - **Image quality variability**: Images range from 201px to 4,364px wide, captured under varying field lighting conditions (overcast, direct sunlight, shade). Resolution normalization is required before training. - **Single crop**: Only covers rice (*Oryza sativa*). Not applicable to other crops without retraining. - **Annotation granularity**: Nutrient deficiency labels (N, P, K) are assigned at the nutrient level, not by deficiency severity stage.

dataset_info: 特征: - 名称: image 数据类型: 图像 - 名称: label 数据类型: 字符串划分集: - 名称: healthy 字节数: 474126138 样本数: 1882 - 名称: pests 字节数: 369735476 样本数: 7142 - 名称: diseases 字节数: 901219319 样本数: 8859 - 名称: nutrition 字节数: 1053826854 样本数: 1156 - 名称: train 字节数: 6946207334 样本数: 26584 - 名称: validation 字节数: 1382383061 样本数: 5697 - 名称: test 字节数: 1417135780 样本数: 5697 下载大小: 14739266671 数据集总大小: 12544633962 配置项: - 配置名称: default 数据文件: - 划分集: healthy 路径: data/healthy-* - 划分集: pests 路径: data/pests-* - 划分集: diseases 路径: data/diseases-* - 划分集: nutrition 路径: data/nutrition-* - 划分集: train 路径: data/train-* - 划分集: validation 路径: data/validation-* - 划分集: test 路径: data/test-* 许可证: apache-2.0 任务类别: - 零样本分类（zero-shot-classification）语言: - vi（越南语）本数据集为本人植物诊断套件项目所用，项目地址为<https://github.com/alberttrann/DiseaseDetector>，可查阅基于本数据集训练的病害检测器。 # 越南水稻病害与作物推荐数据集本数据集为在越南采集的农业AI数据集，涵盖21个类别的37978张水稻植株图像，涵盖病害、虫害、营养缺乏与健康植株四类，适用于图像分类任务。 --- ## 数据集概览 | 组件 | 类型 | 样本量 | 类别数 | 任务类型 | |---------------------|--------------|---------|--------|----------------| | 水稻病害图像 | 图像（JPG） | 37,978 | 21 | 图像分类 | ### 可用划分集 | 划分集名称 | 描述 | 样本量 | |------------------|----------------------------------------------------------------------|-----------| | `healthy` | 原始健康水稻图像 | 3,764 张 | | `pests` | 原始虫害/昆虫图像（共9个类别） | ~14,284 张| | `diseases` | 原始水稻病害图像（共8个类别） | ~17,618 张| | `nutrition` | 原始营养缺乏图像（共3个类别） | 2,312 张 | | `train` | 分层训练划分（占总样本70%） | 26,584 张 | | `validation` | 分层验证划分（占总样本15%） | 5,697 张 | | `test` | 分层测试划分（占总样本15%） | 5,697 张 | > `train`、`validation`与`test`划分集取自全部21个类别的合并样本，并采用分层抽样策略，适用于模型训练。而`healthy`、`pests`、`diseases`与`nutrition`四个类别划分集则对应原始采集数据集的文件夹结构。 --- ## 数据集结构 ### 图像组件字段 | 字段名 | 类型 | 描述 | |--------|--------------|----------------------------------------------------------------------| | `image`| `图像` | 水稻植株实拍照片 | | `label`| `字符串` | 英文类别标签（详见下文标签列表） | ### 数据集加载 python from datasets import load_dataset # 加载指定划分集 ds = load_dataset("minhhungg/rice-disease-dataset", split="train") # 加载全部划分集 ds = load_dataset("minhhungg/rice-disease-dataset") # 获取训练、验证、测试集用于模型训练 train = ds["train"] val = ds["validation"] test = ds["test"] # 快速检查样本 print(train[0]) # {'image': <PIL.JpegImagePlugin...>, 'label': 'Healthy'} --- ## 标签定义 ### 图像类别（共21类） #### 健康植株（1类） | 标签 | 越南语名称 | 样本量 | |---------------|--------------------------|---------| | `Healthy` | Cây lúa khỏe mạnh | 3,764 | #### 虫害/昆虫（9类） | 标签 | 越南语名称 | 近似样本量 | |---------------------------|--------------------------------|------------| | `Tungro Virus` | Tungro virus | 3,480 | | `Hispa` | Sâu gai | 2,922 | | `Rice Gall Midge` | Sâu năn (Muỗi hành) | 1,582 | | `Chilo Stem Borer` | Sâu đục thân (Sọc nâu) | 1,490 | | `Rice Leaf Folder` | Sâu cuốn lá nhỏ | 1,210 | | `Thrips` | Bọ trĩ | 1,160 | | `Rice Skipper` | Sâu cuốn lá lớn | 950 | | `Yellow Stem Borer` | Sâu đục thân (vàng) | 910 | | `Brown Plant Hopper` | Rầy nâu | 580 | #### 病害（8类） | 标签 | 越南语名称 | 近似样本量 | |-------------------------------|----------------------------------------------|------------| | `Leaf Scald` | Bệnh cháy lá | 3,340 | | `Sheath Blight` | Bệnh đốm vằn / khô vằn | 3,156 | | `Brown Spot` | Bệnh đốm nâu | 3,140 | | `Bacterial Leaf Blight` | Bệnh bạc lá | 2,950 | | `Narrow Brown Spot` | Bệnh gạch nâu | 2,832 | | `Blast` | Bệnh đạo ôn lá và cổ bông | 2,000 | | `Bakanae Disease` | Bệnh lúa von (lúa đực) | 100 | | `False Smut` | Bệnh than vàng | 100 | #### 营养缺乏（3类） | 标签 | 越南语名称 | 样本量 | |---------------------------|--------------------------|---------| | `Nitrogen Deficiency` | Thiếu đạm (N) | 880 | | `Potassium Deficiency` | Thiếu kali (K) | 766 | | `Phosphorus Deficiency` | Thiếu lân (P) | 666 | --- ## 类别不平衡问题本数据集存在显著的类别不平衡问题：最常见类别: 健康植株 — 3,764 张最稀少类别: 恶苗病（Bakanae Disease）与稻曲病（False Smut） — 各100张不平衡比例: 37.6 × 若使用合并后的`train`划分集进行训练，建议采用**类别权重**或**焦点损失函数（focal loss）**以缓解该问题。分层划分策略确保了训练、验证与测试集的类别分布比例均衡。 --- ## 图像属性基于对1050张采样图像（每类别50张）的探索性数据分析（EDA）结果： | 属性 | 数值范围/平均值 | |-----------------------|------------------------------------| | 平均高度 | 1,223 像素 | | 平均宽度 | 1,053 像素 | | 最常见宽高比 | 1:1（正方形） | | 高度范围 | 217 – 4,301 像素 | | 宽度范围 | 201 – 4,364 像素 | | 平均文件大小 | 282 KB | | 平均亮度 | 149.5 / 255 | | 色彩通道 | RGB（3通道） | | 文件格式 | JPEG（.jpg / .JPG / .jpeg） | > 图像分辨率差异较大。所有基于本数据集训练的模型均通过`transforms.Resize((224, 224))`将输入图像统一调整为224 × 224像素。 --- ## 数据采集 - **来源**: 从互联网多渠道采集的田间实拍照片 - **地理范围**: 越南（湄公河三角洲及周边农业区域） > **地理偏差**: 本数据集仅在越南境内采集，未验证其在其他气候区、水稻品种或光照条件下的泛化能力。 --- ## 数据集创建 ### 预处理流程原始划分集未对图像进行尺寸调整或数据增强，保留原始分辨率。`train`划分集的CSV文件通过`sklearn.model_selection.train_test_split(stratify=label)`以分层70/15/15比例生成，随机种子设为`random_state=42`。 ### 推荐训练预处理流程 python from torchvision import transforms train_transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.RandomHorizontalFlip(p=0.5), transforms.RandomVerticalFlip(p=0.3), transforms.RandomRotation(30), transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2), transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), # ImageNet 统计量 ]) val_transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) --- ## 使用示例 ### 基于HuggingFace Trainer的基础分类任务 python from datasets import load_dataset from transformers import AutoFeatureExtractor, AutoModelForImageClassification, TrainingArguments, Trainer import numpy as np ds = load_dataset("minhhungg/rice-disease-dataset") # 从训练集获取标签列表 labels = sorted(set(ds["train"]["label"])) label2id = {l: i for i, l in enumerate(labels)} id2label = {i: l for l, i in label2id.items()} model_name = "google/efficientnet-b0" extractor = AutoFeatureExtractor.from_pretrained(model_name) def preprocess(batch): inputs = extractor(images=batch["image"], return_tensors="pt") inputs["labels"] = [label2id[l] for l in batch["label"]] return inputs ds_processed = ds.map(preprocess, batched=True, remove_columns=["image", "label"]) model = AutoModelForImageClassification.from_pretrained( model_name, num_labels=len(labels), id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True, ) args = TrainingArguments( output_dir="rice-classifier", per_device_train_batch_size=32, num_train_epochs=20, evaluation_strategy="epoch", save_strategy="best", metric_for_best_model="accuracy", ) def compute_metrics(eval_pred): logits, labels = eval_pred preds = np.argmax(logits, axis=1) return {"accuracy": (preds == labels).mean()} trainer = Trainer( model=model, args=args, train_dataset=ds_processed["train"], eval_dataset=ds_processed["validation"], compute_metrics=compute_metrics, ) trainer.train() ### 直接使用PyTorch DataLoader python from datasets import load_dataset from torch.utils.data import DataLoader from torchvision import transforms from PIL import Image ds = load_dataset("minhhungg/rice-disease-dataset", split="train") labels = sorted(set(ds["label"])) label2id = {l: i for i, l in enumerate(labels)} transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) def collate_fn(batch): images = [transform(item["image"].convert("RGB")) for item in batch] labels_t = [label2id[item["label"]] for item in batch] import torch return torch.stack(images), torch.tensor(labels_t) loader = DataLoader(ds, batch_size=32, shuffle=True, collate_fn=collate_fn) images, labels_t = next(iter(loader)) print(images.shape) # torch.Size([32, 3, 224, 224]) ### 筛选单一类别样本 python # 仅使用病害图像 diseases = load_dataset("minhhungg/rice-disease-dataset", split="diseases") # 或从合并训练集中筛选病害样本 from datasets import load_dataset train = load_dataset("minhhungg/rice-disease-dataset", split="train") diseases_only = train.filter(lambda x: x["label"] in [ "Leaf Scald", "Sheath Blight", "Brown Spot", "Bacterial Leaf Blight", "Narrow Brown Spot", "Blast", "Bakanae Disease", "False Smut" ]) --- ## 局限性与偏差 - **地理偏差**: 仅在越南境内采集，不同气候区、水稻品种或土壤类型下的虫害与病害外观可能存在差异，泛化能力未验证。 - **类别不平衡**: 样本量最大与最小的类别间存在37倍的不平衡，未做补偿训练的模型可能在`Bakanae Disease`与`False Smut`（各100张）类别上表现不佳。 - **图像质量差异**: 图像宽度范围为201px至4364px，拍摄光照条件各异（阴天、直射阳光、阴影），训练前需进行分辨率标准化处理。 - **仅覆盖单一作物**: 本数据集仅针对水稻（*Oryza sativa*），未适配其他作物，需重新训练方可用于其他作物病害检测。 - **标注粒度有限**: 营养缺乏标签仅按营养元素（N、P、K）分类，未标注缺乏严重程度。

应用场景：