talhakk/agriculture-qa-tokenized
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/talhakk/agriculture-qa-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- agriculture
- crops
- farming
- soil-science
- llm
- gemma
- tokenized
- instruction-tuning
- rag
size_categories:
- 10K<n<100K
---
# 🌾 Agriculture-QA Tokenized Dataset (Gemma Ready)
## 🔍 Overview
The **Agriculture-QA Tokenized Dataset** is a high-performance, ready-to-train version of the original `agriculture-qa` dataset. It has been specifically optimized for Large Language Models (LLMs) like **Gemma, LLaMA, and Mistral**.
It contains **25,410** high-quality question-answer pairs transformed into instruction-style sequences and pre-tokenized for causal language modeling ($CLM$). This removes the preprocessing bottleneck, allowing you to jump straight into fine-tuning.
---
## 🚀 Key Features
* ✅ **25K+ Agriculture QA Pairs:** Comprehensive domain coverage.
* ✅ **Gemma-Compatible:** Pre-tokenized using the Gemma tokenizer.
* ✅ **Instruction-Tuned Format:** Structured specifically for `Question: [text] \n Answer: [text]`.
* ✅ **Efficiency:** No padding applied (enabling dynamic padding during training for 2x faster throughput).
* ✅ **Optimized for LoRA/QLoRA:** Plug-and-play for PEFT libraries.
---
## 🧠 Data Structure
Each entry is a dictionary containing the necessary tensors for training:
| Field | Description |
| :--- | :--- |
| `input_ids` | Tokenized sequence of Question + Answer |
| `attention_mask` | Mask to avoid performing attention on padding |
| `labels` | The target sequence (identical to `input_ids` for Causal LM) |
**Format Example:**
> **Question:** *How to improve wheat yield?* > **Answer:** *Improve soil fertility through balanced NPK application...*
---
## ⚙️ Preprocessing Pipeline
* **Tokenizer:** `google/gemma-2b` (Transformers)
* **Max Length:** 512 tokens
* **Truncation:** Enabled
* **Padding:** None (Recommended: apply dynamic padding at runtime)
* **Parallelization:** Multi-core processed for integrity
---
## 📊 Dataset Statistics
| Feature | Value |
| :--- | :--- |
| **Total Samples** | 25,410 |
| **Format** | Tokenized / Instruction-Style |
| **Max Sequence Length** | 512 |
| **Language** | English |
| **Base Model** | Gemma |
---
## 🧪 Quick Start (Usage)
### Load the dataset
```python
from datasets import load_dataset
dataset = load_dataset("talhakk/agriculture-qa-tokenized")
print(dataset["train"][0])
提供机构:
talhakk



