talhakk/agriculture-qa-tokenized

Name: talhakk/agriculture-qa-tokenized
Creator: talhakk
Published: 2026-04-19 11:55:35
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/talhakk/agriculture-qa-tokenized

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - agriculture - crops - farming - soil-science - llm - gemma - tokenized - instruction-tuning - rag size_categories: - 10K<n<100K --- # 🌾 Agriculture-QA Tokenized Dataset (Gemma Ready) ## 🔍 Overview The **Agriculture-QA Tokenized Dataset** is a high-performance, ready-to-train version of the original `agriculture-qa` dataset. It has been specifically optimized for Large Language Models (LLMs) like **Gemma, LLaMA, and Mistral**. It contains **25,410** high-quality question-answer pairs transformed into instruction-style sequences and pre-tokenized for causal language modeling ($CLM$). This removes the preprocessing bottleneck, allowing you to jump straight into fine-tuning. --- ## 🚀 Key Features * ✅ **25K+ Agriculture QA Pairs:** Comprehensive domain coverage. * ✅ **Gemma-Compatible:** Pre-tokenized using the Gemma tokenizer. * ✅ **Instruction-Tuned Format:** Structured specifically for `Question: [text] \n Answer: [text]`. * ✅ **Efficiency:** No padding applied (enabling dynamic padding during training for 2x faster throughput). * ✅ **Optimized for LoRA/QLoRA:** Plug-and-play for PEFT libraries. --- ## 🧠 Data Structure Each entry is a dictionary containing the necessary tensors for training: | Field | Description | | :--- | :--- | | `input_ids` | Tokenized sequence of Question + Answer | | `attention_mask` | Mask to avoid performing attention on padding | | `labels` | The target sequence (identical to `input_ids` for Causal LM) | **Format Example:** > **Question:** *How to improve wheat yield?* > **Answer:** *Improve soil fertility through balanced NPK application...* --- ## ⚙️ Preprocessing Pipeline * **Tokenizer:** `google/gemma-2b` (Transformers) * **Max Length:** 512 tokens * **Truncation:** Enabled * **Padding:** None (Recommended: apply dynamic padding at runtime) * **Parallelization:** Multi-core processed for integrity --- ## 📊 Dataset Statistics | Feature | Value | | :--- | :--- | | **Total Samples** | 25,410 | | **Format** | Tokenized / Instruction-Style | | **Max Sequence Length** | 512 | | **Language** | English | | **Base Model** | Gemma | --- ## 🧪 Quick Start (Usage) ### Load the dataset ```python from datasets import load_dataset dataset = load_dataset("talhakk/agriculture-qa-tokenized") print(dataset["train"][0])

提供机构：

talhakk

5,000+

优质数据集

54 个

任务类型

进入经典数据集