five

talhakk/agriculture-qa-tokenized

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/talhakk/agriculture-qa-tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - agriculture - crops - farming - soil-science - llm - gemma - tokenized - instruction-tuning - rag size_categories: - 10K<n<100K --- # 🌾 Agriculture-QA Tokenized Dataset (Gemma Ready) ## 🔍 Overview The **Agriculture-QA Tokenized Dataset** is a high-performance, ready-to-train version of the original `agriculture-qa` dataset. It has been specifically optimized for Large Language Models (LLMs) like **Gemma, LLaMA, and Mistral**. It contains **25,410** high-quality question-answer pairs transformed into instruction-style sequences and pre-tokenized for causal language modeling ($CLM$). This removes the preprocessing bottleneck, allowing you to jump straight into fine-tuning. --- ## 🚀 Key Features * ✅ **25K+ Agriculture QA Pairs:** Comprehensive domain coverage. * ✅ **Gemma-Compatible:** Pre-tokenized using the Gemma tokenizer. * ✅ **Instruction-Tuned Format:** Structured specifically for `Question: [text] \n Answer: [text]`. * ✅ **Efficiency:** No padding applied (enabling dynamic padding during training for 2x faster throughput). * ✅ **Optimized for LoRA/QLoRA:** Plug-and-play for PEFT libraries. --- ## 🧠 Data Structure Each entry is a dictionary containing the necessary tensors for training: | Field | Description | | :--- | :--- | | `input_ids` | Tokenized sequence of Question + Answer | | `attention_mask` | Mask to avoid performing attention on padding | | `labels` | The target sequence (identical to `input_ids` for Causal LM) | **Format Example:** > **Question:** *How to improve wheat yield?* > **Answer:** *Improve soil fertility through balanced NPK application...* --- ## ⚙️ Preprocessing Pipeline * **Tokenizer:** `google/gemma-2b` (Transformers) * **Max Length:** 512 tokens * **Truncation:** Enabled * **Padding:** None (Recommended: apply dynamic padding at runtime) * **Parallelization:** Multi-core processed for integrity --- ## 📊 Dataset Statistics | Feature | Value | | :--- | :--- | | **Total Samples** | 25,410 | | **Format** | Tokenized / Instruction-Style | | **Max Sequence Length** | 512 | | **Language** | English | | **Base Model** | Gemma | --- ## 🧪 Quick Start (Usage) ### Load the dataset ```python from datasets import load_dataset dataset = load_dataset("talhakk/agriculture-qa-tokenized") print(dataset["train"][0])
提供机构:
talhakk
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作