NovachronoAI/RAG-Grounded-QA-188k

Name: NovachronoAI/RAG-Grounded-QA-188k
Creator: NovachronoAI
Published: 2026-03-10 09:26:53
License: 暂无描述

Hugging Face2026-03-10 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/NovachronoAI/RAG-Grounded-QA-188k

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-sa-4.0 size_categories: - 100K<n<1M task_categories: - question-answering - text-generation task_ids: - extractive-qa - open-domain-qa tags: - RAG - retrieval-augmented-generation - grounded-qa - anti-hallucination - context-grounding - unanswerable-detection - multi-hop-reasoning - conversational-qa - fine-tuning - qlora - small-language-models pretty_name: "RAG Grounded QA 188K" dataset_info: - config_name: full features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answer dtype: string - name: type dtype: string - name: source dtype: string - name: human_annotated dtype: bool - name: num_context_docs dtype: int64 - name: num_hops dtype: int64 - name: difficulty dtype: string - name: has_unanswerable dtype: bool - name: domain dtype: string - config_name: 20k_subset features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answer dtype: string - name: type dtype: string - name: source dtype: string - name: human_annotated dtype: bool - name: num_context_docs dtype: int64 - name: num_hops dtype: int64 - name: difficulty dtype: string - name: has_unanswerable dtype: bool - name: domain dtype: string configs: - config_name: full data_files: - split: train path: full/train-* - split: test path: full/test-* - config_name: 20k_subset data_files: - split: train path: 20k_subset/train-* - split: test path: 20k_subset/test-* --- <div align="center"> # 🎯 RAG Grounded QA 186K ### The Anti-Hallucination Dataset **Teach language models to answer from context — or shut up trying.** [![License: CC BY-SA 4.0](https://img.shields.io/badge/License-CC_BY--SA_4.0-blue.svg)](https://creativecommons.org/licenses/by-sa/4.0/) [![Dataset Size](https://img.shields.io/badge/Examples-188K-brightgreen)]() [![Human Annotated](https://img.shields.io/badge/Human_Annotated-93.5%25-gold)]() [![Sources](https://img.shields.io/badge/Sources-9_Datasets-purple)]() [![Unanswerable](https://img.shields.io/badge/Unanswerable-20.6%25-red)]() *Built by [NovachronoAI](https://huggingface.co/NovachronoAI) — Precision AI for the real world.* --- **[Full Dataset (186K)](#loading-the-dataset)** · **[20K Subset](#loading-the-dataset)** · **[Schema](#schema)** · **[Sources](#sources)** · **[Usage Guide](#usage-guide)** </div> --- ## 🧠 Why This Dataset Exists Most QA datasets teach models **what to say**. This one also teaches them **when to stay silent**. RAG (Retrieval-Augmented Generation) systems have a fatal flaw: the model hallucinates when retrieved documents don't contain the answer. Current models will confidently fabricate information rather than admit uncertainty. **RAG Grounded QA 186K** is a carefully engineered blend of 9 premier open-source datasets, unified under a single schema, with one mission: > **Train models that are grounded, faithful, and honest.** --- ## ✨ Key Features <table> <tr> <td width="50%"> ### 🛡️ Anti-Hallucination by Design Over **20% of all examples are unanswerable** — the model must learn to refuse gracefully. Includes context-swapped examples where questions are deliberately paired with wrong documents. ### 📊 Battle-Tested Sources Every example traces back to peer-reviewed, widely-cited academic datasets. No synthetic slop. **93.5% human-annotated.** ### 🎯 Ready-to-Train Subset Includes a carefully sampled **20K subset** optimized for small models (0.5B–3B parameters). Plug into QLoRA and go. </td> <td width="50%"> ### 🔀 Multi-Skill Coverage Six distinct reasoning types in one dataset: - ✅ Direct answering from context - ❌ Confident refusal when context lacks info - 🔗 Multi-hop reasoning across documents - 💬 Conversational context tracking - 🔢 Numerical reasoning from text - ✔️ Boolean (Yes/No) grounding ### 🏷️ Rich Metadata Every example tagged with source, type, difficulty, domain, and annotation quality — enabling custom filtering and ablation studies. </td> </tr> </table> --- ## 📊 Dataset at a Glance ### Distribution by Task Type   <div align="center"> <img src="assets/type_distribution.png" alt="Type Distribution" width="600"> </div> | Type | Count | Percentage | Purpose | |:-----|------:|-----------:|:--------| | ✅ **Answerable** | 82,434 | 47.8% | Core grounded QA — answer strictly from context | | 💬 **Conversational** | 31,368 | 18.2% | Multi-turn dialogue grounded in documents | | ❌ **Unanswerable** | 23,744 | 13.8% | Model must refuse — context lacks the answer | | 🔗 **Multi-hop** | 20,000 | 11.6% | Reasoning across 2+ documents | | 🔢 **Numerical Reasoning** | 10,000 | 5.8% | Math, counting, comparisons from text | | ✔️ **Boolean** | 5,000 | 2.9% | Yes/No with grounded justification | | 🔄 **Context-Swapped** | ~15,000 | — | Boosted unanswerable (included in ❌ count) | ### Distribution by Source   ![Source Distribution](assets/source_distribution.png) | Source | Count | % | What It Contributes | |:-------|------:|--:|:--------------------| | [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) | 65,000 | 37.5% | Gold-standard reading comprehension + unanswerable | | [RAGBench](https://huggingface.co/datasets/rungalileo/ragbench) | 27,255 | 15.7% | Multi-domain RAG evaluation (medical, legal, finance) | | [QuAC](https://quac.ai/) | 20,000 | 11.6% | Conversational QA with explicit CANNOTANSWER labels | | [HotpotQA](https://hotpotqa.github.io/) | 20,000 | 11.6% | Multi-hop reasoning across Wikipedia articles | | [CoQA](https://stanfordnlp.github.io/coqa/) | 15,000 | 8.7% | Conversational comprehension over diverse passages | | [DROP](https://allennlp.org/drop) | 10,000 | 5.8% | Discrete reasoning — arithmetic & counting from text | | [neural-bridge](https://huggingface.co/datasets/neural-bridge/rag-dataset-12000) | 9,598 | 5.5% | Clean RAG-native question-context-answer triples | | [BoolQ](https://github.com/google-research-datasets/boolean-questions) | 5,000 | 2.9% | Boolean questions with passage evidence | | [pints-ai](https://huggingface.co/datasets/pints-ai/Finetune-RAG) | 1,653 | 1.0% | Anti-hallucination with distractor documents | | Context-Swapped | ~15,000 | — | Synthetically mismatched Q+C pairs for refusal training | ### Quality Metrics ![Quality Metrics](assets/quality_metrics.png) | Metric | Full Dataset | 20K Subset | |:-------|:------------|:-----------| | **Total Examples** | ~185,000 | 20,000 | | **Human Annotated** | 93.5% | 95.1% | | **Unanswerable Rate** | 20.6% | 22.0% | | **Unique Sources** | 9 | 9 | | **Zero Critical Issues** | ✅ | ✅ | | **Cleaned & Deduplicated** | ✅ | ✅ | --- ## 🗂️ Schema Every example follows a unified, consistent schema — no preprocessing needed: ```json { "id": "squad2_000042", "context": "[Document 1]: The Apollo 11 mission landed the first humans on the Moon on July 20, 1969. Commander Neil Armstrong and lunar module pilot Buzz Aldrin formed the American crew that landed the Apollo Lunar Module Eagle.", "question": "Who was the commander of Apollo 11?", "answer": "Commander Neil Armstrong led the Apollo 11 mission, which landed the first humans on the Moon on July 20, 1969.", "type": "answerable", "source": "squad_v2", "human_annotated": true, "num_context_docs": 1, "num_hops": 1, "difficulty": "easy", "has_unanswerable": false, "domain": "general" } ``` ### Field Descriptions | Field | Type | Description | |:------|:-----|:------------| | `id` | `string` | Unique identifier with source prefix | | `context` | `string` | Retrieved document(s) with `[Document N]:` prefix | | `question` | `string` | User question | | `answer` | `string` | Grounded answer **or** standardized refusal message | | `type` | `string` | One of: `answerable`, `unanswerable`, `multi_hop`, `conversational`, `boolean`, `numerical_reasoning` | | `source` | `string` | Original dataset identifier | | `human_annotated` | `bool` | `true` if both Q and A are human-written | | `num_context_docs` | `int` | Number of documents in context | | `num_hops` | `int` | Reasoning hops required (0 = unanswerable, 1 = single, 2 = multi) | | `difficulty` | `string` | `easy`, `medium`, or `hard` | | `has_unanswerable` | `bool` | Quick filter flag for refusal examples | | `domain` | `string` | Topic domain (e.g., `general`, `pubmedqa`, `finqa`) | ### Standardized Refusal Format All unanswerable examples use a consistent refusal message: > *"Based on the provided context, I don't have enough information to answer this question."* This consistency is critical for training — the model learns one clean refusal pattern. --- ## 🚀 Loading the Dataset ```python from datasets import load_dataset # ── Full dataset (186K examples) ── # Best for: larger models (3B+), research, custom sampling full_ds = load_dataset("NovachronoAI/RAG-Grounded-QA-188k", "full") # ── Curated 20K subset ── # Best for: small models (0.5B-3B), quick fine-tuning, QLoRA subset_ds = load_dataset("NovachronoAI/RAG-Grounded-QA-188k", "20k_subset") ``` --- ## 🎯 Usage Guide ### For Fine-Tuning Small Models (0.5B – 3B) Use the `20k_subset` config. It's pre-balanced with 22% unanswerable examples and optimized for small model capacity. ```python from datasets import load_dataset ds = load_dataset("NovachronoAI/RAG-Grounded-QA-188k", "20k_subset", split="train") # Format for training SYSTEM_PROMPT = """You are a precise answering assistant. Answer the user's question using ONLY the provided context. Rules: - If the context contains the answer, provide it concisely with relevant details. - If the context does NOT contain enough information, say: "Based on the provided context, I don't have enough information to answer this question." - Do NOT use prior knowledge. ONLY use the context. - If only a partial answer is possible, provide what you can and state what's missing.""" def format_for_training(example): user_msg = f"Context:\n{example['context']}\n\nQuestion: {example['question']}" messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": user_msg}, {"role": "assistant", "content": example["answer"]} ] return {"messages": messages} formatted = ds.map(format_for_training) ``` ### For Larger Models (7B+) Sample from the `full` config for maximum coverage: ```python ds = load_dataset("NovachronoAI/RAG-Grounded-QA-188k", "full", split="train") # Sample 50K-100K with balanced types ds_shuffled = ds.shuffle(seed=42) ds_sampled = ds_shuffled.select(range(50000)) ``` ### Custom Filtering The rich metadata enables precise dataset construction: ```python ds = load_dataset("NovachronoAI/RAG-Grounded-QA-188k", "full", split="train") # Only unanswerable examples (for refusal training) refusal_data = ds.filter(lambda x: x["has_unanswerable"] == True) # Only multi-hop reasoning multihop = ds.filter(lambda x: x["type"] == "multi_hop") # Only human-annotated, medium+ difficulty quality = ds.filter(lambda x: x["human_annotated"] and x["difficulty"] != "easy") # Specific domain (from RAGBench) medical = ds.filter(lambda x: x["domain"] == "pubmedqa") financial = ds.filter(lambda x: x["domain"] == "finqa") legal = ds.filter(lambda x: x["domain"] == "cuad") ``` --- ## 🏗️ How This Dataset Was Built ### Pipeline ![Build Pipeline](assets/build_pipeline.png) ### Quality Assurance Process The dataset underwent a comprehensive 15-point diagnostic and cleaning pipeline: | Check | Type | Action | |:------|:-----|:-------| | Missing fields | 🚨 Critical | Removed | | Empty context/question/answer | 🚨 Critical | Removed | | Exact duplicates | 🚨 Critical | Removed | | Near-duplicate Q+A pairs | ⚠️ Warning | Removed | | Broken encoding / garbage chars | ⚠️ Warning | Removed | | Invalid type/difficulty values | 🚨 Critical | Fixed | | Wrong refusal labels | ⚠️ Warning | Corrected | | Missing `[Document]` prefix | 💡 Info | Added | | Inconsistent refusal text | ⚠️ Warning | Standardized | | Excessive whitespace | 💡 Info | Normalized | | HTML artifacts | ⚠️ Warning | Cleaned | | Context length outliers | 💡 Info | Truncated (>10K chars) | **Final verification: Zero critical issues remaining.** ### Unanswerable Boosting To reach the target 20%+ unanswerable rate, we employed **context swapping** — pairing human-written questions with deliberately mismatched human-written contexts. This creates genuinely unanswerable examples where: - The question is natural and well-formed (human-written) - The context is real and coherent (human-written) - The answer genuinely cannot be found in the provided context This technique is more robust than synthetic generation because both components are authentic. --- ## 📐 Recommended Training Configuration ### For Qwen3.5 (0.5B–4B) / LLaMA 3.2 (1B–3B) / SmolLM2 ```yaml # QLoRA Configuration quantization: 4-bit (nf4) lora_rank: 32 lora_alpha: 64 target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] lora_dropout: 0.05 # Training dataset: 20k_subset epochs: 3-4 batch_size: 4 gradient_accumulation: 4 learning_rate: 2e-4 scheduler: cosine warmup_ratio: 0.05 max_seq_length: 2048 ``` ### Phased Training (Recommended) For best results with small models, train in phases: | Phase | Data | Epochs | Goal | |:------|:-----|:-------|:-----| | **Phase 1** | Only `answerable` + `unanswerable` types | 3-4 | Learn grounding + refusal | | **Phase 2** | Add `conversational` + `multi_hop` | 2-3 | Add reasoning complexity | | **Phase 3** | All types | 1-2 | Final polish | > **Evaluate refusal accuracy after each phase.** Don't proceed until the model refuses correctly on 85%+ of unanswerable examples. --- ## 🏛️ Where This Fits in a RAG System ![RAG Pipeline](assets/rag_pipeline.png) --- ## 📋 Example Outputs (What Your Model Will Learn) <details> <summary><b>✅ Answerable — Direct Grounding</b></summary> **Context:** > [Document 1]: Tesla delivered 1.81 million vehicles in 2023, a 38% increase from 2022. The Model Y was the world's best-selling car across all fuel types. **Question:** How many vehicles did Tesla deliver in 2023? **Answer:** Tesla delivered 1.81 million vehicles in 2023, representing a 38% increase from the previous year. The Model Y was notably the world's best-selling car across all fuel types. </details> <details> <summary><b>❌ Unanswerable — Confident Refusal</b></summary> **Context:** > [Document 1]: Python was created by Guido van Rossum and first released in 1991. It emphasizes code readability and supports multiple programming paradigms. **Question:** What is the latest version of Python? **Answer:** Based on the provided context, I don't have enough information to answer this question. </details> <details> <summary><b>🔗 Multi-hop — Cross-Document Reasoning</b></summary> **Context:** > [Albert Einstein]: Albert Einstein was born in Ulm, Germany, in 1879. He developed the theory of general relativity. > > [Nobel Prize]: The 1921 Nobel Prize in Physics was awarded for the discovery of the law of the photoelectric effect. **Question:** What was Einstein's Nobel Prize awarded for? **Answer:** Based on the context, Albert Einstein received the 1921 Nobel Prize in Physics for the discovery of the law of the photoelectric effect, not for his theory of general relativity. </details> <details> <summary><b>🔢 Numerical Reasoning</b></summary> **Context:** > [Document 1]: The company reported revenue of $4.2 billion in Q3 with operating expenses of $3.1 billion, compared to Q2 revenue of $3.8 billion and expenses of $2.9 billion. **Question:** How did the company's operating profit change from Q2 to Q3? **Answer:** Q3 operating profit was $1.1 billion ($4.2B - $3.1B) compared to Q2's $0.9 billion ($3.8B - $2.9B), an increase of approximately $200 million. </details> --- ## ⚖️ License & Attribution This dataset is released under **[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)**. It combines data from the following sources — please cite them if used in academic work: | Source | License | Citation | |:-------|:--------|:---------| | SQuAD 2.0 | CC BY-SA 4.0 | [Rajpurkar et al., 2018](https://arxiv.org/abs/1806.03822) | | HotpotQA | CC BY-SA 4.0 | [Yang et al., 2018](https://arxiv.org/abs/1809.09600) | | QuAC | MIT | [Choi et al., 2018](https://arxiv.org/abs/1808.07036) | | CoQA | Multiple | [Reddy et al., 2019](https://arxiv.org/abs/1808.07042) | | DROP | Apache 2.0 | [Dua et al., 2019](https://arxiv.org/abs/1903.00161) | | BoolQ | CC BY-SA 3.0 | [Clark et al., 2019](https://arxiv.org/abs/1905.10044) | | RAGBench | Apache 2.0 | [Galileo AI](https://huggingface.co/rungalileo) | | neural-bridge | Apache 2.0 | [neural-bridge](https://huggingface.co/neural-bridge) | | pints-ai | Apache 2.0 | [pints-ai](https://huggingface.co/pints-ai) | --- ## 🏢 About NovachronoAI We build precision AI tools and datasets for the real world. Our focus is on creating resources that make AI systems more reliable, grounded, and honest. **This dataset is part of our mission to eliminate hallucination in production AI systems.** 🔗 [HuggingFace](https://huggingface.co/NovachronoAI) · 🌐 [Organization](https://huggingface.co/NovachronoAI) --- <div align="center"> *If you use this dataset, consider giving it a ♥️ on HuggingFace.* *Built with care. Cleaned with obsession. Shared with purpose.* </div> ### Citation ```bibtex @misc{rag_grounded_qa_188k, author = {NovachronoAI}, title = {RAG Grounded QA 188K: The Anti-Hallucination Dataset}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Repository}, howpublished = {\url{https://huggingface.co/datasets/NovachronoAI/RAG-Grounded-QA-188k}} } ```

提供机构：

NovachronoAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集