five

rajat5039/wiki-multihop-qa-500k

收藏
Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/rajat5039/wiki-multihop-qa-500k
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 task_categories: - question-answering task_ids: - extractive-qa - open-domain-qa tags: - multi-hop - reasoning - synthetic - wikipedia - chain-of-thought - latent-reasoning - think-in-silence size_categories: - 100K<n<1M --- # wiki-multihop-qa-500k **500,000 synthetic multi-hop QA pairs generated from Wikipedia.** Built for training the [think-in-silence](https://github.com/rajatmalik/think-in-silence) latent reasoning model — a model that reasons entirely in vector space without generating chain-of-thought tokens. --- ## Why This Dataset Exists Most publicly available QA datasets have two problems for reasoning research: 1. **Too small.** HotpotQA has 113K samples. StrategyQA has 2.8K. Not enough diversity to train a generalizable reasoning module. 2. **Too many single-hop questions.** A question answerable from one sentence teaches the model that one thinking step is enough. That directly breaks the K-scaling property we're trying to demonstrate. This dataset was built to fix both. Every pair requires connecting at least two facts. The model must reason, not look up. --- ## What Makes a Good Multi-Hop Question **Single-hop (NOT in this dataset):** ``` Q: Where was Marie Curie born? A: Warsaw ``` One fact. One lookup. No reasoning chain needed. **Multi-hop (IN this dataset):** ``` Q: In which city did the scientist born in Warsaw later discover polonium? A: Paris ``` Two facts connected: (1) Curie was born in Warsaw → (2) she worked in Paris → (3) discovered polonium there. This is the reasoning pattern that requires K > 1 thinking steps in the think-in-silence architecture. --- ## Dataset Statistics | Split | Samples | |------------|-----------| | Train | ~450,000 | | Validation | ~25,000 | | Test | ~25,000 | | **Total** | **~500,000** | ### Difficulty Distribution (Train) | Score | Type | Description | Approx % | |-------|-------------|------------------------------------|----------| | 0 | single-hop | One fact lookup | ~5% | | 1 | two-hop | Connect two facts | ~55% | | 2 | multi-hop | Three or more connections required | ~40% | > Multi-hop ratio (score ≥ 1): **~95% of the dataset** ### Answer Type Distribution | Type | Description | Example | |---------|--------------------------------|----------------| | entity | Named person, place, or thing | "Paris" | | phrase | Short descriptive phrase | "Atlantic Ocean"| | numeric | Number or quantity | "1867" | | date | Year or date | "1905" | | boolean | Yes/No answer | "yes" | --- ## Data Fields Each sample contains: ```python { "question": str, # Multi-hop question requiring 2+ facts "answer": str, # Short answer (1-6 words) "hops": int, # Number of reasoning hops (always >= 2) "difficulty_score": int, # 0=single-hop, 1=two-hop, 2=multi-hop "answer_type": str, # entity | phrase | numeric | date | boolean "source": str, # Wikipedia chunk file the paragraph came from } ``` ### Example Samples ```json { "question": "In which city did the physicist born in Warsaw conduct her Nobel Prize-winning research?", "answer": "Paris", "hops": 2, "difficulty_score": 2, "answer_type": "entity", "source": "chunk_0042.jsonl" } { "question": "What element was discovered by the scientist who also founded the Radium Institute in France?", "answer": "Polonium", "hops": 2, "difficulty_score": 2, "answer_type": "entity", "source": "chunk_0042.jsonl" } { "question": "Which country's capital city was home to the university where the Nobel Prize winner who was born in 1867 studied?", "answer": "France", "hops": 2, "difficulty_score": 2, "answer_type": "entity", "source": "chunk_0089.jsonl" } ``` --- ## Generation Pipeline ### Source Corpus - **500,000 English Wikipedia articles** streamed from `wikimedia/wikipedia` (20231101.en) - Split into **3,308,934 paragraphs** (30–200 words each) - Stored as 331 JSONL chunk files ### Generation Model - **Gemini 2.5 Flash-Lite** via Google AI API - 10 concurrent workers - ~21 pairs/second sustained throughput ### Prompt Design The generation prompt was carefully engineered to force multi-hop output: ``` STRICT RULES: 1. Each question MUST require connecting 2+ facts from the passage 2. Single-fact questions ("Where was X born?") are REJECTED 3. Answers: 1-6 words maximum 4. Output ONLY a JSON array GOOD: connects 2 facts BAD: only 1 fact needed — REJECTED ``` The good/bad example in the prompt was critical — without it, the model generated mostly single-hop questions despite the instructions. ### Quality Filtering Every generated pair passed through 8 filters before being kept: | Filter | What it removes | |--------|----------------| | `single_hop` | hops < 2 | | `question_too_short` | fewer than 6 words | | `no_question_mark` | missing ? | | `bad_question_start` | doesn't start with question word | | `answer_too_vague` | "yes", "no", "it", "they" | | `answer_too_long` | more than 15 words | | `answer_in_question` | trivial lookup | | `question_has_artifact` | generation artifacts | Plus near-deduplication (Jaccard similarity > 0.85 threshold). ### Generation Stats ``` Raw pairs generated: 1,050,000 After quality filter: ~925,000 (88%) After difficulty filter: ~504,000 (48% overall retention) After deduplication: ~500,000 Total API cost: ~$35 Generation time: ~18 hours ``` --- ## How to Use ### Basic Loading ```python from datasets import load_dataset dataset = load_dataset("rajatmalik/wiki-multihop-qa-500k") # Access splits train = dataset["train"] val = dataset["validation"] test = dataset["test"] print(train[0]) # { # "question": "In which city did the physicist born in Warsaw...", # "answer": "Paris", # "hops": 2, # "difficulty_score": 2, # "answer_type": "entity", # "source": "chunk_0042.jsonl" # } ``` ### Filter by Difficulty ```python # Only hardest multi-hop (score=2) hard = dataset["train"].filter(lambda x: x["difficulty_score"] == 2) # Only two-hop (score=1) medium = dataset["train"].filter(lambda x: x["difficulty_score"] == 1) ``` ### Filter by Answer Type ```python # Only entity answers entities = dataset["train"].filter(lambda x: x["answer_type"] == "entity") # Only numeric numeric = dataset["train"].filter(lambda x: x["answer_type"] == "numeric") ``` ### Combine with Public Datasets ```python from datasets import load_dataset, concatenate_datasets wiki_mhop = load_dataset("rajatmalik/wiki-multihop-qa-500k", split="train") hotpotqa = load_dataset("hotpot_qa", "distractor", split="train") # Use together for training combined = concatenate_datasets([wiki_mhop, hotpotqa]) ``` --- ## Intended Use ### Primary Use — think-in-silence Training This dataset was built specifically to train the ThoughtModule in [think-in-silence](https://github.com/rajatmalik/think-in-silence) — a latent reasoning model that performs K recurrent cross-attention steps in a 256-dimensional space. Multi-hop questions are essential because: - Single-hop questions teach the model K=1 is enough - Multi-hop questions force K > 1 thinking steps - The K-scaling property only emerges with sufficient multi-hop training signal ### Other Suitable Uses - Training retrieval-augmented generation (RAG) systems - Fine-tuning LLMs for multi-step reasoning - Evaluating question answering systems - Research on chain-of-thought and reasoning ### Not Suitable For - Factual question answering benchmarks (answers are synthetic, may contain errors) - Tasks requiring long-form answers - Non-English tasks --- ## Limitations **Answer accuracy is not guaranteed.** This is a synthetically generated dataset. Gemini 2.5 Flash-Lite may occasionally generate incorrect answers or misattribute facts. For research purposes, the reasoning structure (multi-hop) matters more than factual accuracy. **Wikipedia coverage.** All questions are grounded in Wikipedia (November 2023 snapshot). Topics not well-covered in Wikipedia are underrepresented. **English only.** The source corpus is English Wikipedia. All questions and answers are in English. **Retention rate.** ~48% of generated pairs passed all filters. The majority of rejections were single-hop questions that slipped through despite the prompt instructions. This means the remaining dataset is high-confidence multi-hop. --- ## Comparison to Related Datasets | Dataset | Size | Multi-hop | Synthetic | Free | |----------------|--------|-----------|-----------|------| | HotpotQA | 113K | ✓ | ✗ | ✓ | | StrategyQA | 2.8K | ✓ | ✗ | ✓ | | MuSiQue | 20K | ✓ | ✗ | ✓ | | **wiki-multihop-qa-500k** | **500K** | **✓** | **✓** | **✓** | The key advantage is scale — 500K multi-hop pairs versus the largest public alternative at 113K. --- ## Generation Code Full pipeline code is open source: ``` https://github.com/rajatmalik/think-in-silence-data ``` To reproduce this dataset: ```bash git clone https://github.com/rajatmalik/think-in-silence-data cd think-in-silence-data pip install -r requirements.txt export GOOGLE_API_KEY=your_key python run.py --yes ``` --- ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{malik2026wikimultihop, author = {Malik, Rajat}, title = {wiki-multihop-qa-500k: Synthetic Multi-Hop QA from Wikipedia}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/rajatmalik/wiki-multihop-qa-500k}, note = {Generated using Gemini 2.5 Flash-Lite from 500K Wikipedia articles} } ``` --- ## Related Work - **think-in-silence** — The model this dataset was built for: [github.com/rajatmalik/think-in-silence](https://github.com/rajatmalik/think-in-silence) - **I-JEPA** (Assran et al., 2023) — JEPA training objective this project extends to language - **Coconut** (Hao et al., 2024) — Related latent reasoning approach - **HotpotQA** (Yang et al., 2018) — Original multi-hop QA dataset --- ## License [Creative Commons Attribution 4.0 (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) You are free to use, share, and adapt this dataset for any purpose, including commercial use, as long as you give appropriate credit. --- *Built by Rajat Malik · 2026 · Part of the think-in-silence research project*
提供机构:
rajat5039
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作