rajat5039/wiki-multihop-qa-500k
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/rajat5039/wiki-multihop-qa-500k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- question-answering
task_ids:
- extractive-qa
- open-domain-qa
tags:
- multi-hop
- reasoning
- synthetic
- wikipedia
- chain-of-thought
- latent-reasoning
- think-in-silence
size_categories:
- 100K<n<1M
---
# wiki-multihop-qa-500k
**500,000 synthetic multi-hop QA pairs generated from Wikipedia.**
Built for training the [think-in-silence](https://github.com/rajatmalik/think-in-silence) latent reasoning model — a model that reasons entirely in vector space without generating chain-of-thought tokens.
---
## Why This Dataset Exists
Most publicly available QA datasets have two problems for reasoning research:
1. **Too small.** HotpotQA has 113K samples. StrategyQA has 2.8K. Not enough diversity to train a generalizable reasoning module.
2. **Too many single-hop questions.** A question answerable from one sentence teaches the model that one thinking step is enough. That directly breaks the K-scaling property we're trying to demonstrate.
This dataset was built to fix both. Every pair requires connecting at least two facts. The model must reason, not look up.
---
## What Makes a Good Multi-Hop Question
**Single-hop (NOT in this dataset):**
```
Q: Where was Marie Curie born?
A: Warsaw
```
One fact. One lookup. No reasoning chain needed.
**Multi-hop (IN this dataset):**
```
Q: In which city did the scientist born in Warsaw later discover polonium?
A: Paris
```
Two facts connected: (1) Curie was born in Warsaw → (2) she worked in Paris → (3) discovered polonium there.
This is the reasoning pattern that requires K > 1 thinking steps in the think-in-silence architecture.
---
## Dataset Statistics
| Split | Samples |
|------------|-----------|
| Train | ~450,000 |
| Validation | ~25,000 |
| Test | ~25,000 |
| **Total** | **~500,000** |
### Difficulty Distribution (Train)
| Score | Type | Description | Approx % |
|-------|-------------|------------------------------------|----------|
| 0 | single-hop | One fact lookup | ~5% |
| 1 | two-hop | Connect two facts | ~55% |
| 2 | multi-hop | Three or more connections required | ~40% |
> Multi-hop ratio (score ≥ 1): **~95% of the dataset**
### Answer Type Distribution
| Type | Description | Example |
|---------|--------------------------------|----------------|
| entity | Named person, place, or thing | "Paris" |
| phrase | Short descriptive phrase | "Atlantic Ocean"|
| numeric | Number or quantity | "1867" |
| date | Year or date | "1905" |
| boolean | Yes/No answer | "yes" |
---
## Data Fields
Each sample contains:
```python
{
"question": str, # Multi-hop question requiring 2+ facts
"answer": str, # Short answer (1-6 words)
"hops": int, # Number of reasoning hops (always >= 2)
"difficulty_score": int, # 0=single-hop, 1=two-hop, 2=multi-hop
"answer_type": str, # entity | phrase | numeric | date | boolean
"source": str, # Wikipedia chunk file the paragraph came from
}
```
### Example Samples
```json
{
"question": "In which city did the physicist born in Warsaw conduct her Nobel Prize-winning research?",
"answer": "Paris",
"hops": 2,
"difficulty_score": 2,
"answer_type": "entity",
"source": "chunk_0042.jsonl"
}
{
"question": "What element was discovered by the scientist who also founded the Radium Institute in France?",
"answer": "Polonium",
"hops": 2,
"difficulty_score": 2,
"answer_type": "entity",
"source": "chunk_0042.jsonl"
}
{
"question": "Which country's capital city was home to the university where the Nobel Prize winner who was born in 1867 studied?",
"answer": "France",
"hops": 2,
"difficulty_score": 2,
"answer_type": "entity",
"source": "chunk_0089.jsonl"
}
```
---
## Generation Pipeline
### Source Corpus
- **500,000 English Wikipedia articles** streamed from `wikimedia/wikipedia` (20231101.en)
- Split into **3,308,934 paragraphs** (30–200 words each)
- Stored as 331 JSONL chunk files
### Generation Model
- **Gemini 2.5 Flash-Lite** via Google AI API
- 10 concurrent workers
- ~21 pairs/second sustained throughput
### Prompt Design
The generation prompt was carefully engineered to force multi-hop output:
```
STRICT RULES:
1. Each question MUST require connecting 2+ facts from the passage
2. Single-fact questions ("Where was X born?") are REJECTED
3. Answers: 1-6 words maximum
4. Output ONLY a JSON array
GOOD: connects 2 facts
BAD: only 1 fact needed — REJECTED
```
The good/bad example in the prompt was critical — without it, the model generated mostly single-hop questions despite the instructions.
### Quality Filtering
Every generated pair passed through 8 filters before being kept:
| Filter | What it removes |
|--------|----------------|
| `single_hop` | hops < 2 |
| `question_too_short` | fewer than 6 words |
| `no_question_mark` | missing ? |
| `bad_question_start` | doesn't start with question word |
| `answer_too_vague` | "yes", "no", "it", "they" |
| `answer_too_long` | more than 15 words |
| `answer_in_question` | trivial lookup |
| `question_has_artifact` | generation artifacts |
Plus near-deduplication (Jaccard similarity > 0.85 threshold).
### Generation Stats
```
Raw pairs generated: 1,050,000
After quality filter: ~925,000 (88%)
After difficulty filter: ~504,000 (48% overall retention)
After deduplication: ~500,000
Total API cost: ~$35
Generation time: ~18 hours
```
---
## How to Use
### Basic Loading
```python
from datasets import load_dataset
dataset = load_dataset("rajatmalik/wiki-multihop-qa-500k")
# Access splits
train = dataset["train"]
val = dataset["validation"]
test = dataset["test"]
print(train[0])
# {
# "question": "In which city did the physicist born in Warsaw...",
# "answer": "Paris",
# "hops": 2,
# "difficulty_score": 2,
# "answer_type": "entity",
# "source": "chunk_0042.jsonl"
# }
```
### Filter by Difficulty
```python
# Only hardest multi-hop (score=2)
hard = dataset["train"].filter(lambda x: x["difficulty_score"] == 2)
# Only two-hop (score=1)
medium = dataset["train"].filter(lambda x: x["difficulty_score"] == 1)
```
### Filter by Answer Type
```python
# Only entity answers
entities = dataset["train"].filter(lambda x: x["answer_type"] == "entity")
# Only numeric
numeric = dataset["train"].filter(lambda x: x["answer_type"] == "numeric")
```
### Combine with Public Datasets
```python
from datasets import load_dataset, concatenate_datasets
wiki_mhop = load_dataset("rajatmalik/wiki-multihop-qa-500k", split="train")
hotpotqa = load_dataset("hotpot_qa", "distractor", split="train")
# Use together for training
combined = concatenate_datasets([wiki_mhop, hotpotqa])
```
---
## Intended Use
### Primary Use — think-in-silence Training
This dataset was built specifically to train the ThoughtModule in [think-in-silence](https://github.com/rajatmalik/think-in-silence) — a latent reasoning model that performs K recurrent cross-attention steps in a 256-dimensional space.
Multi-hop questions are essential because:
- Single-hop questions teach the model K=1 is enough
- Multi-hop questions force K > 1 thinking steps
- The K-scaling property only emerges with sufficient multi-hop training signal
### Other Suitable Uses
- Training retrieval-augmented generation (RAG) systems
- Fine-tuning LLMs for multi-step reasoning
- Evaluating question answering systems
- Research on chain-of-thought and reasoning
### Not Suitable For
- Factual question answering benchmarks (answers are synthetic, may contain errors)
- Tasks requiring long-form answers
- Non-English tasks
---
## Limitations
**Answer accuracy is not guaranteed.**
This is a synthetically generated dataset. Gemini 2.5 Flash-Lite may occasionally generate incorrect answers or misattribute facts. For research purposes, the reasoning structure (multi-hop) matters more than factual accuracy.
**Wikipedia coverage.**
All questions are grounded in Wikipedia (November 2023 snapshot). Topics not well-covered in Wikipedia are underrepresented.
**English only.**
The source corpus is English Wikipedia. All questions and answers are in English.
**Retention rate.**
~48% of generated pairs passed all filters. The majority of rejections were single-hop questions that slipped through despite the prompt instructions. This means the remaining dataset is high-confidence multi-hop.
---
## Comparison to Related Datasets
| Dataset | Size | Multi-hop | Synthetic | Free |
|----------------|--------|-----------|-----------|------|
| HotpotQA | 113K | ✓ | ✗ | ✓ |
| StrategyQA | 2.8K | ✓ | ✗ | ✓ |
| MuSiQue | 20K | ✓ | ✗ | ✓ |
| **wiki-multihop-qa-500k** | **500K** | **✓** | **✓** | **✓** |
The key advantage is scale — 500K multi-hop pairs versus the largest public alternative at 113K.
---
## Generation Code
Full pipeline code is open source:
```
https://github.com/rajatmalik/think-in-silence-data
```
To reproduce this dataset:
```bash
git clone https://github.com/rajatmalik/think-in-silence-data
cd think-in-silence-data
pip install -r requirements.txt
export GOOGLE_API_KEY=your_key
python run.py --yes
```
---
## Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{malik2026wikimultihop,
author = {Malik, Rajat},
title = {wiki-multihop-qa-500k: Synthetic Multi-Hop QA from Wikipedia},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/rajatmalik/wiki-multihop-qa-500k},
note = {Generated using Gemini 2.5 Flash-Lite from 500K Wikipedia articles}
}
```
---
## Related Work
- **think-in-silence** — The model this dataset was built for: [github.com/rajatmalik/think-in-silence](https://github.com/rajatmalik/think-in-silence)
- **I-JEPA** (Assran et al., 2023) — JEPA training objective this project extends to language
- **Coconut** (Hao et al., 2024) — Related latent reasoning approach
- **HotpotQA** (Yang et al., 2018) — Original multi-hop QA dataset
---
## License
[Creative Commons Attribution 4.0 (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
You are free to use, share, and adapt this dataset for any purpose, including commercial use, as long as you give appropriate credit.
---
*Built by Rajat Malik · 2026 · Part of the think-in-silence research project*
提供机构:
rajat5039



