Shekswess/tlm-dolma-3-ablation
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Shekswess/tlm-dolma-3-ablation
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: source
dtype: string
- name: language
dtype: string
- name: num_tokens
dtype: int64
splits:
- name: train
num_bytes: 1916695462
num_examples: 1526965
download_size: 917434100
dataset_size: 1916695462
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Dolma-3 Ablation Dataset (Reasoning-Focused)
This dataset is a **custom ablation subset of Dolma 3**, built from
`allenai/dolma3_dolmino_pool` (English only), and reduced to **~50% of each source** to enable lightweight ablation experiments for tiny-model training (e.g., 50M–200M parameter models).
The mix intentionally covers **three categories of reasoning datasets**:
### 🟦 1. Text Reasoning (QA, RC, Instruct, Chains-of-Thought)
`reddit_to_flashcards`, `wiki_to_rcqa`, `dolmino_1-flan`, `tulu-3-sft`, `nemotron-synth-qa`, `verifiable-o4mini`, `omr-rewrite-fullthoughts`, `gemini-reasoning-traces`, `qwq-reasoning-traces`, `r1-reasoning-traces`
### 🟥 2. Math Reasoning
`tinymath-mind`, `tinymath-pot`, `math-meta-reasoning`, `cranemath`
### 🟩 3. Code Reasoning
`cranecode`, `code-meta-reasoning`
All samples include **natural-language reasoning traces** or **deep structured reasoning** (math steps, code reasoning, RCQA chains, synthetic CoT).
---
## 📊 Dataset Statistics
- **Total rows:** 1,526,965
- **Total tokens (capped at 2048):** 482,406,394
- **Ablation mode:** *max 50% rows per source retained*
---
### Source Distribution (Rows)
| Source | Rows | % Share |
|-------------------------------|-----------|---------|
| reddit_to_flashcards | 689,664 | 45.17% |
| wiki_to_rcqa-part1 | 125,440 | 8.21% |
| wiki_to_rcqa-part2 | 125,440 | 8.21% |
| wiki_to_rcqa-part3 | 120,501 | 7.89% |
| dolmino_1-flan | 84,992 | 5.57% |
| tinymath-mind | 63,488 | 4.16% |
| cranecode | 47,616 | 3.12% |
| nemotron-synth-qa | 39,936 | 2.62% |
| tinymath-pot | 30,720 | 2.01% |
| tulu-3-sft | 30,720 | 2.01% |
| verifiable-o4mini | 30,208 | 1.98% |
| math-meta-reasoning | 28,160 | 1.84% |
| cranemath | 25,088 | 1.64% |
| general_reasoning_mix | 25,088 | 1.64% |
| code-meta-reasoning | 22,016 | 1.44% |
| verifiable-gpt41 | 15,360 | 1.01% |
| omr-rewrite-fullthoughts | 7,168 | 0.47% |
| gemini-reasoning-traces | 5,120 | 0.34% |
| qwq-reasoning-traces | 5,120 | 0.34% |
| r1-reasoning-traces | 5,120 | 0.34% |
---
### Token Distribution (by Source)
| Source | Tokens | % |
|-----------------------------|---------------|---|
| math-meta-reasoning | 48,678,876 | 10.09% |
| cranecode | 43,386,010 | 8.99% |
| reddit_to_flashcards | 42,275,168 | 8.76% |
| tinymath-mind | 40,143,670 | 8.32% |
| code-meta-reasoning | 37,532,557 | 7.78% |
| dolmino_1-flan | 33,567,740 | 6.96% |
| general_reasoning_mix | 31,005,178 | 6.43% |
| wiki_to_rcqa-part1 | 24,388,916 | 5.06% |
| wiki_to_rcqa-part2 | 24,246,196 | 5.03% |
| wiki_to_rcqa-part3 | 22,597,781 | 4.68% |
| verifiable-gpt41 | 21,290,847 | 4.41% |
| nemotron-synth-qa | 19,074,306 | 3.95% |
| cranemath | 18,710,927 | 3.88% |
| omr-rewrite-fullthoughts | 12,825,044 | 2.66% |
| verifiable-o4mini | 12,653,588 | 2.62% |
| tulu-3-sft | 12,595,916 | 2.61% |
| qwq-reasoning-traces | 10,481,299 | 2.17% |
| tinymath-pot | 10,222,536 | 2.12% |
| gemini-reasoning-traces | 9,980,143 | 2.07% |
| r1-reasoning-traces | 6,749,696 | 1.40% |
---
### Token Distribution (by Type of Dataset)
| Type | Tokens | Percentage |
|------|----------------|-------------|
| Text | 270,437,495 | 56.06% |
| Math | 125,772,009 | 26.07% |
| Code | 86,196,890 | 17.87% |
提供机构:
Shekswess



