five

Shekswess/tlm-dolma-3-ablation

收藏
Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Shekswess/tlm-dolma-3-ablation
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: source dtype: string - name: language dtype: string - name: num_tokens dtype: int64 splits: - name: train num_bytes: 1916695462 num_examples: 1526965 download_size: 917434100 dataset_size: 1916695462 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dolma-3 Ablation Dataset (Reasoning-Focused) This dataset is a **custom ablation subset of Dolma 3**, built from `allenai/dolma3_dolmino_pool` (English only), and reduced to **~50% of each source** to enable lightweight ablation experiments for tiny-model training (e.g., 50M–200M parameter models). The mix intentionally covers **three categories of reasoning datasets**: ### 🟦 1. Text Reasoning (QA, RC, Instruct, Chains-of-Thought) `reddit_to_flashcards`, `wiki_to_rcqa`, `dolmino_1-flan`, `tulu-3-sft`, `nemotron-synth-qa`, `verifiable-o4mini`, `omr-rewrite-fullthoughts`, `gemini-reasoning-traces`, `qwq-reasoning-traces`, `r1-reasoning-traces` ### 🟥 2. Math Reasoning `tinymath-mind`, `tinymath-pot`, `math-meta-reasoning`, `cranemath` ### 🟩 3. Code Reasoning `cranecode`, `code-meta-reasoning` All samples include **natural-language reasoning traces** or **deep structured reasoning** (math steps, code reasoning, RCQA chains, synthetic CoT). --- ## 📊 Dataset Statistics - **Total rows:** 1,526,965 - **Total tokens (capped at 2048):** 482,406,394 - **Ablation mode:** *max 50% rows per source retained* --- ### Source Distribution (Rows) | Source | Rows | % Share | |-------------------------------|-----------|---------| | reddit_to_flashcards | 689,664 | 45.17% | | wiki_to_rcqa-part1 | 125,440 | 8.21% | | wiki_to_rcqa-part2 | 125,440 | 8.21% | | wiki_to_rcqa-part3 | 120,501 | 7.89% | | dolmino_1-flan | 84,992 | 5.57% | | tinymath-mind | 63,488 | 4.16% | | cranecode | 47,616 | 3.12% | | nemotron-synth-qa | 39,936 | 2.62% | | tinymath-pot | 30,720 | 2.01% | | tulu-3-sft | 30,720 | 2.01% | | verifiable-o4mini | 30,208 | 1.98% | | math-meta-reasoning | 28,160 | 1.84% | | cranemath | 25,088 | 1.64% | | general_reasoning_mix | 25,088 | 1.64% | | code-meta-reasoning | 22,016 | 1.44% | | verifiable-gpt41 | 15,360 | 1.01% | | omr-rewrite-fullthoughts | 7,168 | 0.47% | | gemini-reasoning-traces | 5,120 | 0.34% | | qwq-reasoning-traces | 5,120 | 0.34% | | r1-reasoning-traces | 5,120 | 0.34% | --- ### Token Distribution (by Source) | Source | Tokens | % | |-----------------------------|---------------|---| | math-meta-reasoning | 48,678,876 | 10.09% | | cranecode | 43,386,010 | 8.99% | | reddit_to_flashcards | 42,275,168 | 8.76% | | tinymath-mind | 40,143,670 | 8.32% | | code-meta-reasoning | 37,532,557 | 7.78% | | dolmino_1-flan | 33,567,740 | 6.96% | | general_reasoning_mix | 31,005,178 | 6.43% | | wiki_to_rcqa-part1 | 24,388,916 | 5.06% | | wiki_to_rcqa-part2 | 24,246,196 | 5.03% | | wiki_to_rcqa-part3 | 22,597,781 | 4.68% | | verifiable-gpt41 | 21,290,847 | 4.41% | | nemotron-synth-qa | 19,074,306 | 3.95% | | cranemath | 18,710,927 | 3.88% | | omr-rewrite-fullthoughts | 12,825,044 | 2.66% | | verifiable-o4mini | 12,653,588 | 2.62% | | tulu-3-sft | 12,595,916 | 2.61% | | qwq-reasoning-traces | 10,481,299 | 2.17% | | tinymath-pot | 10,222,536 | 2.12% | | gemini-reasoning-traces | 9,980,143 | 2.07% | | r1-reasoning-traces | 6,749,696 | 1.40% | --- ### Token Distribution (by Type of Dataset) | Type | Tokens | Percentage | |------|----------------|-------------| | Text | 270,437,495 | 56.06% | | Math | 125,772,009 | 26.07% | | Code | 86,196,890 | 17.87% |
提供机构:
Shekswess
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作