tlm-dolma-3-ablation
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/Shekswess/tlm-dolma-3-ablation
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
Dolma 3 subset built from allenai/dolma3_dolmino_pool (English only).
Total rows: 305,393
Total tokens (capped at 2048): 97,035,546
Sources share:
reddit_to_flashcards: 45.27% (138,240)
wiki_to_rcqa-part1: 8.21% (25,088)
wiki_to_rcqa-part2: 8.21% (25,088)
wiki_to_rcqa-part3: 7.29% (22,257)
dolmino_1-flan: 5.70% (17,408)
tinymath-mind: 4.19% (12,800)
cranecode: 3.19% (9,728)
nemotron-synth-qa: 2.68% (8,192)
tinymath-pot: 2.01% (6,144)
tulu-3-sft: 2.01% (6,144)
verifiable-o4mini: 2.01% (6,144)
math-meta-reasoning: 1.84% (5,632)
cranemath: 1.68% (5,120)
general_reasoning_mix: 1.68% (5,120)
code-meta-reasoning: 1.51% (4,608)
verifiable-gpt41: 1.01% (3,072)
omr-rewrite-fullthoughts: 0.50% (1,536)
gemini-reasoning-traces: 0.34% (1,024)
qwq-reasoning-traces: 0.34% (1,024)
r1-reasoning-traces: 0.34% (1,024)
Tokens by source:
reddit_to_flashcards: 8,472,334 tokens
wiki_to_rcqa-part1: 4,831,455 tokens
wiki_to_rcqa-part2: 4,921,509 tokens
wiki_to_rcqa-part3: 4,206,844 tokens
dolmino_1-flan: 7,078,060 tokens
tinymath-mind: 8,123,976 tokens
cranecode: 8,894,067 tokens
nemotron-synth-qa: 3,953,873 tokens
tinymath-pot: 2,240,612 tokens
tulu-3-sft: 2,175,610 tokens
verifiable-o4mini: 2,576,286 tokens
math-meta-reasoning: 8,756,698 tokens
cranemath: 3,852,095 tokens
general_reasoning_mix: 6,749,696 tokens
code-meta-reasoning: 7,600,575 tokens
verifiable-gpt41: 4,292,100 tokens
omr-rewrite-fullthoughts: 2,741,410 tokens
gemini-reasoning-traces: 2,003,442 tokens
qwq-reasoning-traces: 2,097,152 tokens
r1-reasoning-traces: 1,467,752 tokens
Targets per source:
code-meta-reasoning: 4,301
cranecode: 9,434
cranemath: 5,000
dolmino_1-flan: 16,949
gemini-reasoning-traces: 949
general_reasoning_mix: 5,000
math-meta-reasoning: 5,624
nemotron-synth-qa: 7,984
omr-rewrite-fullthoughts: 1,370
qwq-reasoning-traces: 967
r1-reasoning-traces: 956
reddit_to_flashcards: 137,931
tinymath-mind: 12,636
tinymath-pot: 6,042
tulu-3-sft: 6,053
verifiable-gpt41: 3,000
verifiable-o4mini: 6,005
wiki_to_rcqa-part1: 25,064
wiki_to_rcqa-part2: 25,064
wiki_to_rcqa-part3: 25,064
Ablation mode: capped at 10% of target rows per source.
数据集说明
本数据集为Dolma 3子集,构建自allenai/dolma3_dolmino_pool,仅包含英语语料。
总数据行数:305,393
总Token(Token)数(截断至2048):97,035,546
来源占比:
reddit_to_flashcards:45.27%(138,240条)
wiki_to_rcqa-part1:8.21%(25,088条)
wiki_to_rcqa-part2:8.21%(25,088条)
wiki_to_rcqa-part3:7.29%(22,257条)
dolmino_1-flan:5.70%(17,408条)
tinymath-mind:4.19%(12,800条)
cranecode:3.19%(9,728条)
nemotron-synth-qa:2.68%(8,192条)
tinymath-pot:2.01%(6,144条)
tulu-3-sft:2.01%(6,144条)
verifiable-o4mini:2.01%(6,144条)
math-meta-reasoning:1.84%(5,632条)
cranemath:1.68%(5,120条)
general_reasoning_mix:1.68%(5,120条)
code-meta-reasoning:1.51%(4,608条)
verifiable-gpt41:1.01%(3,072条)
omr-rewrite-fullthoughts:0.50%(1,536条)
gemini-reasoning-traces:0.34%(1,024条)
qwq-reasoning-traces:0.34%(1,024条)
r1-reasoning-traces:0.34%(1,024条)
各来源Token数:
reddit_to_flashcards:8,472,334 个Token
wiki_to_rcqa-part1:4,831,455 个Token
wiki_to_rcqa-part2:4,921,509 个Token
wiki_to_rcqa-part3:4,206,844 个Token
dolmino_1-flan:7,078,060 个Token
tinymath-mind:8,123,976 个Token
cranecode:8,894,067 个Token
nemotron-synth-qa:3,953,873 个Token
tinymath-pot:2,240,612 个Token
tulu-3-sft:2,175,610 个Token
verifiable-o4mini:2,576,286 个Token
math-meta-reasoning:8,756,698 个Token
cranemath:3,852,095 个Token
general_reasoning_mix:6,749,696 个Token
code-meta-reasoning:7,600,575 个Token
verifiable-gpt41:4,292,100 个Token
omr-rewrite-fullthoughts:2,741,410 个Token
gemini-reasoning-traces:2,003,442 个Token
qwq-reasoning-traces:2,097,152 个Token
r1-reasoning-traces:1,467,752 个Token
各来源目标样本数:
code-meta-reasoning:4,301
cranecode:9,434
cranemath:5,000
dolmino_1-flan:16,949
gemini-reasoning-traces:949
general_reasoning_mix:5,000
math-meta-reasoning:5,624
nemotron-synth-qa:7,984
omr-rewrite-fullthoughts:1,370
qwq-reasoning-traces:967
r1-reasoning-traces:956
reddit_to_flashcards:137,931
tinymath-mind:12,636
tinymath-pot:6,042
tulu-3-sft:6,053
verifiable-gpt41:3,000
verifiable-o4mini:6,005
wiki_to_rcqa-part1:25,064
wiki_to_rcqa-part2:25,064
wiki_to_rcqa-part3:25,064
消融实验模式:各来源的目标行数上限为其原始目标行数的10%。
提供机构:
maas
创建时间:
2025-11-30



