five

tlm-dolma-3-ablation

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/Shekswess/tlm-dolma-3-ablation
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description Dolma 3 subset built from allenai/dolma3_dolmino_pool (English only). Total rows: 305,393 Total tokens (capped at 2048): 97,035,546 Sources share: reddit_to_flashcards: 45.27% (138,240) wiki_to_rcqa-part1: 8.21% (25,088) wiki_to_rcqa-part2: 8.21% (25,088) wiki_to_rcqa-part3: 7.29% (22,257) dolmino_1-flan: 5.70% (17,408) tinymath-mind: 4.19% (12,800) cranecode: 3.19% (9,728) nemotron-synth-qa: 2.68% (8,192) tinymath-pot: 2.01% (6,144) tulu-3-sft: 2.01% (6,144) verifiable-o4mini: 2.01% (6,144) math-meta-reasoning: 1.84% (5,632) cranemath: 1.68% (5,120) general_reasoning_mix: 1.68% (5,120) code-meta-reasoning: 1.51% (4,608) verifiable-gpt41: 1.01% (3,072) omr-rewrite-fullthoughts: 0.50% (1,536) gemini-reasoning-traces: 0.34% (1,024) qwq-reasoning-traces: 0.34% (1,024) r1-reasoning-traces: 0.34% (1,024) Tokens by source: reddit_to_flashcards: 8,472,334 tokens wiki_to_rcqa-part1: 4,831,455 tokens wiki_to_rcqa-part2: 4,921,509 tokens wiki_to_rcqa-part3: 4,206,844 tokens dolmino_1-flan: 7,078,060 tokens tinymath-mind: 8,123,976 tokens cranecode: 8,894,067 tokens nemotron-synth-qa: 3,953,873 tokens tinymath-pot: 2,240,612 tokens tulu-3-sft: 2,175,610 tokens verifiable-o4mini: 2,576,286 tokens math-meta-reasoning: 8,756,698 tokens cranemath: 3,852,095 tokens general_reasoning_mix: 6,749,696 tokens code-meta-reasoning: 7,600,575 tokens verifiable-gpt41: 4,292,100 tokens omr-rewrite-fullthoughts: 2,741,410 tokens gemini-reasoning-traces: 2,003,442 tokens qwq-reasoning-traces: 2,097,152 tokens r1-reasoning-traces: 1,467,752 tokens Targets per source: code-meta-reasoning: 4,301 cranecode: 9,434 cranemath: 5,000 dolmino_1-flan: 16,949 gemini-reasoning-traces: 949 general_reasoning_mix: 5,000 math-meta-reasoning: 5,624 nemotron-synth-qa: 7,984 omr-rewrite-fullthoughts: 1,370 qwq-reasoning-traces: 967 r1-reasoning-traces: 956 reddit_to_flashcards: 137,931 tinymath-mind: 12,636 tinymath-pot: 6,042 tulu-3-sft: 6,053 verifiable-gpt41: 3,000 verifiable-o4mini: 6,005 wiki_to_rcqa-part1: 25,064 wiki_to_rcqa-part2: 25,064 wiki_to_rcqa-part3: 25,064 Ablation mode: capped at 10% of target rows per source.

数据集说明 本数据集为Dolma 3子集,构建自allenai/dolma3_dolmino_pool,仅包含英语语料。 总数据行数:305,393 总Token(Token)数(截断至2048):97,035,546 来源占比: reddit_to_flashcards:45.27%(138,240条) wiki_to_rcqa-part1:8.21%(25,088条) wiki_to_rcqa-part2:8.21%(25,088条) wiki_to_rcqa-part3:7.29%(22,257条) dolmino_1-flan:5.70%(17,408条) tinymath-mind:4.19%(12,800条) cranecode:3.19%(9,728条) nemotron-synth-qa:2.68%(8,192条) tinymath-pot:2.01%(6,144条) tulu-3-sft:2.01%(6,144条) verifiable-o4mini:2.01%(6,144条) math-meta-reasoning:1.84%(5,632条) cranemath:1.68%(5,120条) general_reasoning_mix:1.68%(5,120条) code-meta-reasoning:1.51%(4,608条) verifiable-gpt41:1.01%(3,072条) omr-rewrite-fullthoughts:0.50%(1,536条) gemini-reasoning-traces:0.34%(1,024条) qwq-reasoning-traces:0.34%(1,024条) r1-reasoning-traces:0.34%(1,024条) 各来源Token数: reddit_to_flashcards:8,472,334 个Token wiki_to_rcqa-part1:4,831,455 个Token wiki_to_rcqa-part2:4,921,509 个Token wiki_to_rcqa-part3:4,206,844 个Token dolmino_1-flan:7,078,060 个Token tinymath-mind:8,123,976 个Token cranecode:8,894,067 个Token nemotron-synth-qa:3,953,873 个Token tinymath-pot:2,240,612 个Token tulu-3-sft:2,175,610 个Token verifiable-o4mini:2,576,286 个Token math-meta-reasoning:8,756,698 个Token cranemath:3,852,095 个Token general_reasoning_mix:6,749,696 个Token code-meta-reasoning:7,600,575 个Token verifiable-gpt41:4,292,100 个Token omr-rewrite-fullthoughts:2,741,410 个Token gemini-reasoning-traces:2,003,442 个Token qwq-reasoning-traces:2,097,152 个Token r1-reasoning-traces:1,467,752 个Token 各来源目标样本数: code-meta-reasoning:4,301 cranecode:9,434 cranemath:5,000 dolmino_1-flan:16,949 gemini-reasoning-traces:949 general_reasoning_mix:5,000 math-meta-reasoning:5,624 nemotron-synth-qa:7,984 omr-rewrite-fullthoughts:1,370 qwq-reasoning-traces:967 r1-reasoning-traces:956 reddit_to_flashcards:137,931 tinymath-mind:12,636 tinymath-pot:6,042 tulu-3-sft:6,053 verifiable-gpt41:3,000 verifiable-o4mini:6,005 wiki_to_rcqa-part1:25,064 wiki_to_rcqa-part2:25,064 wiki_to_rcqa-part3:25,064 消融实验模式:各来源的目标行数上限为其原始目标行数的10%。
提供机构:
maas
创建时间:
2025-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作