five

OpenSeek-Pretrain-100B

收藏
魔搭社区2026-05-22 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/BAAI/OpenSeek-Pretrain-100B
下载链接
链接失效反馈
官方服务:
资源简介:
# OpenSeek Pretraining Dataset v1.0 We have released a portion of the sampled 100B tokens data from the **CCI4.0-M2 v1**, including **Chinese and English Web datasets**, domain-specific datasets and Chain-of-Thought reasoning datasets. The dataset directory is consistent with the overall dataset. For a detailed description of the dataset, please refer to [CCI4.0-M2 v1 README](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Base-v1/blob/main/README.md). # Data Composition The Tokens for each subdirectory are as follows: | Name | Tokens | Tokens(B) | |---------------------------------------------------|--------------|----------| | Nemotron-CC-high-actual-actual-high | 1140543860 | 1.14 | | Nemotron-CC-high-actual-actual-low | 368646238 | 0.37 | | Nemotron-CC-high-actual-actual-mid | 801213010 | 0.80 | | Nemotron-CC-high-synthetic-distill-high | 294569308 | 0.29 | | Nemotron-CC-high-synthetic-distill-low | 172342068 | 0.17 | | Nemotron-CC-high-synthetic-distill-mid | 240998642 | 0.24 | | Nemotron-CC-high-synthetic-diverse_qa_pairs-high | 556137649 | 0.56 | | Nemotron-CC-high-synthetic-diverse_qa_pairs-low | 418742390 | 0.42 | | Nemotron-CC-high-synthetic-diverse_qa_pairs-mid | 515733187 | 0.52 | | Nemotron-CC-high-synthetic-extract_knowledge-high | 475714119 | 0.48 | | Nemotron-CC-high-synthetic-extract_knowledge-low | 68996838 | 0.07 | | Nemotron-CC-high-synthetic-extract_knowledge-mid | 353316407 | 0.35 | | Nemotron-CC-high-synthetic-knowledge_list-high | 268953064 | 0.27 | | Nemotron-CC-high-synthetic-knowledge_list-low | 187973360 | 0.19 | | Nemotron-CC-high-synthetic-knowledge_list-mid | 238373108 | 0.24 | | Nemotron-CC-high-synthetic-wrap_medium-high | 848837296 | 0.85 | | Nemotron-CC-high-synthetic-wrap_medium-low | 295324405 | 0.30 | | Nemotron-CC-high-synthetic-wrap_medium-mid | 687328353 | 0.69 | | Nemotron-CC-low-synthetic-wrap_medium-high | 479896420 | 0.48 | | Nemotron-CC-low-synthetic-wrap_medium-low | 206574167 | 0.21 | | Nemotron-CC-low-synthetic-wrap_medium-mid | 444865784 | 0.44 | | Nemotron-CC-medium-actual-actual-high | 1174405205 | 1.17 | | Nemotron-CC-medium-actual-actual-low | 698884310 | 0.70 | | Nemotron-CC-medium-actual-actual-mid | 945401567 | 0.95 | | arxiv | 660912931 | 0.66 | | books | 483917796 | 0.48 | | code-high | 1040945650 | 1.04 | | code-low | 1175000655 | 1.18 | | code-mid | 996826302 | 1.00 | | cot_synthesis2_CC-high | 386941302 | 0.39 | | cot_synthesis2_CC-low | 51390680 | 0.05 | | cot_synthesis2_CC-mid | 1885475230 | 1.89 | | cot_synthesis2_OpenSource-high | 265167656 | 0.27 | | cot_synthesis2_OpenSource-low | 168830028 | 0.17 | | cot_synthesis2_OpenSource-mid | 334976884 | 0.33 | | cot_synthesis2_arxiv-high | 12894983685 | 12.89 | | cot_synthesis2_arxiv-low | 9177670132 | 9.18 | | cot_synthesis2_arxiv-mid | 10446468216 | 10.45 | | cot_synthesis2_code-high | 473767419 | 0.47 | | cot_synthesis2_code-low | 706636812 | 0.71 | | cot_synthesis2_code-mid | 926436168 | 0.93 | | cot_synthesis2_math-high | 1353517224 | 1.35 | | cot_synthesis2_math-low | 1703361358 | 1.70 | | cot_synthesis2_math-mid | 364330324 | 0.36 | | cot_synthesis2_wiki-high | 650684154 | 0.65 | | cot_synthesis2_wiki-low | 615978070 | 0.62 | | cot_synthesis2_wiki-mid | 814947142 | 0.81 | | cot_synthesis_CC-high | 229324269 | 0.23 | | cot_synthesis_CC-low | 185148748 | 0.19 | | cot_synthesis_CC-mid | 210471356 | 0.21 | | cot_synthesis_OpenSource-high | 420505110 | 0.42 | | cot_synthesis_OpenSource-low | 170987708 | 0.17 | | cot_synthesis_OpenSource-mid | 1321855051 | 1.32 | | cot_synthesis_arxiv-high | 5853027309 | 5.85 | | cot_synthesis_arxiv-low | 7718911399 | 7.72 | | cot_synthesis_arxiv-mid | 9208148090 | 9.21 | | cot_synthesis_code-high | 789672525 | 0.79 | | cot_synthesis_code-low | 417526994 | 0.42 | | cot_synthesis_code-mid | 197436971 | 0.20 | | cot_synthesis_math-high | 522900778 | 0.52 | | cot_synthesis_math-low | 663320643 | 0.66 | | cot_synthesis_math-mid | 660137084 | 0.66 | | cot_synthesis_wiki-high | 412152225 | 0.41 | | cot_synthesis_wiki-low | 367306600 | 0.37 | | cot_synthesis_wiki-mid | 594421619 | 0.59 | | math-high | 1871864190 | 1.87 | | math-low | 1745580082 | 1.75 | | math-mid | 1680811027 | 1.68 | | pes2o | 6386997158 | 6.39 | | pes2o-full-train | 1469110938 | 1.47 | | pes2o-full-val | 14693152 | 0.01 | | stack | 435813429 | 0.44 | | wiki | 433002447 | 0.43 | | zh_cc-high-loss0 | 1872431176 | 1.87 | | zh_cc-high-loss1 | 1007405788 | 1.01 | | zh_cc-high-loss2 | 383830893 | 0.38 | | zh_cc-medidum-loss0 | 978118384 | 0.98 | | zh_cc-medidum-loss1 | 951741139 | 0.95 | | zh_cc-medidum-loss2 | 1096769115 | 1.10 |

# OpenSeek 预训练数据集 v1.0 我们从**CCI4.0-M2 v1**中发布了部分采样得到的1000亿Token数据,涵盖**中英文网页数据集**、领域专用数据集与思维链(Chain-of-Thought)推理数据集。 该数据集目录结构与整体数据集保持一致。如需获取该数据集的详细说明,请参阅[CCI4.0-M2 v1 README](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Base-v1/blob/main/README.md)。 # 数据构成 各子目录的Token数量如下: | 名称 | Token数量 | Token数(十亿) | |---------------------------------------------------|--------------|----------| | Nemotron-CC-high-actual-actual-high | 1140543860 | 1.14 | | Nemotron-CC-high-actual-actual-low | 368646238 | 0.37 | | Nemotron-CC-high-actual-actual-mid | 801213010 | 0.80 | | Nemotron-CC-high-synthetic-distill-high | 294569308 | 0.29 | | Nemotron-CC-high-synthetic-distill-low | 172342068 | 0.17 | | Nemotron-CC-high-synthetic-distill-mid | 240998642 | 0.24 | | Nemotron-CC-high-synthetic-diverse_qa_pairs-high | 556137649 | 0.56 | | Nemotron-CC-high-synthetic-diverse_qa_pairs-low | 418742390 | 0.42 | | Nemotron-CC-high-synthetic-diverse_qa_pairs-mid | 515733187 | 0.52 | | Nemotron-CC-high-synthetic-extract_knowledge-high | 475714119 | 0.48 | | Nemotron-CC-high-synthetic-extract_knowledge-low | 68996838 | 0.07 | | Nemotron-CC-high-synthetic-extract_knowledge-mid | 353316407 | 0.35 | | Nemotron-CC-high-synthetic-knowledge_list-high | 268953064 | 0.27 | | Nemotron-CC-high-synthetic-knowledge_list-low | 187973360 | 0.19 | | Nemotron-CC-high-synthetic-knowledge_list-mid | 238373108 | 0.24 | | Nemotron-CC-high-synthetic-wrap_medium-high | 848837296 | 0.85 | | Nemotron-CC-high-synthetic-wrap_medium-low | 295324405 | 0.30 | | Nemotron-CC-high-synthetic-wrap_medium-mid | 687328353 | 0.69 | | Nemotron-CC-low-synthetic-wrap_medium-high | 479896420 | 0.48 | | Nemotron-CC-low-synthetic-wrap_medium-low | 206574167 | 0.21 | | Nemotron-CC-low-synthetic-wrap_medium-mid | 444865784 | 0.44 | | Nemotron-CC-medium-actual-actual-high | 1174405205 | 1.17 | | Nemotron-CC-medium-actual-actual-low | 698884310 | 0.70 | | Nemotron-CC-medium-actual-actual-mid | 945401567 | 0.95 | | arxiv | 660912931 | 0.66 | | books | 483917796 | 0.48 | | code-high | 1040945650 | 1.04 | | code-low | 1175000655 | 1.18 | | code-mid | 996826302 | 1.00 | | cot_synthesis2_CC-high | 386941302 | 0.39 | | cot_synthesis2_CC-low | 51390680 | 0.05 | | cot_synthesis2_CC-mid | 1885475230 | 1.89 | | cot_synthesis2_OpenSource-high | 265167656 | 0.27 | | cot_synthesis2_OpenSource-low | 168830028 | 0.17 | | cot_synthesis2_OpenSource-mid | 334976884 | 0.33 | | cot_synthesis2_arxiv-high | 12894983685 | 12.89 | | cot_synthesis2_arxiv-low | 9177670132 | 9.18 | | cot_synthesis2_arxiv-mid | 10446468216 | 10.45 | | cot_synthesis2_code-high | 473767419 | 0.47 | | cot_synthesis2_code-low | 706636812 | 0.71 | | cot_synthesis2_code-mid | 926436168 | 0.93 | | cot_synthesis2_math-high | 1353517224 | 1.35 | | cot_synthesis2_math-low | 1703361358 | 1.70 | | cot_synthesis2_math-mid | 364330324 | 0.36 | | cot_synthesis2_wiki-high | 650684154 | 0.65 | | cot_synthesis2_wiki-low | 615978070 | 0.62 | | cot_synthesis2_wiki-mid | 814947142 | 0.81 | | cot_synthesis_CC-high | 229324269 | 0.23 | | cot_synthesis_CC-low | 185148748 | 0.19 | | cot_synthesis_CC-mid | 210471356 | 0.21 | | cot_synthesis_OpenSource-high | 420505110 | 0.42 | | cot_synthesis_OpenSource-low | 170987708 | 0.17 | | cot_synthesis_OpenSource-mid | 1321855051 | 1.32 | | cot_synthesis_arxiv-high | 5853027309 | 5.85 | | cot_synthesis_arxiv-low | 7718911399 | 7.72 | | cot_synthesis_arxiv-mid | 9208148090 | 9.21 | | cot_synthesis_code-high | 789672525 | 0.79 | | cot_synthesis_code-low | 417526994 | 0.42 | | cot_synthesis_code-mid | 197436971 | 0.20 | | cot_synthesis_math-high | 522900778 | 0.52 | | cot_synthesis_math-low | 663320643 | 0.66 | | cot_synthesis_math-mid | 660137084 | 0.66 | | cot_synthesis_wiki-high | 412152225 | 0.41 | | cot_synthesis_wiki-low | 367306600 | 0.37 | | cot_synthesis_wiki-mid | 594421619 | 0.59 | | math-high | 1871864190 | 1.87 | | math-low | 1745580082 | 1.75 | | math-mid | 1680811027 | 1.68 | | pes2o | 6386997158 | 6.39 | | pes2o-full-train | 1469110938 | 1.47 | | pes2o-full-val | 14693152 | 0.01 | | stack | 435813429 | 0.44 | | wiki | 433002447 | 0.43 | | zh_cc-high-loss0 | 1872431176 | 1.87 | | zh_cc-high-loss1 | 1007405788 | 1.01 | | zh_cc-high-loss2 | 383830893 | 0.38 | | zh_cc-medidum-loss0 | 978118384 | 0.98 | | zh_cc-medidum-loss1 | 951741139 | 0.95 | | zh_cc-medidum-loss2 | 1096769115 | 1.10 |
提供机构:
maas
创建时间:
2025-04-22
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
OpenSeek-Pretrain-100B是一个大规模预训练数据集,包含100B tokens,涵盖中英文网页、特定领域及链式推理数据,适用于多种自然语言处理任务。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作