下载链接：

https://modelscope.cn/datasets/common-pile/comma_v0.1_training_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# Comma v0.1 dataset This repository contains the dataset used to train [Comma v0.1-1T](https://huggingface.co/common-pile/comma-v0.1-1t) and [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t). It is a slightly modified and consolidated version of the [Common Pile v0.1 "filtered" data](https://huggingface.co/collections/common-pile/common-pile-v01-filtered-data-68300bb0a946d10dda697663). If you are looknig for the raw Common Pile v0.1 data, please see [this collection](https://huggingface.co/collections/common-pile/common-pile-v01-raw-data-6826b454a5a6a445d0b51b37). You can learn more about Common Pile in [our paper](https://huggingface.co/papers/2506.05209). ## Mixing rates and token counts The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we heuristically set mixing rates to up or downweight different sources. In the two tables below, we provide per-source token count, repeat rate, and effective token count (after up/downweighting) for the main and cooldown stage of the Comma v0.1-1T training run. For the Comma v0.1-2T training run, all sources are repeated 2x as many times in both stages. Token counts are as provided by the Comma v0.1 tokenizer; using a different tokenizer may change these counts significantly. | Main stage | Tokens (B) | Repeats | Effective tokens (B) | |-------------------------------|------------|---------|----------------------| | arxiv_abstracts | 0.57 | 6 | 3.4 | | arxiv_papers | 6.0 | 6 | 35.8 | | biodiversity_heritage_library | 9.8 | 0.25 | 2.5 | | caselaw_access_project | 19.7 | 1 | 19.7 | | cccc | 15.2 | 6 | 91.4 | | data_provenance_initiative | 0.92 | 6 | 5.5 | | doab | 3.0 | 6 | 18.2 | | foodista | 0.025 | 6 | 0.15 | | github_archive | 11.0 | 6 | 66.1 | | library_of_congress | 9.5 | 0.25 | 2.4 | | libretexts | 0.093 | 6 | 0.56 | | news | 0.064 | 6 | 0.38 | | oercommons | 0.012 | 6 | 0.07 | | peS2o | 43.3 | 6 | 260.0 | | pre_1929_books | 12.4 | 1 | 12.4 | | pressbooks | 0.14 | 6 | 0.86 | | project_gutenberg | 5.7 | 1 | 5.7 | | public_domain_review | 0.0017 | 6 | 0.010 | | pubmed | 36.6 | 1 | 36.6 | | python_enhancement_proposals | 0.0027 | 6 | 0.016 | | regulations | 1.4 | 6 | 8.2 | | stackexchange | 23.9 | 6 | 143.2 | | stackv2_edu | 67.8 | 2 | 135.5 | | stackv2_html | 1.2 | 2 | 2.5 | | ubuntu_irc | 1.9 | 6 | 11.1 | | uk_hansard | 2.3 | 6 | 14.0 | | usgpo | 8.8 | 0.25 | 2.2 | | uspto | 157.4 | 0.25 | 39.4 | | wikimedia | 15.8 | 6 | 94.7 | | wikiteam | 4.3 | 4 | 17.2 | | youtube | 4.7 | 1 | 4.7 | | Total | 463.6 | | 1034.4 | | Cooldown stage | Tokens (B) | Repeats | Effective tokens (B) | |------------------------------|------------|---------|----------------------| | arxiv_papers | 6.0 | 0.5 | 3.0 | | cccc | 15.2 | 0.3 | 4.6 | | data_provenance_initiative | 0.92 | 2 | 1.8 | | doab | 3.0 | 2 | 6.1 | | foodista | 0.025 | 2 | 0.05 | | libretexts | 0.093 | 2 | 0.19 | | news | 0.064 | 2 | 0.13 | | oercommons | 0.012 | 2 | 0.02 | | peS2o | 43.3 | 0.1 | 4.3 | | pressbooks | 0.14 | 2 | 0.29 | | public_domain_review | 0.0017 | 2 | 0.003 | | python_enhancement_proposals | 0.0027 | 2 | 0.005 | | stackexchange | 23.9 | 0.25 | 6.0 | | stackv2_edu | 67.8 | 0.1 | 6.8 | | wikimedia | 15.8 | 0.4 | 6.3 | | Total | 176.2 | | 39.5 |

# Comma v0.1 数据集本仓库包含用于训练[Comma v0.1-1T](https://huggingface.co/common-pile/comma-v0.1-1t)与[Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t)的数据集。本数据集是对[Common Pile v0.1 "过滤后数据"](https://huggingface.co/collections/common-pile/common-pile-v01-filtered-data-68300bb0a946d10dda697663)进行小幅修改与整合后的版本。若您需要获取原始的Common Pile v0.1数据，请参阅[该数据集集合](https://huggingface.co/collections/common-pile/common-pile-v01-raw-data-6826b454a5a6a445d0b51b37)。您可通过[我们的论文](https://huggingface.co/papers/2506.05209)了解更多关于Common Pile的信息。 ## 混合比例与令牌计数 Comma v0.1系列模型分为两个阶段进行训练，分别为"主训练阶段"与"降温训练阶段"。在每个阶段中，我们通过启发式方法设置混合比例，以对不同数据源进行加权或降权。下文的两张表格分别给出了Comma v0.1-1T训练流程的主阶段与降温阶段中，各数据源的令牌数量、重复次数以及经过加权/降权后的有效令牌数量。对于Comma v0.1-2T的训练流程，所有数据源在两个阶段中的重复次数均翻倍。令牌数量由Comma v0.1分词器（Tokenizer）提供，若使用其他分词器，可能会显著改变此类计数结果。 | 主训练阶段 | 令牌数（B） | 重复次数 | 有效令牌数（B） | |-------------------------------|------------|---------|----------------------| | arXiv摘要（arxiv_abstracts） | 0.57 | 6 | 3.4 | | arXiv论文（arxiv_papers） | 6.0 | 6 | 35.8 | | 生物多样性遗产库（biodiversity_heritage_library） | 9.8 | 0.25 | 2.5 | | 判例法访问项目（caselaw_access_project） | 19.7 | 1 | 19.7 | | CCCC | 15.2 | 6 | 91.4 | | 数据溯源倡议（data_provenance_initiative） | 0.92 | 6 | 5.5 | | DOAB | 3.0 | 6 | 18.2 | | Foodista | 0.025 | 6 | 0.15 | | GitHub存档（github_archive） | 11.0 | 6 | 66.1 | | 美国国会图书馆（library_of_congress） | 9.5 | 0.25 | 2.4 | | LibreTexts | 0.093 | 6 | 0.56 | | 新闻（news） | 0.064 | 6 | 0.38 | | 开放教育资源Commons（oercommons） | 0.012 | 6 | 0.07 | | peS2o | 43.3 | 6 | 260.0 | | 1929年前出版书籍（pre_1929_books） | 12.4 | 1 | 12.4 | | Pressbooks | 0.14 | 6 | 0.86 | | 古腾堡计划（project_gutenberg） | 5.7 | 1 | 5.7 | | 公共领域评论（public_domain_review） | 0.0017 | 6 | 0.010 | | PubMed | 36.6 | 1 | 36.6 | | Python增强提案（python_enhancement_proposals） | 0.0027 | 6 | 0.016 | | 法规（regulations） | 1.4 | 6 | 8.2 | | Stack Exchange | 23.9 | 6 | 143.2 | | Stack v2 教育板块（stackv2_edu） | 67.8 | 2 | 135.5 | | Stack v2 HTML数据源（stackv2_html） | 1.2 | 2 | 2.5 | | Ubuntu IRC频道（ubuntu_irc） | 1.9 | 6 | 11.1 | | 英国议会辩论记录（uk_hansard） | 2.3 | 6 | 14.0 | | 美国政府出版局（usgpo） | 8.8 | 0.25 | 2.2 | | 美国专利商标局（uspto） | 157.4 | 0.25 | 39.4 | | 维基媒体（wikimedia） | 15.8 | 6 | 94.7 | | 维基团队（wikiteam） | 4.3 | 4 | 17.2 | | YouTube | 4.7 | 1 | 4.7 | | 总计 | 463.6 | | 1034.4 | | 降温训练阶段 | 令牌数（B） | 重复次数 | 有效令牌数（B） | |------------------------------|------------|---------|----------------------| | arXiv论文（arxiv_papers） | 6.0 | 0.5 | 3.0 | | CCCC | 15.2 | 0.3 | 4.6 | | 数据溯源倡议（data_provenance_initiative） | 0.92 | 2 | 1.8 | | DOAB | 3.0 | 2 | 6.1 | | Foodista | 0.025 | 2 | 0.05 | | LibreTexts | 0.093 | 2 | 0.19 | | 新闻（news） | 0.064 | 2 | 0.13 | | 开放教育资源Commons（oercommons） | 0.012 | 2 | 0.02 | | peS2o | 43.3 | 0.1 | 4.3 | | Pressbooks | 0.14 | 2 | 0.29 | | 公共领域评论（public_domain_review） | 0.0017 | 2 | 0.003 | | Python增强提案（python_enhancement_proposals） | 0.0027 | 2 | 0.005 | | Stack Exchange | 23.9 | 0.25 | 6.0 | | Stack v2 教育板块（stackv2_edu） | 67.8 | 0.1 | 6.8 | | 维基媒体（wikimedia） | 15.8 | 0.4 | 6.3 | | 总计 | 176.2 | | 39.5 |

应用场景：