comma_v0.1_training_dataset
收藏魔搭社区2025-12-30 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/common-pile/comma_v0.1_training_dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Comma v0.1 dataset
This repository contains the dataset used to train [Comma v0.1-1T](https://huggingface.co/common-pile/comma-v0.1-1t) and [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t).
It is a slightly modified and consolidated version of the [Common Pile v0.1 "filtered" data](https://huggingface.co/collections/common-pile/common-pile-v01-filtered-data-68300bb0a946d10dda697663).
If you are looknig for the raw Common Pile v0.1 data, please see [this collection](https://huggingface.co/collections/common-pile/common-pile-v01-raw-data-6826b454a5a6a445d0b51b37).
You can learn more about Common Pile in [our paper](https://huggingface.co/papers/2506.05209).
## Mixing rates and token counts
The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage.
During each stage, we heuristically set mixing rates to up or downweight different sources.
In the two tables below, we provide per-source token count, repeat rate, and effective token count (after up/downweighting) for the main and cooldown stage of the Comma v0.1-1T training run.
For the Comma v0.1-2T training run, all sources are repeated 2x as many times in both stages.
Token counts are as provided by the Comma v0.1 tokenizer; using a different tokenizer may change these counts significantly.
| Main stage | Tokens (B) | Repeats | Effective tokens (B) |
|-------------------------------|------------|---------|----------------------|
| arxiv_abstracts | 0.57 | 6 | 3.4 |
| arxiv_papers | 6.0 | 6 | 35.8 |
| biodiversity_heritage_library | 9.8 | 0.25 | 2.5 |
| caselaw_access_project | 19.7 | 1 | 19.7 |
| cccc | 15.2 | 6 | 91.4 |
| data_provenance_initiative | 0.92 | 6 | 5.5 |
| doab | 3.0 | 6 | 18.2 |
| foodista | 0.025 | 6 | 0.15 |
| github_archive | 11.0 | 6 | 66.1 |
| library_of_congress | 9.5 | 0.25 | 2.4 |
| libretexts | 0.093 | 6 | 0.56 |
| news | 0.064 | 6 | 0.38 |
| oercommons | 0.012 | 6 | 0.07 |
| peS2o | 43.3 | 6 | 260.0 |
| pre_1929_books | 12.4 | 1 | 12.4 |
| pressbooks | 0.14 | 6 | 0.86 |
| project_gutenberg | 5.7 | 1 | 5.7 |
| public_domain_review | 0.0017 | 6 | 0.010 |
| pubmed | 36.6 | 1 | 36.6 |
| python_enhancement_proposals | 0.0027 | 6 | 0.016 |
| regulations | 1.4 | 6 | 8.2 |
| stackexchange | 23.9 | 6 | 143.2 |
| stackv2_edu | 67.8 | 2 | 135.5 |
| stackv2_html | 1.2 | 2 | 2.5 |
| ubuntu_irc | 1.9 | 6 | 11.1 |
| uk_hansard | 2.3 | 6 | 14.0 |
| usgpo | 8.8 | 0.25 | 2.2 |
| uspto | 157.4 | 0.25 | 39.4 |
| wikimedia | 15.8 | 6 | 94.7 |
| wikiteam | 4.3 | 4 | 17.2 |
| youtube | 4.7 | 1 | 4.7 |
| Total | 463.6 | | 1034.4 |
| Cooldown stage | Tokens (B) | Repeats | Effective tokens (B) |
|------------------------------|------------|---------|----------------------|
| arxiv_papers | 6.0 | 0.5 | 3.0 |
| cccc | 15.2 | 0.3 | 4.6 |
| data_provenance_initiative | 0.92 | 2 | 1.8 |
| doab | 3.0 | 2 | 6.1 |
| foodista | 0.025 | 2 | 0.05 |
| libretexts | 0.093 | 2 | 0.19 |
| news | 0.064 | 2 | 0.13 |
| oercommons | 0.012 | 2 | 0.02 |
| peS2o | 43.3 | 0.1 | 4.3 |
| pressbooks | 0.14 | 2 | 0.29 |
| public_domain_review | 0.0017 | 2 | 0.003 |
| python_enhancement_proposals | 0.0027 | 2 | 0.005 |
| stackexchange | 23.9 | 0.25 | 6.0 |
| stackv2_edu | 67.8 | 0.1 | 6.8 |
| wikimedia | 15.8 | 0.4 | 6.3 |
| Total | 176.2 | | 39.5 |
# Comma v0.1 数据集
本仓库包含用于训练[Comma v0.1-1T](https://huggingface.co/common-pile/comma-v0.1-1t)与[Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t)的数据集。
本数据集是对[Common Pile v0.1 "过滤后数据"](https://huggingface.co/collections/common-pile/common-pile-v01-filtered-data-68300bb0a946d10dda697663)进行小幅修改与整合后的版本。
若您需要获取原始的Common Pile v0.1数据,请参阅[该数据集集合](https://huggingface.co/collections/common-pile/common-pile-v01-raw-data-6826b454a5a6a445d0b51b37)。您可通过[我们的论文](https://huggingface.co/papers/2506.05209)了解更多关于Common Pile的信息。
## 混合比例与令牌计数
Comma v0.1系列模型分为两个阶段进行训练,分别为"主训练阶段"与"降温训练阶段"。在每个阶段中,我们通过启发式方法设置混合比例,以对不同数据源进行加权或降权。下文的两张表格分别给出了Comma v0.1-1T训练流程的主阶段与降温阶段中,各数据源的令牌数量、重复次数以及经过加权/降权后的有效令牌数量。对于Comma v0.1-2T的训练流程,所有数据源在两个阶段中的重复次数均翻倍。令牌数量由Comma v0.1分词器(Tokenizer)提供,若使用其他分词器,可能会显著改变此类计数结果。
| 主训练阶段 | 令牌数(B) | 重复次数 | 有效令牌数(B) |
|-------------------------------|------------|---------|----------------------|
| arXiv摘要(arxiv_abstracts) | 0.57 | 6 | 3.4 |
| arXiv论文(arxiv_papers) | 6.0 | 6 | 35.8 |
| 生物多样性遗产库(biodiversity_heritage_library) | 9.8 | 0.25 | 2.5 |
| 判例法访问项目(caselaw_access_project) | 19.7 | 1 | 19.7 |
| CCCC | 15.2 | 6 | 91.4 |
| 数据溯源倡议(data_provenance_initiative) | 0.92 | 6 | 5.5 |
| DOAB | 3.0 | 6 | 18.2 |
| Foodista | 0.025 | 6 | 0.15 |
| GitHub存档(github_archive) | 11.0 | 6 | 66.1 |
| 美国国会图书馆(library_of_congress) | 9.5 | 0.25 | 2.4 |
| LibreTexts | 0.093 | 6 | 0.56 |
| 新闻(news) | 0.064 | 6 | 0.38 |
| 开放教育资源Commons(oercommons) | 0.012 | 6 | 0.07 |
| peS2o | 43.3 | 6 | 260.0 |
| 1929年前出版书籍(pre_1929_books) | 12.4 | 1 | 12.4 |
| Pressbooks | 0.14 | 6 | 0.86 |
| 古腾堡计划(project_gutenberg) | 5.7 | 1 | 5.7 |
| 公共领域评论(public_domain_review) | 0.0017 | 6 | 0.010 |
| PubMed | 36.6 | 1 | 36.6 |
| Python增强提案(python_enhancement_proposals) | 0.0027 | 6 | 0.016 |
| 法规(regulations) | 1.4 | 6 | 8.2 |
| Stack Exchange | 23.9 | 6 | 143.2 |
| Stack v2 教育板块(stackv2_edu) | 67.8 | 2 | 135.5 |
| Stack v2 HTML数据源(stackv2_html) | 1.2 | 2 | 2.5 |
| Ubuntu IRC频道(ubuntu_irc) | 1.9 | 6 | 11.1 |
| 英国议会辩论记录(uk_hansard) | 2.3 | 6 | 14.0 |
| 美国政府出版局(usgpo) | 8.8 | 0.25 | 2.2 |
| 美国专利商标局(uspto) | 157.4 | 0.25 | 39.4 |
| 维基媒体(wikimedia) | 15.8 | 6 | 94.7 |
| 维基团队(wikiteam) | 4.3 | 4 | 17.2 |
| YouTube | 4.7 | 1 | 4.7 |
| 总计 | 463.6 | | 1034.4 |
| 降温训练阶段 | 令牌数(B) | 重复次数 | 有效令牌数(B) |
|------------------------------|------------|---------|----------------------|
| arXiv论文(arxiv_papers) | 6.0 | 0.5 | 3.0 |
| CCCC | 15.2 | 0.3 | 4.6 |
| 数据溯源倡议(data_provenance_initiative) | 0.92 | 2 | 1.8 |
| DOAB | 3.0 | 2 | 6.1 |
| Foodista | 0.025 | 2 | 0.05 |
| LibreTexts | 0.093 | 2 | 0.19 |
| 新闻(news) | 0.064 | 2 | 0.13 |
| 开放教育资源Commons(oercommons) | 0.012 | 2 | 0.02 |
| peS2o | 43.3 | 0.1 | 4.3 |
| Pressbooks | 0.14 | 2 | 0.29 |
| 公共领域评论(public_domain_review) | 0.0017 | 2 | 0.003 |
| Python增强提案(python_enhancement_proposals) | 0.0027 | 2 | 0.005 |
| Stack Exchange | 23.9 | 0.25 | 6.0 |
| Stack v2 教育板块(stackv2_edu) | 67.8 | 0.1 | 6.8 |
| 维基媒体(wikimedia) | 15.8 | 0.4 | 6.3 |
| 总计 | 176.2 | | 39.5 |
提供机构:
maas
创建时间:
2025-06-11



