five

comma_v0.1_training_dataset

收藏
魔搭社区2025-12-30 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/common-pile/comma_v0.1_training_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# Comma v0.1 dataset This repository contains the dataset used to train [Comma v0.1-1T](https://huggingface.co/common-pile/comma-v0.1-1t) and [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t). It is a slightly modified and consolidated version of the [Common Pile v0.1 "filtered" data](https://huggingface.co/collections/common-pile/common-pile-v01-filtered-data-68300bb0a946d10dda697663). If you are looknig for the raw Common Pile v0.1 data, please see [this collection](https://huggingface.co/collections/common-pile/common-pile-v01-raw-data-6826b454a5a6a445d0b51b37). You can learn more about Common Pile in [our paper](https://huggingface.co/papers/2506.05209). ## Mixing rates and token counts The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we heuristically set mixing rates to up or downweight different sources. In the two tables below, we provide per-source token count, repeat rate, and effective token count (after up/downweighting) for the main and cooldown stage of the Comma v0.1-1T training run. For the Comma v0.1-2T training run, all sources are repeated 2x as many times in both stages. Token counts are as provided by the Comma v0.1 tokenizer; using a different tokenizer may change these counts significantly. | Main stage | Tokens (B) | Repeats | Effective tokens (B) | |-------------------------------|------------|---------|----------------------| | arxiv_abstracts | 0.57 | 6 | 3.4 | | arxiv_papers | 6.0 | 6 | 35.8 | | biodiversity_heritage_library | 9.8 | 0.25 | 2.5 | | caselaw_access_project | 19.7 | 1 | 19.7 | | cccc | 15.2 | 6 | 91.4 | | data_provenance_initiative | 0.92 | 6 | 5.5 | | doab | 3.0 | 6 | 18.2 | | foodista | 0.025 | 6 | 0.15 | | github_archive | 11.0 | 6 | 66.1 | | library_of_congress | 9.5 | 0.25 | 2.4 | | libretexts | 0.093 | 6 | 0.56 | | news | 0.064 | 6 | 0.38 | | oercommons | 0.012 | 6 | 0.07 | | peS2o | 43.3 | 6 | 260.0 | | pre_1929_books | 12.4 | 1 | 12.4 | | pressbooks | 0.14 | 6 | 0.86 | | project_gutenberg | 5.7 | 1 | 5.7 | | public_domain_review | 0.0017 | 6 | 0.010 | | pubmed | 36.6 | 1 | 36.6 | | python_enhancement_proposals | 0.0027 | 6 | 0.016 | | regulations | 1.4 | 6 | 8.2 | | stackexchange | 23.9 | 6 | 143.2 | | stackv2_edu | 67.8 | 2 | 135.5 | | stackv2_html | 1.2 | 2 | 2.5 | | ubuntu_irc | 1.9 | 6 | 11.1 | | uk_hansard | 2.3 | 6 | 14.0 | | usgpo | 8.8 | 0.25 | 2.2 | | uspto | 157.4 | 0.25 | 39.4 | | wikimedia | 15.8 | 6 | 94.7 | | wikiteam | 4.3 | 4 | 17.2 | | youtube | 4.7 | 1 | 4.7 | | Total | 463.6 | | 1034.4 | | Cooldown stage | Tokens (B) | Repeats | Effective tokens (B) | |------------------------------|------------|---------|----------------------| | arxiv_papers | 6.0 | 0.5 | 3.0 | | cccc | 15.2 | 0.3 | 4.6 | | data_provenance_initiative | 0.92 | 2 | 1.8 | | doab | 3.0 | 2 | 6.1 | | foodista | 0.025 | 2 | 0.05 | | libretexts | 0.093 | 2 | 0.19 | | news | 0.064 | 2 | 0.13 | | oercommons | 0.012 | 2 | 0.02 | | peS2o | 43.3 | 0.1 | 4.3 | | pressbooks | 0.14 | 2 | 0.29 | | public_domain_review | 0.0017 | 2 | 0.003 | | python_enhancement_proposals | 0.0027 | 2 | 0.005 | | stackexchange | 23.9 | 0.25 | 6.0 | | stackv2_edu | 67.8 | 0.1 | 6.8 | | wikimedia | 15.8 | 0.4 | 6.3 | | Total | 176.2 | | 39.5 |

# Comma v0.1 数据集 本仓库包含用于训练[Comma v0.1-1T](https://huggingface.co/common-pile/comma-v0.1-1t)与[Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t)的数据集。 本数据集是对[Common Pile v0.1 "过滤后数据"](https://huggingface.co/collections/common-pile/common-pile-v01-filtered-data-68300bb0a946d10dda697663)进行小幅修改与整合后的版本。 若您需要获取原始的Common Pile v0.1数据,请参阅[该数据集集合](https://huggingface.co/collections/common-pile/common-pile-v01-raw-data-6826b454a5a6a445d0b51b37)。您可通过[我们的论文](https://huggingface.co/papers/2506.05209)了解更多关于Common Pile的信息。 ## 混合比例与令牌计数 Comma v0.1系列模型分为两个阶段进行训练,分别为"主训练阶段"与"降温训练阶段"。在每个阶段中,我们通过启发式方法设置混合比例,以对不同数据源进行加权或降权。下文的两张表格分别给出了Comma v0.1-1T训练流程的主阶段与降温阶段中,各数据源的令牌数量、重复次数以及经过加权/降权后的有效令牌数量。对于Comma v0.1-2T的训练流程,所有数据源在两个阶段中的重复次数均翻倍。令牌数量由Comma v0.1分词器(Tokenizer)提供,若使用其他分词器,可能会显著改变此类计数结果。 | 主训练阶段 | 令牌数(B) | 重复次数 | 有效令牌数(B) | |-------------------------------|------------|---------|----------------------| | arXiv摘要(arxiv_abstracts) | 0.57 | 6 | 3.4 | | arXiv论文(arxiv_papers) | 6.0 | 6 | 35.8 | | 生物多样性遗产库(biodiversity_heritage_library) | 9.8 | 0.25 | 2.5 | | 判例法访问项目(caselaw_access_project) | 19.7 | 1 | 19.7 | | CCCC | 15.2 | 6 | 91.4 | | 数据溯源倡议(data_provenance_initiative) | 0.92 | 6 | 5.5 | | DOAB | 3.0 | 6 | 18.2 | | Foodista | 0.025 | 6 | 0.15 | | GitHub存档(github_archive) | 11.0 | 6 | 66.1 | | 美国国会图书馆(library_of_congress) | 9.5 | 0.25 | 2.4 | | LibreTexts | 0.093 | 6 | 0.56 | | 新闻(news) | 0.064 | 6 | 0.38 | | 开放教育资源Commons(oercommons) | 0.012 | 6 | 0.07 | | peS2o | 43.3 | 6 | 260.0 | | 1929年前出版书籍(pre_1929_books) | 12.4 | 1 | 12.4 | | Pressbooks | 0.14 | 6 | 0.86 | | 古腾堡计划(project_gutenberg) | 5.7 | 1 | 5.7 | | 公共领域评论(public_domain_review) | 0.0017 | 6 | 0.010 | | PubMed | 36.6 | 1 | 36.6 | | Python增强提案(python_enhancement_proposals) | 0.0027 | 6 | 0.016 | | 法规(regulations) | 1.4 | 6 | 8.2 | | Stack Exchange | 23.9 | 6 | 143.2 | | Stack v2 教育板块(stackv2_edu) | 67.8 | 2 | 135.5 | | Stack v2 HTML数据源(stackv2_html) | 1.2 | 2 | 2.5 | | Ubuntu IRC频道(ubuntu_irc) | 1.9 | 6 | 11.1 | | 英国议会辩论记录(uk_hansard) | 2.3 | 6 | 14.0 | | 美国政府出版局(usgpo) | 8.8 | 0.25 | 2.2 | | 美国专利商标局(uspto) | 157.4 | 0.25 | 39.4 | | 维基媒体(wikimedia) | 15.8 | 6 | 94.7 | | 维基团队(wikiteam) | 4.3 | 4 | 17.2 | | YouTube | 4.7 | 1 | 4.7 | | 总计 | 463.6 | | 1034.4 | | 降温训练阶段 | 令牌数(B) | 重复次数 | 有效令牌数(B) | |------------------------------|------------|---------|----------------------| | arXiv论文(arxiv_papers) | 6.0 | 0.5 | 3.0 | | CCCC | 15.2 | 0.3 | 4.6 | | 数据溯源倡议(data_provenance_initiative) | 0.92 | 2 | 1.8 | | DOAB | 3.0 | 2 | 6.1 | | Foodista | 0.025 | 2 | 0.05 | | LibreTexts | 0.093 | 2 | 0.19 | | 新闻(news) | 0.064 | 2 | 0.13 | | 开放教育资源Commons(oercommons) | 0.012 | 2 | 0.02 | | peS2o | 43.3 | 0.1 | 4.3 | | Pressbooks | 0.14 | 2 | 0.29 | | 公共领域评论(public_domain_review) | 0.0017 | 2 | 0.003 | | Python增强提案(python_enhancement_proposals) | 0.0027 | 2 | 0.005 | | Stack Exchange | 23.9 | 0.25 | 6.0 | | Stack v2 教育板块(stackv2_edu) | 67.8 | 0.1 | 6.8 | | 维基媒体(wikimedia) | 15.8 | 0.4 | 6.3 | | 总计 | 176.2 | | 39.5 |
提供机构:
maas
创建时间:
2025-06-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作