five

K2Datasets

收藏
魔搭社区2025-12-05 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/LLM360/K2Datasets
下载链接
链接失效反馈
官方服务:
资源简介:
# K2 Dataset Card <!-- Provide a quick summary of the dataset. --> The following data mix was used to train [K2](https://huggingface.co/LLM360/K2) and achieve results in line with Llama 2 70B. ## Dataset Details K2 was trained on 1.4T tokens across two stages. The data sources and data mix for each stage are listed below. ### Dataset Description: Stage 1 <!-- Provide a longer summary of what this dataset is. --> | Dataset | Starting Tokens | Multiplier | Total Tokens |% of Total | | ----------- | ----------- | ----------- | ----------- | ----------- | | [dm-math](https://github.com/google-deepmind/mathematics_dataset) | 4.33B | 3x | 13B | 1% | | pubmed-abstracts (from the Pile) | 4.77B | 3x | 14.3B | 1.1% | | uspto (from the Pile) | 4.77B | 3x | 14.3B | 1.1% | | pubmed-central (from the Pile) | 26B | 1x | 26B | 2% | | [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 27.3B | 1x | 27.3B | 2.1% | | [starcoder.spm](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% | | [starcoder.fim](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% | | [redpajama.stackexchange](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 61.1B | 1x | 61.1B | 4.7% | | [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) | 132.6B | 0.5x | 66.3B | 5.1% | | [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 76.7B | 1x | 76.7B | 5.9% | | [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 80.6B | 1x | 80.6B | 6.2% | | [s2orc](https://allenai.org/data/s2orc) | 107.9B | 1x | 107.9B | 8.3% | | [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 22.1B | 6x | 132.6B | 10.2% | | [refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 612.3B | 1x | 612.3B | 47.1% | | Totals | - | - | 1.3T | 100% | ### Dataset Description: Stage 2 | Dataset | Starting Tokens | Multiplier | Total Tokens |% of Total | | ----------- | ----------- | ----------- | ----------- | ----------- | | [open-web-math](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 14.6B | 1x | 14.6B | 21% | | [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 2B | 1x | 2B | 2.9% | | [simple-wiki](https://huggingface.co/datasets/allenai/dolma) | 4.3B | 1x | 4.3B | 6.2% | | [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 2B | 1x | 2B | 2.9% | | [algebraic-stack](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 10.9B | 1x | 10.9B | 15.7% | | [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 2B | 0.5x | 33.8B | 2.9% | | books | 5.8B | 1x | 5.8B | 8.3% | | [pes20](https://huggingface.co/datasets/allenai/peS2o) | 1.2B | 1x | 1.2B | 1.8% | | [pubmed-central (from the Pile)](https://github.com/EleutherAI/pile-pubmedcentral) | 2B | 1x | 2B | 2.9% | | [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 2B | 1x | 2B | 2.9% | | python | 20.5B | 1x | 20.5B | 29.6% | | [s2orc](https://allenai.org/data/s2orc) | 2B | 1x | 2B | 2.9% | | Totals | - | - | 69.4B* | 100% | *rounding #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> A step-by-step tutorial for reproducing the K2's data preperation can be found in the [LLM360 Pretraining Suite here](https://www.llm360.ai/pretraining.html) ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation **BibTeX:** ```bibtex @misc{ title={LLM360 K2-65B: Scaling Up Open and Transparent Language Models}, author={The LLM360 Team}, year={2024}, } ```

# K2 数据集卡片 <!-- 提供数据集的快速摘要 --> 以下数据混合配比被用于训练[K2](https://huggingface.co/LLM360/K2),并取得了与Llama 2 70B相当的模型效果。 ## 数据集详情 K2共在1.4万亿Token(Token)上分两阶段完成训练,各阶段的数据源与数据配比详情如下。 ### 数据集说明:第一阶段 <!-- 提供该数据集的详细摘要 --> | 数据集名称 | 初始Token数 | 倍增系数 | 总Token数 | 占总Token比例 | | ----------- | ----------- | ----------- | ----------- | ----------- | | [dm-math](https://github.com/google-deepmind/mathematics_dataset) | 43.3亿 | 3倍 | 130亿 | 1% | | PubMed摘要(源自The Pile) | 47.7亿 | 3倍 | 143亿 | 1.1% | | uspto数据集(源自The Pile) | 47.7亿 | 3倍 | 143亿 | 1.1% | | PubMed Central(源自The Pile) | 260亿 | 1倍 | 260亿 | 2% | | [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 273亿 | 1倍 | 273亿 | 2.1% | | [starcoder.spm](https://huggingface.co/datasets/bigcode/starcoderdata) | 676亿 | 0.5倍 | 338亿 | 2.6% | | [starcoder.fim](https://huggingface.co/datasets/bigcode/starcoderdata) | 676亿 | 0.5倍 | 338亿 | 2.6% | | [redpajama.stackexchange](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 611亿 | 1倍 | 611亿 | 4.7% | | [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) | 1326亿 | 0.5倍 | 663亿 | 5.1% | | [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 767亿 | 1倍 | 767亿 | 5.9% | | [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 806亿 | 1倍 | 806亿 | 6.2% | | [s2orc](https://allenai.org/data/s2orc) | 1079亿 | 1倍 | 1079亿 | 8.3% | | [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 221亿 | 6倍 | 1326亿 | 10.2% | | [refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 6123亿 | 1倍 | 6123亿 | 47.1% | | 总计 | - | - | 1.3万亿 | 100% | ### 数据集说明:第二阶段 | 数据集名称 | 初始Token数 | 倍增系数 | 总Token数 | 占总Token比例 | | ----------- | ----------- | ----------- | ----------- | ----------- | | [open-web-math](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 146亿 | 1倍 | 146亿 | 21% | | [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 20亿 | 1倍 | 20亿 | 2.9% | | [simple-wiki](https://huggingface.co/datasets/allenai/dolma) | 43亿 | 1倍 | 43亿 | 6.2% | | [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 20亿 | 1倍 | 20亿 | 2.9% | | [algebraic-stack](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 109亿 | 1倍 | 109亿 | 15.7% | | [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 20亿 | 0.5倍 | 338亿 | 2.9% | | 图书语料 | 58亿 | 1倍 | 58亿 | 8.3% | | [pes20](https://huggingface.co/datasets/allenai/peS2o) | 12亿 | 1倍 | 12亿 | 1.8% | | PubMed Central(源自The Pile) | 20亿 | 1倍 | 20亿 | 2.9% | | [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 20亿 | 1倍 | 20亿 | 2.9% | | Python代码数据集 | 205亿 | 1倍 | 205亿 | 29.6% | | [s2orc](https://allenai.org/data/s2orc) | 20亿 | 1倍 | 20亿 | 2.9% | | 总计 | - | - | 694亿* | 100% | * 四舍五入导致的误差 #### 数据收集与处理 <!-- 本节描述数据收集与处理流程,包括数据选择标准、过滤与归一化方法、所用工具与库等。 --> 可在[LLM360预训练套件](https://www.llm360.ai/pretraining.html)中获取复现K2数据预处理流程的分步教程。 ## 偏差、风险与局限性 <!-- 本节旨在说明技术与社会技术层面的局限性。 --> 用户需知晓本数据集存在的风险、偏差与局限性,仍需更多信息以形成进一步的优化建议。 ## 引用 **BibTeX格式:** bibtex @misc{ title={LLM360 K2-65B: Scaling Up Open and Transparent Language Models}, author={The LLM360 Team}, year={2024}, }
提供机构:
maas
创建时间:
2025-04-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作