K2Datasets
收藏魔搭社区2025-12-05 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/LLM360/K2Datasets
下载链接
链接失效反馈官方服务:
资源简介:
# K2 Dataset Card
<!-- Provide a quick summary of the dataset. -->
The following data mix was used to train [K2](https://huggingface.co/LLM360/K2) and achieve results in line with Llama 2 70B.
## Dataset Details
K2 was trained on 1.4T tokens across two stages. The data sources and data mix for each stage are listed below.
### Dataset Description: Stage 1
<!-- Provide a longer summary of what this dataset is. -->
| Dataset | Starting Tokens | Multiplier | Total Tokens |% of Total |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| [dm-math](https://github.com/google-deepmind/mathematics_dataset) | 4.33B | 3x | 13B | 1% |
| pubmed-abstracts (from the Pile) | 4.77B | 3x | 14.3B | 1.1% |
| uspto (from the Pile) | 4.77B | 3x | 14.3B | 1.1% |
| pubmed-central (from the Pile) | 26B | 1x | 26B | 2% |
| [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 27.3B | 1x | 27.3B | 2.1% |
| [starcoder.spm](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% |
| [starcoder.fim](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% |
| [redpajama.stackexchange](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 61.1B | 1x | 61.1B | 4.7% |
| [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) | 132.6B | 0.5x | 66.3B | 5.1% |
| [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 76.7B | 1x | 76.7B | 5.9% |
| [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 80.6B | 1x | 80.6B | 6.2% |
| [s2orc](https://allenai.org/data/s2orc) | 107.9B | 1x | 107.9B | 8.3% |
| [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 22.1B | 6x | 132.6B | 10.2% |
| [refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 612.3B | 1x | 612.3B | 47.1% |
| Totals | - | - | 1.3T | 100% |
### Dataset Description: Stage 2
| Dataset | Starting Tokens | Multiplier | Total Tokens |% of Total |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| [open-web-math](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 14.6B | 1x | 14.6B | 21% |
| [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 2B | 1x | 2B | 2.9% |
| [simple-wiki](https://huggingface.co/datasets/allenai/dolma) | 4.3B | 1x | 4.3B | 6.2% |
| [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 2B | 1x | 2B | 2.9% |
| [algebraic-stack](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 10.9B | 1x | 10.9B | 15.7% |
| [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 2B | 0.5x | 33.8B | 2.9% |
| books | 5.8B | 1x | 5.8B | 8.3% |
| [pes20](https://huggingface.co/datasets/allenai/peS2o) | 1.2B | 1x | 1.2B | 1.8% |
| [pubmed-central (from the Pile)](https://github.com/EleutherAI/pile-pubmedcentral) | 2B | 1x | 2B | 2.9% |
| [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 2B | 1x | 2B | 2.9% |
| python | 20.5B | 1x | 20.5B | 29.6% |
| [s2orc](https://allenai.org/data/s2orc) | 2B | 1x | 2B | 2.9% |
| Totals | - | - | 69.4B* | 100% |
*rounding
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
A step-by-step tutorial for reproducing the K2's data preperation can be found in the [LLM360 Pretraining Suite here](https://www.llm360.ai/pretraining.html)
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
## Citation
**BibTeX:**
```bibtex
@misc{
title={LLM360 K2-65B: Scaling Up Open and Transparent Language Models},
author={The LLM360 Team},
year={2024},
}
```
# K2 数据集卡片
<!-- 提供数据集的快速摘要 -->
以下数据混合配比被用于训练[K2](https://huggingface.co/LLM360/K2),并取得了与Llama 2 70B相当的模型效果。
## 数据集详情
K2共在1.4万亿Token(Token)上分两阶段完成训练,各阶段的数据源与数据配比详情如下。
### 数据集说明:第一阶段
<!-- 提供该数据集的详细摘要 -->
| 数据集名称 | 初始Token数 | 倍增系数 | 总Token数 | 占总Token比例 |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| [dm-math](https://github.com/google-deepmind/mathematics_dataset) | 43.3亿 | 3倍 | 130亿 | 1% |
| PubMed摘要(源自The Pile) | 47.7亿 | 3倍 | 143亿 | 1.1% |
| uspto数据集(源自The Pile) | 47.7亿 | 3倍 | 143亿 | 1.1% |
| PubMed Central(源自The Pile) | 260亿 | 1倍 | 260亿 | 2% |
| [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 273亿 | 1倍 | 273亿 | 2.1% |
| [starcoder.spm](https://huggingface.co/datasets/bigcode/starcoderdata) | 676亿 | 0.5倍 | 338亿 | 2.6% |
| [starcoder.fim](https://huggingface.co/datasets/bigcode/starcoderdata) | 676亿 | 0.5倍 | 338亿 | 2.6% |
| [redpajama.stackexchange](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 611亿 | 1倍 | 611亿 | 4.7% |
| [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) | 1326亿 | 0.5倍 | 663亿 | 5.1% |
| [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 767亿 | 1倍 | 767亿 | 5.9% |
| [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 806亿 | 1倍 | 806亿 | 6.2% |
| [s2orc](https://allenai.org/data/s2orc) | 1079亿 | 1倍 | 1079亿 | 8.3% |
| [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 221亿 | 6倍 | 1326亿 | 10.2% |
| [refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 6123亿 | 1倍 | 6123亿 | 47.1% |
| 总计 | - | - | 1.3万亿 | 100% |
### 数据集说明:第二阶段
| 数据集名称 | 初始Token数 | 倍增系数 | 总Token数 | 占总Token比例 |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| [open-web-math](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 146亿 | 1倍 | 146亿 | 21% |
| [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 20亿 | 1倍 | 20亿 | 2.9% |
| [simple-wiki](https://huggingface.co/datasets/allenai/dolma) | 43亿 | 1倍 | 43亿 | 6.2% |
| [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 20亿 | 1倍 | 20亿 | 2.9% |
| [algebraic-stack](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 109亿 | 1倍 | 109亿 | 15.7% |
| [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 20亿 | 0.5倍 | 338亿 | 2.9% |
| 图书语料 | 58亿 | 1倍 | 58亿 | 8.3% |
| [pes20](https://huggingface.co/datasets/allenai/peS2o) | 12亿 | 1倍 | 12亿 | 1.8% |
| PubMed Central(源自The Pile) | 20亿 | 1倍 | 20亿 | 2.9% |
| [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 20亿 | 1倍 | 20亿 | 2.9% |
| Python代码数据集 | 205亿 | 1倍 | 205亿 | 29.6% |
| [s2orc](https://allenai.org/data/s2orc) | 20亿 | 1倍 | 20亿 | 2.9% |
| 总计 | - | - | 694亿* | 100% |
* 四舍五入导致的误差
#### 数据收集与处理
<!-- 本节描述数据收集与处理流程,包括数据选择标准、过滤与归一化方法、所用工具与库等。 -->
可在[LLM360预训练套件](https://www.llm360.ai/pretraining.html)中获取复现K2数据预处理流程的分步教程。
## 偏差、风险与局限性
<!-- 本节旨在说明技术与社会技术层面的局限性。 -->
用户需知晓本数据集存在的风险、偏差与局限性,仍需更多信息以形成进一步的优化建议。
## 引用
**BibTeX格式:**
bibtex
@misc{
title={LLM360 K2-65B: Scaling Up Open and Transparent Language Models},
author={The LLM360 Team},
year={2024},
}
提供机构:
maas
创建时间:
2025-04-07



