下载链接：

https://modelscope.cn/datasets/LLM360/K2Datasets

下载链接

链接失效反馈

官方服务：

资源简介：

# K2 Dataset Card  The following data mix was used to train [K2](https://huggingface.co/LLM360/K2) and achieve results in line with Llama 2 70B. ## Dataset Details K2 was trained on 1.4T tokens across two stages. The data sources and data mix for each stage are listed below. ### Dataset Description: Stage 1  | Dataset | Starting Tokens | Multiplier | Total Tokens |% of Total | | ----------- | ----------- | ----------- | ----------- | ----------- | | [dm-math](https://github.com/google-deepmind/mathematics_dataset) | 4.33B | 3x | 13B | 1% | | pubmed-abstracts (from the Pile) | 4.77B | 3x | 14.3B | 1.1% | | uspto (from the Pile) | 4.77B | 3x | 14.3B | 1.1% | | pubmed-central (from the Pile) | 26B | 1x | 26B | 2% | | [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 27.3B | 1x | 27.3B | 2.1% | | [starcoder.spm](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% | | [starcoder.fim](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% | | [redpajama.stackexchange](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 61.1B | 1x | 61.1B | 4.7% | | [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) | 132.6B | 0.5x | 66.3B | 5.1% | | [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 76.7B | 1x | 76.7B | 5.9% | | [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 80.6B | 1x | 80.6B | 6.2% | | [s2orc](https://allenai.org/data/s2orc) | 107.9B | 1x | 107.9B | 8.3% | | [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 22.1B | 6x | 132.6B | 10.2% | | [refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 612.3B | 1x | 612.3B | 47.1% | | Totals | - | - | 1.3T | 100% | ### Dataset Description: Stage 2 | Dataset | Starting Tokens | Multiplier | Total Tokens |% of Total | | ----------- | ----------- | ----------- | ----------- | ----------- | | [open-web-math](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 14.6B | 1x | 14.6B | 21% | | [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 2B | 1x | 2B | 2.9% | | [simple-wiki](https://huggingface.co/datasets/allenai/dolma) | 4.3B | 1x | 4.3B | 6.2% | | [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 2B | 1x | 2B | 2.9% | | [algebraic-stack](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 10.9B | 1x | 10.9B | 15.7% | | [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 2B | 0.5x | 33.8B | 2.9% | | books | 5.8B | 1x | 5.8B | 8.3% | | [pes20](https://huggingface.co/datasets/allenai/peS2o) | 1.2B | 1x | 1.2B | 1.8% | | [pubmed-central (from the Pile)](https://github.com/EleutherAI/pile-pubmedcentral) | 2B | 1x | 2B | 2.9% | | [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 2B | 1x | 2B | 2.9% | | python | 20.5B | 1x | 20.5B | 29.6% | | [s2orc](https://allenai.org/data/s2orc) | 2B | 1x | 2B | 2.9% | | Totals | - | - | 69.4B* | 100% | *rounding #### Data Collection and Processing  A step-by-step tutorial for reproducing the K2's data preperation can be found in the [LLM360 Pretraining Suite here](https://www.llm360.ai/pretraining.html) ## Bias, Risks, and Limitations  Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation **BibTeX:** ```bibtex @misc{ title={LLM360 K2-65B: Scaling Up Open and Transparent Language Models}, author={The LLM360 Team}, year={2024}, } ```

# K2 数据集卡片  以下数据混合配比被用于训练[K2](https://huggingface.co/LLM360/K2)，并取得了与Llama 2 70B相当的模型效果。 ## 数据集详情 K2共在1.4万亿Token（Token）上分两阶段完成训练，各阶段的数据源与数据配比详情如下。 ### 数据集说明：第一阶段  | 数据集名称 | 初始Token数 | 倍增系数 | 总Token数 | 占总Token比例 | | ----------- | ----------- | ----------- | ----------- | ----------- | | [dm-math](https://github.com/google-deepmind/mathematics_dataset) | 43.3亿 | 3倍 | 130亿 | 1% | | PubMed摘要（源自The Pile） | 47.7亿 | 3倍 | 143亿 | 1.1% | | uspto数据集（源自The Pile） | 47.7亿 | 3倍 | 143亿 | 1.1% | | PubMed Central（源自The Pile） | 260亿 | 1倍 | 260亿 | 2% | | [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 273亿 | 1倍 | 273亿 | 2.1% | | [starcoder.spm](https://huggingface.co/datasets/bigcode/starcoderdata) | 676亿 | 0.5倍 | 338亿 | 2.6% | | [starcoder.fim](https://huggingface.co/datasets/bigcode/starcoderdata) | 676亿 | 0.5倍 | 338亿 | 2.6% | | [redpajama.stackexchange](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 611亿 | 1倍 | 611亿 | 4.7% | | [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) | 1326亿 | 0.5倍 | 663亿 | 5.1% | | [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 767亿 | 1倍 | 767亿 | 5.9% | | [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 806亿 | 1倍 | 806亿 | 6.2% | | [s2orc](https://allenai.org/data/s2orc) | 1079亿 | 1倍 | 1079亿 | 8.3% | | [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 221亿 | 6倍 | 1326亿 | 10.2% | | [refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 6123亿 | 1倍 | 6123亿 | 47.1% | | 总计 | - | - | 1.3万亿 | 100% | ### 数据集说明：第二阶段 | 数据集名称 | 初始Token数 | 倍增系数 | 总Token数 | 占总Token比例 | | ----------- | ----------- | ----------- | ----------- | ----------- | | [open-web-math](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 146亿 | 1倍 | 146亿 | 21% | | [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 20亿 | 1倍 | 20亿 | 2.9% | | [simple-wiki](https://huggingface.co/datasets/allenai/dolma) | 43亿 | 1倍 | 43亿 | 6.2% | | [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 20亿 | 1倍 | 20亿 | 2.9% | | [algebraic-stack](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | 109亿 | 1倍 | 109亿 | 15.7% | | [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 20亿 | 0.5倍 | 338亿 | 2.9% | | 图书语料 | 58亿 | 1倍 | 58亿 | 8.3% | | [pes20](https://huggingface.co/datasets/allenai/peS2o) | 12亿 | 1倍 | 12亿 | 1.8% | | PubMed Central（源自The Pile） | 20亿 | 1倍 | 20亿 | 2.9% | | [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 20亿 | 1倍 | 20亿 | 2.9% | | Python代码数据集 | 205亿 | 1倍 | 205亿 | 29.6% | | [s2orc](https://allenai.org/data/s2orc) | 20亿 | 1倍 | 20亿 | 2.9% | | 总计 | - | - | 694亿* | 100% | * 四舍五入导致的误差 #### 数据收集与处理  可在[LLM360预训练套件](https://www.llm360.ai/pretraining.html)中获取复现K2数据预处理流程的分步教程。 ## 偏差、风险与局限性  用户需知晓本数据集存在的风险、偏差与局限性，仍需更多信息以形成进一步的优化建议。 ## 引用 **BibTeX格式：** bibtex @misc{ title={LLM360 K2-65B: Scaling Up Open and Transparent Language Models}, author={The LLM360 Team}, year={2024}, }

应用场景：