prolong-data-512K

Name: prolong-data-512K
Creator: maas
Published: 2025-12-05 11:51:06
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/princeton-nlp/prolong-data-512K

下载链接

链接失效反馈

官方服务：

资源简介：

# princeton-nlp/prolong-data-512K [[Paper](https://arxiv.org/pdf/2410.02660)] [[HF Collection](https://huggingface.co/collections/princeton-nlp/prolong-66c72d55d2051a86ac7bd7e4)] [[Code](https://github.com/princeton-nlp/ProLong)] **ProLong** (Princeton long-context language models) is a family of long-context models that are continued trained and supervised fine-tuned from Llama-3-8B, with a maximum context window of 512K tokens. Our [main ProLong model](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) is one of the best-performing long-context models at the 10B scale (evaluated by [HELMET](https://github.com/princeton-nlp/helmet)). To train this strong long-context model, we conduct thorough ablations on the long-context pre-training data, SFT data, and numerous other design choices. We demonstrate our findings in our paper, [How to Train Long-Context Language Models (Effectively)](https://arxiv.org/pdf/2410.02660). Authors: [Tianyu Gao](https://gaotianyu.xyz/about)\*, [Alexander Wettig](https://www.cs.princeton.edu/~awettig/)\*, [Howard Yen](https://howard-yen.github.io/), [Danqi Chen](https://www.cs.princeton.edu/~danqic/) (* equal contribution) Contact: `{tianyug, awettig}@princeton.edu` ## Dataset Loading This dataset contains 31B tokens, tokenzized with the Llama-3 tokenizer and packed to sequences of 524,288 tokens. The data is stored as **MDS** (Mosaic Data Shard) and requires [mosaicml-streaming](https://github.com/mosaicml/streaming) to be loaded. Instead of `datasets.load_dataset`, download the data by cloning the repository or the `huggingface_hub.snapshot_download` function. When loading the datasets with [mosaicml-streaming](https://github.com/mosaicml/streaming), each entry has the following fields: - `input_ids`: a 1-dimensional array of length 524,288 containing the token ids - `indices`: a list of `(start_index, end_index)` tuples that identify the subsequences in `input_ids` of separate documents. This is particularly important for short-context datasets that are packed to 524,288 sequence length - `domain`: (optional) string of the dataset split This dataset contains the following subsets as folders: | Dataset | Tokens | Source | Sequence Length | |---------|--------|--------|-----------------| | `thestackv1_concat_by_repo-524288` | 3.2B | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | Fixed 524,288 | | `thestackv1_concat_by_repo-65536` | 3.2B | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | Fixed 65,536 | | `book-524288` | 2.1B | Books split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Fixed 524,288 | | `book-65536` | 4.2B | Books split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Fixed 65,536 | | `fineweb-edu` | 6.4B | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | Variable | | `fineweb-2023-50` | 6.4B | 2023-50 snapshot of [fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | Variable | | `stackexchange` | 1B | Stackexchange split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Variable | | `dolmawiki` | 1B | Wikipedia split of [Dolma](https://huggingface.co/datasets/allenai/dolma) | Variable | | `tuluv2` | 250M | [tulu-v2](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | Variable | | `arxiv` | 1B | ArXiv split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Variable | | `openwebmath` | 1B | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | Variable | | `textbooks` | 750M | [TextbookChapters](https://huggingface.co/datasets/princeton-nlp/TextbookChapters) | Variable (majority 524,288) | ## The ProLong Models - [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base) - [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct) - [princeton_nlp/Llama-3-8B-ProLong-512k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Base) - ⭐ [princeton_nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) ## The ProLong Data - Stage 1 64K training: [princeton-nlp/prolong-data-64K](https://huggingface.co/datasets/princeton-nlp/prolong-data-64K) - Stage 2 128K training: [princeton-nlp/prolong-data-512K](https://huggingface.co/datasets/princeton-nlp/prolong-data-512K) ← you are here! ## Data Compositions <img width="80%" alt="image" src="https://github.com/user-attachments/assets/a36a7d0f-4480-4a29-80f3-208477707fb7"> ProLong training data and recipe. ## Citation ```bibtex @article{gao2024prolong, title={Enabling Large Language Models to Generate Text with Citations}, author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi}, year={2024}, } ```

# princeton-nlp/prolong-data-512K [[论文](https://arxiv.org/pdf/2410.02660)] [[Hugging Face数据集集合](https://huggingface.co/collections/princeton-nlp/prolong-66c72d55d2051a86ac7bd7e4)] [[代码](https://github.com/princeton-nlp/ProLong)] **ProLong**（普林斯顿长上下文语言模型，全称Princeton long-context language models）是基于Llama-3-8B进行持续预训练与监督微调的长上下文大语言模型（Large Language Model, LLM）家族，其最大上下文窗口可达512K个Token（Token）。本项目的[主ProLong模型](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)是10B规模下综合表现最优的长上下文模型之一（评估基准为[HELMET](https://github.com/princeton-nlp/helmet)）。为训练该高性能长上下文模型，我们针对长上下文预训练数据、监督微调数据及诸多其他设计选项开展了全面的消融实验，相关研究成果已发表于论文《如何高效训练长上下文大语言模型》（How to Train Long-Context Language Models (Effectively)）。作者：[高天钰](https://gaotianyu.xyz/about)*, [Alexander Wettig](https://www.cs.princeton.edu/~awettig/)*, [Howard Yen](https://howard-yen.github.io/), [陈丹琦](https://www.cs.princeton.edu/~danqic/)（* 为共同第一作者）联系方式：`{tianyug, awettig}@princeton.edu` ## 数据集加载该数据集总计包含310亿个Token（Token），采用Llama-3分词器进行分词，并打包为长度524,288的Token序列。数据以**MDS（Mosaic Data Shard）**格式存储，需借助[mosaicml-streaming](https://github.com/mosaicml/streaming)库进行加载，请勿使用`datasets.load_dataset`加载该数据集，请通过克隆对应仓库或调用`huggingface_hub.snapshot_download`函数完成数据下载。当使用[mosaicml-streaming](https://github.com/mosaicml/streaming)加载数据集时，每条数据包含以下字段： - `input_ids`：长度为524,288的一维数组，存储分词ID - `indices`：由`(start_index, end_index)`元组组成的列表，用于标识`input_ids`中不同独立文档的子序列范围，对于被打包至524,288序列长度的短上下文数据集而言尤为关键 - `domain`：（可选）字符串，用于标识该数据所属的数据集拆分领域该数据集包含以下子数据集文件夹： | 数据集名称 | Token数量 | 数据来源 | 序列长度 | |---------|--------|--------|-----------------| | `thestackv1_concat_by_repo-524288` | 32亿 | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | 固定524,288 | | `thestackv1_concat_by_repo-65536` | 32亿 | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | 固定65,536 | | `book-524288` | 21亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的书籍分块 | 固定524,288 | | `book-65536` | 42亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的书籍分块 | 固定65,536 | | `fineweb-edu` | 64亿 | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 可变长度 | | `fineweb-2023-50` | 64亿 | [fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) 2023-50快照 | 可变长度 | | `stackexchange` | 10亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的StackExchange分块 | 可变长度 | | `dolmawiki` | 10亿 | [Dolma](https://huggingface.co/datasets/allenai/dolma) 的维基百科分块 | 可变长度 | | `tuluv2` | 2.5亿 | [tulu-v2](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | 可变长度 | | `arxiv` | 10亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的ArXiv分块 | 可变长度 | | `openwebmath` | 10亿 | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | 可变长度 | | `textbooks` | 7.5亿 | [TextbookChapters](https://huggingface.co/datasets/princeton-nlp/TextbookChapters) | 可变长度（多数为524,288） | ## ProLong模型系列 - [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base) - [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct) - [princeton_nlp/Llama-3-8B-ProLong-512k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Base) - ⭐ [princeton_nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) ## ProLong训练数据 - 阶段1 64K训练：[princeton-nlp/prolong-data-64K](https://huggingface.co/datasets/princeton-nlp/prolong-data-64K) - 阶段2 128K训练：[princeton-nlp/prolong-data-512K](https://huggingface.co/datasets/princeton-nlp/prolong-data-512K) ← 当前所在数据集！ ## 数据构成 <img width="80%" alt="ProLong训练数据构成" src="https://github.com/user-attachments/assets/a36a7d0f-4480-4a29-80f3-208477707fb7"> ProLong训练数据与训练流程 ## 引用 bibtex @article{gao2024prolong, title={Enabling Large Language Models to Generate Text with Citations}, author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi}, year={2024}, }

提供机构：

maas

创建时间：

2025-08-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集