prolong-data-64K

Name: prolong-data-64K
Creator: maas
Published: 2025-12-05 11:51:06
License: 暂无描述

魔搭社区2025-12-05 更新2025-09-13 收录

下载链接：

https://modelscope.cn/datasets/princeton-nlp/prolong-data-64K

下载链接

链接失效反馈

官方服务：

资源简介：

# princeton-nlp/prolong-data-64K [[Paper](https://arxiv.org/pdf/2410.02660)] [[HF Collection](https://huggingface.co/collections/princeton-nlp/prolong-66c72d55d2051a86ac7bd7e4)] [[Code](https://github.com/princeton-nlp/ProLong)] **ProLong** (Princeton long-context language models) is a family of long-context models that are continued trained and supervised fine-tuned from Llama-3-8B, with a maximum context window of 512K tokens. Our [main ProLong model](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) is one of the best-performing long-context models at the 10B scale (evaluated by [HELMET](https://github.com/princeton-nlp/helmet)). To train this strong long-context model, we conduct thorough ablations on the long-context pre-training data, SFT data, and numerous other design choices. We demonstrate our findings in our paper, [How to Train Long-Context Language Models (Effectively)](https://arxiv.org/pdf/2410.02660). Authors: [Tianyu Gao](https://gaotianyu.xyz/about)\*, [Alexander Wettig](https://www.cs.princeton.edu/~awettig/)\*, [Howard Yen](https://howard-yen.github.io/), [Danqi Chen](https://www.cs.princeton.edu/~danqic/) (* equal contribution) Contact: `{tianyug, awettig}@princeton.edu` ## Dataset Loading This dataset contains 31B tokens, tokenzized with the Llama-3 tokenizer and packed to sequences of 65,536 tokens. The data is stored as **MDS** (Mosaic Data Shard) and requires [mosaicml-streaming](https://github.com/mosaicml/streaming) to be loaded. Instead of `datasets.load_dataset`, download the data by cloning the repository or the `huggingface_hub.snapshot_download` function. When loading the datasets with [mosaicml-streaming](https://github.com/mosaicml/streaming), each entry has the following fields: - `input_ids`: a 1-dimensional array of length 65,536 containing the token ids - `indices`: a list of `(start_index, end_index)` tuples that identify the subsequences in `input_ids` of separate documents. This is particularly important for short-context datasets that are packed to 524,288 sequence length - `domain`: (optional) string of the dataset split This dataset contains the following subsets as folders: | Dataset | Tokens | Source | Sequence Length | |---------|--------|--------|-----------------| | `thestackv1_concat_by_repo-65536` | 6.4B | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | Fixed 65,536 | | `book-65536` | 6.4B | Books split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Fixed 65,536 | | `fineweb-edu` | 6.4B | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | Variable | | `fineweb-2023-50` | 6.4B | 2023-50 snapshot of [fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | Variable | | `stackexchange` | 1B | Stackexchange split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Variable | | `dolmawiki` | 1B | Wikipedia split of [Dolma](https://huggingface.co/datasets/allenai/dolma) | Variable | | `tuluv2` | 250M | [tulu-v2](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | Variable | | `arxiv` | 1B | ArXiv split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Variable | | `openwebmath` | 1B | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | Variable | | `textbooks` | 750M | [TextbookChapters](https://huggingface.co/datasets/princeton-nlp/TextbookChapters) | Variable (majority 65,536) | ## The ProLong Models - [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base) - [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct) - [princeton_nlp/Llama-3-8B-ProLong-512k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Base) - ⭐ [princeton_nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) ## The ProLong Data - Stage 1 64K training: [princeton-nlp/prolong-data-64K](https://huggingface.co/datasets/princeton-nlp/prolong-data-64K) ← you are here! - Stage 2 128K training: [princeton-nlp/prolong-data-512K](https://huggingface.co/datasets/princeton-nlp/prolong-data-512K) ## Data Compositions <img width="80%" alt="image" src="https://github.com/user-attachments/assets/a36a7d0f-4480-4a29-80f3-208477707fb7"> ProLong training data and recipe. ## Citation ```bibtex @article{gao2024prolong, title={Enabling Large Language Models to Generate Text with Citations}, author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi}, year={2024}, } ```

# princeton-nlp/prolong-data-64K [[论文](https://arxiv.org/pdf/2410.02660)] [[Hugging Face 数据集集合](https://huggingface.co/collections/princeton-nlp/prolong-66c72d55d2051a86ac7bd7e4)] [[代码](https://github.com/princeton-nlp/ProLong)] **ProLong**（全称：Princeton长上下文语言模型，即普林斯顿长上下文语言模型）是一系列基于Llama-3-8B继续预训练并经监督微调（Supervised Fine-Tuning, SFT）的长上下文模型，其最大上下文窗口可达512K Token（Token）。我们的[主ProLong模型](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)是10B量级下性能最优的长上下文模型之一，相关性能由[HELMET](https://github.com/princeton-nlp/helmet)评估得到。为训练该高性能长上下文模型，我们针对长上下文预训练数据、监督微调（SFT）数据及诸多其他模型设计选择开展了全面的消融实验（ablation study），相关研究发现已发表于论文《How to Train Long-Context Language Models (Effectively)》（https://arxiv.org/pdf/2410.02660）。作者：[高天羽](https://gaotianyu.xyz/about)*，[Alexander Wettig](https://www.cs.princeton.edu/~awettig/)*，[Howard Yen](https://howard-yen.github.io/)，[陈丹琦](https://www.cs.princeton.edu/~danqic/)（* 代表同等贡献作者）联系邮箱：`{tianyug, awettig}@princeton.edu` ## 数据集加载本数据集总计包含310亿Token（Token），采用Llama-3分词器（Tokenizer）进行分词，并被打包为长度65536的序列。数据以**MDS（Mosaic Data Shard）**格式存储，需依赖[mosaicml-streaming](https://github.com/mosaicml/streaming)库完成加载。请勿使用`datasets.load_dataset`接口加载数据，请通过克隆仓库或调用`huggingface_hub.snapshot_download`函数下载数据集。当使用[mosaicml-streaming](https://github.com/mosaicml/streaming)加载数据集时，每条数据包含以下字段： - `input_ids`：长度为65536的一维数组，存储分词ID - `indices`：由`(start_index, end_index)`元组组成的列表，用于标识`input_ids`中不同独立文档的子序列范围。该字段对打包为524288长度序列的短上下文数据集尤为关键 - `domain`：（可选）标识数据集所属领域的字符串本数据集包含以下以文件夹形式组织的子数据集： | 子数据集名称 | Token总量 | 数据源 | 序列长度 | |---------|--------|--------|-----------------| | `thestackv1_concat_by_repo-65536` | 64亿 | [The Stack](https://huggingface.co/datasets/bigcode/the-stack) | 固定65536 | | `book-65536` | 64亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 书籍分块 | 固定65536 | | `fineweb-edu` | 64亿 | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 可变长度 | | `fineweb-2023-50` | 64亿 | [fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) 2023-50快照 | 可变长度 | | `stackexchange` | 10亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) Stack Exchange分块 | 可变长度 | | `dolmawiki` | 10亿 | [Dolma](https://huggingface.co/datasets/allenai/dolma) 维基百科分块 | 可变长度 | | `tuluv2` | 2.5亿 | [tulu-v2](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | 可变长度 | | `arxiv` | 10亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) ArXiv分块 | 可变长度 | | `openwebmath` | 10亿 | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | 可变长度 | | `textbooks` | 7.5亿 | [TextbookChapters](https://huggingface.co/datasets/princeton-nlp/TextbookChapters) | 可变长度（多数为65536） | ## ProLong模型 - [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base) - [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct) - [princeton_nlp/Llama-3-8B-ProLong-512k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Base) - ⭐ [princeton_nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) ## ProLong数据集 - 阶段1 64K训练：[princeton-nlp/prolong-data-64K](https://huggingface.co/datasets/princeton-nlp/prolong-data-64K) ← 当前所在页面！ - 阶段2 128K训练：[princeton-nlp/prolong-data-512K](https://huggingface.co/datasets/princeton-nlp/prolong-data-512K) ## 数据构成 <img width="80%" alt="图表" src="https://github.com/user-attachments/assets/a36a7d0f-4480-4a29-80f3-208477707fb7"> ProLong训练数据与训练流程。 ## 引用 bibtex @article{gao2024prolong, title={Enabling Large Language Models to Generate Text with Citations}, author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi}, year={2024}, }

提供机构：

maas

创建时间：

2025-08-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集