prolong-data-64K
收藏魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/princeton-nlp/prolong-data-64K
下载链接
链接失效反馈官方服务:
资源简介:
# princeton-nlp/prolong-data-64K
[[Paper](https://arxiv.org/pdf/2410.02660)] [[HF Collection](https://huggingface.co/collections/princeton-nlp/prolong-66c72d55d2051a86ac7bd7e4)] [[Code](https://github.com/princeton-nlp/ProLong)]
**ProLong** (<u>Pr</u>incet<u>o</u>n <u>long</u>-context language models) is a family of long-context models that are continued trained and supervised fine-tuned from Llama-3-8B, with a maximum context window of 512K tokens. Our [main ProLong model](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) is one of the best-performing long-context models at the 10B scale (evaluated by [HELMET](https://github.com/princeton-nlp/helmet)).
To train this strong long-context model, we conduct thorough ablations on the long-context pre-training data, SFT data, and numerous other design choices. We demonstrate our findings in our paper, [How to Train Long-Context Language Models (Effectively)](https://arxiv.org/pdf/2410.02660).
Authors: [Tianyu Gao](https://gaotianyu.xyz/about)\*, [Alexander Wettig](https://www.cs.princeton.edu/~awettig/)\*, [Howard Yen](https://howard-yen.github.io/), [Danqi Chen](https://www.cs.princeton.edu/~danqic/) (* equal contribution)
Contact: `{tianyug, awettig}@princeton.edu`
## Dataset Loading
This dataset contains 31B tokens, tokenzized with the Llama-3 tokenizer and packed to sequences of 65,536 tokens.
The data is stored as **MDS** (Mosaic Data Shard) and requires [mosaicml-streaming](https://github.com/mosaicml/streaming) to be loaded.
Instead of `datasets.load_dataset`, download the data by cloning the repository or the `huggingface_hub.snapshot_download` function.
When loading the datasets with [mosaicml-streaming](https://github.com/mosaicml/streaming), each entry has the following fields:
- `input_ids`: a 1-dimensional array of length 65,536 containing the token ids
- `indices`: a list of `(start_index, end_index)` tuples that identify the subsequences in `input_ids` of separate documents. This is particularly important for short-context datasets that are packed to 524,288 sequence length
- `domain`: (optional) string of the dataset split
This dataset contains the following subsets as folders:
| Dataset | Tokens | Source | Sequence Length |
|---------|--------|--------|-----------------|
| `thestackv1_concat_by_repo-65536` | 6.4B | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | Fixed 65,536 |
| `book-65536` | 6.4B | Books split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Fixed 65,536 |
| `fineweb-edu` | 6.4B | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | Variable |
| `fineweb-2023-50` | 6.4B | 2023-50 snapshot of [fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | Variable |
| `stackexchange` | 1B | Stackexchange split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Variable |
| `dolmawiki` | 1B | Wikipedia split of [Dolma](https://huggingface.co/datasets/allenai/dolma) | Variable |
| `tuluv2` | 250M | [tulu-v2](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | Variable |
| `arxiv` | 1B | ArXiv split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Variable |
| `openwebmath` | 1B | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | Variable |
| `textbooks` | 750M | [TextbookChapters](https://huggingface.co/datasets/princeton-nlp/TextbookChapters) | Variable (majority 65,536) |
## The ProLong Models
- [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base)
- [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct)
- [princeton_nlp/Llama-3-8B-ProLong-512k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Base)
- ⭐ [princeton_nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)
## The ProLong Data
- Stage 1 64K training: [princeton-nlp/prolong-data-64K](https://huggingface.co/datasets/princeton-nlp/prolong-data-64K) ← you are here!
- Stage 2 128K training: [princeton-nlp/prolong-data-512K](https://huggingface.co/datasets/princeton-nlp/prolong-data-512K)
## Data Compositions
<p align="center">
<img width="80%" alt="image" src="https://github.com/user-attachments/assets/a36a7d0f-4480-4a29-80f3-208477707fb7">
</p>
<p align="center" style="margin-top: 0;">
<em>ProLong training data and recipe.</em>
</p>
## Citation
```bibtex
@article{gao2024prolong,
title={Enabling Large Language Models to Generate Text with Citations},
author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi},
year={2024},
}
```
# princeton-nlp/prolong-data-64K
[[论文](https://arxiv.org/pdf/2410.02660)] [[Hugging Face 数据集集合](https://huggingface.co/collections/princeton-nlp/prolong-66c72d55d2051a86ac7bd7e4)] [[代码](https://github.com/princeton-nlp/ProLong)]
**ProLong**(全称:Princeton长上下文语言模型,即<u>普</u>林斯顿<u>长</u>上下文语言模型)是一系列基于Llama-3-8B继续预训练并经监督微调(Supervised Fine-Tuning, SFT)的长上下文模型,其最大上下文窗口可达512K Token(Token)。我们的[主ProLong模型](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)是10B量级下性能最优的长上下文模型之一,相关性能由[HELMET](https://github.com/princeton-nlp/helmet)评估得到。
为训练该高性能长上下文模型,我们针对长上下文预训练数据、监督微调(SFT)数据及诸多其他模型设计选择开展了全面的消融实验(ablation study),相关研究发现已发表于论文《How to Train Long-Context Language Models (Effectively)》(https://arxiv.org/pdf/2410.02660)。
作者:[高天羽](https://gaotianyu.xyz/about)*,[Alexander Wettig](https://www.cs.princeton.edu/~awettig/)*,[Howard Yen](https://howard-yen.github.io/),[陈丹琦](https://www.cs.princeton.edu/~danqic/)(* 代表同等贡献作者)
联系邮箱:`{tianyug, awettig}@princeton.edu`
## 数据集加载
本数据集总计包含310亿Token(Token),采用Llama-3分词器(Tokenizer)进行分词,并被打包为长度65536的序列。数据以**MDS(Mosaic Data Shard)**格式存储,需依赖[mosaicml-streaming](https://github.com/mosaicml/streaming)库完成加载。请勿使用`datasets.load_dataset`接口加载数据,请通过克隆仓库或调用`huggingface_hub.snapshot_download`函数下载数据集。
当使用[mosaicml-streaming](https://github.com/mosaicml/streaming)加载数据集时,每条数据包含以下字段:
- `input_ids`:长度为65536的一维数组,存储分词ID
- `indices`:由`(start_index, end_index)`元组组成的列表,用于标识`input_ids`中不同独立文档的子序列范围。该字段对打包为524288长度序列的短上下文数据集尤为关键
- `domain`:(可选)标识数据集所属领域的字符串
本数据集包含以下以文件夹形式组织的子数据集:
| 子数据集名称 | Token总量 | 数据源 | 序列长度 |
|---------|--------|--------|-----------------|
| `thestackv1_concat_by_repo-65536` | 64亿 | [The Stack](https://huggingface.co/datasets/bigcode/the-stack) | 固定65536 |
| `book-65536` | 64亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 书籍分块 | 固定65536 |
| `fineweb-edu` | 64亿 | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 可变长度 |
| `fineweb-2023-50` | 64亿 | [fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) 2023-50快照 | 可变长度 |
| `stackexchange` | 10亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) Stack Exchange分块 | 可变长度 |
| `dolmawiki` | 10亿 | [Dolma](https://huggingface.co/datasets/allenai/dolma) 维基百科分块 | 可变长度 |
| `tuluv2` | 2.5亿 | [tulu-v2](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | 可变长度 |
| `arxiv` | 10亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) ArXiv分块 | 可变长度 |
| `openwebmath` | 10亿 | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | 可变长度 |
| `textbooks` | 7.5亿 | [TextbookChapters](https://huggingface.co/datasets/princeton-nlp/TextbookChapters) | 可变长度(多数为65536) |
## ProLong模型
- [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base)
- [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct)
- [princeton_nlp/Llama-3-8B-ProLong-512k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Base)
- ⭐ [princeton_nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)
## ProLong数据集
- 阶段1 64K训练:[princeton-nlp/prolong-data-64K](https://huggingface.co/datasets/princeton-nlp/prolong-data-64K) ← 当前所在页面!
- 阶段2 128K训练:[princeton-nlp/prolong-data-512K](https://huggingface.co/datasets/princeton-nlp/prolong-data-512K)
## 数据构成
<p align="center">
<img width="80%" alt="图表" src="https://github.com/user-attachments/assets/a36a7d0f-4480-4a29-80f3-208477707fb7">
</p>
<p align="center" style="margin-top: 0;">
<em>ProLong训练数据与训练流程。</em>
</p>
## 引用
bibtex
@article{gao2024prolong,
title={Enabling Large Language Models to Generate Text with Citations},
author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi},
year={2024},
}
提供机构:
maas
创建时间:
2025-08-16



