five

prolong-data-512K

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/princeton-nlp/prolong-data-512K
下载链接
链接失效反馈
官方服务:
资源简介:
# princeton-nlp/prolong-data-512K [[Paper](https://arxiv.org/pdf/2410.02660)] [[HF Collection](https://huggingface.co/collections/princeton-nlp/prolong-66c72d55d2051a86ac7bd7e4)] [[Code](https://github.com/princeton-nlp/ProLong)] **ProLong** (<u>Pr</u>incet<u>o</u>n <u>long</u>-context language models) is a family of long-context models that are continued trained and supervised fine-tuned from Llama-3-8B, with a maximum context window of 512K tokens. Our [main ProLong model](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) is one of the best-performing long-context models at the 10B scale (evaluated by [HELMET](https://github.com/princeton-nlp/helmet)). To train this strong long-context model, we conduct thorough ablations on the long-context pre-training data, SFT data, and numerous other design choices. We demonstrate our findings in our paper, [How to Train Long-Context Language Models (Effectively)](https://arxiv.org/pdf/2410.02660). Authors: [Tianyu Gao](https://gaotianyu.xyz/about)\*, [Alexander Wettig](https://www.cs.princeton.edu/~awettig/)\*, [Howard Yen](https://howard-yen.github.io/), [Danqi Chen](https://www.cs.princeton.edu/~danqic/) (* equal contribution) Contact: `{tianyug, awettig}@princeton.edu` ## Dataset Loading This dataset contains 31B tokens, tokenzized with the Llama-3 tokenizer and packed to sequences of 524,288 tokens. The data is stored as **MDS** (Mosaic Data Shard) and requires [mosaicml-streaming](https://github.com/mosaicml/streaming) to be loaded. Instead of `datasets.load_dataset`, download the data by cloning the repository or the `huggingface_hub.snapshot_download` function. When loading the datasets with [mosaicml-streaming](https://github.com/mosaicml/streaming), each entry has the following fields: - `input_ids`: a 1-dimensional array of length 524,288 containing the token ids - `indices`: a list of `(start_index, end_index)` tuples that identify the subsequences in `input_ids` of separate documents. This is particularly important for short-context datasets that are packed to 524,288 sequence length - `domain`: (optional) string of the dataset split This dataset contains the following subsets as folders: | Dataset | Tokens | Source | Sequence Length | |---------|--------|--------|-----------------| | `thestackv1_concat_by_repo-524288` | 3.2B | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | Fixed 524,288 | | `thestackv1_concat_by_repo-65536` | 3.2B | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | Fixed 65,536 | | `book-524288` | 2.1B | Books split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Fixed 524,288 | | `book-65536` | 4.2B | Books split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Fixed 65,536 | | `fineweb-edu` | 6.4B | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | Variable | | `fineweb-2023-50` | 6.4B | 2023-50 snapshot of [fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | Variable | | `stackexchange` | 1B | Stackexchange split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Variable | | `dolmawiki` | 1B | Wikipedia split of [Dolma](https://huggingface.co/datasets/allenai/dolma) | Variable | | `tuluv2` | 250M | [tulu-v2](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | Variable | | `arxiv` | 1B | ArXiv split of [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | Variable | | `openwebmath` | 1B | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | Variable | | `textbooks` | 750M | [TextbookChapters](https://huggingface.co/datasets/princeton-nlp/TextbookChapters) | Variable (majority 524,288) | ## The ProLong Models - [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base) - [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct) - [princeton_nlp/Llama-3-8B-ProLong-512k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Base) - ⭐ [princeton_nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) ## The ProLong Data - Stage 1 64K training: [princeton-nlp/prolong-data-64K](https://huggingface.co/datasets/princeton-nlp/prolong-data-64K) - Stage 2 128K training: [princeton-nlp/prolong-data-512K](https://huggingface.co/datasets/princeton-nlp/prolong-data-512K) ← you are here! ## Data Compositions <p align="center"> <img width="80%" alt="image" src="https://github.com/user-attachments/assets/a36a7d0f-4480-4a29-80f3-208477707fb7"> </p> <p align="center" style="margin-top: 0;"> <em>ProLong training data and recipe.</em> </p> ## Citation ```bibtex @article{gao2024prolong, title={Enabling Large Language Models to Generate Text with Citations}, author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi}, year={2024}, } ```

# princeton-nlp/prolong-data-512K [[论文](https://arxiv.org/pdf/2410.02660)] [[Hugging Face数据集集合](https://huggingface.co/collections/princeton-nlp/prolong-66c72d55d2051a86ac7bd7e4)] [[代码](https://github.com/princeton-nlp/ProLong)] **ProLong**(<u>普</u>林斯顿<u>长</u>上下文语言模型,全称Princeton long-context language models)是基于Llama-3-8B进行持续预训练与监督微调的长上下文大语言模型(Large Language Model, LLM)家族,其最大上下文窗口可达512K个Token(Token)。本项目的[主ProLong模型](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)是10B规模下综合表现最优的长上下文模型之一(评估基准为[HELMET](https://github.com/princeton-nlp/helmet))。 为训练该高性能长上下文模型,我们针对长上下文预训练数据、监督微调数据及诸多其他设计选项开展了全面的消融实验,相关研究成果已发表于论文《如何高效训练长上下文大语言模型》(How to Train Long-Context Language Models (Effectively))。 作者:[高天钰](https://gaotianyu.xyz/about)*, [Alexander Wettig](https://www.cs.princeton.edu/~awettig/)*, [Howard Yen](https://howard-yen.github.io/), [陈丹琦](https://www.cs.princeton.edu/~danqic/)(* 为共同第一作者) 联系方式:`{tianyug, awettig}@princeton.edu` ## 数据集加载 该数据集总计包含310亿个Token(Token),采用Llama-3分词器进行分词,并打包为长度524,288的Token序列。数据以**MDS(Mosaic Data Shard)**格式存储,需借助[mosaicml-streaming](https://github.com/mosaicml/streaming)库进行加载,请勿使用`datasets.load_dataset`加载该数据集,请通过克隆对应仓库或调用`huggingface_hub.snapshot_download`函数完成数据下载。 当使用[mosaicml-streaming](https://github.com/mosaicml/streaming)加载数据集时,每条数据包含以下字段: - `input_ids`:长度为524,288的一维数组,存储分词ID - `indices`:由`(start_index, end_index)`元组组成的列表,用于标识`input_ids`中不同独立文档的子序列范围,对于被打包至524,288序列长度的短上下文数据集而言尤为关键 - `domain`:(可选)字符串,用于标识该数据所属的数据集拆分领域 该数据集包含以下子数据集文件夹: | 数据集名称 | Token数量 | 数据来源 | 序列长度 | |---------|--------|--------|-----------------| | `thestackv1_concat_by_repo-524288` | 32亿 | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | 固定524,288 | | `thestackv1_concat_by_repo-65536` | 32亿 | [the Stack](https://huggingface.co/datasets/bigcode/the-stack) | 固定65,536 | | `book-524288` | 21亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的书籍分块 | 固定524,288 | | `book-65536` | 42亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的书籍分块 | 固定65,536 | | `fineweb-edu` | 64亿 | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 可变长度 | | `fineweb-2023-50` | 64亿 | [fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) 2023-50快照 | 可变长度 | | `stackexchange` | 10亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的StackExchange分块 | 可变长度 | | `dolmawiki` | 10亿 | [Dolma](https://huggingface.co/datasets/allenai/dolma) 的维基百科分块 | 可变长度 | | `tuluv2` | 2.5亿 | [tulu-v2](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | 可变长度 | | `arxiv` | 10亿 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的ArXiv分块 | 可变长度 | | `openwebmath` | 10亿 | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | 可变长度 | | `textbooks` | 7.5亿 | [TextbookChapters](https://huggingface.co/datasets/princeton-nlp/TextbookChapters) | 可变长度(多数为524,288) | ## ProLong模型系列 - [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base) - [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct) - [princeton_nlp/Llama-3-8B-ProLong-512k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Base) - ⭐ [princeton_nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct) ## ProLong训练数据 - 阶段1 64K训练:[princeton-nlp/prolong-data-64K](https://huggingface.co/datasets/princeton-nlp/prolong-data-64K) - 阶段2 128K训练:[princeton-nlp/prolong-data-512K](https://huggingface.co/datasets/princeton-nlp/prolong-data-512K) ← 当前所在数据集! ## 数据构成 <p align="center"> <img width="80%" alt="ProLong训练数据构成" src="https://github.com/user-attachments/assets/a36a7d0f-4480-4a29-80f3-208477707fb7"> </p> <p align="center" style="margin-top: 0;"> <em>ProLong训练数据与训练流程</em> </p> ## 引用 bibtex @article{gao2024prolong, title={Enabling Large Language Models to Generate Text with Citations}, author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi}, year={2024}, }
提供机构:
maas
创建时间:
2025-08-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作