LongCorpus-2.5B

Name: LongCorpus-2.5B
Creator: maas
Published: 2025-12-10 16:21:09
License: 暂无描述

魔搭社区2025-12-10 更新2025-01-25 收录

下载链接：

https://modelscope.cn/datasets/DAMO-NLP-SG/LongCorpus-2.5B

下载链接

链接失效反馈

官方服务：

资源简介：

We collect a 2.5B training dataset from various domains for long-context continual pre-training. The composition of this dataset is as follows (partially inspired by [Long-Data-Collection](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)): | Domain | Proportion | Source | | ------------- | ---------- | ------ | | Book | 40% | [Redpajama-Book](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | Arxiv | 20% | [Redpajama-Arxiv](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | General | 20% | [Redpajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | Code | 10% | [LCC-Python](https://huggingface.co/datasets/microsoft/LCC_python) | | QA | 5% | [Natural Questions](https://ai.google.com/research/NaturalQuestions/) | | Summarization | 5% | [BookSum](https://github.com/salesforce/booksum) | We have also curated a test dataset comprising 250 million tokens, mirroring the same composition. The selection criteria ensured that the average n-gram similarity (for n=2, 3, 4) with the training set is below 10%. This threshold effectively excludes all QA and Summarization data, resulting in a test corpus where the distribution of tokens across Book, Arxiv, General, and Code categories follows a ratio of 4:2:2:1, respectively.

我们从多领域采集了规模达25亿词元（Token）的训练数据集，用于长上下文持续预训练。该数据集的构成如下（部分灵感源自[Long-Data-Collection](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)）： | 领域 | 占比 | 来源 | | ---- | ---- | ---- | | 图书 | 40% | [Redpajama-Book](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | arXiv | 20% | [Redpajama-Arxiv](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | 通用领域 | 20% | [Redpajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | 代码 | 10% | [LCC-Python](https://huggingface.co/datasets/microsoft/LCC_python) | | 问答 | 5% | [Natural Questions](https://ai.google.com/research/NaturalQuestions/) | | 摘要生成 | 5% | [BookSum](https://github.com/salesforce/booksum) | 我们还构建了规模达2.5亿词元的测试数据集，其数据构成与训练集保持一致。数据集筛选标准确保其与训练集的平均n元语法相似度（n取2、3、4）低于10%。该阈值会有效剔除所有问答与摘要生成类数据，最终测试语料中图书、arXiv、通用领域、代码四类数据的词元分布比例分别为4:2:2:1。

提供机构：

maas

创建时间：

2025-01-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集