LongCorpus-2.5B
收藏魔搭社区2025-12-10 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/DAMO-NLP-SG/LongCorpus-2.5B
下载链接
链接失效反馈官方服务:
资源简介:
We collect a 2.5B training dataset from various domains for long-context continual pre-training. The composition of this dataset is as follows (partially inspired by [Long-Data-Collection](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)):
| Domain | Proportion | Source |
| ------------- | ---------- | ------ |
| Book | 40% | [Redpajama-Book](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) |
| Arxiv | 20% | [Redpajama-Arxiv](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) |
| General | 20% | [Redpajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) |
| Code | 10% | [LCC-Python](https://huggingface.co/datasets/microsoft/LCC_python) |
| QA | 5% | [Natural Questions](https://ai.google.com/research/NaturalQuestions/) |
| Summarization | 5% | [BookSum](https://github.com/salesforce/booksum) |
We have also curated a test dataset comprising 250 million tokens, mirroring the same composition. The selection criteria ensured that the average n-gram similarity (for n=2, 3, 4) with the training set is below 10%. This threshold effectively excludes all QA and Summarization data, resulting in a test corpus where the distribution of tokens across Book, Arxiv, General, and Code categories follows a ratio of 4:2:2:1, respectively.
我们从多领域采集了规模达25亿词元(Token)的训练数据集,用于长上下文持续预训练。该数据集的构成如下(部分灵感源自[Long-Data-Collection](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)):
| 领域 | 占比 | 来源 |
| ---- | ---- | ---- |
| 图书 | 40% | [Redpajama-Book](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) |
| arXiv | 20% | [Redpajama-Arxiv](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) |
| 通用领域 | 20% | [Redpajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) |
| 代码 | 10% | [LCC-Python](https://huggingface.co/datasets/microsoft/LCC_python) |
| 问答 | 5% | [Natural Questions](https://ai.google.com/research/NaturalQuestions/) |
| 摘要生成 | 5% | [BookSum](https://github.com/salesforce/booksum) |
我们还构建了规模达2.5亿词元的测试数据集,其数据构成与训练集保持一致。数据集筛选标准确保其与训练集的平均n元语法相似度(n取2、3、4)低于10%。该阈值会有效剔除所有问答与摘要生成类数据,最终测试语料中图书、arXiv、通用领域、代码四类数据的词元分布比例分别为4:2:2:1。
提供机构:
maas
创建时间:
2025-01-20



