five

LongCorpus-2.5B

收藏
魔搭社区2025-12-10 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/DAMO-NLP-SG/LongCorpus-2.5B
下载链接
链接失效反馈
官方服务:
资源简介:
We collect a 2.5B training dataset from various domains for long-context continual pre-training. The composition of this dataset is as follows (partially inspired by [Long-Data-Collection](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)): | Domain | Proportion | Source | | ------------- | ---------- | ------ | | Book | 40% | [Redpajama-Book](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | Arxiv | 20% | [Redpajama-Arxiv](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | General | 20% | [Redpajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | Code | 10% | [LCC-Python](https://huggingface.co/datasets/microsoft/LCC_python) | | QA | 5% | [Natural Questions](https://ai.google.com/research/NaturalQuestions/) | | Summarization | 5% | [BookSum](https://github.com/salesforce/booksum) | We have also curated a test dataset comprising 250 million tokens, mirroring the same composition. The selection criteria ensured that the average n-gram similarity (for n=2, 3, 4) with the training set is below 10%. This threshold effectively excludes all QA and Summarization data, resulting in a test corpus where the distribution of tokens across Book, Arxiv, General, and Code categories follows a ratio of 4:2:2:1, respectively.

我们从多领域采集了规模达25亿词元(Token)的训练数据集,用于长上下文持续预训练。该数据集的构成如下(部分灵感源自[Long-Data-Collection](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)): | 领域 | 占比 | 来源 | | ---- | ---- | ---- | | 图书 | 40% | [Redpajama-Book](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | arXiv | 20% | [Redpajama-Arxiv](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | 通用领域 | 20% | [Redpajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | | 代码 | 10% | [LCC-Python](https://huggingface.co/datasets/microsoft/LCC_python) | | 问答 | 5% | [Natural Questions](https://ai.google.com/research/NaturalQuestions/) | | 摘要生成 | 5% | [BookSum](https://github.com/salesforce/booksum) | 我们还构建了规模达2.5亿词元的测试数据集,其数据构成与训练集保持一致。数据集筛选标准确保其与训练集的平均n元语法相似度(n取2、3、4)低于10%。该阈值会有效剔除所有问答与摘要生成类数据,最终测试语料中图书、arXiv、通用领域、代码四类数据的词元分布比例分别为4:2:2:1。
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作