CofeAI/NanoData
收藏Hugging Face2024-06-11 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/CofeAI/NanoData
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: other
license_link: LICENSE
task_categories:
- text-generation
language:
- en
size_categories:
- 100B<n<1T
---
### Dataset Description
To facilitate researchers to use [NanoLM](https://github.com/cofe-ai/nanoLM?tab=readme-ov-file) for comparative analysis across different model designs, we build a curated pre-training dataset from those of existing large-scale models (i.e., Llama, Falcon, GPT-3). It covers diverse domains to improve the generalization capabilities of the resultant models.
#### Dataset Creation
The data is mainly post-processed and filtered from [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [RedPajamaV2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2).
We develop a series of cleaning steps to remove redundant formatting, garbled characters, formula errors, duplicated paragraphs, low-quality text, and other unwanted content. After interleaved deduplication on document level of each independent subset, we finally obtain a high-quality dataset.
#### Dataset Summary
| Dataset | Num Tokens (B) |
| -------------- | -------------- |
| CommonCrawl | 67.00 |
| C4 | 15.00 |
| Wikipedia (En) | 5.14 |
| Books | 4.48 |
| ArXiv | 2.50 |
| StackExchange | 2.00 |
| Total | 97.12 |
We release the data with approximate 100B tokens. Furthermore, we recommend users to add code dataset such as [Starcode](https://huggingface.co/datasets/bigcode/starcoderdata), [The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) to enrich model's performance on code and reasoning.
### Citation
To cite NanoLM, please use:
```
@misc{yao2024nanolm,
title={nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales},
author={Yiqun Yao and Siqi fan and Xiusheng Huang and Xuezhi Fang and Xiang Li and Ziyi Ni and Xin Jiang and Xuying Meng and Peng Han and Shuo Shang and Kang Liu and Aixin Sun and Yequan Wang},
year={2024},
eprint={2304.06875},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Acknowledgement
The data is mainly curated and filtered from [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [RedPajamaV2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2). We extend our gratitude to the original authors for their innovative work and for making it available to the community.
### License
The code of NanoLM used to process the dataset and loss prediction is licensed under the Apache 2.0 license.
For curated data, please refer to the licenses of the original ones.
* [Common Crawl Foundation Terms of Use](https://commoncrawl.org/terms-of-use)
* [C4 license](https://huggingface.co/datasets/allenai/c4#license)
* Books: [the_pile_books3 license](https://huggingface.co/datasets/defunct-datasets/the_pile_books3#licensing-information) and [pg19 license](https://huggingface.co/datasets/deepmind/pg19#licensing-information)
* [ArXiv Terms of Use](https://info.arxiv.org/help/api/tou.html)
* [Wikipedia License](https://huggingface.co/datasets/legacy-datasets/wikipedia#licensing-information)
* [StackExchange license on the Internet Archive](https://archive.org/details/stackexchange)
提供机构:
CofeAI
原始信息汇总
数据集描述
数据集创建
本数据集是为了支持研究人员使用NanoLM进行不同模型设计的比较分析而构建的。数据主要从RedPajama和RedPajamaV2中经过一系列清洗步骤处理和过滤得到,包括去除冗余格式、乱码、公式错误、重复段落、低质量文本等。
数据集总结
| 数据集 | 令牌数量(B) |
|---|---|
| CommonCrawl | 67.00 |
| C4 | 15.00 |
| Wikipedia (En) | 5.14 |
| Books | 4.48 |
| ArXiv | 2.50 |
| StackExchange | 2.00 |
| 总计 | 97.12 |
数据集包含约100B令牌。建议用户添加如Starcode和The Stack V2等代码数据集以增强模型在代码和推理方面的性能。
许可证
数据集的原始数据遵循各自原始数据的许可证。具体包括:



