five

CofeAI/NanoData

收藏
Hugging Face2024-06-11 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/CofeAI/NanoData
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other license_name: other license_link: LICENSE task_categories: - text-generation language: - en size_categories: - 100B<n<1T --- ### Dataset Description To facilitate researchers to use [NanoLM](https://github.com/cofe-ai/nanoLM?tab=readme-ov-file) for comparative analysis across different model designs, we build a curated pre-training dataset from those of existing large-scale models (i.e., Llama, Falcon, GPT-3). It covers diverse domains to improve the generalization capabilities of the resultant models. #### Dataset Creation The data is mainly post-processed and filtered from [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [RedPajamaV2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2). We develop a series of cleaning steps to remove redundant formatting, garbled characters, formula errors, duplicated paragraphs, low-quality text, and other unwanted content. After interleaved deduplication on document level of each independent subset, we finally obtain a high-quality dataset. #### Dataset Summary | Dataset | Num Tokens (B) | | -------------- | -------------- | | CommonCrawl | 67.00 | | C4 | 15.00 | | Wikipedia (En) | 5.14 | | Books | 4.48 | | ArXiv | 2.50 | | StackExchange | 2.00 | | Total | 97.12 | We release the data with approximate 100B tokens. Furthermore, we recommend users to add code dataset such as [Starcode](https://huggingface.co/datasets/bigcode/starcoderdata), [The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) to enrich model's performance on code and reasoning. ### Citation To cite NanoLM, please use: ``` @misc{yao2024nanolm, title={nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales}, author={Yiqun Yao and Siqi fan and Xiusheng Huang and Xuezhi Fang and Xiang Li and Ziyi Ni and Xin Jiang and Xuying Meng and Peng Han and Shuo Shang and Kang Liu and Aixin Sun and Yequan Wang}, year={2024}, eprint={2304.06875}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Acknowledgement The data is mainly curated and filtered from [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [RedPajamaV2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2). We extend our gratitude to the original authors for their innovative work and for making it available to the community. ### License The code of NanoLM used to process the dataset and loss prediction is licensed under the Apache 2.0 license. For curated data, please refer to the licenses of the original ones. * [Common Crawl Foundation Terms of Use](https://commoncrawl.org/terms-of-use) * [C4 license](https://huggingface.co/datasets/allenai/c4#license) * Books: [the_pile_books3 license](https://huggingface.co/datasets/defunct-datasets/the_pile_books3#licensing-information) and [pg19 license](https://huggingface.co/datasets/deepmind/pg19#licensing-information) * [ArXiv Terms of Use](https://info.arxiv.org/help/api/tou.html) * [Wikipedia License](https://huggingface.co/datasets/legacy-datasets/wikipedia#licensing-information) * [StackExchange license on the Internet Archive](https://archive.org/details/stackexchange)
提供机构:
CofeAI
原始信息汇总

数据集描述

数据集创建

本数据集是为了支持研究人员使用NanoLM进行不同模型设计的比较分析而构建的。数据主要从RedPajamaRedPajamaV2中经过一系列清洗步骤处理和过滤得到,包括去除冗余格式、乱码、公式错误、重复段落、低质量文本等。

数据集总结

数据集 令牌数量(B)
CommonCrawl 67.00
C4 15.00
Wikipedia (En) 5.14
Books 4.48
ArXiv 2.50
StackExchange 2.00
总计 97.12

数据集包含约100B令牌。建议用户添加如StarcodeThe Stack V2等代码数据集以增强模型在代码和推理方面的性能。

许可证

数据集的原始数据遵循各自原始数据的许可证。具体包括:

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作