CofeAI/NanoData

Name: CofeAI/NanoData
Creator: CofeAI
Published: 2024-06-11 11:03:05
License: 暂无描述

Hugging Face2024-06-11 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/CofeAI/NanoData

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: other license_link: LICENSE task_categories: - text-generation language: - en size_categories: - 100B<n<1T --- ### Dataset Description To facilitate researchers to use [NanoLM](https://github.com/cofe-ai/nanoLM?tab=readme-ov-file) for comparative analysis across different model designs, we build a curated pre-training dataset from those of existing large-scale models (i.e., Llama, Falcon, GPT-3). It covers diverse domains to improve the generalization capabilities of the resultant models. #### Dataset Creation The data is mainly post-processed and filtered from [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [RedPajamaV2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2). We develop a series of cleaning steps to remove redundant formatting, garbled characters, formula errors, duplicated paragraphs, low-quality text, and other unwanted content. After interleaved deduplication on document level of each independent subset, we finally obtain a high-quality dataset. #### Dataset Summary | Dataset | Num Tokens (B) | | -------------- | -------------- | | CommonCrawl | 67.00 | | C4 | 15.00 | | Wikipedia (En) | 5.14 | | Books | 4.48 | | ArXiv | 2.50 | | StackExchange | 2.00 | | Total | 97.12 | We release the data with approximate 100B tokens. Furthermore, we recommend users to add code dataset such as [Starcode](https://huggingface.co/datasets/bigcode/starcoderdata), [The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) to enrich model's performance on code and reasoning. ### Citation To cite NanoLM, please use: ``` @misc{yao2024nanolm, title={nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales}, author={Yiqun Yao and Siqi fan and Xiusheng Huang and Xuezhi Fang and Xiang Li and Ziyi Ni and Xin Jiang and Xuying Meng and Peng Han and Shuo Shang and Kang Liu and Aixin Sun and Yequan Wang}, year={2024}, eprint={2304.06875}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Acknowledgement The data is mainly curated and filtered from [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) and [RedPajamaV2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2). We extend our gratitude to the original authors for their innovative work and for making it available to the community. ### License The code of NanoLM used to process the dataset and loss prediction is licensed under the Apache 2.0 license. For curated data, please refer to the licenses of the original ones. * [Common Crawl Foundation Terms of Use](https://commoncrawl.org/terms-of-use) * [C4 license](https://huggingface.co/datasets/allenai/c4#license) * Books: [the_pile_books3 license](https://huggingface.co/datasets/defunct-datasets/the_pile_books3#licensing-information) and [pg19 license](https://huggingface.co/datasets/deepmind/pg19#licensing-information) * [ArXiv Terms of Use](https://info.arxiv.org/help/api/tou.html) * [Wikipedia License](https://huggingface.co/datasets/legacy-datasets/wikipedia#licensing-information) * [StackExchange license on the Internet Archive](https://archive.org/details/stackexchange)

提供机构：

CofeAI

原始信息汇总