amd/Instella-Long

Hugging Face2025-11-14 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/amd/Instella-Long

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_link: LICENSE --- # Instella-Long The Instella-Long dataset is a collection of pre-training and instruction following data that is used to train [Instella-3B-Long-Instruct](https://huggingface.co/amd/Instella-3B-Long-Instruct). The pre-training data is sourced from [Prolong](https://arxiv.org/pdf/2410.02660). For the SFT data, we use public datasets: [Ultrachat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), [OpenMathinstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2), [Tülu-3 Instruction Following](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following), and [MMLU auxiliary train set](https://huggingface.co/datasets/cais/mmlu). In addition, we generate synthetic long instruction data using documents of the books and arxiv from our pre-training corpus and the dclm subset from [Dolmino-Mix-1124](https://huggingface.co/datasets/allenai/dolmino-mix-1124). We use [Qwen2.5-14B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-1M) as a teacher model to synthetically generate question-answer pairs for the documents. All of the above data are tokenized with our tokenizer and packed into fixed-length sequences ready for training. ## Load Data The data is in MDS format and can be loaded through [mosaicml-streaming](https://github.com/mosaicml/streaming). The data can be downloaded by cloning the repository or the `huggingface_hub.snapshot_download` function. An example to load the data is as follows: ```python >>> from streaming import LocalDataset >>> dataset = LocalDataset("path/to/dataset") >>> dataset[0] ``` ## License The [amd/Instella-Long](https://huggingface.co/datasets/amd/Instella-Long) data is licensed for academic and research purposes under a ResearchRAIL license. For more information on license and terms, refer to our [LICENSE](./LICENSE), [NOTICES](./NOTICES), and the corresponding licenses of each dataset subset linked below: - [Prolong](https://arxiv.org/pdf/2410.02660) (MIT License) - [Ultrachat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (MIT License) - [OpenMathinstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) (CC-BY-4.0 License) - [Tülu-3 Instruction Following](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) (ODC-BY-1.0 License) - [MMLU auxiliary train set](https://huggingface.co/datasets/cais/mmlu) (MIT License) - [Dolmino-Mix-1124](https://huggingface.co/datasets/allenai/dolmino-mix-1124) (ODC-BY-1.0 License) - [Qwen2.5-14B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-1M) (Apache 2.0 License) ## Citations Feel free to cite our Instella-3B models: ```text @article{instella, title={Instella: Fully Open Language Models with Stellar Performance}, author={Liu, Jiang and Wu, Jialian and Yu, Xiaodong and Su, Yusheng and Mishra, Prakamya and Ramesh, Gowtham and Ranjan, Sudhanshu and Manem, Chaitanya and Sun, Ximeng and Wang, Ze and Brahma, Pratik Prabhanjan and Liu, Zicheng and Barsoum, Emad}, journal={arXiv preprint arXiv:2511.10628}, year={2025} } ```

提供机构：

amd

5,000+

优质数据集

54 个

任务类型

进入经典数据集