five

InfLLM-V2-data-5B

收藏
魔搭社区2026-01-06 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/OpenBMB/InfLLM-V2-data-5B
下载链接
链接失效反馈
官方服务:
资源简介:
# InfLLM-V2 Long-Context Training Dataset with 5B Tokens **Project Links**: [[Paper](https://arxiv.org/abs/2509.24663)] [[InfLLM-V2 Models](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base)] [[CUDA Kernel Code](https://github.com/OpenBMB/infllmv2_cuda_impl)] --- ## 🚀 About InfLLM-V2 **InfLLM-V2** is a native sparse attention framework designed for the efficient processing of long-sequence texts. Its core advantage is the ability to maintain high performance comparable to dense attention in short-text scenarios—without any extra parameters—while seamlessly switching to a sparse mode for long-text scenarios, achieving significant end-to-end acceleration. To support community reproduction and further exploration, we are open-sourcing the full suite of resources for the InfLLM-V2 project, including: * **Initial Weights**: [InfLLM-V2-Short-Dense-Base](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base) (The base model before continued training on long texts). * **Training Data**: `InfLLM-V2-data-5B` (📍 **This Dataset**). * **Final Model**: [InfLLM-V2-Long-Sparse-Base](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base) (The final model after being trained on long-text data). ## ✨ Dataset Description This dataset contains **5B tokens** of long-text data used for training **InfLLM-V2**. We demonstrate that **only 5B tokens of high-quality long-text data** are needed to successfully unlock the model's powerful sparse attention capabilities, without resorting to the trillion-scale data required by other methods. Using this dataset, researchers can efficiently reproduce our results or explore more advanced training methods for long-context models. ### Data Composition and Specifications **1. Data Composition** This dataset is a carefully curated mixture from sources including web data, source code, scientific papers, and Wikipedia, augmented with a selection of high-quality in-house data. **2. Specifications** - **Total Tokens**: Approximately 5 Billion (5B). - **Tokenizer**: Processed using the tokenizer from [MiniCPM4](https://huggingface.co/openbmb/MiniCPM4.1-8B). - **Data Format**: Sharded Parquet (`.parquet`). - **Data Fields**: - `input_ids`: (list[int]) The list of encoded Token IDs. - `text`: (string) The original text. ### How to Use Given the large size of the dataset, it is **highly recommended** to load it in **streaming mode** using the Hugging Face `datasets` library to avoid memory exhaustion. ```python from datasets import load_dataset # Recommended: Load in streaming mode to save memory ds = load_dataset("openbmb/InfLLM-V2-data-5B", split="train", streaming=True) ``` ## The InfLLM-V2 Training Workflow The long-context capability of InfLLM-V2 is achieved through continued training on high-quality long-text data. - **Step 1: Start from the base model.** - [**InfLLM-V2-Short-Dense-Base**](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base): The base model pre-trained on short texts, featuring dense attention. - **Step 2: Continue training on this dataset.** - Use this dataset (`InfLLM-V2-data-5B`) to perform continued training on the base model. - **Step 3: Get the final long-context model.** - [**InfLLM-V2-Long-Sparse-Base**](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base): The final model after training, equipped with powerful long-context and sparse attention capabilities. ## Related Projects - **Models:** - **[openbmb/MiniCPM4.1-8B](https://huggingface.co/openbmb/MiniCPM4.1-8B):** A model trained with InfLLM V2, support fusion thinking. - **[openbmb/MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B):** A model trained with InfLLM V2. - **CUDA Kernels:** - [OpenBMB/infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl): The CUDA implementation of the core sparse attention kernels for InfLLM-V2. - **Training Data:** - [openbmb/InfLLM-V2-data-5B](https://huggingface.co/datasets/openbmb/InfLLM-V2-data-5B) (This dataset). ## Citation If you use our work in your research, please cite our paper: ```bibtex @misc{zhao2025infllmv2densesparseswitchableattention, title={InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation}, author={Weilin Zhao and Zihan Zhou and Zhou Su and Chaojun Xiao and Yuxuan Li and Yanghao Li and Yudi Zhang and Weilun Zhao and Zhen Li and Yuxiang Huang and Ao Sun and Xu Han and Zhiyuan Liu}, year={2025}, eprint={2509.24663}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.24663}, } ```

# InfLLM-V2 50亿Token(Token)长上下文训练数据集 **项目链接**: [[论文](https://arxiv.org/abs/2509.24663)] [[InfLLM-V2 模型](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base)] [[CUDA 内核代码](https://github.com/OpenBMB/infllmv2_cuda_impl)] --- ## 🚀 关于 InfLLM-V2 **InfLLM-V2** 是一款专为高效处理长序列文本设计的原生稀疏注意力框架。其核心优势在于,无需额外参数即可在短文本场景下保持与稠密注意力相当的高性能表现,同时可无缝切换至稀疏模式以适配长文本场景,实现显著的端到端加速。 为支持社区复现研究成果并开展进一步探索,我们开源了 InfLLM-V2 项目的全套资源,包括: * **初始权重**: [InfLLM-V2-Short-Dense-Base](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base)(即在长文本续训前的基础模型)。 * **训练数据**: `InfLLM-V2-data-5B`(📍 **本数据集**)。 * **最终模型**: [InfLLM-V2-Long-Sparse-Base](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base)(即在长文本数据上续训完成后的最终模型)。 ## ✨ 数据集说明 本数据集包含 **50亿Token(Token)** 的长文本数据,用于训练 **InfLLM-V2**。 我们证实,仅需 **50亿高质量长文本数据** 即可成功解锁模型强大的稀疏注意力能力,无需其他方法所需的万亿级数据规模。借助本数据集,研究人员可高效复现我们的研究成果,或探索更先进的长上下文模型训练方法。 ### 数据构成与规范 **1. 数据构成** 本数据集为经精心甄选整合的混合数据集,来源涵盖网页数据、源代码、学术论文与维基百科,并补充了精选的自研高质量数据。 **2. 规范说明** - **总Token数**: 约50亿(5B)。 - **分词器**: 使用来自 [MiniCPM4](https://huggingface.co/openbmb/MiniCPM4.1-8B) 的分词器进行处理。 - **数据格式**: 分块Parquet(`.parquet`)格式。 - **数据字段**: - `input_ids`: (list[int]) 编码后的Token ID列表。 - `text`: (string) 原始文本内容。 ### 使用方式 鉴于本数据集体量较大,**强烈推荐** 使用 Hugging Face `datasets` 库以**流式加载模式** 载入,避免内存耗尽。 python from datasets import load_dataset # 推荐:采用流式加载以节省内存 ds = load_dataset("openbmb/InfLLM-V2-data-5B", split="train", streaming=True) ## InfLLM-V2 训练流程 InfLLM-V2 的长上下文能力通过在高质量长文本数据上进行续训实现。 - **步骤1: 从基础模型起步** - [**InfLLM-V2-Short-Dense-Base**](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base): 基于短文本预训练的基础模型,采用稠密注意力机制。 - **步骤2: 在本数据集上进行续训** - 使用本数据集(`InfLLM-V2-data-5B`)对基础模型执行续训。 - **步骤3: 获得最终长上下文模型** - [**InfLLM-V2-Long-Sparse-Base**](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base): 训练完成后的最终模型,具备强大的长上下文与稀疏注意力能力。 ## 相关项目 - **模型**: - **[openbmb/MiniCPM4.1-8B](https://huggingface.co/openbmb/MiniCPM4.1-8B)**: 采用 InfLLM V2 训练的模型,支持融合思考。 - **[openbmb/MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B)**: 采用 InfLLM V2 训练的模型。 - **CUDA 内核**: - [OpenBMB/infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl): InfLLM-V2 核心稀疏注意力内核的 CUDA 实现。 - **训练数据**: - [openbmb/InfLLM-V2-data-5B](https://huggingface.co/datasets/openbmb/InfLLM-V2-data-5B)(本数据集)。 ## 引用 若您在研究中使用了本工作,请引用我们的论文: bibtex @misc{zhao2025infllmv2densesparseswitchableattention, title={InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation}, author={Weilin Zhao and Zihan Zhou and Zhou Su and Chaojun Xiao and Yuxuan Li and Yanghao Li and Yudi Zhang and Weilun Zhao and Zhen Li and Yuxiang Huang and Ao Sun and Xu Han and Zhiyuan Liu}, year={2025}, eprint={2509.24663}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.24663}, }
提供机构:
maas
创建时间:
2025-11-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作