openbmb/InfLLM-V2-data-5B
收藏Hugging Face2025-10-25 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/openbmb/InfLLM-V2-data-5B
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- zh
tags:
- long-context
- infllm
---
# InfLLM-V2 Long-Context Training Dataset with 5B Tokens
**Project Links**: [[Paper](https://arxiv.org/abs/2509.24663)] [[InfLLM-V2 Models](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base)] [[CUDA Kernel Code](https://github.com/OpenBMB/infllmv2_cuda_impl)]
---
## 🚀 About InfLLM-V2
**InfLLM-V2** is a native sparse attention framework designed for the efficient processing of long-sequence texts. Its core advantage is the ability to maintain high performance comparable to dense attention in short-text scenarios—without any extra parameters—while seamlessly switching to a sparse mode for long-text scenarios, achieving significant end-to-end acceleration.
To support community reproduction and further exploration, we are open-sourcing the full suite of resources for the InfLLM-V2 project, including:
* **Initial Weights**: [InfLLM-V2-Short-Dense-Base](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base) (The base model before continued training on long texts).
* **Training Data**: `InfLLM-V2-data-5B` (📍 **This Dataset**).
* **Final Model**: [InfLLM-V2-Long-Sparse-Base](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base) (The final model after being trained on long-text data).
## ✨ Dataset Description
This dataset contains **5B tokens** of long-text data used for training **InfLLM-V2**.
We demonstrate that **only 5B tokens of high-quality long-text data** are needed to successfully unlock the model's powerful sparse attention capabilities, without resorting to the trillion-scale data required by other methods. Using this dataset, researchers can efficiently reproduce our results or explore more advanced training methods for long-context models.
### Data Composition and Specifications
**1. Data Composition**
This dataset is a carefully curated mixture from sources including web data, source code, scientific papers, and Wikipedia, augmented with a selection of high-quality in-house data.
**2. Specifications**
- **Total Tokens**: Approximately 5 Billion (5B).
- **Tokenizer**: Processed using the tokenizer from [MiniCPM4](https://huggingface.co/openbmb/MiniCPM4.1-8B).
- **Data Format**: Sharded Parquet (`.parquet`).
- **Data Fields**:
- `input_ids`: (list[int]) The list of encoded Token IDs.
- `text`: (string) The original text.
### How to Use
Given the large size of the dataset, it is **highly recommended** to load it in **streaming mode** using the Hugging Face `datasets` library to avoid memory exhaustion.
```python
from datasets import load_dataset
# Recommended: Load in streaming mode to save memory
ds = load_dataset("openbmb/InfLLM-V2-data-5B", split="train", streaming=True)
```
## The InfLLM-V2 Training Workflow
The long-context capability of InfLLM-V2 is achieved through continued training on high-quality long-text data.
- **Step 1: Start from the base model.**
- [**InfLLM-V2-Short-Dense-Base**](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base): The base model pre-trained on short texts, featuring dense attention.
- **Step 2: Continue training on this dataset.**
- Use this dataset (`InfLLM-V2-data-5B`) to perform continued training on the base model.
- **Step 3: Get the final long-context model.**
- [**InfLLM-V2-Long-Sparse-Base**](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base): The final model after training, equipped with powerful long-context and sparse attention capabilities.
## Related Projects
- **Models:**
- **[openbmb/MiniCPM4.1-8B](https://huggingface.co/openbmb/MiniCPM4.1-8B):** A model trained with InfLLM V2, support fusion thinking.
- **[openbmb/MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B):** A model trained with InfLLM V2.
- **CUDA Kernels:**
- [OpenBMB/infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl): The CUDA implementation of the core sparse attention kernels for InfLLM-V2.
- **Training Data:**
- [openbmb/InfLLM-V2-data-5B](https://huggingface.co/datasets/openbmb/InfLLM-V2-data-5B) (This dataset).
## Citation
If you use our work in your research, please cite our paper:
```bibtex
@misc{zhao2025infllmv2densesparseswitchableattention,
title={InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation},
author={Weilin Zhao and Zihan Zhou and Zhou Su and Chaojun Xiao and Yuxuan Li and Yanghao Li and Yudi Zhang and Weilun Zhao and Zhen Li and Yuxiang Huang and Ao Sun and Xu Han and Zhiyuan Liu},
year={2025},
eprint={2509.24663},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.24663},
}
```
---
许可证:Apache-2.0
语言:
- 英语
- 中文
标签:
- 长上下文(long-context)
- InfLLM
---
# 包含50亿Token的InfLLM-V2长上下文训练数据集
**项目链接**:[[论文](https://arxiv.org/abs/2509.24663)] [[InfLLM-V2 模型](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base)] [[CUDA 内核代码](https://github.com/OpenBMB/infllmv2_cuda_impl)]
---
## 🚀 关于InfLLM-V2
**InfLLM-V2** 是一款专为长序列文本高效处理设计的原生稀疏注意力框架。其核心优势在于,无需新增任何参数,即可在短文本场景下保持与稠密注意力相当的高性能表现;同时可无缝切换至稀疏模式适配长文本场景,实现显著的端到端加速。
为支持社区复现与进一步探索,我们开源了InfLLM-V2项目的全套资源,包括:
* **初始权重**:[InfLLM-V2-Short-Dense-Base](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base)(用于长文本预训练前的基础模型)。
* **训练数据**:`InfLLM-V2-data-5B`(📍 **本数据集**)。
* **最终模型**:[InfLLM-V2-Long-Sparse-Base](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base)(基于长文本数据训练完成的最终模型)。
## ✨ 数据集说明
本数据集包含用于训练**InfLLM-V2**的**50亿Token**长文本数据。
我们的研究表明,仅需**50亿Token高质量长文本数据**即可成功解锁模型强大的稀疏注意力能力,无需像其他方法那样依赖万亿级别的训练数据。借助本数据集,研究人员可高效复现我们的研究成果,或探索面向长上下文模型的更先进训练方法。
### 数据组成与规格说明
**1. 数据组成**
本数据集由多源精选数据混合而成,涵盖网页数据、源代码、学术论文与维基百科内容,并补充了精选的自有高质量数据。
**2. 规格参数**
- **总Token数**:约50亿(5B)。
- **分词器**:使用来自[MiniCPM4](https://huggingface.co/openbmb/MiniCPM4.1-8B)的分词器进行处理。
- **数据格式**:分块Parquet(`.parquet`)格式。
- **数据字段**:
- `input_ids`:(list[int]) 编码后的Token ID列表。
- `text`:(string) 原始文本内容。
### 使用方式
鉴于本数据集体量较大,**强烈建议**使用Hugging Face的`datasets`库以**流式加载模式**读取,避免内存耗尽。
python
from datasets import load_dataset
# 推荐:采用流式加载模式以节省内存
ds = load_dataset("openbmb/InfLLM-V2-data-5B", split="train", streaming=True)
## InfLLM-V2 训练流程
InfLLM-V2的长上下文能力通过在高质量长文本数据上的持续训练实现。
- **步骤1:从基础模型起步**
- [**InfLLM-V2-Short-Dense-Base**](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base):基于短文本预训练的基础模型,采用稠密注意力机制。
- **步骤2:在本数据集上执行持续训练**
- 使用本数据集(`InfLLM-V2-data-5B`)对基础模型进行持续训练。
- **步骤3:获得最终长上下文模型**
- [**InfLLM-V2-Long-Sparse-Base**](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base):训练完成的最终模型,具备强大的长上下文与稀疏注意力能力。
## 相关项目
- **模型**:
- **[openbmb/MiniCPM4.1-8B](https://huggingface.co/openbmb/MiniCPM4.1-8B)**:采用InfLLM V2训练的模型,支持融合式思考。
- **[openbmb/MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B)**:采用InfLLM V2训练的模型。
- **CUDA 内核**:
- [OpenBMB/infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl):InfLLM-V2核心稀疏注意力内核的CUDA实现。
- **训练数据**:
- [openbmb/InfLLM-V2-data-5B](https://huggingface.co/datasets/openbmb/InfLLM-V2-data-5B)(本数据集)。
## 引用说明
若您在研究中使用本项目成果,请引用我们的论文:
bibtex
@misc{zhao2025infllmv2densesparseswitchableattention,
title={InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation},
author={Weilin Zhao and Zihan Zhou and Zhou Su and Chaojun Xiao and Yuxuan Li and Yanghao Li and Yudi Zhang and Weilun Zhao and Zhen Li and Yuxiang Huang and Ao Sun and Xu Han and Zhiyuan Liu},
year={2025},
eprint={2509.24663},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.24663},
}
提供机构:
openbmb



