Honey-Data-15M
收藏魔搭社区2026-05-23 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/Open-Bee/Honey-Data-15M
下载链接
链接失效反馈官方服务:
资源简介:
# Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
[[🏠 Homepage](https://open-bee.github.io/)] [[📖 Arxiv Paper](https://arxiv.org/pdf/2510.13795)] [[🤗 Models & Datasets](https://huggingface.co/collections/Open-Bee/bee-8b-68ecbf10417810d90fbd9995)] [[💻 Code](https://github.com/Open-Bee)]
## Introduction
We introduce **Bee-8B**, a new state-of-the-art, fully open 8B Multimodal Large Language Model (MLLM) designed to close the performance gap with proprietary models by focusing on data quality.
Bee-8B is trained on our new **Honey-Data-15M** corpus, a high-quality supervised fine-tuning (SFT) dataset of approximately 15 million samples. This dataset was meticulously created with our transparent, adaptable, and open-source data curation pipeline, **HoneyPipe**, which systematically cleans noisy data and enriches it with a novel dual-level (short and long) Chain-of-Thought (CoT) strategy.
This dataset enables Bee-8B to achieve exceptional performance, particularly in complex reasoning, establishing a new standard for fully open MLLMs.
## Key Features
- **High-Quality, Large-Scale Dataset:** We release **Honey-Data-15M**, a new 15M-sample SFT corpus. It has undergone extensive cleaning to remove widespread noise and has been enriched with dual-level CoT reasoning to enhance advanced problem-solving capabilities.
- **Fully Open-Source Data Curation Suite:** We provide not just the data, but the entire methodology. **HoneyPipe** and its underlying framework **DataStudio** offer the community a transparent and reproducible pipeline, moving beyond static dataset releases.
- **State-of-the-Art Open Model:** Our model, **Bee-8B**, achieves state-of-the-art performance among fully open MLLMs and is highly competitive with recent semi-open models like InternVL3.5-8B, demonstrating the power of high-quality data.
## Honey-Data-15M
> [!NOTE]
> The dataset's responses adhere to two specific tag structures: Short CoT responses are formatted as `<think>\n\n</think>\n\n{short CoT Response}`, while Long CoT responses follow the format `<think>\n{Long CoT Reasoning}\n</think>\n\n`. More details about the dataset can be found in the [Paper](https://arxiv.org/abs/2510.13795).
> [!NOTE]
> The complete data is 4.71 T and has been completely transmitted. Due to a bug in the dataviewer, the size and number of items displayed by huggingface are inaccurate.
Honey-Data-15M is a large-scale, high-quality supervised fine-tuning (SFT) dataset containing approximately **15 million** meticulously curated samples. We built this dataset with the core objective of addressing the quality bottleneck in current open-source data by systematically cleaning widespread data noise and enriching the data with an innovative **"Dual-Level Chain-of-Thought (CoT)"** strategy.
The dataset's composition is as follows:
* **Approximately 12.2 million short CoT samples**: Designed to instill foundational, step-by-step logical inference in the model.
* **Approximately 2.7 million long CoT samples**: Focused on more intricate, multi-step reasoning problems that challenge and enhance the model's advanced cognitive abilities.
## Usage
To load the dataset, you can refer to the following code:
```python
from PIL import Image
from datasets import load_dataset
# Load dataset (using CoSyn_Math subset as example)
item = load_dataset("Open-Bee/Honey-Data-15M",
split="train",
name="CoSyn_Math")[0]
# Extract data fields
item_id = item['id']
conversations = item['conversations']
images_data = item.get('images', [])
source = item.get('source', None)
img_phash = item.get('img_phash', None)
img_size = item.get('img_size', None)
# Save images and record paths
image_paths = []
for img_idx, image_data in enumerate(images_data):
image_filename = f"{item_id}_{img_idx}.jpg"
image_path = image_filename
# Save image (datasets automatically converts to PIL Image object)
if isinstance(image_data, Image.Image):
# JPEG format requires RGB mode
if image_data.mode in ('RGBA', 'LA', 'P'):
image_data = image_data.convert('RGB')
image_data.save(image_path, format='JPEG')
image_paths.append(image_path)
# Build sample
sample = {
'id': item_id,
'conversations': conversations,
'image': image_paths[0] if len(image_paths) == 1 else image_paths,
'source': source,
'img_phash': img_phash,
'img_size': img_size,
}
# Print result
print(sample)
```
## Licensing Information
The `Honey-Data-15M` dataset is a collection composed of multiple publicly available sub-datasets. Each of these sub-datasets is governed by its own original license.
- **Sub-dataset Licenses:** Users of `Honey-Data-15M` must strictly adhere to the specific licensing terms and conditions of each original sub-dataset included in this collection. We recommend you carefully review the original license for each sub-dataset before use.
- **Prompts and Responses:** To the extent that we hold any intellectual property rights in the modified prompts and newly generated responses created for this project, these contributions are made available under the **Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0)** license.
- **Copyright Concerns:** This dataset is compiled for academic research purposes. If you believe any content within `Honey-Data-15M` infringes upon your copyright, please contact us immediately at yi.zhang.4096[at]gmail.com. We will promptly review and address the matter, including the removal of concerned content upon verification.
## Acknowledgements
> [!NOTE]
> If you believe we have missed acknowledging any important data source that should be explicitly mentioned here, please contact us.
Honey-Data-15M is built upon a large collection of publicly available datasets. We extend our deepest gratitude to the creators and maintainers of the following major datasets.
- [LLaVA-OneVision-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data): A comprehensive multimodal instruction tuning dataset
- [MAmmoTH-VL-Instruct-12M](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M): A large-scale vision-language instruction dataset for mathematical reasoning
- [VisualWebInstruct](https://huggingface.co/datasets/TIGER-Lab/VisualWebInstruct): A dataset for web-based visual instruction following
- [ArXiv-OCR-v0.2](https://huggingface.co/datasets/nz/arxiv-ocr-v0.2): OCR data from ArXiv papers for document understanding
- [CoSyn-400K](https://huggingface.co/datasets/allenai/CoSyn-400K): Synthetic data for visual reasoning across multiple domains
- [PixMo Collection](https://huggingface.co/collections/allenai/pixmo): A collection of high-quality vision-language datasets
- And many other datasets including [Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron), [Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M), and numerous individual datasets across VQA, OCR, Charts, STEM, and other domains.
## Citation
If you use our dataset in your research, please cite our paper:
```bibtex
@misc{zhang2025beehighqualitycorpusfullstack,
title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu},
year={2025},
eprint={2510.13795},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.13795},
}
```
# Bee:解锁先进全开源多模态大语言模型的高质量语料与全栈工具套件
[[🏠 项目主页](https://open-bee.github.io/)] [[📖 ArXiv论文](https://arxiv.org/pdf/2510.13795)] [[🤗 模型与数据集](https://huggingface.co/collections/Open-Bee/bee-8b-68ecbf10417810d90fbd9995)] [[💻 代码仓库](https://github.com/Open-Bee)]
## 引言
我们推出**Bee-8B**——一款全新的顶尖全开源8B参数多模态大语言模型(Multimodal Large Language Model, MLLM),其核心设计目标为通过聚焦数据质量,缩小与闭源模型的性能差距。
Bee-8B基于我们全新的**Honey-Data-15M**语料库进行训练,该语料库是一款包含约1500万样本的高质量监督微调(Supervised Fine-Tuning, SFT)数据集。我们通过透明、可适配且开源的数据整理流水线**HoneyPipe**精心构建了该数据集,该流水线可系统性地清洗噪声数据,并通过新颖的双级(短文本与长文本)思维链(Chain-of-Thought, CoT)策略对数据进行增强。
该数据集使得Bee-8B取得了卓越的性能,尤其在复杂推理任务中表现突出,为全开源多模态大语言模型树立了新的性能标杆。
## 核心特性
- **高质量大规模数据集**:我们发布了**Honey-Data-15M**——一款全新的含1500万样本的监督微调语料库。该数据集经过了全方位清洗以去除普遍存在的噪声,并通过双级思维链推理进行增强,以提升模型的高级问题求解能力。
- **全开源数据整理工具套件**:我们不仅开放数据集,还完整开放了构建方法。**HoneyPipe**及其底层框架**DataStudio**为社区提供了透明且可复现的数据处理流水线,超越了静态数据集发布的范畴。
- **顶尖开源模型**:我们的模型**Bee-8B**在全开源多模态大语言模型中达到了顶尖性能,可与近期发布的半开源模型(如InternVL3.5-8B)一较高下,充分证明了高质量数据的价值。
## Honey-Data-15M
> [!注意]
> 该数据集的回复遵循两种特定的标签格式:短思维链回复的格式为 `<think>
</think>
{短思维链回复内容}`,长思维链回复则遵循格式 `<think>
{长思维链推理过程}
</think>
`。有关该数据集的更多细节可参阅[论文](https://arxiv.org/abs/2510.13795)。
> [!注意]
> 完整数据集大小为4.71 TB,且已完全上传。由于数据查看器存在缺陷,Hugging Face平台上显示的数据集大小与样本数量并不准确。
**Honey-Data-15M**是一款大规模高质量监督微调数据集,包含约**1500万**条精心筛选的样本。我们构建该数据集的核心目标是解决当前开源数据的质量瓶颈问题,通过系统性地清洗普遍存在的数据噪声,并结合创新的**「双级思维链(Dual-Level Chain-of-Thought, CoT)」**策略对数据进行增强。
该数据集的构成如下:
* **约1220万条短思维链样本**:旨在为模型奠定基础的逐步逻辑推理能力。
* **约270万条长思维链样本**:聚焦于更复杂的多步推理任务,以挑战并提升模型的高级认知能力。
## 使用方法
若要加载该数据集,可参考如下代码:
python
from PIL import Image
from datasets import load_dataset
# Load dataset (using CoSyn_Math subset as example)
item = load_dataset("Open-Bee/Honey-Data-15M",
split="train",
name="CoSyn_Math")[0]
# Extract data fields
item_id = item['id']
conversations = item['conversations']
images_data = item.get('images', [])
source = item.get('source', None)
img_phash = item.get('img_phash', None)
img_size = item.get('img_size', None)
# Save images and record paths
image_paths = []
for img_idx, image_data in enumerate(images_data):
image_filename = f"{item_id}_{img_idx}.jpg"
image_path = image_filename
# Save image (datasets automatically converts to PIL Image object)
if isinstance(image_data, Image.Image):
# JPEG format requires RGB mode
if image_data.mode in ('RGBA', 'LA', 'P'):
image_data = image_data.convert('RGB')
image_data.save(image_path, format='JPEG')
image_paths.append(image_paths)
# Build sample
sample = {
'id': item_id,
'conversations': conversations,
'image': image_paths[0] if len(image_paths) == 1 else image_paths,
'source': source,
'img_phash': img_phash,
'img_size': img_size,
}
# Print result
print(sample)
## 许可信息
`Honey-Data-15M`数据集由多个公开可用的子数据集集合而成,每个子数据集均受其原始许可协议约束。
- **子数据集许可协议**:`Honey-Data-15M`的使用者必须严格遵守该集合中每个原始子数据集的特定许可条款与条件。我们建议您在使用前仔细查阅每个子数据集的原始许可协议。
- **提示词与回复**:对于我们为本项目修改的提示词以及新生成的回复,若我们对其享有任何知识产权,则该部分内容将按照**知识共享署名-非商业性使用4.0国际许可协议(Creative Commons Attribution-NonCommercial 4.0 International, CC-BY-NC-4.0)**进行授权。
- **版权问题**:本数据集仅用于学术研究用途。若您认为`Honey-Data-15M`中的任何内容侵犯了您的版权,请立即通过yi.zhang.4096[at]gmail.com联系我们。我们将及时审核并处理相关事宜,核实后将立即移除涉及侵权的内容。
## 致谢
> [!注意]
> 若您认为我们遗漏了任何应在此处明确致谢的重要数据源,请与我们联系。
**Honey-Data-15M**基于大量公开可用的数据集构建而成。在此,我们向以下主要数据集的创建者与维护者致以最诚挚的谢意。
- [LLaVA-OneVision-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data):一款全面的多模态指令微调数据集
- [MAmmoTH-VL-Instruct-12M](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M):一款用于数学推理的大规模视觉语言指令数据集
- [VisualWebInstruct](https://huggingface.co/datasets/TIGER-Lab/VisualWebInstruct):一款用于网页视觉指令遵循任务的数据集
- [ArXiv-OCR-v0.2](https://huggingface.co/datasets/nz/arxiv-ocr-v0.2):用于文档理解的ArXiv论文OCR数据
- [CoSyn-400K](https://huggingface.co/datasets/allenai/CoSyn-400K):一款用于多领域视觉推理的合成数据
- [PixMo集合](https://huggingface.co/collections/allenai/pixmo):一款高质量视觉语言数据集集合
- 以及众多其他数据集,包括[Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)、[Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M),以及覆盖视觉问答(Visual Question Answering, VQA)、OCR、图表、理工科(STEM)等多个领域的大量独立数据集。
## 引用
若您在研究中使用了本数据集,请引用我们的论文:
bibtex
@misc{zhang2025beehighqualitycorpusfullstack,
title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu},
year={2025},
eprint={2510.13795},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.13795},
}
提供机构:
maas
创建时间:
2025-10-20



