Bee-Training-Data-Stage1
收藏魔搭社区2026-01-06 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/Open-Bee/Bee-Training-Data-Stage1
下载链接
链接失效反馈官方服务:
资源简介:
# Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
[[🏠 Homepage](https://open-bee.github.io/)] [[📖 Arxiv Paper](https://arxiv.org/pdf/2510.13795)] [[🤗 Models & Datasets](https://huggingface.co/collections/Open-Bee/bee-8b-68ecbf10417810d90fbd9995)] [[💻 Code(coming soon)](https://github.com/Open-Bee)]
## Introduction
We introduce **Bee-8B**, a new state-of-the-art, fully open 8B Multimodal Large Language Model (MLLM) designed to close the performance gap with proprietary models by focusing on data quality.
Bee-8B is trained on our new **Honey-Data-15M** corpus, a high-quality supervised fine-tuning (SFT) dataset of approximately 15 million samples. This dataset was meticulously created with our transparent, adaptable, and open-source data curation pipeline, **HoneyPipe**, which systematically cleans noisy data and enriches it with a novel dual-level (short and long) Chain-of-Thought (CoT) strategy.
This dataset enables Bee-8B to achieve exceptional performance, particularly in complex reasoning, establishing a new standard for fully open MLLMs.
## Key Features
- **High-Quality, Large-Scale Dataset:** We release **Honey-Data-15M**, a new 15M-sample SFT corpus. It has undergone extensive cleaning to remove widespread noise and has been enriched with dual-level CoT reasoning to enhance advanced problem-solving capabilities.
- **Fully Open-Source Data Curation Suite:** We provide not just the data, but the entire methodology. **HoneyPipe** and its underlying framework **DataStudio** offer the community a transparent and reproducible pipeline, moving beyond static dataset releases.
- **State-of-the-Art Open Model:** Our model, **Bee-8B**, achieves state-of-the-art performance among fully open MLLMs and is highly competitive with recent semi-open models like InternVL3.5-8B, demonstrating the power of high-quality data.
## Bee-Training-Data-Stage1
`Bee-Training-Data-Stage1` is the first stage of the Bee-8B training recipe, intended for **Stage 1 training**.
## Usage
Example code to load this pre-training dataset (assuming a data structure with `image` and `text` fields):
```python
from PIL import Image
from datasets import load_dataset
# Load dataset
dataset_name = "Open-Bee/Bee-Training-Data-Stage1"
item = load_dataset(dataset_name, split="train")[0]
# Extract data fields
item_id = item.get('id', 'default_id')
image_data = item['image']
text_data = item['text']
# Save image and record path
image_path = f"{item_id}.jpg"
# Save image (datasets automatically converts to PIL Image object)
if isinstance(image_data, Image.Image):
# JPEG format requires RGB mode
if image_data.mode in ('RGBA', 'LA', 'P'):
image_data = image_data.convert('RGB')
image_data.save(image_path, format='JPEG')
# Build sample
sample = {
'id': item_id,
'text': text_data,
'image_path': image_path
}
# Print result
print(sample)
````
## Licensing Information
The `Bee-Training-Data-Stage1` dataset is built upon several publicly available, large-scale web-scraped datasets.
- **Sub-dataset Licenses:** Users of `Bee-Training-Data-Stage1` must strictly adhere to the specific licensing terms and conditions of each original sub-dataset from which it is derived. We recommend you carefully review the original license for each sub-dataset before use.
- **Prompts and Responses:** To the extent that we hold any intellectual property rights in the modified prompts and newly generated responses created for this project, these contributions are made available under the **Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0)** license.
- **Copyright Concerns:** This dataset is compiled for academic research purposes. If you believe any content within `Bee-Training-Data-Stage1` infringes upon your copyright, please contact us immediately at yi.zhang.4096[at]gmail.com.
## Acknowledgements
> [\!NOTE]
> If you believe we have missed acknowledging any important data source that should be explicitly mentioned here, please contact us.
`Bee-Training-Data-Stage1` is built upon a large collection of publicly available datasets. We extend our deepest gratitude to the creators and maintainers of the following major datasets:
- [COYO-700M](https://github.com/kakaobrain/coyo-dataset): A large-scale, open-source image-text pair dataset.
- [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain): A open-source image-text pair dataset for vision-language pre-training.
## Citation
If you use our dataset or model in your research, please cite our paper:
```bibtex
@misc{zhang2025beehighqualitycorpusfullstack,
title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu},
year={2025},
eprint={2510.13795},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={[https://arxiv.org/abs/2510.13795](https://arxiv.org/abs/2510.13795)},
}
```
# Bee:一款解锁高级全开源多模态大模型的高质量语料与全栈工具套件
[[🏠 主页](https://open-bee.github.io/)] [[📖 Arxiv论文](https://arxiv.org/pdf/2510.13795)] [[🤗 模型与数据集](https://huggingface.co/collections/Open-Bee/bee-8b-68ecbf10417810d90fbd9995)] [[💻 代码(即将上线)](https://github.com/Open-Bee)]
## 简介
我们推出**Bee-8B**——一款全新的顶尖级全开源8参数规模多模态大语言模型(Multimodal Large Language Model, MLLM),其核心设计目标为聚焦数据质量,缩小与闭源商用模型的性能差距。
Bee-8B基于我们全新的**Honey-Data-15M**语料库进行训练,这是一个包含约1500万条样本的高质量监督微调(Supervised Fine-Tuning, SFT)数据集。该数据集依托我们自研的透明、可扩展且开源的数据整理流水线**HoneyPipe**精心构建,该流水线可系统性地清洗噪声数据,并通过创新的双层级(短文本与长文本)思维链(Chain-of-Thought, CoT)策略丰富数据内容。
该数据集助力Bee-8B实现了卓越的性能表现,尤其在复杂推理任务中表现突出,为全开源多模态大语言模型树立了新的性能标杆。
## 核心特性
- **高质量大规模数据集**:我们发布**Honey-Data-15M**——一款全新的1500万样本规模监督微调语料库。该数据集经过了全方位清洗以去除普遍存在的噪声数据,并通过双层级思维链推理进行内容丰富,以提升高级问题解决能力。
- **全开源数据整理工具套件**:我们不仅开放数据集,更完整公开了全套构建方法。**HoneyPipe**及其底层框架**DataStudio**为社区提供了透明且可复现的数据处理流水线,打破了仅静态发布数据集的局限。
- **顶尖级开源模型**:我们的模型**Bee-8B**在全开源多模态大语言模型中达到了顶尖性能水平,可与近期发布的半开源模型(如InternVL3.5-8B)一较高下,充分证明了高质量数据的核心价值。
## Bee-Training-Data-Stage1
`Bee-Training-Data-Stage1` 为Bee-8B训练流程的第一阶段,专门用于**第一阶段训练**。
## 使用方法
以下为加载该预训练数据集的示例代码(假设数据集包含`image`与`text`字段):
python
from PIL import Image
from datasets import load_dataset
# 加载数据集
dataset_name = "Open-Bee/Bee-Training-Data-Stage1"
item = load_dataset(dataset_name, split="train")[0]
# 提取数据字段
item_id = item.get('id', 'default_id')
image_data = item['image']
text_data = item['text']
# 保存图像并记录路径
image_path = f"{item_id}.jpg"
# 保存图像(datasets库会自动转换为PIL图像对象)
if isinstance(image_data, Image.Image):
# JPEG格式需要RGB色彩模式
if image_data.mode in ('RGBA', 'LA', 'P'):
image_data = image_data.convert('RGB')
image_data.save(image_path, format='JPEG')
# 构建样本
sample = {
'id': item_id,
'text': text_data,
'image_path': image_path
}
# 打印结果
print(sample)
## 许可信息
`Bee-Training-Data-Stage1` 数据集基于多个公开可用的大规模网页爬取数据集构建。
- **子数据集许可**:`Bee-Training-Data-Stage1` 的使用者必须严格遵守其衍生来源的每个原始子数据集的专属许可条款与条件。我们建议您在使用前仔细查阅每个子数据集的原始许可协议。
- **提示词与回复**:对于我们为本项目修改的提示词与新生成的回复,若我们拥有相关知识产权,则该部分内容按照**知识共享署名-非商业性使用4.0国际许可协议(Creative Commons Attribution-NonCommercial 4.0 International, CC-BY-NC-4.0)**进行授权。
- **版权相关问题**:本数据集仅用于学术研究用途。若您认为`Bee-Training-Data-Stage1`中的任何内容侵犯了您的版权,请立即通过邮箱yi.zhang.4096[at]gmail.com与我们联系。
## 致谢
> [!提示]
> 若您认为我们遗漏了在此处应明确提及的重要数据来源,请与我们联系。
`Bee-Training-Data-Stage1` 基于大量公开数据集构建。在此,我们向以下主要数据集的创建者与维护者致以最诚挚的感谢:
- [COYO-700M](https://github.com/kakaobrain/coyo-dataset):一款大规模开源图像-文本配对数据集。
- [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain):一款用于视觉语言预训练的开源图像-文本配对数据集。
## 引用
若您在研究中使用了我们的数据集或模型,请引用我们的论文:
bibtex
@misc{zhang2025beehighqualitycorpusfullstack,
title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu},
year={2025},
eprint={2510.13795},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.13795},
}
提供机构:
maas
创建时间:
2025-11-04
搜集汇总
数据集介绍

背景与挑战
背景概述
Bee-Training-Data-Stage1是Bee-8B多模态大语言模型训练的第一阶段数据集,基于高质量、大规模的Honey-Data-15M语料库构建,包含约1500万个经过清洗和双级链式思考推理增强的图像-文本样本,旨在支持模型在复杂推理任务中的训练。
以上内容由遇见数据集搜集并总结生成



