Bee-Training-Data-Stage2

Name: Bee-Training-Data-Stage2
Creator: maas
Published: 2026-01-06 16:51:16
License: 暂无描述

魔搭社区2026-01-06 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/Open-Bee/Bee-Training-Data-Stage2

下载链接

链接失效反馈

官方服务：

资源简介：

# Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs [[🏠 Homepage](https://open-bee.github.io/)] [[📖 Arxiv Paper](https://arxiv.org/pdf/2510.13795)] [[🤗 Models & Datasets](https://huggingface.co/collections/Open-Bee/bee-8b-68ecbf10417810d90fbd9995)] [[💻 Code(coming soon)](https://github.com/Open-Bee)] ## Introduction We introduce **Bee-8B**, a new state-of-the-art, fully open 8B Multimodal Large Language Model (MLLM) designed to close the performance gap with proprietary models by focusing on data quality. Bee-8B is trained on our new **Honey-Data-15M** corpus, a high-quality supervised fine-tuning (SFT) dataset of approximately 15 million samples. This dataset was meticulously created with our transparent, adaptable, and open-source data curation pipeline, **HoneyPipe**, which systematically cleans noisy data and enriches it with a novel dual-level (short and long) Chain-of-Thought (CoT) strategy. This dataset enables Bee-8B to achieve exceptional performance, particularly in complex reasoning, establishing a new standard for fully open MLLMs. ## Key Features - **High-Quality, Large-Scale Dataset:** We release **Honey-Data-15M**, a new 15M-sample SFT corpus. It has undergone extensive cleaning to remove widespread noise and has been enriched with dual-level CoT reasoning to enhance advanced problem-solving capabilities. - **Fully Open-Source Data Curation Suite:** We provide not just the data, but the entire methodology. **HoneyPipe** and its underlying framework **DataStudio** offer the community a transparent and reproducible pipeline, moving beyond static dataset releases. - **State-of-the-Art Open Model:** Our model, **Bee-8B**, achieves state-of-the-art performance among fully open MLLMs and is highly competitive with recent semi-open models like InternVL3.5-8B, demonstrating the power of high-quality data. ## Bee-Training-Data-Stage2 `Bee-Training-Data-Stage2` is the second stage of the Bee-8B training recipe, intended for **Stage 2 training**. ## Usage Example code to load this pre-training dataset (assuming a data structure with `image` and `text` fields): ```python from PIL import Image from datasets import load_dataset # Load dataset dataset_name = "Open-Bee/Bee-Training-Data-Stage2" item = load_dataset(dataset_name, split="train")[0] # Extract data fields item_id = item.get('id', 'default_id') image_data = item['image'] text_data = item['text'] # Save image and record path image_path = f"{item_id}.jpg" # Save image (datasets automatically converts to PIL Image object) if isinstance(image_data, Image.Image): # JPEG format requires RGB mode if image_data.mode in ('RGBA', 'LA', 'P'): image_data = image_data.convert('RGB') image_data.save(image_path, format='JPEG') # Build sample sample = { 'id': item_id, 'text': text_data, 'image_path': image_path } # Print result print(sample) ```` ## Licensing Information The `Bee-Training-Data-Stage2` dataset is built upon several publicly available, large-scale web-scraped datasets. - **Sub-dataset Licenses:** Users of `Bee-Training-Data-Stage2` must strictly adhere to the specific licensing terms and conditions of each original sub-dataset from which it is derived. We recommend you carefully review the original license for each sub-dataset before use. - **Prompts and Responses:** To the extent that we hold any intellectual property rights in the modified prompts and newly generated responses created for this project, these contributions are made available under the **Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0)** license. - **Copyright Concerns:** This dataset is compiled for academic research purposes. If you believe any content within `Bee-Training-Data-Stage2` infringes upon your copyright, please contact us immediately at yi.zhang.4096[at]gmail.com. ## Acknowledgements > [\!NOTE] > If you believe we have missed acknowledging any important data source that should be explicitly mentioned here, please contact us. `Bee-Training-Data-Stage2` is built upon a large collection of publicly available datasets. We extend our deepest gratitude to the creators and maintainers of the following major datasets: - [LAION-5B](https://laion.ai/blog/laion-5b/): A large-scale, open image-text dataset. - [COYO-700M](https://github.com/kakaobrain/coyo-dataset): A large-scale, open-source image-text pair dataset. - [Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1): A open-source text large-scale dataset for complex reasoning. - [LLaVA-OneVision-Mid-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data): A open-source image-text pair dataset for mid-level vision-language pre-training. ## Citation If you use our dataset or model in your research, please cite our paper: ```bibtex @misc{zhang2025beehighqualitycorpusfullstack, title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs}, author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu}, year={2025}, eprint={2510.13795}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={[https://arxiv.org/abs/2510.13795](https://arxiv.org/abs/2510.13795)}, } ```

# Bee：解锁顶级全开源多模态大模型的高质量语料与全栈工具套件 [[🏠 项目主页](https://open-bee.github.io/)] [[📖 Arxiv论文](https://arxiv.org/pdf/2510.13795)] [[🤗 模型与数据集](https://huggingface.co/collections/Open-Bee/bee-8b-68ecbf10417810d90fbd9995)] [[💻 代码（即将上线）](https://github.com/Open-Bee)] ## 简介我们推出**Bee-8B**——一款全新的顶级全开源8参数多模态大模型（Multimodal Large Language Model, MLLM），旨在通过聚焦数据质量，缩小与闭源专有模型的性能差距。 Bee-8B基于我们全新的**Honey-Data-15M**语料库进行训练，这是一个包含约1500万条样本的高质量监督微调（Supervised Fine-Tuning, SFT）数据集。该数据集依托我们透明、可适配且开源的数据整理流水线**HoneyPipe**精心构建，该流水线可系统性地清洗噪声数据，并通过创新性的双级（长短文本）思维链（Chain-of-Thought, CoT）策略丰富数据内容。该数据集使得Bee-8B能够实现卓越的性能，尤其在复杂推理任务中表现突出，为全开源多模态大模型树立了全新的性能标杆。 ## 核心特性 - **高质量大规模数据集**：我们发布**Honey-Data-15M**，这是一款全新的1500万样本规模的监督微调语料库。该数据集经过了全方位清洗以去除普遍存在的噪声，并通过双级思维链推理进行数据增强，以提升模型的高阶问题求解能力。 - **全开源数据整理工具套件**：我们不仅公开数据集，还完整开源了其构建方法。**HoneyPipe**及其底层框架**DataStudio**为社区提供了透明且可复现的数据处理流水线，打破了静态数据集发布的局限。 - **顶级开源模型**：我们的模型**Bee-8B**在全开源多模态大模型中位列顶级性能梯队，且可与近期发布的半开源模型如InternVL3.5-8B一较高下，充分证明了高质量数据的重要价值。 ## Bee-Training-Data-Stage2 `Bee-Training-Data-Stage2` 是Bee-8B训练流程的第二阶段，专为**第二阶段训练**设计。 ## 使用方法以下为加载该预训练数据集的示例代码（假设数据集包含`image`与`text`字段）： python from PIL import Image from datasets import load_dataset # Load dataset dataset_name = "Open-Bee/Bee-Training-Data-Stage2" item = load_dataset(dataset_name, split="train")[0] # Extract data fields item_id = item.get('id', 'default_id') image_data = item['image'] text_data = item['text'] # Save image and record path image_path = f"{item_id}.jpg" # Save image (datasets automatically converts to PIL Image object) if isinstance(image_data, Image.Image): # JPEG format requires RGB mode if image_data.mode in ('RGBA', 'LA', 'P'): image_data = image_data.convert('RGB') image_data.save(image_path, format='JPEG') # Build sample sample = { 'id': item_id, 'text': text_data, 'image_path': image_path } # Print result print(sample) ## 许可信息 `Bee-Training-Data-Stage2` 数据集基于多个公开可用的大规模网页爬取数据集构建。 - **子数据集许可**：`Bee-Training-Data-Stage2` 的使用者必须严格遵守其衍生的每个原始子数据集的专属许可条款与条件。我们建议您在使用前仔细查阅每个子数据集的原始许可协议。 - **提示词与回复**：对于本项目中修改后的提示词与全新生成的回复，若我们对其享有任何知识产权，该部分贡献将以**知识共享署名-非商业性使用4.0国际许可协议（Creative Commons Attribution-NonCommercial 4.0 International, CC-BY-NC-4.0）**进行发布。 - **版权关切**：本数据集仅用于学术研究用途。若您认为`Bee-Training-Data-Stage2`中的任何内容侵犯了您的版权，请立即通过邮箱yi.zhang.4096[at]gmail.com与我们联系。 ## 致谢 > [!提示] > 如果您认为我们遗漏了任何应在此处明确提及的重要数据源，请与我们联系。 `Bee-Training-Data-Stage2` 基于大量公开可用的数据集构建。在此，我们向以下主要数据集的创建者与维护者致以最诚挚的谢意： - [LAION-5B](https://laion.ai/blog/laion-5b/): 大规模开源图像-文本数据集。 - [COYO-700M](https://github.com/kakaobrain/coyo-dataset): 大规模开源图像-文本配对数据集。 - [Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1): 用于复杂推理的大规模开源文本数据集。 - [LLaVA-OneVision-Mid-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data): 用于中级视觉语言预训练的开源图像-文本配对数据集。 ## 引用若您在研究中使用了本数据集或模型，请引用我们的论文： bibtex @misc{zhang2025beehighqualitycorpusfullstack, title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs}, author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu}, year={2025}, eprint={2510.13795}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.13795}, }

提供机构：

maas

创建时间：

2025-11-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集