Honey-Data-1M

魔搭社区2026-05-01 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/Open-Bee/Honey-Data-1M

下载链接

链接失效反馈

官方服务：

资源简介：

# Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs [[🏠 Homepage](https://open-bee.github.io/)] [[📖 Arxiv Paper](https://arxiv.org/pdf/2510.13795)] [[🤗 Models & Datasets](https://huggingface.co/collections/Open-Bee/bee-8b-68ecbf10417810d90fbd9995)] [[💻 Code](https://github.com/Open-Bee)] ## Introduction We introduce **Bee-8B**, a new state-of-the-art, fully open 8B Multimodal Large Language Model (MLLM) designed to close the performance gap with proprietary models by focusing on data quality. Bee-8B is trained on our new **Honey-Data-15M** corpus, a high-quality supervised fine-tuning (SFT) dataset of approximately 15 million samples. This dataset was meticulously created with our transparent, adaptable, and open-source data curation pipeline, **HoneyPipe**, which systematically cleans noisy data and enriches it with a novel dual-level (short and long) Chain-of-Thought (CoT) strategy. This dataset enables Bee-8B to achieve exceptional performance, particularly in complex reasoning, establishing a new standard for fully open MLLMs. ## Key Features - **High-Quality, Large-Scale Dataset:** We release **Honey-Data-15M**, a new 15M-sample SFT corpus. It has undergone extensive cleaning to remove widespread noise and has been enriched with dual-level CoT reasoning to enhance advanced problem-solving capabilities. - **Fully Open-Source Data Curation Suite:** We provide not just the data, but the entire methodology. **HoneyPipe** and its underlying framework **DataStudio** offer the community a transparent and reproducible pipeline, moving beyond static dataset releases. - **State-of-the-Art Open Model:** Our model, **Bee-8B**, achieves state-of-the-art performance among fully open MLLMs and is highly competitive with recent semi-open models like InternVL3.5-8B, demonstrating the power of high-quality data. ## Honey-Data-1M > [!NOTE] > The dataset's responses adhere to two specific tag structures: Short CoT responses are formatted as `<think>\n\n</think>\n\n{short CoT Response}`, while Long CoT responses follow the format `<think>\n{Long CoT Reasoning}\n</think>\n\n`. More details about the dataset can be found in the [Paper](https://arxiv.org/abs/2510.13795). Honey-Data-1M is a high-quality, 1-million-sample subset curated from the full 15-million-sample Honey-Data-15M corpus. It was developed to serve two primary purposes: - To act as an efficient refinement SFT dataset, used in Stage 4 of the Bee-8B training recipe to further polish the model's capabilities. - To provide an accessible, high-quality training option for researchers and developers with limited computational resources. This 1M subset was constructed using a meticulous, multi-faceted selection strategy. The goal was to create a more rational and balanced topic distribution across key domains (like STEM, Chart, Document, OCR, and General) and to achieve an approximate 1:1 ratio between long-chain and short-chain CoT conversations. ## Usage To load the dataset, you can refer to the following code: ```python from PIL import Image from datasets import load_dataset # Load dataset item = load_dataset("Open-Bee/Honey-Data-1M", split="train")[0] # Extract data fields item_id = item['id'] conversations = item['conversations'] images_data = item.get('images', []) source = item.get('source', None) img_phash = item.get('img_phash', None) img_size = item.get('img_size', None) # Save images and record paths image_paths = [] for img_idx, image_data in enumerate(images_data): image_filename = f"{item_id}_{img_idx}.jpg" image_path = image_filename # Save image (datasets automatically converts to PIL Image object) if isinstance(image_data, Image.Image): # JPEG format requires RGB mode if image_data.mode in ('RGBA', 'LA', 'P'): image_data = image_data.convert('RGB') image_data.save(image_path, format='JPEG') image_paths.append(image_path) # Build sample sample = { 'id': item_id, 'conversations': conversations, 'image': image_paths[0] if len(image_paths) == 1 else image_paths, 'source': source, 'img_phash': img_phash, 'img_size': img_size, } # Print result print(sample) ``` ## Licensing Information The `Honey-Data-1M` dataset is a collection composed of multiple publicly available sub-datasets. Each of these sub-datasets is governed by its own original license. - **Sub-dataset Licenses:** Users of `Honey-Data-1M` must strictly adhere to the specific licensing terms and conditions of each original sub-dataset included in this collection. We recommend you carefully review the original license for each sub-dataset before use. - **Prompts and Responses:** To the extent that we hold any intellectual property rights in the modified prompts and newly generated responses created for this project, these contributions are made available under the **Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0)** license. - **Copyright Concerns:** This dataset is compiled for academic research purposes. If you believe any content within `Honey-Data-1M` infringes upon your copyright, please contact us immediately at yi.zhang.4096[at]gmail.com. We will promptly review and address the matter, including the removal of concerned content upon verification. ## Acknowledgements > [!NOTE] > If you believe we have missed acknowledging any important data source that should be explicitly mentioned here, please contact us. Honey-Data-1M is built upon a large collection of publicly available datasets. We extend our deepest gratitude to the creators and maintainers of the following major datasets. - [LLaVA-OneVision-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data): A comprehensive multimodal instruction tuning dataset - [MAmmoTH-VL-Instruct-12M](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M): A large-scale vision-language instruction dataset for mathematical reasoning - [VisualWebInstruct](https://huggingface.co/datasets/TIGER-Lab/VisualWebInstruct): A dataset for web-based visual instruction following - [ArXiv-OCR-v0.2](https://huggingface.co/datasets/nz/arxiv-ocr-v0.2): OCR data from ArXiv papers for document understanding - [CoSyn-400K](https://huggingface.co/datasets/allenai/CoSyn-400K): Synthetic data for visual reasoning across multiple domains - [PixMo Collection](https://huggingface.co/collections/allenai/pixmo): A collection of high-quality vision-language datasets - And many other datasets including [Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron), [Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M), and numerous individual datasets across VQA, OCR, Charts, STEM, and other domains. ## Citation If you use our dataset in your research, please cite our paper: ```bibtex @misc{zhang2025beehighqualitycorpusfullstack, title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs}, author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu}, year={2025}, eprint={2510.13795}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.13795}, } ```

# Bee：解锁先进全开源多模态大模型的高质量语料库与全栈工具套件 [[🏠 主页](https://open-bee.github.io/)] [[📖 Arxiv论文](https://arxiv.org/pdf/2510.13795)] [[🤗 模型与数据集](https://huggingface.co/collections/Open-Bee/bee-8b-68ecbf10417810d90fbd9995)] [[💻 代码（即将上线）](https://github.com/Open-Bee)] ## 简介我们推出**Bee-8B**——一款全新的当前最优全开源8参数级多模态大语言模型（Multimodal Large Language Model, MLLM），旨在通过聚焦数据质量，缩小与闭源模型的性能差距。 Bee-8B基于我们全新的**Honey-Data-15M**语料库进行训练，该数据集是包含约1500万样本的高质量监督微调（Supervised Fine-Tuning, SFT）数据集。该数据集通过我们透明、可扩展且开源的数据整理流水线**HoneyPipe**精心构建，该流水线能够系统性地清理噪声数据，并通过新颖的双级别（短、长）思维链（Chain-of-Thought, CoT）策略丰富数据内容。该数据集助力Bee-8B实现了卓越的性能，尤其在复杂推理任务中表现突出，为全开源MLLMs树立了新的性能标杆。 ## 核心特性 - **高质量大规模数据集**：我们发布**Honey-Data-15M**——一款包含1500万样本的全新SFT语料库。该数据集经过全面清洗以去除普遍存在的噪声，并通过双级别思维链推理进行数据增强，以提升模型的高级问题求解能力。 - **全开源数据整理工具套件**：我们不仅发布数据集，还提供完整的方法论支撑。**HoneyPipe**及其底层框架**DataStudio**为社区提供了透明且可复现的数据处理流水线，突破了静态数据集发布的局限。 - **当前最优开源模型**：我们的模型**Bee-8B**在全开源MLLMs中达到当前最优性能，可与近期发布的半开源模型（如InternVL3.5-8B）相媲美，充分证明了高质量数据的价值。 ## Honey-Data-1M > [!提示] > 该数据集的回复遵循两种特定的标签格式：短思维链回复的格式为 `<think> </think> {短思维链回复内容}`，长思维链回复的格式为 `<think> {长思维链推理过程} </think> `。有关该数据集的更多细节可参阅[论文](https://arxiv.org/abs/2510.13795)。 Honey-Data-1M是从完整的1500万样本Honey-Data-15M语料库中精选出的高质量100万样本子集。该子集的开发主要有两个核心目标： - 作为高效的微调优化数据集，用于Bee-8B训练流程的第4阶段，进一步打磨模型的各项能力。 - 为计算资源有限的研究人员与开发者提供一款轻量化且高质量的训练数据集选项。该100万样本子集通过精心设计的多维度筛选策略构建而成，其目标是在STEM、图表、文档、OCR以及通用等核心领域实现更合理均衡的主题分布，并使长思维链与短思维链对话的比例大致达到1:1。 ## 使用方法若要加载该数据集，可参考以下代码： python from PIL import Image from datasets import load_dataset # Load dataset item = load_dataset("Open-Bee/Honey-Data-1M", split="train")[0] # Extract data fields item_id = item['id'] conversations = item['conversations'] images_data = item.get('images', []) source = item.get('source', None) img_phash = item.get('img_phash', None) img_size = item.get('img_size', None) # Save images and record paths image_paths = [] for img_idx, image_data in enumerate(images_data): image_filename = f"{item_id}_{img_idx}.jpg" image_path = image_filename # Save image (datasets automatically converts to PIL Image object) if isinstance(image_data, Image.Image): # JPEG format requires RGB mode if image_data.mode in ('RGBA', 'LA', 'P'): image_data = image_data.convert('RGB') image_data.save(image_path, format='JPEG') image_paths.append(image_path) # Build sample sample = { 'id': item_id, 'conversations': conversations, 'image': image_paths[0] if len(image_paths) == 1 else image_paths, 'source': source, 'img_phash': img_phash, 'img_size': img_size, } # Print result print(sample) ## 许可信息 `Honey-Data-1M` 数据集由多个公开可用的子数据集组合而成，每个子数据集均受其原始许可协议约束。 - **子数据集许可**：使用`Honey-Data-1M`的用户必须严格遵守该数据集中包含的每个原始子数据集的具体许可条款与条件。我们建议用户在使用前仔细查阅每个子数据集的原始许可协议。 - **提示词与回复**：对于本项目中修改后的提示词与新生成的回复，若我们对其拥有任何知识产权，该部分内容将以**知识共享署名-非商业性使用4.0国际许可协议（Creative Commons Attribution-NonCommercial 4.0 International, CC-BY-NC-4.0）**进行发布。 - **版权关切**：本数据集仅用于学术研究目的。若您认为`Honey-Data-1M`中的任何内容侵犯了您的版权，请立即通过 yi.zhang.4096[at]gmail.com 联系我们。我们将及时审核并处理相关事宜，核实后将立即移除相关内容。 ## 致谢 > [!提示] > 若您认为我们遗漏了任何应在此处明确致谢的重要数据源，请与我们联系。 Honey-Data-1M基于大量公开可用的数据集构建而成。我们向以下主要数据集的创建者与维护者致以最诚挚的谢意。 - [LLaVA-OneVision-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data): 一款全面的多模态指令微调数据集 - [MAmmoTH-VL-Instruct-12M](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M): 用于数学推理的大规模视觉语言指令数据集 - [VisualWebInstruct](https://huggingface.co/datasets/TIGER-Lab/VisualWebInstruct): 用于网页视觉指令跟随的数据集 - [ArXiv-OCR-v0.2](https://huggingface.co/datasets/nz/arxiv-ocr-v0.2): 来自ArXiv论文的OCR数据，用于文档理解 - [CoSyn-400K](https://huggingface.co/datasets/allenai/CoSyn-400K): 用于跨领域视觉推理的合成数据集 - [PixMo Collection](https://huggingface.co/collections/allenai/pixmo): 高质量视觉语言数据集合集 - 以及其他众多数据集，包括[Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)、[Cambrian](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M)，以及VQA、OCR、图表、STEM等领域的众多独立数据集。 ## 引用格式若您在研究中使用了本数据集，请引用我们的论文： bibtex @misc{zhang2025beehighqualitycorpusfullstack, title={"Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs"}, author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu}, year={2025}, eprint={2510.13795}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.13795}, }

提供机构：

maas

创建时间：

2025-11-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集