HoneyBee

Name: HoneyBee
Creator: maas
Published: 2025-12-26 16:53:19
License: 暂无描述

魔搭社区2025-12-26 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/facebook/HoneyBee

下载链接

链接失效反馈

官方服务：

资源简介：

# HoneyBee: Data Recipes for Vision-Language Reasoners This is the official data release for the paper: https://arxiv.org/abs/2510.12225. Github Repo: https://github.com/facebookresearch/HoneyBee_VLM. ## Abstract Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research. ![image](https://cdn-uploads.huggingface.co/production/uploads/61c5c25705aa54027c52f7b3/pz-sjA_aCUBx9i0hryLky.png) The data is composed of three components: 1. Questions from OpenThought3, and chain-of-thoughts from Llama-4 Scout (`q_source='OpenThoughts3'`). We do not re-distribute questions from OT3. 2. Images and Questions from ViRL, and chain-of-thoughts from Llama-4 Scout (`q_source='ViRL'`). We do not re-distribute images and questions from ViRL. 3. Images from ViRL, and new questions and chain-of-thoughts from Llama-4 Scout (`q_source='Ours'`). We do not re-distribute images from ViRL. ## Pointers 1. Use this link to download the images from the ViRL dataset: https://huggingface.co/datasets/TIGER-Lab/ViRL39K/blob/main/images.zip 2. Use this script to merge our data release with original questions from the OT3 and ViRL dataset: https://huggingface.co/datasets/facebook/HoneyBee/blob/main/full_data.py ## Data Explanation ``` q_source: question source q_id: unique id that will help in populating the questions from original source image_path: image path from the ViRL data release question: original question from OT3, ViRL, or Llama-4 Scout generated question cot: Llama-4 Scout generated chain-of-thought (CoT). As per our insights in the paper, the cot consists of image caption (within <caption> and </caption> tags) from Llama-4 followed by solution to the question. The final answer is enclosed within \\boxed{}. ``` ## Results of Training with HoneyBee ![image](https://cdn-uploads.huggingface.co/production/uploads/61c5c25705aa54027c52f7b3/BsDvu3FHvoIksQWYoZxtQ.png) ## License Information The Data is released CC-by-NC. The data are outputs of Llama 4, and subject to the Llama 4 license (https://github.com/meta-llama/llama-models/tree/main/models/llama4). If you use of this portion of the data to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name. Third party content pulled from other locations are subject to its own licenses and you may have other legal obligations or restrictions that govern your use of that content. ## Citation ``` @article{bansal2025honeybee, title={HoneyBee: Data Recipes for Vision-Language Reasoners}, author={Bansal, Hritik and Sachan, Devandra Singh and Chang, Kai-Wei and Grover, Aditya and Ghosh, Gargi and Yih, Wen-tau and Pasunuru, Ramakanth}, journal={arXiv preprint arXiv:2510.12225}, year={2025} } ```

# HoneyBee: 面向视觉语言推理器的数据配方本项目为以下论文的官方数据集发布：https://arxiv.org/abs/2510.12225。 GitHub 仓库：https://github.com/facebookresearch/HoneyBee_VLM。 ## 摘要当前视觉语言模型（Vision-Language Models, VLMs）的研究进展使其在推理任务中表现卓越，但高性能视觉语言推理训练数据集的构建原则仍有待深入探究。本工作提出了多种数据精选方法，并通过严格控制训练与评估设置，研究其对视觉语言推理能力的影响。我们分析了上下文（图像与问题对）来源的作用，实施了针对性的数据干预，并探索了图像、问题及思维链（Chain-of-Thought, CoT）解决方案的缩放策略。研究结果表明：(a) 上下文来源策略对VLM性能具有显著影响；(b) 诸如图像字幕辅助信号、加入纯文本推理等干预措施可带来显著性能提升；(c) 全维度数据缩放（例如每张图像对应唯一问题数、每个图像-问题对对应唯一CoT数）可持续改善推理能力。基于上述发现，我们推出了HoneyBee——一个包含250万条样本、35万组图像-问题对的大规模高质量思维链推理数据集。使用HoneyBee训练的VLMs在各类模型尺寸下均优于当前最优（State-of-the-Art, SOTA）模型。例如，在MathVerse基准上，参数量为30亿的HoneyBee训练VLM分别比SOTA模型与基础模型高出7.8%与24.8%。此外，我们提出了一种测试时缩放策略，可在不牺牲准确率的前提下将解码成本降低73%。综上，本工作为视觉语言推理数据集的精选研究提供了更优的策略。 ![image](https://cdn-uploads.huggingface.co/production/uploads/61c5c25705aa54027c52f7b3/pz-sjA_aCUBx9i0hryLky.png) 该数据集由三部分构成： 1. 问题来源于OpenThought3，思维链来源于Llama-4 Scout（`q_source='OpenThoughts3'`）。我们不重新分发OT3的问题。 2. 图像与问题来源于ViRL，思维链来源于Llama-4 Scout（`q_source='ViRL'`）。我们不重新分发ViRL的图像与问题。 3. 图像来源于ViRL，问题与思维链均由Llama-4 Scout生成（`q_source='Ours'`）。我们不重新分发ViRL的图像。 ## 操作指引 1. 使用以下链接下载ViRL数据集的图像：https://huggingface.co/datasets/TIGER-Lab/ViRL39K/blob/main/images.zip 2. 使用以下脚本将本数据集发布内容与OT3和ViRL数据集的原始问题进行合并：https://huggingface.co/datasets/facebook/HoneyBee/blob/main/full_data.py ## 数据字段说明 q_source: 问题来源 q_id: 用于从原始数据源获取对应问题的唯一标识符 image_path: ViRL数据发布中的图像路径 question: 来源于OT3、ViRL或Llama-4 Scout生成的原始问题 cot: Llama-4 Scout生成的思维链（CoT）。根据本文的研究结论，该思维链包含Llama-4生成的图像字幕（位于<caption>与</caption>标签内）以及问题的求解过程，最终答案被包裹在`\boxed{}`中。 ## HoneyBee训练效果 ![image](https://cdn-uploads.huggingface.co/production/uploads/61c5c25705aa54027c52f7b3/BsDvu3FHvoIksQWYoZxtQ.png) ## 许可信息本数据集采用CC-BY-NC许可协议发布。本数据集内容为Llama 4的输出结果，受Llama 4许可协议约束（https://github.com/meta-llama/llama-models/tree/main/models/llama4）。若您使用本数据集的部分内容创建、训练、微调或以其他方式改进AI模型并进行分发或公开提供，则需在该AI模型名称的开头添加“Llama”字样。从其他渠道获取的第三方内容受其自身许可协议约束，您使用该内容可能需遵守其他法律义务或限制。 ## 引用格式 @article{bansal2025honeybee, title={HoneyBee: Data Recipes for Vision-Language Reasoners}, author={Bansal, Hritik and Sachan, Devendra Singh and Chang, Kai-Wei and Grover, Aditya and Ghosh, Gargi and Yih, Wen-tau and Pasunuru, Ramakanth}, journal={arXiv preprint arXiv:2510.12225}, year={2025} }

提供机构：

maas

创建时间：

2025-10-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集