kaist-ai/volcano-train

Name: kaist-ai/volcano-train
Creator: kaist-ai
Published: 2023-11-13 11:37:08
License: 暂无描述

Hugging Face2023-11-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/kaist-ai/volcano-train

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - image-to-text language: - en tags: - image-to-text - image-captioning - visual-question-answering size_categories: - 1M<n<10M --- # Data details - **274K multimodal feedback and revision data** - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP. - 158K GPT-generated multimodal instruction-following data. - 450K academic-task-oriented VQA data mixture. - 40K ShareGPT data # Data collection ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6550c4f27bbfce1878f5f280/lhOvTWfETB2T58ZhhyIZa.png) Since no multimodal feedback data for training is publicly available as of this writing and human labeling is costly, we used a proprietary LLM to generate feedback data. As shown in figure, we use an open-source LMM to provide an initial answer to a question about an image. Since current proprietary LLMs cannot process images, we provide object details in text and captions as a proxy for image. For each data instance, we feed the LLM image information consisting of object details and captions, question, initial response, and gold answer as reference answer, allowing the model to evaluate the given inputs and produce feedback. The proprietary LLM might exploit the gold answer to generate the feedback, which can cause potential inaccuracies in feedback during inference time when it is not provided. To avoid this, we give the LLM clear prompts to use text-formatted image details when generating feedback. When constructing the revision data, we set up the system to predict the existing gold answer as the output, using the feedback data, image, question, and initial response obtained from the previous steps as input, without involving any separate model generation process. Although Volcano is trained using the language modeling objective in a manner consistent with traditional VLMs, it not only follows instructions but also can provide critical feedback based on image information and subsequently self-revise. This enhanced ability is attributed to Volcano's combined training with visual instruction tuning data, feedback, and revision data.

提供机构：

kaist-ai

原始信息汇总

数据集详情

数据规模

274K多模态反馈和修订数据
558K从LAION/CC/SBU筛选的图像-文本对，由BLIP标注。
158K由GPT生成的多模态指令遵循数据。
450K面向学术任务的VQA数据混合。
40K ShareGPT数据

数据收集

由于目前没有公开的多模态反馈训练数据，且人工标注成本高昂，我们使用专有的大型语言模型（LLM）生成反馈数据。
使用开源的大型多模态模型（LMM）为图像相关问题提供初始答案。由于当前的专有LLM无法处理图像，我们提供文本格式的对象细节和标注作为图像的代理。
对于每个数据实例，我们将包含对象细节和标注的图像信息、问题、初始响应和参考答案输入LLM，使其评估给定输入并生成反馈。
为了避免在推理时可能出现的反馈不准确问题，我们明确提示LLM在生成反馈时使用文本格式的图像细节。
在构建修订数据时，我们设置系统预测现有的参考答案作为输出，使用从先前步骤获得的反馈数据、图像、问题和初始响应作为输入，不涉及任何单独的模型生成过程。

模型训练

Volcano模型在训练中采用与传统视觉语言模型（VLM）一致的语言建模目标，不仅能遵循指令，还能基于图像信息提供关键反馈并进行自我修订。
这种增强的能力归因于Volcano结合了视觉指令调优数据、反馈和修订数据的联合训练。

5,000+

优质数据集

54 个

任务类型

进入经典数据集