THUDM/CogVLM-SFT-311K

Name: THUDM/CogVLM-SFT-311K
Creator: THUDM
Published: 2023-12-26 10:03:17
License: 暂无描述

Hugging Face2023-12-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/THUDM/CogVLM-SFT-311K

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 --- # CogVLM-SFT-311K: Bilingual Visual Instruction Data in CogVLM SFT CogVLM-SFT-311K is the primary aligned corpus used in the initial training of CogVLM v1.0. The process of constructing this dataset is as follows: 1. Approximately 3500 high-quality data samples were selected from the open source [MiniGPT-4](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align), known as minigpt4-3500. 2. Minigpt4-3500 was integrated with [Llava-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and translated into Chinese through a language model. 3. We discovered significant noise in the detailed description part of minigpt4-3500 and Llava-instruct. Thus, we corrected these Chinese corpora and retranslated them into English. ## Dataset Information The dataset contains three folders corresponding to the mixed part of minigpt4-3500 and llava, the llava solo conversation, and the multi-turn conversation datasets. Their layout is as follows: ``` .CogVLM-SFT-311K ├── llava_details-minigpt4_3500_formate ├── llava_instruction_multi_conversations_formate └── llava_instruction_single_conversation_formate ``` In our open-source data, the datasets are distributed as follows: ``` .llava_details-minigpt4_3500_formate ├── images │ └── 00000001.jpg └── labels └── 00000001.json ``` Images are stored in the images folder, while labels containing the corresponding image description or dialogue are stored in the labels folder. ## Dataset Quantity + llava_details-minigpt4_3500_formate: 22,464 images and descriptions + llava_instruction_multi_conversations_formate: 56,673 images and multi-turn conversations + llava_instruction_single_conversation_formate: 76,634 images and single-turn conversations Dataset Format Caption format for image description ``` { "captions": [ { "role": "caption", "content": "The photograph features a beach scene with a group of people surfing in the ocean. There are ten individuals spaced out in the water, sitting or lying on various sizes of body boards. The surfers line up in a row, close to the sandy shoreline. Some of the body boards are visible floating on the surface of the water, with the surfers riding on them or nearby." } ] } ``` Conversation format for image dialogue ``` { "conversations": [ { "role": "user", "content": "What can be inferred about the zebras' behavior and surroundings?" }, { "role": "assistant", "content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat." } ] } ``` ## License + Due to non-commercial agreements, we did not use these data in the bilingual version of CogVLM or any other models involving commercialization. + The dataset license adheres to: <br> Attribution-NonCommercial 4.0 International. It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use This will not allow you to use these data for any **commercial activitiesI**. ## References This project utilizes data and concepts based on the following research papers: - Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592. - Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. arXiv:2304.08485.

---许可证：CC BY-NC 4.0--- # CogVLM-SFT-311K：CogVLM监督微调双语视觉指令数据集 CogVLM-SFT-311K是CogVLM v1.0初始训练阶段所使用的核心对齐语料库。本数据集的构建流程如下： 1. 从开源项目[MiniGPT-4](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align)中选取约3500条高质量数据样本，记为minigpt4-3500。 2. 将minigpt4-3500与[Llava-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)进行融合，并通过大语言模型将融合后的数据翻译为中文。 3. 经检测发现minigpt4-3500与Llava-Instruct的细节描述部分存在大量噪声，因此我们对该批中文语料进行了修正，并将其重新翻译为英文。 ## 数据集概况本数据集包含三个文件夹，分别对应minigpt4-3500与Llava的混合数据集、Llava单轮对话数据集以及多轮对话数据集。其目录结构如下： .CogVLM-SFT-311K ├── llava_details-minigpt4_3500_formate ├── llava_instruction_multi_conversations_formate └── llava_instruction_single_conversation_formate 在我们的开源数据中，数据集的分布形式如下： .llava_details-minigpt4_3500_formate ├── images │ └── 00000001.jpg └── labels └── 00000001.json 图像文件存储于images文件夹，而包含对应图像描述或对话内容的标签文件则存储于labels文件夹。 ## 数据集规模 + llava_details-minigpt4_3500_formate：22,464张图像及对应描述 + llava_instruction_multi_conversations_formate：56,673张图像及多轮对话数据 + llava_instruction_single_conversation_formate：76,634张图像及单轮对话数据 ## 数据集格式 ### 图像描述标注格式 { "captions": [ { "role": "caption", "content": "The photograph features a beach scene with a group of people surfing in the ocean. There are ten individuals spaced out in the water, sitting or lying on various sizes of body boards. The surfers line up in a row, close to the sandy shoreline. Some of the body boards are visible floating on the surface of the water, with the surfers riding on them or nearby." } ] } ### 图像对话格式 { "conversations": [ { "role": "user", "content": "What can be inferred about the zebras' behavior and surroundings?" }, { "role": "assistant", "content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat." } ] } ## 许可协议 + 出于非商业性协议要求，我们未将本数据集用于CogVLM双语版本或任何其他商业化模型。 + 本数据集遵循**署名-非商业性使用4.0国际许可协议（Attribution-NonCommercial 4.0 International，CC BY-NC 4.0）**，同时需遵守OpenAI的相关政策：https://openai.com/policies/terms-of-use，严禁将本数据集用于任何商业活动。 ## 参考文献本项目使用了基于以下研究论文的数据与思路： - 朱迪、陈俊、沈晓、李翔与Elhoseiny M.（2023）。MiniGPT-4：借助先进大语言模型增强视觉语言理解能力。arXiv预印本，arXiv:2304.10592。 - 刘浩天、李超、吴琦与Lee Y. J.（2023）。视觉指令微调。arXiv:2304.08485。

提供机构：

THUDM

原始信息汇总

CogVLM-SFT-311K: 双语视觉指令数据集

CogVLM-SFT-311K 是 CogVLM v1.0 初始训练中使用的主要对齐语料库。该数据集的构建过程如下：

从开源数据集 MiniGPT-4 中选择了约 3500 个高质量数据样本，称为 minigpt4-3500。
将 minigpt4-3500 与 Llava-Instruct-150K 整合，并通过语言模型翻译成中文。
发现 minigpt4-3500 和 Llava-instruct 的详细描述部分存在较多噪声，因此对这些中文语料进行了修正，并重新翻译成英文。

数据集信息

数据集包含三个文件夹，分别对应 minigpt4-3500 和 llava 的混合部分、llava 单独对话以及多轮对话数据集。其结构如下：

.CogVLM-SFT-311K ├── llava_details-minigpt4_3500_formate ├── llava_instruction_multi_conversations_formate └── llava_instruction_single_conversation_formate

在开源数据中，数据集的分布如下：

.llava_details-minigpt4_3500_formate ├── images │ └── 00000001.jpg └── labels └── 00000001.json

图像存储在 images 文件夹中，而包含相应图像描述或对话的标签存储在 labels 文件夹中。

数据集数量

llava_details-minigpt4_3500_formate: 22,464 张图像和描述
llava_instruction_multi_conversations_formate: 56,673 张图像和多轮对话
llava_instruction_single_conversation_formate: 76,634 张图像和单轮对话

数据集格式

图像描述的标注格式： json { "captions": [ { "role": "caption", "content": "The photograph features a beach scene with a group of people surfing in the ocean. There are ten individuals spaced out in the water, sitting or lying on various sizes of body boards. The surfers line up in a row, close to the sandy shoreline. Some of the body boards are visible floating on the surface of the water, with the surfers riding on them or nearby." } ] }

图像对话的标注格式： json { "conversations": [ { "role": "user", "content": "What can be inferred about the zebras behavior and surroundings?" }, { "role": "assistant", "content": "Based on the image, we can infer that the two zebras are likely seeking relief from the suns heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat." } ] }

许可证

由于非商业协议，这些数据未用于 CogVLM 的双语版本或任何其他涉及商业化的模型。
数据集许可证遵循：Attribution-NonCommercial 4.0 International。该许可证不允许将这些数据用于任何商业活动。

搜集汇总

数据集介绍

构建方式

在视觉语言模型领域，构建高质量双语指令数据集是提升模型跨语言理解能力的关键。CogVLM-SFT-311K数据集的构建始于从开源MiniGPT-4中精选约3500个高质量样本，形成minigpt4-3500子集。随后，该子集与Llava-Instruct-150K数据集融合，并借助语言模型将其内容翻译为中文。鉴于原始数据在详细描述部分存在显著噪声，研究团队对中文语料进行了细致校正，并重新回译至英文，确保了数据的准确性与双语一致性。

特点

该数据集以其丰富的视觉语言对齐结构和双语覆盖而著称。它包含三个独立文件夹，分别对应混合对话、单轮对话及多轮对话数据，总计涵盖超过15万张图像及其关联文本。图像描述采用结构化标注格式，对话数据则模拟真实人机交互场景，支持角色化内容生成。数据集严格遵循非商业许可协议，强调学术研究用途，为视觉语言模型的指令微调提供了标准化、多模态的资源基础。

使用方法

在视觉语言模型的研究与应用中，CogVLM-SFT-311K数据集主要用于模型的监督微调阶段。用户可通过加载图像文件夹与对应的JSON标签文件，提取图像描述或对话内容，构建输入-输出对。数据集支持单轮描述生成、多轮对话模拟等多种任务格式，便于集成至训练流程，以增强模型的双语视觉理解与生成能力。使用时需注意遵守非商业许可，确保符合学术伦理与数据使用政策。

背景与挑战

背景概述

随着多模态人工智能技术的迅猛发展，视觉-语言模型（Vision-Language Models, VLMs）已成为连接图像理解与自然语言处理的关键桥梁。在这一背景下，由清华大学知识工程实验室（THUDM）于2023年构建的CogVLM-SFT-311K数据集应运而生，旨在为CogVLM v1.0模型的初始训练提供高质量的双语视觉指令数据。该数据集的核心研究问题聚焦于通过精细化的视觉指令微调，提升模型在跨语言环境下的图像描述、对话生成等复杂任务中的性能。其构建不仅融合了MiniGPT-4与LLaVA-Instruct-150K等开源资源的精华，还通过中英双语转换与噪声修正，显著推动了视觉-语言对齐领域的研究进展，为后续多模态模型的开发奠定了坚实的数据基础。

当前挑战

在视觉-语言对齐领域，CogVLM-SFT-311K数据集致力于解决图像与文本间语义关联的精细化建模挑战，特别是在跨语言场景下实现准确、连贯的视觉指令响应。构建过程中，研究人员面临多重困难：首先，原始开源数据如MiniGPT-4与LLaVA-Instruct中存在显著的噪声问题，尤其在详细描述部分，需通过人工校正与重翻译确保数据质量；其次，双语转换涉及语言模型的精准应用，以避免语义失真或文化语境偏差；此外，数据整合需平衡单轮对话、多轮对话及图像描述等多种格式，以覆盖多样化的视觉推理任务。这些挑战共同凸显了高质量多模态数据构建的复杂性与必要性。

常用场景

经典使用场景

在视觉-语言模型的研究领域，多模态对齐任务对高质量双语指令数据的需求日益凸显。THUDM/CogVLM-SFT-311K数据集通过整合MiniGPT-4与LLaVA-Instruct-150K的精华部分，并辅以精准的翻译与校正，构建了一个包含图像描述、单轮对话及多轮对话的丰富语料库。该数据集最经典的使用场景在于为视觉-语言模型，特别是CogVLM系列，提供监督微调的基础数据，以优化模型在理解图像内容并生成连贯、准确的双语回应方面的能力。其精心设计的格式与高质量标注，为模型在多轮交互与细节描述任务上的性能提升奠定了坚实基础。

实际应用

在实际应用层面，基于THUDM/CogVLM-SFT-311K数据集训练的模型展现出广泛的实用价值。这些模型能够集成于智能助手、教育工具或内容创作平台，为用户提供即时的图像内容分析与双语交互服务。例如，在辅助视障人士理解周围环境、支持跨语言文化教育中的视觉材料讲解，或赋能多媒体内容自动生成字幕与描述等场景中，该数据集所支撑的模型技术都能发挥关键作用，显著提升了人机交互的自然度与信息获取的效率。

衍生相关工作

该数据集的构建理念与高质量标注策略，直接启发了后续一系列重要的研究工作。以CogVLM v1.0为代表的视觉-语言大模型，其卓越的多模态理解能力在很大程度上得益于在此数据集上的微调。同时，该数据集也为社区探索更高效的数据清洗方法、研究低资源语言下的视觉指令微调，以及开发支持复杂推理的多轮对话模型提供了宝贵的范例与起点，持续推动着多模态人工智能技术向更精准、更通用的方向发展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集