---
license: cc-by-nc-4.0
---
# CogVLM-SFT-311K: Bilingual Visual Instruction Data in CogVLM SFT
CogVLM-SFT-311K is the primary aligned corpus used in the initial training of CogVLM v1.0. The process of constructing this dataset is as follows:
1. Approximately 3500 high-quality data samples were selected from the open source [MiniGPT-4](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align), known as minigpt4-3500.
2. Minigpt4-3500 was integrated with [Llava-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and translated into Chinese through a language model.
3. We discovered significant noise in the detailed description part of minigpt4-3500 and Llava-instruct. Thus, we corrected these Chinese corpora and retranslated them into English.
## Dataset Information
The dataset contains three folders corresponding to the mixed part of minigpt4-3500 and llava, the llava solo conversation, and the multi-turn conversation datasets. Their layout is as follows:
```
.CogVLM-SFT-311K
├── llava_details-minigpt4_3500_formate
├── llava_instruction_multi_conversations_formate
└── llava_instruction_single_conversation_formate
```
In our open-source data, the datasets are distributed as follows:
```
.llava_details-minigpt4_3500_formate
├── images
│ └── 00000001.jpg
└── labels
└── 00000001.json
```
Images are stored in the images folder, while labels containing the corresponding image description or dialogue are stored in the labels folder.
## Dataset Quantity
+ llava_details-minigpt4_3500_formate: 22,464 images and descriptions
+ llava_instruction_multi_conversations_formate: 56,673 images and multi-turn conversations
+ llava_instruction_single_conversation_formate: 76,634 images and single-turn conversations
Dataset Format
Caption format for image description
```
{
"captions": [
{
"role": "caption",
"content": "The photograph features a beach scene with a group of people surfing in the ocean. There are ten individuals spaced out in the water, sitting or lying on various sizes of body boards. The surfers line up in a row, close to the sandy shoreline. Some of the body boards are visible floating on the surface of the water, with the surfers riding on them or nearby."
}
]
}
```
Conversation format for image dialogue
```
{
"conversations": [
{
"role": "user",
"content": "What can be inferred about the zebras' behavior and surroundings?"
},
{
"role": "assistant",
"content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat."
}
]
}
```
## License
+ Due to non-commercial agreements, we did not use these data in the bilingual version of CogVLM or any other models involving commercialization.
+ The dataset license adheres to: <br> Attribution-NonCommercial 4.0 International. It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
This will not allow you to use these data for any **commercial activitiesI**.
## References
This project utilizes data and concepts based on the following research papers:
- Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592.
- Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. arXiv:2304.08485.
---许可证:CC BY-NC 4.0---
# CogVLM-SFT-311K:CogVLM监督微调双语视觉指令数据集
CogVLM-SFT-311K是CogVLM v1.0初始训练阶段所使用的核心对齐语料库。本数据集的构建流程如下:
1. 从开源项目[MiniGPT-4](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align)中选取约3500条高质量数据样本,记为minigpt4-3500。
2. 将minigpt4-3500与[Llava-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)进行融合,并通过大语言模型将融合后的数据翻译为中文。
3. 经检测发现minigpt4-3500与Llava-Instruct的细节描述部分存在大量噪声,因此我们对该批中文语料进行了修正,并将其重新翻译为英文。
## 数据集概况
本数据集包含三个文件夹,分别对应minigpt4-3500与Llava的混合数据集、Llava单轮对话数据集以及多轮对话数据集。其目录结构如下:
.CogVLM-SFT-311K
├── llava_details-minigpt4_3500_formate
├── llava_instruction_multi_conversations_formate
└── llava_instruction_single_conversation_formate
在我们的开源数据中,数据集的分布形式如下:
.llava_details-minigpt4_3500_formate
├── images
│ └── 00000001.jpg
└── labels
└── 00000001.json
图像文件存储于images文件夹,而包含对应图像描述或对话内容的标签文件则存储于labels文件夹。
## 数据集规模
+ llava_details-minigpt4_3500_formate:22,464张图像及对应描述
+ llava_instruction_multi_conversations_formate:56,673张图像及多轮对话数据
+ llava_instruction_single_conversation_formate:76,634张图像及单轮对话数据
## 数据集格式
### 图像描述标注格式
{
"captions": [
{
"role": "caption",
"content": "The photograph features a beach scene with a group of people surfing in the ocean. There are ten individuals spaced out in the water, sitting or lying on various sizes of body boards. The surfers line up in a row, close to the sandy shoreline. Some of the body boards are visible floating on the surface of the water, with the surfers riding on them or nearby."
}
]
}
### 图像对话格式
{
"conversations": [
{
"role": "user",
"content": "What can be inferred about the zebras' behavior and surroundings?"
},
{
"role": "assistant",
"content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat."
}
]
}
## 许可协议
+ 出于非商业性协议要求,我们未将本数据集用于CogVLM双语版本或任何其他商业化模型。
+ 本数据集遵循**署名-非商业性使用4.0国际许可协议(Attribution-NonCommercial 4.0 International,CC BY-NC 4.0)**,同时需遵守OpenAI的相关政策:https://openai.com/policies/terms-of-use,严禁将本数据集用于任何商业活动。
## 参考文献
本项目使用了基于以下研究论文的数据与思路:
- 朱迪、陈俊、沈晓、李翔与Elhoseiny M.(2023)。MiniGPT-4:借助先进大语言模型增强视觉语言理解能力。arXiv预印本,arXiv:2304.10592。
- 刘浩天、李超、吴琦与Lee Y. J.(2023)。视觉指令微调。arXiv:2304.08485。