llava-instruct-zh-600k
收藏魔搭社区2026-01-07 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/opencsg/llava-instruct-zh-600k
下载链接
链接失效反馈官方服务:
资源简介:
* 仿照 [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) ,使用 [Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) 合成的用于微调**中文**VLM的数据;也可以与英文数据集混合使用,训练多语言VLM
* 任务类型为基于单张图片的问答和对话,每个样本都对应一张不同的图片,其中大部分图片包含中文字符,更适合中文场景下视觉语言模型的训练。
* 图片从各类中文网站上爬取
* 包含3类任务:日常对话、复杂推理、描述图片。日常对话通常是5轮对话,其余任务是1轮对话。
* 每种任务的数量如下:
| 任务类型 | 数量 |
|------------|---------|
| 日常对话 | 247,431 |
| 复杂推理 | 194,646 |
| 描述图片 | 199,791 |
* 用于生成对话数据的prompt如下
日常对话
```text
设计一个你和一个询问这张照片的人之间的对话。答案应该是视觉AI助手看到图像并回答问题的语气。
你需要提出不同的问题并给出相应的答案。问题可以包括询问图像视觉内容的问题,包括对象类型、对象计数、对象动作、对象位置、对象之间的相对位置等。必须是有明确答案的问题,即
(1) 人们可以在图像中明确看到问题所问的内容,并且可以自信地回答;
(2) 人们可以从图像中肯定地确定它不在图像中。
问题还可以包括与图像中的内容相关的复杂问题,例如,询问图像中对象的背景知识,要求讨论图像中发生的事件等。同样,不要问不确定的细节。
在回答复杂问题时要提供详细的答案。例如,给出详细的例子或推理步骤,使内容更具说服力和组织性。如有必要,您可以包含多个段落。最多不超过5轮对话。
以 用户:...\n\n助手:...\n\n用户:...\n\n助手:...\n\n 的格式返回。
```
复杂推理
```text
设计一个关于这张照片的问题,并提供详细的答案。必须是除了描述场景以外的复杂问题。
要回答这些问题,首先需要理解视觉内容,然后根据背景知识进行推理,来解释事情为什么会这样发生,或者为用户的请求提供指导和帮助。你可以通过不在问题中包含视觉内容的细节来使问题具有挑战性,这样回答问题时必须先根据视觉内容进行推理。
以 问题:...\n\n回答:...\n\n 的格式返回。
```
描述图片
```text
详细描述这张照片的场景。包括对象计数、对象位置、对象之间的相对位置等详细信息。
当图片中存在文字时,需要将文字提取出来进行描述。描述应尽可能全面,尽量覆盖所有对象。
```
This dataset is synthesized using [Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct), following the format of [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K), and is intended for fine-tuning **Chinese** VLMs; it can also be mixed with English datasets to train multilingual VLMs.
The task types include single-image-based question answering and dialogue. Each sample corresponds to a distinct image, and most of these images contain Chinese characters, making this dataset more suitable for training visual language models in Chinese scenarios.
All images are crawled from various Chinese websites.
The dataset contains three types of tasks: daily dialogue, complex reasoning, and image captioning. Daily dialogue tasks typically consist of 5 rounds of conversation, while the other two types are single-round tasks.
The task counts are as follows:
| Task Type | Quantity |
|--------------------|-----------|
| Daily Dialogue | 247,431 |
| Complex Reasoning | 194,646 |
| Image Captioning | 199,791 |
The prompts used to generate the dialogue data are as follows:
#### Daily Dialogue
text
Design a conversation between you and a person asking about this photo. The response should be in the tone of a visual AI assistant viewing the image and answering questions.
You need to propose diverse questions and provide corresponding answers. Questions can include those about the visual content of the image, such as object types, object counts, object actions, object locations, relative positions between objects, etc. Questions must have definite answers, i.e.,
(1) The content asked in the question can be clearly seen in the image and answered with confidence;
(2) People can definitely determine from the image that the content is not present in the image.
Questions can also include complex questions related to the content in the image, such as asking for background knowledge of objects in the image, requesting discussions about events occurring in the image, etc. Similarly, do not ask for uncertain details.
Provide detailed answers when responding to complex questions. For example, give detailed examples or reasoning steps to make the content more persuasive and well-organized. Multiple paragraphs are allowed if necessary. The conversation must not exceed 5 rounds.
Return in the format: User: ...
Assistant: ...
User: ...
Assistant: ...
#### Complex Reasoning
text
Design a question about this photo and provide a detailed answer. The question must be a complex one other than describing the scene.
To answer these questions, you first need to understand the visual content, then perform reasoning based on background knowledge to explain why things happen this way, or provide guidance and assistance for the user's request. You can make the question challenging by not including details of the visual content in the question, so that answering the question requires first reasoning based on the visual content.
Return in the format: Question: ...
Answer: ...
#### Image Captioning
text
Describe the scene in this photo in detail. Include detailed information such as object counts, object locations, relative positions between objects, etc.
If there is text in the image, extract and describe the text. The description should be as comprehensive as possible, covering all objects in the image.
提供机构:
maas
创建时间:
2025-07-21



