winninghealth/windata-vision-synthetics-zh-300k
收藏Hugging Face2024-12-19 更新2024-12-21 收录
下载链接:
https://hf-mirror.com/datasets/winninghealth/windata-vision-synthetics-zh-300k
下载链接
链接失效反馈官方服务:
资源简介:
我们整理生成了一个中文多模态图文指令数据集,包含了大约30万条数据以及约20万张图片,涉及文档doc、图表、数学、OCR等多种场景。针对开源数据中中文图文指令集少且指令集描述普遍过于简短等问题,我们设计了一种基于开源模型的合成数据生成方法,利用 Qwen2-vl-72B-Instruct 生成较为详细的中文caption指令集,然后在同一场景中随机挑选1-4张图片和相应的中文caption,将caption数据给到我们的大语言模型 WiNGPT-2.6 通过设计系统指令使其每轮进行提问,将问题和图片给到 Qwen2-vl-72B-Instruct 使其进行回答;最后设定循环次数,得到多轮多图的对话数据。对于生成后的数据,根据答案的长度、语句的重复性等进行了规则过滤;数学类题目,根据原始数据的答案进行了过滤。在制作最后的caption指令集时,我们针对每一个场景都设计了上百个问题,保证了caption数据集的多样性;在对话数据集上,我们在不同场景下来让WiNGPT-2.6 生成问题,得到了多样性的问题。最终我们通过合成数据的方式得到了一批多样性、答案较为详实且具有一定质量的中文多模态图文指令集。
We have compiled and generated a Chinese multimodal image-text instruction dataset, containing approximately 300,000 data entries and about 200,000 images, covering various scenarios such as documents, charts, mathematics, OCR, etc. Addressing the issues of the scarcity of Chinese image-text instruction sets in open-source data and the generally overly brief descriptions of instruction sets, we designed a synthetic data generation method based on open-source models. We utilized Qwen2-vl-72B-Instruct to generate more detailed Chinese caption instruction sets, then randomly selected 1-4 images and corresponding Chinese captions within the same scene. The caption data was fed to our large language model WiNGPT-2.6, which was designed with system instructions to ask questions in each round, and the questions and images were given to Qwen2-vl-72B-Instruct for answering. Finally, by setting the number of cycles, we obtained multi-round, multi-image dialogue data. For the generated data, we filtered based on the length of answers and the repetitiveness of statements; for mathematical questions, we filtered based on the original datas answers. When creating the final caption instruction set, we designed hundreds of questions for each scene to ensure the diversity of the caption dataset; in the dialogue dataset, we had WiNGPT-2.6 generate questions in different scenarios to obtain diverse questions. Ultimately, through synthetic data, we obtained a batch of diverse, detailed, and quality Chinese multimodal image-text instruction sets.
提供机构:
winninghealth



