coffeeii/minimind-v_dataset
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/coffeeii/minimind-v_dataset
下载链接
链接失效反馈官方服务:
资源简介:
本轮训练用到的图文数据全部来自ALLaVA-4V系列,相比以往从几份LLaVA衍生集拼接得到的数据,ALLaVA-4V的质量更整齐、中英双语原生对照,细粒度描述也更充分。它由两个子源构成:一份是LAION里挑出来的高质量图片(自然图像为主),一份是VFLAN指令流里挑出来的图片(文档、图表、合成场景居多)。Pretrain部分包含约127万条数据,主要用于视觉token到语言token的基础对齐;SFT部分包含约290万条数据,混合了推理式问答、长描述和纯文本对话等多种任务形式。数据集中英比例大致均衡,图像统一处理为256×256大小,并以JPEG格式打包。
The image-text data used in this round of training all comes from the ALLaVA-4V series. Compared to previous data spliced from several LLaVA derivative sets, ALLaVA-4V has more uniform quality, native bilingual (Chinese-English) comparison, and more detailed descriptions. It consists of two sub-sources: one is high-quality images selected from LAION (mainly natural images), and the other is images selected from the VFLAN instruction stream (mostly documents, charts, and synthetic scenes). The Pretrain part contains about 1.27 million data points, mainly used for basic alignment of visual tokens to language tokens; the SFT part contains about 2.9 million data points, mixing various task forms such as reasoning Q&A, long descriptions, and pure text dialogues. The dataset has a roughly balanced Chinese-English ratio, with images uniformly processed to 256×256 size and packed in JPEG format.
提供机构:
coffeeii



