ProVision-10M

Name: ProVision-10M
Creator: maas
Published: 2025-12-05 16:46:44
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-23 收录

下载链接：

https://modelscope.cn/datasets/Salesforce/ProVision-10M

下载链接

链接失效反馈

官方服务：

资源简介：

<h1 align="center"> ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models </h1> ProVision is an extendable data generation engine which produces instruction data for large multimodal language models (MLMs). In particular, it synthesizes instruction data via data generators (Python programs) and scene graphs rather than proprietary models. It also includes a scene graph generation pipeline consisting of various state-of-the-art models (eg, object detection model). Thus, one can generate instruction data for any given image by first generating the scene graph and then apply data generators. Provision supports generation of both single-image and multi-image instruction data. One can also extend the engine by adding new data generators. **You are currently viewing the ProVision-10M dataset.** ![pipeline](pipeline.png) ## Dataset Details ### Dataset Sources - **Repository**: https://github.com/JieyuZ2/ProVision - **Paper:** https://arxiv.org/abs/2412.07012 - **Blog:** - **Source Data:** [Visual Genome](https://homes.cs.washington.edu/~ranjay/visualgenome/index.html)/[GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html) and [DataComp](https://www.datacomp.ai/dcclip/index.html#home) ## Uses Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This repository is being released for research purposes only. ### Direct Use  ProVision-10M is designed to facilitate research in training multimodal language models. ### Out-of-Scope Use  ProVision-10M was built to make research into large multimodal models more accessible. Using the dataset to train models that ingest or generate personally identifying information (such as images of people’s faces and other sensitive content) as well as military applications are all inappropriate use cases of ProVision-10M. ## Dataset Creation ### Curation Rationale ProVision-10M was created to demonstrate the potential of programmatically synthesizing instruction data for training multimodal language models. ### Source Data The dataset is built upon two data sources: - we use 74,289 images and scene graphs from Visual Genome（the GQA version） - we use 126,106 images from DataComp ### Dataset summary **We do not release the images, please download the images from their original sources (GQA/DataComp)** | Split | Size | Format | Description | | :------------| :------ | :------ | :---- | | vgs_sa | 1537630 | short answer | single-image instruction data based on Visual Genome | | vgs_mc | 1537630 | multiple choice | single-image instruction data based on Visual Genome | | vgm_sa_2_img | 1400000 | short answer | 2-image instruction data based on Visual Genome | | vgm_mc_2_img | 1400000 | multiple choice | 2-image instruction data based on Visual Genome | | vgm_sa_3_img | 1400000 | short answer | 3-image instruction data based on Visual Genome | | vgm_mc_3_img | 1400000 | multiple choice | 3-image instruction data based on Visual Genome | | vgm_sa_4_img | 1400000 | short answer | 4-image instruction data based on Visual Genome | | vgm_mc_4_img | 1400000 | multiple choice | 4-image instruction data based on Visual Genome | | dcs_sa | 2294572 | short answer | single-image instruction data based on DataComp images | | dcs_mc | 2294572 | multiple choice | single-image instruction data based on DataComp images | | dcm_sa_2_img | 1400000 | short answer | 2-image instruction data based on DataComp images | | dcm_mc_2_img | 1400000 | multiple choice | 2-image instruction data based on DataComp images | | dcm_sa_3_img | 1400000 | short answer | 3-image instruction data based on DataComp images | | dcm_mc_3_img | 1400000 | multiple choice | 3-image instruction data based on DataComp images | | dcm_sa_4_img | 1400000 | short answer | 4-image instruction data based on DataComp images | | dcm_mc_4_img | 1400000 | multiple choice | 4-image instruction data based on DataComp images | ## License We release ProVision-10M under a CC-BY-NC-4.0 license. ## Ethical Considerations This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP. ## Citation ``` @article{zhang2024provision, title={ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models}, author={Zhang, Jieyu and Xue, Le and Song, Linxin and Wang, Jun and Huang, Weikai and Shu, Manli and Yan, An and Ma, Zixian and Niebles, Juan Carlos and Xiong, Caiming and others}, journal={arXiv preprint arXiv:2412.07012}, year={2024} } ```

<h1 align="center">ProVision：面向多模态大语言模型（Large Multimodal Language Models）的视觉中心指令数据程序化规模化生成工具</h1> ProVision是一款可扩展的数据生成引擎，用于为多模态大语言模型生成指令数据。具体而言，该工具通过数据生成器（Python程序）与场景图（scene graph）而非专有模型来合成指令数据。此外，其还包含一套场景图生成流水线，集成了多种当前主流模型（如目标检测模型（object detection model））。因此，用户可先为任意给定图像生成场景图，再通过数据生成器来生成指令数据。 ProVision支持单图像与多图像指令数据的生成，用户还可通过添加新的数据生成器来扩展该引擎。 **您当前查看的是ProVision-10M数据集。** ![pipeline](pipeline.png) ## 数据集详情 ### 数据集来源 - **仓库地址**: https://github.com/JieyuZ2/ProVision - **论文**: https://arxiv.org/abs/2412.07012 - **博客**: - **源数据**: [视觉基因组（Visual Genome）](https://homes.cs.washington.edu/~ranjay/visualgenome/index.html)/[GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html) 与 [DataComp](https://www.datacomp.ai/dcclip/index.html#home) ## 使用范围用户需自行评估与原始数据集及数据相关的许可协议、条款和条件所规定的任何义务与责任。本仓库仅用于学术研究目的发布。 ### 直接使用场景  ProVision-10M旨在为多模态大语言模型的训练相关研究提供支持。 ### 禁止使用场景  ProVision-10M的开发初衷是降低多模态大模型相关研究的门槛。利用该数据集训练能够提取或生成个人身份信息（如人脸图像及其他敏感内容）的模型，以及将其用于军事用途，均属于该数据集的不当使用场景。 ## 数据集构建 ### 构建理念 ProVision-10M的构建旨在验证通过程序化方式合成多模态大模型训练用指令数据的可行性与潜力。 ### 源数据本数据集基于两类数据源构建： - 采用来自视觉基因组（Visual Genome，GQA版本）的74289张图像及场景图 - 采用来自DataComp的126106张图像 ### 数据集概览 **注：本数据集不包含图像文件，请用户从原始来源（GQA/DataComp）自行下载图像。** | 数据拆分 | 数据量 | 格式 | 描述 | | :--------- | :------ | :---------- | :----------------------------------- | | vgs_sa | 1537630 | 简答题 | 基于视觉基因组的单图像指令数据 | | vgs_mc | 1537630 | 选择题 | 基于视觉基因组的单图像指令数据 | | vgm_sa_2_img | 1400000 | 简答题 | 基于视觉基因组的双图像指令数据 | | vgm_mc_2_img | 1400000 | 选择题 | 基于视觉基因组的双图像指令数据 | | vgm_sa_3_img | 1400000 | 简答题 | 基于视觉基因组的三图像指令数据 | | vgm_mc_3_img | 1400000 | 选择题 | 基于视觉基因组的三图像指令数据 | | vgm_sa_4_img | 1400000 | 简答题 | 基于视觉基因组的四图像指令数据 | | vgm_mc_4_img | 1400000 | 选择题 | 基于视觉基因组的四图像指令数据 | | dcs_sa | 2294572 | 简答题 | 基于DataComp的单图像指令数据 | | dcs_mc | 2294572 | 选择题 | 基于DataComp的单图像指令数据 | | dcm_sa_2_img | 1400000 | 简答题 | 基于DataComp的双图像指令数据 | | dcm_mc_2_img | 1400000 | 选择题 | 基于DataComp的双图像指令数据 | | dcm_sa_3_img | 1400000 | 简答题 | 基于DataComp的三图像指令数据 | | dcm_mc_3_img | 1400000 | 选择题 | 基于DataComp的三图像指令数据 | | dcm_sa_4_img | 1400000 | 简答题 | 基于DataComp的四图像指令数据 | | dcm_mc_4_img | 1400000 | 选择题 | 基于DataComp的四图像指令数据 | ## 许可协议 ProVision-10M采用CC-BY-NC-4.0许可协议进行发布。 ## 伦理考量本数据集仅用于支持学术论文的研究目的发布。我们的模型、数据集及代码并未针对所有下游应用进行专门设计与评估。我们强烈建议用户在部署该模型前，对其准确性、安全性与公平性相关的潜在问题进行评估与处理。我们鼓励用户考虑人工智能的普遍局限性，遵守适用法律法规，并在选择使用场景时采用最佳实践，尤其针对那些错误或误用可能严重影响人们生活、权利或安全的高风险场景。如需更多使用场景相关指导，请参考我们的AUP与AI AUP。 ## 引用格式 @article{zhang2024provision, title={ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models}, author={Zhang, Jieyu and Xue, Le and Song, Linxin and Wang, Jun and Huang, Weikai and Shu, Manli and Yan, An and Ma, Zixian and Niebles, Juan Carlos and Xiong, Caiming and others}, journal={arXiv preprint arXiv:2412.07012}, year={2024} }

提供机构：

maas

创建时间：

2025-08-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集