通用合成图文数据集

Name: 通用合成图文数据集
Creator: 中电数据产业集团有限公司
Published: 2026-04-10 19:03:05
License: 暂无描述

国家数据集管理服务平台2026-04-10 更新2026-04-29 收录

下载链接：

https://www.ndsms.cn/dataRetrieval/datasetDetail/?id=f296242522c808825a7ada719795aea3

下载链接

链接失效反馈

官方服务：

资源简介：

通用合成图文数据集是一个千万级规模的图文合成数据集，该数据集从开源图像数据中，结合 prompt 知识库与大模型（图像分割、图生文）能力，提取物理世界知识。对知识进行融合、分类分级等处理，构建物理世界知识数据库。合成图像引擎调用物理世界知识数据库，结合 prompt 知识库与大模型（文生图）能力，生成图像数据；在此过程中涉及到使用大语言模型对文本进行泛化、增强。对生成图像数据进行分类分级、自动化标注处理，构建合成图文对数据集。其中文本数据使用的语言主要为中文，目前已合成超过 3000 万图文对数据集，且规模持续增长。该数据集的核心技术在于以语义化处理为基础的智能加工方法，结合行业知识与领域需求，对原始图像数据进行深度处理与增强，构建高质量图文对齐数据集。

The General-Purpose Synthetic Image-Text Dataset is a tens-of-millions-scale image-text synthetic dataset. This dataset extracts physical world knowledge from open-source image data, leveraging the capabilities of prompt knowledge bases and large models (including image segmentation and image-to-text generation). After processing the knowledge via fusion, classification and grading, etc., a physical world knowledge database is constructed. The synthetic image generation engine invokes the physical world knowledge database, leveraging prompt knowledge bases and large models (text-to-image generation) capabilities to generate image data. During this processing pipeline, large language models (LLMs) are used to generalize and enhance the text data. The generated image data is subjected to classification, grading and automated annotation processing to build a synthetic image-text pair dataset. The text data in this dataset is primarily in Chinese, and currently more than 30 million image-text pairs have been synthesized, with the scale continuing to grow. The core technology of this dataset lies in the semantic processing-based intelligent processing method, which combines industry knowledge and domain requirements to deeply process and enhance the original image data, thereby constructing a high-quality image-text aligned dataset.

提供机构：

中电数据产业集团有限公司

创建时间：

2026-04-10

搜集汇总

数据集介绍