text-to-image-2M

Name: text-to-image-2M
Creator: maas
Published: 2026-04-25 03:38:50
License: 暂无描述

魔搭社区2026-04-25 更新2025-04-05 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/text-to-image-2M

下载链接

链接失效反馈

官方服务：

资源简介：

# text-to-image-2M: A High-Quality, Diverse Text-to-Image Training Dataset ## Overview `text-to-image-2M` is a curated text-image pair dataset designed for fine-tuning text-to-image models. The dataset consists of approximately 2 million samples, carefully selected and enhanced to meet the high demands of text-to-image model training. The motivation behind creating this dataset stems from the observation that datasets with over 1 million samples tend to produce better fine-tuning results. However, existing publicly available datasets often have limitations: - **Image Understanding Datasets**: Not guarantee the quality of image. - **Informal collected or Task-Specific Datasets**: Not category balanced or lacks diversity. - **Size Constraints**: Available datasets are either too small or too large. (subset sampled from large datasets often lack diversity.) To address these issues, we combined and enhanced existing high-quality datasets using state-of-the-art text-to-image and captioning models to create `text-to-image-2M`. This includes data_512_2M, a 2M 512x512 fine-tuning dataset and data_1024_10K, a 10K high-quality, high-resolution dataset (for high-resolution adaptation). ## Dataset Composition ### data_512_2M The dataset is composed of several high-quality subsets, as detailed below: | **Source** | **Samples** | **Prompts** | **Images** | |-------------------------------------------------|-------------|--------------------------------------|---------------------------------------------| | [**LLaVA-next fine-tuning dataset**](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) | ~700K | Re-captioned using Qwen2-VL | Original images | | [**LLaVA-pretrain dataset**](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | ~500K | Original prompts | Images generated by Flux-dev | | [**ProGamerGov synthetic dataset (DALL·E 3)**](https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions) | ~900K | Filtered for validity | Center-cropped and validity-filtered images | | **GPT-4o generated dataset** | 100K | Generated by GPT-4o | Images generated by Flux-dev | ### data_1024_10K 10K images generated by Flux-dev with prompts generated by GPT-4o ## **Usage**: The dataset uses the [WebDataset](https://github.com/webdataset/webdataset) format and can be easily accessed and used with HuggingFace's datasets library like so: ```py from datasets import load_dataset base_url = "https://huggingface.co/datasets/jackyhate/text-to-image-2M/resolve/main/data_512_2M/data_{i:06d}.tar" num_shards = 46 # Number of webdataset tar files urls = [base_url.format(i=i) for i in range(num_shards)] dataset = load_dataset("webdataset", data_files={"train": urls}, split="train", streaming=True) # Example of iterating through the dataset for image in dataset: print(image) # single image in row with associated columns break ``` * Note that as long as `streaming=True` in the above example, the dataset does not have to be downloaded in full. ## Acknowledgments This dataset builds on the work of several open-source projects, including: - [**LLaVA-next fine-tuning dataset**](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) - [**LLaVA-pretrain dataset**](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) - [**ProGamerGov synthetic dataset (DALL·E 3)**](https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions) - **GPT-4o** - **Flux-1.0-dev** We thank the contributors of these datasets and models for making this project possible. ## Citation ```bibtex @misc{zou2024text2image, title={text-to-image-2M: A high-quality, diverse text--image training dataset}, author={Kai Zou}, year={2024}, doi = {10.57967/hf/3066} } ```

# text-to-image-2M：高质量多样化文本到图像训练数据集 ## 概述 `text-to-image-2M`是一个经精选的文本-图像配对数据集，专为文本到图像模型的微调任务设计。该数据集包含约200万条样本，经过严格筛选与增强处理，以满足文本到图像模型训练的高标准要求。构建此数据集的初衷源于观察：拥有百万级以上样本的数据集往往能带来更优异的微调效果。然而现有公开数据集普遍存在诸多局限： - **图像理解类数据集**：无法保证图像质量。 - **非正式采集或任务专属数据集**：类别分布不均衡，且缺乏多样性。 - **规模限制问题**：现有数据集要么规模过小，要么过大（从大型数据集中采样的子集通常会丢失多样性）。为解决上述问题，我们结合并优化了现有高质量数据集，依托当前顶尖的文本到图像模型与图像描述模型打造了`text-to-image-2M`。该数据集包含两个子模块：`data_512_2M`——一个包含200万条512×512分辨率的微调数据集，以及`data_1024_10K`——一个包含1万条高质量高分辨率样本的数据集（用于高分辨率适配）。 ## 数据集构成 ### data_512_2M 本数据集由多个高质量子集组成，详情如下： | **数据源** | **样本量** | **提示词** | **图像** | |-------------------------------------------------|-------------|--------------------------------------|---------------------------------------------| | [**LLaVA-next微调数据集**](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) | 约70万 | 采用Qwen2-VL重新标注 | 原始图像 | | [**LLaVA预训练数据集**](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | 约50万 | 原始提示词 | 由Flux-dev生成的图像 | | [**ProGamerGov合成数据集（DALL·E 3）**](https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions) | 约90万 | 经有效性筛选 | 经中心裁剪与有效性过滤的图像 | | **GPT-4o生成数据集** | 10万 | 由GPT-4o生成 | 由Flux-dev生成的图像 | ### data_1024_10K 该子集包含1万张由Flux-dev生成的图像，其对应的提示词由GPT-4o生成。 ## 使用方式：本数据集采用[WebDataset](https://github.com/webdataset/webdataset)格式，可通过HuggingFace的datasets库轻松加载与使用，示例代码如下： py from datasets import load_dataset base_url = "https://huggingface.co/datasets/jackyhate/text-to-image-2M/resolve/main/data_512_2M/data_{i:06d}.tar" num_shards = 46 # Number of webdataset tar files urls = [base_url.format(i=i) for i in range(num_shards)] dataset = load_dataset("webdataset", data_files={"train": urls}, split="train", streaming=True) # Example of iterating through the dataset for image in dataset: print(image) # single image in row with associated columns break * 请注意，若在上述示例中设置`streaming=True`，则无需完整下载整个数据集。 ## 致谢本数据集基于多个开源项目的成果构建，包括： - [**LLaVA-next微调数据集**](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) - [**LLaVA预训练数据集**](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) - [**ProGamerGov合成数据集（DALL·E 3）**](https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions) - **GPT-4o** - **Flux-1.0-dev** 我们感谢这些数据集与模型的贡献者，为本项目的完成提供了坚实基础。

提供机构：

maas

创建时间：

2025-04-02

搜集汇总

数据集介绍

背景与挑战

背景概述

text-to-image-2M是一个包含约200万高质量文本-图像对的数据集，专为文本到图像模型微调设计，解决了现有数据集在质量、平衡性和多样性上的不足。数据集整合了多个来源的高质量数据，包括重新标注的LLaVA数据集和由先进模型如DALL·E 3、GPT-4o生成的合成数据，提供两种分辨率选项以满足不同训练需求。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集