text-to-image-2M
收藏魔搭社区2026-04-25 更新2025-04-05 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/text-to-image-2M
下载链接
链接失效反馈官方服务:
资源简介:
# text-to-image-2M: A High-Quality, Diverse Text-to-Image Training Dataset
## Overview
`text-to-image-2M` is a curated text-image pair dataset designed for fine-tuning text-to-image models. The dataset consists of approximately 2 million samples, carefully selected and enhanced to meet the high demands of text-to-image model training. The motivation behind creating this dataset stems from the observation that datasets with over 1 million samples tend to produce better fine-tuning results. However, existing publicly available datasets often have limitations:
- **Image Understanding Datasets**: Not guarantee the quality of image.
- **Informal collected or Task-Specific Datasets**: Not category balanced or lacks diversity.
- **Size Constraints**: Available datasets are either too small or too large. (subset sampled from large datasets often lack diversity.)
To address these issues, we combined and enhanced existing high-quality datasets using state-of-the-art text-to-image and captioning models to create `text-to-image-2M`. This includes data_512_2M, a 2M 512x512 fine-tuning dataset and data_1024_10K, a 10K high-quality, high-resolution dataset (for high-resolution adaptation).
## Dataset Composition
### data_512_2M
The dataset is composed of several high-quality subsets, as detailed below:
| **Source** | **Samples** | **Prompts** | **Images** |
|-------------------------------------------------|-------------|--------------------------------------|---------------------------------------------|
| [**LLaVA-next fine-tuning dataset**](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) | ~700K | Re-captioned using Qwen2-VL | Original images |
| [**LLaVA-pretrain dataset**](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | ~500K | Original prompts | Images generated by Flux-dev |
| [**ProGamerGov synthetic dataset (DALL·E 3)**](https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions) | ~900K | Filtered for validity | Center-cropped and validity-filtered images |
| **GPT-4o generated dataset** | 100K | Generated by GPT-4o | Images generated by Flux-dev |
### data_1024_10K
10K images generated by Flux-dev with prompts generated by GPT-4o
## **Usage**:
The dataset uses the [WebDataset](https://github.com/webdataset/webdataset) format and can be easily accessed and used with HuggingFace's datasets library like so:
```py
from datasets import load_dataset
base_url = "https://huggingface.co/datasets/jackyhate/text-to-image-2M/resolve/main/data_512_2M/data_{i:06d}.tar"
num_shards = 46 # Number of webdataset tar files
urls = [base_url.format(i=i) for i in range(num_shards)]
dataset = load_dataset("webdataset", data_files={"train": urls}, split="train", streaming=True)
# Example of iterating through the dataset
for image in dataset:
print(image) # single image in row with associated columns
break
```
* Note that as long as `streaming=True` in the above example, the dataset does not have to be downloaded in full.
## Acknowledgments
This dataset builds on the work of several open-source projects, including:
- [**LLaVA-next fine-tuning dataset**](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data)
- [**LLaVA-pretrain dataset**](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
- [**ProGamerGov synthetic dataset (DALL·E 3)**](https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions)
- **GPT-4o**
- **Flux-1.0-dev**
We thank the contributors of these datasets and models for making this project possible.
## Citation
```bibtex
@misc{zou2024text2image,
title={text-to-image-2M: A high-quality, diverse text--image training dataset},
author={Kai Zou},
year={2024},
doi = {10.57967/hf/3066}
}
```
# text-to-image-2M:高质量多样化文本到图像训练数据集
## 概述
`text-to-image-2M`是一个经精选的文本-图像配对数据集,专为文本到图像模型的微调任务设计。该数据集包含约200万条样本,经过严格筛选与增强处理,以满足文本到图像模型训练的高标准要求。构建此数据集的初衷源于观察:拥有百万级以上样本的数据集往往能带来更优异的微调效果。然而现有公开数据集普遍存在诸多局限:
- **图像理解类数据集**:无法保证图像质量。
- **非正式采集或任务专属数据集**:类别分布不均衡,且缺乏多样性。
- **规模限制问题**:现有数据集要么规模过小,要么过大(从大型数据集中采样的子集通常会丢失多样性)。
为解决上述问题,我们结合并优化了现有高质量数据集,依托当前顶尖的文本到图像模型与图像描述模型打造了`text-to-image-2M`。该数据集包含两个子模块:`data_512_2M`——一个包含200万条512×512分辨率的微调数据集,以及`data_1024_10K`——一个包含1万条高质量高分辨率样本的数据集(用于高分辨率适配)。
## 数据集构成
### data_512_2M
本数据集由多个高质量子集组成,详情如下:
| **数据源** | **样本量** | **提示词** | **图像** |
|-------------------------------------------------|-------------|--------------------------------------|---------------------------------------------|
| [**LLaVA-next微调数据集**](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) | 约70万 | 采用Qwen2-VL重新标注 | 原始图像 |
| [**LLaVA预训练数据集**](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | 约50万 | 原始提示词 | 由Flux-dev生成的图像 |
| [**ProGamerGov合成数据集(DALL·E 3)**](https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions) | 约90万 | 经有效性筛选 | 经中心裁剪与有效性过滤的图像 |
| **GPT-4o生成数据集** | 10万 | 由GPT-4o生成 | 由Flux-dev生成的图像 |
### data_1024_10K
该子集包含1万张由Flux-dev生成的图像,其对应的提示词由GPT-4o生成。
## 使用方式:
本数据集采用[WebDataset](https://github.com/webdataset/webdataset)格式,可通过HuggingFace的datasets库轻松加载与使用,示例代码如下:
py
from datasets import load_dataset
base_url = "https://huggingface.co/datasets/jackyhate/text-to-image-2M/resolve/main/data_512_2M/data_{i:06d}.tar"
num_shards = 46 # Number of webdataset tar files
urls = [base_url.format(i=i) for i in range(num_shards)]
dataset = load_dataset("webdataset", data_files={"train": urls}, split="train", streaming=True)
# Example of iterating through the dataset
for image in dataset:
print(image) # single image in row with associated columns
break
* 请注意,若在上述示例中设置`streaming=True`,则无需完整下载整个数据集。
## 致谢
本数据集基于多个开源项目的成果构建,包括:
- [**LLaVA-next微调数据集**](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data)
- [**LLaVA预训练数据集**](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
- [**ProGamerGov合成数据集(DALL·E 3)**](https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions)
- **GPT-4o**
- **Flux-1.0-dev**
我们感谢这些数据集与模型的贡献者,为本项目的完成提供了坚实基础。
提供机构:
maas
创建时间:
2025-04-02
搜集汇总
数据集介绍

背景与挑战
背景概述
text-to-image-2M是一个包含约200万高质量文本-图像对的数据集,专为文本到图像模型微调设计,解决了现有数据集在质量、平衡性和多样性上的不足。数据集整合了多个来源的高质量数据,包括重新标注的LLaVA数据集和由先进模型如DALL·E 3、GPT-4o生成的合成数据,提供两种分辨率选项以满足不同训练需求。
以上内容由遇见数据集搜集并总结生成



