KempnerInstituteAI/flux.2-dev-synthetic-2M

Name: KempnerInstituteAI/flux.2-dev-synthetic-2M
Creator: KempnerInstituteAI
Published: 2026-04-08 19:40:42
License: 暂无描述

Hugging Face2026-04-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/KempnerInstituteAI/flux.2-dev-synthetic-2M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other language: - en task_categories: - text-to-image annotations_creators: - machine-generated source_datasets: - text-to-image-2M tags: - synthetic - diffusion - text-to-image - webdataset - flux - generative-models size_categories: - 1M<n<10M viewer: false configs: - config_name: default data_files: - split: train path: train/*.tar --- # Flux2.dev Synthetic: 2.2M Text-to-Image Pairs at 512×512 ## Dataset Summary This dataset contains **~2.2 million** large-scale synthetic image–caption pairs generated using the FLUX.2-dev diffusion model: - **Model**: [black-forest-labs/FLUX.2-dev](https://huggingface.co/black-forest-labs/FLUX.2-dev) - **Caption source**: [Text2Image-2M](https://huggingface.co/datasets/jackyhate/text-to-image-2M) - **Total samples**: 2,282,665 (571 shards × ~4000 samples) - **Resolution**: 512 × 512 - **Image format**: PNG (lossless) - **Total shards**: 571 - **Samples per shard**: 4000 (last shard: 2665) - **Total size**: ~865 GB Each sample consists of: - `.png` image file - `.txt` caption file (with the same base name as the image) - `.json` metadata file (with the same base name as the image) The dataset is stored in WebDataset format for scalable streaming and distributed training. ## Example Samples | Caption | Image | |---------|-------| | A woman in a hat stands in front of a colorful umbrella. | ![](figures/00060040.png) | | A plate of Chinese food with a variety of dishes including dumplings, noodles, and vegetables. There is a small bowl of red sauce on the side. The plate is placed on a white tablecloth, and there are chopsticks resting on the tablecloth to the right of the plate. | ![](figures/00060147.png) | | an open air sculpture with trees in the background | ![](figures/00063454.png) | | An animated figure, likely a young girl with green eyes and black hair, enjoys a meal of noodles at an outdoor wooden table amidst a calm, scenic environment. | ![](figures/00063999.png) | ## Dataset Statistics | Property | Value | |--------|------| | Total Samples | 2,282,665 | | Total Shards | 571 | | Samples per Shard | 4000 (last shard: 2665) | | Image Resolution | 512×512 | | Image Format | PNG | | Total Dataset Size | ~865 GB | | Caption Source | Text2Image-2M | | Generation Model | FLUX.2-dev | | Inference Steps | 50 | | Guidance Scale | 3.5 | ## Dataset Structure ```bash flux2_synthetic_000000.tar flux2_synthetic_000001.tar ... ``` Each `.tar` shard contains grouped samples: ```bash 00020999.png 00020999.txt 00020999.json ``` ## Example Metadata ```json { "id": 20999, "seed": 42, "num_inference_steps": 50, "guidance_scale": 3.5, "height": 512, "width": 512, "model": "black-forest-labs/FLUX.2-dev", "shard_id": 5 } ``` ## Intended Use The captions are sourced from the Text2Image-2M dataset, which contains a wide variety of image descriptions. The synthetic images were generated using fixed parameters (50 steps, guidance scale 3.5) to create a large-scale dataset for training and evaluating text-to-image models. Any use of this dataset should follow the terms of the original FLUX.2-dev and Text2Image-2M licenses. See the License section below for details. ## Loading the Dataset ### Using WebDataset (Recommended for PyTorch) ```python import webdataset as wds import json dataset = ( wds.WebDataset("path/to/flux2_synthetic_{000000..000570}.tar") .decode("pil") .to_tuple("png", "txt", "json") ) for image, caption, metadata in dataset: # image: PIL Image # caption: str # metadata: dict (automatically parsed from JSON) print(f"Caption: {caption}") print(f"Seed: {metadata['seed']}") ``` ### Streaming from Hugging Face (Example) ```python from datasets import load_dataset dataset = load_dataset( "webdataset" data_files="https://huggingface.co/datasets/KempnerInstituteAI/flux.2-dev-synthetic-2M/resolve/main/{000000..000570}.tar", split="train", streaming=True ) # Example: iterate through samples for sample in dataset: image = sample["image"] caption = sample["text"] metadata = sample["json"] # Your processing here ``` ## General Details Images were generated using: - Model: black-forest-labs/FLUX.2-dev - Steps: 50 inference steps - Guidance scale: 3.5 - Seed: Stored per-sample - Resolution: 512×512 - Format: PNG (lossless) Captions originate from the Text2Image-2M corpus. ## Dataset Creation This dataset was generated using a distributed pipeline: 1. Captions were extracted from the Text2Image-2M dataset 2. Images were generated using FLUX.2-dev with fixed parameters (50 steps, guidance 3.5) 3. Generation was distributed across 571 shards for parallel processing 4. Each sample includes the original caption and generation metadata 5. Output was packaged in WebDataset format for efficient streaming ## Data Characteristics - **All images are synthetic** (generated by FLUX.2-dev, not real photographs) - Captions were not modified from the Text2Image-2M source - No additional filtering applied beyond successful generation - No watermarking was applied - No human annotations were added - All samples use consistent generation parameters (steps=50, guidance=3.5) ## Limitations and Considerations - **Synthetic Biases**: Images reflect the biases and characteristics of the FLUX.2-dev model - **Caption Quality**: Caption quality depends entirely on the Text2Image-2M dataset - **Limited Diversity**: Generated with fixed parameters, which may limit stylistic diversity - **No Quality Filtering**: All successfully generated images are included without quality assessment - **Generation Artifacts**: A small number of samples may contain visual artifacts or generation failures typical of diffusion-based image synthesis - **Not Ground Truth**: Synthetic images should not be treated as factually accurate representations ## License Users are responsible for ensuring compliance with all applicable licenses. This dataset combines content from two sources, each with its own license: ### Images: FLUX.2-dev License All images are synthetic outputs generated using [black-forest-labs/FLUX.2-dev](https://huggingface.co/black-forest-labs/FLUX.2-dev). - **You MUST review the [FLUX.2-dev license](https://huggingface.co/black-forest-labs/FLUX.2-dev/blob/main/LICENSE.md)** before using this dataset. ### Captions: Text2Image-2M License Captions originate from the [Text2Image-2M](https://huggingface.co/datasets/jackyhate/text-to-image-2M) dataset. - **Review the [Text2Image-2M license](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md)** for caption usage terms. ### This Dataset This dataset compilation is intended for research purposes only. Any use should proceed only after confirming compliance with the applicable upstream licenses. ## Citation If you use this dataset, please cite: ```bibtex @misc{naeem_khoshnevis_2026, author = { Naeem Khoshnevis and Gabriel Guo and Eric Vanden-Eijnden and Nicholas Boffi and Michael S. Albergo }, title = { flux.2-dev-synthetic-2M (Revision 0e2762f) }, year = { 2026 }, note = { Generated with FLUX.2-dev using Text2Image-2M captions. The project was co-advised by Nicholas Boffi and Michael S. Albergo. }, url = { https://huggingface.co/datasets/KempnerInstituteAI/flux.2-dev-synthetic-2M }, doi = { 10.57967/hf/8311 }, publisher = { Hugging Face } } ``` Additionally, please cite the FLUX.2-dev model: ```bibtex @software{flux2_dev, title={FLUX.2-dev}, author={Black Forest Labs}, year={2024}, url={https://huggingface.co/black-forest-labs/FLUX.2-dev} } ``` And the Text2Image-2M dataset: ```bibtex @misc{zk_2024, author = { zk }, title = { text-to-image-2M (Revision e64fca4) }, year = 2024, url = { https://huggingface.co/datasets/jackyhate/text-to-image-2M }, doi = { 10.57967/hf/3066 }, publisher = { Hugging Face } } ``` ## Acknowledgements This dataset was generated as part of research conducted by Naeem Khoshnevis, Gabriel Guo, Eric Vanden-Eijnden, Nicholas Boffi, and Michael S. Albergo, with Nicholas Boffi and Michael S. Albergo serving as shared advisors on the project. The work used compute resources provided by the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. The large-scale image generation pipeline relied on the Institute's high-performance GPU infrastructure, including distributed compute clusters and high-throughput storage systems. We gratefully acknowledge the technical support and infrastructure engineering efforts that made this large-scale synthetic data generation possible.

提供机构：

KempnerInstituteAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集