Echo-4o-Image
收藏魔搭社区2026-01-06 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Echo-4o-Image
下载链接
链接失效反馈官方服务:
资源简介:
# Echo-4o-Image Dataset
[Paper](https://huggingface.co/papers/2508.09987) | [Project Page](https://yejy53.github.io/Echo-4o) | [Code](https://github.com/yejy53/Echo-4o)
## Introduction
Echo-4o-Image is a 180K-scale synthetic dataset generated by GPT-4o, designed to advance open-source models in image generation. While real-world image datasets are valuable, synthetic images offer crucial advantages, especially in addressing blind spots in real-world coverage:
* **Complementing Rare Scenarios:** Synthetic data can generate examples for scenarios less represented in real-world datasets, such as surreal fantasy or multi-reference image generation, which are common in user queries.
* **Clean and Controllable Supervision:** Unlike real-world data, which often contains complex background noise and misalignment between text and image, synthetic images provide pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment.
This dataset was instrumental in fine-tuning the unified multimodal generation baseline Bagel to obtain Echo-4o, demonstrating strong performance across standard benchmarks. Furthermore, Echo-4o-Image consistently enhances other foundation models (e.g., OmniGen2, BLIP3-o), highlighting its strong transferability.
## Echo-4o-Image Dataset Details
Echo-4o-Image is a large-scale synthetic dataset distilled from GPT-4o, containing approximately 179,000 samples. It spans three distinct task types:
* **38K surreal fantasy generation tasks:** Designed to address imaginative content.
* **73K multi-reference image generation tasks:** For scenarios requiring multiple visual cues.
* **68K complex instruction execution tasks:** To improve adherence to detailed textual prompts.
For better visualization, an online gallery showcasing representative samples from our dataset is available: [Online Gallery](https://yejy53.github.io/Echo-4o/)
## Data Structure
The dataset typically organizes data within compressed packages (e.g., `.tar.gz` files referenced in `configs`). Inside these packages, data is arranged as follows:
```
- package_idx/
--- package_idx.json # metadata for samples in this package
--- images/
----- 00001.png
----- 00002.png
...
```
## Usage
This dataset can be used to train and fine-tune text-to-image models, extending capabilities to support multi-reference datasets.
### Training
The training process extends existing frameworks (e.g., Bagel's capabilities).
1. **Data Preparation:** Follow data preparation guidelines, ensuring multi-reference data adheres to the expected format.
2. **Training Process:** Training scripts use interfaces and parameters similar to established models (e.g., Bagel), allowing for seamless integration with existing training commands and configurations.
### Inference
* **Text-to-Image Tasks:** For standard text-to-image generation, follow the inference process of base models (e.g., Bagel).
* **Multi-Reference Tasks:** Specific examples and guides for tasks involving multiple references are provided in the [official GitHub repository](https://github.com/yejy53/Echo-4o).
### Code and Supporting Files
The associated GitHub repository provides crucial supporting files for working with the dataset:
* **Attributes and Subjects:** `./code/attributes_and_subjects.json` contains dictionaries defining various attributes and subjects used in the dataset.
* **Range-sensitive filtering:** `./code/range_sensitive_filter.json` contains metadata for data filtering, and `./code/data_filter.py` converts it for use in dataloaders.
* **Data Loader:** `./code/dataloader.py` provides an example of how to load the data into image pairs, incorporating filtering and balanced resampling.
## Evaluation Benchmarks
The paper introduces two novel benchmarks for rigorously evaluating image generation capabilities:
* **GenEval++:** Increases instruction complexity and uses an automated evaluator (powered by GPT-4.1) to mitigate score saturation and provide a more accurate assessment of text-to-image instruction following.
* **Imagine-Bench:** Focuses on imaginative content, offering a comprehensive evaluation of conceptual creativity and visual consistency across dimensions like fantasy fulfillment, identity preservation, and aesthetic quality.
Detailed guides for these benchmarks can be found in the [EVAL section of the GitHub repository](https://github.com/yejy53/Echo-4o/blob/main/EVAL.md).
## Acknowledgements
We would like to thank the following open-source projects and research works:
* [Bagel](https://github.com/ByteDance-Seed/Bagel)
* [BLIP3o](https://github.com/JiuhaiChen/BLIP3o)
* [OmniGen2](https://github.com/VectorSpaceLab/OmniGen2?tab=readme-ov-file)
## Citation
If you find this dataset or the associated work useful for your research, please cite the paper:
```bib
@article{ye2025echo4o,
title={Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation},
author={Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li},
journal={https://arxiv.org/abs/2508.09987},
year={2025},
}
```
# Echo-4o-Image 数据集
[论文](https://huggingface.co/papers/2508.09987) | [项目主页](https://yejy53.github.io/Echo-4o) | [代码仓库](https://github.com/yejy53/Echo-4o)
## 简介
Echo-4o-Image 是一款由 GPT-4o 生成的规模达18万级的合成数据集,旨在推动图像生成领域开源模型的发展。尽管真实世界图像数据集颇具价值,但合成图像具备诸多关键优势,尤其能够弥补真实数据集覆盖范围的盲区:
* **补充稀有场景:** 合成数据可生成真实数据集中占比极低的场景样本,例如超现实奇幻场景或多参考图像生成场景——这类场景在用户查询中颇为常见。
* **纯净可控的监督信号:** 与常含复杂背景噪声且文本与图像存在对齐偏差的真实数据不同,合成图像拥有纯净背景与长尾监督信号,有助于实现更精准的文本到图像对齐。
该数据集在对统一多模态生成基线模型 Bagel 进行微调以得到 Echo-4o 的过程中发挥了关键作用,且 Echo-4o 在各类标准基准测试中展现出优异性能。此外,Echo-4o-Image 能够持续提升其他基础模型(如 OmniGen2、BLIP3-o)的表现,凸显了其极强的迁移性。
## Echo-4o-Image 数据集详情
Echo-4o-Image 是一款从 GPT-4o 中蒸馏得到的大规模合成数据集,共包含约17.9万个样本。其涵盖三类截然不同的任务类型:
* **3.8万个超现实奇幻生成任务:** 用于生成富有想象力的内容。
* **7.3万个多参考图像生成任务:** 面向需要多视觉线索的场景。
* **6.8万个复杂指令执行任务:** 用于提升模型对详细文本提示的遵循能力。
为便于直观展示,本数据集的代表性样本已上线线上展厅:[线上展厅](https://yejy53.github.io/Echo-4o/)
## 数据结构
本数据集通常以压缩包(如配置文件中提及的 `.tar.gz` 格式文件)为单位组织数据。压缩包内的数据结构如下:
- package_idx/
--- package_idx.json # 当前包内样本的元数据
--- images/
----- 00001.png
----- 00002.png
...
## 使用方法
本数据集可用于训练与微调文本到图像模型,并可拓展模型对多参考数据集的支持能力。
### 训练流程
本训练流程兼容现有框架(如 Bagel 的训练框架)。
1. **数据准备:** 遵循数据准备规范,确保多参考数据符合预期格式。
2. **训练流程:** 训练脚本采用与成熟模型(如 Bagel)一致的接口与参数,可与现有训练命令及配置无缝集成。
### 推理流程
* **文本到图像任务:** 对于标准文本到图像生成任务,遵循基础模型(如 Bagel)的推理流程即可。
* **多参考任务:** 针对多参考相关任务的具体示例与操作指南,可参阅[官方GitHub仓库](https://github.com/yejy53/Echo-4o)。
### 代码与辅助文件
配套GitHub仓库提供了本数据集相关的关键辅助文件:
* **属性与主题:** `./code/attributes_and_subjects.json` 包含定义数据集所用各类属性与主题的字典文件。
* **范围敏感过滤:** `./code/range_sensitive_filter.json` 包含数据过滤所需的元数据,`./code/data_filter.py` 可将其转换为数据加载器可用的格式。
* **数据加载器:** `./code/dataloader.py` 提供了如何将数据加载为图像对的示例,集成了过滤与平衡重采样功能。
## 评估基准
本论文提出了两款全新基准测试集,用于严格评估图像生成能力:
* **GenEval++:** 提升了指令复杂度,并采用基于 GPT-4.1 的自动评估器,以缓解分数饱和问题,更精准地评估模型对文本到图像指令的遵循程度。
* **Imagine-Bench:** 聚焦于富有想象力的内容,从奇幻场景还原、身份一致性、美学质量等多个维度,对概念创造力与视觉一致性进行全面评估。
上述基准测试的详细指南可参阅GitHub仓库的[EVAL章节](https://github.com/yejy53/Echo-4o/blob/main/EVAL.md)。
## 致谢
谨向以下开源项目与研究工作致以诚挚谢意:
* [Bagel](https://github.com/ByteDance-Seed/Bagel)
* [BLIP3o](https://github.com/JiuhaiChen/BLIP3o)
* [OmniGen2](https://github.com/VectorSpaceLab/OmniGen2?tab=readme-ov-file)
## 引用格式
若您的研究中使用了本数据集或相关工作,请引用如下论文:
bib
@article{ye2025echo4o,
title={Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation},
author={Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li},
journal={arXiv预印本: https://arxiv.org/abs/2508.09987},
year={2025},
}
提供机构:
maas
创建时间:
2025-08-18



