csfufu/Unify-Agent-Toy-Data
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/csfufu/Unify-Agent-Toy-Data
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-to-image
---
# Unify-Agent
[**Paper**](https://arxiv.org/abs/2603.29620) | [**Code**](https://github.com/shawn0728/Unify-Agent)
This repository contains the official resources for [**Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis**](https://arxiv.org/abs/2603.29620), including our **large-scale training data pipeline**, **agent trajectory dataset**, and **FactIP benchmark** for knowledge-intensive image generation.
# 👀 Intro
<div align="center">
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/showcase.png?raw=true" alt="Unify-Agent Overview" width="80%">
</div>
We introduce **Unify-Agent**, a unified multimodal agent for **world-grounded image synthesis**, together with a new data foundation for training and evaluating search-grounded image generation systems.
A central contribution of this project is the construction of a **tailored multimodal data pipeline** for agentic image generation. Based on this pipeline, we curate **143K high-quality agent trajectories** that supervise the full process of **thinking, searching, grounding, recaptioning, and generation**. These trajectories are designed to teach models how to actively acquire and integrate external world knowledge, rather than relying only on frozen parametric memory.
In addition, we introduce **FactIP**, a new benchmark for **factual, knowledge-intensive, and long-tail image generation**, covering **12 categories** of real-world concepts that explicitly require external knowledge grounding.
Together, these resources make Unify-Agent not only a model, but also a **data and benchmark suite** for advancing research on **agent-based image generation**.
## 📦 Unify-Agent Dataset
Our training data is built to support **end-to-end agentic image generation**. Instead of supervising only the final prompt-image pair, we supervise the full reasoning and retrieval pipeline behind grounded generation.
<div align="center">
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/dataset.png?raw=true" alt="Unify-Agent Data Pipeline" width="85%">
</div>
The dataset contains **143K high-quality agent trajectories**, each covering key stages of the generation process:
- **THINK**: analyze the prompt and identify missing knowledge
- **RESEARCH**: retrieve relevant textual and visual evidence
- **RECAPTION**: transform evidence into grounded generation instructions
- **GENERATE**: synthesize the final image
This trajectory-level supervision enables models to learn:
- how to detect knowledge gaps in open-world prompts
- how to search for supporting evidence from multiple modalities
- how to convert retrieved evidence into generation-ready captions
- how to preserve factual identity and visual consistency during synthesis
We believe this dataset provides a strong foundation for future work on **search-grounded generation**, **multimodal agents**, and **world-knowledge-intensive text-to-image systems**.
## 🔍 FactIP Benchmark
To evaluate grounded image generation in realistic open-world settings, we build **FactIP**, a new benchmark targeting **factual and long-tail concept generation**.
<div align="center">
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/construction.png?raw=true" alt="FactIP Benchmark Categories" width="80%">
</div>
FactIP contains **three major groups** — **Character**, **Scene**, and **Object** — and **12 fine-grained subcategories**, covering diverse factual generation scenarios such as:
- celebrities
- animated characters
- landmarks
- cultural relics
- food
- toys
- mythology
The full benchmark contains **2,462 prompts**, and we also provide a **mini test subset** with category proportions aligned to the full benchmark.
FactIP is designed to test whether a model can generate images that are not only visually plausible, but also **factually grounded**, **identity-consistent**, and **faithful to real-world knowledge**.
## 🏆 Performance
Unify-Agent substantially improves factual visual synthesis over its base unified model and strong open-source baselines across **FactIP**, **WiSE**, **KiTTEN**, and **T2I-FactualBench**.
<div align="center">
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/comparison.png?raw=true" alt="Performance Comparison" width="85%">
</div>
Our method produces images that better preserve:
- **subject identity**
- **fine-grained visual attributes**
- **prompt-specific details**
- **real-world factual grounding**
These results highlight the value of our **data construction pipeline**, **trajectory supervision**, and **benchmark design** for building more reliable image generation agents.
## 🧠 Why This Dataset Matters
Conventional text-to-image training mainly focuses on final prompt-image alignment, but many real-world generation tasks require much richer capabilities: identifying missing knowledge, retrieving evidence, resolving ambiguity, and grounding visual details before synthesis.
Our dataset is designed specifically for this setting. By supervising the full agent workflow instead of only the final output, Unify-Agent opens up new directions for:
- **agentic text-to-image generation**
- **search-augmented image synthesis**
- **benchmarking factual visual generation**
- **training unified multimodal models with external knowledge access**
## 📦 Release Status
The repository is now available, and the **code, dataset, benchmark, and checkpoints** are being prepared for full release.
Please stay tuned for upcoming updates.
## Citation
If you find this work or dataset helpful, please consider citing:
```bibtex
@article{chen2026unify,
title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others},
journal={arXiv preprint arXiv:2603.29620},
year={2026}
}
```
---
任务类别:
- 文本到图像(text-to-image)
---
# Unify-Agent
[**论文**](https://arxiv.org/abs/2603.29620) | [**代码**](https://github.com/shawn0728/Unify-Agent)
本仓库包含**Unify-Agent:面向世界感知图像合成的统一多模态AI智能体(AI Agent)**的官方资源,包括我们的**大规模训练数据流水线**、**智能体轨迹数据集**,以及面向知识密集型图像生成的**FactIP基准测试集**。
## 👀 简介
<div align="center">
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/showcase.png?raw=true" alt="Unify-Agent 概览" width="80%">
</div>
我们提出**Unify-Agent**,一款面向**世界感知图像合成**的统一多模态AI智能体,同时为训练和评估基于搜索的图像生成系统提供了全新的数据基础。
本项目的核心贡献之一是构建了面向智能体式图像生成的**定制化多模态数据流水线**。基于该流水线,我们整理了**14.3万条高质量智能体轨迹**,用于监督**思考、搜索、感知、重描述与生成**的完整流程。这些轨迹旨在教会模型如何主动获取并整合外部世界知识,而非仅依赖固化的参数化记忆。
此外,我们推出了**FactIP**,一款面向**事实性、知识密集型且长尾图像生成**的全新基准测试集,涵盖12类明确需要外部知识感知的真实世界概念。
上述资源共同使Unify-Agent不仅是一款模型,更是一套用于推进**基于智能体的图像生成**研究的**数据与基准套件**。
## 📦 Unify-Agent 数据集
我们的训练数据旨在支持**端到端智能体式图像生成**。相较于仅监督最终的提示词-图像对,我们对基于感知的生成背后的完整推理与检索流水线进行了监督。
<div align="center">
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/dataset.png?raw=true" alt="Unify-Agent 数据流水线" width="85%">
</div>
该数据集包含**14.3万条高质量智能体轨迹**,每条轨迹覆盖生成流程的关键阶段:
- **思考(THINK)**:分析提示词并识别缺失的知识
- **检索(RESEARCH)**:获取相关的文本与视觉证据
- **重描述(RECAPTION)**:将证据转化为基于感知的生成指令
- **生成(GENERATE)**:合成最终图像
这种轨迹级别的监督能够让模型学习到:
- 如何在开放世界提示词中检测知识缺口
- 如何从多模态来源检索支撑证据
- 如何将检索到的证据转换为可用于生成的描述
- 如何在合成过程中保留事实一致性与视觉连贯性
我们相信,该数据集为后续在**基于搜索的生成**、**多模态智能体**以及**世界知识密集型文本到图像系统**领域的研究提供了坚实的基础。
## 🔍 FactIP 基准测试集
为了在真实开放世界场景中评估基于感知的图像生成,我们构建了**FactIP**,一款针对**事实性与长尾概念生成**的全新基准测试集。
<div align="center">
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/construction.png?raw=true" alt="FactIP 基准测试集类别" width="80%">
</div>
FactIP包含三大类别——**角色(Character)**、**场景(Scene)**与**物体(Object)**,以及12个细分子类别,覆盖多样化的事实性生成场景,例如:
- 名人
- 动画角色
- 地标建筑
- 文物古迹
- 美食
- 玩具
- 神话形象
完整基准测试集包含**2462条提示词**,同时我们还提供了与全基准类别比例一致的**迷你测试子集**。
FactIP旨在测试模型能否生成不仅视觉上合理,同时**具备事实感知性**、**身份一致性**且**忠实于真实世界知识**的图像。
## 🏆 性能表现
Unify-Agent在**FactIP**、**WiSE**、**KiTTEN**与**T2I-FactualBench**基准上,相较于其基础统一模型与优秀的开源基线,大幅提升了事实性视觉合成效果。
<div align="center">
<img src="https://github.com/shawn0728/Unify-Agent/blob/main/images/comparison.png?raw=true" alt="性能对比" width="85%">
</div>
我们的方法生成的图像能够更好地保留:
- **主体身份一致性**
- **细粒度视觉属性**
- **提示词指定细节**
- **真实世界事实感知性**
上述结果凸显了我们的**数据构建流水线**、**轨迹监督**与**基准设计**在构建更可靠的图像生成智能体方面的价值。
## 🧠 该数据集的重要意义
传统的文本到图像训练主要聚焦于最终的提示词-图像对齐,但许多真实世界的生成任务需要更丰富的能力:在合成前识别缺失的知识、检索证据、消解歧义以及感知视觉细节。
我们的数据集正是针对这一场景设计的。通过对完整的智能体工作流而非仅最终输出进行监督,Unify-Agent为以下方向开辟了新的研究路径:
- **智能体式文本到图像生成**
- **搜索增强型图像合成**
- **事实性视觉生成基准测试**
- **训练具备外部知识访问能力的统一多模态模型**
## 📦 发布状态
本仓库现已开放,**代码、数据集、基准测试集与模型权重**正准备全面发布。
敬请期待后续更新。
## 引用
如果您认为本工作或数据集对您有所帮助,请引用:
bibtex
@article{chen2026unify,
title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others},
journal={arXiv preprint arXiv:2603.29620},
year={2026}
}
提供机构:
csfufu



