UNO-1M
收藏魔搭社区2026-01-09 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/bytedance-research/UNO-1M
下载链接
链接失效反馈官方服务:
资源简介:

<h3 align="center">
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
</h3>
<p align="center">
<a href="https://github.com/bytedance/UNO"><img alt="Build" src="https://img.shields.io/github/stars/bytedance/UNO"></a>
<a href="https://bytedance.github.io/UNO/"><img alt="Build" src="https://img.shields.io/badge/Project%20Page-UNO-blue"></a>
<a href="https://arxiv.org/abs/2504.02160"><img alt="Build" src="https://img.shields.io/badge/arXiv%20paper-UNO-b31b1b.svg"></a>
<a href="https://huggingface.co/bytedance-research/UNO"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=green"></a>
<a href="https://huggingface.co/datasets/bytedance-research/UNO-1M"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Dataset&color=yellow"></a>
<a href="https://huggingface.co/spaces/bytedance-research/UNO-FLUX"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=demo&color=orange"></a>
</p>
## Overview
UNO-1M is a large dataset (~1M paired images) constructed by the in-context generation pipeline introduced in the [UNO](https://arxiv.org/abs/2504.02160) paper. Its advantages include highly diverse categories (>365 categories), high-resolution images (around 1024x1024), variable resolutions (different aspect ratios), high quality (produced by state-of-the-art text-to-image models), and high subject consistency (filtered by VLM-filter CoT). You can train on this dataset to reproduce the [UNO model](https://huggingface.co/bytedance-research/UNO) or build your own state-of-the-art subject-driven model. We now open-source the entire dataset to benefit research.
## Label Format
| Key name | Type | Description |
| ---------------- | ------ | ---------------------------------------------- |
| `img_path1` | `str` | Reference image information (first image). |
| `img_path2` | `str` | Reference image information (second image). |
| `caption` | `dict` | Image caption and subject word. |
| `vlm_filter_cot` | `dict` | The CoT answer of VLM-filter. |
## Dataset Structure
### Directory Structure
```bash
output_root/
├── images/
│ ├── split1.tar.gz
│ ├── split2.tar.gz
│ └── ...
├── labels/
│ ├── split1.json
│ ├── split2.json
│ └── ...
```
After extraction:
```bash
output_root/
├── images/
│ ├── split1/
│ │ ├── object365_w1024_h1536_split_Bread_0_0_1_725x1024.png
│ │ ├── object365_w1024_h1536_split_Bread_0_0_2_811x1024.png
│ │ └── ...
│ └── ...
├── labels/
│ ├── split1.json
│ ├── split2.json
│ └── ...
...
```
## Usage
UNO-1M contains rich label information, and we preserve the breakdown of the consistency score as well as the final consistency scores. It can be applied to:
- Text-to-image generation
- Subject-driven generation
- Scored-filter training
- Consistency reward model training
**Note:**
For subject-driven generation, we recommend using data with a consistency score greater than or equal to 3.5 (the key in JSON is `score_final`). In the UNO paper, we use perfect score (score 4) data for training. You can refer to our technical report for more details.
You can see an example below:
```json
{
"img_path1": "split1/class_generation_w1024_h1536_split_v1_Food_0_0_1_793x1024.png",
"img_path2": "split1/class_generation_w1024_h1536_split_v1_Food_0_0_2_743x1024.png",
"caption": {
"img_path1": "A bowl of beef stew with carrots and garnished with a sprig of parsley, placed on a white surface.",
"img_path2": "A bowl of beef stew with carrots and garnished with a sprig of parsley, placed on a wooden table surrounded by autumn leaves.",
"judgment": "same",
"subject": [
"beef stew with carrots"
]
},
"vlm_filter_cot": {
"score_part": {
"Beef Chunks": 4.0,
"Carrots": 3.0,
"Appearance of the Stew": 4.0,
"Garnish": 4.0
},
"score_final": 3.5
}
}
```
## Citation
If you find our dataset helpful, please consider citing our work:
```bibtex
@article{wu2025less,
title={Less-to-more generalization: Unlocking more controllability by in-context generation},
author={Wu, Shaojin and Huang, Mengqi and Wu, Wenxu and Cheng, Yufeng and Ding, Fei and He, Qian},
journal={arXiv preprint arXiv:2504.02160},
year={2025}
}
```

<h3 align="center">由简至繁泛化:通过上下文生成解锁更强可控性</h3>
<p align="center">
<a href="https://github.com/bytedance/UNO"><img alt="GitHub 星标" src="https://img.shields.io/github/stars/bytedance/UNO"></a>
<a href="https://bytedance.github.io/UNO/"><img alt="项目主页" src="https://img.shields.io/badge/Project%20Page-UNO-blue"></a>
<a href="https://arxiv.org/abs/2504.02160"><img alt="arXiv 论文" src="https://img.shields.io/badge/arXiv%20paper-UNO-b31b1b.svg"></a>
<a href="https://huggingface.co/bytedance-research/UNO"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=模型&color=green"></a>
<a href="https://huggingface.co/datasets/bytedance-research/UNO-1M"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=数据集&color=yellow"></a>
<a href="https://huggingface.co/spaces/bytedance-research/UNO-FLUX"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=演示Demo&color=orange"></a>
</p>
## 概述
UNO-1M 是由[UNO](https://arxiv.org/abs/2504.02160)论文中提出的上下文生成流程构建的大型数据集(约100万张配对图像)。其优势涵盖:类别多样性极高(超365类)、图像分辨率优异(约1024×1024)、支持可变分辨率(不同宽高比)、画质精良(由当前顶尖文本转图像模型生成)、主题一致性强(通过视觉语言模型(Vision-Language Model, VLM)过滤思维链(Chain-of-Thought, CoT)完成筛选)。您可基于该数据集复现[UNO模型](https://huggingface.co/bytedance-research/UNO),或搭建属于自己的顶尖主题驱动模型。目前我们已开源完整数据集以助力学术研究。
## 标签格式
| 键名 | 类型 | 描述 |
| ---------------- | ------ | --------------------------------------- |
| `img_path1` | `str` | 参考图像信息(第一张图像)。 |
| `img_path2` | `str` | 参考图像信息(第二张图像)。 |
| `caption` | `dict` | 图像描述文本与主题词。 |
| `vlm_filter_cot` | `dict` | VLM过滤模块的思维链(CoT)答案。 |
## 数据集结构
### 目录结构
bash
output_root/
├── images/
│ ├── split1.tar.gz
│ ├── split2.tar.gz
│ └── 其他分卷压缩包
├── labels/
│ ├── split1.json
│ ├── split2.json
│ └── 其他标签文件
解压完成后:
bash
output_root/
├── images/
│ ├── split1/
│ │ ├── object365_w1024_h1536_split_Bread_0_0_1_725x1024.png
│ │ ├── object365_w1024_h1536_split_Bread_0_0_2_811x1024.png
│ │ └── 其他图像文件
│ └── 其他分卷目录
├── labels/
│ ├── split1.json
│ ├── split2.json
│ └── 其他标签文件
...
## 使用方式
UNO-1M包含丰富的标签信息,我们保留了一致性评分的分项结果与最终一致性评分。该数据集可应用于:
- 文本转图像生成
- 主题驱动生成
- 带评分过滤的训练
- 一致性奖励模型训练
**注意:**
针对主题驱动生成任务,我们推荐使用一致性评分大于等于3.5的数据(JSON中的对应键为`score_final`)。在UNO论文中,我们使用满分(评分4)的数据进行模型训练。您可参阅我们的技术报告以获取更多细节。
您可参考以下示例:
json
{
"img_path1": "split1/class_generation_w1024_h1536_split_v1_Food_0_0_1_793x1024.png",
"img_path2": "split1/class_generation_w1024_h1536_split_v1_Food_0_0_2_743x1024.png",
"caption": {
"img_path1": "盛放在白色台面上的一碗胡萝卜炖牛肉,点缀有一小枝欧芹。",
"img_path2": "盛放在木质台面上的一碗胡萝卜炖牛肉,点缀有一小枝欧芹,周围环绕秋叶。",
"judgment": "一致",
"subject": [
"胡萝卜炖牛肉"
]
},
"vlm_filter_cot": {
"score_part": {
"牛肉块": 4.0,
"胡萝卜": 3.0,
"炖菜外观": 4.0,
"点缀装饰": 4.0
},
"score_final": 3.5
}
}
## 引用
如果您认为本数据集对您的研究有所帮助,请引用我们的工作:
bibtex
@article{wu2025less,
title={Less-to-more generalization: Unlocking more controllability by in-context generation},
author={Wu, Shaojin and Huang, Mengqi and Wu, Wenxu and Cheng, Yufeng and Ding, Fei and He, Qian},
journal={arXiv preprint arXiv:2504.02160},
year={2025}
}
提供机构:
maas
创建时间:
2025-08-25



