unipic_nano_3images
收藏魔搭社区2026-05-09 更新2026-05-10 收录
下载链接:
https://modelscope.cn/datasets/Skywork/unipic_nano_3images
下载链接
链接失效反馈官方服务:
资源简介:
license: apache-2.0
task_categories:
- image-to-image
- text-to-image
language:
- en
tags:
- image-composition
- multi-image
- image-fusion
- image-editing
- unipic
- 3-image-input
pretty_name: UniPic Nano 3Images
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: "*.jsonl"
---
# UniPic-Nano-3Images: A Multi-Image Composition Dataset
## ⚡ Quick Start
The image archive is split into multiple parts for easier downloading. To reconstruct and extract:
```bash
# Step 1: Concatenate split files into a single zip
cat nano-banana.part_* > nano-banana-3images.zip
# Step 2: Extract the images
unzip nano-banana-3images.zip
```
## 📖 Overview
**UniPic-Nano-3Images** is a high-quality multi-image composition dataset containing **35,394** samples designed for training advanced image fusion and composition models. Each sample consists of **3 input images** and **1 output image**, where elements from all three input images are seamlessly combined based on natural language instructions. This dataset is part of the **UniPic** series and has been used in **UniPic3** for training multi-image composition models with complex, multi-element fusion capabilities.
## 🎨 Demo: Multi-Image Composition
This dataset enables models to intelligently combine subjects and multiple objects from three separate images into a single coherent output based on natural language instructions:

*Example: Person from Image1 combined with objects from Image2 and Image3 to create a seamless multi-element composition*
## 🎯 Key Features
- **3-Image Input**: Each sample uses exactly 3 input images for complex composition
- **Multi-Element Fusion**: Combines person + object + object/scene in sophisticated ways
- **Diverse Composition Patterns**: Covers 20+ different composition scenarios with multiple actions
- **High Quality**: 35,394 carefully curated samples with detailed natural language instructions
- **Production Ready**: Used in UniPic3 for real-world multi-image composition applications
- **Simple Format**: Clean JSON format with straightforward input/output structure
## 📊 Dataset Statistics
### Composition Pattern Distribution
| Composition Pattern | Count | Percentage | Description |
|---------------------|-------|------------|-------------|
| **Person + Object + Object** | 7,468 | 21.1% | Person with two different objects |
| **Person + Wearable + Object** | 5,065 | 14.3% | Person wearing item and holding/near object |
| **Person + Wearable + Wearable** | 3,511 | 9.9% | Person wearing two different items |
| **Person + Object + Wearable** | 1,728 | 4.9% | Person with object and wearing item |
| **Person + Furniture + Object** | 1,572 | 4.4% | Person on furniture with object |
| **Person + Instrument + Object** | 1,537 | 4.3% | Person playing instrument with object |
| **Person + Furniture + Furniture** | 1,100 | 3.1% | Person with two furniture items |
| **Person + Vehicle + Object** | 942 | 2.7% | Person in/on vehicle with object |
| **Person + Instrument + Wearable** | 825 | 2.3% | Person playing instrument wearing item |
| **Person + Wearable + Instrument** | 814 | 2.3% | Person wearing item playing instrument |
| **Person + Furniture + Wearable** | 785 | 2.2% | Person on furniture wearing item |
| **Person + Object + Vehicle** | 709 | 2.0% | Person with object and vehicle |
| **Person + Instrument + Instrument** | 701 | 2.0% | Person with two instruments |
| **Person + Wearable + Vehicle** | 700 | 2.0% | Person wearing item in/on vehicle |
| **Person + Wearable + Furniture** | 685 | 1.9% | Person wearing item on furniture |
| **Other Patterns** | 8,252 | 23.3% | Various other composition combinations |
| **Total** | **35,394** | **100%** | All multi-image composition samples |
### Action Combination Distribution
| Action Combination | Count | Percentage | Description |
|--------------------|-------|------------|-------------|
| **Wearing + Holding** | 6,380 | 18.0% | Person wearing one item and holding another |
| **Holding + Standing** | 4,387 | 12.4% | Person holding items while standing in scene |
| **Holding + Sitting** | 2,991 | 8.5% | Person holding items while sitting |
| **Wearing + Standing** | 2,064 | 5.8% | Person wearing items while standing in scene |
| **Wearing + Sitting** | 1,532 | 4.3% | Person wearing items while sitting |
| **Playing + Sitting** | 737 | 2.1% | Person playing instrument while sitting |
| **Holding + Driving** | 562 | 1.6% | Person holding items while driving |
| **Reading + Sitting** | 144 | 0.4% | Person reading while sitting |
| **Other Multi-Actions** | 15,752 | 44.5% | Various other action combinations |
| **Single Actions** | 845 | 2.4% | Single action compositions |
## 📁 Dataset Structure
### Data Format
Each sample in the dataset is a JSON object with the following structure:
```json
{
"input_images": ["path/to/0.png", "path/to/1.png", "path/to/2.png"],
"instruction": "A woman from Image1 is elegantly playing the violin from Image2 while sitting on the plush purple sofa from Image3, creating a sophisticated and artistic scene.",
"output_image": "path/to/fusion_result.png"
}
```
### Field Descriptions
- **`input_images`**: List of exactly 3 input image paths
- `Image1`: Typically contains the main subject (person)
- `Image2`: Contains the first object/element to be composed (often wearable, instrument, or furniture)
- `Image3`: Contains the second object/element to be composed (often furniture, scene, or additional object)
- **`instruction`**: Natural language description of how to combine all three images, typically following patterns like:
- Subject description from Image1
- First action with element from Image2
- Second action with element from Image3
- Scene/atmosphere description
- **`output_image`**: Path to the composed output image
### Composition Pattern
The dataset follows a consistent 3-element composition pattern:
```
[Subject from Image1] + [Element from Image2] + [Element from Image3] → [Fused Output]
```
Example instructions:
- "A woman from Image1 is standing in front of a camera, holding a wine glass from Image2, with a serene sunset sky from Image3 in the background."
- "A woman in a navy polka dot dress from Image1 is elegantly playing the violin from Image2 while sitting on the plush purple sofa from Image3, creating a sophisticated and artistic scene."
- "A young woman with curly hair from Image1 is elegantly holding a golden handbag from Image2 and standing next to a metal bucket filled with glowing wires from Image3, creating a futuristic and stylish scene."
## 🚀 Usage
### Loading the Dataset
#### Using Hugging Face Datasets
```python
from datasets import load_dataset
# Load the dataset from Hugging Face
dataset = load_dataset("Skywork/unipic_nano_3images", split="train")
# Access a sample
sample = dataset[0]
print(f"Input images: {sample['input_images']}") # 3 images
print(f"Instruction: {sample['instruction']}")
print(f"Output image: {sample['output_image']}")
```
#### Direct JSON Loading
```python
import json
# Load from local JSONL file
samples = []
with open("unipic_nano_3images.jsonl", "r", encoding="utf-8") as f:
for line in f:
sample = json.loads(line.strip())
samples.append(sample)
print(f"Total samples: {len(samples)}") # 35,394
```
#### Using PyTorch DataLoader
```python
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import json
class UniPicNano3ImagesDataset(Dataset):
def __init__(self, jsonl_path, image_root):
self.samples = []
with open(jsonl_path, "r", encoding="utf-8") as f:
for line in f:
self.samples.append(json.loads(line.strip()))
self.image_root = image_root
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
sample = self.samples[idx]
# Load all 3 input images
img1 = Image.open(f"{self.image_root}/{sample['input_images'][0]}")
img2 = Image.open(f"{self.image_root}/{sample['input_images'][1]}")
img3 = Image.open(f"{self.image_root}/{sample['input_images'][2]}")
# Load output image
output = Image.open(f"{self.image_root}/{sample['output_image']}")
return {
"input_images": [img1, img2, img3],
"instruction": sample["instruction"],
"output_image": output
}
dataset = UniPicNano3ImagesDataset("unipic_nano_3images.jsonl", "images/")
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
```
### Filtering by Composition Pattern
```python
import json
def categorize_composition(instruction):
"""Categorize a sample based on its composition pattern."""
instruction = instruction.lower()
has_wearing = 'wearing' in instruction
has_holding = 'holding' in instruction
has_sitting = 'sitting' in instruction
has_standing = 'standing' in instruction
has_playing = 'playing' in instruction
if has_wearing and has_holding:
return 'Wearing + Holding'
elif has_holding and has_sitting:
return 'Holding + Sitting'
elif has_holding and has_standing:
return 'Holding + Standing'
elif has_wearing and has_sitting:
return 'Wearing + Sitting'
elif has_playing and has_sitting:
return 'Playing + Sitting'
return 'Other'
# Filter samples by composition pattern
with open("unipic_nano_3images.jsonl", "r") as f:
samples = [json.loads(line) for line in f]
wearing_holding = [s for s in samples if categorize_composition(s['instruction']) == 'Wearing + Holding']
print(f"Wearing + Holding samples: {len(wearing_holding)}") # ~6,380
```
## 🔬 Task Categories
### 1. Multi-Wearable Compositions (27.1%)
Person from Image1 wearing/using items from Image2 and Image3:
- **Wearable + Wearable**: Two different accessories or clothing items
- **Wearable + Object**: Wearing item while holding/near another object
- **Wearable + Furniture**: Wearing item while on furniture
### 2. Object Interaction Compositions (36.4%)
Person from Image1 interacting with objects from Image2 and Image3:
- **Object + Object**: Holding or interacting with two different objects
- **Object + Furniture**: Holding object while on furniture
- **Object + Vehicle**: Holding object in/on vehicle
- **Object + Scene**: Holding object in a specific scene/background
### 3. Activity + Element Compositions (24.3%)
Person from Image1 performing activities with elements from Image2 and Image3:
- **Playing Instrument + Sitting**: Playing music while seated
- **Reading + Sitting**: Reading while on furniture
- **Holding + Driving**: Holding items while operating vehicle
### 4. Complex Scene Compositions (12.2%)
Person from Image1 in complex scenes with multiple elements:
- **Furniture + Furniture**: Person with multiple furniture pieces
- **Instrument + Instrument**: Person with multiple instruments
- **Vehicle + Scene**: Person in vehicle with background scene
## 🎓 Applications
This dataset is designed for training and evaluating:
- **Advanced Multi-Image Composition Models**: Learn to combine 3+ images seamlessly
- **Complex Scene Understanding**: Models that understand spatial relationships between multiple elements
- **Instruction-Following Vision Models**: Models that follow complex, multi-part composition instructions
- **Multi-Element Fusion**: Sophisticated blending of person + multiple objects/scenes
## 🔗 Related Work
This dataset is part of the **UniPic** dataset series:
- **UniPic3**: A unified multi-image composition framework. For more details, see [Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling](https://arxiv.org/abs/2601.15664)
- **UniPic-Nano-2Images**: The 2-image version of this dataset with 41,812 samples
## 📝 Citation
If you use this dataset in your research, please cite:
```bibtex
@article{wei2026skywork,
title={Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling},
author={Wei, Hongyang and Liu, Hongbo and Wang, Zidong and Peng, Yi and Xu, Baixin and Wu, Size and Zhang, Xuying and He, Xianglong and Liu, Zexiang and Wang, Peiyu and others},
journal={arXiv preprint arXiv:2601.15664},
year={2026}
}
```
## 📄 License
Please refer to the license terms on the [Hugging Face dataset page](https://huggingface.co/datasets/Skywork/unipic_nano_3images).
---
license: Apache 2.0许可证
task_categories:
- 图像到图像(image-to-image)
- 文本到图像(text-to-image)
language:
- 英语(en)
tags:
- 图像合成(image-composition)
- 多图像(multi-image)
- 图像融合(image-fusion)
- 图像编辑(image-editing)
- UniPic
- 三图像输入(3-image-input)
pretty_name: UniPic Nano 3Images
size_categories:
- 10000 < 样本量 < 100000
configs:
- config_name: 默认配置
data_files:
- split: 训练集
path: "*.jsonl"
---
# UniPic-Nano-3Images:多图像合成数据集
## ⚡ 快速上手
为便于下载,图像归档文件已拆分为多个分卷。如需合并解压,请执行以下步骤:
bash
# 步骤1:将分卷文件合并为单个ZIP包
cat nano-banana.part_* > nano-banana-3images.zip
# 步骤2:解压图像文件
unzip nano-banana-3images.zip
## 📖 数据集概览
**UniPic-Nano-3Images** 是一款高质量多图像合成数据集,共包含35394个样本,专为训练先进的图像融合与合成模型设计。每个样本由**3张输入图像**与**1张输出图像**组成,可根据自然语言指令,将三张输入图像中的元素无缝融合为一体。本数据集隶属于**UniPic**系列数据集,已被用于**UniPic3**中,以训练具备复杂多元素融合能力的多图像合成模型。
## 🎨 演示:多图像合成
本数据集可使模型基于自然语言指令,将三张独立图像中的主体与多个对象智能融合为一张连贯的输出图像:

*示例:将图像1中的人物与图像2、图像3中的对象结合,生成无缝的多元素合成作品*
## 🎯 核心特性
- **三图像输入**:每个样本严格使用3张输入图像以实现复杂合成任务
- **多元素融合**:以精细方式融合人物+对象+对象/场景
- **多样合成模式**:涵盖20余种不同的合成场景与动作组合
- **高质量**:35394个经过精心筛选的样本,附带详细的自然语言指令
- **可落地部署**:已应用于UniPic3中,支持真实场景下的多图像合成任务
- **格式简洁**:采用简洁的JSON格式,输入输出结构清晰明了
## 📊 数据集统计
### 合成模式分布
| 合成模式 | 样本量 | 占比 | 描述 |
|---------------------|-------|------------|-------------|
| **人物 + 对象 + 对象** | 7468 | 21.1% | 人物持有两个不同对象 |
| **人物 + 穿戴物品 + 对象** | 5065 | 14.3% | 人物穿戴一件物品并持有/靠近另一对象 |
| **人物 + 穿戴物品 + 穿戴物品** | 3511 | 9.9% | 人物穿戴两件不同物品 |
| **人物 + 对象 + 穿戴物品** | 1728 | 4.9% | 人物持有对象并穿戴一件物品 |
| **人物 + 家具 + 对象** | 1572 | 4.4% | 人物坐在家具上并持有对象 |
| **人物 + 乐器 + 对象** | 1537 | 4.3% | 人物演奏乐器并持有对象 |
| **人物 + 家具 + 家具** | 1100 | 3.1% | 人物使用两件家具 |
| **人物 + 交通工具 + 对象** | 942 | 2.7% | 人物处于交通工具中/上并持有对象 |
| **人物 + 乐器 + 穿戴物品** | 825 | 2.3% | 人物演奏乐器并穿戴一件物品 |
| **人物 + 穿戴物品 + 乐器** | 814 | 2.3% | 人物穿戴一件物品并演奏乐器 |
| **人物 + 家具 + 穿戴物品** | 785 | 2.2% | 人物坐在家具上并穿戴一件物品 |
| **人物 + 对象 + 交通工具** | 709 | 2.0% | 人物持有对象并处于交通工具中/上 |
| **人物 + 乐器 + 乐器** | 701 | 2.0% | 人物使用两件乐器 |
| **人物 + 穿戴物品 + 交通工具** | 700 | 2.0% | 人物穿戴一件物品并处于交通工具中/上 |
| **人物 + 穿戴物品 + 家具** | 685 | 1.9% | 人物穿戴一件物品并坐在家具上 |
| **其他模式** | 8252 | 23.3% | 各类其他合成组合 |
| **总计** | **35394** | **100%** | 所有多图像合成样本 |
### 动作组合分布
| 动作组合 | 样本量 | 占比 | 描述 |
|--------------------|-------|------------|-------------|
| **穿戴 + 持有** | 6380 | 18.0% | 人物穿戴一件物品并持有另一件 |
| **持有 + 站立** | 4387 | 12.4% | 人物持有物品并处于站立场景 |
| **持有 + 坐姿** | 2991 | 8.5% | 人物持有物品并处于坐姿场景 |
| **穿戴 + 站立** | 2064 | 5.8% | 人物穿戴物品并处于站立场景 |
| **穿戴 + 坐姿** | 1532 | 4.3% | 人物穿戴物品并处于坐姿场景 |
| **演奏 + 坐姿** | 737 | 2.1% | 人物演奏乐器并处于坐姿场景 |
| **持有 + 驾驶** | 562 | 1.6% | 人物持有物品并驾驶交通工具 |
| **阅读 + 坐姿** | 144 | 0.4% | 人物坐姿阅读 |
| **其他多动作组合** | 15752 | 44.5% | 各类其他动作组合 |
| **单动作组合** | 845 | 2.4% | 仅包含单动作的合成样本 |
## 📁 数据集结构
### 数据格式
数据集中的每个样本均为JSON对象,结构如下:
json
{
"input_images": ["path/to/0.png", "path/to/1.png", "path/to/2.png"],
"instruction": "将图像1中的女性优雅地演奏图像2中的小提琴,并坐在图像3中的紫色毛绒沙发上,打造出精致且富有艺术感的场景。",
"output_image": "path/to/fusion_result.png"
}
*注:原英文指令已译为中文示例,实际数据集中为英文指令*
### 字段说明
- **`input_images`**:严格包含3张输入图像路径的列表
- `Image1`:通常包含主要主体(人物)
- `Image2`:包含待合成的第一个对象/元素(通常为穿戴物品、乐器或家具)
- `Image3`:包含待合成的第二个对象/元素(通常为家具、场景或额外对象)
- **`instruction`**:描述如何合并三张图像的自然语言指令,通常遵循以下模式:
- 图像1中主体的描述
- 与图像2元素相关的第一个动作
- 与图像3元素相关的第二个动作
- 场景/氛围描述
- **`output_image`**:合成后的输出图像路径
### 合成模式
本数据集遵循统一的三元素合成模式:
[图像1中的主体] + [图像2中的元素] + [图像3中的元素] → [融合输出]
示例指令:
- "图像1中的女性站在相机前,手持图像2中的酒杯,背景为图像3中的宁静日落天空。"
- "图像1中身着海军蓝波点连衣裙的女性优雅地演奏图像2中的小提琴,并坐在图像3中的紫色毛绒沙发上,打造出精致且富有艺术感的场景。"
- "图像1中拥有卷发的年轻女性优雅地持有图像2中的金色手提包,并站在图像3中装满发光电线的金属桶旁,打造出未来感十足的时尚场景。"
## 🚀 使用方法
### 加载数据集
#### 使用Hugging Face Datasets库
python
from datasets import load_dataset
# 从Hugging Face加载数据集
dataset = load_dataset("Skywork/unipic_nano_3images", split="train")
# 访问单个样本
sample = dataset[0]
print(f"输入图像:{sample['input_images']}") # 共3张图像
print(f"合成指令:{sample['instruction']}")
print(f"输出图像:{sample['output_image']}")
#### 直接加载JSONL文件
python
import json
# 从本地JSONL文件加载
samples = []
with open("unipic_nano_3images.jsonl", "r", encoding="utf-8") as f:
for line in f:
sample = json.loads(line.strip())
samples.append(sample)
print(f"总样本量:{len(samples)}") # 35394
#### 使用PyTorch DataLoader
python
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import json
class UniPicNano3ImagesDataset(Dataset):
def __init__(self, jsonl_path, image_root):
self.samples = []
with open(jsonl_path, "r", encoding="utf-8") as f:
for line in f:
self.samples.append(json.loads(line.strip()))
self.image_root = image_root
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
sample = self.samples[idx]
# 加载3张输入图像
img1 = Image.open(f"{self.image_root}/{sample['input_images'][0]}")
img2 = Image.open(f"{self.image_root}/{sample['input_images'][1]}")
img3 = Image.open(f"{self.image_root}/{sample['input_images'][2]}")
# 加载输出图像
output = Image.open(f"{self.image_root}/{sample['output_image']}")
return {
"input_images": [img1, img2, img3],
"instruction": sample["instruction"],
"output_image": output
}
dataset = UniPicNano3ImagesDataset("unipic_nano_3images.jsonl", "images/")
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
### 按合成模式筛选样本
python
import json
def categorize_composition(instruction):
"""根据合成模式对样本进行分类。"""
instruction = instruction.lower()
has_wearing = 'wearing' in instruction
has_holding = 'holding' in instruction
has_sitting = 'sitting' in instruction
has_standing = 'standing' in instruction
has_playing = 'playing' in instruction
if has_wearing and has_holding:
return '穿戴 + 持有'
elif has_holding and has_sitting:
return '持有 + 坐姿'
elif has_holding and has_standing:
return '持有 + 站立'
elif has_wearing and has_sitting:
return '穿戴 + 坐姿'
elif has_playing and has_sitting:
return '演奏 + 坐姿'
return '其他'
# 按合成模式筛选样本
with open("unipic_nano_3images.jsonl", "r") as f:
samples = [json.loads(line) for line in f]
wearing_holding = [s for s in samples if categorize_composition(s['instruction']) == '穿戴 + 持有']
print(f"穿戴+持有模式样本量:{len(wearing_holding)}") # 约6380
## 🔬 任务类别
### 1. 多穿戴物品合成(27.1%)
图像1中的人物穿戴/使用图像2和图像3中的物品:
- **穿戴物品 + 穿戴物品**:两件不同的配饰或衣物
- **穿戴物品 + 对象**:穿戴一件物品并持有/靠近另一对象
- **穿戴物品 + 家具**:穿戴一件物品并坐在家具上
### 2. 对象交互合成(36.4%)
图像1中的人物与图像2和图像3中的对象交互:
- **对象 + 对象**:持有或交互两个不同对象
- **对象 + 家具**:持有对象并坐在家具上
- **对象 + 交通工具**:持有对象并处于交通工具中/上
- **对象 + 场景**:持有对象并处于特定场景/背景中
### 3. 动作+元素合成(24.3%)
图像1中的人物使用图像2和图像3中的元素执行动作:
- **演奏乐器 + 坐姿**:坐着演奏乐器
- **阅读 + 坐姿**:坐姿阅读
- **持有 + 驾驶**:持有物品并驾驶交通工具
### 4. 复杂场景合成(12.2%)
图像1中的人物处于包含多元素的复杂场景中:
- **家具 + 家具**:人物使用多件家具
- **乐器 + 乐器**:人物使用多件乐器
- **交通工具 + 场景**:人物处于交通工具中并带有背景场景
## 🎓 应用场景
本数据集专为训练与评估以下模型而设计:
- **先进多图像合成模型**:学习如何无缝融合3张及以上图像
- **复杂场景理解模型**:理解多元素间空间关系的模型
- **指令遵循视觉模型**:遵循复杂多部分合成指令的模型
- **多元素融合模型**:实现人物与多个对象/场景的精细融合
## 🔗 相关工作
本数据集隶属于**UniPic**数据集系列:
- **UniPic3**:统一的多图像合成框架。更多详情请参阅[Skywork UniPic 3.0:基于序列建模的统一多图像合成](https://arxiv.org/abs/2601.15664)
- **UniPic-Nano-2Images**:本数据集的双图像版本,包含41812个样本
## 📝 引用
如果您在研究中使用本数据集,请引用以下文献:
bibtex
@article{wei2026skywork,
title={Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling},
author={Wei, Hongyang and Liu, Hongbo and Wang, Zidong and Peng, Yi and Xu, Baixin and Wu, Size and Zhang, Xuying and He, Xianglong and Liu, Zexiang and Wang, Peiyu and others},
journal={arXiv preprint arXiv:2601.15664},
year={2026}
}
## 📄 许可证
请参阅[Hugging Face数据集页面](https://huggingface.co/datasets/Skywork/unipic_nano_3images)中的许可证条款。
提供机构:
maas
创建时间:
2026-01-31



