Skywork/unipic_seedream_6images
收藏Hugging Face2026-02-10 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/Skywork/unipic_seedream_6images
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- image-to-image
- text-to-image
language:
- en
tags:
- image-composition
- multi-image
- image-fusion
- image-editing
- unipic
- 6-image-input
pretty_name: UniPic Nano 6Images
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: "*.jsonl"
---
# UniPic-Nano-6Images: A Complex Multi-Image Composition Dataset
## ⚡ Quick Start
The image archive is split into multiple parts for easier downloading. To reconstruct and extract:
```bash
# Step 1: Concatenate split files into a single zip
cat nano-banana.part_* > nano-banana-6images.zip
# Step 2: Extract the images
unzip nano-banana-6images.zip
```
## 📖 Overview
**UniPic-Nano-6Images** is a high-quality complex multi-image composition dataset containing **41,508** samples designed for training advanced image fusion and composition models. Each sample consists of **6 input images** and **1 output image**, where elements from all six input images are seamlessly combined based on natural language instructions. This dataset is part of the **UniPic** series and has been used in **UniPic3** for training multi-image composition models with highly complex, multi-element fusion capabilities.
## 🎯 Key Features
- **6-Image Input**: Each sample uses exactly 6 input images for highly complex composition
- **Multi-Element Fusion**: Combines person + 5 objects/elements in sophisticated ways
- **Diverse Composition Patterns**: Covers extensive composition scenarios with multiple simultaneous actions
- **High Quality**: 41,508 carefully curated samples with detailed natural language instructions
- **Production Ready**: Used in UniPic3 for real-world multi-image composition applications
- **Simple Format**: Clean JSON format with straightforward input/output structure
## 📊 Dataset Statistics
### Action Distribution
| Action | Count | Percentage | Description |
|--------|-------|------------|-------------|
| **Holding** | 36,435 | 87.8% | Person holding objects |
| **Wearing** | 28,803 | 69.4% | Person wearing accessories/clothing |
| **Standing** | 26,558 | 64.0% | Person standing in scene |
| **Sitting** | 11,104 | 26.8% | Person sitting on furniture |
| **Playing** | 6,759 | 16.3% | Person playing instruments |
| **Resting** | 4,012 | 9.7% | Object resting in scene |
| **Leaning** | 3,359 | 8.1% | Person/object leaning |
| **Using** | 3,337 | 8.0% | Person using devices |
| **Carrying** | 2,446 | 5.9% | Person carrying items |
| **Other Actions** | ~5,000 | 12.1% | Cleaning, cooking, driving, riding, etc. |
### Action Combination Distribution
| Action Combination | Count | Percentage | Description |
|--------------------|-------|------------|-------------|
| **Holding + Standing + Wearing** | 15,609 | 37.6% | Person standing, wearing items, and holding objects |
| **Holding + Sitting + Wearing** | 5,560 | 13.4% | Person sitting, wearing items, and holding objects |
| **Holding + Standing** | 4,477 | 10.8% | Person standing and holding objects |
| **Holding + Wearing** | 2,378 | 5.7% | Person wearing and holding items |
| **Holding + Sitting** | 1,858 | 4.5% | Person sitting and holding objects |
| **Holding + Playing + Standing** | 1,022 | 2.5% | Person standing, playing instrument, holding objects |
| **Standing + Wearing** | 999 | 2.4% | Person standing and wearing items |
| **Holding + Sitting + Standing** | 930 | 2.2% | Complex pose combinations |
| **Sitting + Wearing** | 922 | 2.2% | Person sitting and wearing items |
| **Holding + Playing + Wearing** | 831 | 2.0% | Person wearing, playing, and holding |
| **Other Combinations** | ~4,922 | 11.9% | Various other action combinations |
### Element Type Distribution
| Element Type | Count | Percentage | Description |
|--------------|-------|------------|-------------|
| **Objects** | 39,043 | 94.1% | Handheld items (cups, bottles, books, cameras, etc.) |
| **Wearables** | 30,686 | 73.9% | Accessories (glasses, hats, watches, jewelry, etc.) |
| **Furniture** | 16,087 | 38.8% | Seating and surfaces (sofas, chairs, beds, etc.) |
| **Vehicles** | 12,414 | 29.9% | Transportation (cars, motorcycles, bicycles, etc.) |
| **Appliances** | 11,247 | 27.1% | Home devices (refrigerators, lamps, TVs, etc.) |
| **Instruments** | 10,217 | 24.6% | Musical instruments (piano, guitar, drums, etc.) |
| **Scenes/Backgrounds** | 6,762 | 16.3% | Environmental elements (trees, buildings, etc.) |
### Top Object Categories
| Object Category | Count | Object Category | Count |
|-----------------|-------|-----------------|-------|
| Plate | 2,952 | Surfboard | 2,195 |
| Cup | 2,823 | Chair | 2,190 |
| Wine Glass | 2,800 | Drum | 2,175 |
| Kettle | 2,654 | Couch | 2,125 |
| Pot | 2,637 | Saxophone | 2,091 |
| Bottle | 2,623 | Towel | 2,022 |
| Canned | 2,610 | Tea Pot | 2,001 |
| Bucket | 2,593 | Candle | 1,995 |
| Bowl | 2,569 | Handbag | 1,989 |
| Guitar | 2,517 | Baseball Bat | 1,953 |
| Tennis Racket | 2,458 | Stool | 1,912 |
| Vase | 2,426 | Flute | 1,867 |
| Piano | 2,347 | Bed | 1,850 |
| Fishing Rod | 2,336 | Backpack | 1,840 |
| Golf Club | 2,301 | | |
| Skateboard | 2,284 | | |
## 📁 Dataset Structure
### Data Format
Each sample in the dataset is a JSON object with the following structure:
```json
{
"input_images": [
"path/to/0.png",
"path/to/1.png",
"path/to/2.png",
"path/to/3.png",
"path/to/4.png",
"path/to/5.png"
],
"instruction": "A man from Image1 is sitting on a metallic sofa from Image6, holding a glass of red wine from Image2, with a pink teapot and cup from Image3 beside him on a table, a golden frying pan from Image4 nearby, and an acoustic guitar from Image5 resting against the sofa, creating a relaxed and sophisticated atmosphere.",
"output_image": "path/to/fusion_result.png",
"id": 6
}
```
### Field Descriptions
- **`input_images`**: List of exactly 6 input image paths
- `Image1`: Contains the main subject (person/people)
- `Image2-6`: Contains various objects/elements to be composed (objects, accessories, furniture, instruments, vehicles, scenes, etc.)
- **`instruction`**: Natural language description of how to combine all six images, typically following patterns like:
- Subject description from Image1
- Multiple actions with elements from Image2-6
- Scene/atmosphere description
- **`output_image`**: Path to the composed output image
- **`id`**: Unique identifier for the sample
### Composition Pattern
The dataset follows a consistent 6-element composition pattern:
```
[Subject from Image1] + [Elements from Image2-6] → [Fused Output]
```
Example instructions:
- "A man from Image1 is sitting on a metallic sofa from Image6, holding a glass of red wine from Image2, with a pink teapot and cup from Image3 beside him on a table, a golden frying pan from Image4 nearby, and an acoustic guitar from Image5 resting against the sofa."
- "A woman from Image1 is elegantly wearing the black wireless earbuds from Image2, sitting in front of the open laptop from Image3, holding the yellow lotion bottle from Image4, with the metallic bottle from Image5 beside her, and the purple parking meter from Image6 in the background."
- "A woman with curly hair from Image1 is confidently wearing the sleek black motorcycle helmet from Image2, holding a colorful fishing rod from Image3, wearing a light blue surgical mask from Image4, standing next to an intricately designed silver teapot from Image5, and sporting a stylish brown suede high-top boot from Image6."
## 🚀 Usage
### Loading the Dataset
#### Using Hugging Face Datasets
```python
from datasets import load_dataset
# Load the dataset from Hugging Face
dataset = load_dataset("Skywork/unipic_nano_6images", split="train")
# Access a sample
sample = dataset[0]
print(f"Input images: {sample['input_images']}") # 6 images
print(f"Instruction: {sample['instruction']}")
print(f"Output image: {sample['output_image']}")
```
#### Direct JSON Loading
```python
import json
# Load from local JSONL file
samples = []
with open("seedream_6imgs_all.jsonl", "r", encoding="utf-8") as f:
for line in f:
sample = json.loads(line.strip())
samples.append(sample)
print(f"Total samples: {len(samples)}") # 41,508
```
#### Using PyTorch DataLoader
```python
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import json
class UniPicNano6ImagesDataset(Dataset):
def __init__(self, jsonl_path, image_root):
self.samples = []
with open(jsonl_path, "r", encoding="utf-8") as f:
for line in f:
self.samples.append(json.loads(line.strip()))
self.image_root = image_root
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
sample = self.samples[idx]
# Load all 6 input images
input_imgs = [
Image.open(f"{self.image_root}/{sample['input_images'][i]}")
for i in range(6)
]
# Load output image
output = Image.open(f"{self.image_root}/{sample['output_image']}")
return {
"input_images": input_imgs,
"instruction": sample["instruction"],
"output_image": output,
"id": sample["id"]
}
dataset = UniPicNano6ImagesDataset("seedream_6imgs_all.jsonl", "images/")
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)
```
### Filtering by Composition Pattern
```python
import json
def categorize_composition(instruction):
"""Categorize a sample based on its composition pattern."""
instruction = instruction.lower()
has_wearing = 'wearing' in instruction
has_holding = 'holding' in instruction
has_sitting = 'sitting' in instruction
has_standing = 'standing' in instruction
has_playing = 'playing' in instruction
actions = []
if has_wearing: actions.append('Wearing')
if has_holding: actions.append('Holding')
if has_sitting: actions.append('Sitting')
if has_standing: actions.append('Standing')
if has_playing: actions.append('Playing')
return ' + '.join(actions) if actions else 'Other'
# Filter samples by composition pattern
with open("seedream_6imgs_all.jsonl", "r") as f:
samples = [json.loads(line) for line in f]
standing_wearing_holding = [
s for s in samples
if categorize_composition(s['instruction']) == 'Wearing + Holding + Standing'
]
print(f"Standing + Wearing + Holding samples: {len(standing_wearing_holding)}") # ~15,609
```
## 🔬 Task Categories
### 1. Multi-Object Compositions (94.1%)
Person from Image1 interacting with multiple objects from Image2-6:
- **Multi-Handheld**: Holding multiple objects simultaneously
- **Object + Scene**: Objects placed in specific scenes/backgrounds
- **Object + Furniture**: Objects positioned on/near furniture
### 2. Wearable + Object Compositions (73.9%)
Person from Image1 wearing items and holding objects:
- **Accessory + Handheld**: Wearing glasses/watches while holding items
- **Clothing + Handheld**: Wearing specific clothing while carrying objects
- **Full Ensemble**: Complete outfit with multiple accessories and held items
### 3. Furniture + Activity Compositions (38.8%)
Person from Image1 on furniture with various activities:
- **Sitting + Playing**: Sitting on furniture while playing instruments
- **Sitting + Holding**: Seated with multiple held objects
- **Standing Near**: Standing near furniture with objects
### 4. Complex Multi-Element Compositions (29.9%+)
Person from Image1 in complex scenes with vehicles, appliances, instruments:
- **Vehicle + Objects**: Person in/on vehicle with multiple objects
- **Instrument + Accessories**: Playing instrument while wearing items
- **Appliance + Scene**: Using appliances in specific settings
## 🎓 Applications
This dataset is designed for training and evaluating:
- **Advanced Multi-Image Composition Models**: Learn to combine 6+ images seamlessly
- **Complex Scene Understanding**: Models that understand spatial relationships between many elements
- **Instruction-Following Vision Models**: Models that follow highly complex, multi-part composition instructions
- **Multi-Element Fusion**: Sophisticated blending of person + multiple objects/scenes/accessories
## 🔗 Related Work
This dataset is part of the **UniPic** dataset series:
- **UniPic3**: A unified multi-image composition framework. For more details, see [Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling](https://arxiv.org/abs/2601.15664)
- **UniPic-Nano-2Images**: The 2-image version with simpler compositions
- **UniPic-Nano-3Images**: The 3-image version with moderate complexity
## 📝 Citation
If you use this dataset in your research, please cite:
```bibtex
@misc{wei2026skyworkunipic30unified,
title={Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling},
author={Hongyang Wei and Hongbo Liu and Zidong Wang and Yi Peng and Baixin Xu and Size Wu and Xuying Zhang and Xianglong He and Zexiang Liu and Peiyu Wang and Xuchen Song and Yangguang Li and Yang Liu and Yahui Zhou},
year={2026},
eprint={2601.15664},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.15664},
}
```
## 📄 License
Please refer to the license terms on the [Hugging Face dataset page](https://huggingface.co/datasets/Skywork/unipic_nano_6images).
---
提供机构:
Skywork



