InternVL-U/ScaleEdit-12M
收藏Hugging Face2026-04-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/InternVL-U/ScaleEdit-12M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- image-to-image
language:
- en
tags:
- image-editing
- instruction-based-editing
- multimodal
- computer-vision
- scaleedit
- internvl
size_categories:
- 10M<n<100M
---
# ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework
<div>
[](https://arxiv.org/abs/2603.20644)
[](https://github.com/gzchen4ai/ScaleEdit-12M)
[](https://huggingface.co/datasets/InternVL-U/ScaleEdit-12M)
</div>
## 📌 Overview
**The largest open-source instruction-based image editing dataset to date.**
ScaleEdit-12M contains **12.4 million** rigorously verified instruction–image pairs spanning **23 task families** across diverse real and synthetic visual domains. It was constructed using **ScaleEditor**, a fully open-source hierarchical multi-agent framework that eliminates the need for costly proprietary APIs.

## 🔥 News
- **[2026/04/03]** 🚀ScaleEdit-12M is released on [[Huggingface]](https://huggingface.co/datasets/InternVL-U/ScaleEdit-12M).
- **[2026/03/24]** 🔥ScaleEdit-12M paper is released on [[arXiv]](https://arxiv.org/abs/2603.20644).
- **[2026/03/06]** 🔥InternVL-U **technical report** released. Check it out on [[arXiv]](https://arxiv.org/abs/2603.09877).
## ✅ TODO
- [x] Release ScaleEdit-12M dataset
- [ ] Release ScaleEdit-1M subset
- [ ] Release ScaleEditor framework
## 📊 Dataset Structure
### Repository Layout
The dataset is organized into **23 task-specific subdirectories**, each containing multiple sharded Parquet files. The directory naming follows the pattern `{category_id}_{task_name}`:
```
ScaleEdit-12M/
├── README.md
├── 1.1_style_transfer/ # Global editing tasks
│ ├── style_transfer_0000.parquet # (~31.7 GB per shard)
│ ├── style_transfer_0001.parquet
│ ├── ...
│ └── style_transfer_0015.parquet
├── 1.2_tone_adjustment/
│ └── tone_adjustment_XXXX.parquet
├── 1.3_viewpoint_transformation/
├── 1.4_background_replacement/
├── 2.1_object_addition/ # Object editing tasks
├── 2.2_object_removal/
├── 2.3_object_replacement/
├── 2.4_action_editing/
├── 2.5_part_extraction/
├── 3.1_color_change/ # Attribute editing tasks
├── 3.2_material_change/
├── 3.3_visual_beautification/
├── 3.4_count_change/
├── 3.5_size_change/
├── 4.1_movie_poster_text_editing/ # Text editing tasks
├── 4.2_gui_interface_text_editing/
├── 4.3_object_surface_text_editing/
├── 4.4_building_surface_text_editing/
├── 5.1_perceptual_reasoning/ # Knowledge-infused tasks
├── 5.2_symbolic_reasoning/
├── 5.3_social_reasoning/
├── 5.4_scientific_reasoning/
└── 6.1_compositional_editing/ # Compositional tasks
```
Each task folder contains **multiple Parquet shards** (typically ~31–32 GB each) named `{task_name}_{shard_index:04d}.parquet`. The number of shards varies by task depending on the volume of data in that category.
### Parquet Schema
Each Parquet file contains the following columns:
| Column | Type | Description |
|---|---|---|
| `id` | `int64` | Unique identifier for the sample |
| `edit_task` | `string` | Task category name (e.g., `"style_transfer"`, `"object_addition"`) |
| `edit_instruction` | `string` | Natural-language editing instruction |
| `source_image` | `binary` | Raw bytes of the source image (pre-edit) |
| `edited_image` | `binary` | Raw bytes of the edited image (post-edit) |
| `source_image_width` | `int64` | Width of the source image in pixels |
| `source_image_height` | `int64` | Height of the source image in pixels |
| `edited_image_width` | `int64` | Width of the edited image in pixels |
| `edited_image_height` | `int64` | Height of the edited image in pixels |
| `instruction_following_score` | `int64` | Quality score: how well the edit follows the instruction (1–3) |
| `editing_consistency_score` | `int64` | Quality score: consistency between source and edited images (1–3) |
| `generation_quality_score` | `int64` | Quality score: overall visual quality of the edited image (1–3) |
### Example Row
```json
{
"id": 0,
"edit_task": "object_addition",
"edit_instruction": "Add a red and white striped safety barrier at the edge of the platform on the right side of the image.",
"source_image": <binary bytes>,
"edited_image": <binary bytes>,
"source_image_width": 2000,
"source_image_height": 1500,
"edited_image_width": 2000,
"edited_image_height": 1500,
"instruction_following_score": 3,
"editing_consistency_score": 3,
"generation_quality_score": 3
}
```
The `source_image` and `edited_image` columns store images as raw binary bytes. They can be decoded into PIL images:
```python
from PIL import Image
import io
img = Image.open(io.BytesIO(row["source_image"]))
```
### Quality Scores
Every sample has been scored through ScaleEditor's **task-aware quality verification mechanism** across three dimensions, each rated on a 1–3 scale:
- **Instruction Following (IF, 1–3):** Does the edited image accurately reflect the intent of the instruction?
- **Editing Consistency (EC, 1–3):** Are unedited regions preserved? Is the edit spatially coherent with the source?
- **Generation Quality (GQ, 1–3):** Is the output image free of artifacts, distortions, and visual defects?
In ScaleEdit, only samples with IF=3, EC≥2, GQ≥2 are retained.
## 🛠️ Highlights
ScaleEdit-12M was constructed using the **ScaleEditor** framework, which consists of three stages:
1. **Source Image Expansion** — Curates and expands source images from diverse real and synthetic domains, infusing world knowledge to enable knowledge-grounded editing tasks.
2. **Adaptive Multi-Agent Editing** — An ensemble of specialized agents generates editing instructions and corresponding edited images, adapting strategies per task family.
3. **Task-Aware Quality Verification** — A multi-dimensional scoring system evaluates instruction following, editing consistency, and generation quality, filtering out low-quality samples.

Fine-tuning leading foundation models on ScaleEdit-12M yields consistent improvements:
- **Up to +10.4%** on ImgEdit and **+35.1%** on GEdit for general editing benchmarks
- **Up to +150.0%** on RISE and **+26.5%** on KRIS-Bench for knowledge-infused editing benchmarks
These gains were demonstrated on both UniWorld-V1 and Bagel, showing that open-source agentic pipelines can approach commercial-grade data quality.
## 🌟 Citation
```bibtex
@article{chen2026scaleedit,
title={ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework},
author={Chen, Guanzhou and Cui, Erfei and Tian, Changyao and Yang, Danni and Yang, Ganlin and Qiao, Yu and Li, Hongsheng and Luo, Gen and Zhang, Hongjie},
journal={arXiv preprint arXiv:2603.20644},
year={2026}
}
@article{tian2026internvl,
title={InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing},
author={Tian, Changyao and Yang, Danni and Chen, Guanzhou and Cui, Erfei and Wang, Zhaokai and Duan, Yuchen and Yin, Penghao and Chen, Sitao and Yang, Ganlin and Liu, Mingxin and others},
journal={arXiv preprint arXiv:2603.09877},
year={2026}
}
```
提供机构:
InternVL-U
搜集汇总
数据集介绍

构建方式
在计算机视觉与多模态学习领域,高质量指令驱动的图像编辑数据是推动模型进步的关键资源。ScaleEdit-12M数据集的构建依托于ScaleEditor这一开源分层多智能体框架,该框架通过三个阶段系统性地生成并筛选数据。首先,框架从多样化的真实与合成视觉领域中精心策划并扩展源图像,为知识驱动的编辑任务注入世界知识。随后,一个由专门化智能体组成的集合根据23个任务家族自适应地生成编辑指令及对应的编辑后图像。最后,通过任务感知的质量验证机制,从指令遵循、编辑一致性与生成质量三个维度对样本进行评分与过滤,仅保留高质量样本,从而确保了数据集的规模与可靠性。
特点
作为当前最大的开源指令图像编辑数据集,ScaleEdit-12M蕴含超过1240万条经过严格验证的指令-图像对,其显著特点在于覆盖了广泛的视觉编辑任务范畴。数据集被系统地组织为23个任务特定的子目录,涵盖全局编辑、对象编辑、属性编辑、文本编辑、知识融合与组合编辑等六大类别,这种结构化的分类为模型训练提供了清晰的任务导向。每个数据样本均附带三维质量评分,确保了编辑结果在语义遵循、空间一致性与视觉保真度上的高标准。数据集完全基于开源流程构建,避免了依赖昂贵专有API,为社区提供了可复现且高质量的大规模训练资源。
使用方法
对于研究人员与开发者而言,ScaleEdit-12M数据集以分片Parquet文件格式提供,便于高效加载与处理。数据集按任务类别分目录存储,每个Parquet文件包含唯一标识符、编辑任务类型、自然语言编辑指令、源图像与编辑后图像的二进制数据,以及图像尺寸与三项质量分数。使用者可通过标准数据加载库读取Parquet文件,并利用PIL等工具将二进制图像数据解码为可用格式。该数据集专为训练与评估指令跟随的图像编辑模型而设计,尤其适用于微调大型多模态基础模型,以提升其在通用编辑与知识注入编辑等复杂任务上的性能。
背景与挑战
背景概述
在人工智能驱动的图像编辑领域,基于指令的编辑技术旨在通过自然语言指令精确操控图像内容,对模型的语义理解与生成能力提出了极高要求。ScaleEdit-12M数据集由InternVL-U团队于2026年创建并公开发布,是目前规模最大的开源指令图像编辑数据集。该数据集包含超过1240万条经过严格验证的指令-图像对,涵盖风格迁移、对象编辑、属性修改等23个任务族,其核心研究目标是为训练通用且可控的图像编辑模型提供高质量、大规模且多样化的数据支持。通过完全开源的ScaleEditor多智能体框架构建,该数据集有效降低了数据生成对专有API的依赖,推动了开源社区在复杂视觉任务上的研究进程,并对多模态基础模型的性能提升产生了显著影响。
当前挑战
基于指令的图像编辑任务本身面临多重挑战,其核心在于如何确保模型能够精准理解开放域的自然语言指令,并在保持图像整体一致性与真实性的前提下,执行局部或全局的复杂视觉修改。ScaleEdit-12M旨在系统性地应对这些挑战,为模型训练提供解决方案。在数据构建过程中,团队同样遭遇了显著困难:首先,大规模生成兼具高质量与多样性的编辑样本,需要协调多个专用智能体进行指令生成与图像合成,并确保跨任务策略的自适应性;其次,建立高效可靠的质量验证机制以过滤低质量数据,需从指令遵循度、编辑一致性与生成质量三个维度进行多尺度评估,这一过程计算成本高昂且对评估标准的设计提出了精细要求。
常用场景
经典使用场景
在指令引导的图像编辑领域,ScaleEdit-12M数据集为训练和评估多模态生成模型提供了核心资源。其涵盖23个任务家族,从风格迁移、对象增删到知识增强编辑,为研究者构建能够精准理解自然语言指令并执行复杂视觉编辑的模型奠定了数据基础。该数据集通过大规模、高质量的指令-图像对,推动了生成模型在遵循用户意图和保持图像一致性方面的性能边界。
衍生相关工作
该数据集的发布催生了一系列围绕高效多模态编辑的衍生研究。其核心框架ScaleEditor本身即是一项开源贡献,启发了后续基于智能体协作的数据生成范式。同时,数据集被用于微调如UniWorld-V1、Bagel等前沿基础模型,在ImgEdit、GEdit、RISE等多个基准测试上取得了显著性能提升,推动了开源社区在理解、推理与生成统一模型(如InternVL-U)方向上的技术演进。
数据集最近研究
最新研究方向
在指令引导的图像编辑领域,大规模高质量数据集的构建正成为推动模型泛化能力与可控性的关键。ScaleEdit-12M作为当前最大的开源指令图像编辑数据集,其前沿研究聚焦于通过多智能体框架实现数据生成的规模化与质量可控。该数据集涵盖23类编辑任务,并引入任务感知的质量验证机制,为知识增强编辑、组合式编辑等复杂场景提供了丰富样本。相关研究热点围绕开源多模态模型的民主化展开,例如InternVL-U技术报告所倡导的统一理解、推理与生成框架。这些进展显著提升了模型在ImgEdit、GEdit等通用编辑基准,以及RISE、KRIS-Bench等知识注入任务上的性能,标志着开源智能体流程在接近商业级数据质量方面迈出重要一步,为可扩展、低成本的视觉内容创作系统奠定了数据基础。
以上内容由遇见数据集搜集并总结生成



