Ditto-1M
收藏魔搭社区2026-05-16 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Ditto-1M
下载链接
链接失效反馈官方服务:
资源简介:
# Ditto-1M: A High-Quality Synthetic Dataset for Instruction-Based Video Editing
> **Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset** <br>
> Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen
<div align=center>
<img src="./assets/data_teaser.jpg" width=850px>
</div>
**Figure:** Our proposed synthetic data generation pipeline can automatically produce high-quality and highly diverse video editing data, encompassing both global and local editing tasks.
<div align=center>
## 🔗 **Links & Resources**
[**[**📄 Paper**](https://arxiv.org/abs/2510.15742)**]
[**[**🌐 Project Page**](https://ezioby.github.io/Ditto_page/)**]
[**[**💻 Github Code**](https://github.com/EzioBy/Ditto)**]
[**[**📦 Model Weights**](https://huggingface.co/QingyanBai/Ditto_models/tree/main)**]
</div>
## Updating List
#### - [√] 10/22/2025 - We have uploaded the csvs that can be directly used for model training with DiffSynth-Studio, as well as the metadata json for sim2real setting.
#### - [√] 10/22/2025 - We finish uploading all the videos of the dataset!
## Dataset Overview
Ditto-1M is a comprehensive dataset of one million high-fidelity video editing triplets designed to tackle the fundamental challenge of instruction-based video editing. This dataset was generated using our novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models.
The dataset contains diverse video editing scenarios including:
- **Global style transfer**: Artistic style changes, color grading, and visual effects
- **Global freeform editing**: Complex scene modifications, environment changes, and creative transformations
- **Local editing**: Precise object modifications, attribute changes, and local transformations
## Dataset Structure
The dataset is organized as follows:
```
Ditto-1M/
├── mini_test_videos/ # 30+ video cases for testing
├── videos/ # Main video data
│ ├── source/ # Source videos (original videos)
│ ├── local/ # Local editing results
│ ├── global_style1/ # Global style editing
│ ├── global_style2/ # Global style editing
│ ├── global_freeform1/ # Freeform editing
│ ├── global_freeform2/ # Freeform editing
│ └── global_freeform3/ # Freeform editing (relatively hard)
├── source_video_captions/ # QwenVL generated captions for source videos
├── training_metadata/ # Training metadata including video paths and editing instructions
└── csvs_for_DiffSynth/ # CSVs for model training with DiffSynth-Studio
```
### Data Categories
- **Source Videos (~180G)**: Original videos before editing
- **Global Style (~230+120G)**: Artistic style transformations and color grading
- **Global Freeform (~370+430+270G)**: Complex scene modifications and creative editing
- **Local Editing (~530G)**: Precise modifications to specific objects or regions
### Training Metadata
Each metadata json file contains triplet items of:
- `source_path`: Path to the source video
- `instruction`: Editing instruction
- `edited_path`: Path to the corresponding edited video
## Downloading and Extracting the Dataset
### Full Dataset Download
```python
from datasets import load_dataset
# Download the entire dataset
dataset = load_dataset("QingyanBai/Ditto-1M")
```
### Selective Download
Due to the large size of the videos folder (~2TB), you can only download the specific subsets if you only need to train on specific tasks:
```python
from huggingface_hub import snapshot_download
# Download the metadata and source captions
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["source_video_captions/*", "training_metadata/*"]
)
# Download only the mini test videos
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["mini_test_videos/*"]
)
# Download the local editing data
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["videos/source/*", "videos/local/*"]
)
# Download the global editing videos
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["videos/source/*", "videos/global_style1/*", "videos/global_style2/*", "videos/global_freeform1/*", "videos/global_freeform2/*"]
)
# Download only the style editing videos
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["videos/source/*", "videos/global_style1/*", "videos/global_style2/*"]
)
```
### Extracting the Video Data
On Linux/macOS or Windows (with Git Bash/WSL):
```bash
# Navigate to the directory containing the split files
cd path/to/your/dataset/part
# For example, to extract the global_style1 videos:
cat global_style1.tar.gz.* | tar -zxv
```
This command concatenates all the split parts and pipes the output directly to tar for extraction, saving both disk space (by not creating an intermediate merged file) and time (as you can start previewing videos immediately without waiting for the entire tar merging process to complete).
## Dataset Statistics
- **Total Examples**: 1,000,000+ video editing triplets
- **Video Resolution**: Various resolutions (1280\*720 / 720\*1280)
- **Video Length**: 101 frames per video
- **Categories**: Global style, Global freeform, Local editing
- **Instructions**: Captions and editing instructions generated by intelligent agents
- **Quality Control**: Processed with the data filtering pipeline and enhanced with the denoising enhancer
## Citation
If you find this dataset useful, please consider citing our paper:
```bibtex
@article{bai2025ditto,
title={Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset},
author={Bai, Qingyan and Wang, Qiuyu and Ouyang, Hao and Yu, Yue and Wang, Hanlin and Wang, Wen and Cheng, Ka Leong and Ma, Shuailei and Zeng, Yanhong and Liu, Zichen and Xu, Yinghao and Shen, Yujun and Chen, Qifeng},
journal={arXiv preprint arXiv:2510.15742},
year={2025}
}
```
# Ditto-1M:面向指令式视频编辑的高质量合成数据集
> **Ditto:依托高质量合成数据集实现指令式视频编辑的规模化方案** <br>
> 白清岩、王秋宇、欧阳浩、于悦、王翰林、王雯、郑嘉良、马帅磊、曾艳红、刘子宸、徐英豪、沈玉俊、陈启峰
<div align=center>
<img src="./assets/data_teaser.jpg" width=850px>
</div>
**图**:本文提出的合成数据生成流程可自动生成高质量且多样性丰富的视频编辑数据,涵盖全局与局部编辑任务。
<div align=center>
## 🔗 链接与资源
[**[📄 论文**](https://arxiv.org/abs/2510.15742)**]
[**[🌐 项目主页**](https://ezioby.github.io/Ditto_page/)**]
[**[💻 Github代码**](https://github.com/EzioBy/Ditto)**]
[**[📦 模型权重**](https://huggingface.co/QingyanBai/Ditto_models/tree/main)**]
</div>
## 更新日志
#### - [√] 2025年10月22日 - 我们已上传可直接配合DiffSynth-Studio进行模型训练的CSV文件,以及适用于sim2real(仿真到现实)设置的元数据JSON文件。
#### - [√] 2025年10月22日 - 我们已完成数据集所有视频文件的上传!
## 数据集概述
Ditto-1M是一个包含百万级高质量视频编辑三元组的综合性数据集,旨在解决指令式视频编辑领域的核心挑战。本数据集依托我们提出的新型数据生成流程构建,该流程融合了顶尖图像编辑器的创作多样性与上下文感知视频生成器的能力,突破了现有模型的应用局限。
本数据集涵盖多样化的视频编辑场景,具体包括:
- **全局风格迁移**:艺术风格转换、色彩分级与视觉特效
- **全局自由形式编辑**:复杂场景修改、环境变更与创意转换
- **局部编辑**:精准对象修改、属性调整与局部变换
## 数据集结构
Ditto-1M/
├── mini_test_videos/ # 30+ 用于测试的视频样例
├── videos/ # 主视频数据目录
│ ├── source/ # 源视频(未经编辑的原始视频)
│ ├── local/ # 局部编辑结果视频
│ ├── global_style1/ # 全局风格编辑结果
│ ├── global_style2/ # 全局风格编辑结果
│ ├── global_freeform1/ # 自由形式编辑结果
│ ├── global_freeform2/ # 自由形式编辑结果
│ └── global_freeform3/ # 难度较高的自由形式编辑结果
├── source_video_captions/ # 由QwenVL生成的源视频字幕
├── training_metadata/ # 训练元数据,包含视频路径与编辑指令
└── csvs_for_DiffSynth/ # 配合DiffSynth-Studio训练的CSV文件
### 数据分类
- **源视频(约180GB)**:未经编辑的原始视频
- **全局风格编辑数据(约230+120GB)**:艺术风格转换与色彩分级结果
- **全局自由形式编辑数据(约370+430+270GB)**:复杂场景修改与创意编辑结果
- **局部编辑数据(约530GB)**:针对特定对象或区域的精准修改结果
### 训练元数据
每个元数据JSON文件均包含如下三元组条目:
- `source_path`:源视频的文件路径
- `instruction`:编辑指令文本
- `edited_path`:对应编辑后视频的文件路径
## 数据集下载与解压
### 完整数据集下载
python
from datasets import load_dataset
# 下载完整数据集
dataset = load_dataset("QingyanBai/Ditto-1M")
### 选择性下载
由于视频文件夹总大小约2TB,若仅需针对特定任务进行训练,可仅下载对应的子集数据:
python
from huggingface_hub import snapshot_download
# 下载元数据与源视频字幕
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["source_video_captions/*", "training_metadata/*"]
)
# 下载仅迷你测试视频
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["mini_test_videos/*"]
)
# 下载局部编辑数据
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["videos/source/*", "videos/local/*"]
)
# 下载全局编辑视频
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["videos/source/*", "videos/global_style1/*", "videos/global_style2/*", "videos/global_freeform1/*", "videos/global_freeform2/*"]
)
# 仅下载风格编辑视频
snapshot_download(
repo_id="QingyanBai/Ditto-1M",
repo_type="dataset",
local_dir="./Ditto-1M",
allow_patterns=["videos/source/*", "videos/global_style1/*", "videos/global_style2/*"]
)
### 视频数据解压
在Linux/macOS系统或Windows(需借助Git Bash/WSL)环境下执行:
bash
# 导航至包含分片文件的数据集目录
cd path/to/your/dataset/part
# 示例:解压global_style1分类的视频:
cat global_style1.tar.gz.* | tar -zxv
该命令可将所有分片文件拼接后直接通过管道传递给tar进行解压,无需生成中间合并文件,既节省磁盘空间,又可提前预览视频(无需等待整个tar合并流程完成)。
## 数据集统计信息
- **总样本量**:100万+视频编辑三元组
- **视频分辨率**:多种分辨率(1280×720 / 720×1280)
- **视频时长**:单视频含101帧
- **任务类别**:全局风格编辑、全局自由形式编辑、局部编辑
- **编辑指令**:由智能体生成的字幕与编辑指令
- **质量管控**:通过数据过滤流程处理,并借助降噪增强工具进行优化
## 论文引用
若您认为本数据集对研究有所帮助,请引用我们的论文:
bibtex
@article{bai2025ditto,
title={Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset},
author={Bai, Qingyan and Wang, Qiuyu and Ouyang, Hao and Yu, Yue and Wang, Hanlin and Wang, Wen and Cheng, Ka Leong and Ma, Shuailei and Zeng, Yanhong and Liu, Zichen and Xu, Yinghao and Shen, Yujun and Chen, Qifeng},
journal={arXiv preprint arXiv:2510.15742},
year={2025}
}
提供机构:
maas
创建时间:
2025-10-21



