ysicong/Sora100K
收藏Hugging Face2026-04-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ysicong/Sora100K
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Sora100K
license: other
language:
- en
tags:
- video
- multimodal
- tabular
- text-to-video
- video-editing
- datasets
configs:
- config_name: text_to_video_generation
default: true
data_files:
- split: train
path: "Text-to-Video Generation/*.csv"
- config_name: single_turn_video_editing
data_files:
- split: train
path: "Single-Turn Videos Editing/*.csv"
- config_name: multi_turn_video_editing
data_files:
- split: train
path: "Multi-Turn Videos Editing/*.csv"
---
# Sora100K
This page serves as both the **dataset website** and the **supplementary materials website** for the ACM MM 2026 Dataset Track submission.
## Quick Navigation
- [Dataset Overview](#dataset-overview)
- [Key Statistics](#key-statistics)
- [Dataset Structure](#dataset-structure)
- [Data Source](#data-source)
- [Access and License](#access-and-license)
- [How to Obtain the Videos](#how-to-obtain-the-videos)
- [Ethical Considerations, Privacy, and Limitations](#ethical-considerations-privacy-and-limitations)
- [Supplementary Materials](#supplementary-materials)
- [Loading the Dataset](#loading-the-dataset)
- [Citation](#citation)
---
# Dataset Website
## Dataset Overview

**Sora100K** is a large-scale multimodal video dataset resource for studying **text-to-video generation**, **single-turn video editing**, and **multi-turn video editing** under a unified metadata and analysis framework.
The full Sora100K resource is designed to support modern video creation and editing research across multiple task settings. In addition to generation and editing, the dataset is intended to support metadata-driven analyses of **multi-shot composition**, **scene transitions**, **editing trajectories**, and other structural properties of video creation workflows.
This Hugging Face repository currently releases the **metadata layer** of Sora100K together with documentation and supplementary materials. The underlying raw video files are **not directly redistributed** through the Hugging Face repository.
## Key Statistics

The complete Sora100K resource covers three task settings:
- **Text-to-Video Generation**
- **Single-turn Video Editing**
- **Multi-turn Video Editing**
At the full-resource level, Sora100K contains:
- **103,439** videos in total
- **18,451** generation samples
- **76,964** single-turn editing samples
- **8024** multi-turn editing samples or chains
This repository releases the **metadata layer** for these subsets rather than directly redistributing the underlying raw video files.
## Dataset Structure
The repository is organized into three subset-specific folders:
- [Text-to-Video Generation](./Text-to-Video%20Generation/README.md)
- [Single-Turn Video Editing](./Single-Turn%20Videos%20Editing/README.md)
- [Multi-Turn Video Editing](./Multi-Turn%20Videos%20Editing/README.md)
Each subset folder contains its own `README.md` file describing:
- the task setting
- the current release status
- the released metadata files
- representative fields
- subset-specific notes and limitations
### Repository-level organization
At a high level, this repository contains:
- subset-level metadata files organized by task setting
- subset-specific documentation under each folder
- supplementary materials under `supplementary/`
- visual or illustrative assets under `doc/`
### Current release status
The three subsets are not currently released at exactly the same level of completeness.
- The **Single-Turn Videos Editing** folder currently reflects the most mature part of the released metadata and is the primary subset used in the current experiments.
- The **Text-to-Video Generation** folder corresponds to the generation subset of Sora100K, and its metadata organization may continue to expand as additional generation-specific files are prepared.
- The **Multi-Turn Videos Editing** folder corresponds to the multi-turn editing subset organized as editing chains, with subset-specific metadata and documentation provided separately.
Readers should refer to the `README.md` file inside each subset folder for the most accurate description of included files and their meanings.

## Data Source
The Sora100K resource is constructed from videos and structured records associated with multiple video creation settings, including text-to-video generation, single-turn video editing, and multi-turn video editing.
During dataset construction and preprocessing, per-sample structured records such as `meta.json`, `result.json`, and `scenedetect.json` were processed and converted into tabular metadata files for easier analysis, filtering, benchmarking, and metadata-driven studies of generation and editing workflows.
The currently released metadata does not cover all subsets at exactly the same level of completeness. In particular, some structured files are currently more mature for editing-related subsets than for the generation subset. Subset-specific details should therefore be checked in the corresponding `README.md` file under each subset folder.
This dataset only publishes metadata; the underlying raw video files are not directly distributed. We provide a dedicated download tool for accessing the dataset. If you are unable to download the videos after Sora2 goes offline, please contact the authors; we will provide a direct link to a copy of the dataset, which is for academic and non-commercial research use only.
## Access and License
### Released Content
This repository provides the **metadata layer** of Sora100K, including:
- subset-level metadata tables
- dataset-level documentation
- subset-specific documentation
- supplementary materials for review and reuse
- retrieval instructions and/or utilities for accessing underlying videos from original or otherwise authorized sources, when permitted
The goal of this release is to support dataset analysis, benchmarking, metadata-based retrieval, and structured studies of generation and editing workflows while keeping the release aligned with source-level access conditions.
### Access Scope
This Hugging Face repository does **not** directly redistribute raw video files as part of the dataset release.
Instead, the repository provides metadata, documentation, and access support that can be used to recover or obtain underlying videos from original or otherwise authorized sources, subject to source availability and access permissions.
Additional details about currently released files and supported retrieval workflows are provided in the corresponding `README.md` file under each subset folder.
### License Note
This repository is marked as `license: other` because the released resource consists of **metadata, documentation, and related utilities**, while the underlying raw media may involve mixed ownership or platform-specific rights conditions.
Unless otherwise stated, this repository does **not** claim relicensing or redistribution rights for any raw media referenced by the metadata. Users are responsible for complying with the original platform terms, creator rights, and any applicable laws or regulations when retrieving or using underlying videos.
### Review-Time Access
For ACM MM 2026 Dataset Track review, this page serves as the official dataset website. Reviewers and area chairs can use this repository to inspect:
- the dataset overview
- the repository structure
- subset-level documentation
- access conditions
- supplementary materials linked from this page
- the provided retrieval workflow
## How to Obtain the Videos
This Hugging Face repository releases the **metadata layer** of Sora100K together with documentation and retrieval support for accessing underlying videos from original or otherwise authorized sources, when permitted.
To facilitate reuse, the repository provides:
- released metadata files for identifying samples
- retrieval instructions and/or utilities for recovering source-level video records
- subset-specific notes describing currently available access workflows
In general, users can obtain underlying videos by:
1. using the released metadata files to identify the target samples
2. following the provided retrieval workflow or scripts in this repository
3. accessing the corresponding videos from original or otherwise authorized sources, subject to source availability and access permissions
4. regenerating valid source-level access links when required by the original platform
### Important Note on Links
Some metadata fields may contain temporary signed URLs or short-lived download links. Such links may expire and should **not** be treated as stable or permanent identifiers.
For long-term reference and recovery, the recommended identifiers are:
- `sample_id`
- `source_post_id`
- `edited_post_id`
Additional subset-specific access details are provided in the corresponding `README.md` file under each subset folder.
## Ethical Considerations, Privacy, and Limitations
### Ethical Considerations, Privacy
Sora100K is intended to support research on multimodal video generation and editing, with particular focus on:
- dataset analysis and benchmarking for text-to-video generation, single-turn video editing, and multi-turn video editing
- metadata-driven retrieval, filtering, and dataset organization
- studies of source-to-edit relationships and editing trajectories
- scene-level and structural analysis, including temporal organization and multi-shot composition
- reproducible data curation, preprocessing, and subset-level evaluation workflows
In its current Hugging Face release form, the repository is especially suited for **metadata-based analysis and benchmark construction**, as well as for research workflows that rely on structured identifiers, tabular metadata, and subset-level documentation.
### Limitations
The current Hugging Face release does not directly redistribute raw video files and instead focuses on metadata, documentation, and subset-level organization.
Users should also note that:
- some source-level links or signed URLs referenced in metadata may be temporary or may expire over time
- different subsets may currently be released at different levels of completeness
- some annotations or structural metadata may currently be available only for specific subsets or source videos
- the dataset may inherit biases, artifacts, or coverage imbalances from the original generation or editing platform
- access to underlying videos may depend on source availability, platform policies, or authorization conditions
These limitations should be considered when using the resource for large-scale retrieval, reconstruction of raw media, or cross-subset comparisons.
### Responsible Use
Users are responsible for ensuring that any access, retrieval, download, or use of underlying videos complies with:
- the original platform terms of service
- copyright and related rights requirements
- applicable laws and regulations
- any source-specific access or authorization conditions
This repository releases metadata, documentation, and related materials only, and does not claim relicensing or redistribution rights for underlying raw video assets unless explicitly stated otherwise.
---
# Supplementary Materials
This section serves as the entry point to the supplementary materials for the ACM MM 2026 Dataset Track submission of **Sora100K**.
The supplementary materials include extended details that complement the main paper and the dataset website, including:
- extended statistics
- additional dataset examples and visualizations
- taxonomy and annotation details
- metadata schema and field descriptions
- notes on preprocessing and metadata conversion
- additional clarifications for dataset usage
- utility experiment details
### Supplementary Navigation
- [Comparison of Video Datasets](./supplementary/README.md#s1-comparison-of-video-datasets)
- [Taxonomy and Label Definitions](./supplementary/README.md#s2-taxonomy-and-label-definitions)
- [Metadata Schema and Field Descriptions](./supplementary/README.md#s3-metadata-schema-and-field-descriptions)
- [Additional Dataset Examples](./supplementary/README.md#s4-additional-dataset-examples)
- [VBench Evaluation Details](./supplementary/README.md#s6-vbench-evaluation-details)
- [Utility Experiment Details](./supplementary/README.md#s7-utility-experiment-details)
---
## Loading the Dataset
Example:
```python
from datasets import load_dataset
gen_ds = load_dataset("ysicong/Sora100K", "text_to_video_generation")
single_edit_ds = load_dataset("ysicong/Sora100K", "single_turn_video_editing")
multi_edit_ds = load_dataset("ysicong/Sora100K", "multi_turn_video_editing")
print(gen_ds["train"][0])
print(single_edit_ds["train"][0])
print(multi_edit_ds["train"][0])
```
> If a subset folder does not yet contain any CSV file, remove the corresponding config from the YAML block temporarily and add it back after the files are uploaded.
---
## Citation
If you use this resource, please cite the corresponding paper and dataset page.
提供机构:
ysicong



