ysicong/Sora100K

Name: ysicong/Sora100K
Creator: ysicong
Published: 2026-04-09 09:12:37
License: 暂无描述

Hugging Face2026-04-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ysicong/Sora100K

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Sora100K license: other language: - en tags: - video - multimodal - tabular - text-to-video - video-editing - datasets configs: - config_name: text_to_video_generation default: true data_files: - split: train path: "Text-to-Video Generation/*.csv" - config_name: single_turn_video_editing data_files: - split: train path: "Single-Turn Videos Editing/*.csv" - config_name: multi_turn_video_editing data_files: - split: train path: "Multi-Turn Videos Editing/*.csv" --- # Sora100K This page serves as both the **dataset website** and the **supplementary materials website** for the ACM MM 2026 Dataset Track submission. ## Quick Navigation - [Dataset Overview](#dataset-overview) - [Key Statistics](#key-statistics) - [Dataset Structure](#dataset-structure) - [Data Source](#data-source) - [Access and License](#access-and-license) - [How to Obtain the Videos](#how-to-obtain-the-videos) - [Ethical Considerations, Privacy, and Limitations](#ethical-considerations-privacy-and-limitations) - [Supplementary Materials](#supplementary-materials) - [Loading the Dataset](#loading-the-dataset) - [Citation](#citation) --- # Dataset Website ## Dataset Overview ![Figer1](./asset/Figer1.png) **Sora100K** is a large-scale multimodal video dataset resource for studying **text-to-video generation**, **single-turn video editing**, and **multi-turn video editing** under a unified metadata and analysis framework. The full Sora100K resource is designed to support modern video creation and editing research across multiple task settings. In addition to generation and editing, the dataset is intended to support metadata-driven analyses of **multi-shot composition**, **scene transitions**, **editing trajectories**, and other structural properties of video creation workflows. This Hugging Face repository currently releases the **metadata layer** of Sora100K together with documentation and supplementary materials. The underlying raw video files are **not directly redistributed** through the Hugging Face repository. ## Key Statistics ![Figer2](./asset/Figer2.png) The complete Sora100K resource covers three task settings: - **Text-to-Video Generation** - **Single-turn Video Editing** - **Multi-turn Video Editing** At the full-resource level, Sora100K contains: - **103,439** videos in total - **18,451** generation samples - **76,964** single-turn editing samples - **8024** multi-turn editing samples or chains This repository releases the **metadata layer** for these subsets rather than directly redistributing the underlying raw video files. ## Dataset Structure The repository is organized into three subset-specific folders: - [Text-to-Video Generation](./Text-to-Video%20Generation/README.md) - [Single-Turn Video Editing](./Single-Turn%20Videos%20Editing/README.md) - [Multi-Turn Video Editing](./Multi-Turn%20Videos%20Editing/README.md) Each subset folder contains its own `README.md` file describing: - the task setting - the current release status - the released metadata files - representative fields - subset-specific notes and limitations ### Repository-level organization At a high level, this repository contains: - subset-level metadata files organized by task setting - subset-specific documentation under each folder - supplementary materials under `supplementary/` - visual or illustrative assets under `doc/` ### Current release status The three subsets are not currently released at exactly the same level of completeness. - The **Single-Turn Videos Editing** folder currently reflects the most mature part of the released metadata and is the primary subset used in the current experiments. - The **Text-to-Video Generation** folder corresponds to the generation subset of Sora100K, and its metadata organization may continue to expand as additional generation-specific files are prepared. - The **Multi-Turn Videos Editing** folder corresponds to the multi-turn editing subset organized as editing chains, with subset-specific metadata and documentation provided separately. Readers should refer to the `README.md` file inside each subset folder for the most accurate description of included files and their meanings. ![Figer3](./asset/Figer3.png) ## Data Source The Sora100K resource is constructed from videos and structured records associated with multiple video creation settings, including text-to-video generation, single-turn video editing, and multi-turn video editing. During dataset construction and preprocessing, per-sample structured records such as `meta.json`, `result.json`, and `scenedetect.json` were processed and converted into tabular metadata files for easier analysis, filtering, benchmarking, and metadata-driven studies of generation and editing workflows. The currently released metadata does not cover all subsets at exactly the same level of completeness. In particular, some structured files are currently more mature for editing-related subsets than for the generation subset. Subset-specific details should therefore be checked in the corresponding `README.md` file under each subset folder. This dataset only publishes metadata; the underlying raw video files are not directly distributed. We provide a dedicated download tool for accessing the dataset. If you are unable to download the videos after Sora2 goes offline, please contact the authors; we will provide a direct link to a copy of the dataset, which is for academic and non-commercial research use only. ## Access and License ### Released Content This repository provides the **metadata layer** of Sora100K, including: - subset-level metadata tables - dataset-level documentation - subset-specific documentation - supplementary materials for review and reuse - retrieval instructions and/or utilities for accessing underlying videos from original or otherwise authorized sources, when permitted The goal of this release is to support dataset analysis, benchmarking, metadata-based retrieval, and structured studies of generation and editing workflows while keeping the release aligned with source-level access conditions. ### Access Scope This Hugging Face repository does **not** directly redistribute raw video files as part of the dataset release. Instead, the repository provides metadata, documentation, and access support that can be used to recover or obtain underlying videos from original or otherwise authorized sources, subject to source availability and access permissions. Additional details about currently released files and supported retrieval workflows are provided in the corresponding `README.md` file under each subset folder. ### License Note This repository is marked as `license: other` because the released resource consists of **metadata, documentation, and related utilities**, while the underlying raw media may involve mixed ownership or platform-specific rights conditions. Unless otherwise stated, this repository does **not** claim relicensing or redistribution rights for any raw media referenced by the metadata. Users are responsible for complying with the original platform terms, creator rights, and any applicable laws or regulations when retrieving or using underlying videos. ### Review-Time Access For ACM MM 2026 Dataset Track review, this page serves as the official dataset website. Reviewers and area chairs can use this repository to inspect: - the dataset overview - the repository structure - subset-level documentation - access conditions - supplementary materials linked from this page - the provided retrieval workflow ## How to Obtain the Videos This Hugging Face repository releases the **metadata layer** of Sora100K together with documentation and retrieval support for accessing underlying videos from original or otherwise authorized sources, when permitted. To facilitate reuse, the repository provides: - released metadata files for identifying samples - retrieval instructions and/or utilities for recovering source-level video records - subset-specific notes describing currently available access workflows In general, users can obtain underlying videos by: 1. using the released metadata files to identify the target samples 2. following the provided retrieval workflow or scripts in this repository 3. accessing the corresponding videos from original or otherwise authorized sources, subject to source availability and access permissions 4. regenerating valid source-level access links when required by the original platform ### Important Note on Links Some metadata fields may contain temporary signed URLs or short-lived download links. Such links may expire and should **not** be treated as stable or permanent identifiers. For long-term reference and recovery, the recommended identifiers are: - `sample_id` - `source_post_id` - `edited_post_id` Additional subset-specific access details are provided in the corresponding `README.md` file under each subset folder. ## Ethical Considerations, Privacy, and Limitations ### Ethical Considerations, Privacy Sora100K is intended to support research on multimodal video generation and editing, with particular focus on: - dataset analysis and benchmarking for text-to-video generation, single-turn video editing, and multi-turn video editing - metadata-driven retrieval, filtering, and dataset organization - studies of source-to-edit relationships and editing trajectories - scene-level and structural analysis, including temporal organization and multi-shot composition - reproducible data curation, preprocessing, and subset-level evaluation workflows In its current Hugging Face release form, the repository is especially suited for **metadata-based analysis and benchmark construction**, as well as for research workflows that rely on structured identifiers, tabular metadata, and subset-level documentation. ### Limitations The current Hugging Face release does not directly redistribute raw video files and instead focuses on metadata, documentation, and subset-level organization. Users should also note that: - some source-level links or signed URLs referenced in metadata may be temporary or may expire over time - different subsets may currently be released at different levels of completeness - some annotations or structural metadata may currently be available only for specific subsets or source videos - the dataset may inherit biases, artifacts, or coverage imbalances from the original generation or editing platform - access to underlying videos may depend on source availability, platform policies, or authorization conditions These limitations should be considered when using the resource for large-scale retrieval, reconstruction of raw media, or cross-subset comparisons. ### Responsible Use Users are responsible for ensuring that any access, retrieval, download, or use of underlying videos complies with: - the original platform terms of service - copyright and related rights requirements - applicable laws and regulations - any source-specific access or authorization conditions This repository releases metadata, documentation, and related materials only, and does not claim relicensing or redistribution rights for underlying raw video assets unless explicitly stated otherwise. --- # Supplementary Materials This section serves as the entry point to the supplementary materials for the ACM MM 2026 Dataset Track submission of **Sora100K**. The supplementary materials include extended details that complement the main paper and the dataset website, including: - extended statistics - additional dataset examples and visualizations - taxonomy and annotation details - metadata schema and field descriptions - notes on preprocessing and metadata conversion - additional clarifications for dataset usage - utility experiment details ### Supplementary Navigation - [Comparison of Video Datasets](./supplementary/README.md#s1-comparison-of-video-datasets) - [Taxonomy and Label Definitions](./supplementary/README.md#s2-taxonomy-and-label-definitions) - [Metadata Schema and Field Descriptions](./supplementary/README.md#s3-metadata-schema-and-field-descriptions) - [Additional Dataset Examples](./supplementary/README.md#s4-additional-dataset-examples) - [VBench Evaluation Details](./supplementary/README.md#s6-vbench-evaluation-details) - [Utility Experiment Details](./supplementary/README.md#s7-utility-experiment-details) --- ## Loading the Dataset Example: ```python from datasets import load_dataset gen_ds = load_dataset("ysicong/Sora100K", "text_to_video_generation") single_edit_ds = load_dataset("ysicong/Sora100K", "single_turn_video_editing") multi_edit_ds = load_dataset("ysicong/Sora100K", "multi_turn_video_editing") print(gen_ds["train"][0]) print(single_edit_ds["train"][0]) print(multi_edit_ds["train"][0]) ``` > If a subset folder does not yet contain any CSV file, remove the corresponding config from the YAML block temporarily and add it back after the files are uploaded. --- ## Citation If you use this resource, please cite the corresponding paper and dataset page.

提供机构：

ysicong

5,000+

优质数据集

54 个

任务类型

进入经典数据集