Soul-AILab/VividHead

Name: Soul-AILab/VividHead
Creator: Soul-AILab
Published: 2026-02-12 09:20:28
License: 暂无描述

Hugging Face2026-02-12 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/Soul-AILab/VividHead

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - image-to-video pretty_name: VividHead size_categories: - 100K<n<1M --- <div align="center"> <h1>SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads</h1> [Tan Yu*](https://jiayoujiayoujiayoua.github.io/), [Qian Qiao*](https://qianqiaoai.github.io/)<sup>✉</sup>, [Le Shen*](https://openreview.net/profile?id=%7ELe_Shen3), [Ke Zhou](https://github.com/jokerz0624), [Jincheng Hu](#), [Dian Sheng](#), [Bo Hu](#), [Haoming Qin](#), [Jun Gao](#), [Changhai Zhou](#), [Shunshun Yin](#), [Siyuan Liu](#) <sup>✉</sup> <sup>*</sup>Equal Contribution <sup>✉</sup>Corresponding Author <a href='https://soul-ailab.github.io/soulx-flashhead/'><img src='https://img.shields.io/badge/Project-Page-green'></a> <a href='https://arxiv.org/abs/2602.07449'><img src='https://img.shields.io/badge/Technical-Report-red'></a> <a href='https://huggingface.co/Soul-AILab/SoulX-FlashHead-1_3B'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a> </div> # VividHead Dataset ## Highlights - 🔥 **Large-scale, high-quality talking-head dataset** with **330K clips** and **782 hours** of head-cropped videos - 🔥 **Broad diversity** across **15+ languages** and a **wide age range (0–60+)** - 🔥 **Rich annotations** including age, gender, ethnicity, and language - 🔥 **Unified and standardized processing**, with a consistent **FPS = 25** and **resolution = 512 × 512** ## ShowCase ## 🌰 Examples <table style="width: 100%; border-collapse: collapse; border: none;"> <tr style="border: none;"> <td style="width: 25%; border: none; padding: 5px; vertical-align: top;"> <video src="https://huggingface.co/datasets/Soul-AILab/VividHead/resolve/main/assets/6732.mp4" controls muted loop style="width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"></video> </td> <td style="width: 25%; border: none; padding: 5px; vertical-align: top;"> <video src="https://huggingface.co/datasets/Soul-AILab/VividHead/resolve/main/assets/13464.mp4" controls muted loop style="width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"></video> </td> <td style="width: 25%; border: none; padding: 5px; vertical-align: top;"> <video src="https://huggingface.co/datasets/Soul-AILab/VividHead/resolve/main/assets/26927.mp4" controls muted loop style="width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"></video> </td> <td style="width: 25%; border: none; padding: 5px; vertical-align: top;"> <video src="https://huggingface.co/datasets/Soul-AILab/VividHead/resolve/main/assets/30292.mp4" controls muted loop style="width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"></video> </td> </tr> <tr style="border: none;"> <td style="width: 25%; border: none; padding: 5px; vertical-align: top;"> <video src="https://huggingface.co/datasets/Soul-AILab/VividHead/resolve/main/assets/37024.mp4" controls muted loop style="width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"></video> </td> <td style="width: 25%; border: none; padding: 5px; vertical-align: top;"> <video src="https://huggingface.co/datasets/Soul-AILab/VividHead/resolve/main/assets/40390.mp4" controls muted loop style="width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"></video> </td> <td style="width: 25%; border: none; padding: 5px; vertical-align: top;"> <video src="https://huggingface.co/datasets/Soul-AILab/VividHead/resolve/main/assets/53737.mp4" controls muted loop style="width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"></video> </td> <td style="width: 25%; border: none; padding: 5px; vertical-align: top;"> <video src="https://huggingface.co/datasets/Soul-AILab/VividHead/resolve/main/assets/87511.mp4" controls muted loop style="width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"></video> </td> </tr> </table> ## Dataset Statistics This dataset exhibits strong diversity across multiple dimensions: - **Duration**: 3s–60s+, bimodal (peaks ~5s, ~10s), mean **8.37s**; most clips in 3–15s. - **Age**: 31–45 (432.5h), 19–30 (277.2h), 46–60 (61.3h), 60+ (10.4h), 0–19 (0.2h). - **Language (Top 10)**: English (651.4h), Chinese (67.5h), Russian (8.7h), Spanish (7.1h), Portuguese (6.4h), Welsh (5.4h), Hindi (5.3h), German (3.6h), French (3.0h), Korean (2.7h); 15+ languages in total. - **Gender & ethnicity**: Male (552.8h), Female (229.0h); White (506.7h), Asian (113.1h), Latino/Hispanic (56.5h), Middle Eastern (42.9h), Black (36.4h). <table> <tr> <td align="center"><img src="assets/duration_distribution_chart.png" width="90%"/><br/><b>Duration</b></td> <td align="center"><img src="assets/age_group_distribution_chart.png" width="90%"/><br/><b>Age group</b></td> </tr> <tr> <td align="center"><img src="assets/language_distribution_chart.png" width="90%"/><br/><b>Language (Top 10)</b></td> <td align="center"><img src="assets/gender_ethnicity_distribution_chart.png" width="90%"/><br/><b>Gender & ethnicity</b></td> </tr> </table> ## Comparison with Other Datasets | Dataset | Speakers | Face Crop | Clips | Hours | Resolution | Language | Age | Ethnicity | Source | |---------------|----------|-----------|--------|-------|------------------|----------|--------|-----------|--------| | MEAD | 60 | ✅ | 281.4K | 39 | 384p | English | 20–35 | – | Lab | | HDTF | 362 | ✅ | 10K | 15.8 | 512p | – | – | – | Wild | | AVSpeech | 150K | ❌ | 2.5M | 4700 | 720p, 1080p | – | – | – | Wild | | Hallo3 | – | ✅ | 101.5K | 70 | 720p | – | – | – | Wild | | OpenHumanVid | – | ❌ | 13.4M | 16.7K | 720p | – | – | – | Wild | | TalkVid | 7,729 | ❌ | 281.4K | 1244 | 1080p, 2160p | 15 lang. | 0–60+ | 3 | Wild | | SpeakerVid | 83K | ❌ | 5.2M | 8.7K | 1080p | – | – | – | Wild | | **Ours** | **60K** | ✅ | **330K** | **782** | **512p** | **15 lang.** | **0–60+** | **3** | **Wild** | # Data Processing Pipeline Our data processing pipeline is designed to construct a large-scale, high-quality talking-head dataset through systematic preprocessing, filtering, and annotation, ensuring sample uniqueness, temporal consistency, and reliable multi-modal supervision. ## Data Preprocessing Stage 1. **Data collection**: Aggregates initial content from Web videos and various Open-source videos to build a diverse raw data pool. 2. **Deduplication & Slicing**: Employs MD5 hash verification to eliminate redundant content and uses PySceneDetect to divide long videos into coherent clips ranging from 3 to 60+ seconds. 3. **Standardize to 25 FPS**: Normalizes all video clips to a uniform frame rate of 25 FPS using FFMPEG to ensure temporal consistency for model training. ## Data Filter & Annotation Stage 4. **Face detection & crop**: Detects face visibility and crops valid sequences into a centered $512 \times 512$ resolution. 5. **Jump cut detection**: Uses optical flow analysis to identify and exclude sequences containing scene discontinuities or abrupt transitions. 6. **Faceless filter**: Screens and excludes frames where a detectable face is missing or the head region is improperly framed. 7. **DWpose extraction & hand-filter**: Extracts body keypoints and strictly removes clips featuring hand-over-face occlusion to prevent generation artifacts. 8. **Lip-sync**: Utilizes the SyncNet model to calculate confidence scores (LSE-C and LSE-D), discarding any samples with poor audio-visual alignment. 9. **Audio feature & attribute labeling**: Extracts robust streaming features via Wav2Vec and annotates metadata including language, ethnicity, age, and gender. ## 📚 Citation If you find our work useful in your research, please consider citing: ``` @misc{yu2026soulxflashheadoracleguidedgenerationinfinite, title={SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads}, author={Tan Yu and Qian Qiao and Le Shen and Ke Zhou and Jincheng Hu and Dian Sheng and Bo Hu and Haoming Qin and Jun Gao and Changhai Zhou and Shunshun Yin and Siyuan Liu}, year={2026}, eprint={2602.07449}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.07449}, } ``` # License Our VividHead dataset is released under the CC-BY-4.0 license and is intended for research and non-commercial purposes. The video samples are collected from publicly available datasets.

提供机构：

Soul-AILab

5,000+

优质数据集

54 个

任务类型

进入经典数据集