EngineeringAI-LAB/3DTalkingDataset

Name: EngineeringAI-LAB/3DTalkingDataset
Creator: EngineeringAI-LAB
Published: 2026-04-08 07:03:25
License: 暂无描述

Hugging Face2026-04-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/EngineeringAI-LAB/3DTalkingDataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en size_categories: - 100M<n<1B --- # Dataset Card for Dataset Curation of 3DXTalker ### Dataset Description - **Repository:** [https://github.com/EngineeringAI-LAB/3DXTalker/tree/main] - **Paper:** [https://arxiv.org/abs/2602.10516] - **Project :** [https://engineeringai-lab.github.io/3DXTalker.github.io/] ### Dataset Summary This dataset is a large-scale, curated collection of talking head videos built for tasks such as high-fidelity 3D talking avatar generation, lip synchronization, and pose dynamics modeling. The dataset aggregates and standardizes data from six prominent sources (**GRID, RAVDESS, MEAD, VoxCeleb2, HDTF, Celebv-HQ**), processed through a rigorous data curation pipeline to ensure high quality in terms of face alignment, resolution, and audio-visual synchronization. It covers diverse environments (Lab vs. Wild) and a wide range of subjects. ### Supported Tasks and Leaderboards - **3D Talking Head Generation:** Synthesizing realistic talking videos from driving speech. - **Audio-Driven Lip Synchronization:** Aligning lip movements precisely with input speech. - **Emotion Analysis & Synthesis:** Leveraging the emotional diversity in datasets like RAVDESS and MEAD. - **Audio-Driven Head Pose Synthesis:** Modeling natural head movements and orientation directly driving speech. ## Dataset Structure ``` trainset/ ├── V0-GRID/ # 6,570 sequences from GRID corpus │ ├── V0-s1-00001/ │ │ ├── audio.wav # (N,) audio data │ │ ├── cam.npy # (T, 3) camera parameters │ │ ├── detailcode.npy # (T, 128) facial details │ │ ├── envelope.npy # (N,) audio envelope │ │ ├── expcode.npy # (T, 50) expression codes │ │ ├── lightcode.npy # (T, 9, 3) lighting │ │ ├── metadata.pkl # Sequence metadata │ │ ├── posecode.npy # (T, 6) head pose │ │ ├── refimg.npy # (C, H, W) reference image │ │ ├── shapecode.npy # (T, 100) shape codes │ │ └── texcode.npy # (T, 50) texture codes │ ├── V0-s1-00002/ │ │ └── ... (same 11 files) │ ├── V0-s1-00003/ │ └── ... (6,570 total sequences) │ ├── V1-RAVDESS/ # 583 sequences from RAVDESS dataset │ ├── V1-Song-Actor_01-00001/ │ │ └── ... (same 11 files) │ ├── V1-Song-Actor_01-00002/ │ ├── V1-Speech-Actor_01-00001/ │ ├── V1-Speech-Actor_02-00001/ │ └── ... (583 total sequences) │ ├── V2-MEAD/ # 1,939 sequences from MEAD dataset │ ├── V2-M003-angry-00001/ │ │ └── ... (same 11 files) │ ├── V2-M003-angry-00002/ │ ├── V2-M003-happy-00001/ │ ├── V2-W009-sad-00001/ │ └── ... (1,939 total sequences) │ ├── V3-VoxCeleb2/ # 1,296 sequences from VoxCeleb2 │ ├── {sequence_id}/ │ │ └── ... (same 11 files) │ └── ... (1,296 total sequences) │ ├── V4-HDTF/ # 350 sequences from HDTF dataset │ ├── {sequence_id}/ │ │ └── ... (same 11 files) │ └── ... (350 total sequences) │ └── V5-CelebV-HQ/ # 768 sequences from CelebV-HQ dataset ├── {sequence_id}/ │ └── ... (same 11 files) └── ... (768 total sequences) ``` ## Data Format Details ### File Overview | File | Type | Shape | Description | |------|------|-------|-------------| | `audio.wav` | Audio | (N_samples,) | Original audio waveform| | `cam.npy` | Parameters | (N_frames, 3) | Camera parameters (position/scale) | | `detailcode.npy` | Parameters | (N_frames, 128) | Facial detail codes (wrinkles, fine features) | | `envelope.npy` | Parameters | (N_audio_samples,) | Audio envelope/amplitude over time | | `expcode.npy` | Parameters | (N_frames, 50) | FLAME expression parameters (50-dim) | | `lightcode.npy` | Parameters | (N_frames, 9, 3) | Spherical harmonics lighting (9 bands × RGB) | | `metadata.pkl` | Metadata | N/A | Sequence metadata (integer or dict) | | `posecode.npy` | Parameters | (N_frames, 6) | 3 head pose + 3 jaw pose | | `refimg.npy` | Image | (3, 224, 224) | Reference image (RGB, 224×224 pixels) | | `shapecode.npy` | Parameters | (N_frames, 100) | FLAME shape parameters (100-dim) | | `texcode.npy` | Parameters | (N_frames, 50) | Texture codes (50-dim) | ### Coordinate Systems and Conventions - **FLAME model**: 3D Morphable Face Model with 5023 vertices - **Expression space**: 50-dimensional linear basis - **Shape space**: 100-dimensional PCA space - **Pose representation**: 3 head pose + 3 jaw pose - **Lighting**: 2nd-order spherical harmonics (9 bands) ### Temporal Synchronization - **Video frames**: 25 FPS (frames per second) - **Audio samples**: 16,000 samples per second - All video parameters (`expcode`, `shapecode`, `detailcode`, `posecode`, `cam`, `lightcode`, `texcode`) share the same `N_frames` dimension - Audio and video are temporally aligned (frame 0 corresponds to start of audio) ### Data Statistics The dataset comprises **11,706** total video samples, spanning approximately **67.4 hours** of self-talking footage. The data is categorized by environment (Lab vs. Wild) and includes varying resolutions and subject diversity. #### Detailed Statistics (from Curation Pipeline) | Dataset | ID | Environment | Year | Raw Resolution | Size (samples) | Subject | Total Duration (s) | Hours (h) | Avg. Duration (s/sample) | |-------------|----|-------------|------|----------------|----------------|---------|--------------------|-----------|--------------------------| | **GRID** | V0 | Lab | 2006 | 720 × 576 | 6,600 | 34 | 99,257.81 | 27.57 | 15.04 | | **RAVDESS** | V1 | Lab | 2018 | 1280 × 1024 | 613 | 24 | 10,071.88 | 2.80 | 16.43 | | **MEAD** | V2 | Lab | 2020 | 1920 × 1080 | 1,969 | 60 | 42,868.77 | 11.91 | 21.77 | | **VoxCeleb2**| V3| Wild | 2018 | 360P~720P | 1,326 | 1k+ | 21,528.20 | 5.98 | 16.24 | | **HDTF** | V4 | Wild | 2021 | 720P~1080P | 400 | 300+ | 55,452.08 | 15.40 | 138.63 | | **Celebv-HQ**| V5| Wild | 2022 | 512 × 512 | 798 | 700+ | 13,486.20 | 3.75 | 16.90 | ### Data Splits The dataset follows a strict training and testing split protocol to ensure fair evaluation. The testing set is composed of a balanced selection from each sub-dataset. | Dataset | ID | Total Size | Training Set | Test Set | | ------------- | --- | ---------- | ------------ | -------- | | **GRID** | V0 | 6,600 | 6,570 | 30 | | **RAVDESS** | V1 | 613 | 583 | 30 | | **MEAD** | V2 | 1,969 | 1,939 | 30 | | **VoxCeleb2** | V3 | 1,326 | 1,296 | 30 | | **HDTF** | V4 | 400 | 350 | 50 | | **Celebv-HQ** | V5 | 798 | 768 | 30 | | **Summary** | | **11,706** | **11,506** | **200** | ## Dataset Creation ### Curation Rationale Raw videos from the wild (e.g., VoxCeleb2, Celebv-HQ) often contain background noise, diverse languages, or varying resolutions. This dataset is the result of the following data curation pipeline designed to ensure high-quality audio-visual consistency: 1. **Duration Filtering:** To facilitate temporal modeling, short clips from lab datasets are concatenated to form 10–20s sequences, while wild samples shorter than 10s are filtered out. 2. **Signal-to-Noise Ratio (SNR) Filtering:** Clips with strong background noise, music, or environmental interference are removed based on SNR thresholds to ensure clean audio features. 3. **Language Filtering:** Linguistic consistency is enforced by using **Whisper** to discard non-English samples or those with low detection confidence. 4. **Audio-Visual Sync Filtering:** **SyncNet** is used to eliminate clips with poor lip synchronization, abrupt scene cuts, or off-screen speakers (e.g., voice-overs). 5. **Resolution Normalization:** All videos are resized and center-cropped to a unified **512×512** resolution and re-encoded at **25 FPS** with standardized RGB to harmonize data from diverse sources. ### Source Video Data - **GRID:** https://zenodo.org/records/3625687 - **RAVDESS:** https://zenodo.org/records/1188976 - **MEAD:** https://wywu.github.io/projects/MEAD/MEAD.html - **VoxCeleb2:** https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html - **HDTF:** https://huggingface.co/datasets/global-optima-research/HDTF - **Celebv-HQ:** https://github.com/CelebV-HQ/CelebV-HQ/ ## Citation ```bibtex @misc{wang20263dxtalkerunifyingidentitylip, title={3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars}, author={Zhongju Wang and Zhenhong Sun and Beier Wang and Yifu Wang and Daoyi Dong and Huadong Mo and Hongdong Li}, year={2026}, eprint={2602.10516}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.10516}, } ```

提供机构：

EngineeringAI-LAB

5,000+

优质数据集

54 个

任务类型

进入经典数据集