Pinch-Research/lipsync-hdtf-training-data

Name: Pinch-Research/lipsync-hdtf-training-data
Creator: Pinch-Research
Published: 2026-04-17 13:36:39
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Pinch-Research/lipsync-hdtf-training-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - video-classification tags: - lipsync - talking-head - audio-visual - HDTF - X-Dub pretty_name: Lipsync Training Data (HDTF Teacher Pairs + Landmarks) size_categories: - 1K<n<10K --- # Lipsync Training Data Preprocessed datasets for audio-driven lip-sync model training, generated from the HDTF (High-Definition Talking Face) dataset. ## Contents ### 1. `xdub_teacher_pairs/` (1.3 GB) **1,449 same-speaker lip-synced video pairs** generated by [X-Dub](https://github.com/KlingAIResearch/X-Dub) (Wan2.2-TI2V-5B release). Each pair consists of: - **Source video**: an HDTF clip of speaker S saying utterance A - **Teacher output**: X-Dub's lip-synced version of the same clip with audio from a *different* utterance B by the *same* speaker S These are pseudo-pairs for distillation training: the teacher output has the source's pose/identity but with different lip shapes matching the alternate audio. The alternate audio is muxed into the teacher output mp4. **Generation cost**: ~25 GPU-hours on 4×H100 (X-Dub inference at ~4 min/clip with 30 DDIM steps). Structure: ``` xdub_teacher_pairs/ ├── videos/ │ └── {src_stem}__x__{audio_stem}.mp4 (512×512, 25fps, with alt audio muxed) └── meta/ └── {src_stem}__x__{audio_stem}.json (frame counts, alignment info) ``` ### 2. `xdub_teacher_pairs_manifest.json` (964 KB) Validated manifest of all 1,449 teacher pairs with metadata: - `source`: path to the original HDTF clip - `audio`: path to the alternate audio source clip - `teacher`: path to the X-Dub teacher output - `n_aligned`: min(source_frames, teacher_frames) — safe frame count for training - `src_speaker` / `alt_speaker`: speaker IDs (always same speaker) - `duration_s`: clip duration in seconds All pairs are **same-speaker only** per the X-Dub paper's recommendation (Sec 3.1): *"we sample a_alt from the same speaker as V_real, avoiding instability from unseen data or cross-identity audio-visual combinations."* ### 3. `hdtf_landmarks/` (2.4 GB) **MediaPipe FaceLandmarker landmarks for all 6,965 HDTF clips**, computed per-frame. Each `.npz` file contains: - `landmarks`: `(n_frames, 478, 2)` float16 — normalized [0,1] xy coordinates for 468 face mesh + 10 iris landmarks - `valid`: `(n_frames,)` uint8 — 1 if detection succeeded for that frame Generated using MediaPipe FaceLandmarker (float16 v1 model) on all clips in `data/hdtf/filtered/{WDA,WRA,RD}/`. **Useful for**: face/lip mask generation, face region extraction, head pose estimation, any talking-head research using HDTF. ## Source Data - **HDTF**: 6,965 clips, 341 speakers, ~19 hours total, all 512×512 at 25fps. Face-cropped frontal studio recordings. - **X-Dub teacher**: [KlingAIResearch/X-Dub](https://github.com/KlingAIResearch/X-Dub) (Apache 2.0), Wan2.2-TI2V-5B public release. ## How this data was used This data was created as part of a project to train a proprietary lip-sync model via teacher distillation from X-Dub. The teacher pairs provide (source, alt-audio) → (lip-synced output) training triples where the model learns to imitate X-Dub's lip-sync ability on a smaller/faster architecture. The landmarks were used to generate face/lip region masks for loss weighting during training (X-Dub paper App D: `L_wFM = (1 + w·M_face + w_lip·M_lip) ⊙ L_FM`). ## License - This preprocessed data: Apache 2.0 - HDTF source videos: subject to HDTF's original license (research use) - X-Dub teacher outputs: generated using X-Dub's Apache 2.0 code + released model weights - MediaPipe landmarks: generated using Google's MediaPipe (Apache 2.0)

提供机构：

Pinch-Research

5,000+

优质数据集

54 个

任务类型

进入经典数据集