TalkVid

Name: TalkVid
Creator: maas
Published: 2026-04-29 21:31:47
License: 暂无描述

魔搭社区2026-04-29 更新2025-09-06 收录

下载链接：

https://modelscope.cn/datasets/FreedomIntelligence/TalkVid

下载链接

链接失效反馈

官方服务：

资源简介：

# TalkVid Dataset This repository hosts the [**TalkVid**](https://github.com/FreedomIntelligence/TalkVid) dataset. - Paper: [TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis](https://huggingface.co/papers/2508.13618) - Arxiv paper: https://arxiv.org/abs/2508.13618 - Project Page: https://freedomintelligence.github.io/talk-vid - GitHub: https://github.com/FreedomIntelligence/TalkVid ## Abstract Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in this https URL ## Dataset Overview **TalkVid** is a large-scale and diversified open-source dataset for audio-driven talking head synthesis, featuring: - **Scale**: 7,729 unique speakers with over 1,244 hours of HD/4K footage - **Diversity**: Covers 15 languages and wide age range (0–60+ years) - **Quality**: High-resolution videos (1080p & 2160p) with comprehensive quality filtering - **Rich Context**: Full upper-body presence unlike head-only datasets - **Annotations**: High-quality captions and comprehensive metadata **More example videos** can be found in our [🌐 Project Page](https://freedomintelligence.github.io/talk-vid). ### Data Format ```json { "id": "videovideoTr6MMsoWAog-scene1-scene1", "height": 1080, "width": 1920, "fps": 24.0, "start-time": 0.1, "start-frame": 0, "end-time": 5.141666666666667, "end-frame": 121, "durations": "5.042s", "info": { "Person ID": "597", "Ethnicity": "White", "Age Group": "60+", "Gender": "Male", "Video Link": "https://www.youtube.com/watch?v=Tr6MMsoWAog", "Language": "English", "Video Category": "Personal Experience" }, "description": "The provided image sequence shows an older man in a suit, likely being interviewed or participating in a recorded conversation. He is seated and maintains a consistent, upright posture. Across the frames, his head rotates incrementally towards the camera's right, suggesting he is addressing someone off-screen in that direction. His facial expressions also show subtle shifts, likely related to speaking or reacting. No significant movements of the hands, arms, or torso are observed. Because these are still images, any dynamic motion analysis is limited to inferring likely movements from the subtle positional changes between frames.", "dover_scores": 8.9, "cotracker_ratio": 0.9271857142448425, "head_detail": { "scores": { "avg_movement": 97.92236052453518, "min_movement": 89.4061028957367, "avg_rotation": 93.79223716779671, "min_rotation": 70.42514759667668, "avg_completeness": 100.0, "min_completeness": 100.0, "avg_resolution": 383.14267156972596, "min_resolution": 349.6849455656829, "avg_orientation": 80.29047955896623, "min_orientation": 73.27433271185937 } } } ``` ### Data Statistics The dataset exhibits excellent diversity across multiple dimensions: - **Languages**: English, Chinese, Arabic, Polish, German, Russian, French, Korean, Portuguese, Japanese, Thai, Spanish, Italian, Hindi - **Age Groups**: 0–19, 19–30, 31–45, 46–60, 60+ - **Video Quality**: HD (1080p) and 4K (2160p) resolution with Dover score (mean ≈ 8.55), Cotracker ratio (mean ≈ 0.92), and head-detail scores concentrated in the 90–100 range - **Duration Distribution**: Balanced segments from 3-30 seconds for optimal training ## Sample Usage We provide an easy-to-use inference script for generating talking head videos. ### Environment Setup ```bash # Create conda environment conda create -n talkvid python=3.10 -y conda activate talkvid # Install dependencies pip install -r requirements.txt # Install additional dependencies for video processing conda install -c conda-forge 'ffmpeg<7' -y conda install torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y ``` ### Model Downloads Before running inference, download the required model checkpoints: ```bash # Download the model checkpoints huggingface-cli download tk93/V-Express --local-dir V-Express mv V-Express/model_ckpts model_ckpts mv V-Express/*.bin model_ckpts/v-express rm -rf V-Express/ ``` ### Quick Inference We provide an easy-to-use inference script for generating talking head videos. #### Command Line Usage ```bash # Single sample inference bash scripts/inference.sh # Or run directly with Python cd src python src/inference.py \ --reference_image_path "./test_samples/short_case/tys/ref.jpg" \ --audio_path "./test_samples/short_case/tys/aud.mp3" \ --kps_path "./test_samples/short_case/tys/kps.pth" \ --output_path "./output.mp4" \ --retarget_strategy "naive_retarget" \ --num_inference_steps 25 \ --guidance_scale 3.5 \ --context_frames 24 ``` ## Citation If our work is helpful for your research, please consider giving a star ⭐ and citing our paper 📝 ```bibtex @misc{chen2025talkvidlargescalediversifieddataset, title={TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis}, author={Shunian Chen and Hejin Huang and Yexin Liu and Zihan Ye and Pengcheng Chen and Chenghao Zhu and Michael Guan and Rongsheng Wang and Junying Chen and Guanbin Li and Ser-Nam Lim and Harry Yang and Benyou Wang}, year={2025}, eprint={2508.13618}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.13618}, } ``` ## License ### Dataset License The **TalkVid dataset** is released under [Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/), allowing only non-commercial research use. ### Code License The **source code** is released under [Apache License 2.0](LICENSE), allowing both academic and commercial use with proper attribution.

# TalkVid 数据集本仓库托管[**TalkVid**](https://github.com/FreedomIntelligence/TalkVid)数据集。 - 论文：[TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis](https://huggingface.co/papers/2508.13618) - Arxiv论文：https://arxiv.org/abs/2508.13618 - 项目页面：https://freedomintelligence.github.io/talk-vid - GitHub仓库：https://github.com/FreedomIntelligence/TalkVid ## 摘要音频驱动的说话头合成（Audio-driven Talking Head Synthesis）已实现出色的照片级真实感，但当前最先进（State-of-the-art, SOTA）模型存在一项关键缺陷：它们无法泛化覆盖种族、语言与年龄群体中人类多样性的全谱系。我们认为，这一泛化差距正是现有训练数据局限性的直接体现——现有数据缺乏必要的规模、质量与多样性。为解决这一难题，我们推出TalkVid：一款全新的大规模、高质量且多样化数据集，包含来自7729位独特说话者的1244小时视频素材。TalkVid经规范化多阶段自动化流程精心构建，严格筛选视频的运动稳定性、美学质量与面部细节，并通过人工评判验证以确保其可靠性。此外，我们还构建并发布了TalkVid基准测试集（TalkVid-Bench），这是一个分层评估集，包含500段视频片段，在关键人口统计学与语言学维度上经过精心平衡。我们的实验表明，在TalkVid上训练的模型优于在过往数据集上训练的同类模型，展现出更优异的跨数据集泛化能力。至关重要的是，我们基于TalkVid-Bench的分析揭示了传统聚合指标所掩盖的不同子群体间的性能差异，这凸显了该基准集对未来研究的必要性。代码与数据集可通过下述链接获取。 ## 数据集概览 **TalkVid**是一款面向音频驱动的说话头合成的大规模多样化开源数据集，具备以下特性： - **规模**：7729位独特说话者，涵盖超过1244小时的高清（HD）与4K视频素材 - **多样性**：覆盖15种语言，年龄跨度广泛（0至60+岁） - **质量**：高分辨率视频（1080p与2160p），经过全面的质量筛选 - **丰富上下文**：包含完整的上半身画面，而非仅头部数据集 - **标注信息**：高质量字幕与全面的元数据（Metadata）更多示例视频可访问我们的[🌐 项目页面](https://freedomintelligence.github.io/talk-vid)查看。 ### 数据格式 json { "id": "videovideoTr6MMsoWAog-scene1-scene1", "height": 1080, "width": 1920, "fps": 24.0, "start-time": 0.1, "start-frame": 0, "end-time": 5.141666666666667, "end-frame": 121, "durations": "5.042s", "info": { "Person ID": "597", "Ethnicity": "白人", "Age Group": "60+岁", "Gender": "男性", "Video Link": "https://www.youtube.com/watch?v=Tr6MMsoWAog", "Language": "英语", "Video Category": "个人经历" }, "description": "该图像序列展示了一位身着西装的老年男性，似乎正在接受采访或参与录制对话。他坐姿端正，保持稳定的直立姿态。在各帧画面中，他的头部逐步向镜头右侧转动，表明他正在向该方向的场外对象讲话。他的面部表情也有细微变化，大概率与讲话或回应相关。未观察到手部、手臂或躯干的显著动作。由于这是静态图像序列，任何动态运动分析仅能通过帧间细微的位置变化推断可能的运动。", "dover_scores": 8.9, "cotracker_ratio": 0.9271857142448425, "head_detail": { "scores": { "avg_movement": 97.92236052453518, "min_movement": 89.4061028957367, "avg_rotation": 93.79223716779671, "min_rotation": 70.42514759667668, "avg_completeness": 100.0, "min_completeness": 100.0, "avg_resolution": 383.14267156972596, "min_resolution": 349.6849455656829, "avg_orientation": 80.29047955896623, "min_orientation": 73.27433271185937 } } } ### 数据统计信息该数据集在多个维度上展现出出色的多样性： - **语言覆盖**：英语、汉语、阿拉伯语、波兰语、德语、俄语、法语、韩语、葡萄牙语、日语、泰语、西班牙语、意大利语、印地语 - **年龄分组**：0–19岁、19–30岁、31–45岁、46–60岁、60+岁 - **视频质量**：高清（1080p）与4K（2160p）分辨率，Dover评分（Dover Score）均值约为8.55，Cotracker比率（Cotracker Ratio）均值约为0.92，头部细节评分集中在90–100区间 - **时长分布**：3至30秒的均衡片段，以适配最优训练需求 ## 示例用法我们提供了一款易于使用的推理脚本，用于生成说话头视频。 ### 环境配置 bash # 创建Conda环境 conda create -n talkvid python=3.10 -y conda activate talkvid # 安装依赖项 pip install -r requirements.txt # 安装视频处理所需的额外依赖 conda install -c conda-forge 'ffmpeg<7' -y conda install torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y ### 模型下载在运行推理前，请下载所需的模型检查点（Model Checkpoint）： bash # 下载模型检查点 huggingface-cli download tk93/V-Express --local-dir V-Express mv V-Express/model_ckpts model_ckpts mv V-Express/*.bin model_ckpts/v-express rm -rf V-Express/ ### 快速推理我们提供了一款易于使用的推理脚本，用于生成说话头视频。 #### 命令行使用方式 bash # 单样本推理 bash scripts/inference.sh # 或直接通过Python运行 cd src python src/inference.py --reference_image_path "./test_samples/short_case/tys/ref.jpg" --audio_path "./test_samples/short_case/tys/aud.mp3" --kps_path "./test_samples/short_case/tys/kps.pth" --output_path "./output.mp4" --retarget_strategy "naive_retarget" --num_inference_steps 25 --guidance_scale 3.5 --context_frames 24 ## 引用若本研究对您的工作有所帮助，请考虑为仓库点亮⭐并引用我们的论文📝 bibtex @misc{chen2025talkvidlargescalediversifieddataset, title={TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis}, author={Shunian Chen and Hejin Huang and Yexin Liu and Zihan Ye and Pengcheng Chen and Chenghao Zhu and Michael Guan and Rongsheng Wang and Junying Chen and Guanbin Li and Ser-Nam Lim and Harry Yang and Benyou Wang}, year={2025}, eprint={2508.13618}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.13618}, } ## 许可协议 ### 数据集许可 **TalkVid数据集**采用[知识共享署名-非商业性使用4.0国际许可协议（CC BY-NC 4.0）](https://creativecommons.org/licenses/by-nc/4.0/)发布，仅允许非商业研究用途。 ### 代码许可 **源代码**采用[Apache许可证2.0](LICENSE)发布，允许在注明出处的前提下用于学术与商业用途。

提供机构：

maas

创建时间：

2025-08-21

搜集汇总

数据集介绍