HuMoSet
收藏魔搭社区2026-05-16 更新2025-12-27 收录
下载链接:
https://modelscope.cn/datasets/leoniuschen/HuMoSet
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<h1> HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning </h1>
<a href="https://arxiv.org/abs/2509.08519"><img src="https://img.shields.io/badge/arXiv%20paper-2509.08519-b31b1b.svg"></a>
<a href="https://phantom-video.github.io/HuMo/"><img src="https://img.shields.io/badge/Project_page-More_visualizations-green"></a>
<a href="https://modelscope.cn/datasets/leoniuschen/HuMoSet"><img src="https://img.shields.io/badge/Dataset-Download-red?logo=googlechrome&logoColor=red"></a>
<a href="https://huggingface.co/bytedance-research/HuMo"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=orange"></a>
<a href='https://openbayes.com/console/public/tutorials/KhniTI5hwrf'><img src='https://img.shields.io/badge/Live Playground-OpenBayes贝式计算-blue'></a>
[Liyang Chen](https://scholar.google.com/citations?user=jk6jWXgAAAAJ&hl)<sup> * </sup>, [Tianxiang Ma](https://tianxiangma.github.io/)<sup> * </sup>, [Jiawei Liu](https://scholar.google.com/citations?user=X21Fz-EAAAAJ), [Bingchuan Li](https://scholar.google.com/citations?user=ac5Se6QAAAAJ)<sup> † </sup>, <br>[Zhuowei Chen](https://scholar.google.com/citations?user=ow1jGJkAAAAJ), [Lijie Liu](https://liulj13.github.io/), [Xu He](https://scholar.google.com/citations?user=KMrFk2MAAAAJ&hl), [Gen Li](https://scholar.google.com/citations?user=wqA7EIoAAAAJ), [Qian He](https://scholar.google.com/citations?user=9rWWCgUAAAAJ), [Zhiyong Wu](https://scholar.google.com/citations?user=7Xl6KdkAAAAJ)<sup> § </sup><br>
<sup> * </sup>Equal contribution, <sup> † </sup>Project lead, <sup> § </sup>Corresponding author
Tsinghua University | Intelligent Creation Team, ByteDance
</div>
## Introduction
HuMoSet is a comprehensive human-centric video dataset containing about 670,000 video samples. It is designed to advance research in controllable video generation.
### Key Features
- Diverse Reference Images: For every video sample, we provide a corresponding reference image featuring the same identity (ID) but with distinct variations in clothing, accessories, background, and hairstyle. This diversity is crucial for robust identity preservation training.
- Dense Video Descriptions: We utilize Qwen2.5-VL to generate dense, high-quality descriptive captions for each video, enabling fine-grained text-to-video capabilities.
- Audio-Visual Synchronization: All video samples are strictly processed to ensure perfect synchronization between audio and visual tracks.
- Open Source Origin: All videos and reference images are curated exclusively from open-source datasets (such as **[OpenHumanVid](https://github.com/fudan-generative-vision/OpenHumanVid)**). No internal or proprietary company data is included.
### Potential Applications
By leveraging HuMoSet on top of existing video foundation models, researchers and developers can explore a wide range of applications, including but not limited to:
1. **Talking Human Models:** Training highly realistic talking head generation systems.
2. **Multimodal Control:** Developing models like **[HuMo](https://github.com/Phantom-video/HuMo)** with precise multimodal conditional control capabilities, supporting inputs such as **text, reference images, and audio**.
3. **Customized Video Generation:** Creating advanced generative models (e.g., **[Sora 2-level capabilities](https://openai.com/index/sora-2)**) that support customized identity and voice preservation.
## Demonstration
The reference image of the person in the video is displayed in the top-left corner, while the video description is shown below the video.
<table class="center">
<!-- Row 1 -->
<tr>
<td width=25% style="border: none">
<video src="asset/video/000a522f92a96fc3126ead73376d2092.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/001955692ad769e927008d0b9d24ca14.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/00462dcb946f63dd46de095717e4d0d1.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/0059276f0359e11345a018afd153fd36.mp4" controls width="100%"></video>
</td>
</tr>
<tr style="text-align: center;">
<td width=25% style="border: none">A middle-aged man with short, graying hair sits upright in a dimly lit home setting, facing the camera. He wears a purple-and-white plaid shirt, remains mostly still, and speaks with a serious, concerned expression.</td>
<td width=25% style="border: none">In an office-like setting, a blonde woman in a black leather jacket faces a man in a dark suit seen from behind. She remains still, maintains eye contact, and displays a serious, focused expression, suggesting determination.</td>
<td width=25% style="border: none">Against a gray stone wall, a woman in a tan military uniform stands upright, speaking with a serious, focused expression. A similarly dressed man stands behind her holding a rifle, remaining still and attentive.</td>
<td width=25% style="border: none">In a dimly lit office with bookshelves, a man wearing glasses and a vest sits facing a woman, holding and gesturing with a plaid shirt as he speaks earnestly. The woman, mostly still and seen from the side, listens attentively.</td>
</tr>
<!-- Row 2 -->
<tr>
<td width=25% style="border: none">
<video src="asset/video/0071e4a8b4028b46216bb97c2ef11265.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/00a0ccf45b30ae435d8af62e1389ea51.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/00a1bc0299048596f13b361afb3fc7f5.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/00a9e945550e7141aaa1b2f04454e96e.mp4" controls width="100%"></video>
</td>
</tr>
<tr style="text-align: center;">
<td width=25% style="border: none">In a wood-paneled office, a man in a tweed jacket and tie sits upright and speaks with a serious, thoughtful expression to someone in a dark suit seen from behind.</td>
<td width=25% style="border: none">Outdoors in front of a brick house, a red-haired woman wearing gardening gloves holds pruning shears and faces the camera, appearing focused as she explains something.</td>
<td width=25% style="border: none">In a store or office setting, a man in a maroon sweater sits facing another person, maintaining steady eye contact with a neutral, slightly focused expression while the other listens from off-camera.</td>
<td width=25% style="border: none">In a dim, bluish environment, a young boy in a red jacket leans against a large marine creature. He opens his eyes and shifts from calm to concerned, showing fear and vulnerability as the creature gently rests a hand on his shoulder in comfort.</td>
</tr>
</table>
## Download
You can download the dataset by cloning the repository from ModelScope:
```bash
# Option 1: Using ModelScope. Much faster for users in the Chinese Mainland
pip install modelscope[framework]
modelscope download --dataset leoniuschen/HuMoSet --local_dir ./HuMoSet
# Option 2: Using Git
git lfs install
git clone https://modelscope.cn/datasets/leoniuschen/HuMoSet.git
```
Dataset Structure:
- `video/`: This folder contains the target video files.
- `reference_image/`: This folder stores the corresponding reference image for each video.
- `video_caption.parquet`: A metadata file containing the dense descriptions for all videos.
## Acknowledgements
Our work builds upon and is greatly inspired by several outstanding open-source projects, including [Wan2.1](https://github.com/Wan-Video/Wan2.1), [Phantom](https://github.com/Phantom-video/Phantom), [SeedVR](https://github.com/IceClear/SeedVR?tab=readme-ov-file), [MEMO](https://github.com/memoavatar/memo), [Hallo3](https://github.com/fudan-generative-vision/hallo3), [OpenHumanVid](https://github.com/fudan-generative-vision/OpenHumanVid), [OpenS2V-Nexus](https://github.com/PKU-YuanGroup/OpenS2V-Nexus), [ConsisID](https://github.com/PKU-YuanGroup/ConsisID), [Qwen2.5-VL](https://arxiv.org/abs/2502.13923) and [Whisper](https://github.com/openai/whisper). We sincerely thank the authors and contributors of these projects for generously sharing their excellent codes and ideas.
## ⭐ Citation
If HuMo is helpful, please help to ⭐ the [repo](https://github.com/Phantom-video/HuMo).
If you find this project useful for your research, please consider citing our [paper](https://arxiv.org/abs/2509.08519).
### BibTeX
```bibtex
@misc{chen2025humo,
title={HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning},
author={Liyang Chen and Tianxiang Ma and Jiawei Liu and Bingchuan Li and Zhuowei Chen and Lijie Liu and Xu He and Gen Li and Qian He and Zhiyong Wu},
year={2025},
eprint={2509.08519},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.08519},
}
```
## License of HuMoSet
The video samples are collected from the publicly available dataset. Users must follow [the license](./LICENSE) to use these video samples.
## 📧 Contact
If you have any comments or questions regarding this open-source project, please open a new issue or contact [Liyang Chen](https://leoniuschen.github.io/) and [Tianxiang Ma](https://tianxiangma.github.io/).
<div align="center">
<h1> HuMo:基于协同多模态条件控制的以人为中心的视频生成 </h1>
<a href="https://arxiv.org/abs/2509.08519"><img src="https://img.shields.io/badge/arXiv%20paper-2509.08519-b31b1b.svg"></a>
<a href="https://phantom-video.github.io/HuMo/"><img src="https://img.shields.io/badge/Project_page-More_visualizations-green"></a>
<a href="https://modelscope.cn/datasets/leoniuschen/HuMoSet"><img src="https://img.shields.io/badge/Dataset-Download-red?logo=googlechrome&logoColor=red"></a>
<a href="https://huggingface.co/bytedance-research/HuMo"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=orange"></a>
<a href='https://openbayes.com/console/public/tutorials/KhniTI5hwrf'><img src='https://img.shields.io/badge/Live Playground-OpenBayes贝式计算-blue'></a>
[陈立扬](https://scholar.google.com/citations?user=jk6jWXgAAAAJ&hl)<sup> * </sup>, [马天翔](https://tianxiangma.github.io/)<sup> * </sup>, [刘佳伟](https://scholar.google.com/citations?user=X21Fz-EAAAAJ), [李炳川](https://scholar.google.com/citations?user=ac5Se6QAAAAJ)<sup> † </sup>, <br>[陈卓炜](https://scholar.google.com/citations?user=ow1jGJkAAAAJ), [刘立杰](https://liulj13.github.io/), [何旭](https://scholar.google.com/citations?user=KMrFk2MAAAAJ&hl), [李根](https://scholar.google.com/citations?user=wqA7EIoAAAAJ), [何谦](https://scholar.google.com/citations?user=9rWWCgUAAAAJ), [吴智勇](https://scholar.google.com/citations?user=7Xl6KdkAAAAJ)<sup> § </sup><br>
<sup> * </sup>共同第一作者,<sup> † </sup>项目负责人,<sup> § </sup>通讯作者
清华大学 | 字节跳动智能创作团队
</div>
## 简介
HuMoSet是一个全面的以人为中心的视频数据集,包含约67万个视频样本,旨在推动可控视频生成领域的研究进展。
### 核心特性
- 多样化参考图像:针对每个视频样本,我们提供对应同一身份(ID)的参考图像,但在服饰、配饰、背景与发型上存在显著差异。这种多样性对于鲁棒的身份保留训练至关重要。
- 密集视频描述:我们采用Qwen2.5-VL(Qwen2.5-VL)为每个视频生成密集且高质量的描述性字幕,从而支持细粒度的文本到视频生成能力。
- 音画同步:所有视频样本均经过严格处理,以确保音轨与视轨完美同步。
- 开源来源:所有视频与参考图像均仅从开源数据集(如**OpenHumanVid(OpenHumanVid)**)中整理得到,未包含任何内部或专有公司数据。
### 潜在应用场景
借助HuMoSet与现有视频基础模型,研究人员与开发者可探索广泛的应用场景,包括但不限于:
1. **虚拟人发声模型**:训练高度写实的虚拟人头部生成系统。
2. **多模态控制**:开发具备精准多模态条件控制能力的模型(如HuMo(HuMo)),支持文本、参考图像与音频等输入。
3. **定制化视频生成**:创建支持定制化身份与语音保留的先进生成模型(例如具备Sora(Sora)2级能力的模型)。
## 效果演示
视频中人物的参考图像显示在左上角,视频描述则展示在视频下方。
<table class="center">
<!-- Row 1 -->
<tr>
<td width=25% style="border: none">
<video src="asset/video/000a522f92a96fc3126ead73376d2092.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/001955692ad769e927008d0b9d24ca14.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/00462dcb946f63dd46de095717e4d0d1.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/0059276f0359e11345a018afd153fd36.mp4" controls width="100%"></video>
</td>
</tr>
<tr style="text-align: center;">
<td width=25% style="border: none">一位头发花白的中年男性端正坐在光线昏暗的居家环境中,面向镜头。他身着紫白格纹衬衫,基本保持静止,表情严肃且略带忧虑地讲话。</td>
<td width=25% style="border: none">在类似办公室的环境中,一位身穿黑色皮夹克的金发女性面向一名背对镜头的深色西装男性。她保持静止,保持眼神交流,表情严肃且专注,彰显出坚定的态度。</td>
<td width=25% style="border: none">在灰色石墙背景前,一位身着黄褐色军装的女性端正站立,表情严肃且专注地讲话。一名身着同款制服的男性站在她身后持枪,保持静止且注意力集中。</td>
<td width=25% style="border: none">在带有书架的光线昏暗的办公室中,一名戴着眼镜、身着背心的男性面向一名女性坐着,手持格纹衬衫并做出手势,诚恳地讲话。该女性基本保持静止,侧身倾听,注意力专注。</td>
</tr>
<!-- Row 2 -->
<tr>
<td width=25% style="border: none">
<video src="asset/video/0071e4a8b4028b46216bb97c2ef11265.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/00a0ccf45b30ae435d8af62e1389ea51.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/00a1bc0299048596f13b361afb3fc7f5.mp4" controls width="100%"></video>
</td>
<td width=25% style="border: none">
<video src="asset/video/00a9e945550e7141aaa1b2f04454e96e.mp4" controls width="100%"></video>
</td>
</tr>
<tr style="text-align: center;">
<td width=25% style="border: none">在木质墙板装饰的办公室中,一名身着粗花呢夹克、系着领带的男性端正坐着,表情严肃且若有所思地与一名背对镜头的深色西装人士交谈。</td>
<td width=25% style="border: none">在砖房外的户外场景中,一名红发女性戴着园艺手套,手持修枝剪面向镜头,看起来专注地在讲解某事。</td>
<td width=25% style="border: none">在商店或办公室场景中,一名身着酒红色毛衣的男性面向另一个人坐着,保持稳定的眼神交流,表情中性且略带专注,而对方则在镜头外倾听。</td>
<td width=25% style="border: none">在光线昏暗、偏蓝调的环境中,一名身着红色夹克的年轻男孩倚靠在一只大型海洋生物身上。他睁开眼睛,情绪从平静转为担忧,表现出恐惧与脆弱,而该生物则轻轻将手放在他的肩上以示安慰。</td>
</tr>
</table>
## 下载方式
你可以通过克隆ModelScope平台上的仓库来下载该数据集:
bash
# 选项1:使用ModelScope,中国大陆用户可获得更快下载速度
pip install modelscope[framework]
modelscope download --dataset leoniuschen/HuMoSet --local_dir ./HuMoSet
# 选项2:使用Git
git lfs install
git clone https://modelscope.cn/datasets/leoniuschen/HuMoSet.git
### 数据集结构
- `video/`:该文件夹存储目标视频文件。
- `reference_image/`:该文件夹存储每个视频对应的参考图像。
- `video_caption.parquet`:包含所有视频密集描述的元数据文件。
## 致谢
我们的工作基于多个优秀开源项目构建,并深受其启发,包括Wan2.1(Wan2.1)、Phantom(Phantom)、SeedVR(SeedVR)、MEMO(MEMO)、Hallo3(Hallo3)、OpenHumanVid(OpenHumanVid)、OpenS2V-Nexus(OpenS2V-Nexus)、ConsisID(ConsisID)、Qwen2.5-VL(Qwen2.5-VL)与Whisper(Whisper)。我们衷心感谢这些项目的作者与贡献者慷慨分享其出色的代码与研究思路。
## ⭐ 引用说明
如果HuMo对你的工作有所帮助,请为该仓库⭐Star。
如果你发现本项目对你的研究有价值,请考虑引用我们的[论文](https://arxiv.org/abs/2509.08519)。
### BibTeX引用格式
bibtex
@misc{chen2025humo,
title={HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning},
author={Liyang Chen and Tianxiang Ma and Jiawei Liu and Bingchuan Li and Zhuowei Chen and Lijie Liu and Xu He and Gen Li and Qian He and Zhiyong Wu},
year={2025},
eprint={2509.08519},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.08519},
}
## HuMoSet许可协议
视频样本均从公开可用的数据集收集而来,用户必须遵循[本许可协议](./LICENSE)使用这些视频样本。
## 📧 联系方式
如果您对该开源项目有任何意见或疑问,请提交新的Issue,或联系[陈立扬](https://leoniuschen.github.io/)与[马天翔](https://tianxiangma.github.io/)。
提供机构:
maas
创建时间:
2025-12-19



