noakraicer/ID-LoRA-TalkVid
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/noakraicer/ID-LoRA-TalkVid
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
drop_labels: true
license: cc-by-4.0
task_categories:
- text-to-video
- text-to-audio
tags:
- audio-video-generation
- speaker-identity
- id-lora
- in-context-lora
- multimodal
- ltx-video
pretty_name: "TalkVid Preprocessed for ID-LoRA"
size_categories:
- 1K<n<10K
---
# TalkVid Preprocessed Dataset for ID-LoRA
Preprocessed training data for **[ID-LoRA](https://github.com/ID-LoRA/ID-LoRA)**: identity-driven audio-video personalization with In-Context LoRA ([paper](https://huggingface.co/papers/2603.10256)).
## Overview
| Property | Value |
|----------|-------|
| Source dataset | TalkVid |
| Training pairs | 5,796 |
| Unique videos | 5,803 |
| Speakers | 600 |
| Resolution | Original (1080p/4K); latents computed at 512×512 |
| Frame rate | 25 fps |
| Frames per clip | 121 (~4.84 s) |
## Quick Start
### Browse and load data
`load_dataset` returns training pairs with playable video objects and metadata:
```python
from datasets import load_dataset
ds = load_dataset("noakraicer/ID-LoRA-TalkVid")
print(ds["train"].column_names) # ['target', 'reference', 'speaker_cluster_id', 'target_caption']
print(ds["train"][0]["target_caption"])
```
> **Note:** Loading videos requires the `decord` package (`pip install decord`).
### Download for training
```python
from huggingface_hub import snapshot_download
# Full dataset (videos + precomputed latents)
snapshot_download("noakraicer/ID-LoRA-TalkVid", repo_type="dataset", local_dir="./data/talkvid")
# Precomputed latents only (skip videos)
snapshot_download(
"noakraicer/ID-LoRA-TalkVid",
repo_type="dataset",
allow_patterns=["precomputed/**", "train/metadata.jsonl"],
local_dir="./data/talkvid",
)
```
### Extract precomputed latents
After downloading, extract the `.tar.zst` archives:
```bash
cd ./data/talkvid/precomputed
for f in *.tar.zst; do
tar --use-compress-program=unzstd -xf "$f"
done
```
This creates per-video `.pt` files in `latents/`, `audio_latents/`, `audio_latents_clean/`, and `conditions/`.
## Dataset Structure
```
ID-LoRA-TalkVid/
├── train/
│ ├── metadata.jsonl # Pair metadata (5,796 rows)
│ └── {video_id}.mp4 # 5,803 unique video clips
└── precomputed/
├── latents.tar.zst # Video VAE latents
├── audio_latents.tar.zst # Audio VAE latents (target)
├── audio_latents_clean.tar.zst # Denoised audio VAE latents (reference)
└── conditions.tar.zst # Text + caption embeddings (Gemma 3)
```
Video filenames follow the pattern `{video_id}.mp4`. Latent filenames follow `{video_id}.pt`.
## Metadata Columns
Each row in `train/metadata.jsonl` represents one training pair. When loaded via `load_dataset`, the `*_file_name` columns are resolved to video objects:
| Column | Type | Description |
|--------|------|-------------|
| `target` | Video | Target video clip |
| `reference` | Video | Reference video clip (same speaker, different clip) |
| `speaker_cluster_id` | string | Speaker identity cluster |
| `target_caption` | string | Structured caption with `[VISUAL]`, `[SPEECH]`, `[SOUNDS]`, `[TEXT]` sections |
## Precomputed Latents
Ready-to-train representations stored as `.tar.zst` archives, each containing per-video `.pt` files.
| Archive | Description |
|---------|-------------|
| `latents.tar.zst` | Video VAE latents |
| `audio_latents.tar.zst` | Audio VAE latents — **target** audio (original recording, including environmental sounds) |
| `audio_latents_clean.tar.zst` | Audio VAE latents — **reference** audio (denoised speech, used for IC conditioning) |
| `conditions.tar.zst` | Text embeddings from Gemma 3, computed from the full structured caption |
### Audio Latent Types
Training uses two types of audio latents:
- **`audio_latents`** (target): Audio encoded from the original video. This is what the model learns to generate.
- **`audio_latents_clean`** (reference): Denoised audio with environmental sounds removed, used as in-context conditioning. Clean audio helps the model focus on speaker identity rather than background noise.
See the [training code](https://github.com/ID-LoRA/ID-LoRA) for how reference audio latents are mapped to training pairs.
### Latent File Format
Each `.pt` file is a dict:
- **Video latents**: `latents` (bf16 `[C, F, H, W]`), `num_frames`, `height`, `width`, `fps`
- **Audio latents**: `latents` (fp32 `[C, T, F]`), `num_time_steps`, `frequency_bins`, `duration`
- **Conditions**: `prompt_embeds` (bf16 `[seq_len, hidden]`), `prompt_attention_mask`
## Caption Format
Each caption follows a structured format with four tagged sections:
```
[VISUAL]: <scene description, people, actions, setting>
[SPEECH]: <word-for-word speech transcription>
[SOUNDS]: <speaker vocal style, environmental sounds>
[TEXT]: <on-screen text or "None">
```
## Training
This dataset is designed for training [ID-LoRA](https://github.com/ID-LoRA/ID-LoRA) adapters built on [LTX-2](https://github.com/Lightricks/LTX-Video).
Each pair consists of a target video and a reference video from the **same speaker**.
The model jointly generates audio and video, transferring the speaker's voice identity from the reference
while the visual appearance is controlled via first-frame conditioning and the text prompt.
See the [paper](https://huggingface.co/papers/2603.10256) and
[code](https://github.com/ID-LoRA/ID-LoRA) for training details.
## Citation
```bibtex
@misc{dahan2026idloraidentitydrivenaudiovideopersonalization,
title = {ID-LoRA: Identity-Driven Audio-Video Personalization
with In-Context LoRA},
author = {Aviad Dahan and Moran Yanuka and Noa Kraicer and Lior Wolf and Raja Giryes},
year = {2026},
eprint = {2603.10256},
archivePrefix = {arXiv},
primaryClass = {cs.SD},
url = {https://arxiv.org/abs/2603.10256}
}
```
提供机构:
noakraicer



