noakraicer/ID-LoRA-TalkVid

Name: noakraicer/ID-LoRA-TalkVid
Creator: noakraicer
Published: 2026-03-19 16:39:52
License: 暂无描述

Hugging Face2026-03-19 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/noakraicer/ID-LoRA-TalkVid

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default drop_labels: true license: cc-by-4.0 task_categories: - text-to-video - text-to-audio tags: - audio-video-generation - speaker-identity - id-lora - in-context-lora - multimodal - ltx-video pretty_name: "TalkVid Preprocessed for ID-LoRA" size_categories: - 1K<n<10K --- # TalkVid Preprocessed Dataset for ID-LoRA Preprocessed training data for **[ID-LoRA](https://github.com/ID-LoRA/ID-LoRA)**: identity-driven audio-video personalization with In-Context LoRA ([paper](https://huggingface.co/papers/2603.10256)). ## Overview | Property | Value | |----------|-------| | Source dataset | TalkVid | | Training pairs | 5,796 | | Unique videos | 5,803 | | Speakers | 600 | | Resolution | Original (1080p/4K); latents computed at 512×512 | | Frame rate | 25 fps | | Frames per clip | 121 (~4.84 s) | ## Quick Start ### Browse and load data `load_dataset` returns training pairs with playable video objects and metadata: ```python from datasets import load_dataset ds = load_dataset("noakraicer/ID-LoRA-TalkVid") print(ds["train"].column_names) # ['target', 'reference', 'speaker_cluster_id', 'target_caption'] print(ds["train"][0]["target_caption"]) ``` > **Note:** Loading videos requires the `decord` package (`pip install decord`). ### Download for training ```python from huggingface_hub import snapshot_download # Full dataset (videos + precomputed latents) snapshot_download("noakraicer/ID-LoRA-TalkVid", repo_type="dataset", local_dir="./data/talkvid") # Precomputed latents only (skip videos) snapshot_download( "noakraicer/ID-LoRA-TalkVid", repo_type="dataset", allow_patterns=["precomputed/**", "train/metadata.jsonl"], local_dir="./data/talkvid", ) ``` ### Extract precomputed latents After downloading, extract the `.tar.zst` archives: ```bash cd ./data/talkvid/precomputed for f in *.tar.zst; do tar --use-compress-program=unzstd -xf "$f" done ``` This creates per-video `.pt` files in `latents/`, `audio_latents/`, `audio_latents_clean/`, and `conditions/`. ## Dataset Structure ``` ID-LoRA-TalkVid/ ├── train/ │ ├── metadata.jsonl # Pair metadata (5,796 rows) │ └── {video_id}.mp4 # 5,803 unique video clips └── precomputed/ ├── latents.tar.zst # Video VAE latents ├── audio_latents.tar.zst # Audio VAE latents (target) ├── audio_latents_clean.tar.zst # Denoised audio VAE latents (reference) └── conditions.tar.zst # Text + caption embeddings (Gemma 3) ``` Video filenames follow the pattern `{video_id}.mp4`. Latent filenames follow `{video_id}.pt`. ## Metadata Columns Each row in `train/metadata.jsonl` represents one training pair. When loaded via `load_dataset`, the `*_file_name` columns are resolved to video objects: | Column | Type | Description | |--------|------|-------------| | `target` | Video | Target video clip | | `reference` | Video | Reference video clip (same speaker, different clip) | | `speaker_cluster_id` | string | Speaker identity cluster | | `target_caption` | string | Structured caption with `[VISUAL]`, `[SPEECH]`, `[SOUNDS]`, `[TEXT]` sections | ## Precomputed Latents Ready-to-train representations stored as `.tar.zst` archives, each containing per-video `.pt` files. | Archive | Description | |---------|-------------| | `latents.tar.zst` | Video VAE latents | | `audio_latents.tar.zst` | Audio VAE latents — **target** audio (original recording, including environmental sounds) | | `audio_latents_clean.tar.zst` | Audio VAE latents — **reference** audio (denoised speech, used for IC conditioning) | | `conditions.tar.zst` | Text embeddings from Gemma 3, computed from the full structured caption | ### Audio Latent Types Training uses two types of audio latents: - **`audio_latents`** (target): Audio encoded from the original video. This is what the model learns to generate. - **`audio_latents_clean`** (reference): Denoised audio with environmental sounds removed, used as in-context conditioning. Clean audio helps the model focus on speaker identity rather than background noise. See the [training code](https://github.com/ID-LoRA/ID-LoRA) for how reference audio latents are mapped to training pairs. ### Latent File Format Each `.pt` file is a dict: - **Video latents**: `latents` (bf16 `[C, F, H, W]`), `num_frames`, `height`, `width`, `fps` - **Audio latents**: `latents` (fp32 `[C, T, F]`), `num_time_steps`, `frequency_bins`, `duration` - **Conditions**: `prompt_embeds` (bf16 `[seq_len, hidden]`), `prompt_attention_mask` ## Caption Format Each caption follows a structured format with four tagged sections: ``` [VISUAL]: <scene description, people, actions, setting> [SPEECH]: <word-for-word speech transcription> [SOUNDS]: <speaker vocal style, environmental sounds> [TEXT]: <on-screen text or "None"> ``` ## Training This dataset is designed for training [ID-LoRA](https://github.com/ID-LoRA/ID-LoRA) adapters built on [LTX-2](https://github.com/Lightricks/LTX-Video). Each pair consists of a target video and a reference video from the **same speaker**. The model jointly generates audio and video, transferring the speaker's voice identity from the reference while the visual appearance is controlled via first-frame conditioning and the text prompt. See the [paper](https://huggingface.co/papers/2603.10256) and [code](https://github.com/ID-LoRA/ID-LoRA) for training details. ## Citation ```bibtex @misc{dahan2026idloraidentitydrivenaudiovideopersonalization, title = {ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA}, author = {Aviad Dahan and Moran Yanuka and Noa Kraicer and Lior Wolf and Raja Giryes}, year = {2026}, eprint = {2603.10256}, archivePrefix = {arXiv}, primaryClass = {cs.SD}, url = {https://arxiv.org/abs/2603.10256} } ```

提供机构：

noakraicer

5,000+

优质数据集

54 个

任务类型

进入经典数据集