AVGen-Bench
收藏AVGen-Bench 数据集概述
数据集简介
AVGen-Bench 是一个用于多粒度评估**文本到音视频(T2AV)**生成的任务驱动型基准。
评估框架概述
该基准从三个粒度评估 T2AV 系统:基础单模态质量、跨模态对齐和细粒度语义可控性。
基准对比特点
与先前基准相比,AVGen-Bench 强调联合音视频评估、更丰富的细粒度指标以及现实世界的复杂提示。
主要定量结果
AVGen-Bench 遵循论文的 10 个维度叙事,涵盖视觉/音频质量、同步性、文本/人脸/音乐/语音可控性、物理合理性和整体语义对齐。表格中 AV/Lip 作为互补的同步性测量,Lo-Phy/Hi-Phy 作为互补的物理合理性测量。
| 模型 | 组件 | Vis | Aud (PQ) | AV | Lip | Text | Face | Music | Speech | Lo-Phy | Hi-Phy | Holistic | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Veo 3.1-fast | Veo 3.1-fast | 0.960 | 6.64 | 0.21 | 2.39 | 75.10 | 52.77 | 3.13 | 94.53 | 3.68 | 67.43 | 86.27 | 67.87 |
| Veo 3.1-quality | Veo 3.1-quality | 0.954 | 6.77 | 0.24 | 3.59 | 76.53 | 52.90 | 5.00 | 96.09 | 3.74 | 68.53 | 84.10 | 66.28 |
| Sora-2 | Sora-2 | 0.848 | 5.91 | 0.25 | 4.50 | 74.84 | 51.17 | 7.81 | 88.63 | 4.05 | 78.95 | 88.89 | 64.16 |
| Wan2.6 | Wan2.6 | 0.959 | 7.15 | 0.30 | 4.32 | 76.95 | 49.27 | 1.75 | 89.33 | 3.69 | 66.92 | 80.98 | 62.97 |
| Seedance-1.5 Pro | Seedance-1.5 Pro | 0.970 | 7.48 | 0.26 | 3.43 | 38.28 | 54.42 | 1.88 | 93.45 | 3.72 | 66.88 | 77.38 | 62.55 |
| Kling-V2.6 | Kling-V2.6 | 0.906 | 6.93 | 0.21 | 2.30 | 14.52 | 57.33 | 5.00 | 89.62 | 3.84 | 63.92 | 76.74 | 61.82 |
| LTX-2.3 | LTX-2.3 | 0.858 | 7.11 | 0.36 | 2.00 | 54.17 | 45.06 | 1.38 | 86.66 | 3.99 | 64.31 | 65.22 | 59.97 |
| NanoBanana2 + MOVA | NanoBanana2 MOVA | 0.890 | 6.71 | 0.44 | 2.70 | 68.26 | 41.33 | 0.59 | 82.45 | 3.91 | 60.95 | 72.48 | 58.10 |
| LTX-2 | LTX-2 | 0.828 | 6.84 | 0.23 | 4.76 | 24.76 | 48.53 | 5.75 | 87.07 | 4.05 | 60.20 | 66.59 | 56.62 |
| Emu3.5 + MOVA | Emu3.5 MOVA | 0.911 | 6.80 | 0.38 | 4.83 | 64.72 | 48.44 | 0.62 | 81.74 | 3.89 | 55.85 | 66.55 | 56.12 |
| Wan2.2 + HunyuanVideo-Foley | Wan2.2 HunyuanVideo-Foley | 0.936 | 6.60 | 0.23 | 5.38 | 48.46 | 36.23 | 3.44 | 53.40 | 3.90 | 54.11 | 60.63 | 53.29 |
| Ovi | Ovi | 0.839 | 6.31 | 0.37 | 5.40 | 41.36 | 49.05 | 11.25 | 76.49 | 3.93 | 52.92 | 57.45 | 52.02 |
指标方向说明:Vis、Aud (PQ)、Text、Face、Music、Speech、Lo-Phy、Hi-Phy 和 Holistic 分数越高越好;AV 和 Lip 分数越低越好。 模型排序:按 Total 分数降序排列。粗体标记每项指标的最佳分数,斜体标记次佳分数。橙色标签表示专有组件,蓝色标签表示开源组件。
细粒度评估案例
展示了六个细粒度评估模块的详细工作流程,以及 AVGen-Bench 揭示的代表性失败模式。
失败演示视频
展示了附录 A 中的多模型定性失败案例。每个案例显示原始提示以及 Veo 3.1 Fast、Ovi、LTX-2 和 Kling 2.6 的并排输出。
案例 1:提示文本渲染("Your customers are talking")
原始提示:A single wind-up chattering teeth toy clacks continuously against a solid teal background. The scene cuts to a blue screen displaying the white text "Your customers are talking," abruptly followed by rows of multi-colored chattering teeth toys all moving at once, creating a loud chaotic mechanical clatter. A green screen appears with the text "Are you listening?" before cutting to a generic product logo and a "Try it free" button on a white background as the noise ceases.
案例 2:预告片标题渲染("EIGHTY-SEVEN SECONDS")
原始提示:Four-shot high-tempo teaser with clean sync hits. Shot 1: Inside a bank vault, fluorescent hum and distant alarms; a timer on a device beeps faster as a thief whispers, "Eighty-seven seconds, move." Shot 2: Close-up of a glass cutter scoring a pane with a sharp scratch, then a suction cup pops as the circle lifts free, landing on a bass hit. Shot 3: Smash cut to a getaway car; engine revs, tires chirp, and the car fishtails out of a tight alley with gravel spraying and rattling off the chassis. Shot 4: A final slow-motion shot of a duffel bag hitting the pavement with a heavy thud as sirens surge; the title EIGHTY-SEVEN SECONDS slams onto black with a metallic logo sting.
案例 3:物理合理性(克拉尼板)
原始提示:A top-down view of a black square metal plate sprinkled evenly with fine white sand as a tone generator plays a pure sine wave that sweeps upward in pitch. As the plate begins to vibrate, the rising tone makes the sand suddenly jitter and chatter across the metal, then fall quiet as grains slide into crisp geometric nodal lines that sharpen and rearrange each time the pitch crosses a new resonance.
案例 4:物理合理性(Briggs-Rauscher 反应)
原始提示:A high-speed time-lapse shows a beaker on a magnetic stirrer, the stir plate motor making a steady whir as a stir bar spins. The beaker contains a Briggs-Rauscher mixture (hydrogen peroxide, potassium iodate, malonic acid, and a metal-ion catalyst with starch indicator). While the vortex turns, the liquid repeatedly cycles through several distinct visible states in a rhythmic pattern, switching abruptly and then returning again and again as the stirring continues.
案例 5:语义错位(度假广告)
原始提示:A young boy hits a beach ball as a group of children runs past him and jumps into a swimming pool with loud splashes, while a voiceover states, "We went on vacation with a toe dipper." The camera follows the kids underwater as bubbles roar and feet kick past the lens, and the voiceover finishes, "and left with a cannonballer." Finally, the view resurfaces to show a laughing girl in the water as on-screen text reads "Book your family home now."
案例 6:音乐音高准确性(单音 A4)
原始提示:A zoomed-in tutorial shot of a clean-tone electric guitar fretboard and picking hand. The player frets a single note A4 and plucks it four times with even timing, letting each note ring briefly. The pitch stays stable (no bend, no vibrato), and no other strings ring.
引用
如果觉得 AVGen-Bench 有用,请引用:
@misc{zhou2026avgenbenchtaskdrivenbenchmarkmultigranular, title={AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation}, author={Ziwei Zhou and Zeyuan Lai and Rui Wang and Yifan Yang and Zhen Xing and Yuqing Yang and Qi Dai and Lili Qiu and Chong Luo}, year={2026}, eprint={2604.08540}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2604.08540}, }




