AVGen-Bench

Name: AVGen-Bench
Creator: 复旦大学; 中国科学技术大学; 微软亚洲研究院
Published: 2026-04-10 01:59:39
License: 暂无描述

arXiv2026-04-10 更新2026-04-11 收录

下载链接：

http://aka.ms/avgenbench

下载链接

链接失效反馈

官方服务：

资源简介：

AVGen-Bench是由微软亚洲研究院等机构联合构建的文本-音视频生成评估基准，包含235条精心设计的跨场景提示词，覆盖专业媒体制作、创作者经济、物理世界模拟三大领域11个子类别。数据集采用任务驱动型构建策略，通过GPT-5.2生成候选提示后人工筛选，平均每条提示包含88.54个token，44%涉及语音合成，88%包含环境音效。其创新性在于解耦提示设计与其评估指标，聚焦细粒度语义对齐能力，如音乐音高控制、物理规律模拟等，为多模态生成模型的语义可控性提供标准化测试框架。

AVGen-Bench is a text-audio-visual generation evaluation benchmark jointly constructed by Microsoft Research Asia and other institutions. It includes 235 meticulously designed cross-scenario prompts, covering 11 subcategories across three domains: professional media production, creator economy, and physical world simulation. The dataset adopts a task-driven construction strategy, where candidate prompts are generated via GPT-5.2 and then manually filtered. On average, each prompt contains 88.54 tokens, with 44% involving speech synthesis and 88% including environmental sound effects. Its innovation lies in decoupling prompt design from its corresponding evaluation metrics, focusing on fine-grained semantic alignment capabilities such as musical pitch control and physical law simulation, thus providing a standardized test framework for the semantic controllability of multimodal generation models.

提供机构：

复旦大学; 中国科学技术大学; 微软亚洲研究院

创建时间：

2026-04-10

原始信息汇总

AVGen-Bench 数据集概述

数据集简介

AVGen-Bench 是一个用于多粒度评估**文本到音视频（T2AV）**生成的任务驱动型基准。

评估框架概述

该基准从三个粒度评估 T2AV 系统：基础单模态质量、跨模态对齐和细粒度语义可控性。

基准对比特点

与先前基准相比，AVGen-Bench 强调联合音视频评估、更丰富的细粒度指标以及现实世界的复杂提示。

主要定量结果

AVGen-Bench 遵循论文的 10 个维度叙事，涵盖视觉/音频质量、同步性、文本/人脸/音乐/语音可控性、物理合理性和整体语义对齐。表格中 AV/Lip 作为互补的同步性测量，Lo-Phy/Hi-Phy 作为互补的物理合理性测量。

模型	组件	Vis	Aud (PQ)	AV	Lip	Text	Face	Music	Speech	Lo-Phy	Hi-Phy	Holistic	Total
Veo 3.1-fast	Veo 3.1-fast	0.960	6.64	0.21	2.39	75.10	52.77	3.13	94.53	3.68	67.43	86.27	67.87
Veo 3.1-quality	Veo 3.1-quality	0.954	6.77	0.24	3.59	76.53	52.90	5.00	96.09	3.74	68.53	84.10	66.28
Sora-2	Sora-2	0.848	5.91	0.25	4.50	74.84	51.17	7.81	88.63	4.05	78.95	88.89	64.16
Wan2.6	Wan2.6	0.959	7.15	0.30	4.32	76.95	49.27	1.75	89.33	3.69	66.92	80.98	62.97
Seedance-1.5 Pro	Seedance-1.5 Pro	0.970	7.48	0.26	3.43	38.28	54.42	1.88	93.45	3.72	66.88	77.38	62.55
Kling-V2.6	Kling-V2.6	0.906	6.93	0.21	2.30	14.52	57.33	5.00	89.62	3.84	63.92	76.74	61.82
LTX-2.3	LTX-2.3	0.858	7.11	0.36	2.00	54.17	45.06	1.38	86.66	3.99	64.31	65.22	59.97
NanoBanana2 + MOVA	NanoBanana2 MOVA	0.890	6.71	0.44	2.70	68.26	41.33	0.59	82.45	3.91	60.95	72.48	58.10
LTX-2	LTX-2	0.828	6.84	0.23	4.76	24.76	48.53	5.75	87.07	4.05	60.20	66.59	56.62
Emu3.5 + MOVA	Emu3.5 MOVA	0.911	6.80	0.38	4.83	64.72	48.44	0.62	81.74	3.89	55.85	66.55	56.12
Wan2.2 + HunyuanVideo-Foley	Wan2.2 HunyuanVideo-Foley	0.936	6.60	0.23	5.38	48.46	36.23	3.44	53.40	3.90	54.11	60.63	53.29
Ovi	Ovi	0.839	6.31	0.37	5.40	41.36	49.05	11.25	76.49	3.93	52.92	57.45	52.02

指标方向说明：Vis、Aud (PQ)、Text、Face、Music、Speech、Lo-Phy、Hi-Phy 和 Holistic 分数越高越好；AV 和 Lip 分数越低越好。 模型排序：按 Total 分数降序排列。粗体标记每项指标的最佳分数，斜体标记次佳分数。橙色标签表示专有组件，蓝色标签表示开源组件。

细粒度评估案例

展示了六个细粒度评估模块的详细工作流程，以及 AVGen-Bench 揭示的代表性失败模式。

失败演示视频

展示了附录 A 中的多模型定性失败案例。每个案例显示原始提示以及 Veo 3.1 Fast、Ovi、LTX-2 和 Kling 2.6 的并排输出。

案例 1：提示文本渲染（"Your customers are talking"）

原始提示：A single wind-up chattering teeth toy clacks continuously against a solid teal background. The scene cuts to a blue screen displaying the white text "Your customers are talking," abruptly followed by rows of multi-colored chattering teeth toys all moving at once, creating a loud chaotic mechanical clatter. A green screen appears with the text "Are you listening?" before cutting to a generic product logo and a "Try it free" button on a white background as the noise ceases.

案例 2：预告片标题渲染（"EIGHTY-SEVEN SECONDS"）

原始提示：Four-shot high-tempo teaser with clean sync hits. Shot 1: Inside a bank vault, fluorescent hum and distant alarms; a timer on a device beeps faster as a thief whispers, "Eighty-seven seconds, move." Shot 2: Close-up of a glass cutter scoring a pane with a sharp scratch, then a suction cup pops as the circle lifts free, landing on a bass hit. Shot 3: Smash cut to a getaway car; engine revs, tires chirp, and the car fishtails out of a tight alley with gravel spraying and rattling off the chassis. Shot 4: A final slow-motion shot of a duffel bag hitting the pavement with a heavy thud as sirens surge; the title EIGHTY-SEVEN SECONDS slams onto black with a metallic logo sting.

案例 3：物理合理性（克拉尼板）

原始提示：A top-down view of a black square metal plate sprinkled evenly with fine white sand as a tone generator plays a pure sine wave that sweeps upward in pitch. As the plate begins to vibrate, the rising tone makes the sand suddenly jitter and chatter across the metal, then fall quiet as grains slide into crisp geometric nodal lines that sharpen and rearrange each time the pitch crosses a new resonance.

案例 4：物理合理性（Briggs-Rauscher 反应）

原始提示：A high-speed time-lapse shows a beaker on a magnetic stirrer, the stir plate motor making a steady whir as a stir bar spins. The beaker contains a Briggs-Rauscher mixture (hydrogen peroxide, potassium iodate, malonic acid, and a metal-ion catalyst with starch indicator). While the vortex turns, the liquid repeatedly cycles through several distinct visible states in a rhythmic pattern, switching abruptly and then returning again and again as the stirring continues.

案例 5：语义错位（度假广告）

原始提示：A young boy hits a beach ball as a group of children runs past him and jumps into a swimming pool with loud splashes, while a voiceover states, "We went on vacation with a toe dipper." The camera follows the kids underwater as bubbles roar and feet kick past the lens, and the voiceover finishes, "and left with a cannonballer." Finally, the view resurfaces to show a laughing girl in the water as on-screen text reads "Book your family home now."

案例 6：音乐音高准确性（单音 A4）

原始提示：A zoomed-in tutorial shot of a clean-tone electric guitar fretboard and picking hand. The player frets a single note A4 and plucks it four times with even timing, letting each note ring briefly. The pitch stays stable (no bend, no vibrato), and no other strings ring.

引用

如果觉得 AVGen-Bench 有用，请引用：

@misc{zhou2026avgenbenchtaskdrivenbenchmarkmultigranular, title={AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation}, author={Ziwei Zhou and Zeyuan Lai and Rui Wang and Yifan Yang and Zhen Xing and Yuqing Yang and Qi Dai and Lili Qiu and Chong Luo}, year={2026}, eprint={2604.08540}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2604.08540}, }

搜集汇总

数据集介绍

构建方式

在文生音视频生成领域，评估体系长期处于碎片化状态，现有基准往往孤立评估音频与视频，或依赖粗粒度的嵌入相似性，难以捕捉现实提示所需的细粒度联合正确性。AVGen-Bench的构建采用任务驱动的设计理念，其提示词库并非围绕现有度量指标反向工程，而是严格基于真实用户意图与应用场景。通过“人在回路”的生成流程，利用大语言模型基于涵盖专业媒体制作、创作者经济与物理世界模拟三大领域的11个现实类别生成候选提示，并经过人工严格筛选，最终形成235个高质量、复杂且多样化的任务集合，确保了评估与真实创作需求的对齐。

特点

AVGen-Bench的核心特征在于其多层次、细粒度的评估框架。与以往基准不同，它首次实现了对音视频生成任务的联合评估，并突破了传统粗粒度美学得分的局限。该框架创新性地融合了轻量级专家模型与多模态大语言模型的优势，构建了涵盖十个维度的综合评估套件。这不仅包括基础的模态质量与跨模态同步性评估，更引入了针对场景文本可读性、面部身份一致性、音乐音高准确性、语音清晰度与连贯性、物理合理性以及整体语义对齐等细粒度可控性与语义正确性的专项度量，从而能够系统性地诊断模型在复杂、真实任务中的具体失败模式。

使用方法

使用AVGen-Bench进行评估时，研究者首先将待测的文生音视频模型在基准提供的235个任务提示上进行推理，生成对应的音视频内容。随后，利用基准提供的多粒度评估套件对生成内容进行自动化分析。该套件采用混合评估策略：基础模态质量与同步性由专门的预训练模型（如Q-Align、Audiobox-Aesthetic、Syncformer等）计算；而细粒度语义控制能力则通过将专家模型（如PaddleOCR、InsightFace、Basic-Pitch等）作为特征提取器，与多模态大语言模型（如Gemini）串联构成的推理管道进行验证。最终，模型在各项指标上的得分揭示了其在视听美学与细粒度语义可靠性之间的能力差距，为模型诊断与后续研究提供了明确方向。

背景与挑战

背景概述

随着生成式人工智能的迅猛发展，文本到音视频生成技术正逐步成为媒体创作的核心接口。然而，该领域的评估体系长期处于碎片化状态，现有基准大多孤立评估音频或视频质量，或依赖粗糙的嵌入相似性度量，难以捕捉现实提示所需的细粒度联合正确性。为此，微软亚洲研究院与复旦大学、中国科学技术大学等机构的研究团队于2026年共同推出了AVGen-Bench基准。该基准旨在通过涵盖11个现实世界类别的高质量提示集，驱动对文本到音视频生成模型的全面评估，其核心研究问题是解决多模态生成中语义对齐与细粒度可控性的量化难题，对推动沉浸式媒体生成技术的发展具有深远影响。

当前挑战

AVGen-Bench所针对的文本到音视频生成领域，核心挑战在于实现跨模态的细粒度语义对齐。具体而言，模型需在生成高保真视听内容的同时，精确遵循提示中的复杂约束，如特定文本渲染、音乐音高控制、语音连贯性及物理逻辑合理性。然而，当前模型在此类任务中普遍表现薄弱，存在明显的语义可靠性缺陷。在数据集构建过程中，挑战主要体现在设计既能反映真实用户意图、又具备足够复杂度的提示集，并开发一套融合轻量级专家模型与多模态大语言模型的混合评估框架，以兼顾信号级精度与高层语义推理，从而系统诊断生成模型的失败模式。

常用场景

经典使用场景

在文本到音视频生成领域，AVGen-Bench作为任务驱动的基准测试框架，其经典使用场景聚焦于对前沿生成模型进行多粒度、细粒度的联合评估。该数据集通过涵盖专业媒体制作、创作者经济与世界模拟三大领域的11个现实类别，构建了高质量提示词集合，旨在模拟真实用户创作意图。其核心应用在于系统性地诊断模型在音视频同步、语义对齐、物理合理性等方面的性能瓶颈，为研究社区提供了标准化、可复现的评估环境，推动了生成模型从感知质量到语义可控性的全面进化。

实际应用

在实际应用层面，AVGen-Bench为多媒体内容创作工具的开发与优化提供了关键评估标准。其涵盖的电影预告片、广告、音乐教程、游戏实况等场景，直接对应专业媒体制作与创作者经济中的真实需求。通过评估模型在复杂多镜头叙事、精细音频控制、物理动态模拟等方面的能力，该基准助力开发者识别并改进模型在生成教育内容、娱乐媒体、模拟训练等应用中的缺陷。例如，在生成化学实验教学视频时，基准能检测模型是否准确模拟钠与水反应的物理现象，确保生成内容的科学性与教育价值。

衍生相关工作

AVGen-Bench的推出催生了一系列围绕细粒度音视频生成评估的衍生研究。其多粒度混合评估范式启发了后续工作如PhyT2V等在物理合理性评估方向的深化，而针对文本渲染、音高控制等模块的专项评估方法也被整合进如VBench++等扩展基准中。同时，该数据集揭示的模型在音乐理论理解、语义一致性等方面的普遍缺陷，推动了如MAViD等混合架构在跨模态一致性增强方面的探索。这些衍生工作共同促进了生成模型从概率纹理合成向物理接地世界模拟的范式转变。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集