five

IntelligenceLab/VideoHallu

收藏
Hugging Face2025-06-04 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/IntelligenceLab/VideoHallu
下载链接
链接失效反馈
官方服务:
资源简介:
VideoHallu是一个用于评估和缓解合成视频中多模态幻觉的基准数据集。它包括由模型如Sora、Veo2和Kling生成的合成视频,并与专家设计的反直觉问答配对,以评估多模态大型语言模型(MLLMs)在感知明显但通常因语言先验而幻觉的异常方面的批判性思维能力。VideoHallu通过涵盖对齐、一致性、常识和物理的示例来评估MLLMs的异常检测能力。该数据集用于基准测试最先进的MLLMs,包括GPT-4o、Gemini-2.5-Pro、Qwen-2.5-VL、Video-R1和VideoChat-R1。研究表明,这些模型在现实世界的基准测试中表现良好,但在合成视频中的基本物理和常识推理方面仍然存在问题。进一步研究表明,使用群组相对策略优化(GRPO)进行后训练,并使用包含视频问答、反直觉常识和物理推理的真实和合成视频的数据集进行课程学习,可以提高MLLMs的异常检测和批判性思维能力。

VideoHallu is a benchmark dataset for evaluating and mitigating multi-modal hallucinations in synthetic videos. It includes synthetic videos generated by models like Sora, Veo2, and Kling, paired with expert-crafted counterintuitive QA to evaluate the critical thinking abilities of Multi-modal Large Language Models (MLLMs) on abnormalities that are perceptually obvious to humans but often hallucinated due to language priors. VideoHallu evaluates MLLMs abnormality detection abilities with examples across alignment, consistency, common sense, and physics. We benchmark SOTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and forefront models like Video-R1 and VideoChat-R1. We observe that these models perform well on many real-world benchmarks like MVBench and MovieChat, but still struggle with basic physics-based and common sense reasoning in synthetic videos. We further show that post-training with Group Relative Policy Optimization (GRPO), using curriculum learning on datasets combining video QA with counterintuitive common sense and physics reasoning over real and synthetic videos, improves MLLMs abnormality detection and critical thinking, demonstrating the value of targeted training for improving their understanding of common sense and physical laws.
提供机构:
IntelligenceLab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作