five

streamingbench

收藏
魔搭社区2026-01-08 更新2026-01-03 收录
下载链接:
https://modelscope.cn/datasets/Lingcco1/streamingbench
下载链接
链接失效反馈
官方服务:
资源简介:
[**NEW!** 2025.05.15] 🔥: [Seed1.5-VL](https://github.com/ByteDance-Seed/Seed1.5-VL) achieved ALL model SOTA with a score of 82.80 on the Proactive Output. [**NEW!** 2025.03.17] ⭐: [ViSpeeker](https://arxiv.org/abs/2503.12769) achieved Open-Source SOTA with a score of 61.60 on the Omni-Source Understanding. [**NEW!** 2025.01.14] 🚀: [MiniCPM-o 2.6](https://github.com/OpenBMB/MiniCPM-o) achieved Streaming SOTA with a score of 66.01 on the Overall benchmark. [**NEW!** 2025.01.06] 🏆: [Dispider](https://github.com/Mark12Ding/Dispider) achieved Streaming SOTA with a score of 53.12 on the Overall benchmark. [**NEW!** 2024.12.09] 🎉: [InternLM-XComposer2.5-OmniLive](https://github.com/InternLM/InternLM-XComposer) achieved 73.79 on Real-Time Visual Understanding. ------ ## 🎞️ Overview As MLLMs continue to advance, they remain largely focused on offline video comprehension, where all frames are pre-loaded before making queries. However, this is far from the human ability to process and respond to video streams in real-time, capturing the dynamic nature of multimedia content. To bridge this gap, **StreamingBench** introduces the first comprehensive benchmark for streaming video understanding in MLLMs. ### Key Evaluation Aspects - 🎯 **Real-time Visual Understanding**: Can the model process and respond to visual changes in real-time? - 🔊 **Omni-source Understanding**: Does the model integrate visual and audio inputs synchronously in real-time video streams? - 🎬 **Contextual Understanding**: Can the model comprehend the broader context within video streams? ### Dataset Statistics - 📊 **900** diverse videos - 📝 **4,500** human-annotated QA pairs - ⏱️ Five questions per video at different timestamps #### 🎬 Video Categories <div align="center"> <img src="./figs/StreamingBench_Video.png" width="80%" alt="Video Categories"> </div> #### 🔍 Task Taxonomy <div align="center"> <img src="./figs/task_taxonomy.png" width="80%" alt="Task Taxonomy"> </div> ## 📐 Dataset Examples https://github.com/user-attachments/assets/e6d1655d-ab3f-47a7-973a-8fd6c8962307 <div align="center"> <video width="100%" controls> <source src="./figs/example.video" type="video/mp4"> Your browser does not support the video tag. </video> </div> ## 🔮 Evaluation Pipeline ### Requirements - Python 3.x - ffmpeg-python ### Data Preparation 1. **Download Dataset**: Retrieve all necessary files from the [StreamingBench Dataset](https://huggingface.co/datasets/mjuicem/StreamingBench). 2. **Decompress Files**: Extract the downloaded files and organize them in the `./data` directory as follows: ``` StreamingBench/ ├── data/ │ ├── real/ # Unzip Real Time Visual Understanding_*.zip into this folder │ ├── omni/ # Unzip other .zip files into this folder │ ├── sqa/ # Unzip Sequential Question Answering_*.zip into this folder │ └── proactive/ # Unzip Proactive Output_*.zip into this folder ``` 3. **Preprocess Data**: Run the following command to preprocess the data: ```bash cd ./scripts bash preprocess.sh ``` ### Model Preparation Prepare your own model for evaluation by following the instructions provided [here](./docs/model_guide.md). This guide will help you set up and configure your model to ensure it is ready for testing against the dataset. ### Evaluation Now you can run the benchmark: ```sh bash eval.sh ``` This will run the benchmark and save the results to the specified output file. Then you can calculate the metrics using the following command: ```sh bash stats.sh ``` ## 🔬 Experimental Results ### Performance of Various MLLMs on StreamingBench - 60 seconds of context preceding the query time (Main) <div align="center"> <img src="./figs/result_2.png" width="80%" alt="Task Taxonomy"> </div> - All Context (+ Long Context) <div align="center"> <img src="./figs/result_1.png" width="80%" alt="Task Taxonomy"> </div> - Comparison of Main Experiment vs. 60 Seconds of Video Context - <div align="center"> <img src="./figs/heatmap.png" width="80%" alt="Task Taxonomy"> </div> ### Performance of Different MLLMs on the Proactive Output Task *"≤ xs" means that the answer is considered correct if the actual output time is within x seconds of the ground truth.* <div align="center"> <img src="./figs/po.png" width="80%" alt="Task Taxonomy"> </div> ## 📝 Citation ```bibtex @article{lin2024streaming, title={StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding}, author={Junming Lin and Zheng Fang and Chi Chen and Zihao Wan and Fuwen Luo and Peng Li and Yang Liu and Maosong Sun}, journal={arXiv preprint arXiv:2411.03628}, year={2024} } ```

[**NEW!** 2025.05.15] 🔥: [Seed1.5-VL](https://github.com/ByteDance-Seed/Seed1.5-VL) 在主动输出(Proactive Output)任务上以82.80的分数达成全模型当前最优(State-of-the-Art,SOTA)。 [**NEW!** 2025.03.17] ⭐: [ViSpeeker](https://arxiv.org/abs/2503.12769) 在全源理解(Omni-Source Understanding)任务上以61.60的分数达成开源模型当前最优(Open-Source SOTA)。 [**NEW!** 2025.01.14] 🚀: [MiniCPM-o 2.6](https://github.com/OpenBMB/MiniCPM-o) 在整体基准测试(Overall benchmark)上以66.01的分数达成流式(Streaming)当前最优。 [**NEW!** 2025.01.06] 🏆: [Dispider](https://github.com/Mark12Ding/Dispider) 在整体基准测试上以53.12的分数达成流式当前最优。 [**NEW!** 2024.12.09] 🎉: [InternLM-XComposer2.5-OmniLive](https://github.com/InternLM/InternLM-XComposer) 在实时视觉理解(Real-Time Visual Understanding)任务上取得73.79的分数。 ------ ## 🎞️ 概述 随着多模态大语言模型(Multimodal Large Language Model, MLLM)持续发展,当前主流模型仍主要聚焦于离线视频理解任务——即所有视频帧均会在发起查询前预加载完毕。但这与人类实时处理视频流并作出响应、捕捉多媒体内容动态特性的能力相去甚远。为填补这一研究空白,**StreamingBench** 推出了首个针对多模态大语言模型流式视频理解的全面基准测试集。 ### 核心评估维度 - 🎯 **实时视觉理解**:模型能否实时处理并响应视觉变化? - 🔊 **全源理解**:模型能否在实时视频流中同步整合视觉与音频输入? - 🎬 **上下文理解**:模型能否理解视频流中的整体上下文信息? ### 数据集统计信息 - 📊 **900个** 多样化视频 - 📝 **4500组** 人工标注问答对 - ⏱️ 每个视频在不同时间戳下设置5个问题 #### 🎬 视频类别 <div align="center"> <img src="./figs/StreamingBench_Video.png" width="80%" alt="Video Categories"> </div> #### 🔍 任务分类体系 <div align="center"> <img src="./figs/task_taxonomy.png" width="80%" alt="Task Taxonomy"> </div> ## 📐 数据集示例 https://github.com/user-attachments/assets/e6d1655d-ab3f-47a7-973a-8fd6c8962307 <div align="center"> <video width="100%" controls> <source src="./figs/example.video" type="video/mp4"> 您的浏览器不支持视频播放标签。 </video> </div> ## 🔮 评估流程 ### 环境要求 - Python 3.x - ffmpeg-python ### 数据准备 1. **下载数据集**:从 [StreamingBench 数据集](https://huggingface.co/datasets/mjuicem/StreamingBench) 获取所有必要文件。 2. **解压文件**:解压下载的文件并按照以下结构组织至 `./data` 目录: StreamingBench/ ├── data/ │ ├── real/ # 将 Real Time Visual Understanding_*.zip 解压至此文件夹 │ ├── omni/ # 将其余 .zip 文件解压至此文件夹 │ ├── sqa/ # 将 Sequential Question Answering_*.zip 解压至此文件夹 │ └── proactive/ # 将 Proactive Output_*.zip 解压至此文件夹 3. **数据预处理**:执行以下命令完成数据预处理: bash cd ./scripts bash preprocess.sh ### 模型准备 按照 [./docs/model_guide.md](./docs/model_guide.md) 中的说明准备用于评估的自定义模型,该指南将帮助您完成模型的配置与部署,确保其可针对本数据集开展测试。 ### 模型评估 即可运行基准测试: sh bash eval.sh 该命令将运行基准测试并将结果保存至指定的输出文件。随后可通过以下命令计算评估指标: sh bash stats.sh ## 🔬 实验结果 ### 多款多模态大语言模型在StreamingBench上的性能表现 - 查询前60秒上下文(主实验设置) <div align="center"> <img src="./figs/result_2.png" width="80%" alt="Task Taxonomy"> </div> - 全上下文(含长上下文)设置 <div align="center"> <img src="./figs/result_1.png" width="80%" alt="Task Taxonomy"> </div> - 主实验与60秒视频上下文设置的对比 <div align="center"> <img src="./figs/heatmap.png" width="80%" alt="Task Taxonomy"> </div> ### 不同多模态大语言模型在主动输出任务上的性能表现 *"≤ xs" 表示:若模型实际输出时间与真实时间的差值在x秒以内,则判定该答案正确。* <div align="center"> <img src="./figs/po.png" width="80%" alt="Task Taxonomy"> </div> ## 📝 引用信息 bibtex @article{lin2024streaming, title={StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding}, author={Junming Lin and Zheng Fang and Chi Chen and Zihao Wan and Fuwen Luo and Peng Li and Yang Liu and Maosong Sun}, journal={arXiv preprint arXiv:2411.03628}, year={2024} }
提供机构:
maas
创建时间:
2025-12-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作