streamingbench
收藏魔搭社区2026-01-08 更新2026-01-03 收录
下载链接:
https://modelscope.cn/datasets/Lingcco1/streamingbench
下载链接
链接失效反馈官方服务:
资源简介:
[**NEW!** 2025.05.15] 🔥: [Seed1.5-VL](https://github.com/ByteDance-Seed/Seed1.5-VL) achieved ALL model SOTA with a score of 82.80 on the Proactive Output.
[**NEW!** 2025.03.17] ⭐: [ViSpeeker](https://arxiv.org/abs/2503.12769) achieved Open-Source SOTA with a score of 61.60 on the Omni-Source Understanding.
[**NEW!** 2025.01.14] 🚀: [MiniCPM-o 2.6](https://github.com/OpenBMB/MiniCPM-o) achieved Streaming SOTA with a score of 66.01 on the Overall benchmark.
[**NEW!** 2025.01.06] 🏆: [Dispider](https://github.com/Mark12Ding/Dispider) achieved Streaming SOTA with a score of 53.12 on the Overall benchmark.
[**NEW!** 2024.12.09] 🎉: [InternLM-XComposer2.5-OmniLive](https://github.com/InternLM/InternLM-XComposer) achieved 73.79 on Real-Time Visual Understanding.
------
## 🎞️ Overview
As MLLMs continue to advance, they remain largely focused on offline video comprehension, where all frames are pre-loaded before making queries. However, this is far from the human ability to process and respond to video streams in real-time, capturing the dynamic nature of multimedia content. To bridge this gap, **StreamingBench** introduces the first comprehensive benchmark for streaming video understanding in MLLMs.
### Key Evaluation Aspects
- 🎯 **Real-time Visual Understanding**: Can the model process and respond to visual changes in real-time?
- 🔊 **Omni-source Understanding**: Does the model integrate visual and audio inputs synchronously in real-time video streams?
- 🎬 **Contextual Understanding**: Can the model comprehend the broader context within video streams?
### Dataset Statistics
- 📊 **900** diverse videos
- 📝 **4,500** human-annotated QA pairs
- ⏱️ Five questions per video at different timestamps
#### 🎬 Video Categories
<div align="center">
<img src="./figs/StreamingBench_Video.png" width="80%" alt="Video Categories">
</div>
#### 🔍 Task Taxonomy
<div align="center">
<img src="./figs/task_taxonomy.png" width="80%" alt="Task Taxonomy">
</div>
## 📐 Dataset Examples
https://github.com/user-attachments/assets/e6d1655d-ab3f-47a7-973a-8fd6c8962307
<div align="center">
<video width="100%" controls>
<source src="./figs/example.video" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
## 🔮 Evaluation Pipeline
### Requirements
- Python 3.x
- ffmpeg-python
### Data Preparation
1. **Download Dataset**: Retrieve all necessary files from the [StreamingBench Dataset](https://huggingface.co/datasets/mjuicem/StreamingBench).
2. **Decompress Files**: Extract the downloaded files and organize them in the `./data` directory as follows:
```
StreamingBench/
├── data/
│ ├── real/ # Unzip Real Time Visual Understanding_*.zip into this folder
│ ├── omni/ # Unzip other .zip files into this folder
│ ├── sqa/ # Unzip Sequential Question Answering_*.zip into this folder
│ └── proactive/ # Unzip Proactive Output_*.zip into this folder
```
3. **Preprocess Data**: Run the following command to preprocess the data:
```bash
cd ./scripts
bash preprocess.sh
```
### Model Preparation
Prepare your own model for evaluation by following the instructions provided [here](./docs/model_guide.md). This guide will help you set up and configure your model to ensure it is ready for testing against the dataset.
### Evaluation
Now you can run the benchmark:
```sh
bash eval.sh
```
This will run the benchmark and save the results to the specified output file. Then you can calculate the metrics using the following command:
```sh
bash stats.sh
```
## 🔬 Experimental Results
### Performance of Various MLLMs on StreamingBench
- 60 seconds of context preceding the query time (Main)
<div align="center">
<img src="./figs/result_2.png" width="80%" alt="Task Taxonomy">
</div>
- All Context (+ Long Context)
<div align="center">
<img src="./figs/result_1.png" width="80%" alt="Task Taxonomy">
</div>
- Comparison of Main Experiment vs. 60 Seconds of Video Context
- <div align="center">
<img src="./figs/heatmap.png" width="80%" alt="Task Taxonomy">
</div>
### Performance of Different MLLMs on the Proactive Output Task
*"≤ xs" means that the answer is considered correct if the actual output time is within x seconds of the ground truth.*
<div align="center">
<img src="./figs/po.png" width="80%" alt="Task Taxonomy">
</div>
## 📝 Citation
```bibtex
@article{lin2024streaming,
title={StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding},
author={Junming Lin and Zheng Fang and Chi Chen and Zihao Wan and Fuwen Luo and Peng Li and Yang Liu and Maosong Sun},
journal={arXiv preprint arXiv:2411.03628},
year={2024}
}
```
[**NEW!** 2025.05.15] 🔥: [Seed1.5-VL](https://github.com/ByteDance-Seed/Seed1.5-VL) 在主动输出(Proactive Output)任务上以82.80的分数达成全模型当前最优(State-of-the-Art,SOTA)。
[**NEW!** 2025.03.17] ⭐: [ViSpeeker](https://arxiv.org/abs/2503.12769) 在全源理解(Omni-Source Understanding)任务上以61.60的分数达成开源模型当前最优(Open-Source SOTA)。
[**NEW!** 2025.01.14] 🚀: [MiniCPM-o 2.6](https://github.com/OpenBMB/MiniCPM-o) 在整体基准测试(Overall benchmark)上以66.01的分数达成流式(Streaming)当前最优。
[**NEW!** 2025.01.06] 🏆: [Dispider](https://github.com/Mark12Ding/Dispider) 在整体基准测试上以53.12的分数达成流式当前最优。
[**NEW!** 2024.12.09] 🎉: [InternLM-XComposer2.5-OmniLive](https://github.com/InternLM/InternLM-XComposer) 在实时视觉理解(Real-Time Visual Understanding)任务上取得73.79的分数。
------
## 🎞️ 概述
随着多模态大语言模型(Multimodal Large Language Model, MLLM)持续发展,当前主流模型仍主要聚焦于离线视频理解任务——即所有视频帧均会在发起查询前预加载完毕。但这与人类实时处理视频流并作出响应、捕捉多媒体内容动态特性的能力相去甚远。为填补这一研究空白,**StreamingBench** 推出了首个针对多模态大语言模型流式视频理解的全面基准测试集。
### 核心评估维度
- 🎯 **实时视觉理解**:模型能否实时处理并响应视觉变化?
- 🔊 **全源理解**:模型能否在实时视频流中同步整合视觉与音频输入?
- 🎬 **上下文理解**:模型能否理解视频流中的整体上下文信息?
### 数据集统计信息
- 📊 **900个** 多样化视频
- 📝 **4500组** 人工标注问答对
- ⏱️ 每个视频在不同时间戳下设置5个问题
#### 🎬 视频类别
<div align="center">
<img src="./figs/StreamingBench_Video.png" width="80%" alt="Video Categories">
</div>
#### 🔍 任务分类体系
<div align="center">
<img src="./figs/task_taxonomy.png" width="80%" alt="Task Taxonomy">
</div>
## 📐 数据集示例
https://github.com/user-attachments/assets/e6d1655d-ab3f-47a7-973a-8fd6c8962307
<div align="center">
<video width="100%" controls>
<source src="./figs/example.video" type="video/mp4">
您的浏览器不支持视频播放标签。
</video>
</div>
## 🔮 评估流程
### 环境要求
- Python 3.x
- ffmpeg-python
### 数据准备
1. **下载数据集**:从 [StreamingBench 数据集](https://huggingface.co/datasets/mjuicem/StreamingBench) 获取所有必要文件。
2. **解压文件**:解压下载的文件并按照以下结构组织至 `./data` 目录:
StreamingBench/
├── data/
│ ├── real/ # 将 Real Time Visual Understanding_*.zip 解压至此文件夹
│ ├── omni/ # 将其余 .zip 文件解压至此文件夹
│ ├── sqa/ # 将 Sequential Question Answering_*.zip 解压至此文件夹
│ └── proactive/ # 将 Proactive Output_*.zip 解压至此文件夹
3. **数据预处理**:执行以下命令完成数据预处理:
bash
cd ./scripts
bash preprocess.sh
### 模型准备
按照 [./docs/model_guide.md](./docs/model_guide.md) 中的说明准备用于评估的自定义模型,该指南将帮助您完成模型的配置与部署,确保其可针对本数据集开展测试。
### 模型评估
即可运行基准测试:
sh
bash eval.sh
该命令将运行基准测试并将结果保存至指定的输出文件。随后可通过以下命令计算评估指标:
sh
bash stats.sh
## 🔬 实验结果
### 多款多模态大语言模型在StreamingBench上的性能表现
- 查询前60秒上下文(主实验设置)
<div align="center">
<img src="./figs/result_2.png" width="80%" alt="Task Taxonomy">
</div>
- 全上下文(含长上下文)设置
<div align="center">
<img src="./figs/result_1.png" width="80%" alt="Task Taxonomy">
</div>
- 主实验与60秒视频上下文设置的对比
<div align="center">
<img src="./figs/heatmap.png" width="80%" alt="Task Taxonomy">
</div>
### 不同多模态大语言模型在主动输出任务上的性能表现
*"≤ xs" 表示:若模型实际输出时间与真实时间的差值在x秒以内,则判定该答案正确。*
<div align="center">
<img src="./figs/po.png" width="80%" alt="Task Taxonomy">
</div>
## 📝 引用信息
bibtex
@article{lin2024streaming,
title={StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding},
author={Junming Lin and Zheng Fang and Chi Chen and Zihao Wan and Fuwen Luo and Peng Li and Yang Liu and Maosong Sun},
journal={arXiv preprint arXiv:2411.03628},
year={2024}
}
提供机构:
maas
创建时间:
2025-12-26



