STAR-Bench
收藏魔搭社区2025-12-19 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/Shanghai_AI_Laboratory/STAR-Bench
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<div align="center">
<img src="assets/4d_logo.png" width="160"/>
<h1 align="center">
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
</h1>
</div>
<p align="center">
<a href="https://scholar.google.com/citations?user=iELd-Q0AAAAJ"><strong>Zihan Liu<sup>*</sup></strong></a>
·
<a href="https://scholar.google.com/citations?user=mXSpi2kAAAAJ&hl=zh-CN"><strong>Zhikang Niu<sup>*</sup></strong></a>
·
<a href="https://github.com/akkkkkkkkki/"><strong>Qiuyang Xiao</strong></a>
·
<a href="https://scholar.google.com/citations?user=WYwBrzAAAAAJ&hl=en"><strong>Zhisheng Zheng</strong></a>
·
<a href="https://github.com/yrqqqq404"><strong>Ruoqi Yuan</strong></a>
·
<a href="https://yuhangzang.github.io/"><strong>Yuhang Zang<sup>†</sup></strong></a>
</br>
<a href="https://scholar.google.com/citations?user=sJkqsqkAAAAJ"><strong>Yuhang Cao</strong></a>
·
<a href="https://lightdxy.github.io/"><strong>Xiaoyi Dong</strong></a>
·
<a href="https://scholar.google.com/citations?user=P4yNnSkAAAAJ&hl=zh-TW"><strong>Jianze Liang</strong></a>
·
<a href="https://scholar.google.com/citations?user=d6u01FkAAAAJ&hl=en"><strong>Xie Chen</strong></a>
·
<a href="https://scholar.google.com/citations?user=QVHvhM4AAAAJ&hl=en"><strong>Leilei Sun</strong></a>
·
<a href="http://dahua.site/"><strong>Dahua Lin</strong></a>
·
<a href="https://myownskyw7.github.io/"><strong>Jiaqi Wang<sup>†</sup></strong></a>
</p>
<p align="center" style="font-size: 1em; margin-top: -1em"> <sup>*</sup> Equal Contribution. <sup>†</sup>Corresponding authors. </p>
<p align="center" style="font-size: 1.2em; margin-top: 0.5em">
📖<a href="https://huggingface.co/papers/2510.24693">Paper</a> | 📖<a href="https://arxiv.org/abs/2510.24693">arXiv</a>
|🏠<a href="https://github.com/InternLM/StarBench">Code</a>
|🌐<a href="https://internlm.github.io/StarBench/">Homepage</a>
| 🤗<a href="https://huggingface.co/datasets/internlm/STAR-Bench">Dataset</a>
</p>
</div>
## 🌈Overview
We formalize <strong>audio 4D intelligence</strong> that is defined as reasoning over sound dynamics in time and 3D space, and introduce a <strong>STAR-Bench</strong> to measure it. STAR-Bench combines a <strong>Foundational Acoustic Perception</strong>setting (six attributes under absolute and relative regimes) with a <strong>Holistic Spatio-Temporal Reasoning</strong> setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories.
<p style="text-align: center;">
<img src="assets/teaser.png" alt="teaser" width="100%">
</p>
Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on <strong>linguistically hard-to-describe cues</strong>.
Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
Benchmark examples are illustrated below. You can also visit the [homepage](https://internlm.github.io/StarBench/) for a more intuitive overview.
</p>
<p style="text-align: center;">
<img src="assets/bench_examples.png" alt="STAR-Bench Examples" width="100%">
</p>
<!-- A comparative overview of our benchmark against other representative audio benchmarks is shown below.
<p style="text-align: center;">
<img src="assets/bench_compare.png" alt="Comparison among Benchmarks" width="100%">
</p> -->
## 📊Results and Analysis
Evaluation results of various models on STAR-Bench v0.5 are shown below.
The leaderboard for v1.0 will be released soon.
<p style="text-align: center;">
<img src="assets/results.png" alt="Results" width="100%">
</p>
Error distribution across temporal and spatial Tasks:
<p style="text-align: center;">
<img src="assets/error_dist.png" alt="Results" width="100%">
</p>
## 💡 Key Insights
- 🔥**A clear capability hierarchy between the two groups.** Closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning.
- 🔥 **Enhancing dense audio captioning.** Open-source models struggle to produce dense, fine-grained captions, which limits their perceptual sensitivity and ability to extract embedded knowledge. Bridging this gap is a crucial first step.
- 🔥 **Improving multi-audio reasoning.** Open-source models lag significantly in comparing, integrating, and grounding information across multiple audio clips.
- 🔥 **Moving beyond channel-averaged audio preprocessing.** The common practice of averaging multi-channel audio into a mono signal is a major bottleneck for spatial reasoning. Developing architectures that natively process multi-channel cues is essential for unlocking genuine spatial awareness.
## ⚙️Data Curation
<p style="text-align: center;">
<img src="assets/data_dist.png" alt="" width="90%">
</p>
All audio for the foundational perception task is synthesized using precise parameterization or the Pyroomacoustics physics-based simulator, providing complete control over acoustic parameters. Domain experts rigorously validate the task difficulty
levels, which are then calibrated through human testing.</br>
For the holistic spatio-temporal reasoning task, the curation process comprises four key stages, including human annotation and final selection based on human performance, as illustrated below.
<p style="text-align: center;">
<img src="assets/pipeline.png" alt="pipeline" width="90%">
</p>
## 🛠️ Sample Usage
The `ALMEval_code/` is partially adapted from [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) and [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit).
It provides a unified evaluation pipeline for multimodal large models on **STAR-Bench**.
**Step 1: Prepare Environment**
```bash
git clone https://github.com/InternLM/StarBench.git
cd StarBench
conda activate starbench python==3.10.0
pip install -r requirements.txt
cd ALMEval_code
```
**Step 2: Get STAR-Bench v1.0 Dataset**
Download STAR-Bench v1.0 dataset from 🤗[HuggingFace](https://huggingface.co/datasets/internlm/STAR-Bench)
```bash
huggingface-cli download --repo-type dataset --resume-download internlm/STAR-Bench --local-dir your_local_data_dir
```
**Step 3: Set Up Your Model for Evaluation**
Currently supported models include: `Qwen2.5-Omni`, `Qwen2-Audio-Instruct`, `DeSTA2.5-Audio`, `Phi4-MM`, `Kimi-Audio`, `MiDashengLM`, `Step-Audio-2-mini`, `Gemma-3n-E4B-it`, `Gemini` and `GPT-4o Audio`.
<!-- `Ming-Lite-Omni-1.5`,`Xiaomi-MiMo-Audio`,`MiniCPM-O-v2.6`,`Audio Flamingo 3`, -->
To integrate a new model, create a new file `yourmodel.py` under the `models/` directory and implement the function generate_inner().
✅ Example: generate_inner()
```
def generate_inner(self, msg):
"""
Args:
msg: dict, input format as below
"""
msg = {
"meta": {
"id": ...,
"task": ...,
"category": ...,
"sub-category": ...,
"options": ...,
"answer": ...,
"answer_letter": ...,
"rotate_id": ...,
},
"prompts": [
{"type": "text", "value": "xxxx"},
{"type": "audio", "value": "audio1.wav"},
{"type": "text", "value": "xxxx"},
{"type": "audio", "value": "audio2.wav"},
...
]
}
# Return the model's textual response
return "your model output here"
```
**Step 4: Configure Model Settings**
Modify the configuration file: `/models/model.yaml`.
For existing models, you may need to update parameters such as `model_path` to match your local model weight path.
To add a new model variant, follow these steps:
1. Create a new top-level key for your alias (e.g., 'my_model_variant:').
2. Set 'base_model' to the `NAME` attribute of the corresponding Python class.
3. Add any necessary arguments for the class's `__init__` method under `init_args`.
Example:
```
qwen25-omni:
base_model: qwen25-omni
init_args:
model_path: your_model_weight_path_here
```
**Step 5: Run Evaluation**
Run the following command:
```
python ./run.py \
--model qwen25-omni \
--data starbench_default \
--dataset_root your_local_data_dir \
--work-dir ./eval_results
```
Evaluation results will be automatically saved to the ./eval_results directory.
You can also evaluate specific subtasks or their combinations by modifying the `--data` argument.
The full list of available task names can be found in
`ALMEval_code/datasets/__init__.py.`
Example: Evaluate only the temporal reasoning and spatial reasoning tasks:
```bash
python ./run.py \
--model qwen25-omni \
--data tr sr \
--dataset_root your_local_data_dir \
--work-dir ./eval_results
```
## ✒️Citation
```bibtex
@article{liu2025starbench,
title={STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence},
author={Liu, Zihan and Niu, Zhikang and Xiao, Qiuyang and Zheng, Zhisheng and Yuan, Ruoqi and Zang, Yuhang and Cao, Yuhang and Dong, Xiaoyi and Liang, Jianze and Chen, Xie and Sun, Leilei and Lin, Dahua and Wang, Jiaqi},
journal={arXiv preprint arXiv:2510.24693},
year={2025}
}
```
## 📄 License
  **Usage and License Notices**: The data and code are intended and licensed for research use only.
## Acknowledgement
We sincerely thank <a href="2077ai.com" target="_blank">2077AI</a> for providing the platform that supported our data annotation, verification, and review processes.
<div align="center">
<div align="center">
<img src="assets/4d_logo.png" width="160"/>
<h1 align="center">
STAR-Bench:将深度时空推理作为音频4D智能进行评测
</h1>
</div>
<p align="center">
<a href="https://scholar.google.com/citations?user=iELd-Q0AAAAJ"><strong>刘子涵<sup>*</sup></strong></a>
·
<a href="https://scholar.google.com/citations?user=mXSpi2kAAAAJ&hl=zh-CN"><strong>牛志康<sup>*</sup></strong></a>
·
<a href="https://github.com/akkkkkkkkki/"><strong>肖秋阳</strong></a>
·
<a href="https://scholar.google.com/citations?user=WYwBrzAAAAAJ&hl=en"><strong>郑智升</strong></a>
·
<a href="https://github.com/yrqqqq404"><strong>袁若琦</strong></a>
·
<a href="https://yuhangzang.github.io/"><strong>臧宇航<sup>†</sup></strong></a>
</br>
<a href="https://scholar.google.com/citations?user=sJkqsqkAAAAJ"><strong>曹宇航</strong></a>
·
<a href="https://lightdxy.github.io/"><strong>董晓毅</strong></a>
·
<a href="https://scholar.google.com/citations?user=P4yNnSkAAAAJ&hl=zh-TW"><strong>梁建泽</strong></a>
·
<a href="https://scholar.google.com/citations?user=d6u01FkAAAAJ&hl=en"><strong>陈协</strong></a>
·
<a href="https://scholar.google.com/citations?user=QVHvhM4AAAAJ&hl=en"><strong>孙雷蕾</strong></a>
·
<a href="http://dahua.site/"><strong>林达华</strong></a>
·
<a href="https://myownskyw7.github.io/"><strong>王家祺<sup>†</sup></strong></a>
</p>
<p align="center" style="font-size: 1em; margin-top: -1em"> <sup>*</sup> 共同第一作者。<sup>†</sup>通讯作者。</p>
<p align="center" style="font-size: 1.2em; margin-top: 0.5em">
📖<a href="https://huggingface.co/papers/2510.24693">论文</a> | 📖<a href="https://arxiv.org/abs/2510.24693">arXiv预印本</a>
|🏠<a href="https://github.com/InternLM/StarBench">代码仓库</a>
|🌐<a href="https://internlm.github.io/StarBench/">项目主页</a>
| 🤗<a href="https://huggingface.co/datasets/internlm/STAR-Bench">数据集</a>
</p>
</div>
## 🌈 概述
我们将**音频4D智能(audio 4D intelligence)**形式化为对时间与3D空间中的声音动态进行推理的能力,并引入了**STAR-Bench**来对其进行量化评测。STAR-Bench涵盖两大设置:**基础声学感知(Foundational Acoustic Perception)**设置(包含绝对与相对维度下的六项属性),以及**整体时空推理(Holistic Spatio-Temporal Reasoning)**设置,后者包含针对连续与离散过程的片段重排序任务,以及覆盖静态定位、多源关联与动态轨迹的各类空间任务。
<p style="text-align: center;">
<img src="assets/teaser.png" alt="teaser" width="100%">
</p>
与此前仅依赖文本描述作答、仅小幅降低模型准确率的基准不同,STAR-Bench会导致模型性能出现显著下滑(时序任务下降31.5%,空间任务下降35.2%),这证明其聚焦于**难以用语言描述的线索**。我们对19个模型进行了评测,结果显示其性能与人类存在显著差距,且呈现出清晰的能力层级。STAR-Bench为开发具备更鲁棒物理世界理解能力的未来模型提供了关键洞见与明确的发展路径。
基准示例如下所示。您也可以访问[项目主页](https://internlm.github.io/StarBench/)获取更直观的概览。
</p>
<p style="text-align: center;">
<img src="assets/bench_examples.png" alt="STAR-Bench Examples" width="100%">
</p>
## 📊 结果与分析
STAR-Bench v0.5版本上各类模型的评测结果如下所示。v1.0版本的排行榜即将发布。
<p style="text-align: center;">
<img src="assets/results.png" alt="Results" width="100%">
</p>
时序与空间任务的错误分布情况:
<p style="text-align: center;">
<img src="assets/error_dist.png" alt="Results" width="100%">
</p>
## 💡 核心洞见
- 🔥**两类模型间存在明确的能力层级**。闭源模型受限于细粒度感知能力,而开源模型则在感知、知识与推理全维度均存在明显滞后。
- 🔥 **强化密集音频字幕生成能力**。开源模型难以生成密集、细粒度的音频字幕,这限制了其感知敏感性与提取嵌入知识的能力。填补这一差距是至关重要的第一步。
- 🔥 **提升多音频推理能力**。开源模型在跨多个音频片段的信息比较、整合与锚定任务中表现显著滞后。
- 🔥 **突破单通道平均音频预处理的局限**。将多通道音频平均为单声道信号的通用做法是空间推理的主要瓶颈。开发能够原生处理多通道线索的架构,是实现真正空间感知的必要条件。
## ⚙️ 数据整理
<p style="text-align: center;">
<img src="assets/data_dist.png" alt="" width="90%">
</p>
基础感知任务的所有音频均通过精确参数化或基于物理的Pyroomacoustics模拟器合成,可完全控制声学参数。领域专家会严格验证任务难度等级,随后通过人类测试进行校准。</br>
对于整体时空推理任务,其整理流程包含四个关键阶段,包括人类标注与基于人类表现的最终筛选,具体流程如下所示。
<p style="text-align: center;">
<img src="assets/pipeline.png" alt="pipeline" width="90%">
</p>
## 🛠️ 示例使用
`ALMEval_code/` 部分改编自 [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) 与 [Kimi-Audio-Evalkit](https://github.com/MoonshotAI/Kimi-Audio-Evalkit),为多模态大模型在STAR-Bench上的评测提供了统一的评估流水线。
**步骤1:配置环境**
bash
git clone https://github.com/InternLM/StarBench.git
cd StarBench
conda activate starbench python==3.10.0
pip install -r requirements.txt
cd ALMEval_code
**步骤2:获取STAR-Bench v1.0数据集**
从 🤗[HuggingFace](https://huggingface.co/datasets/internlm/STAR-Bench) 下载STAR-Bench v1.0数据集:
bash
huggingface-cli download --repo-type dataset --resume-download internlm/STAR-Bench --local-dir your_local_data_dir
**步骤3:配置待评测模型**
当前支持的模型包括:`Qwen2.5-Omni`、`Qwen2-Audio-Instruct`、`DeSTA2.5-Audio`、`Phi4-MM`、`Kimi-Audio`、`MiDashengLM`、`Step-Audio-2-mini`、`Gemma-3n-E4B-it`、`Gemini`与`GPT-4o Audio`。
若需集成新模型,请在`models/`目录下创建新文件`yourmodel.py`,并实现`generate_inner()`函数。
✅ 示例:generate_inner()
def generate_inner(self, msg):
"""
Args:
msg: dict, 输入格式如下
"""
msg = {
"meta": {
"id": ...,
"task": ...,
"category": ...,
"sub-category": ...,
"options": ...,
"answer": ...,
"answer_letter": ...,
"rotate_id": ...,
},
"prompts": [
{"type": "text", "value": "xxxx"},
{"type": "audio", "value": "audio1.wav"},
{"type": "text", "value": "xxxx"},
{"type": "audio", "value": "audio2.wav"},
...
]
}
# 返回模型的文本输出结果
return "your model output here"
**步骤4:配置模型参数**
修改配置文件:`/models/model.yaml`。
对于已支持的模型,您可能需要更新`model_path`等参数以匹配本地模型权重路径。
若需添加新的模型变体,请遵循以下步骤:
1. 为您的模型别名创建新的顶级键(例如`'my_model_variant:'`)。
2. 将`base_model`设置为对应Python类的`NAME`属性。
3. 在`init_args`下添加类`__init__`方法所需的任意参数。
示例:
qwen25-omni:
base_model: qwen25-omni
init_args:
model_path: your_model_weight_path_here
**步骤5:运行评测**
执行以下命令:
python ./run.py
--model qwen25-omni
--data starbench_default
--dataset_root your_local_data_dir
--work-dir ./eval_results
评测结果将自动保存至`./eval_results`目录。
您也可以通过修改`--data`参数,仅评测特定子任务或其子任务组合。可用任务名称的完整列表可在`ALMEval_code/datasets/__init__.py`中查看。
示例:仅评测时序推理与空间推理任务:
bash
python ./run.py
--model qwen25-omni
--data tr sr
--dataset_root your_local_data_dir
--work-dir ./eval_results
## ✒️ 引用
bibtex
@article{liu2025starbench,
title={STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence},
author={Liu, Zihan and Niu, Zhikang and Xiao, Qiuyang and Zheng, Zhisheng and Yuan, Ruoqi and Zang, Yuhang and Cao, Yuhang and Dong, Xiaoyi and Liang, Jianze and Chen, Xie and Sun, Leilei and Lin, Dahua and Wang, Jiaqi},
journal={arXiv preprint arXiv:2510.24693},
year={2025}
}
## 📄 许可
  **使用与许可声明**:本数据与代码仅用于研究用途,并遵循相应许可协议。
## 致谢
我们衷心感谢 <a href="https://2077ai.com" target="_blank">2077AI</a> 提供的平台,支持了我们的数据标注、验证与审核流程。
提供机构:
maas
创建时间:
2025-10-29



