RynnEC-Bench
收藏魔搭社区2026-05-08 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/DAMO_Academy/RynnEC-Bench
下载链接
链接失效反馈官方服务:
资源简介:
# RynnEC-Bench
RynnEC-Bench evaluates fine-grained embodied understanding models from the perspectives of **object cognition** and **spatial cognition** in open-world scenario. The benchmark includes 507 video clips captured in real household scenarios.
<p align="center">
<img src="https://github.com/alibaba-damo-academy/RynnEC/blob/main/assets/bench.png?raw=true" width="90%" style="margin-bottom: 0.2;"/>
<p>
---
| Model | <font color="red">*Overall Mean* </font> | Object Properties | Seg. DR | Seg. SR | <font color="red">*Object Mean*</font> | Ego. His. | Ego. Pres. | Ego. Fut. | World Size | World Dis. | World PR | <font color="red">*Spatial Mean*</font> |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| GPT-4o | 28.3 | 41.1 | --- | --- | 33.9 | 13.4 | 22.8 | 6.0 | 24.3 | 16.7 | 36.1 | 22.2 |
| GPT-4.1 | 33.5 | 45.9 | --- | --- | 37.8 | 17.2 | 27.6 | 6.1 | 35.9 | 30.4 | 45.7 | 28.8 |
| Genimi-2.5 Pro | 45.5 | 64.0 | --- | --- | 52.7 | 9.3 | 36.7 | 8.1 | 47.0 | 29.9 | 69.3 | 37.8 |
| VideoLLaMA3-7B | 27.3 | 36.7 | --- | --- | 30.2 | 5.1 | 26.8 | 1.2 | 30.0 | 19.0 | 34.9 | 24.1 |
| InternVL3-78B | 29.0 | 45.3 | --- | --- | 37.3 | 9.0 | 31.8 | 2.2 | 10.9 | 30.9 | 26.0 | 20.0 |
| Qwen2.5-VL-72B | 36.4 | 54.2 | --- | --- | 44.7 | 11.3 | 24.8 | 7.2 | 27.2 | 22.9 | 83.7 | 27.4 |
| DAM-3B | 15.6 | 22.2 | --- | --- | 18.3 | 2.8 | 14.1 | 1.3 | 28.7 | 6.1 | 18.3 | 12.6 |
| VideoRefer-VL3-7B | 32.9 | 44.1 | --- | --- | 36.3 | 5.8 | 29.0 | 6.1 | 38.1 | 30.7 | 28.8 | 29.3 |
| Sa2VA-4B | 4.9 | 5.9 | 35.3 | 14.8 | 9.4 | 0.0 | 0.0 | 1.3 | 0.0 | 0.0 | 0.0 | 0.0 |
| VideoGlaMM-4B | 9.0 | 16.4 | 5.8 | 4.2 | 14.4 | 4.1 | 4.7 | 1.4 | 0.8 | 0.0 | 0.3 | 3.2 |
| RGA3-7B | 10.5 | 15.2 | 32.8 | 23.4 | 17.5 | 0.0 | 5.5 | 6.1 | 1.2 | 0.9 | 0.0 | 3.0 |
| RoboBrain-2.0-32B | 24.2 | 25.1 | --- | --- | 20.7 | 8.8 | 34.1 | 0.2 | 37.2 | 30.4 | 3.6 | 28.0 |
| **RynnEC-2B** | **54.4** | **59.3** | **46.2** | **36.9** | **56.3** | **30.1** | **47.2** | **23.8** | **67.4** | **31.2** | **85.8** | **52.3** |
---
## 1. Object Cognition
Object cognition is divided into the object properties cognition task and the referring object segmentation tasks.
### Object Property Cognition
This subset is designed to evaluate the model's ability to recognize object attributes. It is subdivided into 10 categories: category, color, material, shape, state, position, function surface detail, size, counting. It comprises a total of 10354 curated data entries. All entries have been manually annotated and verified.
🌟 **Data Balance**: To address evaluation biases from inconsistent object distributions across houses, we first established a real-world object frequency distribution by analyzing 500k indoor images with GPT-4o. We then employed frequency-based sampling to ensure our benchmark"s data mirrors this real-world distribution, and further balanced question difficulty for a more objective and realistic evaluation.
#### Data Format
```json
[
{
"video_id": "path to video",
"video": ["frame id"],
"conversations": [{"from": "human", "value": "question"}, {"from": "gpt", "value": "answer"}],
"type": "task type",
"masks": [{"frame id": {"size": [1080, 1920], "counts": "mask rle"}}],
"mask_ids": ["which frame is mask in"],
"timestamps": ["timestamp in video"],
"class_name": ["object class"] # not for counting
}
]
```
- `masks`: masks is a list of dicts, every dict is for one object.
- `mask_ids`: The correspondence frame of mask in 'video'.
- All the segmentations are in `RLE` format.
### Referring Object Segmentation
This subset is designed to evaluate the model's ability of precise instance segmentation. The task is divided into direct referring problems and situational referring problems. Direct referring problems involve only combinations of descriptions for the instance, while contextual referring problems are set within a scenario, requiring MLLMs to perform reasoning in order to identify the target object. All entries have been manually annotated and verified.
#### Data Format
```json
[
{
"video_id": "path to video",
"video": ["frame id"],
"conversations": [{"from": "human", "value": "question"}],
"type": "task type",
"masks": [{"frame id": {"size": [1080, 1920], "counts": "mask rle"}}],
"mask_ids": ["which frame is mask in"],
"timestamps": ["timestamp in video"],
}
]
```
## 2. Spatial Cognition
Spatial cognition involves MLLMs processing egocentric videos to form a 3D spatial awareness. The subset is categorized into two main types: ego-centric and world-centric. Ego-centric cognition focuses on the agent's own relationship with the environment across time, while world-centric cognition assesses the understanding of the objective 3D layout and properties of the world, such as size, distance, and position.
```json
[
{
"video_id": "path to video",
"video": ["frame id"],
"conversations": [{"from": "human", "value": "question"}, {"from": "gpt", "value": "answer"}],
"type": "task type",
"masks": [{"frame id": {"size": [1080, 1920], "counts": "mask rle"}}],
"mask_ids": ["which frame is mask in"],
"timestamps": ["timestamp in video"],
}
]
```
## 3. Data download
The data and annotation of RynnEC-Bench can be downloaded [here](https://huggingface.co/datasets/Alibaba-DAMO-Academy/RynnEC-Bench/tree/main). You should first unzip the `RynnECBench_data.zip`
Data structure:
```bash
RynnEC
└── data
└── RynnEC-Bench
├── object_cognition.json
├── object_segmentation.json
├── spatial_cognition.json
└── data
└── ...(videos)
```
## 4. Evaluation
More details can be found in [RynnEC github](https://github.com/alibaba-damo-academy/RynnEC/tree/main#).
## 5. RynnEC-Bench-mini
Considering RynnEC-Bench is very large, we also provide a subset of RynnEC-Bench, RynnEC-Bench-mini, which includes 2k object property cognition, 2k spatial cognition, and 1k object segmentation. You can validate your model first on RynnEC-Bench-mini for debugging. Please refer to the three *-mini.json file.
## 6. Citation
If you find RynnEC useful for your research and applications, please cite using this BibTeX:
```bibtex
@misc{dang2025rynnecbringingmllmsembodied,
title={RynnEC: Bringing MLLMs into Embodied World},
author={Ronghao Dang and Yuqian Yuan and Yunxuan Mao and Kehan Li and Jiangpin Liu and Zhikai Wang and Xin Li and Fan Wang and Deli Zhao},
year={2025},
eprint={2508.14160},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.14160},
}
```
# RynnEC-Bench 基准测试集
RynnEC-Bench 从**物体认知(object cognition)**与**空间认知(spatial cognition)**两个维度,对开放世界场景下的细粒度具身理解模型进行评估。该基准测试集包含507段采集自真实家庭场景的视频片段。
<p align="center">
<img src="https://github.com/alibaba-damo-academy/RynnEC/blob/main/assets/bench.png?raw=true" width="90%" style="margin-bottom: 0.2;"/>
</p>
---
| 模型 | <font color="red">*总体均值*</font> | 物体属性 | Seg. DR | Seg. SR | <font color="red">*物体认知均值*</font> | 自我历史视角 | 自我当前视角 | 自我未来视角 | 世界空间尺寸 | 世界空间距离 | 世界空间位姿 | <font color="red">*空间认知均值*</font> |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| GPT-4o | 28.3 | 41.1 | --- | --- | 33.9 | 13.4 | 22.8 | 6.0 | 24.3 | 16.7 | 36.1 | 22.2 |
| GPT-4.1 | 33.5 | 45.9 | --- | --- | 37.8 | 17.2 | 27.6 | 6.1 | 35.9 | 30.4 | 45.7 | 28.8 |
| Gemini-2.5 Pro | 45.5 | 64.0 | --- | --- | 52.7 | 9.3 | 36.7 | 8.1 | 47.0 | 29.9 | 69.3 | 37.8 |
| VideoLLaMA3-7B | 27.3 | 36.7 | --- | --- | 30.2 | 5.1 | 26.8 | 1.2 | 30.0 | 19.0 | 34.9 | 24.1 |
| InternVL3-78B | 29.0 | 45.3 | --- | --- | 37.3 | 9.0 | 31.8 | 2.2 | 10.9 | 30.9 | 26.0 | 20.0 |
| Qwen2.5-VL-72B | 36.4 | 54.2 | --- | --- | 44.7 | 11.3 | 24.8 | 7.2 | 27.2 | 22.9 | 83.7 | 27.4 |
| DAM-3B | 15.6 | 22.2 | --- | --- | 18.3 | 2.8 | 14.1 | 1.3 | 28.7 | 6.1 | 18.3 | 12.6 |
| VideoRefer-VL3-7B | 32.9 | 44.1 | --- | --- | 36.3 | 5.8 | 29.0 | 6.1 | 38.1 | 30.7 | 28.8 | 29.3 |
| Sa2VA-4B | 4.9 | 5.9 | 35.3 | 14.8 | 9.4 | 0.0 | 0.0 | 1.3 | 0.0 | 0.0 | 0.0 | 0.0 |
| VideoGlaMM-4B | 9.0 | 16.4 | 5.8 | 4.2 | 14.4 | 4.1 | 4.7 | 1.4 | 0.8 | 0.0 | 0.3 | 3.2 |
| RGA3-7B | 10.5 | 15.2 | 32.8 | 23.4 | 17.5 | 0.0 | 5.5 | 6.1 | 1.2 | 0.9 | 0.0 | 3.0 |
| RoboBrain-2.0-32B | 24.2 | 25.1 | --- | --- | 20.7 | 8.8 | 34.1 | 0.2 | 37.2 | 30.4 | 3.6 | 28.0 |
| **RynnEC-2B** | **54.4** | **59.3** | **46.2** | **36.9** | **56.3** | **30.1** | **47.2** | **23.8** | **67.4** | **31.2** | **85.8** | **52.3** |
---
## 1. 物体认知
物体认知分为物体属性认知任务与指代表象分割任务两类。
### 物体属性认知
该子任务用于评估模型识别物体属性的能力,共划分为10个类别:类别、颜色、材质、形状、状态、位置、功能表面细节、尺寸、计数。该子集总计包含10354条经过精心整理的数据条目,所有条目均经过人工标注与核验。
🌟 **数据均衡性设计**:为解决不同家庭场景中物体分布不均导致的评估偏差问题,我们首先通过GPT-4o分析50万张室内图像,构建了真实世界的物体频率分布。随后我们采用基于频率的采样方式,确保基准数据集的数据分布与真实世界一致,并进一步平衡了问题难度,以实现更客观、更贴合实际场景的评估。
#### 数据格式
json
[
{
"video_id": "path to video",
"video": ["frame id"],
"conversations": [{"from": "human", "value": "question"}, {"from": "gpt", "value": "answer"}],
"type": "task type",
"masks": [{"frame id": {"size": [1080, 1920], "counts": "mask rle"}}],
"mask_ids": ["which frame is mask in"],
"timestamps": ["timestamp in video"],
"class_name": ["object class"] # not for counting
}
]
- `masks`:掩码列表,每个字典对应一个目标物体。
- `mask_ids`:掩码对应的视频帧编号。
- 所有分割掩码均采用`RLE`格式。
### 指代表象分割
该子任务用于评估模型实现精准实例分割的能力,任务分为直接指称问题与情境指称问题两类。直接指称问题仅基于实例的描述组合进行识别,而情境指称问题则嵌入于具体场景中,需要多模态大语言模型(MLLMs)通过推理以定位目标物体。所有条目均经过人工标注与核验。
#### 数据格式
json
[
{
"video_id": "path to video",
"video": ["frame id"],
"conversations": [{"from": "human", "value": "question"}],
"type": "task type",
"masks": [{"frame id": {"size": [1080, 1920], "counts": "mask rle"}}],
"mask_ids": ["which frame is mask in"],
"timestamps": ["timestamp in video"],
}
]
## 2. 空间认知
空间认知任务要求多模态大语言模型(MLLMs)处理第一人称视角视频,以构建三维空间认知能力。该子任务分为两大类别:自我中心视角与世界中心视角。自我中心视角认知聚焦于智能体随时间推移与环境的交互关系,而世界中心视角认知则评估模型对客观三维空间布局及世界属性(如尺寸、距离、位置)的理解能力。
json
[
{
"video_id": "path to video",
"video": ["frame id"],
"conversations": [{"from": "human", "value": "question"}, {"from": "gpt", "value": "answer"}],
"type": "task type",
"masks": [{"frame id": {"size": [1080, 1920], "counts": "mask rle"}}],
"mask_ids": ["which frame is mask in"],
"timestamps": ["timestamp in video"],
}
]
## 3. 数据下载
RynnEC-Bench 的数据集与标注文件可通过[此链接](https://huggingface.co/datasets/Alibaba-DAMO-Academy/RynnEC-Bench/tree/main)下载。请首先解压`RynnECBench_data.zip`压缩包。
数据集结构:
bash
RynnEC
└── data
└── RynnEC-Bench
├── object_cognition.json
├── object_segmentation.json
├── spatial_cognition.json
└── data
└── ...(videos)
## 4. 评估
更多评估细节可查阅[RynnEC 官方GitHub仓库](https://github.com/alibaba-damo-academy/RynnEC/tree/main#)。
## 5. RynnEC-Bench-mini 迷你基准集
考虑到RynnEC-Bench数据集规模较大,我们同时提供了其迷你子集RynnEC-Bench-mini,该子集包含2000条物体属性认知数据、2000条空间认知数据以及1000条指代表象分割数据。您可先在RynnEC-Bench-mini上验证模型效果以进行调试,相关数据可参考三个`*-mini.json`文件。
## 6. 引用
若您的研究或应用中使用了RynnEC数据集,请采用以下BibTeX格式进行引用:
bibtex
@misc{dang2025rynnecbringingmllmsembodied,
title={RynnEC: Bringing MLLMs into Embodied World},
author={Ronghao Dang and Yuqian Yuan and Yunxuan Mao and Kehan Li and Jiangpin Liu and Zhikai Wang and Xin Li and Fan Wang and Deli Zhao},
year={2025},
eprint={2508.14160},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.14160},
}
提供机构:
maas
创建时间:
2025-08-07



