MMEVAL/mmevalpro
收藏Hugging Face2024-10-15 更新2025-11-03 收录
下载链接:
https://hf-mirror.com/datasets/MMEVAL/mmevalpro
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- zh
license: cc-by-sa-4.0
task_categories:
- multiple-choice
dataset_info:
features:
- name: index
dtype: int64
- name: triplet_id
dtype: int64
- name: question
dtype: string
- name: choices
sequence: string
- name: answer
dtype: string
- name: image
dtype: image
- name: source
dtype: string
- name: question_category
dtype: string
- name: eval_type
dtype: string
splits:
- name: test
num_bytes: 755169661.25
num_examples: 6414
download_size: 252419064
dataset_size: 755169661.25
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
tags:
- image
---
<h1 align="center">MMEvalPro</h1>
# Dataset Card for MMEvalPro
We create **MMEvalPro** for more accurate and efficent evaluation for Large Multimodal Models. It is designed to avoid Type-I errors through a **trilogy** evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one **perception** question and one **knowledge** anchor question through a meticulous annotation process.
## Data Format
```json
{
"index": [int64] The global index of the question text,
"image": [image] A PIL image file,
"triplet_id": [int64] The global index of the triplet the question belonging to,
"question": [string] The question text,
"choices": [list] Choice options for multiple-choice problems.
"answer": [string] The correct answer for the problem,
"source": [string] The dataset source of the question, from ['MMMU','ScienceQA','MathVista'],
"question_category": [string] The sub-category of the question,
"eval_type": [string] The evaluation type, from ['Origin','Perception','Knowledge']
}
```
## Load Dataset
```python
from datasets import load_dataset
dataset = load_dataset("../MMEvalPro")
print(dataset)
```
## Automatic Evaluation
🔔 To automatically evaluate a model on the dataset and compute the genuine accuracy, average accuracy and different analysis metric, we provide an example code to compute the scores given model output and groundtruth labels.
The output for all questions should be saved in json file, following `./demo_model_output.json`
```json
[
{
"index": 0,
"model_output": "A",
"answer": "B",
"triplet_id": 1,
"eval_type": "Origin"
},
{
"index": 1,
"model_output": "A",
"answer": "B",
"triplet_id": 1,
"eval_type": "Perception"
},
{
"index": 2,
"model_output": "A",
"answer": "B",
"triplet_id": 1,
"eval_type": "Knowledge"
}
...
]
```
Then you can run the `./auto_score.py` to get the scores.
```bash
python auto_score.py \
--model_output ./demo_model_output.json \ # model output file in json format
--output_path ./demo_score.json \ # path to save the result
```
The overall score file looks like below:
```json
{
"MMMU": {
"genuine_accuracy_score": 18.88,
"average_score": 54.87,
"origin_score": 46.61,
"perception_score": 64.01,
"knowledge_score": 53.98
},
"MathVista": {
"genuine_accuracy_score": 16.85,
"average_score": 53.15,
"origin_score": 57.41,
"perception_score": 51.11,
"knowledge_score": 50.93
},
"ScienceQA": {
"genuine_accuracy_score": 49.01,
"average_score": 77.07,
"origin_score": 84.27,
"perception_score": 72.92,
"knowledge_score": 74.03
},
"Macro_Average": {
"genuine_accuracy_score": 28.25,
"average_score": 61.7,
"origin_score": 62.76,
"perception_score": 62.68,
"knowledge_score": 59.65
},
"Micro_Average": {
"genuine_accuracy_score": 36.11,
"average_score": 67.51,
"origin_score": 71.52,
"perception_score": 66.0,
"knowledge_score": 65.01
}
}
```
## License
The new contributions to our dataset are distributed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license, including
The copyright of the images and the original questions belongs to the authors of MMMU, ScienceQA and MathVista
- **Purpose:** The dataset was primarily designed for use as a test set.
- **Commercial Use:** The dataset can be used commercially as a test set, but using it as a training set is prohibited. By accessing or using this dataset, you acknowledge and agree to abide by these terms in conjunction with the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.
language:
- en: 英语
- zh: 中文
license: CC BY-SA 4.0许可协议
task_categories:
- multiple-choice: 多项选择
dataset_info:
features:
- name: index
dtype: int64 → 数据类型:int64
- name: triplet_id
dtype: int64 → 数据类型:int64
- name: question
dtype: string → 数据类型:字符串
- name: choices
sequence: string → 序列类型:字符串
- name: answer
dtype: string → 数据类型:字符串
- name: image
dtype: image → 数据类型:图像
- name: source
dtype: string → 数据类型:字符串
- name: question_category
dtype: string → 数据类型:字符串
- name: eval_type
dtype: string → 数据类型:字符串
splits:
- name: test
num_bytes: 755169661.25 → 字节数:755169661.25
num_examples: 6414 → 样本数:6414
configs:
- config_name: default → 配置名称:默认
data_files:
- split: test → 拆分:测试集
path: data/test-* → 路径:data/test-*
tags:
- image: 图像
<h1 align="center">MMEvalPro</h1>
# MMEvalPro数据集卡片
我们构建了MMEvalPro数据集,旨在为大型多模态模型(Large Multimodal Models)提供更精准、高效的评估方案。该数据集通过三元组评估流水线(pipeline)与更严谨的指标设计,可有效规避第一类错误(Type-I errors)。针对现有基准数据集(如MMMU、ScienceQA、MathVista)中的每个原始问题,标注人员经细致的标注流程,为其补充一个感知类问题与一个知识锚定类问题。
## 数据格式
json
{
"index": [int64] 问题文本的全局索引,
"image": [image] PIL图像文件,
"triplet_id": [int64] 问题所属三元组的全局索引,
"question": [string] 问题文本,
"choices": [list] 多项选择问题的选项列表,
"answer": [string] 问题的正确答案,
"source": [string] 问题的数据集来源,取值范围为['MMMU','ScienceQA','MathVista'],
"question_category": [string] 问题的子类别,
"eval_type": [string] 评估类型,取值范围为['Origin','Perception','Knowledge']
}
## 加载数据集
python
from datasets import load_dataset
dataset = load_dataset("../MMEvalPro")
print(dataset)
## 自动评估
🔔 为在本数据集上自动评估模型并计算真实准确率、平均准确率及各类分析指标,我们提供示例代码,可基于模型输出与真实标签计算得分。
所有问题的输出需保存为JSON文件,格式参考`./demo_model_output.json`:
json
[
{
"index": 0,
"model_output": "A",
"answer": "B",
"triplet_id": 1,
"eval_type": "Origin"
},
{
"index": 1,
"model_output": "A",
"answer": "B",
"triplet_id": 1,
"eval_type": "Perception"
},
{
"index": 2,
"model_output": "A",
"answer": "B",
"triplet_id": 1,
"eval_type": "Knowledge"
}
...
]
随后可运行`./auto_score.py`脚本计算得分:
bash
python auto_score.py
--model_output ./demo_model_output.json # 模型输出文件(JSON格式)
--output_path ./demo_score.json # 结果保存路径
整体得分文件示例如下:
json
{
"MMMU": {
"genuine_accuracy_score": 18.88,
"average_score": 54.87,
"origin_score": 46.61,
"perception_score": 64.01,
"knowledge_score": 53.98
},
"MathVista": {
"genuine_accuracy_score": 16.85,
"average_score": 53.15,
"origin_score": 57.41,
"perception_score": 51.11,
"knowledge_score": 50.93
},
"ScienceQA": {
"genuine_accuracy_score": 49.01,
"average_score": 77.07,
"origin_score": 84.27,
"perception_score": 72.92,
"knowledge_score": 74.03
},
"Macro_Average": {
"genuine_accuracy_score": 28.25,
"average_score": 61.7,
"origin_score": 62.76,
"perception_score": 62.68,
"knowledge_score": 59.65
},
"Micro_Average": {
"genuine_accuracy_score": 36.11,
"average_score": 67.51,
"origin_score": 71.52,
"perception_score": 66.0,
"knowledge_score": 65.01
}
}
## 许可协议
本数据集的新增贡献部分采用[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)许可协议分发,具体包括:
图像及原始问题的版权归属于MMMU、ScienceQA与MathVista数据集的原作者。
- **用途**:本数据集主要设计为测试集使用。
- **商业用途**:允许将本数据集作为测试集用于商业场景,但禁止将其作为训练集使用。访问或使用本数据集即表示您确认并同意遵守上述条款及[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)许可协议。
提供机构:
MMEVAL



