five

MMEVAL/mmevalpro

收藏
Hugging Face2024-10-15 更新2025-11-03 收录
下载链接:
https://hf-mirror.com/datasets/MMEVAL/mmevalpro
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - zh license: cc-by-sa-4.0 task_categories: - multiple-choice dataset_info: features: - name: index dtype: int64 - name: triplet_id dtype: int64 - name: question dtype: string - name: choices sequence: string - name: answer dtype: string - name: image dtype: image - name: source dtype: string - name: question_category dtype: string - name: eval_type dtype: string splits: - name: test num_bytes: 755169661.25 num_examples: 6414 download_size: 252419064 dataset_size: 755169661.25 configs: - config_name: default data_files: - split: test path: data/test-* tags: - image --- <h1 align="center">MMEvalPro</h1> # Dataset Card for MMEvalPro We create **MMEvalPro** for more accurate and efficent evaluation for Large Multimodal Models. It is designed to avoid Type-I errors through a **trilogy** evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one **perception** question and one **knowledge** anchor question through a meticulous annotation process. ## Data Format ```json { "index": [int64] The global index of the question text, "image": [image] A PIL image file, "triplet_id": [int64] The global index of the triplet the question belonging to, "question": [string] The question text, "choices": [list] Choice options for multiple-choice problems. "answer": [string] The correct answer for the problem, "source": [string] The dataset source of the question, from ['MMMU','ScienceQA','MathVista'], "question_category": [string] The sub-category of the question, "eval_type": [string] The evaluation type, from ['Origin','Perception','Knowledge'] } ``` ## Load Dataset ```python from datasets import load_dataset dataset = load_dataset("../MMEvalPro") print(dataset) ``` ## Automatic Evaluation 🔔 To automatically evaluate a model on the dataset and compute the genuine accuracy, average accuracy and different analysis metric, we provide an example code to compute the scores given model output and groundtruth labels. The output for all questions should be saved in json file, following `./demo_model_output.json` ```json [ { "index": 0, "model_output": "A", "answer": "B", "triplet_id": 1, "eval_type": "Origin" }, { "index": 1, "model_output": "A", "answer": "B", "triplet_id": 1, "eval_type": "Perception" }, { "index": 2, "model_output": "A", "answer": "B", "triplet_id": 1, "eval_type": "Knowledge" } ... ] ``` Then you can run the `./auto_score.py` to get the scores. ```bash python auto_score.py \ --model_output ./demo_model_output.json \ # model output file in json format --output_path ./demo_score.json \ # path to save the result ``` The overall score file looks like below: ```json { "MMMU": { "genuine_accuracy_score": 18.88, "average_score": 54.87, "origin_score": 46.61, "perception_score": 64.01, "knowledge_score": 53.98 }, "MathVista": { "genuine_accuracy_score": 16.85, "average_score": 53.15, "origin_score": 57.41, "perception_score": 51.11, "knowledge_score": 50.93 }, "ScienceQA": { "genuine_accuracy_score": 49.01, "average_score": 77.07, "origin_score": 84.27, "perception_score": 72.92, "knowledge_score": 74.03 }, "Macro_Average": { "genuine_accuracy_score": 28.25, "average_score": 61.7, "origin_score": 62.76, "perception_score": 62.68, "knowledge_score": 59.65 }, "Micro_Average": { "genuine_accuracy_score": 36.11, "average_score": 67.51, "origin_score": 71.52, "perception_score": 66.0, "knowledge_score": 65.01 } } ``` ## License The new contributions to our dataset are distributed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license, including The copyright of the images and the original questions belongs to the authors of MMMU, ScienceQA and MathVista - **Purpose:** The dataset was primarily designed for use as a test set. - **Commercial Use:** The dataset can be used commercially as a test set, but using it as a training set is prohibited. By accessing or using this dataset, you acknowledge and agree to abide by these terms in conjunction with the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.

language: - en: 英语 - zh: 中文 license: CC BY-SA 4.0许可协议 task_categories: - multiple-choice: 多项选择 dataset_info: features: - name: index dtype: int64 → 数据类型:int64 - name: triplet_id dtype: int64 → 数据类型:int64 - name: question dtype: string → 数据类型:字符串 - name: choices sequence: string → 序列类型:字符串 - name: answer dtype: string → 数据类型:字符串 - name: image dtype: image → 数据类型:图像 - name: source dtype: string → 数据类型:字符串 - name: question_category dtype: string → 数据类型:字符串 - name: eval_type dtype: string → 数据类型:字符串 splits: - name: test num_bytes: 755169661.25 → 字节数:755169661.25 num_examples: 6414 → 样本数:6414 configs: - config_name: default → 配置名称:默认 data_files: - split: test → 拆分:测试集 path: data/test-* → 路径:data/test-* tags: - image: 图像 <h1 align="center">MMEvalPro</h1> # MMEvalPro数据集卡片 我们构建了MMEvalPro数据集,旨在为大型多模态模型(Large Multimodal Models)提供更精准、高效的评估方案。该数据集通过三元组评估流水线(pipeline)与更严谨的指标设计,可有效规避第一类错误(Type-I errors)。针对现有基准数据集(如MMMU、ScienceQA、MathVista)中的每个原始问题,标注人员经细致的标注流程,为其补充一个感知类问题与一个知识锚定类问题。 ## 数据格式 json { "index": [int64] 问题文本的全局索引, "image": [image] PIL图像文件, "triplet_id": [int64] 问题所属三元组的全局索引, "question": [string] 问题文本, "choices": [list] 多项选择问题的选项列表, "answer": [string] 问题的正确答案, "source": [string] 问题的数据集来源,取值范围为['MMMU','ScienceQA','MathVista'], "question_category": [string] 问题的子类别, "eval_type": [string] 评估类型,取值范围为['Origin','Perception','Knowledge'] } ## 加载数据集 python from datasets import load_dataset dataset = load_dataset("../MMEvalPro") print(dataset) ## 自动评估 🔔 为在本数据集上自动评估模型并计算真实准确率、平均准确率及各类分析指标,我们提供示例代码,可基于模型输出与真实标签计算得分。 所有问题的输出需保存为JSON文件,格式参考`./demo_model_output.json`: json [ { "index": 0, "model_output": "A", "answer": "B", "triplet_id": 1, "eval_type": "Origin" }, { "index": 1, "model_output": "A", "answer": "B", "triplet_id": 1, "eval_type": "Perception" }, { "index": 2, "model_output": "A", "answer": "B", "triplet_id": 1, "eval_type": "Knowledge" } ... ] 随后可运行`./auto_score.py`脚本计算得分: bash python auto_score.py --model_output ./demo_model_output.json # 模型输出文件(JSON格式) --output_path ./demo_score.json # 结果保存路径 整体得分文件示例如下: json { "MMMU": { "genuine_accuracy_score": 18.88, "average_score": 54.87, "origin_score": 46.61, "perception_score": 64.01, "knowledge_score": 53.98 }, "MathVista": { "genuine_accuracy_score": 16.85, "average_score": 53.15, "origin_score": 57.41, "perception_score": 51.11, "knowledge_score": 50.93 }, "ScienceQA": { "genuine_accuracy_score": 49.01, "average_score": 77.07, "origin_score": 84.27, "perception_score": 72.92, "knowledge_score": 74.03 }, "Macro_Average": { "genuine_accuracy_score": 28.25, "average_score": 61.7, "origin_score": 62.76, "perception_score": 62.68, "knowledge_score": 59.65 }, "Micro_Average": { "genuine_accuracy_score": 36.11, "average_score": 67.51, "origin_score": 71.52, "perception_score": 66.0, "knowledge_score": 65.01 } } ## 许可协议 本数据集的新增贡献部分采用[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)许可协议分发,具体包括: 图像及原始问题的版权归属于MMMU、ScienceQA与MathVista数据集的原作者。 - **用途**:本数据集主要设计为测试集使用。 - **商业用途**:允许将本数据集作为测试集用于商业场景,但禁止将其作为训练集使用。访问或使用本数据集即表示您确认并同意遵守上述条款及[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)许可协议。
提供机构:
MMEVAL
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作