MaCBench-Ablations

Name: MaCBench-Ablations
Creator: maas
Published: 2025-10-09 16:35:51
License: 暂无描述

魔搭社区2025-10-09 更新2025-08-16 收录

下载链接：

https://modelscope.cn/datasets/jablonkagroup/MaCBench-Ablations

下载链接

链接失效反馈

官方服务：

资源简介：

# MaCBench-Ablations ![MaCBench Logo](MacBench_logo.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/MaCBench) [![Leaderboard](https://img.shields.io/badge/🏆%20Leaderboard-Live-orange)](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2411.16955) [![Code](https://img.shields.io/badge/💻-Code-purple)](https://github.com/lamalab-org/chembench/tree/main) [![Website](https://img.shields.io/badge/🌐-Website-green)](https://macbench.lamalab.org/) *A Chemistry and Materials Benchmark for evaluating Vision Large Language Models* --- ## ⚠️ **IMPORTANT NOTICE - NOT FOR TRAINING** ### 🚫 **THIS DATASET IS STRICTLY FOR EVALUATION PURPOSES ONLY** 🚫 **DO NOT USE THIS DATASET FOR TRAINING OR FINE-TUNING MODELS** This benchmark is designed exclusively for **evaluation and testing** of existing models. Using this data for training would compromise the integrity of the benchmark and invalidate evaluation results. Please respect the evaluation-only nature of this dataset to maintain fair and meaningful comparisons across different AI systems. --- **MaCBench-Ablations** is the ablation dataset for MaCBench. This data is the result of systematically removing or altering specific components of the original MaCBench dataset to evaluate the impact on model performance. Thus the **MaCBench-Ablations** dataset serves as a valuable tool for understanding the contributions of different data modalities and question types in the evaluation of LLMs. As a result, the dataset comprises 1153 multimodal questions about chemistry and each designed to probe specific aspects of model performance.👨‍🔬 **MaCBench** is a benchmark designed to evaluate the chemistry and materials **multimodal capabilities** of Large Language Models (LLMs). 🔬 The questions in the corpus span different fields of chemistry and materials science, capturing various levels of complexity, from simple multiple-choice questions (MCQs) to open-ended reasoning-based questions. All the content is made open-source under the **MIT license**, allowing for: - ✅ Commercial and non-commercial use - ✅ Modification and derivatives - ✅ Distribution and private use - ⚠️ Attribution required - ⚠️ No warranty provided ## 📖 Data Fields The dataset contains the following fields for all the questions: - `uuid` (str): a unique identifier for the question, which can be used to identify the question in the dataset. This field is used to ensure that each question has a unique identifier, which can be used to track the question and its performance over time. 🔑 - `image` (image): an image associated with the question, which can be used to provide additional context and information for the question. This field is used to provide a visual representation of the question, which can be used to help the LLMs understand the question and provide a more accurate answer. 🖼️ - `canary` (str): a canary string to avoid that the dataset gets leaked into some training set. 🐦 - `name` (str): the name of the question, which can be used to identify the question in the dataset. 📝 - `description` (str): a description of the question, which can be used to understand the context and the expected answer. 📄 - `keywords` (list of str): a list of keywords that can be used to search for the question in the dataset. These keywords are used to index the question and to facilitate the search for similar questions. 🏷️ - `preferred_score` (str): the preferred score for the question, which can be used to evaluate the performance of the LLMs on the question. This field is used to indicate the expected score for the question, which can be used to compare the results with other models. For MCQ questions, it will be `multiple_choice_grade`, while for open-ended questions it will be `mae` for most of the questions. 📊 - `metrics` (list of str): a list of metrics that can be used to evaluate the question. These metrics are used to measure the performance of the LLMs on the question and to compare the results with other models. They will differ from MCQ to open-ended questions. For MCQ questions, the metric is `multiple_choice_grade`. For open-ended questions, the metrics are `exact_string_match`, `mae`, and `mse`. 📏 - `examples` (list of dict): a list of examples that can be used to understand the question and the expected answer. Each example contains the following fields: - `input` (str): the question to be answered. ❓ - `target` (str, optional): the expected value for the open-ended questions. For multiple-choice questions, this field is empty. 🎯 - `target_scores` (str, optional): For MCQ questions, the choices with the correct answer or answers labeled as `1` and the incorrect ones as `0`. For open-ended questions this field is empty. ✅ - `relative_tolerance` (float, optional): the relative tolerance for the open-ended questions, which can be used to evaluate the performance of the LLMs on the question. This field is used to indicate the acceptable deviation from the expected answer for the open-ended questions. For MCQ questions, this field is empty. ⚖️ ## 🚀 Usage ```python from chembench.prompter import PrompterBuilder prompter = PrompterBuilder.from_model_object( model="anthropic/claude-3-5-sonnet-20240620", # prompt_type="multimodal_instruction", # ) benchmark = ChemBenchmark.from_huggingface("jablonkagroup/MaCBench-Ablations") results = benchmark.bench( prompter, ) ``` For more in depth usage, please refer to the [documentation](https://lamalab-org.github.io/chembench/). 📚 ## 📄 Citation If you use ChemBench in your research, please cite: ```bibtex @article{alampara2024probing, title = {Probing the limitations of multimodal language models for chemistry and materials research}, author = {Nawaf Alampara and Mara Schilling-Wilhelmi and Martiño Ríos-García and Indrajeet Mandal and Pranav Khetarpal and Hargun Singh Grover and N. M. Anoop Krishnan and Kevin Maik Jablonka}, year = {2024}, journal = {arXiv preprint arXiv: 2411.16955} } ``` ## 👥 Contact & Support - **📄 Paper**: [arXiv Publication](https://arxiv.org/abs/2411.16955) - **🌐 Website**: [MaCBench Project](https://macbench.lamalab.org/) - **🤗 Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench) - **🤗 Ablation Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench-Ablations) - **🤗 Prompt Ablation Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench-Prompt-Ablations) - **🏆 Leaderboard**: [Model Performance Rankings](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard) - **💻 Code**: [GitHub Repository](https://github.com/lamalab-org/macbench) - **📚 Documentation**: [Full Documentation for the Benchmark Runtime](https://lamalab-org.github.io/chembench/) - **❓ Issues**: Report problems or ask questions via the Hugging Face dataset page or GitHub repository --- ![LamaLab logo](png-file.png) Advancing the evaluation of AI systems in chemistry and materials science

# MaCBench-Ablations ![MaCBench Logo](MacBench_logo.png) [![数据集](https://img.shields.io/badge/🤗%20Hugging%20Face-数据集-yellow)](https://huggingface.co/datasets/jablonkagroup/MaCBench) [![排行榜](https://img.shields.io/badge/🏆%20排行榜-实时-orange)](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard) [![许可证: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![论文](https://img.shields.io/badge/📄-论文-red)](https://arxiv.org/abs/2411.16955) [![代码](https://img.shields.io/badge/💻-代码-purple)](https://github.com/lamalab-org/chembench/tree/main) [![官网](https://img.shields.io/badge/🌐-官网-green)](https://macbench.lamalab.org/) **面向视觉大语言模型（Vision Large Language Model）评估的化学与材料基准测试集** --- ## ⚠️ **重要声明——禁止用于训练** ### 🚫 **本数据集仅可用于评估目的** 🚫 **请勿将本数据集用于模型训练或微调** 本基准测试集仅专为现有模型的评估与测试设计。使用该数据集进行训练将破坏基准测试的严谨性，导致评估结果失效。请尊重本数据集仅可用于评估的属性，以确保不同AI系统间的对比公平且具有意义。 --- **MaCBench-Ablations** 是MaCBench的消融测试数据集。本数据集通过系统性移除或修改原始MaCBench数据集的特定组件，以评估其对模型性能的影响。因此，**MaCBench-Ablations** 数据集可作为探究不同数据模态与问题类型对大语言模型（Large Language Model, LLM）评估贡献的宝贵工具。本数据集包含1153道化学与材料领域的多模态问题，每道问题均旨在探查模型性能的特定维度👨‍🔬。 **MaCBench** 是用于评估大语言模型（LLM）化学与材料多模态能力的基准测试集🔬。数据集涵盖化学与材料科学的多个领域，包含从简单选择题（Multiple-Choice Questions, MCQs）到开放式推理问题的不同复杂度层级。所有内容均采用**MIT许可证**开源，允许： - ✅ 商业与非商业使用 - ✅ 修改及衍生创作 - ✅ 分发与私人使用 - ⚠️ 需注明原作者 - ⚠️ 不提供任何担保 ## 📖 数据字段本数据集为所有问题提供以下字段： - `uuid`（字符串型）：问题的唯一标识符，可用于在数据集中定位问题。本字段确保每个问题拥有唯一标识，可用于追踪问题及其随时间变化的性能🔑。 - `image`（图像型）：与问题关联的图像，可为问题提供额外上下文与信息。本字段用于提供问题的可视化表示，帮助大语言模型理解问题并给出更精准的答案🖼️。 - `canary`（字符串型）：金丝雀字符串，用于避免数据集被意外混入训练集🐦。 - `name`（字符串型）：问题名称，可用于在数据集中识别问题📝。 - `description`（字符串型）：问题描述，用于说明问题上下文与预期答案📄。 - `keywords`（字符串列表）：可用于在数据集中检索问题的关键词列表。本关键词列表用于为问题建立索引，便于搜索相似问题🏷️。 - `preferred_score`（字符串型）：本问题的首选评估指标，用于评估大语言模型在该问题上的性能。本字段用于标注该问题的预期评分标准，便于与其他模型的结果进行对比。对于选择题，该字段值为`multiple_choice_grade`；对于开放式问题，绝大多数场景下该字段值为`mae`（平均绝对误差，Mean Absolute Error）📊。 - `metrics`（字符串列表）：可用于评估本问题的指标列表。本指标列表用于衡量大语言模型在该问题上的性能，并与其他模型的结果进行对比，其类型会因问题为选择题或开放式问题而有所不同。对于选择题，评估指标为`multiple_choice_grade`；对于开放式问题，评估指标包括`exact_string_match`（精确字符串匹配）、`mae`（平均绝对误差，Mean Absolute Error）与`mse`（均方误差，Mean Squared Error）📏。 - `examples`（字典列表）：可用于理解问题与预期答案的示例列表。每个示例包含以下字段： - `input`（字符串型）：待解答的问题❓ - `target`（字符串型，可选）：开放式问题的预期答案值。对于选择题，该字段为空🎯。 - `target_scores`（字符串型，可选）：对于选择题，该字段为标注了正确答案（值为`1`）与错误答案（值为`0`）的选项列表；对于开放式问题，该字段为空✅。 - `relative_tolerance`（浮点型，可选）：开放式问题的相对容忍度，用于评估大语言模型在该问题上的性能。本字段用于标注开放式问题中可接受的预期答案偏差范围。对于选择题，该字段为空⚖️。 ## 🚀 使用方法 python from chembench.prompter import PrompterBuilder prompter = PrompterBuilder.from_model_object( model="anthropic/claude-3-5-sonnet-20240620", # 模型名称 prompt_type="multimodal_instruction", # 多模态指令类型 ) benchmark = ChemBenchmark.from_huggingface("jablonkagroup/MaCBench-Ablations") results = benchmark.bench( prompter, ) 如需更详细的使用方法，请参阅[官方文档](https://lamalab-org.github.io/chembench/)📚。 ## 📄 引用如果您在研究中使用ChemBench，请引用以下文献： bibtex @article{alampara2024probing, title = {Probing the limitations of multimodal language models for chemistry and materials research}, author = {Nawaf Alampara and Mara Schilling-Wilhelmi and Martiño Ríos-García and Indrajeet Mandal and Pranav Khetarpal and Hargun Singh Grover and N. M. Anoop Krishnan and Kevin Maik Jablonka}, year = {2024}, journal = {arXiv preprint arXiv: 2411.16955} } ## 👥 联系与支持 - **📄 论文**：[arXiv预印本](https://arxiv.org/abs/2411.16955) - **🌐 官网**：[MaCBench项目主页](https://macbench.lamalab.org/) - **🤗 基准数据集**：[Hugging Face平台](https://huggingface.co/datasets/jablonkagroup/MaCBench) - **🤗 消融数据集**：[Hugging Face平台](https://huggingface.co/datasets/jablonkagroup/MaCBench-Ablations) - **🤗 提示词消融数据集**：[Hugging Face平台](https://huggingface.co/datasets/jablonkagroup/MaCBench-Prompt-Ablations) - **🏆 排行榜**：[模型性能排名](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard) - **💻 代码仓库**：[GitHub仓库](https://github.com/lamalab-org/macbench) - **📚 文档**：[基准测试运行工具完整文档](https://lamalab-org.github.io/chembench/) - **❓ 问题反馈**：可通过Hugging Face数据集页面或GitHub仓库提交问题或咨询疑问 --- ![LamaLab 标志](png-file.png) 助力化学与材料科学领域AI系统的评估研究

提供机构：

maas

创建时间：

2025-05-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集