MaCBench
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/MaCBench
下载链接
链接失效反馈官方服务:
资源简介:
# MaCBench
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/MaCBench)
[](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard)
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2411.16955)
[](https://github.com/lamalab-org/chembench/tree/main)
[](https://macbench.lamalab.org/)
*A Chemistry and Materials Benchmark for evaluating Vision Large Language Models*
</div>
---
## ⚠️ **IMPORTANT NOTICE - NOT FOR TRAINING**
<div align="center">
### 🚫 **THIS DATASET IS STRICTLY FOR EVALUATION PURPOSES ONLY** 🚫
**DO NOT USE THIS DATASET FOR TRAINING OR FINE-TUNING MODELS**
This benchmark is designed exclusively for **evaluation and testing** of existing models. Using this data for training would compromise the integrity of the benchmark and invalidate evaluation results. Please respect the evaluation-only nature of this dataset to maintain fair and meaningful comparisons across different AI systems.
</div>
---
MaCBench is a benchmark designed to evaluate the chemistry and materials **multimodal capabilities** of Large Language Models (LLMs). 🔬 The questions in the corpus span different fields of chemistry and materials science, capturing various levels of complexity, from simple multiple-choice questions (MCQs) to open-ended reasoning-based questions.
The dataset comprises **over 1100 high-quality questions** manually curated by chemistry and materials experts. 👨🔬 All the questions in the corpus contain a **text description and an image**, which needs to be analyzed in order to correctly answer the question. The benchmark is designed to be used with the [ChemBench engine](https://github.com/lamalab-org/chembench/tree/main) but it can also be used with any other benchmarking framework. ⚡
All the content is made open-source under the **MIT license**, allowing for:
- ✅ Commercial and non-commercial use
- ✅ Modification and derivatives
- ✅ Distribution and private use
- ⚠️ Attribution required
- ⚠️ No warranty provided
## 🗂️ Subsets of the Dataset
MaCBench is organized into distinct categories that comprehensively evaluate different aspects of chemistry and materials science. The dataset contains **35 configs** spanning over **1100 questions** across the following major areas:
### 📊 Data Extraction
**Hand-drawn Molecules** (29 questions)
- `handdrawn-molecules`: Systematic naming of hand-drawn organic molecules, testing the ability to interpret sketched chemical structures and apply IUPAC nomenclature rules. ✏️
**Organic Chemistry** (81 questions)
- `chirality`: Determination of the number of chiral centers in molecules, including their configuration, spatial orientation, and priority groups according to Cahn-Ingold-Prelog rules. 🔄
- `isomers`: Identification of isomeric relationships between two molecules, including structural, geometric, and optical isomers. 🔀
- `organic-molecules`: Systematic naming of organic molecules following IUPAC nomenclature, testing knowledge of functional groups and naming conventions. 🧬
- `org-schema`: Extraction of components such as solvents, temperature, or yield from organic reaction schemas with SMILES notation. ⚗️
- `org-schema-wo-smiles`: Analysis of organic reaction schemas with visual references for molecule identification without relying on SMILES strings. 🖼️
**Tables and Plots** (407 questions)
- `tables-qa`: Analysis of composition tables, requiring extraction and interpretation of quantitative data from tabular formats. 📋
- `us-patent-figures`: Extraction of information from scientific figures in US patents, testing ability to interpret complex technical diagrams. 📄
- `us-patent-plots`: Interpretation of 2D plots presented in US patents, focusing on data visualization analysis. 📈
### 🔬 In Silico and Lab Experiments
**Lab QA** (80 questions)
- `chem-lab-basic`: Review of images taken in a chemistry lab focusing on safety protocols and proper laboratory practices. 🥽
- `chem-lab-comparison`: Comparison of laboratory images to identify correct practices and violations of good laboratory standards. ⚖️
- `chem-lab-equipments`: Identification and classification of laboratory glassware and other equipment commonly used in chemistry. 🧪
**Crystal Structure Analysis** (209 questions)
- `cif-atomic-species`: Determination of the number of different atomic species from crystal structure images. ⚛️
- `cif-density`: Determination of the density from crystal structure images using crystallographic data. 📏
- `cif-symmetry`: Determination of the point group from crystal structure images, testing symmetry recognition. 🔄
- `cif-volume`: Determination of the volume from crystal structure images using unit cell parameters. 📐
- `cif-crystal-system`: Determination of the crystal system (cubic, tetragonal, etc.) from crystal structure images. 💎
### 📊 Data Interpretation
**AFM Image Analysis** (50 questions)
- `afm-image`: Analysis of topography in various specimens using atomic force microscope images, testing surface characterization skills. 🔍
**Adsorption Isotherm** (155 questions)
- `mof-capacity-comparison`, `mof-capacity-order`, `mof-capacity-value`: Analysis of adsorption capacities in metal-organic frameworks. 🏭
- `mof-henry-constant-comparison`, `mof-henry-constant-order`: Evaluation of Henry's constants from isotherm data. 📊
- `mof-adsorption-strength-comparison`, `mof-adsorption-strength-order`: Assessment of adsorption strength characteristics. 💪
- `mof-working-capacity-comparison`, `mof-working-capacity-order`, `mof-working-capacity-value`: Determination and comparison of working capacities. ⚙️
**Electronic Structure** (24 questions)
- `electronic-structure`: Analysis of the electronic structure of materials, including determination of direct or indirect bandgaps and metallic characteristics. ⚡
**NMR and MS Spectra** (20 questions)
- `spectral-analysis`: Identification of halide atoms using MS isotope patterns and substitution positions on benzene rings using ¹H NMR spectra. 📡
**XRD QA** (80 questions)
- `xrd-pattern-matching`: Determination of crystal type from XRD patterns through phase identification. 🎯
- `xrd-pattern-shape`: Selection of the crystalline or amorphous nature from XRD pattern characteristics. 📊
- `xrd-peak-position`: Determination of the peak position of the most intense peak from XRD patterns. 📍
- `xrd-relative-intensity`: Ordering of the peak positions of the three most intense peaks from XRD patterns. 📊
With these comprehensive subsets, MaCBench provides a thorough evaluation of chemistry and materials science capabilities, covering fundamental concepts through advanced analytical techniques. The benchmark tests both theoretical knowledge and practical application skills essential for modern chemistry and materials research. 🎯
## 📖 Data Fields
The dataset contains the following fields for all the questions:
- `uuid` (str): a unique identifier for the question, which can be used to identify the question in the dataset. This field is used to ensure that each question has a unique identifier, which can be used to track the question and its performance over time. 🔑
- `image` (image): an image associated with the question, which can be used to provide additional context and information for the question. This field is used to provide a visual representation of the question, which can be used to help the LLMs understand the question and provide a more accurate answer. 🖼️
- `canary` (str): a canary string to avoid that the dataset gets leaked into some training set. 🐦
- `name` (str): the name of the question, which can be used to identify the question in the dataset. 📝
- `description` (str): a description of the question, which can be used to understand the context and the expected answer. 📄
- `keywords` (list of str): a list of keywords that can be used to search for the question in the dataset. These keywords are used to index the question and to facilitate the search for similar questions. 🏷️
- `preferred_score` (str): the preferred score for the question, which can be used to evaluate the performance of the LLMs on the question. This field is used to indicate the expected score for the question, which can be used to compare the results with other models. For MCQ questions, it will be `multiple_choice_grade`, while for open-ended questions it will be `mae` for most of the questions. 📊
- `metrics` (list of str): a list of metrics that can be used to evaluate the question. These metrics are used to measure the performance of the LLMs on the question and to compare the results with other models. They will differ from MCQ to open-ended questions. For MCQ questions, the metric is `multiple_choice_grade`. For open-ended questions, the metrics are `exact_string_match`, `mae`, and `mse`. 📏
- `examples` (list of dict): a list of examples that can be used to understand the question and the expected answer. Each example contains the following fields:
- `input` (str): the question to be answered. ❓
- `target` (str, optional): the expected value for the open-ended questions. For multiple-choice questions, this field is empty. 🎯
- `target_scores` (str, optional): For MCQ questions, the choices with the correct answer or answers labeled as `1` and the incorrect ones as `0`. For open-ended questions this field is empty. ✅
- `relative_tolerance` (float, optional): the relative tolerance for the open-ended questions, which can be used to evaluate the performance of the LLMs on the question. This field is used to indicate the acceptable deviation from the expected answer for the open-ended questions. For MCQ questions, this field is empty. ⚖️
## 🚀 Usage
```python
from chembench.prompter import PrompterBuilder
prompter = PrompterBuilder.from_model_object(
model="anthropic/claude-3-5-sonnet-20240620", #
prompt_type="multimodal_instruction", #
)
benchmark = ChemBenchmark.from_huggingface("jablonkagroup/MaCBench")
results = benchmark.bench(
prompter,
)
```
For more in depth usage, please refer to the [documentation](https://lamalab-org.github.io/chembench/). 📚
## 📄 Citation
If you use ChemBench in your research, please cite:
```bibtex
@article{alampara2024probing,
title = {Probing the limitations of multimodal language models for chemistry and materials research},
author = {Nawaf Alampara and Mara Schilling-Wilhelmi and Martiño Ríos-García and Indrajeet Mandal and Pranav Khetarpal and Hargun Singh Grover and N. M. Anoop Krishnan and Kevin Maik Jablonka},
year = {2024},
journal = {arXiv preprint arXiv: 2411.16955}
}
```
## 👥 Contact & Support
- **📄 Paper**: [arXiv Publication](https://arxiv.org/abs/2411.16955)
- **🌐 Website**: [MaCBench Project](https://macbench.lamalab.org/)
- **🤗 Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench)
- **🤗 Ablation Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench-Ablations)
- **🤗 Prompt Ablation Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench-Prompt-Ablations)
- **🏆 Leaderboard**: [Model Performance Rankings](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard)
- **💻 Code**: [GitHub Repository](https://github.com/lamalab-org/macbench)
- **📚 Documentation**: [Full Documentation for the Benchmark Runtime](https://lamalab-org.github.io/chembench/)
- **❓ Issues**: Report problems or ask questions via the Hugging Face dataset page or GitHub repository
---
<div align="center">

<i>Advancing the evaluation of AI systems in chemistry and materials science</i>
</div>
# MaCBench
<div align="center">

[](https://huggingface.co/datasets/jablonkagroup/MaCBench)
[](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard)
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2411.16955)
[](https://github.com/lamalab-org/chembench/tree/main)
[](https://macbench.lamalab.org/)
*用于评估视觉大语言模型(Vision Large Language Model)的化学与材料科学基准测试集*
</div>
---
## ⚠️ **重要声明 - 禁止用于训练**
<div align="center">
### 🚫 **本数据集仅可用于评估目的** 🚫
**严禁将本数据集用于模型训练或微调**
本基准仅用于现有模型的**评估与测试**。若将该数据用于训练,将破坏本基准的公正性,并导致评估结果失效。请严格遵守本数据集仅可用于评估的规则,以确保不同人工智能系统间的对比公平且具有实际意义。
</div>
---
MaCBench是一款专为评估大语言模型(Large Language Model, LLM)的化学与材料科学**多模态能力**而打造的基准测试集🔬。该数据集的问题覆盖化学与材料科学的多个细分领域,涵盖从简单选择题(Multiple-Choice Question, MCQ)到开放式推理问题的不同复杂度层级。
该数据集包含由化学与材料科学专家手动整理的**1100余道高质量问题**👨🔬。所有问题均配套**文本描述与图像**,需结合图像分析才能正确作答。本基准可配合[ChemBench引擎](https://github.com/lamalab-org/chembench/tree/main)使用,也可兼容其他任意基准测试框架⚡。
所有内容均以**MIT许可证**开源,允许:
- ✅ 商业与非商业使用
- ✅ 修改及衍生创作
- ✅ 分发与私人使用
- ⚠️ 需注明原作者
- ⚠️ 不提供任何担保
## 🗂️ 数据集子集
MaCBench按不同类别组织,全面覆盖化学与材料科学的多个评估维度。本数据集包含**35个配置项**,涵盖上述1100余道问题,主要分为以下领域:
### 📊 数据提取
**手绘分子**(29道题)
- `handdrawn-molecules`: 手绘有机分子的系统命名,测试解读手绘化学结构并应用IUPAC(国际纯粹与应用化学联合会)命名规则的能力✏️。
**有机化学**(81道题)
- `chirality`: 测定分子中的手性中心数量,包括其构型、空间取向及依据Cahn-Ingold-Prelog(CIP)规则确定的优先级基团🔄。
- `isomers`: 识别两种分子间的同分异构关系,包括结构异构、几何异构与光学异构🔀。
- `organic-molecules`: 依据IUPAC(国际纯粹与应用化学联合会)命名规则对有机分子进行系统命名,测试官能团与命名规范相关知识🧬。
- `org-schema`: 从带有SMILES(简化分子线性输入规范)符号的有机反应式中提取溶剂、温度、产率等组分⚗️。
- `org-schema-wo-smiles`: 不依赖SMILES字符串,仅通过视觉参考识别分子,分析有机反应式🖼️。
**图表与表格**(407道题)
- `tables-qa`: 分析成分表格,需从表格格式中提取并解读定量数据📋。
- `us-patent-figures`: 从美国专利的科学图表中提取信息,测试解读复杂技术图纸的能力📄。
- `us-patent-plots`: 解读美国专利中的二维图表,重点考察数据可视化分析能力📈。
### 🔬 计算机模拟与实验室实验
**实验室问答**(80道题)
- `chem-lab-basic`: 审视化学实验室拍摄的图像,重点考察安全规程与标准实验室操作规范🥽。
- `chem-lab-comparison`: 对比实验室图像,识别合规操作与违反实验室良好规范的行为⚖️。
- `chem-lab-equipments`: 识别并分类化学实验中常用的玻璃器皿与其他设备🧪。
**晶体结构分析**(209道题)
- `cif-atomic-species`: 从晶体结构图像中确定不同原子种类的数量⚛️。
- `cif-density`: 利用晶体学数据,从晶体结构图像中计算密度📏。
- `cif-symmetry`: 从晶体结构图像中确定点群,测试对称性识别能力🔄。
- `cif-volume`: 利用晶胞参数,从晶体结构图像中计算体积📐。
- `cif-crystal-system`: 从晶体结构图像中确定晶系(立方、四方等)💎。
### 📊 数据解读
**原子力显微镜图像分析**(50道题)
- `afm-image`: 利用原子力显微镜(Atomic Force Microscope, AFM)图像分析各类样品的表面形貌,测试表面表征能力🔍。
**吸附等温线**(155道题)
- `mof-capacity-comparison`, `mof-capacity-order`, `mof-capacity-value`: 分析金属有机框架(Metal-Organic Framework, MOF)的吸附容量🏭。
- `mof-henry-constant-comparison`, `mof-henry-constant-order`: 从等温线数据中评估亨利常数📊。
- `mof-adsorption-strength-comparison`, `mof-adsorption-strength-order`: 评估吸附强度特性💪。
- `mof-working-capacity-comparison`, `mof-working-capacity-order`, `mof-working-capacity-value`: 测定并对比有效工作容量⚙️。
**电子结构**(24道题)
- `electronic-structure`: 分析材料的电子结构,包括测定直接/间接带隙与金属特性⚡。
**核磁共振与质谱谱图**(20道题)
- `spectral-analysis`: 利用质谱(Mass Spectrometry, MS)同位素模式识别卤原子,以及利用氢核磁共振(¹H NMR)谱确定苯环上的取代位置📡。
**X射线衍射问答**(80道题)
- `xrd-pattern-matching`: 通过物相鉴定,从X射线衍射(X-Ray Diffraction, XRD)谱图中确定晶体类型🎯。
- `xrd-pattern-shape`: 根据XRD谱图特征判断晶体或非晶态属性📊。
- `xrd-peak-position`: 从XRD谱图中确定最强峰的峰位📍。
- `xrd-relative-intensity`: 对XRD谱图中三个最强峰的峰位进行排序📊。
凭借这些全面的子数据集,MaCBench可对化学与材料科学能力开展全方位评估,覆盖从基础概念到高级分析技术的全部内容。本基准同时测试现代化学与材料研究必备的理论知识与实践应用能力🎯。
## 📖 数据字段
本数据集的所有问题均包含以下字段:
- `uuid`(字符串):问题的唯一标识符,可用于在数据集中定位该问题,确保每个问题拥有独立标识以支持长期追踪与性能统计🔑。
- `image`(图像):与问题配套的图像,可为问题提供额外上下文与信息,辅助大语言模型理解问题并给出更准确的答案🖼️。
- `canary`(字符串):金丝雀字符串,用于防止本数据集被意外泄露至训练集🐦。
- `name`(字符串):问题名称,可用于在数据集中识别该问题📝。
- `description`(字符串):问题描述,用于说明问题背景与预期答案方向📄。
- `keywords`(字符串列表):可用于在数据集中检索问题的关键词列表,用于为问题建立索引并方便相似问题的搜索🏷️。
- `preferred_score`(字符串):该问题的首选评估指标,用于衡量大语言模型在本题上的表现。选择题将使用`multiple_choice_grade`,而多数开放式问题将使用`mae`(平均绝对误差,Mean Absolute Error)📊。
- `metrics`(字符串列表):用于评估该问题的指标列表,用于衡量大语言模型在本题上的表现并与其他模型对比。选择题与开放式问题使用的指标有所不同:选择题仅使用`multiple_choice_grade`;开放式问题则使用`exact_string_match`、`mae`与`mse`(均方误差,Mean Squared Error)📏。
- `examples`(字典列表):用于辅助理解问题与预期答案的示例列表,每个示例包含以下字段:
- `input`(字符串):需要作答的问题❓。
- `target`(字符串,可选):开放式问题的预期答案,选择题此字段为空🎯。
- `target_scores`(字符串,可选):针对选择题,该字段为将正确选项标记为`1`、错误选项标记为`0`的选项列表;开放式问题此字段为空✅。
- `relative_tolerance`(浮点数,可选):开放式问题的相对容忍度,用于定义预期答案的可接受偏差范围,选择题此字段为空⚖️。
## 🚀 使用方法
python
from chembench.prompter import PrompterBuilder
prompter = PrompterBuilder.from_model_object(
model="anthropic/claude-3-5-sonnet-20240620", # 模型名称
prompt_type="multimodal_instruction", # 多模态指令提示类型
)
benchmark = ChemBenchmark.from_huggingface("jablonkagroup/MaCBench")
results = benchmark.bench(
prompter,
)
如需了解更详细的使用方法,请参考[官方文档](https://lamalab-org.github.io/chembench/)📚。
## 📄 引用声明
若您在研究中使用MaCBench,请引用以下文献:
bibtex
@article{alampara2024probing,
title = {探究多模态语言模型在化学与材料研究中的局限性},
author = {Nawaf Alampara、Mara Schilling-Wilhelmi、Martiño Ríos-García、Indrajeet Mandal、Pranav Khetarpal、Hargun Singh Grover、N. M. Anoop Krishnan、Kevin Maik Jablonka},
year = {2024},
journal = {arXiv预印本 arXiv: 2411.16955}
}
## 👥 联系与支持
- **📄 论文**: [arXiv发表页面](https://arxiv.org/abs/2411.16955)
- **🌐 项目官网**: [MaCBench官方项目页](https://macbench.lamalab.org/)
- **🤗 数据集**: [Hugging Face数据集页面](https://huggingface.co/datasets/jablonkagroup/MaCBench)
- **🤗 消融实验数据集**: [Hugging Face消融数据集页面](https://huggingface.co/datasets/jablonkagroup/MaCBench-Ablations)
- **🤗 提示消融实验数据集**: [Hugging Face提示消融数据集页面](https://huggingface.co/datasets/jablonkagroup/MaCBench-Prompt-Ablations)
- **🏆 排行榜**: [模型性能排名榜](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard)
- **💻 代码仓库**: [GitHub代码库](https://github.com/lamalab-org/macbench)
- **📚 官方文档**: [基准测试运行完整文档](https://lamalab-org.github.io/chembench/)
- **❓ 问题反馈**: 可通过Hugging Face数据集页面或GitHub仓库提交问题与咨询
---
<div align="center">

<i>推动化学与材料科学领域人工智能系统的评估研究</i>
</div>
提供机构:
maas
创建时间:
2025-05-27
搜集汇总
数据集介绍

背景与挑战
背景概述
MaCBench是一个专门用于评估视觉大语言模型在化学和材料科学领域多模态能力的基准数据集,仅限评估使用,不可用于训练。它包含超过1100个由专家手动整理的问题,每个问题都结合文本描述和图像,涵盖数据提取、实验室实验和数据分析等多个子领域,旨在全面测试模型的理论知识和实践应用能力。
以上内容由遇见数据集搜集并总结生成



