five

MaCBench-Prompt-Ablations

收藏
魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/MaCBench-Prompt-Ablations
下载链接
链接失效反馈
官方服务:
资源简介:
# MaCBench-Prompt-Ablations <div align="center"> ![MaCBench Logo](MacBench_logo.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/MaCBench) [![Leaderboard](https://img.shields.io/badge/🏆%20Leaderboard-Live-orange)](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2411.16955) [![Code](https://img.shields.io/badge/💻-Code-purple)](https://github.com/lamalab-org/chembench/tree/main) [![Website](https://img.shields.io/badge/🌐-Website-green)](https://macbench.lamalab.org/) *A Chemistry and Materials Benchmark for evaluating Vision Large Language Models* </div> --- ## ⚠️ **IMPORTANT NOTICE - NOT FOR TRAINING** <div align="center"> ### 🚫 **THIS DATASET IS STRICTLY FOR EVALUATION PURPOSES ONLY** 🚫 **DO NOT USE THIS DATASET FOR TRAINING OR FINE-TUNING MODELS** This benchmark is designed exclusively for **evaluation and testing** of existing models. Using this data for training would compromise the integrity of the benchmark and invalidate evaluation results. Please respect the evaluation-only nature of this dataset to maintain fair and meaningful comparisons across different AI systems. </div> --- **MaCBench-Prompt-Ablations** is the ablation dataset for MaCBench. This data is the result of systematically modifying the prompt content to evaluate the impact on model performance. Thus the **MaCBench-Prompt-Ablations** dataset serves as a valuable tool for understanding how small changes in prompt design can impact and fragility of model performance. **MaCBench** is a benchmark designed to evaluate the chemistry and materials **multimodal capabilities** of Large Language Models (LLMs). 🔬 The questions in the corpus span different fields of chemistry and materials science, capturing various levels of complexity, from simple multiple-choice questions (MCQs) to open-ended reasoning-based questions. All the content is made open-source under the **MIT license**, allowing for: - ✅ Commercial and non-commercial use - ✅ Modification and derivatives - ✅ Distribution and private use - ⚠️ Attribution required - ⚠️ No warranty provided ## 📖 Data Fields The dataset contains the following fields for all the questions: - `uuid` (str): a unique identifier for the question, which can be used to identify the question in the dataset. This field is used to ensure that each question has a unique identifier, which can be used to track the question and its performance over time. 🔑 - `image` (image): an image associated with the question, which can be used to provide additional context and information for the question. This field is used to provide a visual representation of the question, which can be used to help the LLMs understand the question and provide a more accurate answer. 🖼️ - `canary` (str): a canary string to avoid that the dataset gets leaked into some training set. 🐦 - `name` (str): the name of the question, which can be used to identify the question in the dataset. 📝 - `description` (str): a description of the question, which can be used to understand the context and the expected answer. 📄 - `keywords` (list of str): a list of keywords that can be used to search for the question in the dataset. These keywords are used to index the question and to facilitate the search for similar questions. 🏷️ - `preferred_score` (str): the preferred score for the question, which can be used to evaluate the performance of the LLMs on the question. This field is used to indicate the expected score for the question, which can be used to compare the results with other models. For MCQ questions, it will be `multiple_choice_grade`, while for open-ended questions it will be `mae` for most of the questions. 📊 - `metrics` (list of str): a list of metrics that can be used to evaluate the question. These metrics are used to measure the performance of the LLMs on the question and to compare the results with other models. They will differ from MCQ to open-ended questions. For MCQ questions, the metric is `multiple_choice_grade`. For open-ended questions, the metrics are `exact_string_match`, `mae`, and `mse`. 📏 - `examples` (list of dict): a list of examples that can be used to understand the question and the expected answer. Each example contains the following fields: - `input` (str): the question to be answered. ❓ - `target` (str, optional): the expected value for the open-ended questions. For multiple-choice questions, this field is empty. 🎯 - `target_scores` (str, optional): For MCQ questions, the choices with the correct answer or answers labeled as `1` and the incorrect ones as `0`. For open-ended questions this field is empty. ✅ - `relative_tolerance` (float, optional): the relative tolerance for the open-ended questions, which can be used to evaluate the performance of the LLMs on the question. This field is used to indicate the acceptable deviation from the expected answer for the open-ended questions. For MCQ questions, this field is empty. ⚖️ ## 🚀 Usage ```python from chembench.prompter import PrompterBuilder prompter = PrompterBuilder.from_model_object( model="anthropic/claude-3-5-sonnet-20240620", # prompt_type="multimodal_instruction", # ) benchmark = ChemBenchmark.from_huggingface("jablonkagroup/MaCBench-Prompt-Ablations") results = benchmark.bench( prompter, ) ``` For more in depth usage, please refer to the [documentation](https://lamalab-org.github.io/chembench/). 📚 ## 📄 Citation If you use ChemBench in your research, please cite: ```bibtex @article{alampara2024probing, title = {Probing the limitations of multimodal language models for chemistry and materials research}, author = {Nawaf Alampara and Mara Schilling-Wilhelmi and Martiño Ríos-García and Indrajeet Mandal and Pranav Khetarpal and Hargun Singh Grover and N. M. Anoop Krishnan and Kevin Maik Jablonka}, year = {2024}, journal = {arXiv preprint arXiv: 2411.16955} } ``` ## 👥 Contact & Support - **📄 Paper**: [arXiv Publication](https://arxiv.org/abs/2411.16955) - **🌐 Website**: [MaCBench Project](https://macbench.lamalab.org/) - **🤗 Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench) - **🤗 Ablation Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench-Ablations) - **🤗 Prompt Ablation Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench-Prompt-Ablations) - **🏆 Leaderboard**: [Model Performance Rankings](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard) - **💻 Code**: [GitHub Repository](https://github.com/lamalab-org/macbench) - **📚 Documentation**: [Full Documentation for the Benchmark Runtime](https://lamalab-org.github.io/chembench/) - **❓ Issues**: Report problems or ask questions via the Hugging Face dataset page or GitHub repository --- <div align="center"> ![LamaLab logo](png-file.png) <i>Advancing the evaluation of AI systems in chemistry and materials science</i> </div>

# MaCBench-Prompt-Ablations <div align="center"> ![MaCBench Logo](MacBench_logo.png) [![数据集](https://img.shields.io/badge/🤗%20Hugging%20Face-数据集-yellow)](https://huggingface.co/datasets/jablonkagroup/MaCBench) [![排行榜](https://img.shields.io/badge/🏆%20排行榜-实时-orange)](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard) [![许可证: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![论文](https://img.shields.io/badge/📄-论文-red)](https://arxiv.org/abs/2411.16955) [![代码](https://img.shields.io/badge/💻-代码-purple)](https://github.com/lamalab-org/chembench/tree/main) [![官网](https://img.shields.io/badge/🌐-官网-green)](https://macbench.lamalab.org/) *用于评估视觉大语言模型的化学与材料科学基准测试集* </div> --- ## ⚠️ **重要声明——禁止用于训练** <div align="center"> ### 🚫 **本数据集仅可用于评估用途** 🚫 **严禁将本数据集用于模型训练或微调** 本基准测试集仅专为现有模型的**评估与测试**设计。使用该数据集进行训练将破坏基准测试的严谨性,导致评估结果失效。请严格遵守本数据集仅用于评估的规则,以确保不同AI系统间的对比公平且具有参考价值。 </div> --- **MaCBench-Prompt-Ablations** 是MaCBench的消融实验数据集,其数据通过系统性修改提示词内容,以探究提示词调整对模型性能的影响。因此,**MaCBench-Prompt-Ablations** 数据集可作为探究提示词设计的细微变化如何影响模型性能及其脆弱性的宝贵工具。 **MaCBench** 是一款用于评估大语言模型(Large Language Model, LLM)化学与材料领域多模态能力的基准测试集🔬。该基准的问题涵盖化学与材料科学的多个分支,包含从简单选择题(Multiple-Choice Questions, MCQs)到开放式推理问题的不同复杂度层级。 所有内容均以**MIT许可证**开源,允许: - ✅ 商业与非商业使用 - ✅ 修改及衍生创作 - ✅ 分发与私人使用 - ⚠️ 需注明原作者 - ⚠️ 不提供任何担保 ## 📖 数据字段 本数据集的所有问题均包含以下字段: - `uuid` (str): 问题的唯一标识符,可用于在数据集中标识该问题,该字段确保每个问题拥有唯一标识,便于追踪问题及其随时间变化的性能表现🔑 - `image` (图像): 与问题关联的图像,可为问题提供额外上下文与信息,用于可视化呈现问题,帮助大语言模型理解问题并给出更准确的答案🖼️ - `canary` (str): 防泄露字符串,用于避免本数据集被意外混入训练集🐦 - `name` (str): 问题名称,可用于在数据集中标识该问题📝 - `description` (str): 问题描述,用于说明问题上下文与预期答案📄 - `keywords` (str列表): 用于在数据集中搜索问题的关键词列表,可用于索引问题并便于相似问题的检索🏷️ - `preferred_score` (str): 用于评估大语言模型在该问题上性能的首选评分指标,该字段指明该问题的预期评分标准,用于与其他模型的结果进行对比。对于选择题,该字段值为`multiple_choice_grade`;对于开放式问题,多数情况下该字段值为`mae`📊 - `metrics` (str列表): 用于评估该问题的指标列表,可用于衡量大语言模型在该问题上的性能并与其他模型结果对比,指标因选择题与开放式问题而异。对于选择题,指标为`multiple_choice_grade`;对于开放式问题,指标包括`exact_string_match`、`mae`与`mse`📏 - `examples` (dict列表): 用于理解问题与预期答案的示例列表,每个示例包含以下字段: - `input` (str): 需要解答的问题❓ - `target` (str, 可选): 开放式问题的预期答案值。对于选择题,该字段为空🎯 - `target_scores` (str, 可选): 对于选择题,该字段为标注了正确答案(值为`1`)与错误答案(值为`0`)的选项;对于开放式问题,该字段为空✅ - `relative_tolerance` (float, 可选): 开放式问题的相对容忍度,用于评估大语言模型在该问题上的性能,该字段指明开放式问题中可接受的预期答案偏差范围。对于选择题,该字段为空⚖️ ## 🚀 使用方法 python from chembench.prompter import PrompterBuilder prompter = PrompterBuilder.from_model_object( model="anthropic/claude-3-5-sonnet-20240620", # prompt_type="multimodal_instruction", # ) benchmark = ChemBenchmark.from_huggingface("jablonkagroup/MaCBench-Prompt-Ablations") results = benchmark.bench( prompter, ) 如需更详细的使用方法,请参阅[官方文档](https://lamalab-org.github.io/chembench/)📚 ## 📄 引用 若您在研究中使用本基准测试集,请引用以下文献: bibtex @article{alampara2024probing, title = {Probing the limitations of multimodal language models for chemistry and materials research}, author = {Nawaf Alampara and Mara Schilling-Wilhelmi and Martiño Ríos-García and Indrajeet Mandal and Pranav Khetarpal and Hargun Singh Grover and N. M. Anoop Krishnan and Kevin Maik Jablonka}, year = {2024}, journal = {arXiv preprint arXiv: 2411.16955} } ## 👥 联系与支持 - **📄 论文**: [arXiv预印本](https://arxiv.org/abs/2411.16955) - **🌐 项目官网**: [MaCBench项目](https://macbench.lamalab.org/) - **🤗 主数据集**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench) - **🤗 消融实验数据集**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench-Ablations) - **🤗 提示词消融数据集**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/MaCBench-Prompt-Ablations) - **🏆 排行榜**: [模型性能排名](https://huggingface.co/spaces/jablonkagroup/MaCBench-Leaderboard) - **💻 代码仓库**: [GitHub仓库](https://github.com/lamalab-org/macbench) - **📚 官方文档**: [基准测试运行完整文档](https://lamalab-org.github.io/chembench/) - **❓ 问题反馈**: 可通过Hugging Face数据集页面或GitHub仓库提交问题或咨询疑问 --- <div align="center"> ![LamaLab logo](png-file.png) <i>推进化学与材料科学领域AI系统的评估</i> </div>
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作