five

ChemBench

收藏
魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/ChemBench
下载链接
链接失效反馈
官方服务:
资源简介:
# ChemBench <div align="center"> ![ChemBench Logo](CHEMBENCH_LOGO_NO_BACK.png) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/ChemBench) [![Leaderboard](https://img.shields.io/badge/🏆%20Leaderboard-Live-orange)](https://huggingface.co/spaces/jablonkagroup/ChemBench-Leaderboard) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://www.nature.com/articles/s41557-025-01815-x) [![Code](https://img.shields.io/badge/💻-Code-purple)](https://github.com/lamalab-org/chembench/tree/main) [![Website](https://img.shields.io/badge/🌐-Website-green)](https://chembench.lamalab.org/) *A manually curated benchmark for evaluating chemistry and materials capabilities of Large Language Models* </div> --- ## ⚠️ **IMPORTANT NOTICE - NOT FOR TRAINING** <div align="center"> ### 🚫 **THIS DATASET IS STRICTLY FOR EVALUATION PURPOSES ONLY** 🚫 **DO NOT USE THIS DATASET FOR TRAINING OR FINE-TUNING MODELS** This benchmark is designed exclusively for **evaluation and testing** of existing models. Using this data for training would compromise the integrity of the benchmark and invalidate evaluation results. Please respect the evaluation-only nature of this dataset to maintain fair and meaningful comparisons across different AI systems. </div> --- ## 📋 Dataset Summary ChemBench is a meticulously crafted benchmark designed to assess the chemistry and materials science capabilities of Large Language Models (LLMs). 🧪 This comprehensive evaluation suite spans diverse chemical disciplines and complexity levels, from straightforward multiple-choice questions to sophisticated open-ended reasoning challenges that demand both deep chemical knowledge and advanced reasoning skills. The benchmark comprises **over 2,700 high-quality questions** manually curated by chemistry and materials science experts. Each question is designed to test specific aspects of chemical understanding, making ChemBench an invaluable resource for researchers developing and evaluating AI systems in the chemical sciences. 🔬 ### 📊 Dataset Statistics - **🎯 Total Questions**: 2,700+ expertly curated questions - **👨‍🔬 Expert Curation**: Manually created by chemistry and materials science professionals - **🌐 Multi-field Coverage**: Spanning 9 major chemistry and materials science domains - **📈 Complexity Range**: From basic concepts to advanced reasoning challenges - **⚖️ Question Types**: Multiple-choice and open-ended format questions ## 🗂️ Dataset Configurations ChemBench encompasses nine major chemistry and materials science domains, each designed to evaluate specific aspects of chemical knowledge and reasoning: ### 🔬 **Analytical Chemistry** (`analytical_chemistry`) Questions covering spectroscopic techniques, chromatography, mass spectrometry, and other analytical methods essential for chemical analysis and characterization. ### 🧬 **Chemical Preference** (`chemical_preference`) Human preference-based questions evaluating compounds on oral bioavailability, toxicity profiles, drug-likeness, and other pharmacologically relevant properties. ### ⚗️ **General Chemistry** (`general_chemistry`) Fundamental chemistry concepts including periodic table properties, chemical bonding theories, stoichiometry, and basic thermodynamics. ### 🔵 **Inorganic Chemistry** (`inorganic_chemistry`) Properties and reactions of inorganic compounds, coordination chemistry, organometallics, solid-state chemistry, and crystal structures. ### �️ **Materials Science** (`materials_science`) Material properties and applications covering polymers, ceramics, nanomaterials, composites, and advanced functional materials. ### 🌿 **Organic Chemistry** (`organic_chemistry`) Organic compound properties, reaction mechanisms, functional group chemistry, synthesis strategies, and stereochemistry. ### ⚡ **Physical Chemistry** (`physical_chemistry`) Fundamental principles including thermodynamics, kinetics, electrochemistry, quantum chemistry, and statistical mechanics. ### ⚙️ **Technical Chemistry** (`technical_chemistry`) Practical applications in chemical engineering, process design, industrial chemistry, and chemical manufacturing. ### ⚠️ **Toxicity and Safety** (`toxicity_and_safety`) Chemical safety assessment, environmental chemistry, toxicology, risk assessment, and regulatory compliance. ## 📜 License All content is made open-source under the **MIT** license, allowing for: - ✅ Commercial and non-commercial use - ✅ Modification and derivatives - ✅ Distribution and private use - ⚠️ Attribution required - ⚠️ No warranty provided ## 📖 Data Fields The dataset contains comprehensive metadata and content fields designed for robust evaluation and analysis: ### 🔍 **Core Question Data** - **`canary`** (str): Anti-contamination string to prevent training data leakage - **`description`** (str): Detailed context and background information for the question - **`name`** (str): Unique question identifier for easy reference and tracking - **`uuid`** (str): Universal unique identifier ensuring dataset integrity - **`subfield`** (str): Specific chemistry/materials subcategory beyond the main configuration ### 📝 **Question Content & Answers** - **`examples`** (list of dict): Question-answer pairs containing: - **`input`** (str): The complete question to be answered - **`target`** (str, optional): Expected answer for open-ended questions - **`target_scores`** (str, optional): Answer key for multiple-choice questions (1=correct, 0=incorrect) ### 🏷️ **Categorization & Metadata** - **`keywords`** (list of str): Searchable tags including: - **Difficulty levels**: `difficulty-basic`, `difficulty-intermediate`, `difficulty-advanced` - **Required skills**: `requires-knowledge`, `requires-reasoning`, `requires-calculation`, `requires-intuition` - **Topic-specific keywords** for content discovery ### 📊 **Evaluation Framework** - **`metrics`** (list of str): Available evaluation metrics: - **Multiple-choice**: `multiple_choice_grade` - **Open-ended**: `exact_string_match`, `mae` (Mean Absolute Error), `mse` (Mean Squared Error) - **`preferred_score`** (str): Recommended primary evaluation metric for the question ### 🛠️ **Tool Usage Indicators** - **`in_humansubset_w_tool`** (bool): Whether question requires computational tools - **`in_humansubset_wo_tool`** (bool): Whether question can be solved without tools ## 🚀 Usage ### Quick Start with ChemBench Engine ```python from chembench.evaluate import ChemBenchmark from chembench.prompter import PrompterBuilder from chembench.utils import enable_logging from dotenv import load_dotenv # Setup environment and logging load_dotenv(".env") enable_logging() # Load the benchmark benchmark = ChemBenchmark.from_huggingface() # Configure your model model = "openai/gpt-4" prompter = PrompterBuilder.from_model_object(model=model) # Run evaluation results = benchmark.bench(prompter) # Submit results to leaderboard benchmark.submit(results) ``` For comprehensive documentation and advanced usage patterns, visit our [documentation](https://lamalab-org.github.io/chembench/). 📚 ## 🎯 Use Cases ChemBench serves multiple purposes in the development and evaluation of AI systems for chemistry: - **🤖 LLM Evaluation**: Comprehensive assessment of large language models' chemistry knowledge and reasoning capabilities - **📊 Model Comparison**: Standardized benchmarking for comparing different AI models across chemistry domains - **� Research Development**: Identifying strengths and weaknesses in AI systems to guide future research directions - **🎓 Educational Assessment**: Evaluating AI tutoring systems and educational tools for chemistry learning - **🏢 Industry Applications**: Testing AI systems before deployment in pharmaceutical, materials, and chemical industries - **🧪 Expert Validation**: Comparing AI performance against human chemistry experts and professionals ## ⚠️ Limitations & Considerations - **🎯 Scope**: Focused on chemistry and materials science; may not cover all specialized subdisciplines - **📚 Knowledge Cutoff**: Reflects current scientific understanding; new discoveries may not be included - **🌍 Language**: Primarily English content, limiting multilingual applications - **⚖️ Complexity Distribution**: While spanning basic to advanced levels, expert-level questions may be limited - **🔄 Dynamic Field**: Chemistry knowledge evolves rapidly; regular updates recommended - **👥 Expert Bias**: Reflects perspectives and knowledge of curating experts - **📊 Evaluation Metrics**: Some nuanced chemical reasoning may not be fully captured by current metrics ## 🛠️ Data Processing Pipeline ChemBench follows a rigorous curation and validation process: 1. **👨‍🔬 Expert Curation**: Questions created by chemistry and materials science professionals 2. **📚 Content Review**: Multi-expert validation of question accuracy and relevance 3. **🏷️ Metadata Assignment**: Comprehensive tagging with keywords, difficulty levels, and skill requirements 4. **⚖️ Quality Control**: Systematic review for clarity, accuracy, and appropriate difficulty distribution 5. **🔧 Format Standardization**: Consistent JSON structure across all chemistry domains 6. **✅ Validation Testing**: Pilot testing with human experts to ensure question quality 7. **📊 Statistical Analysis**: Distribution analysis to ensure balanced representation across topics ## 🏗️ ChemBench Framework This dataset is designed to work seamlessly with the **ChemBench evaluation engine**, providing: ### 🚀 **Key Features** - **🔄 Automated Evaluation**: Streamlined assessment pipeline for various model types - **📈 Leaderboard Integration**: Direct submission to public performance leaderboards - **🛠️ Tool Integration**: Support for models with and without computational tool access - **📊 Comprehensive Metrics**: Multiple evaluation approaches for different question types - **🌐 Community Driven**: Open-source framework encouraging community contributions ### 💡 **Flexibility** While optimized for the ChemBench engine, the dataset can be adapted for use with any benchmarking framework, making it accessible to the broader AI research community. ## 📄 Citation If you use ChemBench in your research, please cite: ```bibtex @article{Mirza2025, title = {A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists}, ISSN = {1755-4349}, url = {http://dx.doi.org/10.1038/s41557-025-01815-x}, DOI = {10.1038/s41557-025-01815-x}, journal = {Nature Chemistry}, publisher = {Springer Science and Business Media LLC}, author = {Mirza, Adrian and Alampara, Nawaf and Kunchapu, Sreekanth and Ríos-García, Martiño and Emoekabu, Benedict and Krishnan, Aswanth and Gupta, Tanya and Schilling-Wilhelmi, Mara and Okereke, Macjonathan and Aneesh, Anagha and Asgari, Mehrdad and Eberhardt, Juliane and Elahi, Amir Mohammad and Elbeheiry, Hani M. and Gil, María Victoria and Glaubitz, Christina and Greiner, Maximilian and Holick, Caroline T. and Hoffmann, Tim and Ibrahim, Abdelrahman and Klepsch, Lea C. and K"{o}ster, Yannik and Kreth, Fabian Alexander and Meyer, Jakob and Miret, Santiago and Peschel, Jan Matthias and Ringleb, Michael and Roesner, Nicole C. and Schreiber, Johanna and Schubert, Ulrich S. and Stafast, Leanne M. and Wonanke, A. D. Dinga and Pieler, Michael and Schwaller, Philippe and Jablonka, Kevin Maik}, year = {2025}, month = may } ``` ## 👥 Contact & Support - **📄 Paper**: [Nature Chemistry Publication](https://www.nature.com/articles/s41557-025-01815-x) - **🌐 Website**: [ChemBench Project](https://chembench.lamalab.org/) - **🤗 Dataset**: [Hugging Face](https://huggingface.co/datasets/jablonkagroup/ChemBench) - **🏆 Leaderboard**: [Model Performance Rankings](https://huggingface.co/spaces/jablonkagroup/ChemBench-Leaderboard) - **💻 Code**: [GitHub Repository](https://github.com/lamalab-org/chembench/tree/main) - **📚 Documentation**: [Full Documentation](https://lamalab-org.github.io/chembench/) - **❓ Issues**: Report problems or ask questions via the Hugging Face dataset page or GitHub repository --- <div align="center"> ![ChemBench Logo](CHEMBENCH_LOGO_NO_BACK.png) <i>Advancing the evaluation of AI systems in chemistry and materials science</i> </div>

# ChemBench <div align="center"> ![ChemBench Logo](CHEMBENCH_LOGO_NO_BACK.png) [![🤗 Hugging Face-数据集](https://img.shields.io/badge/🤗%20Hugging%20Face-数据集-yellow)](https://huggingface.co/datasets/jablonkagroup/ChemBench) [![🏆 实时排行榜](https://img.shields.io/badge/🏆%20排行榜-Live-orange)](https://huggingface.co/spaces/jablonkagroup/ChemBench-Leaderboard) [![许可证:MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![📄 论文](https://img.shields.io/badge/📄-Paper-red)](https://www.nature.com/articles/s41557-025-01815-x) [![💻 代码](https://img.shields.io/badge/💻-Code-purple)](https://github.com/lamalab-org/chembench/tree/main) [![🌐 官网](https://img.shields.io/badge/🌐-Website-green)](https://chembench.lamalab.org/) *一款经人工精心编制的基准数据集,用于评估大语言模型(Large Language Model)的化学与材料科学能力* </div> --- ## ⚠️ **重要声明——禁止用于训练** <div align="center"> ### 🚫 **本数据集仅可用于评估用途** 🚫 **请勿将本数据集用于模型的训练或微调** 本基准仅专为现有模型的评估与测试设计。使用该数据集进行训练将破坏基准的公正性,导致评估结果失效。请遵守本数据集仅用于评估的要求,以确保不同AI系统间的对比公平且有意义。 </div> --- ## 📋 数据集概述 ChemBench是一款精心打造的基准数据集,用于评估大语言模型(Large Language Model)的化学与材料科学能力🧪。这套全面的评估套件涵盖了多样的化学学科与复杂度层级,从简单的选择题到需要深厚化学知识与高级推理能力的复杂开放式推理挑战均有涉及。 本基准包含**超过2700道高质量题目**,均由化学与材料科学领域的专家人工精心编制。每道题目均针对特定的化学理解维度进行设计,使得ChemBench成为化学科学领域中研发与评估AI系统的研究人员的宝贵资源🔬。 ### 📊 数据集统计 - **🎯 总题目数**:2700+道经专家精心编制的题目 - **👨‍🔬 专家编制**:由化学与材料科学专业人员手动创建 - **🌐 多领域覆盖**:涵盖9个主要化学与材料科学领域 - **📈 复杂度范围**:从基础概念到高级推理挑战 - **⚖️ 题目类型**:包含选择题与开放式问答两种格式 ## 🗂️ 数据集配置 ChemBench涵盖九个主要化学与材料科学领域,每个领域均针对特定的化学知识与推理能力进行评估: ### 🔬 **分析化学(analytical_chemistry)** 涵盖光谱技术、色谱法、质谱法及其他用于化学分析与表征的重要分析方法的题目。 ### 🧬 **化学偏好(chemical_preference)** 基于人类偏好的题目,用于评估化合物在口服生物利用度、毒性特征、成药性及其他药理学相关属性方面的表现。 ### ⚗️ **普通化学(general_chemistry)** 涵盖基础化学概念,包括元素周期表性质、化学键理论、化学计量学及基础热力学。 ### 🔵 **无机化学(inorganic_chemistry)** 涉及无机化合物的性质与反应、配位化学、有机金属化合物、固态化学及晶体结构。 ### 🧪 **材料科学(materials_science)** 涵盖聚合物、陶瓷、纳米材料、复合材料及先进功能材料的性质与应用。 ### 🌿 **有机化学(organic_chemistry)** 涉及有机化合物性质、反应机理、官能团化学、合成策略及立体化学。 ### ⚡ **物理化学(physical_chemistry)** 涵盖基础原理,包括热力学、动力学、电化学、量子化学及统计力学。 ### ⚙️ **工业化学(technical_chemistry)** 涉及化学工程、工艺设计、工业化学及化学制造的实际应用。 ### ⚠️ **毒性与安全(toxicity_and_safety)** 涉及化学安全评估、环境化学、毒理学、风险评估及法规合规性。 ## 📜 许可证 所有内容均以**MIT许可证**开源,允许: - ✅ 商业与非商业使用 - ✅ 修改及衍生创作 - ✅ 分发与私人使用 - ⚠️ 需注明原作者 - ⚠️ 不提供任何担保 ## 📖 数据字段 本数据集包含全面的元数据与内容字段,用于可靠的评估与分析: ### 🔍 **核心问题数据** - **`canary`** (str):用于防止训练数据泄露的反污染字符串 - **`description`** (str):题目详细背景与上下文信息 - **`name`** (str):用于便捷引用与追踪的唯一题目标识符 - **`uuid`** (str):确保数据集完整性的通用唯一标识符 - **`subfield`** (str):主配置之外的具体化学/材料学子类别 ### 📝 **题目内容与答案** - **`examples`** (list of dict):包含问答对的字典列表,其中: - **`input`** (str):完整的待解答题目 - **`target`** (str, 可选):开放式题目的预期答案 - **`target_scores`** (str, 可选):选择题的答案键(1=正确,0=错误) ### 🏷️ **分类与元数据** - **`keywords`** (list of str):可搜索的标签列表,包括: - **难度等级**:`difficulty-basic`(基础)、`difficulty-intermediate`(中级)、`difficulty-advanced`(高级) - **所需技能**:`requires-knowledge`(需知识储备)、`requires-reasoning`(需推理能力)、`requires-calculation`(需计算能力)、`requires-intuition`(需直觉判断) - **主题专属关键词**:用于内容检索 ### 📊 **评估框架** - **`metrics`** (list of str):可用的评估指标列表: - **选择题**:`multiple_choice_grade`(选择题得分率) - **开放式问答**:`exact_string_match`(精确字符串匹配)、`mae`(平均绝对误差,Mean Absolute Error)、`mse`(均方误差,Mean Squared Error) - **`preferred_score`** (str):该题目推荐使用的主要评估指标 ### 🛠️ **工具使用指示** - **`in_humansubset_w_tool`** (bool):该题目是否需要使用计算工具 - **`in_humansubset_wo_tool`** (bool):该题目是否可在不使用工具的情况下解答 ## 🚀 使用方法 ### 使用ChemBench引擎快速上手 python from chembench.evaluate import ChemBenchmark from chembench.prompter import PrompterBuilder from chembench.utils import enable_logging from dotenv import load_dotenv # Setup environment and logging load_dotenv(".env") enable_logging() # Load the benchmark benchmark = ChemBenchmark.from_huggingface() # Configure your model model = "openai/gpt-4" prompter = PrompterBuilder.from_model_object(model=model) # Run evaluation results = benchmark.bench(prompter) # Submit results to leaderboard benchmark.submit(results) 如需获取完整文档与高级使用方法,请访问我们的[文档页面](https://lamalab-org.github.io/chembench/) 📚。 ## 🎯 应用场景 ChemBench在化学领域AI系统的研发与评估中具有多种用途: - **🤖 大语言模型评估**:全面评估大语言模型的化学知识与推理能力 - **📊 模型对比**:标准化基准测试,用于对比不同AI模型在化学领域的表现 - **🔬 研究开发**:识别AI系统的优势与不足,为后续研究方向提供指引 - **🎓 教育评估**:评估化学学习相关的AI辅导系统与教育工具 - **🏢 行业应用**:在将AI系统部署至制药、材料与化学工业前进行测试 - **🧪 专家验证**:对比AI系统与人类化学专家及专业人员的表现 ## ⚠️ 局限性与注意事项 - **🎯 范围限制**:聚焦于化学与材料科学,可能未覆盖所有专业子学科 - **📚 知识截止**:仅反映当前的科学认知,未包含最新发现 - **🌍 语言限制**:内容主要为英文,限制了多语言应用场景 - **⚖️ 复杂度分布**:虽涵盖基础至高级层级,但专家级题目可能数量有限 - **🔄 动态领域**:化学知识更新迅速,建议定期更新数据集 - **👥 专家偏差**:反映了编制专家的视角与知识储备 - **📊 评估指标**:当前指标可能无法完全捕捉部分复杂的化学推理能力 ## 🛠️ 数据处理流程 ChemBench遵循严格的编制与验证流程: 1. **👨‍🔬 专家编制**:由化学与材料科学专业人员创建题目 2. **📚 内容审核**:多位专家对题目的准确性与相关性进行验证 3. **🏷️ 元数据分配**:为题目添加关键词、难度等级与所需技能等全面标签 4. **⚖️ 质量控制**:系统审查题目清晰度、准确性与难度分布的合理性 5. **🔧 格式标准化**:所有化学领域的题目均采用统一的JSON结构 6. **✅ 验证测试**:由人类专家进行试点测试,确保题目质量 7. **📊 统计分析**:进行分布分析,确保各主题间的代表性均衡 ## 🏗️ ChemBench框架 本数据集专为与**ChemBench评估引擎**无缝协作而设计,提供以下功能: ### 🚀 **核心功能** - **🔄 自动化评估**:为各类模型提供简化的评估流程 - **📈 排行榜集成**:可直接向公开性能排行榜提交结果 - **🛠️ 工具集成**:支持使用与不使用计算工具的模型 - **📊 全面指标**:针对不同题目类型提供多种评估方法 - **🌐 社区驱动**:开源框架鼓励社区贡献 ### 💡 **灵活性** 尽管专为ChemBench引擎优化,本数据集仍可适配任何基准测试框架,便于更广泛的AI研究社区使用。 ## 📄 引用声明 如果您在研究中使用ChemBench,请引用以下文献: bibtex @article{Mirza2025, title = {A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists}, ISSN = {1755-4349}, url = {http://dx.doi.org/10.1038/s41557-025-01815-x}, DOI = {10.1038/s41557-025-01815-x}, journal = {Nature Chemistry}, publisher = {Springer Science and Business Media LLC}, author = {Mirza, Adrian and Alampara, Nawaf and Kunchapu, Sreekanth and Ríos-García, Martiño and Emoekabu, Benedict and Krishnan, Aswanth and Gupta, Tanya and Schilling-Wilhelmi, Mara and Okereke, Macjonathan and Aneesh, Anagha and Asgari, Mehrdad and Eberhardt, Juliane and Elahi, Amir Mohammad and Elbeheiry, Hani M. and Gil, María Victoria and Glaubitz, Christina and Greiner, Maximilian and Holick, Caroline T. and Hoffmann, Tim and Ibrahim, Abdelrahman and Klepsch, Lea C. and K"{o}ster, Yannik and Kreth, Fabian Alexander and Meyer, Jakob and Miret, Santiago and Peschel, Jan Matthias and Ringleb, Michael and Roesner, Nicole C. and Schreiber, Johanna and Schubert, Ulrich S. and Stafast, Leanne M. and Wonanke, A. D. Dinga and Pieler, Michael and Schwaller, Philippe and Jablonka, Kevin Maik}, year = {2025}, month = may } ## 👥 联系与支持 - **📄 论文**:[《自然·化学》发表论文](https://www.nature.com/articles/s41557-025-01815-x) - **🌐 官网**:[ChemBench项目主页](https://chembench.lamalab.org/) - **🤗 数据集**:[Hugging Face平台](https://huggingface.co/datasets/jablonkagroup/ChemBench) - **🏆 排行榜**:[模型性能排名](https://huggingface.co/spaces/jablonkagroup/ChemBench-Leaderboard) - **💻 代码**:[GitHub仓库](https://github.com/lamalab-org/chembench/tree/main) - **📚 文档**:[完整文档](https://lamalab-org.github.io/chembench/) - **❓ 问题反馈**:请通过Hugging Face数据集页面或GitHub仓库报告问题或咨询疑问 --- <div align="center"> ![ChemBench Logo](CHEMBENCH_LOGO_NO_BACK.png) <i>推动化学与材料科学领域AI系统的评估发展</i> </div>
提供机构:
maas
创建时间:
2025-05-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作