IDEA-AI4S/ChemO
收藏Hugging Face2026-04-20 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/IDEA-AI4S/ChemO
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
size_categories:
- 1K<n<10K
task_categories:
- question-answering
- image-text-to-text
tags:
- chemistry
- agent
- olympaid
- benchmark
- llm-evaluation
- science
- multimodal
language:
- en
---
# 🧪 **ChemO Dataset**
[](https://huggingface.co/papers/2511.16205)
[](https://arxiv.org/abs/2511.16205)
📄 **Paper**: [ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025](https://huggingface.co/papers/2511.16205)
# ChemO Version 1.1
Now with CDXML Files! 🎉
The ChemO dataset has been officially released after meticulous proofreading and preparation. This benchmark is built from the **International Chemistry Olympiad (IChO) 2025** and represents a new frontier in automated chemical problem-solving.
## 🌟 Key Features
- **🏆 Olympic-Level Benchmark** - Challenging problems from IChO 2025 for advanced AI reasoning
- **🔬 Multimodal Symbolic Language** - Addresses chemistry's unique combination of text, formulas, and molecular structures
- **📊 Two Novel Assessment Methods**:
- **AER (Assessment-Equivalent Reformulation)** - Converts visual output requirements (e.g., drawing molecules) into computationally tractable formats
- **SVE (Structured Visual Enhancement)** - Diagnostic mechanism to separate visual perception from core chemical reasoning capabilities
## 📦 What's Included
The current release includes:
- ✅ **Original Problems** - Complete problem sets with additional chapter markers for Problems and Solutions sections (no other modifications to the original content)
- ✅ **Well-structured JSON Files** - Clean, organized data designed for:
- 🤖 **MLLM Benchmarking** - Olympic-level chemistry reasoning evaluation
- 🔗 **Multi-Agent System Testing** - Hierarchical agent collaboration assessment
- 🎯 **Multimodal Reasoning** - Text, formula, and molecular structure understanding
- ✅ **CDXML Files** - Molecular structure files now available in `JSON/cdxml/`
## 📋 Dataset Structure
The ChemO dataset consists of **9 problems** from IChO 2025, with each problem provided as a structured JSON file (1.json ~ 9.json in `JSON/`).
```
JSON/
├── 1.json ~ 9.json # Problem and solution data in structured JSON format
├── images/ # All referenced images indexed in JSON files
└── cdxml/ # Molecular structure files in CDXML format
```
## 📚 Data Source
All problems are sourced from **ICHO 2025**: https://www.icho2025.ae/problems
## 🚀 State-of-the-Art Results
Our ChemLabs multi-agent system combined with SVE achieves **93.6/100** on ChemO, surpassing the estimated human gold medal threshold and establishing a new benchmark in automated chemical problem-solving.
## 🤝 Community
We appreciate your patience and look forward to your feedback as we continue to improve this resource for the community. Feel free to reach out to us at jerry.sy.bai@gmail.com.
Future updates will primarily be maintained at the following link: https://huggingface.co/IDEA-AI4S.
## 📄 Citation
If you use ChemO in your research, please cite our paper:
```bibtex
@article{qiang2025chemlabs,
title={ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025},
author={Xu, Qiang and Bai, Shengyuan and Chen, Leqing and Liu, Zijing and Li, Yu},
journal={arXiv preprint arXiv:2511.16205},
year={2025}
}
```
许可证:Apache-2.0
样本量范围:1K < n < 10K
任务类别:
- 问答
- 图文转文本
标签:
- 化学
- AI智能体(AI Agent)
- 奥赛
- 基准数据集
- 大语言模型(LLM)评测
- 科学
- 多模态
语言:
- 英语
# 🧪 **ChemO数据集**
[](https://huggingface.co/papers/2511.16205)
[](https://arxiv.org/abs/2511.16205)
📄 **论文**:[ChemLabs on ChemO: 面向2025年国际化学奥林匹克(IChO)的多智能体多模态推理系统](https://huggingface.co/papers/2511.16205)
# ChemO 版本1.1
现已支持CDXML文件!🎉
ChemO数据集经过精心校对与筹备后正式发布。本基准数据集源自**2025年国际化学奥林匹克(IChO)**,代表了自动化化学解题领域的全新前沿。
## 🌟 核心特性
- **🏆 奥运级基准数据集:源自IChO 2025的高难度试题,用于高级AI推理评测
- **🔬 多模态符号语言:覆盖化学领域特有的文本、公式与分子结构组合形式
- **📊 两种新颖评估方法**:
- **AER(评估等效重编码):将可视化输出需求(如绘制分子)转换为可计算的格式
- **SVE(结构化视觉增强):将视觉感知能力与核心化学推理能力分离的诊断机制
## 📦 数据集内容
本次发布包含:
- ✅ **原始试题**:完整试题集,附带试题与解答章节的额外章节标记,未对原始内容做任何其他修改
- ✅ **结构规范的JSON文件:简洁有序的数据,适用于:
- 🤖 **多模态大语言模型(MLLM)基准评测:奥运级化学推理评估
- 🔗 **多智能体系统测试:分层智能体协作能力评估
- 🎯 **多模态推理:文本、公式与分子结构理解
- ✅ **CDXML文件**:分子结构文件现已在`JSON/cdxml/`目录中提供
## 📋 数据集结构
ChemO数据集包含来自IChO 2025的9道试题,每道试题均以结构化JSON文件形式提供(`JSON/`目录下的1.json ~ 9.json)。
JSON/
├── 1.json ~ 9.json # 结构化JSON格式的试题与解答数据
├── images/ # JSON文件中引用的所有图像
└── cdxml/ # CDXML格式的分子结构文件
## 📚 数据来源
所有试题均源自2025年国际化学奥林匹克竞赛:https://www.icho2025.ae/problems
## 🚀 当前最优结果
我们的ChemLabs多智能体系统结合SVE方法在ChemO数据集上取得了**93.6/100**的成绩,超过了预估的人类金牌分数线,为自动化化学解题领域树立了新的基准。
## 🤝 社区共建
感谢各位的耐心等待,期待社区反馈以持续改进该资源。可联系邮箱jerry.sy.bai@gmail.com与我们取得联系。
未来更新将主要维护于以下链接:https://huggingface.co/IDEA-AI4S。
## 📄 引用说明
如果在研究中使用ChemO数据集,请引用我们的论文:
bibtex
@article{qiang2025chemlabs,
title={ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025},
author={Xu, Qiang and Bai, Shengyuan and Chen, Leqing and Liu, Zijing and Li, Yu},
journal={arXiv preprint arXiv:2511.16205},
year={2025}
}
提供机构:
IDEA-AI4S



