CMPhysBench
收藏CMPhysBench 数据集概述
数据集基本信息
- 名称:CMPhysBench
- 许可证:Apache 2.0
- 任务类别:问答
- 语言:英语
- 标签:凝聚态物理
- 规模:520个样本
数据集描述
CMPhysBench是一个用于评估大语言模型在凝聚态物理领域能力的新型基准测试。该数据集包含520多个研究生级别的精心策划问题,涵盖凝聚态物理的代表性子领域和基础理论框架,如磁性、超导性、强关联系统等。
核心特点
- 专注于计算问题,要求大语言模型独立生成全面解决方案
- 引入可扩展表达式编辑距离(SEED)评分,提供细粒度(非二进制)部分信用,更准确评估预测与真实值之间的相似性
评估结果
最佳模型Grok-4在CMPhysBench上的平均SEED得分仅为36,准确率为28%,显示出在这一前沿领域与传统物理学相比存在显著能力差距。
相关资源
- 论文:https://arxiv.org/abs/2508.18124
- 代码:https://github.com/CMPhysBench/CMPhysBench
- 数据:https://huggingface.co/datasets/weidawang/CMPhysBench
- 许可证:https://github.com/CMPhysBench/CMPhysBench/blob/main/LICENSE
致谢
CMPhysBench受到PHYBench、PHYSICS、GPQA和OlympiadBench等先前数据集工作的启发。SEED评分方法基于PHYBench的表达式编辑距离(EED)指标进行扩展和改进。
引用信息
bibtex @misc{wang2025cmphysbench, title={CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics}, author={Weida Wang and Dongchen Huang and Jiatong Li and Tengchao Yang and Ziyang Zheng and Di Zhang and Dong Han and Benteng Chen and Binzhao Luo and Zhiyu Liu and Kunling Liu and Zhiyuan Gao and Shiqi Geng and Wei Ma and Jiaming Su and Xin Li and Shuchen Pu and Yuhan Shui and Qianjia Cheng and Zhihao Dou and Dongfei Cui and Changyong He and Jin Zeng and Zeke Xie and Mao Su and Dongzhan Zhou and Yuqiang Li and Wanli Ouyang and Yunqi Cai and Xi Dai and Shufei Zhang and Lei Bai and Jinguang Cheng and Zhong Fang and Hongming Weng}, year={2025}, eprint={2508.18124}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.18124}, }




