Eureka-Lab/PHYBench
收藏Hugging Face2025-05-16 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/Eureka-Lab/PHYBench
下载链接
链接失效反馈官方服务:
资源简介:
PHYBench是一个用于评估大型语言模型(LLM)物理感知和推理能力的大规模基准数据集。它包含了500个物理学问题,包括100个带有完整解决方案的详细示例和400个只包含问题和标签的示例。该数据集旨在挑战模型在现实世界场景中的推理能力、多步推理能力和符号精确度。评估指标为表达式编辑距离(EED)分数,用于衡量模型生成的答案与真实答案之间的相似度。数据集还包括用于比较的人类基线性能。
PHYBench is a large-scale benchmark dataset designed to evaluate the physical perception and reasoning capabilities of Large Language Models (LLMs). It includes 500 physics problems, with 100 fully detailed examples and 400 additional examples containing questions and tags. The dataset is designed to challenge models in real-world grounding, multi-step reasoning, and symbolic precision. The evaluation metric used is the Expression Edit Distance (EED) Score, which measures the similarity between model-generated answers and the ground truth. The dataset also includes a human baseline performance for comparison.
提供机构:
Eureka-Lab



