PhysReason
收藏PhysReason数据集概述
数据集简介
PhysReason是一个面向物理推理的全面基准数据集。该数据集包含1200个涵盖多个领域的物理问题,旨在评估模型在物理知识应用和推理方面的能力。
关键特性
- 问题数量:1200个物理问题
- 问题类型:25%基于知识,75%基于推理
- 定理数量:147个物理定理
- 带图问题:81%的问题包含 diagram
数据收集过程
- 获取方式:从全球大学入学考试和竞赛中收集
- 标准化:使用MinerU框架
- 翻译:两阶段过程,专家验证
- 搜索预防:排除易于搜索的问题
- 难度分类:基于解决时间和定理复杂性
与现有基准对比
<table> <tr> <th>基准</th> <th>多模态</th> <th>大小</th> <th>知识</th> <th>问题类型</th> <th>平均时间</th> <th>逐步解决</th> <th>平均时间</th> <th>平均步骤</th> </tr> <tr> <td>JEEBench</td> <td>❌</td> <td>123</td> <td>CEE</td> <td>OE,MC</td> <td>169.7</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>MMLU-Pro</td> <td>❌</td> <td>1299</td> <td>COL</td> <td>MC</td> <td>52.1</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>GPQA</td> <td>❌</td> <td>227</td> <td>PH.D.</td> <td>OE</td> <td>111.4</td> <td>❌</td> <td>197.2</td> <td>3.6</td> </tr> <tr> <td>SciEval</td> <td>❌</td> <td>1657</td> <td>-</td> <td>OE,MC</td> <td>154.5</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>SciBench</td> <td>✅</td> <td>295</td> <td>COL</td> <td>OE</td> <td>80.5</td> <td>❌</td> <td>315.9</td> <td>2.8</td> </tr> <tr> <td>MMMU</td> <td>✅</td> <td>443</td> <td>COL</td> <td>OE,MC</td> <td>53.8</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>ScienceQA</td> <td>✅</td> <td>617</td> <td>K1-K12</td> <td>MC</td> <td>13.3</td> <td>❌</td> <td>63.0</td> <td>2.4</td> </tr> <tr> <td>OlympiadBench</td> <td>✅</td> <td>2334</td> <td>COMP</td> <td>OE</td> <td>222.0</td> <td>❌</td> <td>199.8</td> <td>3.7</td> </tr> <tr> <td>EMMA</td> <td>✅</td> <td>156</td> <td>-</td> <td>MC</td> <td>109.5</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Ours-Knowledge</td> <td>✅</td> <td>300</td> <td>CEE+COMP</td> <td>OE</td> <td>163.7</td> <td>✅</td> <td>196.5</td> <td>3.3</td> </tr> <tr> <td>Ours-Easy</td> <td>✅</td> <td>300</td> <td>CEE+COMP</td> <td>OE</td> <td>171.2</td> <td>✅</td> <td>241.5</td> <td>5.0</td> </tr> <tr> <td>Ours-Medium</td> <td>✅</td> <td>300</td> <td>CEE+COMP</td> <td>OE</td> <td>229.2</td> <td>✅</td> <td>391.3</td> <td>8.4</td> </tr> <tr> <td>Ours-Hard</td> <td>✅</td> <td>300</td> <td>CEE+COMP</td> <td>OE</td> <td>340.9</td> <td>✅</td> <td>936.1</td> <td>15.6</td> </tr> <tr style="background-color: #f8f9fa;"> <td>Ours-Full</td> <td>✅</td> <td>1200</td> <td>CEE+COMP</td> <td>OE</td> <td>226.3</td> <td>✅</td> <td>441.3</td> <td>8.1</td> </tr> </table>
评估框架
- PSAS-A(答案级别评估):基于子问题答案进行评估,从模型的推理过程中提取答案,验证语义一致性,并根据解决方案步骤的长度为不同子问题加权计算分数。
- PSAS-S(步骤级别评估):提供详细的逐步评估,通过四个阶段:数据提取、评分、首次错误步骤检测和错误分析,识别模型首次偏离正确推理路径的位置并分类错误类型。
实验结果
<table> <tr> <th>模型</th> <th>输入</th> <th>知识</th> <th>简单</th> <th>中等</th> <th>困难</th> <th>平均</th> </tr> <tr> <td colspan="7" style="background-color: #f8f9fa; font-weight: bold;">非O-like模型</td> </tr> <tr> <td>Qwen2VL-72B</td> <td>Q, I</td> <td>41.92/62.47</td> <td>24.04/45.26</td> <td>15.97/36.13</td> <td>4.83/24.23</td> <td>16.96/42.88</td> </tr> <tr> <td>InternVL2.5-78B</td> <td>Q, I</td> <td>28.34/64.71</td> <td>24.16/50.69</td> <td>17.72/38.56</td> <td>9.71/25.95</td> <td>19.98/45.89</td> </tr> <tr> <td>GPT-4o</td> <td>Q, I</td> <td>50.71/65.82</td> <td>33.87/51.98</td> <td>22.73/42.36</td> <td>11.03/24.71</td> <td>29.58/47.23</td> </tr> <tr> <td>Deepseek-V3-671B</td> <td>Q, IC</td> <td>55.86/66.14</td> <td>40.06/52.77</td> <td>26.63/44.02</td> <td>13.73/26.87</td> <td>34.07/48.42</td> </tr> <tr> <td>Claude-3.5-Sonnet</td> <td>Q, I</td> <td>54.14/66.45</td> <td>41.35/55.85</td> <td>28.14/44.86</td> <td>15.11/28.51</td> <td>34.69/49.88</td> </tr> <tr> <td>Gemini-2.0-Flash</td> <td>Q, I</td> <td>65.08/75.04</td> <td>54.84/68.60</td> <td>39.79/55.67</td> <td>21.99/38.39</td> <td>45.20/60.40</td> </tr> <tr> <td>Gemini-2.0-Pro</td> <td>Q, I</td> <td>67.99/79.01</td> <td>55.43/71.47</td> <td>44.29/57.74</td> <td>23.81/42.66</td> <td>47.88/62.74</td> </tr> <tr> <td colspan="7" style="background-color: #f8f9fa; font-weight: bold;">O-like模型</td> </tr> <tr> <td>o1-mini</td> <td>Q, IC</td> <td>53.90/65.74</td> <td>35.21/52.26</td> <td>22.24/40.19</td> <td>10.61/26.80</td> <td>30.49/47.18</td> </tr> <tr> <td>QvQ-72B</td> <td>Q, I</td> <td>62.44/70.92</td> <td>53.74/64.65</td> <td>28.18/54.88</td> <td>14.30/36.47</td> <td>32.67/57.66</td> </tr> </table>




