Supplementary data for the paper: System 2 thinking in OpenAI’s o1-preview model: Near-perfect performance on a mathematics exam
收藏4TU.ResearchData2025-05-12 更新2026-04-23 收录
下载链接:
https://data.4tu.nl/datasets/2e663686-f656-4ff2-bb21-567ba4d4f03e/3
下载链接
链接失效反馈官方服务:
资源简介:
The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the <em>o1</em> model series, designed to handle System 2-like reasoning. While OpenAI’s benchmarks are promising, independent validation is still needed. In this study, we tested the <em>o1-preview</em> model twice on the Dutch ‘Mathematics B’ final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the <em>GPT-4o</em> model scored 66 and 62 out of 76, well above the Dutch average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model contamination (i.e., the knowledge cutoff of <em>o1-preview</em> and <em>GPT-4o</em> was after the exam was published online), we repeated the procedure with a new Mathematics B exam that was published after the cutoff date. The results again indicated that <em>o1-preview</em> performed strongly (97.8th percentile), which suggests that contamination was not a factor. We also show that there is some variability in the output of <em>o1-preview</em>, which means that sometimes there is ‘luck’ (the answer is correct) or ‘bad luck’ (the output has diverged into something that is incorrect). We demonstrate that a self-consistency approach, where repeated prompts are given and the most common answer is selected, is a useful strategy for identifying the correct answer. It is concluded that while OpenAI’s new model series holds great potential, certain risks must be considered.
人类认知的底层过程通常被划分为两类系统:系统1(System 1),即快速直觉式思考;系统2(System 2),即缓慢审慎的推理过程。此前,大语言模型(Large Language Model,LLM)常被诟病缺乏系统2所具备的更深层次、更具分析性的能力。2024年9月,OpenAI推出了o1模型系列,旨在实现类系统2的推理能力。尽管OpenAI的基准测试结果颇具前景,但仍需开展独立验证。
本研究中,我们两次在荷兰“数学B”期末考试中对o1-preview模型进行测试,其得分分别为76分制下的76分(近乎满分)与74分。作为参考,荷兰16414名考生中仅有24人斩获满分。相较而言,GPT-4o模型的得分分别为66分与62分,远高于荷兰考生的平均分40.63分。两款模型均未接触过该考试原题。
由于存在模型污染风险(即o1-preview与GPT-4o的知识截止日期晚于该考试公开上线的时间),我们使用一个在知识截止日期之后公开的新版数学B考试重复了测试流程。结果再次显示o1-preview表现优异(处于97.8百分位),这表明模型污染并非影响因素。
我们还发现o1-preview的模型输出存在一定波动性,这意味着有时会出现“侥幸答对”或“失误答错”的情况。研究证明,采用自一致性(self-consistency)方法——即多次重复提示并选取出现频率最高的答案——是识别正确答案的有效策略。
最终结论表明,尽管OpenAI的全新模型系列潜力巨大,但仍需考量部分潜在风险。
提供机构:
Eisma, Yke Bauke
创建时间:
2025-05-12



