Supplementary data for the paper 'System 2 thinking in OpenAI’s o1-preview model: Near-perfect performance on a mathematics exam'
收藏4TU.ResearchData2024-10-18 更新2026-04-23 收录
下载链接:
https://data.4tu.nl/datasets/2e663686-f656-4ff2-bb21-567ba4d4f03e/2
下载链接
链接失效反馈官方服务:
资源简介:
The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the <em>o1</em> model series, designed to handle System 2-like reasoning. While OpenAI’s benchmarks are promising, independent validation is still needed. In this study, we tested the <em>o1-preview</em> model twice on the Dutch ‘Mathematics B’ final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the <em>GPT-4o</em> model scored 66 and 62 out of 76, well above the Dutch average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model contamination (i.e., the knowledge cutoff of <em>o1-preview</em> and <em>GPT-4o</em> was after the exam was published online), we repeated the procedure with a new Mathematics B exam that was published after the cutoff date. The results again indicated that <em>o1-preview</em> performed strongly (97.8th percentile), which suggests that contamination was not a factor. We also show that there is some variability in the output of <em>o1-preview</em>, which means that sometimes there is ‘luck’ (the answer is correct) or ‘bad luck’ (the output has diverged into something that is incorrect). We demonstrate that a self-consistency approach, where repeated prompts are given and the most common answer is selected, is a useful strategy for identifying the correct answer. It is concluded that while OpenAI’s new model series holds great potential, certain risks must be considered.
人类认知的底层过程通常被划分为系统1(System 1)与系统2(System 2):前者对应快速、直觉式的思考,后者则涉及缓慢、审慎的推理。此前,大语言模型(Large Language Model)常被批评缺乏系统2所具备的深度分析能力。2024年9月,OpenAI推出了<em>o1</em>模型系列,旨在实现类系统2的推理任务。尽管OpenAI官方的基准测试结果喜人,但仍需开展独立验证。
本研究中,我们针对荷兰“数学B”期末考试,对<em>o1-preview</em>模型开展了两次测试,分别取得了76分制下的76分与74分,成绩近乎满分。作为参考背景,荷兰16414名考生中仅有24人获得满分。相较之下,<em>GPT-4o</em>模型的得分分别为66分与62分,远高于荷兰考生的平均分40.63分。两款模型均未接触过该考试原题。
由于存在模型数据污染的风险(即<em>o1-preview</em>与<em>GPT-4o</em>的知识截止日期晚于该考试公开上线的时间),我们又使用了一个在知识截止日期后公开的新版“数学B”考试重复了测试流程。测试结果再次显示<em>o1-preview</em>表现优异(处于97.8百分位),这表明数据污染并非影响因素。
我们还发现<em>o1-preview</em>的模型输出存在一定波动性,即有时会出现“好运”(答案正确)或“霉运”(输出偏离正确方向)的情况。我们验证了一种自一致性(self-consistency)策略:通过多次重复提示并选取最常见的答案,可有效识别正确结果。
本研究最终表明,尽管OpenAI的全新模型系列极具潜力,但仍需考量相关潜在风险。
提供机构:
Eisma, Yke Bauke
创建时间:
2024-10-18



