Supplementary data for the paper 'Can ChatGPT pass high school exams on English language comprehension?'

4TU.ResearchData2023-09-12 更新2026-04-23 收录

下载链接：

https://data.4tu.nl/datasets/545f8ead-235a-4eb6-8f32-aebb030dbbad/1

下载链接

链接失效反馈

官方服务：

资源简介：

Launched in late November 2022, ChatGPT, a large language model chatbot, has garnered considerable attention. However, ongoing questions remain regarding its capabilities. In this study, ChatGPT was used to complete national high school exams in the Netherlands on the topic of English reading comprehension. In late December 2022, we submitted the exam questions through the ChatGPT web interface (GPT-3.5). According to official norms, ChatGPT achieved a mean grade of 7.3 on the Dutch scale of 1 to 10—comparable to the mean grade of all students who took the exam in the Netherlands, 6.99. However, ChatGPT occasionally required re-prompting to arrive at an explicit answer; without these nudges, the overall grade was 6.5. In March 2023, API access was made available, and a new version of ChatGPT, GPT-4, was released. We submitted the same exams to the API, and GPT-4 achieved a score of 8.3 without a need for re-prompting. Additionally, employing a bootstrapping method that incorporated randomness through ChatGPT’s ‘temperature’ parameter proved effective in self-identifying potentially incorrect answers. Finally, a re-assessment conducted with the GPT-4 model updated as of June 2023 showed no substantial change in the overall score. The present findings highlight significant opportunities but also raise concerns about the impact of ChatGPT and similar large language models on educational assessment.

2022年11月末正式推出的大语言模型（Large Language Model）聊天机器人ChatGPT，已获得广泛关注。不过，其能力边界仍存在诸多悬而未决的问题。本研究让ChatGPT参与荷兰全国高中英语阅读理解科目考试。2022年12月末，研究人员通过ChatGPT网页端界面（GPT-3.5模型）提交了全部考试题目。按照荷兰官方评分规则，ChatGPT在1至10分的评分体系下取得了7.3的平均分，与当年所有参考学生的平均得分6.99基本持平。不过，ChatGPT偶尔需要通过重新提示才能生成明确答案；若不施加此类引导，其最终整体得分为6.5。2023年3月，OpenAI开放了ChatGPT的API接口，并推出了新版本GPT-4。研究团队将同一套考题提交至API接口进行测试，GPT-4无需额外提示即可取得8.3的得分。此外，通过结合ChatGPT的「温度（temperature）」参数引入随机性的自举（bootstrapping）方法，可有效识别潜在的错误答案。最后，针对2023年6月更新后的GPT-4模型开展的重新评估显示，其整体得分未出现显著变化。本研究结果既凸显了ChatGPT及同类大语言模型的巨大应用潜力，也引发了学界对其在教育评估领域所产生影响的担忧。

创建时间：

2023-09-12