five

A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

收藏
DataCite Commons2024-04-23 更新2024-08-19 收录
下载链接:
https://figshare.com/articles/dataset/A_comparison_of_Human_GPT-3_5_and_GPT-4_Performance_in_a_University-Level_Coding_Course/25673799/1
下载链接
链接失效反馈
官方服务:
资源简介:
Data from "A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course". This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed n = 300 data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = 2.482 x 10<sup>-10</sup>). Prompt engineering significantly improved scores for both GPT-4 (p = 1.661 x 10<sup>-4</sup>) and GPT-3.5 (p = 4.967 x 10<sup>-9</sup>). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from `Definitely AI' to `Definitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary `AI' or `Human' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.
提供机构:
figshare
创建时间:
2024-04-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作