Chinese EFL Learners' Writing Evaluation by ChatGPT
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://data.mendeley.com/datasets/8fbzsg82p9
下载链接
链接失效反馈官方服务:
资源简介:
The data mainly provide ChatGPT's rating on 82 Chinese EFL learners' writings with scores and comments as well as the scores by reliable manual rating. With the data, researchers can do quantitative or qualitative research on the reliability of EFL writing evaluation with ChatGPT by taking reliable manual ratings as a reference. It includes two parts: 1) ChatGPT's rating with scores and comments, and 2) statistics on overall, average, and specific scores of manual and ChatGPT's rating.
1. EFL Writings with ChatGPT's Rating
There are 270 EFL expository compositions in the Spoken and Written Corpus of Chinese Learners Version 2.0. (Wen et al., 2008) written by 270 Chinese EFL learners within a time limit of 30 minutes. Their IDs are from "WEXP0001" to "WEXP0270".
Eighty-two compositions are randomly sampled from the 270 compositions. The sample size is determined by the power analysis software G*Power (Faul et al., 2009; Faul et al., 2007). A set of random 82 numbers from 270 are generated by using the Random Numbers Generator.
The ChatGPT's rating is generated by asking ChatGPT to rate the 82 EFL writings one by one. The next day, the same 82 writings were rated by ChatGPT again with the same prompts to obtain another set of scores.
2. Scores of Manual and ChatGPT's Rating
The spreadsheet provides not only ChatGPT's rating on the EFL compositions with overall and specific scores but also corresponding scores of manual rating. For the manual rating, the compositions were rated by three experienced raters on aspects of language (40%), content (30%), and organization (30%) and the total score was the sum of the three parts. Then the average scores of the total score and scores of each aspect from the three raters were calculated.
The inter-rater reliability analysis between scores from every two raters was conducted. The result showed that they have significant (p < 0.01) and high inter-rater reliabilities, which were from 0.710 to 0.785.
References
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior research methods, 41(4), 1149-1160. https://doi.org/10.3758/BRM.41.4.1149
Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior research methods, 39(2), 175-191. https://doi.org/10.3758/BF03193146
Wen, Q., Wang, L., & Liang, M. (2008). Spoken and Written English Corpus of Chinese Learners (Version 2.0). Foreign Language Teaching and Research Press.
本数据集主要提供ChatGPT对82名中国英语作为外语(English as a Foreign Language)学习者习作的评分与评语,以及经可靠人工评分所得的分数。依托该数据集,研究者可将可靠人工评分作为参照,开展关于ChatGPT用于EFL写作评分可靠性的定量或定性研究。本数据集包含两部分内容:1)ChatGPT生成的带评分与评语的习作评分结果;2)人工评分与ChatGPT评分的总体、平均及分项得分统计数据。
1. 带ChatGPT评分的EFL习作
本数据集的样本源自《中国学习者口语与写作语料库2.0版》(Spoken and Written Corpus of Chinese Learners Version 2.0,Wen等,2008)中的270篇限时30分钟完成的EFL说明文习作,该批习作由270名中国EFL学习者完成,编号范围为WEXP0001至WEXP0270。
研究人员通过功效分析软件G*Power(Faul等,2009;Faul等,2007)确定样本量,随后利用随机数生成器从270篇习作中随机抽取82篇作为本次研究的样本。
通过向ChatGPT逐一发送提示词对该82篇EFL习作进行首轮评分;次日,使用完全相同的提示词对同批习作再次评分,获得第二组评分结果。
2. 人工评分与ChatGPT评分结果
本数据集的电子表格不仅包含ChatGPT对EFL习作的总体评分与分项评分,还附带对应的人工评分结果。人工评分环节由三名经验丰富的评分员从语言(权重40%)、内容(权重30%)与组织结构(权重30%)三个维度对习作进行评分,总分为三个维度得分之和。随后计算三名评分员的总得分及各维度得分的平均值。
研究人员对每两名评分员的评分结果开展了评分者间信度分析,结果显示评分者间信度显著(p < 0.01)且较高,取值区间为0.710至0.785。
参考文献
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). 使用G*Power 3.1进行统计功效分析:相关与回归分析检验. 行为研究方法, 41(4), 1149-1160. https://doi.org/10.3758/BRM.41.4.1149
Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3:面向社会、行为与生物医学科学的灵活统计功效分析程序. 行为研究方法, 39(2), 175-191. https://doi.org/10.3758/BF03193146
Wen, Q., Wang, L., & Liang, M. (2008). 中国学习者口语与写作语料库2.0版. 外语教学与研究出版社.
创建时间:
2023-04-18



