中英文作文评测数据集
收藏国家基础学科公共科学数据中心2026-01-30 收录
下载链接:
https://nbsdc.cn/general/dataDetail?id=67fb648c195d265448044a22&type=1
下载链接
链接失效反馈官方服务:
资源简介:
本数据集包含中英文作文各1万篇,数据格式为jpg,jpeg等常见图片格式,用于作文自动评分模型训练。数据集来源于武汉市梅苑中学、淮北市海宫学校、安徽省濉溪县第二中学、首师大附中永定分校的学生实际作答答题卡,数据集中包括jpeg等形式中英文作文和作文对应的题目和分数。作文数据涵盖初一到高三6个年级。数据集的采集和标准流程如下,1)使用扫描设备将纸质试卷转换为数字图像格式(如JPEG、PNG等);2)通过技术手段,提取作文区域的内容;3)利用已标注数据进行作文文本识别模型训练;4)利用已经训练的模型对未经过人工标注的考试试卷做标注,人工检查校正结果;5)迭代3)和4)步,直至数据量满足指标要求。请注意,每篇作文都具有分数和题目,但是有些作文数据不具备试卷和试卷答案。
This dataset contains 10,000 Chinese compositions and 10,000 English compositions, stored in common image formats such as JPG/JPEG and others, and is designed for training automated essay scoring (AES) models. The dataset is sourced from actual student-completed answer sheets collected from four educational institutions: Wuhan Meiyuan Middle School, Huaibei Haigong School, Suixi County No. 2 Middle School of Anhui Province, and the Yongding Branch of High School Affiliated to Capital Normal University. It encompasses Chinese and English compositions in formats like JPEG, alongside their corresponding prompts and scores. The compositions cover six grade levels from Grade 7 (first year of junior high school) to Grade 12 (third year of senior high school).
The collection and standardization workflow is as follows:
1) Convert paper exam papers into digital image formats (e.g., JPEG, PNG) using scanning equipment;
2) Extract the content within the essay region via technical methods;
3) Train an essay text recognition model using pre-annotated data;
4) Employ the trained model to annotate unlabelled exam papers, followed by manual inspection and correction of the annotation results;
5) Iterate steps 3 and 4 until the dataset size meets the preset requirements.
Please note that each composition is paired with a score and a prompt, yet some composition entries do not include the original test paper and its corresponding answer key.
提供机构:
首都师范大学
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个用于作文自动评分模型训练的资源,包含中英文作文各1万篇,以jpg、jpeg等图片格式存储。数据来源于多所中学学生的实际答题卡,涵盖初一到高三六个年级,每篇作文均配有题目和分数,并通过扫描、提取和人工校正流程处理。
以上内容由遇见数据集搜集并总结生成



