five

A Messy Handwriting Dataset with Student Crossouts and Corrections (Line-version)

收藏
Research Data Australia2024-08-03 收录
下载链接:
https://researchdata.edu.au/a-messy-handwriting-line-version/2831127
下载链接
链接失效反馈
官方服务:
资源简介:
This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.1)SMHD.txt contain all the line level transcription in the form of image name, threshold value, label 0001-000,178 Bombay Phenotype :- 2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text. 3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.Dataset Description:We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.

本数据集为学生潦草手写数据集(Student Messy Handwritten Dataset, SMHD)的行分割版本(Nisa, Hiqmat、Thom, James、Ciesielski, Vic、Tennakoon, Ruwan于2023年发布)。该数据集由RMIT大学公开,DOI编号为10.25439/rmt.24312715.v1。 核心存储库中包含按文档分割为行级的子文件夹,所有图像均采用.png格式。主文件夹下包含3个.txt文件: 1) SMHD.txt:存储所有行级转录文本,格式为图像名称、阈值、标签,示例条目如0001-000,178 Bombay Phenotype :- 2) SMHD-Cross-outsandInsertions.txt:收录本数据集中所有包含划除与插入文本的行级图像。 3) Class_Notes_SMHD.txt:收录包含划除、插入与覆写操作的更复杂样本,可作为测试集使用。该文件中的图像未被纳入SMHD.txt。 转录文件中,所有被划除的内容均以符号"#"标注,便于快速识别是否包含此类编辑操作。 数据集说明: 本数据集共收纳超过500名学生的手写样本。手写试卷是学术机构评估学生学习成果的核心载体。作为学术工作者,我们发现学生的手写试卷通常存在大量插入、修正等潦草痕迹,是开展野外手写文本识别(Handwritten Text Recognition, HTR)研究的优质素材。但受伦理规范约束,无法直接获取真实学生试卷,因此我们搭建了类考试场景以收集学生手写样本。所采集的语料均为学术类内容:在学术场景中,手写文稿通常带有行格,为此我们在白纸上以浅色绘制行格,单行行高为1.5 pt,行间距为40 pt。填写完成的手写文稿以300 dpi分辨率、8位灰度级进行扫描。 采集流程: 本次采集共采用四种方式。 第一种为摘要类数据集:我们要求参与者以自有语言概括给定文本。本次招募了60名以英语为专业的本科生,在获得知情同意后,向其分发印刷版文本文章,要求参与者任选一篇文章阅读,并在15分钟内将其概括为一段文字。所使用的印刷文本语料均来自互联网,涵盖当前政治局势、日常生活及新冠疫情等主题。 第二种为随笔类数据集:我们为参与者提供指定主题列表,或允许其自选主题进行随笔创作。本次数据来自250名高中生,给予其30分钟时间构思并完成写作。 第三种为学科类数据集:我们招募不同专业的本科生,要求其围绕当前所学课程的主题进行写作。本次参与的学生涵盖多个专业:数学专业33人、生物科学专业71人、环境科学专业24人、物理专业17人,英语专业学生超过84人。 第四种为课堂笔记类数据集:我们收集了近31名学生针对同一主题的课堂笔记。要求参与者完整记录讲座中讲师讲授的每一句话,课程时长约10分钟,课后要求学生核对并与同班同学的笔记对比,未设置核对时长限制。相较于摘要类与学科类采集样本,课堂笔记样本中划除与修正的痕迹更为常见。 在全部四种采集流程中,我们未对参与者施加任何格式限制,例如行间距、用笔类型等,并允许参与者自行划除不当内容。尽管通常创作者会在二次阅读时自行修正,我们仍额外预留了5分钟用于修正作业。
提供机构:
RMIT University, Australia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作