Data associated with Optical Character Recognition for Pre-Digital Historical Documents using Large Language Models
收藏DataCite Commons2025-09-26 更新2026-05-07 收录
下载链接:
https://data.lib.vt.edu/articles/dataset/Data_associated_with_Optical_Character_Recognition_for_Pre-Digital_Historical_Documents_using_Large_Language_Models/28540643/1
下载链接
链接失效反馈官方服务:
资源简介:
This is a collection of text image clips from scanned historical real estate documents from the early 20th Century. This dataset is a subset of clips of documents from the Chicago Covenants project. The clips are accompanied by a TSV that match each clip to a label describing its visual condition plus the transcription of the text in each clip. The purpose of this collection is to provide ground truth for testing optical character recognition (OCR) technologies on scanned historical documents that have challenging visual characteristics that can lead to OCR errors. The collection can also be used as the basis of a training set for improving OCR technologies.Cautionary Note: This collection includes terms that may be offensive to some. Unfortunately, these terms were part of this point in history
提供机构:
University Libraries, Virginia Tech
创建时间:
2025-09-26



