Data associated with Optical Character Recognition for Pre-Digital Historical Documents using Large Language Models

Name: Data associated with Optical Character Recognition for Pre-Digital Historical Documents using Large Language Models
Creator: University Libraries, Virginia Tech
Published: 2025-09-26 20:22:40
License: 暂无描述

DataCite Commons2025-09-26 更新2026-05-07 收录

下载链接：

https://data.lib.vt.edu/articles/dataset/Data_associated_with_Optical_Character_Recognition_for_Pre-Digital_Historical_Documents_using_Large_Language_Models/28540643/1

下载链接

链接失效反馈

官方服务：

资源简介：

This is a collection of text image clips from scanned historical real estate documents from the early 20th Century. This dataset is a subset of clips of documents from the Chicago Covenants project. The clips are accompanied by a TSV that match each clip to a label describing its visual condition plus the transcription of the text in each clip. The purpose of this collection is to provide ground truth for testing optical character recognition (OCR) technologies on scanned historical documents that have challenging visual characteristics that can lead to OCR errors. The collection can also be used as the basis of a training set for improving OCR technologies.Cautionary Note: This collection includes terms that may be offensive to some. Unfortunately, these terms were part of this point in history

提供机构：

University Libraries, Virginia Tech

创建时间：

2025-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集