five

Greetings From! Historical Postcards Address Transcription Dataset

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10005565
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset provides both Ground Truth (GT) and Handwritten Text Recognition (HTR) transcriptions of historical postcard addresses, stemming from a project to extract address information from historical picture postcards from Belgium, France, Germany, Luxembourg, the Netherlands, and the UK. The dataset encapsulates the back of 500 historically significant postcards. The research associated with this dataset will be presented at Computational Humanities Research Conference, December 6--8, 2023, Paris, France. Scope and Content: HTR Material: Handwritten Text Recognition outputs for 500 postcards. GT Material: Ground Truth transcriptions created by human transcribers for the same set of 500 postcards. File Structure and Formats: For both HTR and GT Material, the following files are provided: JPEG Images: Scanned or digitized images of the postcards. .txt: Plain text transcriptions of the postcards. _tei.xml: Transcriptions rendered in the TEI XML format. .pdf: PDF presentation of the postcards along with their transcriptions. mets.xml: METS (Metadata Encoding and Transmission Standard) schema for the data. page folder: XML files for individual images, offering metadata and structural information. metadata.xml: metadata concerning the dataset. GT_addresses_GPT4.json & HTR_addresses_GPT4.json: JSON files detailing individual address data for each postcard in structured format. Annotation and Transcription: GT: Ground Truth data was annotated by human transcribers who examined both the images of the postcards and the outputs of the HTR system. Transcribers made corrections according to predefined conventions: using # for illegible characters, * at the start of lines without address information (e.g., person's name), and starting a line with @ for irrelevant lines. HTR: The HTR versions emerged from state-of-the-art HTR systems (Transkribus Text Titan I). The .json files hold precise address details derived from the main data, which were processed using OpenAI's GPT-4 Large Language Model.
创建时间:
2023-10-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作