Scrambled text: training Language Models to correct OCR errors using synthetic data

Name: Scrambled text: training Language Models to correct OCR errors using synthetic data
Creator: University College London
Published: 2025-04-18 08:55:20
License: 暂无描述

DataCite Commons2025-04-18 更新2025-04-17 收录

下载链接：

https://rdr.ucl.ac.uk/articles/dataset/Scrambled_text_training_Language_Models_to_correct_OCR_errors_using_synthetic_data/27108334/1

下载链接

链接失效反馈

官方服务：

资源简介：

This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data

提供机构：

University College London

创建时间：

2024-09-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集