five

Scrambled text: training Language Models to correct OCR errors using synthetic data

收藏
DataCite Commons2025-04-18 更新2025-04-17 收录
下载链接:
https://rdr.ucl.ac.uk/articles/dataset/Scrambled_text_training_Language_Models_to_correct_OCR_errors_using_synthetic_data/27108334/1
下载链接
链接失效反馈
官方服务:
资源简介:
This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as follows<b>ncse_hf_dataset</b>: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the paper<b>synth_gt.zip</b>: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.<b>synthetic_articles.zip</b>: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.<b>synthetic_articles_text.zip</b>: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data<br>
提供机构:
University College London
创建时间:
2024-09-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作