agentlans/ocr-correction
收藏Hugging Face2024-12-11 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/agentlans/ocr-correction
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含从互联网档案馆获取的英文书籍和报纸的OCR校正文本样本,提供了原始OCR文本和AI校正后的文本对,专为OCR校正任务设计。数据集结构包括训练集、验证集和总样本数。创建过程涉及使用自定义的Llama 3.1 8B模型进行校正,并过滤了输入和输出长度相似的样本。使用数据时需注意文本长度、数字准确性、间距问题和AI生成的内容等限制。数据集旨在提升OCR技术,促进历史文本的访问和文化遗产的保护。
This dataset is designed for optical character recognition (OCR) correction, containing OCR text and their AI-corrected versions from English books and newspapers sourced from the Internet Archive. Each instance in the dataset includes raw OCR text and corrected text, suitable for OCR correction tasks. The dataset is divided into training and validation sets, totaling 49,047 samples. The creation of the dataset involved using the Llama 3.1 8B model for correction, filtering samples with similar input and output lengths, and selecting those with significant quality improvements. Limitations of the dataset include lack of context, text-only corrections, potential numerical and date inaccuracies, spacing issues, and possible truncations or additional details introduced by AI corrections. The dataset aims to enhance OCR technology, improve access to historical texts, and contribute to the preservation of cultural heritage.
提供机构:
agentlans



