NealCaren/newspaper-ocr-gold

Name: NealCaren/newspaper-ocr-gold
Creator: NealCaren
Published: 2026-03-23 14:00:02
License: 暂无描述

Hugging Face2026-03-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/NealCaren/newspaper-ocr-gold

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - image-to-text tags: - ocr - historical-newspapers - fine-tuning language: - en size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: data/train-* - split: val path: data/val-* - split: test path: data/test-* dataset_info: features: - name: image dtype: image - name: transcription dtype: string - name: resolution dtype: string - name: scale dtype: float64 - name: width dtype: int64 - name: height dtype: int64 - name: page_id dtype: string - name: line_id dtype: string - name: confidence dtype: float64 - name: flag dtype: string splits: - name: train num_bytes: 717769970 num_examples: 51572 - name: val num_bytes: 62524729 num_examples: 4487 - name: test num_bytes: 98808848 num_examples: 6554 download_size: 871517347 dataset_size: 879103547 --- # newspaper-ocr-gold Gold-standard OCR training data for historical newspaper scans. ## Contents - 13,371 line-level transcriptions verified by Qwen3-VL 235B - Line crop PNG images from 100 newspaper pages - 73 unique titles spanning 1840s-2010s - Train/val/test split by page (80/10/10) ## Splits | Split | Pages | Lines | |-------|-------|-------| | train | 80 | 11,044 | | val | 10 | 1,111 | | test | 10 | 1,216 | ## Files - `verified_lines.jsonl` — full metadata (split, page_id, line_id, crop_path, transcription, confidence, flag) - `sample_metadata.json` — page sampling details (73 titles, decade distribution) - `train_images.tar.gz`, `val_images.tar.gz`, `test_images.tar.gz` — line crop PNGs organized as `{split}/{page_id}/lines/line_NNNN.png` ## Usage ```python from huggingface_hub import hf_hub_download import tarfile, json # Download verified labels path = hf_hub_download("NealCaren/newspaper-ocr-gold", "verified_lines.jsonl", repo_type="dataset") with open(path) as f: lines = [json.loads(l) for l in f] # Download and extract images for a split for split in ["train", "val", "test"]: tar = hf_hub_download("NealCaren/newspaper-ocr-gold", f"{split}_images.tar.gz", repo_type="dataset") with tarfile.open(tar) as t: t.extractall("./gold_data/") ``` ## Quality - 49% clean, 47% partial (line cut-off at word boundary), 3% degraded - Mean confidence: 0.95 - Verified by Qwen3-VL 235B via OpenRouter (blind transcription, no OCR input)

许可协议：CC BY 4.0 任务类别：图像到文本（image-to-text）标签：光学字符识别（OCR）、历史报纸（historical-newspapers）、微调（fine-tuning）语言：英语（en）样本量区间：10000 < 样本数 < 100000 配置项： - 配置名称：default 数据文件： - 训练集划分：data/train-* - 验证集划分：data/val-* - 测试集划分：data/test-* 数据集信息：特征字段： - 图像（image）：数据类型为图像 - 转录文本（transcription）：数据类型为字符串 - 分辨率（resolution）：数据类型为字符串 - 缩放比例（scale）：数据类型为float64 - 宽度（width）：数据类型为int64 - 高度（height）：数据类型为int64 - 页面ID（page_id）：数据类型为字符串 - 行ID（line_id）：数据类型为字符串 - 置信度（confidence）：数据类型为float64 - 标记（flag）：数据类型为字符串数据集划分详情： - 训练集：字节大小717769970，样本数51572 - 验证集：字节大小62524729，样本数4487 - 测试集：字节大小98808848，样本数6554 下载总大小：871517347 数据集总大小：879103547 # 报纸OCR黄金数据集（newspaper-ocr-gold） ## 内容概览 - 13371条经Qwen3-VL 235B验证的行级转录文本 - 源自100份报纸页面的行裁剪PNG图像 - 涵盖1840年代至2010年代的73种独特报刊标题 - 按页面比例80:10:10划分为训练/验证/测试集 ## 数据集划分详情 | 数据集划分 | 页面数 | 行样本数 | |-----------|--------|----------| | 训练集 | 80 | 11044 | | 验证集 | 10 | 1111 | | 测试集 | 10 | 1216 | ## 文件说明 - `verified_lines.jsonl`：完整元数据文件，包含数据集划分、页面ID、行ID、裁剪图像路径、转录文本、置信度及标记信息 - `sample_metadata.json`：页面采样详情文件，涵盖73种报刊标题及年代分布信息 - `train_images.tar.gz`、`val_images.tar.gz`、`test_images.tar.gz`：行裁剪PNG图像压缩包，文件组织格式为`{split}/{page_id}/lines/line_NNNN.png` ## 使用示例 python from huggingface_hub import hf_hub_download import tarfile, json # 下载验证后的标签 path = hf_hub_download("NealCaren/newspaper-ocr-gold", "verified_lines.jsonl", repo_type="dataset") with open(path) as f: lines = [json.loads(l) for l in f] # 下载并解压指定划分的图像 for split in ["train", "val", "test"]: tar = hf_hub_download("NealCaren/newspaper-ocr-gold", f"{split}_images.tar.gz", repo_type="dataset") with tarfile.open(tar) as t: t.extractall("./gold_data/") ## 数据质量 - 49%为干净文本，47%为部分截断文本（行在单词边界处被截断），3%为质量退化文本 - 平均置信度：0.95 - 通过OpenRouter平台由Qwen3-VL 235B完成验证，采用盲转录模式，无OCR输入

提供机构：

NealCaren

5,000+

优质数据集

54 个

任务类型

进入经典数据集