Teklia/POPP-line

Name: Teklia/POPP-line
Creator: Teklia
Published: 2025-02-11 09:56:52
License: 暂无描述

Hugging Face2025-02-11 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/Teklia/POPP-line

下载链接

链接失效反馈

官方服务：

资源简介：

POPP-line数据集包含了20世纪初巴黎的法语民事普查表图像及其对应的文本转录。数据集针对图像到文本的任务，图像经过调整大小处理，每个图像的高度为128像素。该数据集分为训练集、验证集和测试集，共计4794个样本，适用于属性识别、手写文本识别、光学字符识别等任务。

The POPP-line dataset contains French civil census images from Paris in the early 20th century along with their corresponding text transcriptions. The dataset is designed for image-to-text tasks, with images resized to a fixed height of 128 pixels. The dataset is split into training, validation, and test sets, totaling 4794 samples, and is suitable for tasks such as attribute recognition, handwritten text recognition, and optical character recognition.

提供机构：

Teklia

原始信息汇总

POPP - line level 数据集概述

数据集描述

数据集名称: POPP-line
语言: 法语
任务类别: 图像到文本
标签: atr, htr, ocr, historical, handwritten

数据集结构

数据实例

json { "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=4300x128 at 0x1A800E8E190>, "text": "Joly Ernest 88 Indre M par Employé Roblot!18377" }

数据字段

image: 包含图像的PIL.Image.Image对象。注意，访问图像列时（使用dataset[0]["image"]），图像文件会自动解码。解码大量图像文件可能需要较长时间，因此建议先查询样本索引再访问"image"列，即dataset[0]["image"]应始终优先于dataset["image"][0]。
text: 图像的标签转录。

数据分割

训练集: 3834个样本
验证集: 479个样本
测试集: 478个样本
总数据量: 4791个样本

5,000+

优质数据集

54 个

任务类型

进入经典数据集