agomberto/FrenchCensus-handwritten-texts
收藏Hugging Face2023-11-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/agomberto/FrenchCensus-handwritten-texts
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fr
license: mit
size_categories:
- 1K<n<10K
task_categories:
- image-to-text
tags:
- imate-to-text
- trocr
dataset_info:
features:
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 501750699.816
num_examples: 5601
- name: validation
num_bytes: 45084242.0
num_examples: 707
- name: test
num_bytes: 49133043.0
num_examples: 734
download_size: 459795745
dataset_size: 595967984.816
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
## Source
This repository contains 3 datasets created within the POPP project ([Project for the Oceration of the Paris Population Census](https://popp.hypotheses.org/#ancre2)) for the task of handwriting text recognition. These datasets have been published in [Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_10).
The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.
We publish here only the lines. If you want the pages, go [here](https://zenodo.org/record/6581158). This dataset is made 4800 annotated lines extracted from 80 double pages of the 1926 Paris census.
## Data Info
Since the lines are extracted from table rows, we defined 4 special characters to describe the structure of the text:
- ¤ : indicates an empty cell
- / : indicates the separation into columns
- ? : indicates that the content of the cell following this symbol is written above the regular baseline
- ! : indicates that the content of the cell following this symbol is written below the regular baseline
There are three splits: train, valid and test.
## How to use it
```python
from datasets import load_dataset
import numpy as np
dataset = load_dataset("agomberto/FrenchCensus-handwritten-texts")
i = np.random.randint(len(dataset['train']))
img = dataset['train']['image'][i]
text = dataset['train']['text'][i]
print(text)
img
```
## BibTeX entry and citation info
```bibtex
@InProceedings{10.1007/978-3-031-06555-2_10,
author="Constum, Thomas
and Kempf, Nicolas
and Paquet, Thierry
and Tranouez, Pierrick
and Chatelain, Cl{\'e}ment
and Br{\'e}e, Sandra
and Merveille, Fran{\c{c}}ois",
editor="Uchida, Seiichi
and Barney, Elisa
and Eglin, V{\'e}ronique",
title="Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early {\$}{\$}20^{\{}th{\}}{\$}{\$}Century Paris Census",
booktitle="Document Analysis Systems",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="143--157",
abstract="We aim to build a vast database (up to 9 million individuals) from the handwritten tabular nominal census of Paris of 1926, 1931 and 1936, each composed of about 100,000 handwritten simple pages in a tabular format. We created a complete pipeline that goes from the scan of double pages to text prediction while minimizing the need for segmentation labels. We describe how weighted finite state transducers, writer specialization and self-training further improved our results. We also introduce through this communication two annotated datasets for handwriting recognition that are now publicly available, and an open-source toolkit to apply WFST on CTC lattices.",
isbn="978-3-031-06555-2"
}
```
提供机构:
agomberto
原始信息汇总
数据集概述
基本信息
- 语言: 法语
- 许可证: MIT
- 数据规模: 1K<n<10K
- 任务类别: 图像转文本
- 标签: 图像转文本, trocr
数据集结构
- 特征:
- 图像: 数据类型为图像
- 文本: 数据类型为字符串
数据分割
- 训练集:
- 字节数: 501750699.816
- 样本数: 5601
- 验证集:
- 字节数: 45084242.0
- 样本数: 707
- 测试集:
- 字节数: 49133043.0
- 样本数: 734
数据集大小
- 下载大小: 459795745
- 数据集大小: 595967984.816
配置
- 默认配置:
- 训练集: data/train-*
- 验证集: data/validation-*
- 测试集: data/test-*



