agomberto/FrenchCensus-handwritten-texts

Name: agomberto/FrenchCensus-handwritten-texts
Creator: agomberto
Published: 2023-11-28 17:35:18
License: 暂无描述

Hugging Face2023-11-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/agomberto/FrenchCensus-handwritten-texts

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fr license: mit size_categories: - 1K<n<10K task_categories: - image-to-text tags: - imate-to-text - trocr dataset_info: features: - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 501750699.816 num_examples: 5601 - name: validation num_bytes: 45084242.0 num_examples: 707 - name: test num_bytes: 49133043.0 num_examples: 734 download_size: 459795745 dataset_size: 595967984.816 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- ## Source This repository contains 3 datasets created within the POPP project ([Project for the Oceration of the Paris Population Census](https://popp.hypotheses.org/#ancre2)) for the task of handwriting text recognition. These datasets have been published in [Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_10). The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines. We publish here only the lines. If you want the pages, go [here](https://zenodo.org/record/6581158). This dataset is made 4800 annotated lines extracted from 80 double pages of the 1926 Paris census. ## Data Info Since the lines are extracted from table rows, we defined 4 special characters to describe the structure of the text: - ¤ : indicates an empty cell - / : indicates the separation into columns - ? : indicates that the content of the cell following this symbol is written above the regular baseline - ! : indicates that the content of the cell following this symbol is written below the regular baseline There are three splits: train, valid and test. ## How to use it ```python from datasets import load_dataset import numpy as np dataset = load_dataset("agomberto/FrenchCensus-handwritten-texts") i = np.random.randint(len(dataset['train'])) img = dataset['train']['image'][i] text = dataset['train']['text'][i] print(text) img ``` ## BibTeX entry and citation info ```bibtex @InProceedings{10.1007/978-3-031-06555-2_10, author="Constum, Thomas and Kempf, Nicolas and Paquet, Thierry and Tranouez, Pierrick and Chatelain, Cl{\'e}ment and Br{\'e}e, Sandra and Merveille, Fran{\c{c}}ois", editor="Uchida, Seiichi and Barney, Elisa and Eglin, V{\'e}ronique", title="Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early {\$}{\$}20^{\{}th{\}}{\$}{\$}Century Paris Census", booktitle="Document Analysis Systems", year="2022", publisher="Springer International Publishing", address="Cham", pages="143--157", abstract="We aim to build a vast database (up to 9 million individuals) from the handwritten tabular nominal census of Paris of 1926, 1931 and 1936, each composed of about 100,000 handwritten simple pages in a tabular format. We created a complete pipeline that goes from the scan of double pages to text prediction while minimizing the need for segmentation labels. We describe how weighted finite state transducers, writer specialization and self-training further improved our results. We also introduce through this communication two annotated datasets for handwriting recognition that are now publicly available, and an open-source toolkit to apply WFST on CTC lattices.", isbn="978-3-031-06555-2" } ```

提供机构：

agomberto

原始信息汇总

数据集概述

基本信息

语言: 法语
许可证: MIT
数据规模: 1K<n<10K
任务类别: 图像转文本
标签: 图像转文本, trocr

数据集结构

特征:
- 图像: 数据类型为图像
- 文本: 数据类型为字符串

数据分割

训练集:
- 字节数: 501750699.816
- 样本数: 5601
验证集:
- 字节数: 45084242.0
- 样本数: 707
测试集:
- 字节数: 49133043.0
- 样本数: 734

数据集大小

下载大小: 459795745
数据集大小: 595967984.816

配置

默认配置:
- 训练集: data/train-*
- 验证集: data/validation-*
- 测试集: data/test-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集