five

agomberto/FrenchCensus-handwritten-texts

收藏
Hugging Face2023-11-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/agomberto/FrenchCensus-handwritten-texts
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - fr license: mit size_categories: - 1K<n<10K task_categories: - image-to-text tags: - imate-to-text - trocr dataset_info: features: - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 501750699.816 num_examples: 5601 - name: validation num_bytes: 45084242.0 num_examples: 707 - name: test num_bytes: 49133043.0 num_examples: 734 download_size: 459795745 dataset_size: 595967984.816 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- ## Source This repository contains 3 datasets created within the POPP project ([Project for the Oceration of the Paris Population Census](https://popp.hypotheses.org/#ancre2)) for the task of handwriting text recognition. These datasets have been published in [Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_10). The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines. We publish here only the lines. If you want the pages, go [here](https://zenodo.org/record/6581158). This dataset is made 4800 annotated lines extracted from 80 double pages of the 1926 Paris census. ## Data Info Since the lines are extracted from table rows, we defined 4 special characters to describe the structure of the text: - ¤ : indicates an empty cell - / : indicates the separation into columns - ? : indicates that the content of the cell following this symbol is written above the regular baseline - ! : indicates that the content of the cell following this symbol is written below the regular baseline There are three splits: train, valid and test. ## How to use it ```python from datasets import load_dataset import numpy as np dataset = load_dataset("agomberto/FrenchCensus-handwritten-texts") i = np.random.randint(len(dataset['train'])) img = dataset['train']['image'][i] text = dataset['train']['text'][i] print(text) img ``` ## BibTeX entry and citation info ```bibtex @InProceedings{10.1007/978-3-031-06555-2_10, author="Constum, Thomas and Kempf, Nicolas and Paquet, Thierry and Tranouez, Pierrick and Chatelain, Cl{\'e}ment and Br{\'e}e, Sandra and Merveille, Fran{\c{c}}ois", editor="Uchida, Seiichi and Barney, Elisa and Eglin, V{\'e}ronique", title="Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early {\$}{\$}20^{\{}th{\}}{\$}{\$}Century Paris Census", booktitle="Document Analysis Systems", year="2022", publisher="Springer International Publishing", address="Cham", pages="143--157", abstract="We aim to build a vast database (up to 9 million individuals) from the handwritten tabular nominal census of Paris of 1926, 1931 and 1936, each composed of about 100,000 handwritten simple pages in a tabular format. We created a complete pipeline that goes from the scan of double pages to text prediction while minimizing the need for segmentation labels. We describe how weighted finite state transducers, writer specialization and self-training further improved our results. We also introduce through this communication two annotated datasets for handwriting recognition that are now publicly available, and an open-source toolkit to apply WFST on CTC lattices.", isbn="978-3-031-06555-2" } ```
提供机构:
agomberto
原始信息汇总

数据集概述

基本信息

  • 语言: 法语
  • 许可证: MIT
  • 数据规模: 1K<n<10K
  • 任务类别: 图像转文本
  • 标签: 图像转文本, trocr

数据集结构

  • 特征:
    • 图像: 数据类型为图像
    • 文本: 数据类型为字符串

数据分割

  • 训练集:
    • 字节数: 501750699.816
    • 样本数: 5601
  • 验证集:
    • 字节数: 45084242.0
    • 样本数: 707
  • 测试集:
    • 字节数: 49133043.0
    • 样本数: 734

数据集大小

  • 下载大小: 459795745
  • 数据集大小: 595967984.816

配置

  • 默认配置:
    • 训练集: data/train-*
    • 验证集: data/validation-*
    • 测试集: data/test-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作