biglam/early_printed_books_font_detection

Name: biglam/early_printed_books_font_detection
Creator: biglam
Published: 2022-10-28 15:39:50
License: 暂无描述

Hugging Face2022-10-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/biglam/early_printed_books_font_detection

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: image dtype: image - name: labels sequence: class_label: names: 0: greek 1: antiqua 2: other_font 3: not_a_font 4: italic 5: rotunda 6: textura 7: fraktur 8: schwabacher 9: hebrew 10: bastarda 11: gotico_antiqua splits: - name: test num_bytes: 2345451 num_examples: 10757 - name: train num_bytes: 5430875 num_examples: 24866 download_size: 44212934313 dataset_size: 7776326 annotations_creators: - expert-generated language: [] language_creators: [] license: - cc-by-nc-sa-4.0 multilinguality: [] pretty_name: Early Printed Books Font Detection Dataset size_categories: - 10K<n<100K source_datasets: [] tags: [] task_categories: - image-classification task_ids: - multi-label-image-classification --- # Dataset Card for Early Printed Books Font Detection Dataset ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:**https://doi.org/10.5281/zenodo.3366686 - **Paper:**: https://doi.org/10.1145/3352631.3352640 - **Leaderboard:** - **Point of Contact:** ### Dataset Summary > This dataset is composed of photos of various resolution of 35'623 pages of printed books dating from the 15th to the 18th century. Each page has been attributed by experts from one to five labels corresponding to the font groups used in the text, with two extra-classes for non-textual content and fonts not present in the following list: Antiqua, Bastaπrda, Fraktur, Gotico Antiqua, Greek, Hebrew, Italic, Rotunda, Schwabacher, and Textura. [More Information Needed] ### Supported Tasks and Leaderboards The primary use case for this datasets is - `multi-label-image-classification`: This dataset can be used to train a model for multi label image classification where each image can have one, or more labels. - `image-classification`: This dataset could also be adapted to only predict a single label for each image ### Languages The dataset includes books from a range of libraries (see below for further details). The paper doesn't provide a detailed overview of language breakdown. However, the books are from the 15th-18th century and appear to be dominated by European languages from that time period. The dataset also includes Hebrew. [More Information Needed] ## Dataset Structure This dataset has a single configuration. ### Data Instances An example instance from this dataset: ```python {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=3072x3840 at 0x7F6AC192D850>, 'labels': [5]} ``` ### Data Fields This dataset contains two fields: - `image`: the image of the book page - `labels`: one or more labels for the font used in the book page depicted in the `image` ### Data Splits The dataset is broken into a train and test split with the following breakdown of number of examples: - train: 24,866 - test: 10,757 ## Dataset Creation ### Curation Rationale The dataset was created to help train and evaluate automatic methods for font detection. The paper describing the paper also states that: >data was cherry-picked, thus it is not statistically representative of what can be found in libraries. For example, as we had a small amount of Textura at the start, we specifically looked for more pages containing this font group, so we can expect that less than 3.6 % of randomly selected pages from libraries would contain Textura. ### Source Data #### Initial Data Collection and Normalization The images in this dataset are from books held by the British Library (London), Bayerische Staatsbibliothek München, Staatsbibliothek zu Berlin, Universitätsbibliothek Erlangen, Universitätsbibliothek Heidelberg, Staats- und Universitäatsbibliothek Göttingen, Stadt- und Universitätsbibliothek Köln, Württembergische Landesbibliothek Stuttgart and Herzog August Bibliothek Wolfenbüttel. [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions Thanks to [@github-username](https://github.com/<github-username>) for adding this dataset.

提供机构：

biglam

原始信息汇总

数据集概述

数据集名称

名称: Early Printed Books Font Detection Dataset

数据集特征

特征:
- image: 图像数据
- labels: 标签数据，包含以下类别:
  - 0: greek
  - 1: antiqua
  - 2: other_font
  - 3: not_a_font
  - 4: italic
  - 5: rotunda
  - 6: textura
  - 7: fraktur
  - 8: schwabacher
  - 9: hebrew
  - 10: bastarda
  - 11: gotico_antiqua

数据集结构

数据实例: python {image: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=3072x3840 at 0x7F6AC192D850>, labels: [5]}
数据字段:
- image: 书籍页面的图像
- labels: 图像中使用的字体的一个或多个标签

数据集分割

分割:
- train: 24,866个样本
- test: 10,757个样本

数据集大小

下载大小: 44,212,934,313字节
数据集大小: 7,776,326字节

许可证

许可证: cc-by-nc-sa-4.0

任务类别

任务类别: image-classification
任务ID: multi-label-image-classification

数据集创建

注释创建者: expert-generated
数据集创建理由: 用于训练和评估自动字体检测方法
源数据: 来自多个图书馆的书籍图像

数据集使用考虑

数据集代表性: 数据是精选的，不代表图书馆中可找到的内容的统计代表性

5,000+

优质数据集

54 个

任务类型

进入经典数据集