biglam/illustrated_ads
收藏Hugging Face2023-01-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/biglam/illustrated_ads
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language: []
language_creators: []
license:
- cc0-1.0
multilinguality: []
pretty_name: 19th Century United States Newspaper Advert images with 'illustrated'
or 'non illustrated' labels
size_categories:
- n<1K
source_datasets: []
tags:
- lam
- historic newspapers
task_categories:
- image-classification
task_ids:
- multi-class-image-classification
---
The Dataset contains images derived from the [Newspaper Navigator](https://news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/).
> [The Newspaper Navigator dataset](https://news-navigator.labs.loc.gov/) consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project. source: https://news-navigator.labs.loc.gov/
One of these categories is 'advertisements'. This dataset contains a sample of these images with additional labels indicating if the advert is 'illustrated' or 'not illustrated'.
This dataset was created for use in a [Programming Historian tutorial](http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt1). The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material.
# Dataset Card for 19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
- **Repository:**[https://doi.org/10.5281/zenodo.5838410](https://doi.org/10.5281/zenodo.5838410)
- **Paper:**[https://doi.org/10.46430/phen0101](https://doi.org/10.46430/phen0101)
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
The Dataset contains images derived from the [Newspaper Navigator](news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/).
> [The Newspaper Navigator dataset](https://news-navigator.labs.loc.gov/) consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project. source: https://news-navigator.labs.loc.gov/
One of these categories is 'advertisements. This dataset contains a sample of these images with additional labels indicating if the advert is 'illustrated' or 'not illustrated'.
This dataset was created for use in a [Programming Historian tutorial](http://programminghistorian.github.io/ph-submissions/lessons/computer-vision-deep-learning-pt1). The primary aim of the data was to provide a realistic example dataset for teaching computer vision for working with digitised heritage material.
### Supported Tasks and Leaderboards
- `image-classification`: the primary purpose of this dataset is for classifying historic newspaper images identified as being 'advertisements' into 'illustrated' and 'not-illustrated' categories.
### Languages
[More Information Needed]
## Dataset Structure
### Data Instances
An example instance from this dataset
``` python
{'file': 'pst_fenske_ver02_data_sn84026497_00280776129_1880042101_0834_002_6_96.jpg',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=L size=388x395 at 0x7F9A72038950>,
'label': 0,
'pub_date': Timestamp('1880-04-21 00:00:00'),
'page_seq_num': 834,
'edition_seq_num': 1,
'batch': 'pst_fenske_ver02',
'lccn': 'sn84026497',
'box': [0.649412214756012,
0.6045778393745422,
0.8002520799636841,
0.7152365446090698],
'score': 0.9609346985816956,
'ocr': "H. II. IIASLKT & SOXN, Dealers in General Merchandise In New Store Room nt HASLET'S COS ITERS, 'JTionoMtii, ln. .Tau'y 1st, 1?0.",
'place_of_publication': 'Tionesta, Pa.',
'geographic_coverage': "['Pennsylvania--Forest--Tionesta']",
'name': 'The Forest Republican. [volume]',
'publisher': 'Ed. W. Smiley',
'url': 'https://news-navigator.labs.loc.gov/data/pst_fenske_ver02/data/sn84026497/00280776129/1880042101/0834/002_6_96.jpg',
'page_url': 'https://chroniclingamerica.loc.gov/data/batches/pst_fenske_ver02/data/sn84026497/00280776129/1880042101/0834.jp2'}
```
### Data Fields
[More Information Needed]
### Data Splits
The dataset contains a single split.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
A description of the annotation process is outlined in this [GitHub repository](https://github.com/Living-with-machines/nnanno)
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
``` bibtex
@dataset{van_strien_daniel_2021_5838410,
author = {van Strien, Daniel},
title = {{19th Century United States Newspaper Advert images
with 'illustrated' or 'non illustrated' labels}},
month = oct,
year = 2021,
publisher = {Zenodo},
version = {0.0.1},
doi = {10.5281/zenodo.5838410},
url = {https://doi.org/10.5281/zenodo.5838410}}
```
[More Information Needed]
### Contributions
Thanks to [@davanstrien](https://github.com/davanstrien) for adding this dataset.
提供机构:
biglam
原始信息汇总
数据集概述
数据集名称
- 名称: 19th Century United States Newspaper Advert images with illustrated or non illustrated labels
数据集属性
- 语言: 无特定语言信息
- 许可证: CC0-1.0
- 多语言性: 不适用
- 大小: 小于1000个数据实例
- 标签创建者: 专家生成
- 任务类别: 图像分类
- 任务ID: 多类别图像分类
数据集描述
- 摘要: 该数据集包含从Newspaper Navigator提取的图像,这些图像来自美国国会图书馆的Chronicling America收藏。数据集中的图像被标记为illustrated或non illustrated广告。
- 用途: 主要用于教学计算机视觉,特别是在处理数字化遗产材料方面。
数据集结构
- 数据实例: 每个实例包括文件名、图像、标签、发布日期、页面序列号、版本序列号、批次、LCCN、边界框、得分、OCR文本、出版地点、地理覆盖范围、名称、出版商、URL和页面URL。
- 数据字段: 包括图像、标签等详细信息。
- 数据分割: 单一分割。
数据集创建
- 来源数据: 数据源自Newspaper Navigator,该数据集包含从16,358,041个历史报纸页面中提取的视觉内容。
- 注释: 注释过程描述在GitHub仓库中。
使用考虑
- 社会影响: 未提供详细信息。
- 偏见讨论: 未提供详细信息。
- 其他已知限制: 未提供详细信息。
附加信息
-
数据集创建者: Daniel van Strien
-
许可证信息: CC0-1.0
-
引用信息: bibtex @dataset{van_strien_daniel_2021_5838410, author = {van Strien, Daniel}, title = {{19th Century United States Newspaper Advert images with illustrated or non illustrated labels}}, month = oct, year = 2021, publisher = {Zenodo}, version = {0.0.1}, doi = {10.5281/zenodo.5838410}, url = {https://doi.org/10.5281/zenodo.5838410} }
-
贡献者: 感谢@davanstrien添加此数据集。



