biglam/newspaper-navigator
收藏Hugging Face2025-05-20 更新2025-05-31 收录
下载链接:
https://hf-mirror.com/datasets/biglam/newspaper-navigator
下载链接
链接失效反馈官方服务:
资源简介:
Newspaper Navigator数据集提供了超过1600万页美国历史报纸的Parquet格式版本,包含边界框、预测的视觉类型(例如,照片、地图)和OCR内容的注释。该数据集按预测的视觉内容类型分为不同的配置,例如卡通、漫画、插图、地图和照片。每个配置都包括出版日期、地点、LCCN、边界框、OCR文本、IIIF图像链接和源URL等元数据。数据集适用于图像分类、目标检测和图像特征提取等任务,可用于历史研究、机器学习、数字人文和教育活动。README中还包含了如何使用HuggingFace Datasets加载数据集、数据集结构、潜在应用和伦理考虑等信息。
The Newspaper Navigator dataset provides a Parquet-converted version of historical US newspapers, containing over 16 million pages annotated with bounding boxes, predicted visual types, and OCR content. The dataset is available in different configurations based on the predicted type of visual content, such as cartoons, comics, illustrations, maps, and photos. Each configuration includes metadata like publication date, location, LCCN, bounding boxes, OCR text, IIIF image links, and source URLs. The dataset is designed for tasks like image classification, object detection, and image feature extraction, and can be used for historical research, machine learning, digital humanities, and educational applications. The README also includes information on how to load the dataset using HuggingFace Datasets, the dataset structure, potential applications, and ethical considerations.
提供机构:
biglam
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个历史报纸视觉内容的数据集,包含超过1600万页的美国历史报纸,涵盖了多种视觉类型,如照片、地图和漫画等。数据集提供了丰富的元数据和IIIF图像链接,适用于历史研究、机器学习和数字人文等多个领域。数据集以Parquet格式提供,便于查询和集成到现代机器学习流程中。
以上内容由遇见数据集搜集并总结生成



