biglam/loc_beyond_words

Name: biglam/loc_beyond_words
Creator: biglam
Published: 2025-05-07 10:59:45
License: 暂无描述

Hugging Face2025-05-07 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/biglam/loc_beyond_words

下载链接

链接失效反馈

官方服务：

资源简介：

Beyond Words数据集是一个众包收集的边界框注释集合，注释对象为Library of Congress的Chronicling America收藏中第一次世界大战时期的历史报纸页面。数据集标注了七种视觉内容类型，包括照片、插图、地图、漫画、 editorial cartoons、标题和广告，用于训练Newspaper Navigator项目背后的视觉内容识别模型。该数据集可作为历史档案大规模文档布局分析的基础数据集。

The Beyond Words dataset is a crowdsourced collection of bounding box annotations on World War I-era historical newspaper pages from the Library of Congress’s Chronicling America collection. It includes annotations for seven types of visual content — photographs, illustrations, maps, comics, editorial cartoons, headlines, and advertisements — to train the visual content recognition model behind the Newspaper Navigator project. It serves as a foundational dataset for large-scale document layout analysis in historical archives.

提供机构：

biglam

原始信息汇总

数据集概述

数据集名称

名称: Beyond Words

数据集特征

特征:
- image_id: 整数类型 (int64)
- image: 图像类型
- width: 整数类型 (int32)
- height: 整数类型 (int32)
- objects: 序列类型
  - bw_id: 字符串类型 (string)
  - category_id: 分类标签类型
    - 0: Photograph
    - 1: Illustration
    - 2: Map
    - 3: Comics/Cartoon
    - 4: Editorial Cartoon
    - 5: Headline
    - 6: Advertisement
  - image_id: 字符串类型 (string)
  - id: 整数类型 (int64)
  - area: 整数类型 (int64)
  - bbox: 序列类型，长度为4的浮点数类型 (float32)
  - iscrowd: 布尔类型 (bool)

数据集分割

分割:
- train: 2846个样本，2854507字节
- validation: 712个样本，731782字节

数据集大小

下载大小: 1200053819字节
数据集大小: 3586289字节

许可证

许可证: cc0-1.0

任务类别

任务类别: 对象检测

数据集名称

名称: Beyond Words

大小类别

大小类别: 1K<n<10K

5,000+

优质数据集

54 个

任务类型

进入经典数据集

biglam/loc_beyond_words

数据集概述

数据集名称

数据集特征

数据集分割

数据集大小

许可证

任务类别

标签

数据集名称

大小类别