ReadingTimeMachine/historical_dla
收藏Hugging Face2024-03-25 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/ReadingTimeMachine/historical_dla
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
---
## Dataset Introduction
This dataset has bounding boxes for ~6000 hand annotated pages with bounding boxes for figures, figure captions, tables, and math formulas.
More coverage is available for figures + captions (some pages might not have all tables and math formulas annotated).
Format is JSON and includes lists (so it looks like HuggingFace doesn't necessarily like this format for display), with rows that look like:

## How to use this data
To plot an example check out the [trial_example_from_data.ipynb](https://huggingface.co/datasets/ReadingTimeMachine/historical_dla/blob/main/trial_example_from_data.ipynb) notebook.
This assumes you have the data and the [data_utils.py](https://huggingface.co/datasets/ReadingTimeMachine/historical_dla/blob/main/data_utils.py) file in the same location as your notebook.
The following packages will have to be installed:
```python
matplotlib
numpy
pandas
wand
PIL
wget
cv2 # OpenCV
```
On Google Colab, to install `wand` we found we had to do the following (this is not in the linked notebook):
```python
!apt install imagemagick
!apt-get install libmagickwand-dev
!pip install Wand
!rm /etc/ImageMagick-6/policy.xml
!pip install wget
```
提供机构:
ReadingTimeMachine
原始信息汇总
数据集概述
数据集内容
- 包含约6000页的手动标注边界框,涵盖图形、图形标题、表格和数学公式。
- 图形及其标题的标注覆盖更广,部分页面可能未标注所有表格和数学公式。
数据格式
- 数据格式为JSON,包含列表。
使用示例
- 提供了如何使用数据的示例,包括安装必要的Python包和在Google Colab中的特定安装步骤。



