mchelali/forbin_dataset
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mchelali/forbin_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_name: "Forbin Dataset"
tags:
- humanities
- digital-humanities
- archives
- historical-documents
- text-detection
- polygon-annotation
- verso-recto photographs
license: cc-by-nc-4.0
task_categories:
- object-detection
- feature-extraction
- image-classification
pretty_name: "Forbin Dataset: A collection of historical photographs with archival metadata"
---
# Forbin Dataset: *A collection of historical photographs with archival metadata*
This repository hosts the *Forbin Dataset*, a large-scale collection of historical photographs taken or collected by **Victor Forbin (1868–1947)**.
This HuggingFace dataset version provides:
- COCO-style annotations (segmentation polygons)
- Archival metadata (Box ID, description, notes, dates when available)
- A lightweight **explorer interface** (HTML/JS) to preview images and annotations: [https://mchelali.github.io/forbin_dataset/](https://mchelali.github.io/forbin_dataset/)
## 📜 Dataset Description
The Forbin Dataset contains digitized historical photographs from the personal archives of Victor Forbin, a French explorer, photographer, and writer.
Images are accompanied by rich metadata and manually extracted segmentation polygons suitable for:
- Computer Vision
- Document Analysis
- Cultural Heritage Studies
- Machine Learning Research
The sample included here is intended for **illustration and early experimentation only**.
The upcoming full release will contain tens of thousands of images with complete metadata and annotations.
## 🛠️ Data Access and Usage Instructions
Given the size of the image archives, the dataset must be loaded in a two-step process: **Local Download** followed by **Indexing**.
### 1\. Downloading the Raw Data Files (Images and Annotations) ⬇️
The dataset is distributed as WebDataset archives (`.tar`) and separate JSON annotation files. **You must download these files locally before starting the training process.**
| File | Content | Note |
| :--- | :--- | :--- |
| **`forbin_all.json`** | All Image IDs, metadata, and annotations (for annotated images). | Used for full dataset indexing. |
| **`forbin_annotated.json`** | Only images that have associated annotations (simplified index). | Useful for training on annotation tasks. |
| **`data/*.tar`** | WebDataset archives containing all raw images. | **Large files.** |
#### **Mode A: Via the Hugging Face Command Line Interface (CLI)**
This is the fastest method for users familiar with the terminal.
```bash
# Requires installation: pip install huggingface_hub
hf download mchelali/forbin_dataset --repo-type dataset --local-dir ./forbin_data_local
```
#### **Mode B: Via Python (Recommended for Resumable Downloads)**
This reliable method uses the official Python API, which automatically handles resuming the download process if interrupted.
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="mchelali/forbin_dataset",
repo_type="dataset",
local_dir="./forbin_data_local" # Your chosen destination folder
)
```
#### **Web Download Interface (For SHS Researchers):**
For users less familiar with the command line, we provide a dedicated web interface to download the individual `.tar` archives one by one:
➡️ **Web Download Interface:** [https://mchelali.github.io/forbin\_dataset/download.html](https://mchelali.github.io/forbin_dataset/download.html)
-----
### 2\. Indexing and Annotation Usage 📚
Once the `*.json` and `*.tar` files are downloaded locally, you can build your own data loading pipeline.
**Annotation Format:**
All annotations (including textual metadata, bounding boxes, and segmentation polygons) are provided in the standard **COCO (Common Objects in Context) format**. This ensures compatibility with existing computer vision tools and libraries like PyTorch, TensorFlow, and `pycocotools`.
The JSON file acts as your **Manifest** (Index Table). It links the image ID (via `image_id`) to the image's location within the `.tar` archives (via the `file_names` field in the `images` section).
**To use the dataset:**
1. Load the JSON file (`forbin_all.json` or `forbin_annotated.json`) into your program.
2. Use the Python `tarfile` (or `webdataset`) library to open the corresponding `.tar` archive and load the image bytes based on the path provided in the `file_names` field.
3. Apply the COCO annotations (found in the `annotations` section of the JSON) to the loaded image.
## 🔖 License
This sample dataset is released under the following license:
**Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)**
➡️ https://creativecommons.org/licenses/by-nc/4.0/
This means:
- ✔ You must provide attribution
- ✔ You may share and adapt the material
- ❌ You may **not** use it for commercial purposes
## 📚 Citation
If you use this dataset or the sample in academic work, please cite the forthcoming data paper:
```
[Under review]
Chelali M., Gosselet S. K., Cloppet F., Kurtz C., Bloch I. and Foliard D.,
The Forbin Dataset: A collection of historical photographs with archival metadata, 2025.
```
## 🤝 Acknowledgment of Authors
This dataset originates from the personal archives of **Victor Forbin**, digitized and curated by the *High Vision Project – Archives & Vision Initiative*.
All annotation and data processing work was performed by the project contributors.
This work is supported by the French National Research Agency under the **ANR-24-CE38-4079** project
dataset_name: "Forbin Dataset"
tags:
- 人文研究(humanities)
- 数字人文(digital-humanities)
- 档案(archives)
- 历史文献(historical-documents)
- 文本检测(text-detection)
- 多边形标注(polygon annotation)
- 正背面摄影图像(verso-recto photographs)
license: 知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)
task_categories:
- 目标检测(object-detection)
- 特征提取(feature-extraction)
- 图像分类(image-classification)
pretty_name: "Forbin数据集:带档案元数据的历史照片合集"
---
# Forbin数据集:*带档案元数据的历史照片合集*
本仓库托管**Forbin数据集**,这是由维克多·福班(Victor Forbin,1868–1947)拍摄或收藏的大规模历史照片合集。
本Hugging Face数据集版本包含:
- COCO(Common Objects in Context)格式标注(分割多边形)
- 档案元数据(如Box ID、描述、备注、可获取的拍摄日期)
- 轻量化**浏览界面**(HTML/JS),可预览图像与标注:[https://mchelali.github.io/forbin_dataset/](https://mchelali.github.io/forbin_dataset/)
## 📜 数据集概述
Forbin数据集包含来自法国探险家、摄影师兼作家维克多·福班个人档案的数字化历史照片。图像附带丰富元数据与人工提取的分割多边形标注,适用于:
- 计算机视觉研究
- 文档分析
- 文化遗产研究
- 机器学习科研
本次提供的示例样本仅用于**演示与早期实验**。即将发布的完整版本将包含数万张图像及完整元数据与标注。
## 🛠️ 数据获取与使用指南
鉴于图像档案规模较大,本数据集需通过两步流程加载:**本地下载**后执行**索引构建**。
### 1. 下载原始数据文件(图像与标注) ⬇️
本数据集以WebDataset归档文件(`.tar`)与独立JSON标注文件形式分发。**请务必在开始训练前将这些文件下载至本地**。
| 文件 | 内容 | 说明 |
| :--- | :--- | :--- |
| **`forbin_all.json`** | 所有图像ID、元数据及标注(适用于已标注图像) | 用于完整数据集索引 |
| **`forbin_annotated.json`** | 仅包含带有标注的图像(简化索引) | 适用于标注任务的训练 |
| **`data/*.tar`** | 包含所有原始图像的WebDataset归档文件 | **文件体积较大** |
#### 方式A:通过Hugging Face命令行界面(CLI)
适合熟悉终端操作的用户,为最快下载方式。
bash
# 需先安装依赖:pip install huggingface_hub
hf download mchelali/forbin_dataset --repo-type dataset --local-dir ./forbin_data_local
#### 方式B:通过Python(推荐支持断点续传)
该方式使用官方Python API,若下载中断可自动恢复进程,可靠性更高。
python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="mchelali/forbin_dataset",
repo_type="dataset",
local_dir="./forbin_data_local" # 自定义目标文件夹
)
#### 网页下载界面(适用于SHS研究者)
对于不熟悉命令行的用户,我们提供专属网页界面,可逐个下载`.tar`归档文件:
➡️ **网页下载界面**:[https://mchelali.github.io/forbin_dataset/download.html](https://mchelali.github.io/forbin_dataset/download.html)
-----
### 2. 索引构建与标注使用 📚
将`*.json`与`*.tar`文件下载至本地后,即可自定义构建数据加载流程。
**标注格式**:
所有标注(含文本元数据、边界框与分割多边形)均采用标准**COCO(Common Objects in Context)格式**,可兼容现有计算机视觉工具与库,如PyTorch、TensorFlow与`pycocotools`。
JSON文件作为**索引清单**,通过`image_id`字段关联图像ID,并通过`images`部分的`file_names`字段指明图像在`.tar`归档中的存储位置。
**使用方法**:
1. 将JSON文件(`forbin_all.json`或`forbin_annotated.json`)加载至程序中
2. 使用Python的`tarfile`(或`webdataset`)库打开对应`.tar`归档,并根据`file_names`字段提供的路径加载图像字节流
3. 将JSON中`annotations`部分的COCO标注应用至加载的图像
## 🔖 许可协议
本示例数据集采用以下许可协议发布:
**知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)**
➡️ https://creativecommons.org/licenses/by-nc/4.0/
这意味着:
- ✔ 必须提供署名
- ✔ 可共享、改编本素材
- ❌ 不得用于商业用途
## 📚 引用说明
若您在学术工作中使用本数据集或示例样本,请引用即将发表的数据论文:
[审稿中]
Chelali M., Gosselet S. K., Cloppet F., Kurtz C., Bloch I. and Foliard D.,
The Forbin Dataset: A collection of historical photographs with archival metadata, 2025.
## 🤝 作者致谢
本数据集源自维克多·福班的个人档案,由**High Vision Project – Archives & Vision Initiative(高视觉项目——档案与视觉倡议)**完成数字化与整理。所有标注与数据处理工作由该项目贡献者完成。
本研究得到法国国家科研署**ANR-24-CE38-4079**项目资助。
提供机构:
mchelali



