mchelali/forbin_dataset

Name: mchelali/forbin_dataset
Creator: mchelali
Published: 2025-12-09 08:49:50
License: 暂无描述

Hugging Face2025-12-09 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/mchelali/forbin_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_name: "Forbin Dataset" tags: - humanities - digital-humanities - archives - historical-documents - text-detection - polygon-annotation - verso-recto photographs license: cc-by-nc-4.0 task_categories: - object-detection - feature-extraction - image-classification pretty_name: "Forbin Dataset: A collection of historical photographs with archival metadata" --- # Forbin Dataset: *A collection of historical photographs with archival metadata* This repository hosts the *Forbin Dataset*, a large-scale collection of historical photographs taken or collected by **Victor Forbin (1868–1947)**. This HuggingFace dataset version provides: - COCO-style annotations (segmentation polygons) - Archival metadata (Box ID, description, notes, dates when available) - A lightweight **explorer interface** (HTML/JS) to preview images and annotations: [https://mchelali.github.io/forbin_dataset/](https://mchelali.github.io/forbin_dataset/) ## 📜 Dataset Description The Forbin Dataset contains digitized historical photographs from the personal archives of Victor Forbin, a French explorer, photographer, and writer. Images are accompanied by rich metadata and manually extracted segmentation polygons suitable for: - Computer Vision - Document Analysis - Cultural Heritage Studies - Machine Learning Research The sample included here is intended for **illustration and early experimentation only**. The upcoming full release will contain tens of thousands of images with complete metadata and annotations. ## 🛠️ Data Access and Usage Instructions Given the size of the image archives, the dataset must be loaded in a two-step process: **Local Download** followed by **Indexing**. ### 1\. Downloading the Raw Data Files (Images and Annotations) ⬇️ The dataset is distributed as WebDataset archives (`.tar`) and separate JSON annotation files. **You must download these files locally before starting the training process.** | File | Content | Note | | :--- | :--- | :--- | | **`forbin_all.json`** | All Image IDs, metadata, and annotations (for annotated images). | Used for full dataset indexing. | | **`forbin_annotated.json`** | Only images that have associated annotations (simplified index). | Useful for training on annotation tasks. | | **`data/*.tar`** | WebDataset archives containing all raw images. | **Large files.** | #### **Mode A: Via the Hugging Face Command Line Interface (CLI)** This is the fastest method for users familiar with the terminal. ```bash # Requires installation: pip install huggingface_hub hf download mchelali/forbin_dataset --repo-type dataset --local-dir ./forbin_data_local ``` #### **Mode B: Via Python (Recommended for Resumable Downloads)** This reliable method uses the official Python API, which automatically handles resuming the download process if interrupted. ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="mchelali/forbin_dataset", repo_type="dataset", local_dir="./forbin_data_local" # Your chosen destination folder ) ``` #### **Web Download Interface (For SHS Researchers):** For users less familiar with the command line, we provide a dedicated web interface to download the individual `.tar` archives one by one: ➡️ **Web Download Interface:** [https://mchelali.github.io/forbin\_dataset/download.html](https://mchelali.github.io/forbin_dataset/download.html) ----- ### 2\. Indexing and Annotation Usage 📚 Once the `*.json` and `*.tar` files are downloaded locally, you can build your own data loading pipeline. **Annotation Format:** All annotations (including textual metadata, bounding boxes, and segmentation polygons) are provided in the standard **COCO (Common Objects in Context) format**. This ensures compatibility with existing computer vision tools and libraries like PyTorch, TensorFlow, and `pycocotools`. The JSON file acts as your **Manifest** (Index Table). It links the image ID (via `image_id`) to the image's location within the `.tar` archives (via the `file_names` field in the `images` section). **To use the dataset:** 1. Load the JSON file (`forbin_all.json` or `forbin_annotated.json`) into your program. 2. Use the Python `tarfile` (or `webdataset`) library to open the corresponding `.tar` archive and load the image bytes based on the path provided in the `file_names` field. 3. Apply the COCO annotations (found in the `annotations` section of the JSON) to the loaded image. ## 🔖 License This sample dataset is released under the following license: **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)** ➡️ https://creativecommons.org/licenses/by-nc/4.0/ This means: - ✔ You must provide attribution - ✔ You may share and adapt the material - ❌ You may **not** use it for commercial purposes ## 📚 Citation If you use this dataset or the sample in academic work, please cite the forthcoming data paper: ``` [Under review] Chelali M., Gosselet S. K., Cloppet F., Kurtz C., Bloch I. and Foliard D., The Forbin Dataset: A collection of historical photographs with archival metadata, 2025. ``` ## 🤝 Acknowledgment of Authors This dataset originates from the personal archives of **Victor Forbin**, digitized and curated by the *High Vision Project – Archives & Vision Initiative*. All annotation and data processing work was performed by the project contributors. This work is supported by the French National Research Agency under the **ANR-24-CE38-4079** project

dataset_name: "Forbin Dataset" tags: - 人文研究（humanities） - 数字人文（digital-humanities） - 档案（archives） - 历史文献（historical-documents） - 文本检测（text-detection） - 多边形标注（polygon annotation） - 正背面摄影图像（verso-recto photographs） license: 知识共享署名-非商业性使用4.0国际许可协议（CC BY-NC 4.0） task_categories: - 目标检测（object-detection） - 特征提取（feature-extraction） - 图像分类（image-classification） pretty_name: "Forbin数据集：带档案元数据的历史照片合集" --- # Forbin数据集：*带档案元数据的历史照片合集* 本仓库托管**Forbin数据集**，这是由维克多·福班（Victor Forbin，1868–1947）拍摄或收藏的大规模历史照片合集。本Hugging Face数据集版本包含： - COCO（Common Objects in Context）格式标注（分割多边形） - 档案元数据（如Box ID、描述、备注、可获取的拍摄日期） - 轻量化**浏览界面**（HTML/JS），可预览图像与标注：[https://mchelali.github.io/forbin_dataset/](https://mchelali.github.io/forbin_dataset/) ## 📜 数据集概述 Forbin数据集包含来自法国探险家、摄影师兼作家维克多·福班个人档案的数字化历史照片。图像附带丰富元数据与人工提取的分割多边形标注，适用于： - 计算机视觉研究 - 文档分析 - 文化遗产研究 - 机器学习科研本次提供的示例样本仅用于**演示与早期实验**。即将发布的完整版本将包含数万张图像及完整元数据与标注。 ## 🛠️ 数据获取与使用指南鉴于图像档案规模较大，本数据集需通过两步流程加载：**本地下载**后执行**索引构建**。 ### 1. 下载原始数据文件（图像与标注） ⬇️ 本数据集以WebDataset归档文件（`.tar`）与独立JSON标注文件形式分发。**请务必在开始训练前将这些文件下载至本地**。 | 文件 | 内容 | 说明 | | :--- | :--- | :--- | | **`forbin_all.json`** | 所有图像ID、元数据及标注（适用于已标注图像） | 用于完整数据集索引 | | **`forbin_annotated.json`** | 仅包含带有标注的图像（简化索引） | 适用于标注任务的训练 | | **`data/*.tar`** | 包含所有原始图像的WebDataset归档文件 | **文件体积较大** | #### 方式A：通过Hugging Face命令行界面（CLI）适合熟悉终端操作的用户，为最快下载方式。 bash # 需先安装依赖：pip install huggingface_hub hf download mchelali/forbin_dataset --repo-type dataset --local-dir ./forbin_data_local #### 方式B：通过Python（推荐支持断点续传）该方式使用官方Python API，若下载中断可自动恢复进程，可靠性更高。 python from huggingface_hub import snapshot_download snapshot_download( repo_id="mchelali/forbin_dataset", repo_type="dataset", local_dir="./forbin_data_local" # 自定义目标文件夹 ) #### 网页下载界面（适用于SHS研究者）对于不熟悉命令行的用户，我们提供专属网页界面，可逐个下载`.tar`归档文件： ➡️ **网页下载界面**：[https://mchelali.github.io/forbin_dataset/download.html](https://mchelali.github.io/forbin_dataset/download.html) ----- ### 2. 索引构建与标注使用 📚 将`*.json`与`*.tar`文件下载至本地后，即可自定义构建数据加载流程。 **标注格式**：所有标注（含文本元数据、边界框与分割多边形）均采用标准**COCO（Common Objects in Context）格式**，可兼容现有计算机视觉工具与库，如PyTorch、TensorFlow与`pycocotools`。 JSON文件作为**索引清单**，通过`image_id`字段关联图像ID，并通过`images`部分的`file_names`字段指明图像在`.tar`归档中的存储位置。 **使用方法**： 1. 将JSON文件（`forbin_all.json`或`forbin_annotated.json`）加载至程序中 2. 使用Python的`tarfile`（或`webdataset`）库打开对应`.tar`归档，并根据`file_names`字段提供的路径加载图像字节流 3. 将JSON中`annotations`部分的COCO标注应用至加载的图像 ## 🔖 许可协议本示例数据集采用以下许可协议发布： **知识共享署名-非商业性使用4.0国际许可协议（CC BY-NC 4.0）** ➡️ https://creativecommons.org/licenses/by-nc/4.0/ 这意味着： - ✔ 必须提供署名 - ✔ 可共享、改编本素材 - ❌ 不得用于商业用途 ## 📚 引用说明若您在学术工作中使用本数据集或示例样本，请引用即将发表的数据论文： [审稿中] Chelali M., Gosselet S. K., Cloppet F., Kurtz C., Bloch I. and Foliard D., The Forbin Dataset: A collection of historical photographs with archival metadata, 2025. ## 🤝 作者致谢本数据集源自维克多·福班的个人档案，由**High Vision Project – Archives & Vision Initiative（高视觉项目——档案与视觉倡议）**完成数字化与整理。所有标注与数据处理工作由该项目贡献者完成。本研究得到法国国家科研署**ANR-24-CE38-4079**项目资助。

提供机构：

mchelali

5,000+

优质数据集

54 个

任务类型

进入经典数据集