five

mchelali/forbin_dataset

收藏
Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mchelali/forbin_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_name: "Forbin Dataset" tags: - humanities - digital-humanities - archives - historical-documents - text-detection - polygon-annotation - verso-recto photographs license: cc-by-nc-4.0 task_categories: - object-detection - feature-extraction - image-classification pretty_name: "Forbin Dataset: A collection of historical photographs with archival metadata" --- # Forbin Dataset: *A collection of historical photographs with archival metadata* This repository hosts the *Forbin Dataset*, a large-scale collection of historical photographs taken or collected by **Victor Forbin (1868–1947)**. This HuggingFace dataset version provides: - COCO-style annotations (segmentation polygons) - Archival metadata (Box ID, description, notes, dates when available) - A lightweight **explorer interface** (HTML/JS) to preview images and annotations: [https://mchelali.github.io/forbin_dataset/](https://mchelali.github.io/forbin_dataset/) ## 📜 Dataset Description The Forbin Dataset contains digitized historical photographs from the personal archives of Victor Forbin, a French explorer, photographer, and writer. Images are accompanied by rich metadata and manually extracted segmentation polygons suitable for: - Computer Vision - Document Analysis - Cultural Heritage Studies - Machine Learning Research The sample included here is intended for **illustration and early experimentation only**. The upcoming full release will contain tens of thousands of images with complete metadata and annotations. ## 🛠️ Data Access and Usage Instructions Given the size of the image archives, the dataset must be loaded in a two-step process: **Local Download** followed by **Indexing**. ### 1\. Downloading the Raw Data Files (Images and Annotations) ⬇️ The dataset is distributed as WebDataset archives (`.tar`) and separate JSON annotation files. **You must download these files locally before starting the training process.** | File | Content | Note | | :--- | :--- | :--- | | **`forbin_all.json`** | All Image IDs, metadata, and annotations (for annotated images). | Used for full dataset indexing. | | **`forbin_annotated.json`** | Only images that have associated annotations (simplified index). | Useful for training on annotation tasks. | | **`data/*.tar`** | WebDataset archives containing all raw images. | **Large files.** | #### **Mode A: Via the Hugging Face Command Line Interface (CLI)** This is the fastest method for users familiar with the terminal. ```bash # Requires installation: pip install huggingface_hub hf download mchelali/forbin_dataset --repo-type dataset --local-dir ./forbin_data_local ``` #### **Mode B: Via Python (Recommended for Resumable Downloads)** This reliable method uses the official Python API, which automatically handles resuming the download process if interrupted. ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="mchelali/forbin_dataset", repo_type="dataset", local_dir="./forbin_data_local" # Your chosen destination folder ) ``` #### **Web Download Interface (For SHS Researchers):** For users less familiar with the command line, we provide a dedicated web interface to download the individual `.tar` archives one by one: ➡️ **Web Download Interface:** [https://mchelali.github.io/forbin\_dataset/download.html](https://mchelali.github.io/forbin_dataset/download.html) ----- ### 2\. Indexing and Annotation Usage 📚 Once the `*.json` and `*.tar` files are downloaded locally, you can build your own data loading pipeline. **Annotation Format:** All annotations (including textual metadata, bounding boxes, and segmentation polygons) are provided in the standard **COCO (Common Objects in Context) format**. This ensures compatibility with existing computer vision tools and libraries like PyTorch, TensorFlow, and `pycocotools`. The JSON file acts as your **Manifest** (Index Table). It links the image ID (via `image_id`) to the image's location within the `.tar` archives (via the `file_names` field in the `images` section). **To use the dataset:** 1. Load the JSON file (`forbin_all.json` or `forbin_annotated.json`) into your program. 2. Use the Python `tarfile` (or `webdataset`) library to open the corresponding `.tar` archive and load the image bytes based on the path provided in the `file_names` field. 3. Apply the COCO annotations (found in the `annotations` section of the JSON) to the loaded image. ## 🔖 License This sample dataset is released under the following license: **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)** ➡️ https://creativecommons.org/licenses/by-nc/4.0/ This means: - ✔ You must provide attribution - ✔ You may share and adapt the material - ❌ You may **not** use it for commercial purposes ## 📚 Citation If you use this dataset or the sample in academic work, please cite the forthcoming data paper: ``` [Under review] Chelali M., Gosselet S. K., Cloppet F., Kurtz C., Bloch I. and Foliard D., The Forbin Dataset: A collection of historical photographs with archival metadata, 2025. ``` ## 🤝 Acknowledgment of Authors This dataset originates from the personal archives of **Victor Forbin**, digitized and curated by the *High Vision Project – Archives & Vision Initiative*. All annotation and data processing work was performed by the project contributors. This work is supported by the French National Research Agency under the **ANR-24-CE38-4079** project

dataset_name: "Forbin Dataset" tags: - 人文研究(humanities) - 数字人文(digital-humanities) - 档案(archives) - 历史文献(historical-documents) - 文本检测(text-detection) - 多边形标注(polygon annotation) - 正背面摄影图像(verso-recto photographs) license: 知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0) task_categories: - 目标检测(object-detection) - 特征提取(feature-extraction) - 图像分类(image-classification) pretty_name: "Forbin数据集:带档案元数据的历史照片合集" --- # Forbin数据集:*带档案元数据的历史照片合集* 本仓库托管**Forbin数据集**,这是由维克多·福班(Victor Forbin,1868–1947)拍摄或收藏的大规模历史照片合集。 本Hugging Face数据集版本包含: - COCO(Common Objects in Context)格式标注(分割多边形) - 档案元数据(如Box ID、描述、备注、可获取的拍摄日期) - 轻量化**浏览界面**(HTML/JS),可预览图像与标注:[https://mchelali.github.io/forbin_dataset/](https://mchelali.github.io/forbin_dataset/) ## 📜 数据集概述 Forbin数据集包含来自法国探险家、摄影师兼作家维克多·福班个人档案的数字化历史照片。图像附带丰富元数据与人工提取的分割多边形标注,适用于: - 计算机视觉研究 - 文档分析 - 文化遗产研究 - 机器学习科研 本次提供的示例样本仅用于**演示与早期实验**。即将发布的完整版本将包含数万张图像及完整元数据与标注。 ## 🛠️ 数据获取与使用指南 鉴于图像档案规模较大,本数据集需通过两步流程加载:**本地下载**后执行**索引构建**。 ### 1. 下载原始数据文件(图像与标注) ⬇️ 本数据集以WebDataset归档文件(`.tar`)与独立JSON标注文件形式分发。**请务必在开始训练前将这些文件下载至本地**。 | 文件 | 内容 | 说明 | | :--- | :--- | :--- | | **`forbin_all.json`** | 所有图像ID、元数据及标注(适用于已标注图像) | 用于完整数据集索引 | | **`forbin_annotated.json`** | 仅包含带有标注的图像(简化索引) | 适用于标注任务的训练 | | **`data/*.tar`** | 包含所有原始图像的WebDataset归档文件 | **文件体积较大** | #### 方式A:通过Hugging Face命令行界面(CLI) 适合熟悉终端操作的用户,为最快下载方式。 bash # 需先安装依赖:pip install huggingface_hub hf download mchelali/forbin_dataset --repo-type dataset --local-dir ./forbin_data_local #### 方式B:通过Python(推荐支持断点续传) 该方式使用官方Python API,若下载中断可自动恢复进程,可靠性更高。 python from huggingface_hub import snapshot_download snapshot_download( repo_id="mchelali/forbin_dataset", repo_type="dataset", local_dir="./forbin_data_local" # 自定义目标文件夹 ) #### 网页下载界面(适用于SHS研究者) 对于不熟悉命令行的用户,我们提供专属网页界面,可逐个下载`.tar`归档文件: ➡️ **网页下载界面**:[https://mchelali.github.io/forbin_dataset/download.html](https://mchelali.github.io/forbin_dataset/download.html) ----- ### 2. 索引构建与标注使用 📚 将`*.json`与`*.tar`文件下载至本地后,即可自定义构建数据加载流程。 **标注格式**: 所有标注(含文本元数据、边界框与分割多边形)均采用标准**COCO(Common Objects in Context)格式**,可兼容现有计算机视觉工具与库,如PyTorch、TensorFlow与`pycocotools`。 JSON文件作为**索引清单**,通过`image_id`字段关联图像ID,并通过`images`部分的`file_names`字段指明图像在`.tar`归档中的存储位置。 **使用方法**: 1. 将JSON文件(`forbin_all.json`或`forbin_annotated.json`)加载至程序中 2. 使用Python的`tarfile`(或`webdataset`)库打开对应`.tar`归档,并根据`file_names`字段提供的路径加载图像字节流 3. 将JSON中`annotations`部分的COCO标注应用至加载的图像 ## 🔖 许可协议 本示例数据集采用以下许可协议发布: **知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)** ➡️ https://creativecommons.org/licenses/by-nc/4.0/ 这意味着: - ✔ 必须提供署名 - ✔ 可共享、改编本素材 - ❌ 不得用于商业用途 ## 📚 引用说明 若您在学术工作中使用本数据集或示例样本,请引用即将发表的数据论文: [审稿中] Chelali M., Gosselet S. K., Cloppet F., Kurtz C., Bloch I. and Foliard D., The Forbin Dataset: A collection of historical photographs with archival metadata, 2025. ## 🤝 作者致谢 本数据集源自维克多·福班的个人档案,由**High Vision Project – Archives & Vision Initiative(高视觉项目——档案与视觉倡议)**完成数字化与整理。所有标注与数据处理工作由该项目贡献者完成。 本研究得到法国国家科研署**ANR-24-CE38-4079**项目资助。
提供机构:
mchelali
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作