biglam/sloane-catalogues

Name: biglam/sloane-catalogues
Creator: biglam
Published: 2025-08-15 15:10:29
License: 暂无描述

Hugging Face2025-08-15 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/biglam/sloane-catalogues

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - image-to-text language: - en tags: - ocr - handwriting-recognition - historical-documents - library-science - index-cards size_categories: - n<1K --- # Sloane Manuscript Index Cards Dataset ## Dataset Description This dataset contains digitized index cards from the British Library's Sloane Manuscript collection. The cards represent a catalog system used to index the Sloane manuscripts, one of the founding collections of the British Museum (now British Library). ### Dataset Sources - **Repository:** British Library Digital Collections - **Original Dataset:** [Sloane Catalogues Dataset](https://bl.iro.bl.uk/concern/datasets/30d16800-bcb2-4d86-8ffe-22f623424860) - **License:** CC Public Domain Mark 1.0 ## Dataset Structure The dataset contains index cards from multiple collections (c_1 through c_8 and d). The cards are primarily handwritten historical catalog entries, with some collections containing typewritten divider pages or forms. ### Data Fields - `image`: The index card image - `filename`: Original filename - `collection`: Source collection identifier (e.g., sloane_ms_3972_c!2_jpegs) - `page_number`: Page number extracted from filename - `source`: Source attribution ### Content Types The cards contain various types of catalog information including: - Author names and biographical information - Manuscript titles and descriptions - Shelfmarks and reference numbers - Cross-references to other manuscripts - Historical notes and annotations ## Use Cases This dataset is ideal for: - Testing OCR and handwriting recognition models on historical documents - Evaluating Vision-Language Models (VLMs) on catalog cards - Training models for library catalog digitization - Information extraction from semi-structured historical documents - Benchmarking on challenging handwritten text from multiple time periods ## Example Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("davanstrien/sloane-index-cards") # Filter by collection c2_cards = dataset.filter(lambda x: 'c_2' in x['collection']) # Access an image sample = dataset[0] image = sample['image'] # PIL Image ``` ## Citation If you use this dataset, please cite: ```bibtex @dataset{sloane_index_cards_2024, title={Sloane Manuscript Index Cards Dataset}, author={British Library}, year={2024}, publisher={Hugging Face}, note={Derived from British Library Digital Collections} } ``` ## Acknowledgments Thanks to the British Library for making these historical materials available under an open license. This dataset was created as part of research into AI applications for GLAM (Galleries, Libraries, Archives, and Museums) institutions.

提供机构：

biglam

5,000+

优质数据集

54 个

任务类型

进入经典数据集