NationalLibraryOfScotland/Scottish-School-Exam-Papers

Name: NationalLibraryOfScotland/Scottish-School-Exam-Papers
Creator: NationalLibraryOfScotland
Published: 2025-07-18 11:59:38
License: 暂无描述

Hugging Face2025-07-18 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/NationalLibraryOfScotland/Scottish-School-Exam-Papers

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: document_id dtype: string - name: page_number dtype: int32 - name: file_identifier dtype: string - name: image dtype: image - name: text dtype: string - name: alto_xml dtype: string - name: has_image dtype: bool - name: has_alto dtype: bool - name: document_metadata dtype: string - name: has_metadata dtype: bool - name: exam_type dtype: string - name: exam_year dtype: string - name: exam_reference dtype: string splits: - name: train num_bytes: 1646707701.999 num_examples: 11141 download_size: 1403366547 dataset_size: 1646707701.999 configs: - config_name: default data_files: - split: train path: data/train-* license: cc0-1.0 task_categories: - text-generation language: - en tags: - ocr --- # Scottish School Exam Papers Dataset ## Dataset Description This dataset contains digitised Scottish school examination papers from the National Library of Scotland's (NLS) digital collections. The papers represent historical educational assessment materials that have been processed with Optical Character Recognition (OCR) to extract text content alongside the original page images. ### Dataset Summary - **Source**: [National Library of Scotland - Scottish School Exam Papers](https://data.nls.uk/data/digitised-collections/scottish-exams/) - **Format**: Image-text pairs with OCR output - **Processing**: Converted from ALTO XML and JPG images to Hugging Face dataset format - **Contents**: Historical Scottish school examination papers with both scanned images and extracted text ## Dataset Structure ### Data Fields Each record in the dataset contains the following fields: - `document_id` (string): Unique identifier for the exam paper document - `page_number` (int): Sequential page number within the document (1, 2, 3, etc.) - `file_identifier` (string): Original file identifier from the NLS dataset - `image` (image): Scanned image of the exam paper page - `text` (string): OCR-extracted text from the page - `alto_xml` (string): Raw ALTO XML containing detailed OCR information - `has_image` (bool): Whether the page has an associated image file - `has_alto` (bool): Whether the page has associated ALTO OCR data - `document_metadata` (string): Full metadata description from inventory (e.g., "Leaving Certificate - 1888 - P.P.1888 XLI") - `has_metadata` (bool): Whether metadata is available for this document - `exam_type` (string): Type of examination (e.g., "Leaving Certificate", "Scottish Education Department") - `exam_year` (string): Year of the examination (extracted from metadata) - `exam_reference` (string): Reference number from the metadata (if available) ### Data Statistics The dataset includes examination papers with varying levels of OCR quality and completeness. Some pages may have images without corresponding OCR text, typically due to: - Poor scan quality - Handwritten content - Complex layouts or mathematical notation - Degraded original documents ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("davanstrien/Scottish-School-Exam-Papers") # Access the data for example in dataset['train']: print(f"Document: {example['document_id']}") print(f"Page: {example['page_number']}") print(f"Text preview: {example['text'][:200]}...") # Display image if needed example['image'].show() break ``` ### Filtering for Complete Records To work only with pages that have both images and OCR text: ```python # Filter for complete records complete_pages = dataset.filter(lambda x: x['has_image'] and x['has_alto']) print(f"Complete pages: {len(complete_pages)}") ``` ### Working with ALTO XML The ALTO XML contains detailed OCR information including: - Text positioning and layout - Confidence scores - Character-level bounding boxes ```python import xml.etree.ElementTree as ET # Parse ALTO XML for detailed OCR information def parse_alto_details(alto_xml): root = ET.fromstring(alto_xml) ns = {'alto': 'http://www.loc.gov/standards/alto/v3/alto.xsd'} # Extract text lines with positioning for textline in root.findall('.//alto:TextLine', ns): line_text = [] for string_elem in textline.findall('./alto:String', ns): content = string_elem.get('CONTENT', '') if content: line_text.append(content) if line_text: print(' '.join(line_text)) ``` ## Dataset Creation ### Source Data The original data comes from the National Library of Scotland's digitisation programme, as part of their 'One Third Digital' strategic aim. These items were digitised as part of a a larger project to digitise all Scottish exam papers to 2006. The exam papers represent various subjects, years, and educational levels from Scotland's educational history. ### Processing Pipeline 1. **Original Format**: The source data consists of: - High-resolution JPG scans of exam papers - ALTO XML files containing OCR output - METS metadata files with page ordering information - Inventory CSV with document-level metadata 2. **Conversion Process**: Using the custom `convert_scottish_exams.py` script: - Parsed METS XML files to extract proper page ordering - Paired image files with their corresponding ALTO XML - Extracted plain text from ALTO XML while preserving line breaks - Mapped file identifiers to sequential page numbers (1, 2, 3, etc.) - Extracted exam type and year from metadata descriptions - Preserved all metadata associations and added structured fields ## Considerations for Using this Data ### Historical Context These examination papers represent historical educational practices and may contain: - Outdated terminology or perspectives - Historical biases in educational content - References to historical events and contexts ### OCR Quality The quality of text extraction varies based on: - Original document condition - Print quality and typography - Presence of handwritten annotations - Mathematical or scientific notation ### Recommended Use Cases - Historical education research - OCR system evaluation and improvement - Educational content analysis - Digital humanities research - Training data for historical document processing ## Additional Information ### Licensing Please refer to the [National Library of Scotland's terms of use](https://data.nls.uk/data/digitised-collections/scottish-exams/) for the original data. Users should comply with NLS's licensing terms when using this dataset. ### Citation If you use this dataset, please cite both the original source and this Hugging Face dataset: ```bibtex @misc{nls_scottish_exams, title={Scottish School Exam Papers}, author={National Library of Scotland}, howpublished={\url{https://data.nls.uk/data/digitised-collections/scottish-exams/}}, } ### Acknowledgments Dataset converted to Hugging Face format by [davanstrien](https://huggingface.co/davanstrien)

提供机构：

NationalLibraryOfScotland

5,000+

优质数据集

54 个

任务类型

进入经典数据集