NationalLibraryOfScotland/Scottish-School-Exam-Papers
收藏Hugging Face2025-07-18 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/NationalLibraryOfScotland/Scottish-School-Exam-Papers
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: document_id
dtype: string
- name: page_number
dtype: int32
- name: file_identifier
dtype: string
- name: image
dtype: image
- name: text
dtype: string
- name: alto_xml
dtype: string
- name: has_image
dtype: bool
- name: has_alto
dtype: bool
- name: document_metadata
dtype: string
- name: has_metadata
dtype: bool
- name: exam_type
dtype: string
- name: exam_year
dtype: string
- name: exam_reference
dtype: string
splits:
- name: train
num_bytes: 1646707701.999
num_examples: 11141
download_size: 1403366547
dataset_size: 1646707701.999
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc0-1.0
task_categories:
- text-generation
language:
- en
tags:
- ocr
---
# Scottish School Exam Papers Dataset
## Dataset Description
This dataset contains digitised Scottish school examination papers from the National Library of Scotland's (NLS) digital collections. The papers represent historical educational assessment materials that have been processed with Optical Character Recognition (OCR) to extract text content alongside the original page images.
### Dataset Summary
- **Source**: [National Library of Scotland - Scottish School Exam Papers](https://data.nls.uk/data/digitised-collections/scottish-exams/)
- **Format**: Image-text pairs with OCR output
- **Processing**: Converted from ALTO XML and JPG images to Hugging Face dataset format
- **Contents**: Historical Scottish school examination papers with both scanned images and extracted text
## Dataset Structure
### Data Fields
Each record in the dataset contains the following fields:
- `document_id` (string): Unique identifier for the exam paper document
- `page_number` (int): Sequential page number within the document (1, 2, 3, etc.)
- `file_identifier` (string): Original file identifier from the NLS dataset
- `image` (image): Scanned image of the exam paper page
- `text` (string): OCR-extracted text from the page
- `alto_xml` (string): Raw ALTO XML containing detailed OCR information
- `has_image` (bool): Whether the page has an associated image file
- `has_alto` (bool): Whether the page has associated ALTO OCR data
- `document_metadata` (string): Full metadata description from inventory (e.g., "Leaving Certificate - 1888 - P.P.1888 XLI")
- `has_metadata` (bool): Whether metadata is available for this document
- `exam_type` (string): Type of examination (e.g., "Leaving Certificate", "Scottish Education Department")
- `exam_year` (string): Year of the examination (extracted from metadata)
- `exam_reference` (string): Reference number from the metadata (if available)
### Data Statistics
The dataset includes examination papers with varying levels of OCR quality and completeness. Some pages may have images without corresponding OCR text, typically due to:
- Poor scan quality
- Handwritten content
- Complex layouts or mathematical notation
- Degraded original documents
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("davanstrien/Scottish-School-Exam-Papers")
# Access the data
for example in dataset['train']:
print(f"Document: {example['document_id']}")
print(f"Page: {example['page_number']}")
print(f"Text preview: {example['text'][:200]}...")
# Display image if needed
example['image'].show()
break
```
### Filtering for Complete Records
To work only with pages that have both images and OCR text:
```python
# Filter for complete records
complete_pages = dataset.filter(lambda x: x['has_image'] and x['has_alto'])
print(f"Complete pages: {len(complete_pages)}")
```
### Working with ALTO XML
The ALTO XML contains detailed OCR information including:
- Text positioning and layout
- Confidence scores
- Character-level bounding boxes
```python
import xml.etree.ElementTree as ET
# Parse ALTO XML for detailed OCR information
def parse_alto_details(alto_xml):
root = ET.fromstring(alto_xml)
ns = {'alto': 'http://www.loc.gov/standards/alto/v3/alto.xsd'}
# Extract text lines with positioning
for textline in root.findall('.//alto:TextLine', ns):
line_text = []
for string_elem in textline.findall('./alto:String', ns):
content = string_elem.get('CONTENT', '')
if content:
line_text.append(content)
if line_text:
print(' '.join(line_text))
```
## Dataset Creation
### Source Data
The original data comes from the National Library of Scotland's digitisation programme, as part of their 'One Third Digital' strategic aim. These items were digitised as part of a a larger project to digitise all Scottish exam papers to 2006. The exam papers represent various subjects, years, and educational levels from Scotland's educational history.
### Processing Pipeline
1. **Original Format**: The source data consists of:
- High-resolution JPG scans of exam papers
- ALTO XML files containing OCR output
- METS metadata files with page ordering information
- Inventory CSV with document-level metadata
2. **Conversion Process**: Using the custom `convert_scottish_exams.py` script:
- Parsed METS XML files to extract proper page ordering
- Paired image files with their corresponding ALTO XML
- Extracted plain text from ALTO XML while preserving line breaks
- Mapped file identifiers to sequential page numbers (1, 2, 3, etc.)
- Extracted exam type and year from metadata descriptions
- Preserved all metadata associations and added structured fields
## Considerations for Using this Data
### Historical Context
These examination papers represent historical educational practices and may contain:
- Outdated terminology or perspectives
- Historical biases in educational content
- References to historical events and contexts
### OCR Quality
The quality of text extraction varies based on:
- Original document condition
- Print quality and typography
- Presence of handwritten annotations
- Mathematical or scientific notation
### Recommended Use Cases
- Historical education research
- OCR system evaluation and improvement
- Educational content analysis
- Digital humanities research
- Training data for historical document processing
## Additional Information
### Licensing
Please refer to the [National Library of Scotland's terms of use](https://data.nls.uk/data/digitised-collections/scottish-exams/) for the original data. Users should comply with NLS's licensing terms when using this dataset.
### Citation
If you use this dataset, please cite both the original source and this Hugging Face dataset:
```bibtex
@misc{nls_scottish_exams,
title={Scottish School Exam Papers},
author={National Library of Scotland},
howpublished={\url{https://data.nls.uk/data/digitised-collections/scottish-exams/}},
}
### Acknowledgments
Dataset converted to Hugging Face format by [davanstrien](https://huggingface.co/davanstrien)
提供机构:
NationalLibraryOfScotland



