The Canadian Vichy Intercepts: OCR Processing Output Dataset, Version 2 (mistral-ocr-202512)
收藏Zenodo2026-04-24 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.17214263
下载链接
链接失效反馈官方服务:
资源简介:
Overview
This dataset contains the full output of the Version 2 OCR processing pipeline applied to the Canadian Vichy Intercepts corpus — a collection of 13,848 microfilmed diplomatic telegrams intercepted and transcribed by the Examination Unit of Canada between September 1941 and July 1945. The source documents are held by Library and Archives Canada (BAC/LAC) on microfilm reels T-17425 through T-17429 and are accessible via IIIF manifests through the BAC/LAC digital collections.
The corpus covers two overlapping series:
Vichy France communications: September 1941 – March 1945
France libre (Free French) communications: April 1943 – July 1945
OCR transcription was performed using the mistral-ocr-202512 model (Mistral AI) via the official API. The pipeline extracted structured text from each document image, segmenting content into three zones: document header, body, and footer. Images were retrieved directly from IIIF endpoints using manifest files.
This deposit contains the pipeline output only. The processing script (mistral_ocr_v2_reprocess.py) is archived separately in a companion record within this Zenodo community, alongside its GitLab repository. A human-readable version of the corpus is accessible through the project's Omeka Classic platform at: https://omeka.uottawa.ca/examination-unit/
Source Material
Field
Value
Holding institution
Bibliothèque et Archives Canada (BAC/LAC)
Microfilm reels
T-17425, T-17426, T-17427, T-17428, T-17429
Total images processed
13,848
Date range
September 1941 – July 1945
Document type
Intercepted diplomatic telegrams
Original language
French (primary); English
Access
IIIF (International Image Interoperability Framework)
Processing Pipeline
Processing was performed using a Python script (mistral_ocr_v2_reprocess.py) operating as follows:
IIIF manifests (v2 and v3) were parsed to extract image URLs for each canvas.
Each image was submitted to the Mistral OCR API (mistral-ocr-202512) with structured extraction enabled.
API responses were parsed to isolate three text zones per page: header, body, and footer.
Results were saved incrementally in batches of 50 images, with checkpoint files written after each batch to enable resumption.
Aggregated datasets, performance reports, and structured outputs were generated upon completion.
Model: mistral-ocr-202512 (Mistral AI)Processing date: January 2026Pipeline version: V2
Archive Contents
This deposit consists of six TAR archives, each corresponding to a directory produced by the pipeline.
batch_progress.tar
JSON checkpoint files recording the processing state after each batch of 50 images. One file per manifest (five total), named by manifest identifier. These files enabled fault-tolerant resumption of processing and document which images were successfully processed in each run. Useful for auditing and reproducibility.
datasets.tar
Consolidated datasets aggregating results across all five microfilm reels. Includes:
full_dataset_v2.json — complete structured dataset with all extracted fields per image
structured_telegrams_v2.csv — tabular dataset with columns for filename, manifest ID, page number, canvas ID, header text, body text, footer text, full text, character counts (header/body/footer/total), word count, header/footer presence flags, image URL, and processing timestamp
omeka_import_v2.csv — Dublin Core–formatted CSV for direct import into the Omeka Classic platform
json_responses.tar
Raw JSON responses returned by the Mistral OCR API, one file per processed image. Each file preserves the complete API output, including confidence indicators and structured zone content. These files constitute the unprocessed primary output of the OCR pipeline and support independent reanalysis or alternative parsing strategies.
reports.tar
Human- and machine-readable reports generated at the end of processing. Includes:
technical_performance_report.json — per-manifest and aggregate statistics (success rate, character counts, processing time, estimated API cost)
structure_analysis_report.json — analysis of header and footer extraction rates across the corpus
error_log.json — log of all images that failed OCR processing, with error messages
processing_summary.txt — plain-text summary report suitable for inclusion in research documentation
structured_output.tar
Per-document JSON files containing the parsed, structured representation of each telegram, with header, body, and footer fields separated. One file per image. These files form the basis of the datasets.tar aggregations and support document-level analysis without requiring re-parsing of raw API responses.
text_files.tar
Plain-text transcriptions organized into four subdirectories:
full_text/ — complete transcription of each image (header + body + footer concatenated)
headers/ — header zone text only
body/ — body zone text only
footers/ — footer zone text only
One .txt file per image in each subdirectory. File naming follows the convention: V2_{manifest_id}_page_{NNNN}_{label}.txt. These files are suitable for downstream natural language processing tasks, including named entity recognition (NER), topic modelling, and full-text search indexing.
Related Resources
Resource
Description
URL / Identifier
Web publication
Omeka Classic platform (IIIF + full-text search)
https://omeka.uottawa.ca/examination-unit/
Processing code
Companion Zenodo record (same community)
See companion record
GitLab repository
Source code with version history
See companion record
Metadata extraction dataset
Structured metadata extracted from OCR output (V2)
See companion record
提供机构:
University of Ottawa
创建时间:
2026-04-24



