five

The Canadian Vichy Intercepts: OCR Processing Output Dataset, Version 2 (mistral-ocr-202512)

收藏
Zenodo2026-04-24 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.17214263
下载链接
链接失效反馈
官方服务:
资源简介:
Overview This dataset contains the full output of the Version 2 OCR processing pipeline applied to the Canadian Vichy Intercepts corpus — a collection of 13,848 microfilmed diplomatic telegrams intercepted and transcribed by the Examination Unit of Canada between September 1941 and July 1945. The source documents are held by Library and Archives Canada (BAC/LAC) on microfilm reels T-17425 through T-17429 and are accessible via IIIF manifests through the BAC/LAC digital collections. The corpus covers two overlapping series: Vichy France communications: September 1941 – March 1945 France libre (Free French) communications: April 1943 – July 1945 OCR transcription was performed using the mistral-ocr-202512 model (Mistral AI) via the official API. The pipeline extracted structured text from each document image, segmenting content into three zones: document header, body, and footer. Images were retrieved directly from IIIF endpoints using manifest files. This deposit contains the pipeline output only. The processing script (mistral_ocr_v2_reprocess.py) is archived separately in a companion record within this Zenodo community, alongside its GitLab repository. A human-readable version of the corpus is accessible through the project's Omeka Classic platform at: https://omeka.uottawa.ca/examination-unit/ Source Material Field Value Holding institution Bibliothèque et Archives Canada (BAC/LAC) Microfilm reels T-17425, T-17426, T-17427, T-17428, T-17429 Total images processed 13,848 Date range September 1941 – July 1945 Document type Intercepted diplomatic telegrams Original language French (primary); English Access IIIF (International Image Interoperability Framework) Processing Pipeline Processing was performed using a Python script (mistral_ocr_v2_reprocess.py) operating as follows: IIIF manifests (v2 and v3) were parsed to extract image URLs for each canvas. Each image was submitted to the Mistral OCR API (mistral-ocr-202512) with structured extraction enabled. API responses were parsed to isolate three text zones per page: header, body, and footer. Results were saved incrementally in batches of 50 images, with checkpoint files written after each batch to enable resumption. Aggregated datasets, performance reports, and structured outputs were generated upon completion. Model: mistral-ocr-202512 (Mistral AI)Processing date: January 2026Pipeline version: V2 Archive Contents This deposit consists of six TAR archives, each corresponding to a directory produced by the pipeline. batch_progress.tar JSON checkpoint files recording the processing state after each batch of 50 images. One file per manifest (five total), named by manifest identifier. These files enabled fault-tolerant resumption of processing and document which images were successfully processed in each run. Useful for auditing and reproducibility. datasets.tar Consolidated datasets aggregating results across all five microfilm reels. Includes: full_dataset_v2.json — complete structured dataset with all extracted fields per image structured_telegrams_v2.csv — tabular dataset with columns for filename, manifest ID, page number, canvas ID, header text, body text, footer text, full text, character counts (header/body/footer/total), word count, header/footer presence flags, image URL, and processing timestamp omeka_import_v2.csv — Dublin Core–formatted CSV for direct import into the Omeka Classic platform json_responses.tar Raw JSON responses returned by the Mistral OCR API, one file per processed image. Each file preserves the complete API output, including confidence indicators and structured zone content. These files constitute the unprocessed primary output of the OCR pipeline and support independent reanalysis or alternative parsing strategies. reports.tar Human- and machine-readable reports generated at the end of processing. Includes: technical_performance_report.json — per-manifest and aggregate statistics (success rate, character counts, processing time, estimated API cost) structure_analysis_report.json — analysis of header and footer extraction rates across the corpus error_log.json — log of all images that failed OCR processing, with error messages processing_summary.txt — plain-text summary report suitable for inclusion in research documentation structured_output.tar Per-document JSON files containing the parsed, structured representation of each telegram, with header, body, and footer fields separated. One file per image. These files form the basis of the datasets.tar aggregations and support document-level analysis without requiring re-parsing of raw API responses. text_files.tar Plain-text transcriptions organized into four subdirectories: full_text/ — complete transcription of each image (header + body + footer concatenated) headers/ — header zone text only body/ — body zone text only footers/ — footer zone text only One .txt file per image in each subdirectory. File naming follows the convention: V2_{manifest_id}_page_{NNNN}_{label}.txt. These files are suitable for downstream natural language processing tasks, including named entity recognition (NER), topic modelling, and full-text search indexing. Related Resources Resource Description URL / Identifier Web publication Omeka Classic platform (IIIF + full-text search) https://omeka.uottawa.ca/examination-unit/ Processing code Companion Zenodo record (same community) See companion record GitLab repository Source code with version history See companion record Metadata extraction dataset Structured metadata extracted from OCR output (V2) See companion record
提供机构:
University of Ottawa
创建时间:
2026-04-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作