ISOB-Small-Hard
收藏DataCite Commons2026-05-07 更新2026-05-18 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/ANSN7M
下载链接
链接失效反馈官方服务:
资源简介:
ISOB Dataset – Representative Sample Release (SMALL-HARD)
Note: This release contains a representative subset of the full ISOB dataset. The complete dataset, including all document images, OCR files, and metadata, will be released upon paper acceptance.
Overview
The ISOB (Indian Scripts OCR Benchmark) dataset is a multilingual, India-centric OCR benchmark designed for research in document understanding, OCR, and multilingual Document-VLM systems.
Each sample contains:
A document image (.jpg)
Its corresponding OCR transcription (.txt)
The dataset focuses on real-world Indian documents containing multiple languages and scripts within the same page, capturing naturally occurring multilingual layouts, noisy scans, and complex formatting patterns commonly found in archival and institutional documents.
This representative release showcases multilingual image–text pairs spanning multiple Indian languages and scripts.
Motivation
India has one of the world’s most diverse document ecosystems, with large volumes of valuable textual data spread across regional languages, scripts, and historical archives. However, authentic multilingual OCR datasets for Indian languages remain limited and fragmented.
Many documents contain:
Multiple scripts within a single page
Region-specific dialects
Non-standard layouts, tables, and equations
Historical or hand-digitized archival material
The ISOB dataset addresses these challenges by providing:
Real-world multilingual document layouts
OCR-aligned transcriptions
Benchmarking data for multilingual OCR and VLM systems
Complex mixed-script OCR scenarios beyond synthetic settings
The dataset was curated from authentic sources collected through legally compliant archival and institutional collaborations.
Dataset Structure
Each sample consists of:
Component Description
Image File Document image in .jpg format
OCR File Corresponding OCR transcription in .txt format
File Naming Convention
File names encode the languages present in the document.
Example:
hocr_assamese_bodo_maithili_urdu_v0141.txt
This enables easy language-based filtering and multilingual benchmarking experiments.
Languages Covered
The representative release includes documents containing:
Assamese
Bengali
Bodo
Dogri
Gujarati
Hindi
Kannada
Kashmiri
Konkani
Maithili
Malayalam
Manipuri
Marathi
Nepali
Odia
Punjabi
Sanskrit
Santali
Sindhi
Tamil
Telugu
Urdu
The complete ISOB release will cover all 22 officially recognized Indian languages at significantly larger scale.
Example Files
Image File OCR File Languages Present
hocr_assamese_bodo_maithili_urdu_v0141_edited_gpu4_s4044.jpg hocr_assamese_bodo_maithili_urdu_v0141.txt Assamese, Bodo, Maithili, Urdu
hocr_bengali_hindi_maithili_v0008_edited_gpu0_s43.jpg hocr_bengali_hindi_maithili_v0008.txt Bengali, Hindi, Maithili
hocr_hindi_assamese_telugu_santali_v0139_edited_gpu4_s4048.jpg hocr_hindi_assamese_telugu_santali_v0139.txt Hindi, Assamese, Telugu, Santali
hocr_konkani_punjabi_hindi_sanskrit_v0033_edited_gpu6_s6043.jpg hocr_konkani_punjabi_hindi_sanskrit_v0033.txt Konkani, Punjabi, Hindi, Sanskrit
hocr_maithili_dogri_tamil_v0102_edited_gpu4_s4052.jpg hocr_maithili_dogri_tamil_v0102.txt Maithili, Dogri, Tamil
A complete file listing is available in the repository.
Language Diversity
This subset intentionally emphasizes multilingual complexity, with many documents containing 3–5 languages simultaneously.
Language Frequency Across Released Files
Hindi: 11 files
Konkani: 7 files
Assamese: 6 files
Maithili: 6 files
Odia: 6 files
Gujarati: 5 files
Santali: 5 files
Sindhi: 5 files
Bengali: 4 files
Bodo: 4 files
Kannada: 4 files
Sanskrit: 4 files
Telugu: 3 files
Malayalam: 3 files
Urdu: 2 files
Dogri: 2 files
Tamil: 2 files
Kashmiri: 1 file
Nepali: 1 file
Punjabi: 1 file
Usage
The dataset is suitable for:
Multilingual OCR training
OCR benchmarking and evaluation
Document-VLM research
Script identification
Mixed-language document understanding
Historical document digitization
Each OCR transcription corresponds one-to-one with its image, enabling supervised learning and evaluation workflows.
提供机构:
Harvard Dataverse
创建时间:
2026-05-07



