five

ZihCiLin/traditional-chinese-ocr-synthetic

收藏
Hugging Face2026-01-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ZihCiLin/traditional-chinese-ocr-synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - zh license: cc-by-nc-4.0 task_categories: - image-to-text tags: - ocr - historical-document - synthetic-data - traditional-chinese - vertical-text size_categories: - 1M<n<10M --- # Traditional Chinese OCR Synthetic Dataset A large-scale synthetic dataset containing **4.1 million** image-text pairs specifically designed for **Traditional Chinese historical document recognition**. ## Dataset Overview Existing large-scale Traditional Chinese OCR datasets (e.g., TCSynth) are primarily designed for scene text recognition, characterized by: - Horizontal layouts - Short text sequences (2-5 characters on average) - Modern commonly-used characters These characteristics differ significantly from historical manuscripts, which typically feature: - **Vertical writing** as the primary layout - **Long sentences** (20-40 characters) - **Archaic and variant characters** - **Visual degradation** (paper aging, ink fading, stains) This dataset specifically addresses this domain gap by providing a configurable synthetic data generation pipeline tailored for historical document OCR training. **Open-Source Generator**: [https://github.com/Jason9339/ocr-synth-generator](https://github.com/Jason9339/ocr-synth-generator) ## Dataset Statistics | Split | Description | Samples | Layout | Purpose | |-------|-------------|--------:|--------|---------| | **train** | Training set | 4,102,200 | Mixed H/V | Model training | | **test_random** | Test Set 1 ($T_1$) | 1,000 | Mixed H/V | Non-semantic evaluation | | **test_semantic** | Test Set 2 ($T_2$) | 395 | Mixed H/V | Semantic text evaluation | **Total Samples**: 4,103,595 | **Total Size**: ~76 GB ### Layout Distribution | Layout Type | Training | Test Random | Test Semantic | Total | |-------------|----------|-------------|---------------|-------| | Horizontal | 2,051,100 | 500 | 193 | 2,051,793 | | Vertical | 2,051,100 | 500 | 202 | 2,051,802 | ## Data Structure Each sample contains three fields: | Field | Type | Description | |-------|------|-------------| | `image` | Image (PNG) | Synthetic document image (~384×384 pixels) | | `text` | String | Ground truth text (Traditional Chinese) | | `layout` | ClassLabel | Layout orientation: `horizontal` or `vertical` | ## Dataset Characteristics ### Text Properties - **Character Set**: Traditional Chinese CNS11643 standard - **Vocabulary Size**: 13,172 characters including archaic and rare variants - **Sentence Length**: - Training set: Primarily 20-30 characters/sentence - Significantly longer than typical scene text (2-5 characters) - Reflects long-sentence structure of historical prose - **Content Types**: - **Training Set**: Diverse text patterns with balanced character distribution - **Test Set 1** ($T_1$): Random character sequences (no semantic structure) - **Test Set 2** ($T_2$): Semantically coherent historical-style text ### Visual Properties **Font Resources** (143 Traditional Chinese fonts): - Noto Sans/Serif TC family - GenRyuMin (Source Han Serif derivative) - GenSekiGothic - LINE Seed TW - Jason Handwriting series - Other open-source Traditional Chinese fonts **Background Textures** (225+ types): - Paper textures (aged, modern, handmade) - Wood grain - Fabric and cloth - Marble and stone - Abstract patterns **Synthetic Effects** (simulating historical degradation): - **Blur** (σ ∈ [0,2]): Simulates focus issues - **Elastic Distortion**: Simulates paper warping - **Skew** (±15°): Simulates scanning angle variations - **Stroke Variation** (0-2px): Simulates ink bleeding - **Color Range** (#000000-#808080): Simulates ink fading - **Textured Backgrounds**: Simulate paper aging and stains ### Vertical Text Generation Character-wise 90° rotation technique for authentic vertical layout: - Each character is individually rotated counterclockwise by 90° - Rotated characters are arranged left-to-right - Produces authentic classical vertical reading style (top-to-bottom, right-to-left progression) - Preserves character clarity while avoiding quality loss from full-image rotation ## Usage ### Loading the Dataset Due to the large size (76GB), streaming mode is recommended: ```python from datasets import load_dataset # Load in streaming mode dataset = load_dataset("ZihCiLin/traditional-chinese-ocr-synthetic", streaming=True) # Access a sample sample = next(iter(dataset['train'])) print(sample['text']) # Text content print(sample['layout']) # 'horizontal' or 'vertical' sample['image'].show() # Display image ``` ### Filtering by Layout ```python # Use only vertical layout data vertical_data = dataset['train'].filter(lambda x: x['layout'] == 'vertical') # Use only horizontal layout data horizontal_data = dataset['train'].filter(lambda x: x['layout'] == 'horizontal') ``` ## Data Generation Pipeline **Open-Source Generator**: [https://github.com/Jason9339/ocr-synth-generator](https://github.com/Jason9339/ocr-synth-generator) Extensively modified from TextRecognitionDataGenerator with key technical features: 1. **Character Rotation System**: Authentic vertical text rendering 2. **143 Traditional Chinese Fonts**: Curated font collection 3. **225+ Background Textures**: Expanded from original 4 to 225 diverse backgrounds 4. **Configurable Effects Pipeline**: Full control over blur, distortion, skew, stroke, and color 5. **LMDB Integration**: Efficient large-scale data generation 6. **Custom Label Support**: Flexible text source integration ## Limitations ### Synthetic Data Nature - **Not Real Documents**: All images are synthetically generated, simulating but not fully capturing historical manuscript complexity - **Simplified Degradation**: Real document aging involves complex processes only approximately modeled - **Font Coverage**: Despite 143 fonts, historical calligraphic styles may not be fully represented ### Distribution Gaps - **Character Frequency**: Synthetic character distribution may differ from real historical corpora - **Semantic Patterns**: Training text artificially generated, may not fully reflect authentic historical writing styles - **Context Specificity**: Does not capture period-specific or genre-specific language patterns ## Related Resources ### Real Historical Document Dataset - **[Traditional Chinese Historical OCR Dataset (Lo Chia-Lun Manuscripts)](https://huggingface.co/datasets/ZihCiLin/traditional-chinese-historical-ocr-lo-chia-lun)**: 921 manually annotated real historical document samples with comprehensive geometric metadata (bounding boxes, layout regions, reading order) ### Tools - **[Data Generator](https://github.com/Jason9339/ocr-synth-generator)**: Configurable pipeline used to generate this dataset - **[Annotation System](https://github.com/Jason9339/document-ocr-annotation-system)**: Web-based annotation tool for Traditional Chinese historical documents ## License This dataset is released under **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)**. - Allowed: Academic research, education, non-commercial use - Prohibited: Commercial use without permission Full license: [https://creativecommons.org/licenses/by-nc/4.0/](https://creativecommons.org/licenses/by-nc/4.0/) --- **Dataset Version**: 1.0 **Last Updated**: 2025-01-02
提供机构:
ZihCiLin
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作