ZihCiLin/traditional-chinese-ocr-synthetic
收藏Hugging Face2026-01-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ZihCiLin/traditional-chinese-ocr-synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- zh
license: cc-by-nc-4.0
task_categories:
- image-to-text
tags:
- ocr
- historical-document
- synthetic-data
- traditional-chinese
- vertical-text
size_categories:
- 1M<n<10M
---
# Traditional Chinese OCR Synthetic Dataset
A large-scale synthetic dataset containing **4.1 million** image-text pairs specifically designed for **Traditional Chinese historical document recognition**.
## Dataset Overview
Existing large-scale Traditional Chinese OCR datasets (e.g., TCSynth) are primarily designed for scene text recognition, characterized by:
- Horizontal layouts
- Short text sequences (2-5 characters on average)
- Modern commonly-used characters
These characteristics differ significantly from historical manuscripts, which typically feature:
- **Vertical writing** as the primary layout
- **Long sentences** (20-40 characters)
- **Archaic and variant characters**
- **Visual degradation** (paper aging, ink fading, stains)
This dataset specifically addresses this domain gap by providing a configurable synthetic data generation pipeline tailored for historical document OCR training.
**Open-Source Generator**: [https://github.com/Jason9339/ocr-synth-generator](https://github.com/Jason9339/ocr-synth-generator)
## Dataset Statistics
| Split | Description | Samples | Layout | Purpose |
|-------|-------------|--------:|--------|---------|
| **train** | Training set | 4,102,200 | Mixed H/V | Model training |
| **test_random** | Test Set 1 ($T_1$) | 1,000 | Mixed H/V | Non-semantic evaluation |
| **test_semantic** | Test Set 2 ($T_2$) | 395 | Mixed H/V | Semantic text evaluation |
**Total Samples**: 4,103,595 | **Total Size**: ~76 GB
### Layout Distribution
| Layout Type | Training | Test Random | Test Semantic | Total |
|-------------|----------|-------------|---------------|-------|
| Horizontal | 2,051,100 | 500 | 193 | 2,051,793 |
| Vertical | 2,051,100 | 500 | 202 | 2,051,802 |
## Data Structure
Each sample contains three fields:
| Field | Type | Description |
|-------|------|-------------|
| `image` | Image (PNG) | Synthetic document image (~384×384 pixels) |
| `text` | String | Ground truth text (Traditional Chinese) |
| `layout` | ClassLabel | Layout orientation: `horizontal` or `vertical` |
## Dataset Characteristics
### Text Properties
- **Character Set**: Traditional Chinese CNS11643 standard
- **Vocabulary Size**: 13,172 characters including archaic and rare variants
- **Sentence Length**:
- Training set: Primarily 20-30 characters/sentence
- Significantly longer than typical scene text (2-5 characters)
- Reflects long-sentence structure of historical prose
- **Content Types**:
- **Training Set**: Diverse text patterns with balanced character distribution
- **Test Set 1** ($T_1$): Random character sequences (no semantic structure)
- **Test Set 2** ($T_2$): Semantically coherent historical-style text
### Visual Properties
**Font Resources** (143 Traditional Chinese fonts):
- Noto Sans/Serif TC family
- GenRyuMin (Source Han Serif derivative)
- GenSekiGothic
- LINE Seed TW
- Jason Handwriting series
- Other open-source Traditional Chinese fonts
**Background Textures** (225+ types):
- Paper textures (aged, modern, handmade)
- Wood grain
- Fabric and cloth
- Marble and stone
- Abstract patterns
**Synthetic Effects** (simulating historical degradation):
- **Blur** (σ ∈ [0,2]): Simulates focus issues
- **Elastic Distortion**: Simulates paper warping
- **Skew** (±15°): Simulates scanning angle variations
- **Stroke Variation** (0-2px): Simulates ink bleeding
- **Color Range** (#000000-#808080): Simulates ink fading
- **Textured Backgrounds**: Simulate paper aging and stains
### Vertical Text Generation
Character-wise 90° rotation technique for authentic vertical layout:
- Each character is individually rotated counterclockwise by 90°
- Rotated characters are arranged left-to-right
- Produces authentic classical vertical reading style (top-to-bottom, right-to-left progression)
- Preserves character clarity while avoiding quality loss from full-image rotation
## Usage
### Loading the Dataset
Due to the large size (76GB), streaming mode is recommended:
```python
from datasets import load_dataset
# Load in streaming mode
dataset = load_dataset("ZihCiLin/traditional-chinese-ocr-synthetic", streaming=True)
# Access a sample
sample = next(iter(dataset['train']))
print(sample['text']) # Text content
print(sample['layout']) # 'horizontal' or 'vertical'
sample['image'].show() # Display image
```
### Filtering by Layout
```python
# Use only vertical layout data
vertical_data = dataset['train'].filter(lambda x: x['layout'] == 'vertical')
# Use only horizontal layout data
horizontal_data = dataset['train'].filter(lambda x: x['layout'] == 'horizontal')
```
## Data Generation Pipeline
**Open-Source Generator**: [https://github.com/Jason9339/ocr-synth-generator](https://github.com/Jason9339/ocr-synth-generator)
Extensively modified from TextRecognitionDataGenerator with key technical features:
1. **Character Rotation System**: Authentic vertical text rendering
2. **143 Traditional Chinese Fonts**: Curated font collection
3. **225+ Background Textures**: Expanded from original 4 to 225 diverse backgrounds
4. **Configurable Effects Pipeline**: Full control over blur, distortion, skew, stroke, and color
5. **LMDB Integration**: Efficient large-scale data generation
6. **Custom Label Support**: Flexible text source integration
## Limitations
### Synthetic Data Nature
- **Not Real Documents**: All images are synthetically generated, simulating but not fully capturing historical manuscript complexity
- **Simplified Degradation**: Real document aging involves complex processes only approximately modeled
- **Font Coverage**: Despite 143 fonts, historical calligraphic styles may not be fully represented
### Distribution Gaps
- **Character Frequency**: Synthetic character distribution may differ from real historical corpora
- **Semantic Patterns**: Training text artificially generated, may not fully reflect authentic historical writing styles
- **Context Specificity**: Does not capture period-specific or genre-specific language patterns
## Related Resources
### Real Historical Document Dataset
- **[Traditional Chinese Historical OCR Dataset (Lo Chia-Lun Manuscripts)](https://huggingface.co/datasets/ZihCiLin/traditional-chinese-historical-ocr-lo-chia-lun)**: 921 manually annotated real historical document samples with comprehensive geometric metadata (bounding boxes, layout regions, reading order)
### Tools
- **[Data Generator](https://github.com/Jason9339/ocr-synth-generator)**: Configurable pipeline used to generate this dataset
- **[Annotation System](https://github.com/Jason9339/document-ocr-annotation-system)**: Web-based annotation tool for Traditional Chinese historical documents
## License
This dataset is released under **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)**.
- Allowed: Academic research, education, non-commercial use
- Prohibited: Commercial use without permission
Full license: [https://creativecommons.org/licenses/by-nc/4.0/](https://creativecommons.org/licenses/by-nc/4.0/)
---
**Dataset Version**: 1.0
**Last Updated**: 2025-01-02
提供机构:
ZihCiLin



