davanstrien/bpl-card-catalog-glm-ocr
收藏Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/bpl-card-catalog-glm-ocr
下载链接
链接失效反馈官方服务:
资源简介:
---
tags:
- ocr
- document-processing
- glm-ocr
- markdown
- uv-script
- generated
---
# Document OCR using GLM-OCR
This dataset contains OCR results from images in [davanstrien/bpl-card-catalog-with-ocr](https://huggingface.co/datasets/davanstrien/bpl-card-catalog-with-ocr) using GLM-OCR, a compact 0.9B OCR model achieving SOTA performance.
## Processing Details
- **Source Dataset**: [davanstrien/bpl-card-catalog-with-ocr](https://huggingface.co/datasets/davanstrien/bpl-card-catalog-with-ocr)
- **Model**: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR)
- **Task**: text recognition
- **Number of Samples**: 453,006
- **Processing Time**: 1315.3 min
- **Processing Date**: 2026-02-27 11:43 UTC
### Configuration
- **Image Column**: `image`
- **Output Column**: `markdown`
- **Dataset Split**: `train`
- **Batch Size**: 64
- **Max Model Length**: 8,192 tokens
- **Max Output Tokens**: 8,192
- **Temperature**: 0.01
- **Top P**: 1e-05
- **GPU Memory Utilization**: 90.0%
## Model Information
GLM-OCR is a compact, high-performance OCR model:
- 0.9B parameters
- 94.62% on OmniDocBench V1.5
- CogViT visual encoder + GLM-0.5B language decoder
- Multi-Token Prediction (MTP) loss for efficiency
- Multilingual: zh, en, fr, es, ru, de, ja, ko
- MIT licensed
## Dataset Structure
The dataset contains all original columns plus:
- `markdown`: The extracted text in markdown format
- `inference_info`: JSON list tracking all OCR models applied to this dataset
## Reproduction
```bash
uv run https://huggingface.co/datasets/uv-scripts/ocr/raw/main/glm-ocr-v2.py \
davanstrien/bpl-card-catalog-with-ocr \
<output-dataset> \
--image-column image \
--batch-size 64 \
--task ocr
```
Generated with [UV Scripts](https://huggingface.co/uv-scripts) (glm-ocr-v2.py)
提供机构:
davanstrien



