five

davanstrien/bpl-card-catalog-glm-ocr

收藏
Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/bpl-card-catalog-glm-ocr
下载链接
链接失效反馈
官方服务:
资源简介:
--- tags: - ocr - document-processing - glm-ocr - markdown - uv-script - generated --- # Document OCR using GLM-OCR This dataset contains OCR results from images in [davanstrien/bpl-card-catalog-with-ocr](https://huggingface.co/datasets/davanstrien/bpl-card-catalog-with-ocr) using GLM-OCR, a compact 0.9B OCR model achieving SOTA performance. ## Processing Details - **Source Dataset**: [davanstrien/bpl-card-catalog-with-ocr](https://huggingface.co/datasets/davanstrien/bpl-card-catalog-with-ocr) - **Model**: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) - **Task**: text recognition - **Number of Samples**: 453,006 - **Processing Time**: 1315.3 min - **Processing Date**: 2026-02-27 11:43 UTC ### Configuration - **Image Column**: `image` - **Output Column**: `markdown` - **Dataset Split**: `train` - **Batch Size**: 64 - **Max Model Length**: 8,192 tokens - **Max Output Tokens**: 8,192 - **Temperature**: 0.01 - **Top P**: 1e-05 - **GPU Memory Utilization**: 90.0% ## Model Information GLM-OCR is a compact, high-performance OCR model: - 0.9B parameters - 94.62% on OmniDocBench V1.5 - CogViT visual encoder + GLM-0.5B language decoder - Multi-Token Prediction (MTP) loss for efficiency - Multilingual: zh, en, fr, es, ru, de, ja, ko - MIT licensed ## Dataset Structure The dataset contains all original columns plus: - `markdown`: The extracted text in markdown format - `inference_info`: JSON list tracking all OCR models applied to this dataset ## Reproduction ```bash uv run https://huggingface.co/datasets/uv-scripts/ocr/raw/main/glm-ocr-v2.py \ davanstrien/bpl-card-catalog-with-ocr \ <output-dataset> \ --image-column image \ --batch-size 64 \ --task ocr ``` Generated with [UV Scripts](https://huggingface.co/uv-scripts) (glm-ocr-v2.py)
提供机构:
davanstrien
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作