five

davanstrien/falcon-ocr-test-nocompile

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/falcon-ocr-test-nocompile
下载链接
链接失效反馈
官方服务:
资源简介:
--- tags: - ocr - document-processing - falcon-ocr - plain - uv-script - generated --- # Document Processing using Falcon OCR (plain mode) This dataset contains OCR results from images in [davanstrien/ufo-ColPali](https://huggingface.co/datasets/davanstrien/ufo-ColPali) using [Falcon OCR](https://huggingface.co/tiiuae/Falcon-OCR), a 0.3B early-fusion vision-language model. ## Processing Details - **Source Dataset**: [davanstrien/ufo-ColPali](https://huggingface.co/datasets/davanstrien/ufo-ColPali) - **Model**: [tiiuae/Falcon-OCR](https://huggingface.co/tiiuae/Falcon-OCR) - **Task Mode**: `plain` - Full-page text extraction - **Number of Samples**: 20 - **Processing Time**: 9.3 min - **Processing Date**: 2026-04-07 20:27 UTC ### Configuration - **Image Column**: `image` - **Dataset Split**: `train` - **Max Output Tokens**: 2,048 - **Backend**: Transformers ## Model Information Falcon OCR is a compact early-fusion model that processes image patches and text tokens in a shared Transformer. Key results: - 80.3% on olmOCR benchmark - 88.64% on OmniDocBench - 87.1% on multi-column documents (best in class) - 90.3% on tables (best in class) ## Reproduction ```bash uv run https://huggingface.co/datasets/uv-scripts/ocr/raw/main/falcon-ocr.py \ davanstrien/ufo-ColPali \ <output-dataset> \ --task-mode plain \ --image-column image ``` Generated with [UV Scripts](https://huggingface.co/uv-scripts)
提供机构:
davanstrien
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作