sarvamai/olmOCR-Bench-English
收藏Hugging Face2026-02-05 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/sarvamai/olmOCR-Bench-English
下载链接
链接失效反馈官方服务:
资源简介:
这是一个过滤后的版本,仅包含英文文档的数据集,用于OCR(光学字符识别)和文档理解任务。数据集包含多种类型的文档,如数学公式、页眉页脚、长文本、多栏布局、历史扫描文档和表格等。README中提供了详细的测试案例和PDF文件的数量变化,以及保留比例。数据集结构包括README.md、stats.json和bench_data目录,其中bench_data包含每个类别的测试案例和PDF文件。数据集继承自原始olmOCR-bench数据集,采用ODC-BY-1.0许可证。
This is a filtered version of the dataset containing only English documents, designed for OCR (Optical Character Recognition) and document understanding tasks. The dataset includes various types of documents such as math formulas, headers and footers, long documents with tiny text, multi-column layouts, historical scanned documents, and tables. The README provides detailed statistics on the number of test cases and PDF files before and after filtering, along with retention percentages. The dataset structure includes README.md, stats.json, and a bench_data directory containing test cases and PDF files for each category. The dataset inherits the ODC-BY-1.0 license from the original olmOCR-bench dataset.
提供机构:
sarvamai



