nehruperumalla/forms
收藏Hugging Face2022-06-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nehruperumalla/forms
下载链接
链接失效反馈官方服务:
资源简介:
CORD数据集是一个用于OCR后解析的收据数据集,数据来源于clovaai的GitHub仓库。数据集中的框坐标是相对于图像宽度和高度进行归一化的,并且一些出现频率较低的标签被替换为O:。该数据集主要用于文档智能领域的OCR后解析任务。
提供机构:
nehruperumalla
原始信息汇总
CORD: A Consolidated Receipt Dataset for Post-OCR Parsing
数据集概述
- 数据来源:CORD数据集是从clovaai GitHub仓库克隆的。
- 坐标系统:边界框坐标是相对于图像宽度和高度的归一化值。
- 标签处理:出现频率极低的标签被替换为"O",具体替换的标签包括: python replacing_labels = [menu.etc, menu.itemsubtotal, menu.sub_etc, menu.sub_unitprice, menu.vatyn, void_menu.nm, void_menu.price, sub_total.othersvc_price]
引用信息
CORD: A Consolidated Receipt Dataset for Post-OCR Parsing
plaintext @article{park2019cord, title={CORD: A Consolidated Receipt Dataset for Post-OCR Parsing}, author={Park, Seunghyun and Shin, Seung and Lee, Bado and Lee, Junyeop and Surh, Jaeheung and Seo, Minjoon and Lee, Hwalsuk}, booktitle={Document Intelligence Workshop at Neural Information Processing Systems}, year={2019} }
Post-OCR parsing: building simple and robust parser via BIO tagging
plaintext @article{hwang2019post, title={Post-OCR parsing: building simple and robust parser via BIO tagging}, author={Hwang, Wonseok and Kim, Seonghyeon and Yim, Jinyeong and Seo, Minjoon and Park, Seunghyun and Park, Sungrae and Lee, Junyeop and Lee, Bado and Lee, Hwalsuk}, booktitle={Document Intelligence Workshop at Neural Information Processing Systems}, year={2019} }



