WEwoCram/CommonForms
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/WEwoCram/CommonForms
下载链接
链接失效反馈官方服务:
资源简介:
CommonForms是一个用于表单字段检测的大规模、多样化数据集,将表单字段检测问题视为对象检测问题:给定页面图像,预测表单字段的位置和类型(文本输入、选择按钮、签名)。
主要特点:
* **规模**:约55,000份文档,包含超过450,000页。
* **来源**:通过过滤Common Crawl以找到具有可填写元素的PDF构建。
* **多样性**:包含多种语言(三分之一为非英语)和领域,没有任何单一领域占数据集的25%以上。
* **目的**:首个发布的大规模表单字段检测数据集,旨在促进鲁棒表单字段检测器的开发。
CommonForms is a large, diverse dataset for form field detection, casting the problem as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields.
Key Features:
* **Scale:** Roughly 55,000 documents comprising over 450,000 pages.
* **Source:** Constructed by filtering Common Crawl to find PDFs with fillable elements.
* **Diversity:** Contains a diverse mixture of languages (one third non-English) and domains, with no single domain making up more than 25% of the dataset.
* **Purpose:** The first large-scale dataset released for form field detection, aimed at fostering the development of robust form field detectors.
提供机构:
WEwoCram



