WEwoCram/CommonForms

Name: WEwoCram/CommonForms
Creator: WEwoCram
Published: 2026-04-28 00:55:44
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/WEwoCram/CommonForms

下载链接

链接失效反馈

官方服务：

资源简介：

CommonForms是一个用于表单字段检测的大规模、多样化数据集，将表单字段检测问题视为对象检测问题：给定页面图像，预测表单字段的位置和类型（文本输入、选择按钮、签名）。主要特点： * **规模**：约55,000份文档，包含超过450,000页。 * **来源**：通过过滤Common Crawl以找到具有可填写元素的PDF构建。 * **多样性**：包含多种语言（三分之一为非英语）和领域，没有任何单一领域占数据集的25%以上。 * **目的**：首个发布的大规模表单字段检测数据集，旨在促进鲁棒表单字段检测器的开发。

CommonForms is a large, diverse dataset for form field detection, casting the problem as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. Key Features: * **Scale:** Roughly 55,000 documents comprising over 450,000 pages. * **Source:** Constructed by filtering Common Crawl to find PDFs with fillable elements. * **Diversity:** Contains a diverse mixture of languages (one third non-English) and domains, with no single domain making up more than 25% of the dataset. * **Purpose:** The first large-scale dataset released for form field detection, aimed at fostering the development of robust form field detectors.

提供机构：

WEwoCram

5,000+

优质数据集

54 个

任务类型

进入经典数据集