Voxel51/form_understanding_in_noisy_scanned_documents_plus
收藏Hugging Face2025-10-21 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/Voxel51/form_understanding_in_noisy_scanned_documents_plus
下载链接
链接失效反馈官方服务:
资源简介:
FUNSD+(表单理解在扫描文档中的噪声加号)是原始FUNSD数据集的增强版本,旨在用于表单理解任务。该数据集提供了从扫描表单中提取结构化信息的真实数据,包括实体识别和字段标签及其值之间的关系抽取的标注。FUNSD+解决了原始FUNSD数据集中发现的标注不一致问题,并将文档数量从199扩展到1,113份。数据集包含对标题、问题(字段标签)、答案(字段值)及其之间关系的标注,使其适用于训练和评估用于键值提取、文档布局分析和表单理解任务模型。每个样本包括:扫描表单图像、具有边界框的单词级OCR标记、实体标签(标题、问题、答案、其他)、形成语义单元的分组单词以及显示问题与答案之间关系的链接组。
FUNSD+ (Form Understanding in Noisy Scanned Documents Plus) is an enhanced version of the original FUNSD dataset designed for form understanding tasks. The dataset provides ground truth data for extracting structured information from scanned forms, including entity recognition and relationship extraction between form fields and their values. FUNSD+ addresses inconsistencies in labeling found in the original FUNSD dataset and significantly expands the document count from 199 to 1,113 documents. The dataset contains annotations for headers, questions (field labels), answers (field values), and their relationships, making it suitable for training and evaluating models for key-value extraction, document layout analysis, and form understanding tasks. Each sample includes: Scanned form images, Word-level OCR tokens with bounding boxes, Entity labels (header, question, answer, other), Grouped words forming semantic units, Linked groups showing relationships between questions and answers.
提供机构:
Voxel51



