VANVAN6992/StructBill-CN

Name: VANVAN6992/StructBill-CN
Creator: VANVAN6992
Published: 2026-04-11 01:22:27
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/VANVAN6992/StructBill-CN

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 task_categories: - text-generation - table-question-answering language: - zh tags: - medical size_categories: - 1K<n<10K --- # Dataset Card for StructBill-CN **StructBill-CN** is a comprehensive benchmark dataset tailored for Schema-based Unified Extraction in Visual Document Understanding (VDU). It specifically targets the direct-ingestion extraction of complex, hierarchical information (both global Key-Value pairs and nested wireless tables) from high-resolution Chinese medical statement images, with a strong emphasis on evaluating structural accuracy and arithmetic logical consistency. ## Dataset Details ### Dataset Description Automated transformation of complex statement images into queryable databases is a critical yet unresolved challenge. While Multimodal Large Language Models (MLLMs) excel in general perception, they struggle with precise direct-ingestion tasks, particularly when processing wireless tables where the absence of visual grid lines renders traditional Table Structure Recognition (TSR) ineffective. StructBill-CN bridges the gap between visual cues and semantic structure. Unlike traditional datasets that focus heavily on physical bounding boxes, StructBill-CN features logical structure annotations for both global Key-Value pairs and complex line-item tables. It compels models to comprehend semantic layouts and business logic (such as deterministic arithmetic rules like `Price * Quantity = Amount`) rather than merely performing physical visual detection. - **Curated by:** The authors of the paper *"StructBill-CN: Benchmarking and Improving Logical Consistency in Visual Document Understanding with Schema-Reinforced Policy Optimization"* - **Language(s) (NLP):** Chinese (zh-CN) - **License:** [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) ### Dataset Sources To strictly comply with original data distribution agreements, our repository explicitly separates our novel annotations from third-party raw images. - **Repository (Annotations & Internal Test Data):** [Insert HuggingFace Anonymous Link Here] - **Original Images - CHIP-2022:** [Aliyun Tianchi Platform - CHIP 2022 Shared Task](https://www.google.com/search?q=https://tianchi.aliyun.com/dataset/137694) *(Requires Tianchi account)* - **Original Images - SIBR-med:** [Official SIBR-med GitHub Repository](https://www.google.com/search?q=https://github.com/HustVision/SIBR) - **Paper:** *Under Review at IJCAI 2026* ## Uses ### Direct Use This dataset is designed for academic research in the field of Document AI, specifically for: - Evaluating and training Multimodal Large Language Models (MLLMs) on Visual Information Extraction (VIE). - Benchmarking models on parsing complex, wireless, and borderless tables. - Assessing the logical reasoning and arithmetic consistency capabilities of VDU systems. - Developing reinforcement learning algorithms (like SRPO) for document alignment and schema-following tasks. ### Out-of-Scope Use - **Commercial Use:** Prohibited under the CC BY-NC-SA 4.0 license. - **High-Risk Decision Making:** Models trained on this dataset should not be deployed in real-world healthcare or financial automated auditing without a robust Human-in-the-loop (HITL) review system. ## Dataset Structure StructBill-CN comprises 3,596 high-resolution images covering 8 distinct business schemas. **Strict Compliance Distribution Policy (Decoupled Release):** To comply with data privacy policies and third-party distribution agreements, the dataset is structured as follows: - **1. What We Provide Here (Available Now):** - **Unified Annotations:** Our curated, hierarchical JSON annotations for *all* 3,596 instances (CHIP-2022, SIBR-med, and Internal-Wild). - **Internal-Wild Test Images:** The original image files for the Out-of-Distribution (OOD) test set of our proprietary Internal-Wild data. These have been fully de-identified. - **2. Third-Party Source Images (Download Required):** - For the CHIP-2022 and SIBR-med subsets, **we only provide the annotations**. Researchers must download the original raw images directly from their respective official platforms (linked in the Dataset Sources section above) to pair with our JSON files. - **3. Internal-Wild Training Set (Pending Release):** - We have completed the rigorous de-identification process for the training set images and obtained the necessary permissions for public release. To strictly maintain the **double-blind review policy** for IJCAI, this subset is temporarily withheld. It will be uploaded to our official, non-anonymous repository immediately upon the publication of the paper. ## Dataset Creation ### Curation Rationale Existing benchmarks predominantly focus on simple KV extraction or ruled tables with relatively static layouts. They fail to expose model deficiencies in semantic alignment when dealing with borderless tables, structural ambiguity, and extreme visual density. StructBill-CN was created to establish an "Ingestion-Ready" benchmark that mimics real-world database schemas, forcing models to infer structure from content logic. ### Source Data #### Data Collection and Processing The dataset aggregates data from three main sources: 1. **CHIP-2022 (1,700 items):** Inpatient/Outpatient/Pharmacy invoices and Discharge records. 2. **SIBR-Med (600 items):** Fee lists and Notification notes. 3. **Internal-Wild (1,296 items):** Real-world private data representing complex itemized billing lists and dense KV layouts. ### Annotations #### Annotation process The annotation protocol strictly prioritizes **semantic attribution over physical location**. Instead of traditional bounding box coordinates, annotations are formatted as a Hierarchical JSON standard. In the presence of printing offsets or wireless table layouts, labels are assigned based on the logical business context. Furthermore, all numerical fields (Price, Quantity, Amount) were cross-validated to ensure arithmetic consistency in the Ground Truth. #### Personal and Sensitive Information **Strict Ethical & Privacy Statement:** All real-world data (Internal-Wild) has undergone rigorous **de-identification and anonymization**. All Protected Health Information (PHI)—including patient names, personal identification numbers, specific medical institution names, and exact dates—has been thoroughly redacted, masked, or replaced with synthetic placeholders. **No sensitive personal data is exposed in this benchmark.** ### Recommendations Users should be aware that while the dataset aims to benchmark arithmetic consistency, current MLLMs may still hallucinate numbers. It is highly recommended to implement deterministic rule-checkers on top of model outputs when using these systems in practical scenarios. ## Citation **BibTeX:** ``` @article{structbillcn2026, title={StructBill-CN: Benchmarking and Improving Logical Consistency in Visual Document Understanding with Schema-Reinforced Policy Optimization}, author={Anonymous Authors}, journal={Under Review at IJCAI}, year={2026} } ``` *(Note: The citation will be updated with author names and official publication details upon acceptance.)* ## Dataset Card Contact For questions regarding the dataset, data licensing, or to request the removal of source images based on copyright claims, please contact: `vanvan6992@gamil.com`

提供机构：

VANVAN6992

搜集汇总

数据集介绍

构建方式

在视觉文档理解领域，针对复杂医疗单据图像的结构化信息提取需求，StructBill-CN数据集通过整合多源异构数据构建而成。其核心来源于CHIP-2022共享任务的住院与门诊发票、SIBR-med的收费清单以及内部采集的真实世界账单数据，共计涵盖3,596张高分辨率图像。标注过程摒弃了传统的物理边界框方法，转而采用基于语义归属的层次化JSON标准，强调逻辑业务上下文而非视觉位置，并对数值字段进行了算术一致性交叉验证，确保标注结果符合实际业务规则。

使用方法

该数据集主要用于视觉文档理解领域的学术研究，特别是评估与训练多模态大语言模型在复杂无线表格解析上的性能。研究者需首先从官方平台下载CHIP-2022与SIBR-med的原始图像，并将其与数据集中提供的统一JSON标注配对使用；内部测试集的图像则已直接包含。在使用过程中，建议结合确定性规则检查器对模型输出进行验证，以应对可能存在的数值幻觉问题，同时需注意该数据集禁止商业用途，且不适用于高风险决策场景。

背景与挑战

背景概述

在视觉文档理解领域，将复杂医疗单据图像自动转化为结构化数据是一项长期存在的技术难题。StructBill-CN数据集于2026年由相关研究团队构建，旨在针对中文高分辨率医疗单据，提供基于模式的统一信息抽取基准。该数据集聚焦于从无视觉表格线的无线表格中提取层次化信息，并强调结构准确性与算术逻辑一致性，推动了多模态大语言模型在精确视觉信息抽取任务上的评估与优化，为文档智能研究提供了重要的实验平台。

当前挑战

StructBill-CN所应对的核心挑战在于视觉文档理解中语义结构与算术逻辑的一致性建模。传统方法依赖物理边界框，难以处理无线表格的结构识别与复杂业务逻辑推理。数据集构建过程中，需克服多源医疗图像整合、严格去标识化处理以及逻辑层次标注的复杂性，同时确保标注数据遵循语义属性优先原则，并在算术关系上保持精确无误，以模拟真实世界数据库模式的直接摄入需求。

常用场景

经典使用场景

在视觉文档理解领域，StructBill-CN数据集被广泛应用于评估多模态大语言模型在复杂医疗账单图像中的结构化信息提取能力。该数据集通过高分辨率中文医疗单据图像，模拟真实世界数据库模式，要求模型从无边框表格和密集布局中推断语义结构，从而推动模型在直接摄取任务上的性能优化。其经典使用场景包括训练和测试模型对全局键值对与嵌套无线表格的解析，强调逻辑一致性与算术规则验证，为文档智能研究提供了标准化的基准平台。

解决学术问题

StructBill-CN数据集致力于解决视觉文档理解中的核心学术问题，即如何实现从复杂视觉文档到结构化数据的精准转换。传统方法依赖物理边界框检测，难以处理无线表格和语义布局模糊的挑战。该数据集通过引入逻辑结构标注和算术一致性验证，促使模型超越视觉感知，深入理解业务逻辑与语义关联。这不仅填补了现有基准在结构准确性和逻辑推理评估上的空白，还为多模态模型在真实场景下的鲁棒性研究提供了关键数据支撑。

实际应用

在实际应用层面，StructBill-CN数据集为医疗和金融领域的自动化文档处理系统提供了重要参考。例如，在医疗账单审核中，系统可借助该数据集训练模型，从高密度发票图像中提取关键信息，如药品价格、数量和总金额，并自动验证算术逻辑一致性。这种应用不仅提升了数据录入效率，还通过人机协同审核机制降低了人工错误风险，为行业智能化转型提供了可靠的技术基础。

数据集最近研究