ibm-research/VAREX

Name: ibm-research/VAREX
Creator: ibm-research
Published: 2026-03-18 07:51:34
License: 暂无描述

Hugging Face2026-03-18 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/ibm-research/VAREX

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cdla-permissive-2.0 task_categories: - document-question-answering - image-to-text language: - en tags: - document-extraction - structured-extraction - document-ai - form-understanding - multimodal - benchmark - json-schema size_categories: - 1K<n<10K arxiv: 2603.15118 --- # VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents VAREX (VARied-schema EXtraction) is a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. It comprises **1,777 documents** with **1,771 unique schemas** across three structural categories, each provided in four input modalities. Ground truth is deterministic — generated via a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, validated through three-phase quality assurance achieving ~98.5% field-level accuracy. **Paper:** [arXiv:2603.15118](https://arxiv.org/abs/2603.15118) **Evaluation code & scoring:** [github.com/udibarzi/varex-bench](https://github.com/udibarzi/varex-bench) ## Quick Start ```python from datasets import load_dataset import json ds = load_dataset("ibm-research/VAREX", split="benchmark") doc = ds[0] print(doc["doc_id"]) # e.g., "1044" print(doc["split"]) # "Flat", "Nested", or "Table" schema = json.loads(doc["schema"]) gt = json.loads(doc["ground_truth"]) image = doc["image"] # PIL Image, 200 DPI text = doc["text_layout"] # Spatial text with layout ``` ## Columns | Column | Type | Description | |--------|------|-------------| | `doc_id` | string | Unique document identifier | | `split` | string | Structural category: Flat, Nested, or Table | | `image` | Image | Document page rendered at 200 DPI (primary evaluation modality) | | `image_50dpi` | Image | Document page rendered at 50 DPI (resolution robustness evaluation) | | `schema` | string | JSON Schema defining the extraction target | | `ground_truth` | string | JSON ground truth values | | `text_flow` | string | Plain text in reading order | | `text_layout` | string | Spatial text with whitespace-preserved layout | ## Input Modalities | Modality | Paper code | Column(s) to use | |----------|------------|------------------| | Plain Text | P | `text_flow` | | Spatial Text | S | `text_layout` | | Image | V | `image` (or `image_50dpi` for robustness) | | Spatial Text + Image | S+V | `text_layout` + `image` | ## Document Splits | Split | Documents | Description | |-------|-----------|-------------| | Flat | 299 | Simple key-value schemas, no nesting | | Nested | 1,146 | Schemas with nested objects | | Table | 332 | Schemas with arrays of objects | ## PDF Files Original filled PDFs are available in the `pdfs/` directory of this repository. Each filename corresponds to the `doc_id` column (e.g., doc_id `"1044"` → `pdfs/1044.pdf`). These allow researchers to apply their own text extraction or parsing pipelines. ## Scoring Evaluation code, scoring scripts, and field exclusion lists are maintained at: **[github.com/udibarzi/varex-bench](https://github.com/udibarzi/varex-bench)** The benchmark uses Exact Match (EM) as the primary metric with order-invariant array matching via the Hungarian algorithm. 610 field-level exclusions are applied at scoring time for fields with known ground truth issues. ## Citation ```bibtex @inproceedings{varex2026, title = {VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents}, author = {Barzelay, Udi and Azulai, Ophir and Shapira, Inbar and Friedman, Idan and Abo Dahood, Foad and Lee, Madison and Daniels, Abraham}, year = {2026} } ``` ## License Community Data License Agreement – Permissive, Version 2.0

提供机构：

ibm-research

5,000+

优质数据集

54 个

任务类型

进入经典数据集