ibm-research/VAREX
收藏Hugging Face2026-03-18 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/ibm-research/VAREX
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-permissive-2.0
task_categories:
- document-question-answering
- image-to-text
language:
- en
tags:
- document-extraction
- structured-extraction
- document-ai
- form-understanding
- multimodal
- benchmark
- json-schema
size_categories:
- 1K<n<10K
arxiv: 2603.15118
---
# VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents
VAREX (VARied-schema EXtraction) is a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. It comprises **1,777 documents** with **1,771 unique schemas** across three structural categories, each provided in four input modalities. Ground truth is deterministic — generated via a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, validated through three-phase quality assurance achieving ~98.5% field-level accuracy.
**Paper:** [arXiv:2603.15118](https://arxiv.org/abs/2603.15118)
**Evaluation code & scoring:** [github.com/udibarzi/varex-bench](https://github.com/udibarzi/varex-bench)
## Quick Start
```python
from datasets import load_dataset
import json
ds = load_dataset("ibm-research/VAREX", split="benchmark")
doc = ds[0]
print(doc["doc_id"]) # e.g., "1044"
print(doc["split"]) # "Flat", "Nested", or "Table"
schema = json.loads(doc["schema"])
gt = json.loads(doc["ground_truth"])
image = doc["image"] # PIL Image, 200 DPI
text = doc["text_layout"] # Spatial text with layout
```
## Columns
| Column | Type | Description |
|--------|------|-------------|
| `doc_id` | string | Unique document identifier |
| `split` | string | Structural category: Flat, Nested, or Table |
| `image` | Image | Document page rendered at 200 DPI (primary evaluation modality) |
| `image_50dpi` | Image | Document page rendered at 50 DPI (resolution robustness evaluation) |
| `schema` | string | JSON Schema defining the extraction target |
| `ground_truth` | string | JSON ground truth values |
| `text_flow` | string | Plain text in reading order |
| `text_layout` | string | Spatial text with whitespace-preserved layout |
## Input Modalities
| Modality | Paper code | Column(s) to use |
|----------|------------|------------------|
| Plain Text | P | `text_flow` |
| Spatial Text | S | `text_layout` |
| Image | V | `image` (or `image_50dpi` for robustness) |
| Spatial Text + Image | S+V | `text_layout` + `image` |
## Document Splits
| Split | Documents | Description |
|-------|-----------|-------------|
| Flat | 299 | Simple key-value schemas, no nesting |
| Nested | 1,146 | Schemas with nested objects |
| Table | 332 | Schemas with arrays of objects |
## PDF Files
Original filled PDFs are available in the `pdfs/` directory of this repository. Each filename corresponds to the `doc_id` column (e.g., doc_id `"1044"` → `pdfs/1044.pdf`). These allow researchers to apply their own text extraction or parsing pipelines.
## Scoring
Evaluation code, scoring scripts, and field exclusion lists are maintained at:
**[github.com/udibarzi/varex-bench](https://github.com/udibarzi/varex-bench)**
The benchmark uses Exact Match (EM) as the primary metric with order-invariant array matching via the Hungarian algorithm. 610 field-level exclusions are applied at scoring time for fields with known ground truth issues.
## Citation
```bibtex
@inproceedings{varex2026,
title = {VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents},
author = {Barzelay, Udi and Azulai, Ophir and Shapira, Inbar and Friedman, Idan and Abo Dahood, Foad and Lee, Madison and Daniels, Abraham},
year = {2026}
}
```
## License
Community Data License Agreement – Permissive, Version 2.0
提供机构:
ibm-research



