five

ibm-research/VAREX

收藏
Hugging Face2026-03-18 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/ibm-research/VAREX
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cdla-permissive-2.0 task_categories: - document-question-answering - image-to-text language: - en tags: - document-extraction - structured-extraction - document-ai - form-understanding - multimodal - benchmark - json-schema size_categories: - 1K<n<10K arxiv: 2603.15118 --- # VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents VAREX (VARied-schema EXtraction) is a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. It comprises **1,777 documents** with **1,771 unique schemas** across three structural categories, each provided in four input modalities. Ground truth is deterministic — generated via a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, validated through three-phase quality assurance achieving ~98.5% field-level accuracy. **Paper:** [arXiv:2603.15118](https://arxiv.org/abs/2603.15118) **Evaluation code & scoring:** [github.com/udibarzi/varex-bench](https://github.com/udibarzi/varex-bench) ## Quick Start ```python from datasets import load_dataset import json ds = load_dataset("ibm-research/VAREX", split="benchmark") doc = ds[0] print(doc["doc_id"]) # e.g., "1044" print(doc["split"]) # "Flat", "Nested", or "Table" schema = json.loads(doc["schema"]) gt = json.loads(doc["ground_truth"]) image = doc["image"] # PIL Image, 200 DPI text = doc["text_layout"] # Spatial text with layout ``` ## Columns | Column | Type | Description | |--------|------|-------------| | `doc_id` | string | Unique document identifier | | `split` | string | Structural category: Flat, Nested, or Table | | `image` | Image | Document page rendered at 200 DPI (primary evaluation modality) | | `image_50dpi` | Image | Document page rendered at 50 DPI (resolution robustness evaluation) | | `schema` | string | JSON Schema defining the extraction target | | `ground_truth` | string | JSON ground truth values | | `text_flow` | string | Plain text in reading order | | `text_layout` | string | Spatial text with whitespace-preserved layout | ## Input Modalities | Modality | Paper code | Column(s) to use | |----------|------------|------------------| | Plain Text | P | `text_flow` | | Spatial Text | S | `text_layout` | | Image | V | `image` (or `image_50dpi` for robustness) | | Spatial Text + Image | S+V | `text_layout` + `image` | ## Document Splits | Split | Documents | Description | |-------|-----------|-------------| | Flat | 299 | Simple key-value schemas, no nesting | | Nested | 1,146 | Schemas with nested objects | | Table | 332 | Schemas with arrays of objects | ## PDF Files Original filled PDFs are available in the `pdfs/` directory of this repository. Each filename corresponds to the `doc_id` column (e.g., doc_id `"1044"` → `pdfs/1044.pdf`). These allow researchers to apply their own text extraction or parsing pipelines. ## Scoring Evaluation code, scoring scripts, and field exclusion lists are maintained at: **[github.com/udibarzi/varex-bench](https://github.com/udibarzi/varex-bench)** The benchmark uses Exact Match (EM) as the primary metric with order-invariant array matching via the Hungarian algorithm. 610 field-level exclusions are applied at scoring time for fields with known ground truth issues. ## Citation ```bibtex @inproceedings{varex2026, title = {VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents}, author = {Barzelay, Udi and Azulai, Ophir and Shapira, Inbar and Friedman, Idan and Abo Dahood, Foad and Lee, Madison and Daniels, Abraham}, year = {2026} } ``` ## License Community Data License Agreement – Permissive, Version 2.0
提供机构:
ibm-research
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作