NIAT-Pro/needle-in-a-table-pro

Name: NIAT-Pro/needle-in-a-table-pro
Creator: NIAT-Pro
Published: 2026-04-11 06:33:18
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/NIAT-Pro/needle-in-a-table-pro

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en pretty_name: NIAT-Pro license: openrail task_categories: - question-answering - text-generation tags: - tables - tabular-reasoning - long-context - benchmark - question-answering - retrieval - reasoning - llm-evaluation - structured-data - large-tables - tabular-formats size_categories: - 10K<n<100K --- # NIAT-Pro: Needle-In-A-Table-Pro NIAT-Pro is a benchmark for evaluating how well large language models understand and reason over large tables under controlled variations of tabular format, table size, and information position. It extends the original Needle-In-A-Table setting from simple cell lookup to one-hop, two-hop, and four-hop tasks, and studies performance across 11 table representations: - CSV - TSV - PSV - JSON - XML - YAML - Markdown - HTML - LaTeX - SQL - Free-form text NIAT-Pro is designed to study long-context tabular understanding at realistic scales using three public datasets: HAR, SECOM, and WEC. It systematically controls table representation, row and column scaling, and target information position to enable rigorous analysis of LLM behavior on large tables. ## Overview Existing tabular benchmarks often rely on relatively small tables, fix contextual properties within each sample, and report only coarse average accuracy. NIAT-Pro is designed to address these limitations by: - using substantially larger tables - systematically varying format, row and column scaling, and target information position - including retrieval and reasoning tasks of increasing complexity - enabling factorial analysis of how these factors affect model performance ## Tasks NIAT-Pro includes three levels of task complexity. ### One-hop lookup This task evaluates direct retrieval of a target cell value from row and column cues. ### Two-hop reasoning This task includes two types of questions: - finding the maximum or minimum value of a given column - table navigation relative to a base cell ### Four-hop reasoning This task further increases complexity by defining the base position implicitly through an extreme value in a column, then asking the model to navigate relative to that position and retrieve the final target value. ## Source datasets NIAT-Pro is constructed from three public tabular datasets spanning different domains: - SECOM: semiconductor manufacturing - WEC: wave energy converters - HAR: human activity recognition These datasets cover engineering, environmental, and health-related domains. ## Repository structure and folder hierarchy The repository is organized by dataset name, then by row and column scaling factors, then by benchmark scenarios defined by information positions, and finally by the 11 table-format files for the same test scenario. At the root level, the repository contains the three dataset folders and the README file: ```text NIAT-Pro/ ├── har/ ├── secom/ ├── wec/ └── README.md ``` Here: - `har`, `secom`, and `wec` are dataset names - each dataset folder stores benchmark artifacts derived from that source dataset ## Dataset-level hierarchy Inside each dataset folder, there are multiple subfolders named in the form `Srow{}_Scol{}`. These indicate the scaling factors applied to the row and column dimensions of the benchmark tables. A dataset folder has the following general structure: ```text <dataset_name>/ ├── Srow{row_scale}_Scol{col_scale}/ ├── Srow{row_scale}_Scol{col_scale}/ ├── ... ├── T_s.csv ├── benchmark_summary.json └── qa_spec.json ``` Meaning of these items: - `Srow{row_scale}_Scol{col_scale}`: a benchmark subset with a specific row scaling factor and column scaling factor - `T_s.csv`: the informative subtable used during benchmark construction - `benchmark_summary.json`: summary metadata for the benchmark instances under this dataset - `qa_spec.json`: dataset-level question and answer specification, not specific questions and answers ## Scaling-factor level hierarchy Inside each `Srow{}_Scol{}` folder, the structure is: ```text Srow{row_scale}_Scol{col_scale}/ ├── benches/ ├── manifest.json └── qas.json ``` Meaning of these items: - `qas.json`: records the questions and answers for this scaling setting; these are unified for all scenario subfolders inside `benches/` - `manifest.json`: metadata describing the scenario inventory and files under this scaling setting - `benches/`: contains benchmark scenarios created by varying information positions This means that for a fixed dataset and a fixed pair of row and column scaling factors, the questions and answers are shared across the different information-position scenarios, while the actual rendered benchmark tables differ by scenario. ## Benchmark-scenario hierarchy Inside `benches/`, each subfolder name is of the form `i{}_j{}`: ```text benches/ ├── i01_j01/ ├── i01_j02/ ├── i01_j03/ ├── i02_j01/ ├── i02_j02/ ├── i02_j03/ ├── i03_j01/ ├── i03_j02/ └── i03_j03/ ``` These folders represent information positions. Meaning of `i{}_j{}`: - `i` is the row-position index - `j` is the column-position index They indicate where the target information is placed in the benchmark table. In the benchmark design, information position is systematically controlled across row and column dimensions, corresponding to top, middle, and bottom positions along rows and front, middle, and back positions along columns. ## Format-file hierarchy Inside each `i{}_j{}` folder, the repository stores the same benchmark scenario rendered into 11 different formats, together with scenario metadata: ```text i{row_pos}_j{col_pos}/ ├── meta.json ├── table.csv ├── table.html ├── table.json ├── table.md ├── table.nl.txt ├── table.psv ├── table.sql ├── table.tex ├── table.tsv ├── table.xml └── table.yaml ``` Meaning of these files: - `meta.json`: metadata for the specific benchmark scenario - `table.csv`, `table.tsv`, `table.psv`: delimiter-separated representations - `table.json`, `table.xml`, `table.yaml`: hierarchical serialization formats - `table.md`, `table.html`, `table.tex`: markup-oriented formats - `table.sql`: executable relational representation - `table.nl.txt`: free-form natural-language rendering These 11 files correspond to the 11 tabular formats studied in NIAT-Pro. ## Full hierarchy example The full folder hierarchy can be summarized as follows: ```text NIAT-Pro/ ├── har/ │ ├── Srow6_Scol6/ │ │ ├── benches/ │ │ │ ├── i01_j01/ │ │ │ │ ├── meta.json │ │ │ │ ├── table.csv │ │ │ │ ├── table.html │ │ │ │ ├── table.json │ │ │ │ ├── table.md │ │ │ │ ├── table.nl.txt │ │ │ │ ├── table.psv │ │ │ │ ├── table.sql │ │ │ │ ├── table.tex │ │ │ │ ├── table.tsv │ │ │ │ ├── table.xml │ │ │ │ └── table.yaml │ │ │ ├── i01_j02/ │ │ │ ├── i01_j03/ │ │ │ ├── i02_j01/ │ │ │ ├── i02_j02/ │ │ │ ├── i02_j03/ │ │ │ ├── i03_j01/ │ │ │ ├── i03_j02/ │ │ │ └── i03_j03/ │ │ ├── manifest.json │ │ └── qas.json │ ├── Srow{...}_Scol{...}/ │ ├── T_s.csv │ ├── benchmark_summary.json │ └── qa_spec.json ├── secom/ ├── wec/ └── README.md ``` ## How to interpret one path For example, the path below: ```text har/Srow6_Scol6/benches/i01_j01/table.csv ``` can be interpreted as: - `har`: the HAR source dataset - `Srow6_Scol6`: row scaling factor 6 and column scaling factor 6 - `benches/i01_j01`: the benchmark scenario where the informative content is placed at information position `(i=1, j=1)` - `table.csv`: the CSV rendering of that exact scenario The corresponding `qas.json` in `har/Srow6_Scol6/` provides the unified question and answer set for all `i{}_j{}` scenario folders under that same scaling configuration. ## Benchmark construction NIAT-Pro is generated through a controlled pipeline that: 1. selects an informative subtable from the original source table 2. expands rows and columns in a controlled manner 3. places informative content at controlled row and column positions 4. renders the same table scenario into multiple tabular formats This construction is designed to keep target information and question content aligned across settings so that performance differences can be more cleanly attributed to the manipulated factors. ## Controlled factors ### Tabular format The benchmark includes 11 different tabular formats. Format choice has a substantial impact on LLM performance, and CSV is not always the best-performing representation. ### Table size Table size is controlled along both row length and column width, yielding structures such as short-and-narrow, short-and-wide, long-and-narrow, and long-and-wide. ### Information position Target information is placed at controlled positions across both rows and columns, enabling analysis of early, middle, and late positions in the table context. ## Intended uses NIAT-Pro is intended for: - benchmarking LLM tabular understanding - studying long-context reasoning over structured data - comparing different tabular representations - evaluating sensitivity to table size and information position - testing methods such as direct encoding, RAG, code execution, Code-RAG, and few-shot test-time scaling ## Loading notes Because the repository is organized as nested benchmark artifacts rather than a single flat table, users may prefer loading specific JSON files or writing a small parser over the folder hierarchy. Example: ```python from pathlib import Path import json root = Path("NIAT-Pro") dataset = "har" scale = "Srow6_Scol6" pos = "i01_j01" qas = json.loads((root / dataset / scale / "qas.json").read_text()) meta = json.loads((root / dataset / scale / "benches" / pos / "meta.json").read_text()) table_csv = (root / dataset / scale / "benches" / pos / "table.csv").read_text() print(meta) print(qas[0] if isinstance(qas, list) and len(qas) > 0 else qas) print(table_csv[:500]) ``` ## License The current repository lists the dataset license as `openrail`. Please verify the final repository-level license choice for consistency with the included files and redistribution plan. ## Citation If you use NIAT-Pro, please cite: ```bibtex @article{yuan2026niatpro, title={Needle-In-A-Table-Pro: Tabular Formats Matter When Table Size and Information Position Jointly Shape LLMs' Understanding of Large Tables}, author={}, journal={Preprint}, year={2026} } ``` ## Acknowledgements NIAT-Pro is built on public datasets from semiconductor manufacturing, wave energy systems, and human activity recognition, and is released to support research on robust long-context tabular understanding.

提供机构：

NIAT-Pro

搜集汇总

数据集介绍

构建方式

在表格数据理解领域，NIAT-Pro数据集的构建体现了系统化的实验设计理念。该数据集从半导体制造、波浪能转换器以及人类活动识别三个公开表格数据源中提取信息子表，通过精确控制行与列的缩放因子来生成不同尺寸的表格变体。构建过程的核心在于对信息位置进行网格化布局，将目标内容置于预先定义的行列坐标上，从而创建出具有明确信息定位的基准场景。最终，每个场景被同时渲染为CSV、JSON、XML等十一种不同的表格表示格式，确保了跨格式的严格可比性，为分析大语言模型在结构化数据上的长上下文理解能力提供了精细的受控环境。

特点

NIAT-Pro数据集的核心特征在于其多维度、系统化的可控变量设计。它不仅涵盖了从简单查找、两跳推理到四跳推理的渐进式任务复杂度，更关键的是在表格格式、尺寸与信息位置三个关键维度上实现了因子化组合。数据集包含十一种主流的表格序列化格式，从分隔符文本到标记语言乃至可执行代码，全面考察模型对多样化数据表示的适应能力。同时，通过独立调节行与列的缩放，生成了从短窄到长宽等多种表格形态，并结合信息在行列方向上的前、中、后位置变化，使得研究者能够精准剖析不同因素对大语言模型表格理解性能的独立与交互影响。

使用方法

为有效利用NIAT-Pro数据集进行模型评估与研究，用户需遵循其层次化的文件结构进行数据加载。数据集按源数据、缩放配置、信息位置场景及具体格式文件四级目录组织。典型的使用流程是，首先根据研究目标选择特定领域的数据集文件夹，进而确定所需的行列缩放因子子目录。在该目录下，统一的`qas.json`文件提供了对应所有信息位置场景的问题与答案集，而每个具体的`i{}_j{}`场景文件夹内则存储了同一表格内容在不同格式下的具体渲染文件及其元数据。研究者可通过编写简单的路径解析脚本，加载特定场景的表格内容与对应的问题集，从而进行模型在受控变量下的性能测试与对比分析。

背景与挑战

背景概述

在大型语言模型（LLM）蓬勃发展的时代，对结构化数据，尤其是表格数据的理解与推理能力，成为衡量模型认知智能的关键维度。NIAT-Pro（Needle-In-A-Table-Pro）基准数据集应运而生，旨在系统评估LLM在多样化表格格式、规模及信息位置下的表现。该数据集由研究团队于2026年构建，其核心研究问题聚焦于探究表格表示形式、尺寸缩放以及目标信息定位如何共同塑造LLM对大规模表格的语义解析与多跳推理能力。通过整合半导体制造（SECOM）、波浪能转换器（WEC）和人类活动识别（HAR）三个公开数据集，NIAT-Pro为长上下文表格理解研究提供了严谨、可控的实验平台，推动了表格推理基准向更真实、更系统化的方向发展。

当前挑战

该数据集致力于解决表格理解领域的核心挑战，即评估LLM在复杂、大规模表格中进行精准信息检索与多步推理的能力。具体挑战体现在任务设计上，从简单的单跳查找，到需要识别极值并进行相对导航的双跳推理，乃至通过隐式定位完成四跳复杂查询，逐步提升对模型逻辑演绎与上下文整合能力的考验。在构建过程中，挑战源于对多变量的系统控制：需将同一信息内容精准嵌入11种异构表格格式（如CSV、JSON、HTML等），并在不同行列缩放因子下保持问题与答案的一致性；同时，通过精细控制信息在表格中的行列位置（如顶部、中部、底部），以分析模型对长上下文不同区域的敏感性，这要求构建流程具备高度的可重复性与对齐精度。

常用场景

经典使用场景

在大型语言模型评估领域，NIAT-Pro数据集被广泛用于系统性地测试模型对大规模表格的理解与推理能力。其经典使用场景涉及在多种表格格式（如CSV、JSON、HTML等）下，通过控制行与列的缩放比例以及信息位置，设计从单跳查找到四跳推理的复杂任务，从而全面评估模型在处理结构化数据时的长上下文表现。

衍生相关工作

NIAT-Pro衍生了一系列经典研究工作，包括基于检索增强生成（RAG）的表格理解方法、针对长上下文表格的编码策略比较，以及少样本测试时缩放技术的评估。这些工作进一步推动了结构化数据推理、多格式表格处理以及大型语言模型在工程与环境科学等领域的应用创新。

数据集最近研究