data-product-benchmark
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/data-product-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
### Dataset Description
This dataset provides a benchmark for automatic data product creation. The task is framed as follows: given a natural language data product request and a corpus of text and tables, the objective is to identify the relevant tables and text documents that should be included in the resulting data product which would useful to the given data product request. The benchmark brings together three variants: HybridQA, TAT-QA, and ConvFinQA, each consisting of:
- A corpus of text passages and tables, and
- A set of data product requests along with their corresponding ground-truth tables and text.
This benchmark enables systematic evaluation of approaches for discovering tables and text for automatic creation of data products from data lakes with tables and text.
## Dataset Details
<!-- Provide the basic links for the dataset. -->
- **Repository:** https://github.com/ibm/data-product-benchmark
- **Paper:**
### Curation Rationale
Data products are reusable, self-contained assets designed for specific business use cases. Automating their discovery and generation is of great industry interest, as it enables discovery in large data lakes and supports analytical Data Product Requests (DPRs).
Currently, there is no benchmark established specifically for data product discovery. Existing datasets focus on answering single factoid questions over individual tables rather than collecting multiple data assets for broader, coherent products.
To address this gap, we introduce DPBench, the first user-request-driven data product benchmark over hybrid table-text corpora.
Our framework systematically repurposes existing table-text QA datasets such as ConvFinQA, TATQA and HybridQA by clustering related tables and passages into coherent data products, generating professional-level analytical requests that span both data sources, and validating benchmark quality through multi-LLM evaluation.
### Source Datasets
| Dataset | Paper | Links |
|-----------|-------|-------|
| **HybridQA** | [*HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data* (Chen et al., EMNLP Findings 2020)](https://aclanthology.org/2020.findings-emnlp.91/) | [GitHub](https://github.com/wenhuchen/HybridQA) -- [Website](https://hybridqa.github.io/)|
| **TAT-QA** | [*TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance* (Zhu et al., ACL-IJCNLP 2021)](https://aclanthology.org/2021.acl-long.254/) | [GitHub](https://github.com/NExTplusplus/TAT-QA) -- [Website](https://nextplusplus.github.io/TAT-QA/)|
| **ConvFinQA** | [*ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering* (Chen et al., EMNLP 2022)](https://aclanthology.org/2022.emnlp-main.421/) | [GitHub](https://github.com/czyssrs/ConvFinQA) |
## Dataset Structure
```
benchmark_data/
├── ConvFinQA/
│ ├── ConvFinQA_dev.jsonl # DPRs + ground truth DPs
│ ├── ConvFinQA_test.jsonl
│ └── ConvFinQA_train.jsonl
│ └── ConvFinQA-corpus/
| └── ConvFinQA_corpus.jsonl # text + table corpora
├── HybridQA/
│ ├── HybridQA_dev.jsonl
│ ├── HybridQA_test.jsonl
│ └── HybridQA_train.jsonl
│ └── HybridQA-corpus/
| └── HybirdQA_corpus.jsonl
└── TATQA/
├── TATQA_dev.jsonl
├── TATQA_test.jsonl
└── TATQA_train.jsonl
└── TATQA-corpus/
| └── TATQA_corpus.jsonl
```
#### Data Collection and Processing
We propose repurposing traditional table–text QA datasets to construct new benchmarks for data product discovery.
Rather than focusing on single-table QA, we reinterpret these resources at the table level. By clustering similar QA pairs across multiple tables and their associated passages, we simulate broader data products. We then generate high-level Data Product Requests (DPRs) that abstract away from the low-level questions, while the associated tables and passages serve as ground-truth data products.
This reframing enables us to systematically transform QA datasets into DPR benchmarks, providing a cost-effective, scalable alternative to manual construction.
### Benchmark statistics
| Dataset | Split | # of DPRs | # of Tables | # of Text Passages |
|-----------|-------|-------------|---------------|----------------------|
| | Train | 4843 | 12378 | 41,608 |
| **Hybrid QA** | Dev | 2008 | ↑| ↑|
| | Test | 1980 | ↑| ↑|
||||||
| | Train | 820 | 2757 | 4,760 |
| **TAT-QA** | Dev | 147 | ↑| ↑|
| | Test | 176 | ↑| ↑|
||||||
| | Train | 2113 | 4976 | 8721 |
| **ConvFinQA** | Dev | 373 | ↑| ↑|
| | Test | 627 | ↑| ↑|
## Citation
If you find this dataset useful in your research, please cite our paper:
**BibTeX:**
@article{zhangdp2025,
title={From Factoid Questions to Data Product Requests: Benchmarking Data Product Discovery over Tables and Text},
author={Zhang, Liangliang and Mihindukulasooriya, Nandana and D'Souza, Niharika S. and Shirai, Sola and Dash, Sarthak and Ma, Yao and Samulowitz, Horst},
journal={arXiv preprint},
year={2025}
}
### 数据集描述
本数据集为自动化数据产品构建提供基准测试集。任务框架如下:给定自然语言形式的数据产品请求(Data Product Request, DPR)以及文本与表格语料库,目标是识别出可纳入最终数据产品、且对给定数据产品请求具有实用价值的相关表格与文本文档。该基准整合了三个变体数据集:HybridQA、TAT-QA与ConvFinQA,每个数据集均包含:
- 文本段落与表格语料库,以及
- 一组数据产品请求,及其对应的真值表格与文本。
本基准支持对从包含表格与文本的数据湖中自动构建数据产品的相关方法进行系统性评估。
### 数据集详情
<!-- 提供数据集的基础链接。 -->
- **代码仓库**:https://github.com/ibm/data-product-benchmark
- **论文**:
### 数据集构建依据
数据产品是为特定业务用例设计的可复用、自包含资产。自动化其发现与生成具有重要的产业价值,因为它可实现大型数据湖中的资产发现,并支持分析型数据产品请求(Data Product Request, DPR)。目前尚无专门针对数据产品发现的基准数据集。现有数据集多聚焦于单个表格上的单事实问答任务,而非为构建更广泛、连贯的数据产品而收集多类数据资产。为填补这一空白,我们推出DPBench——首个面向用户请求驱动的、基于混合表格-文本语料库的数据产品基准。我们的框架通过将相关表格与文本段落聚类为连贯的数据产品,生成覆盖两类数据源的专业级分析请求,并通过多大语言模型(Large Language Model, LLM)评估验证基准质量,从而系统性地复用现有表格-文本问答数据集,如ConvFinQA、TAT-QA与HybridQA。
### 源数据集
| 数据集 | 论文文献 | 链接 |
|-----------|-------|-------|
| **HybridQA** | [*HybridQA:面向表格与文本数据的多跳问答数据集*(Chen等,EMNLP 2020研究发现专题)](https://aclanthology.org/2020.findings-emnlp.91/) | [GitHub](https://github.com/wenhuchen/HybridQA) -- [官网](https://hybridqa.github.io/)|
| **TAT-QA** | [*TAT-QA:面向金融领域混合表格与文本内容的问答基准*(Zhu等,ACL-IJCNLP 2021)](https://aclanthology.org/2021.acl-long.254/) | [GitHub](https://github.com/NExTplusplus/TAT-QA) -- [官网](https://nextplusplus.github.io/TAT-QA/)|
| **ConvFinQA** | [*ConvFinQA:探索会话式金融问答中的数值推理链*(Chen等,EMNLP 2022)](https://aclanthology.org/2022.emnlp-main.421/) | [GitHub](https://github.com/czyssrs/ConvFinQA) |
### 数据集结构
benchmark_data/
├── ConvFinQA/
│ ├── ConvFinQA_dev.jsonl # 数据产品请求 + 真值数据产品
│ ├── ConvFinQA_test.jsonl
│ └── ConvFinQA_train.jsonl
│ └── ConvFinQA-corpus/
| └── ConvFinQA_corpus.jsonl # 文本 + 表格语料库
├── HybridQA/
│ ├── HybridQA_dev.jsonl
│ ├── HybridQA_test.jsonl
│ └── HybridQA_train.jsonl
│ └── HybridQA-corpus/
| └── HybirdQA_corpus.jsonl
└── TATQA/
├── TATQA_dev.jsonl
├── TATQA_test.jsonl
└── TATQA_train.jsonl
└── TATQA-corpus/
| └── TATQA_corpus.jsonl
#### 数据收集与处理流程
我们提出通过复用传统表格-文本问答数据集,构建数据产品发现的新基准。与聚焦单表格问答的任务不同,我们从表格层面重新诠释这些现有资源。通过将跨多个表格及其关联段落的相似问答对进行聚类,我们模拟出更广泛的连贯数据产品。随后,我们生成抽象于低级问题的高层数据产品请求(Data Product Request, DPR),而关联的表格与段落则作为真值数据产品。这种重构方式使我们能够将问答数据集系统性地转换为DPR基准,为手动构建提供了一种低成本、可扩展的替代方案。
### 基准统计数据
| 数据集 | 划分 | 数据产品请求数 | 表格数 | 文本段落数 |
|-----------|-------|-------------|---------------|----------------------|
| | 训练集 | 4843 | 12378 | 41,608 |
| **HybridQA** | 验证集 | 2008 | ↑| ↑|
| | 测试集 | 1980 | ↑| ↑|
||||||
| | 训练集 | 820 | 2757 | 4,760 |
| **TAT-QA** | 验证集 | 147 | ↑| ↑|
| | 测试集 | 176 | ↑| ↑|
||||||
| | 训练集 | 2113 | 4976 | 8721 |
| **ConvFinQA** | 验证集 | 373 | ↑| ↑|
| | 测试集 | 627 | ↑| ↑|
### 引用说明
若您在研究中使用本数据集,请引用我们的论文:
**BibTeX格式引用:**
bibtex
@article{zhangdp2025,
title={From Factoid Questions to Data Product Requests: Benchmarking Data Product Discovery over Tables and Text},
author={Zhang, Liangliang and Mihindukulasooriya, Nandana and D'Souza, Niharika S. and Shirai, Sola and Dash, Sarthak and Ma, Yao and Samulowitz, Horst},
journal={arXiv preprint},
year={2025}
}
提供机构:
maas
创建时间:
2025-10-04



