数据集概述:Human Centric Tables Question Answering (HCTQA)
基本信息
- 语言: 英语 (en)
- 许可证: MIT
- 标签: 表格、基准测试、问答、大语言模型、文档理解、多模态
- 数据集名称: Human Centric Tables Question Answering (HCTQA)
- 规模: 10K < n < 100K
- 任务类别: 问答
- 任务ID: 文档问答、视觉问答
- 注释创建者: 专家生成
数据集配置
- 配置名称: default
- 数据文件:
- 训练集: train.parquet
- 验证集: val.parquet
- 测试集: test.parquet
数据集描述
HCTQA是一个基准数据集,旨在评估大语言模型在复杂、真实世界和合成表格上的问答性能。数据集包含真实世界和合成表格,附带相关图像、CSV和结构化元数据。问题涵盖不同复杂度级别,要求模型处理复杂结构推理、数值聚合和上下文相关理解。
数据集内容
- 真实世界表格: 2,188个,附带9,835个人工标注的问答对
- 合成表格: 4,679个,附带67,500个程序生成的问答对
- 数据类型字段: 标识样本来自真实世界数据源 (
realWorldHCTs) 或合成数据 (syntheticHCTs)
数据集结构
特征
- table_id: 字符串
- table_csv_path: 字符串
- table_image_url: 字符串
- table_image_local_path: 字符串
- table_csv_format: 字符串
- table_properties: 字符串
- question_id: 字符串
- question: 字符串
- question_template: 字符串
- question_properties: 字符串
- answer: 字符串
- prompt: 字符串
- prompt_without_system: 字符串
- dataset_type: 字符串
数据集分割
| 配置 |
分割 |
示例数量 (占位符) |
| RealWorld |
Train |
7,500 |
| RealWorld |
Test |
2,335 |
| Synthetic |
Train |
55,000 |
| Synthetic |
Test |
12,500 |
样本条目结构
json
{
"table_id": "arxiv--1--1118",
"table_info": {
"table_csv_path": "../tables/csvs/arxiv--1--1118.csv",
"table_image_url": "https://hcsdtables.qcri.org/datasets/all_images/arxiv_1_1118.jpg",
"table_image_local_path": "../tables/images/arxiv--1--1118.jpg",
"table_properties": {
"Standard Relational Table": true,
"Row Nesting": false,
"Column Aggregation": false,
...
},
"table_formats": {
"csv": ",0,1,2
0,Domain,Average Text Length,Aspects Identified
1,Journalism,50,44
..."
}
},
"questions": [
{
"question_id": "arxiv--1--1118--M0",
"question": "Report the Domain and the Average Text Length where the Aspects Identified equals 72",
"gt": "{Psychology | 86} || {Linguistics | 90}",
"question_properties": {
"Row Filter": true,
"Aggregation": false,
"Returned Columns": true,
...
}
}
...
]
}
表格属性
| 属性名称 |
| Standard Relational Table |
| Multi Level Column |
| Balanced Multi Level Column |
| Symmetric Multi Level Column |
| Unbalanced Multi Level Column |
| Asymmetric Multi Level Column |
| Column Aggregation |
| Global Column Aggregation |
| Local Column-Group Aggregation |
| Explicit Column Aggregation Terms |
| Implicit Column Aggregation Terms |
| Row Nesting |
| Balanced Row Nesting |
| Symmetric Row Nesting |
| Unbalanced Row Nesting |
| Asymmetric Row Nesting |
| Row Aggregation |
| Global Row Aggregation |
| Local Row-Group Aggregation |
| Explicit Row Aggregation Terms |
| Implicit Row Aggregation Terms |
| Split Header Cell |
| Row Group Label |
问题属性
| 属性名称 |
| Row Filter |
| Row Filter Condition Type Lookup |
| Row Filter Condition Type Expression |
| Row Filter Involved Columns Single |
| Row Filter Involved Columns Multiple |
| Row Filter Max Depth Of Involved Columns |
| Row Filter Retained Rows Single |
| Row Filter Retained Rows Multiple |
| Row Filter Num Of Conditions |
| Returned Columns |
| Returned Columns Project On Plain |
| Returned Columns Project On Expression |
| Returned Columns Max Depth |
| Returned Columns Expression In Table Present |
| Returned Columns Expression In Table Not Present |
| Returned Columns Num Of Output Columns |
| Yes/No |
| Aggregation |
| Aggregation Type Sum |
| Aggregation Type Avg |
| Aggregation Grouping Global |
| Aggregation Grouping Local |
| Rank |
| Rank Type |