Open Government Data Corpus for Table Search

Mendeley Data2024-05-10 更新2024-06-27 收录

下载链接：

https://zenodo.org/records/7908079

下载链接

链接失效反馈

官方服务：

资源简介：

Increasing amounts of structured data can provide value for research and business if the relevant data can be located. Often the data is in a data lake without a consistent schema, making locating useful data challenging. Table search is a growing research area, but existing benchmarks have been limited to displayed tables. Tables sized and formatted for display in a Wikipedia page or ArXiv paper are considerably different from data tables in both scale and style. By using metadata associated with open data from government portals, we create the first dataset to benchmark search over data tables at scale. We demonstrate three styles of table-to-table related table search. The three notions of table relatedness are: tables produced by the same organization, tables distributed as part of the same dataset, and tables with a high degree of overlap in the annotated tags. The keyword tags provided with the metadata also permit the automatic creation of a keyword search over tables benchmark. We provide baselines on this dataset using existing methods including traditional and neural approaches.

若能精准定位所需的结构化数据，其体量的持续增长可为科研与商业活动创造可观价值。这类数据常存储于无统一数据模式（schema）的数据湖（data lake）中，致使有效数据的检索工作颇具挑战。表格搜索（table search）作为日益受到关注的研究方向，现有的基准测试集却仅局限于展示型表格。维基百科（Wikipedia）页面或ArXiv论文中用于展示的表格，在规模与格式风格上均与数据表格存在显著差异。我们依托与政府开放数据门户（government portals）关联的元数据（metadata），构建了首个可用于大规模数据表格搜索基准测试的数据集。我们展示了三种表格间关联搜索的实现范式，其关联性判定维度分别为：由同一机构生成的表格、同属某一数据集的分发表格，以及标注标签（annotated tags）重叠度较高的表格。此外，元数据附带的关键词标签还支持自动构建面向表格的基准关键词搜索方法。我们基于该数据集，采用包括传统方法与神经方法在内的现有技术方案，提供了基准测试结果。

创建时间：

2023-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集