CardBench

Name: CardBench
Creator: 谷歌公司
Published: 2024-08-29 07:25:25
License: 暂无描述

arXiv2024-08-29 更新2024-08-31 收录

下载链接：

https://github.com/google-research/google-research/tree/master/CardBench_zero_shot_cardinality_training

下载链接

链接失效反馈

官方服务：

资源简介：

CardBench是由谷歌公司发布的用于学习基数估计的基准数据集，包含20个多样化的真实世界数据库。数据集大小从几百GB到几PB不等，通过随机采样原始数据集创建。数据集内容包括单表查询和二元连接查询，每种查询类型都带有多个过滤谓词。数据集的创建过程涉及统计计算、查询生成和执行以及注释查询图的生成。CardBench主要应用于数据库查询优化领域，旨在提高基数估计的准确性，从而优化查询性能。

CardBench is a benchmark dataset released by Google for learning cardinality estimation, which contains 20 diverse real-world databases. The size of the datasets ranges from hundreds of gigabytes to several petabytes, and it is created via random sampling of the original datasets. The dataset includes single-table queries and binary join queries, with multiple filtering predicates for each query type. The creation process of CardBench involves statistical computation, query generation and execution, as well as the generation of annotated query graphs. CardBench is mainly applied in the field of database query optimization, aiming to improve the accuracy of cardinality estimation so as to optimize query performance.

提供机构：

谷歌公司

创建时间：

2024-08-29

搜集汇总

数据集介绍

构建方式

CardBench数据集的构建方式是通过收集和分析20个不同的真实世界数据库上的数千个查询。数据集的构建不仅包括查询本身，还包括每个查询的统计信息，如数据行数、列数、数据类型、唯一值数量等。此外，还提供了查询的基数（即查询返回的中间记录数）作为标签。这些查询被分为两种类型：单表查询和二叉连接查询。单表查询涉及对一个表使用1到4个过滤谓词进行筛选，而二叉连接查询涉及连接两个表，并对每个表使用1到3个过滤谓词进行筛选。数据集的构建旨在提供多样化的查询和数据分布，以挑战基数估计模型的鲁棒性。

特点

CardBench数据集的特点在于其多样性、复杂性和覆盖广泛的数据分布。数据集包括20个不同的数据库，涵盖了从GitHub仓库到电影评分的各种数据类型。数据集的构建旨在为研究人员提供一个全面的环境，用于测试和比较不同的基数估计方法。数据集的特点还包括提供了详细的统计数据，如表格行数、列数、数据类型、唯一值数量等，以及查询的基数标签。此外，数据集还提供了一个查询生成器和灵活的基础设施，可以用于生成更复杂的训练查询。

使用方法

CardBench数据集的使用方法包括训练和测试基数估计模型。研究人员可以使用数据集提供的查询和统计信息来训练模型，并使用查询的基数标签来评估模型的准确性。数据集还提供了一个查询生成器和灵活的基础设施，可以用于生成更复杂的训练查询。此外，数据集还提供了一个评估模型准确性的指标，即q-error，它计算预测基数与真实基数之间的相对偏差。通过使用CardBench数据集，研究人员可以系统地评估和比较不同类型的基数估计方法，并推动这一重要问题的进一步研究。

背景与挑战

背景概述

Cardinality estimation (CE) is a critical component in optimizing query performance in relational databases. Traditionally, CE techniques have been based on heuristics and simple analytical models, which often make assumptions about data uniformity and the independence of columns in tables. These methods have well-known limitations that can lead to suboptimal query execution plans. To address these limitations, learned CE models have been proposed, which use machine learning techniques to improve accuracy. However, these models have not been widely adopted due to their high training overheads. To facilitate research and development in this area, the CardBench benchmark was created. CardBench is a dataset that contains thousands of queries over 20 distinct real-world databases, and it is designed to enable systematic evaluation and training of learned CE models. It includes scripts to compute data summary statistics and generate queries, as well as two training datasets with true cardinalities. CardBench was created by a team at Google Inc. and released in 2024, with the aim of fostering research on the important problem of CE and improving on recent directions such as pre-trained CE models.

当前挑战

The challenges associated with CardBench include the complexity of cardinality estimation itself, which is crucial for optimizing query performance in databases. The dataset addresses the challenge of creating a systematic benchmark for learned CE models, which has not been available previously. It also faces the challenge of generating a diverse and complex set of queries that can stress CE models and test their robustness across various real-world domains. Additionally, the dataset aims to facilitate the development of pre-trained CE models that can be used on unseen datasets without the need for extensive retraining. The challenges in achieving this include ensuring that the models can generalize well to new data and that they can be fine-tuned effectively to maintain high accuracy. Furthermore, the dataset must also consider the computational cost of creating such a large-scale benchmark, as it involves running thousands of queries to collect true cardinalities, which is resource-intensive. Despite these challenges, CardBench provides a valuable resource for the database and machine learning communities to advance the state-of-the-art in cardinality estimation.

常用场景

经典使用场景

CardBench 数据集是用于评估和训练学习型基数估计模型的关键基准。它包含了来自20个不同数据库的数千个查询，涵盖了广泛的数据分布和查询复杂度。CardBench 旨在帮助研究人员系统地评估新学习方法的进展，并促进新学习方法的开发。该数据集的经典使用场景包括训练和测试基于图神经网络（GNN）和变压器模型的基数估计模型，以及探索这些模型在不同设置下的性能，如实例基础、零样本和微调。

解决学术问题

CardBench 数据集解决了传统基数估计技术在学习型基数估计模型中的局限性问题。传统方法通常依赖于启发式和简单的分析模型，而学习型模型则展示了更高的准确性。然而，学习型模型的高训练开销限制了其在实践中的应用。CardBench 通过提供大规模的训练数据集，降低了研究人员开发和学习型基数估计模型的门槛，并促进了预训练模型的研究。此外，CardBench 还有助于探索学习型基数估计模型在零样本设置下的泛化能力，以及微调预训练模型在有限数据集上的效果。

衍生相关工作

CardBench 数据集的发布促进了许多相关研究的发展。基于 CardBench 的训练数据集，研究人员探索了各种学习型基数估计模型，包括基于 GNN 和变压器的模型，并评估了它们在不同设置下的性能。此外，CardBench 还促进了预训练模型在基数估计领域的应用，并探索了零样本学习和微调方法的效果。CardBench 的发布为基数估计领域的研究提供了重要的数据基础和实验平台，推动了该领域的进一步发展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集