TabLib Sample Version
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/bytedance/plm4ndv
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是TabLib数据集的一个子集,包含了多种关系型表格数据,用于评估非重复值(NDV)估计方法。该数据集包含77个经过筛选、具有有用语义的列的Parquet文件,数据类型多样,主要以大整数、字符串和双精度浮点数为主。该数据集的规模为69GB,仅占完整TabLib数据集(69TB)的0.1%。其任务是进行非重复值(Ndv)的估计。
This dataset is a subset of the TabLib dataset, which encompasses various relational tabular data and is designed for evaluating non-duplicate value (NDV) estimation methods. It comprises 77 Parquet files with carefully screened semantically meaningful columns, featuring diverse data types dominated by large integers, strings, and double-precision floating-point numbers. The total size of this dataset is 69 GB, accounting for merely 0.1% of the full TabLib dataset (69 TB). The core task of this dataset is non-duplicate value (NDV) estimation.
提供机构:
TabLib



