Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10600047
下载链接
链接失效反馈官方服务:
资源简介:
Files composing the YADL data lake, for the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes (Experiment, Analysis & Benchmark Paper)"
We present an in-depth analysis of data discovery for analytics in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three key steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset developed as a tool for benchmarking this data discovery task -- and Open Data US, a well-referenced real data lake. Through systematic exploration on both lakes, our study outlines the importance of accurately retrieving join candidates, and the efficiency of simple aggregation methods. We report new insights on the benefits of existing solutions and on the their limitations, aiming at guiding future research in this space.
Archives provided here follow the notation used for the experiments, which is different from what is reported in the paper. The four YADL versions available here are:
"binary_update" (YADL Binary)
"wordnet_full" (YADL Base)
"wordnet_vldb_10" (YADL 10k)
"wordnet_vldb_50" (YADL 50k)
创建时间:
2024-07-04



