five

Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8098908
下载链接
链接失效反馈
官方服务:
资源简介:
Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport. The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs. The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small. Dataset References adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security. dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2. hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper. t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper. tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
创建时间:
2023-07-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作