Data lakes for clustering
收藏Mendeley Data2023-01-27 更新2024-06-26 收录
下载链接:
https://data.mendeley.com/datasets/kd9rr3vcr6
下载链接
链接失效反馈官方服务:
资源简介:
This dataset describes the on-line materials that accompany article "RÓMULO: A Clustering Proposal in the Context of Data Lakes", by Patricia Jiménez, Juan C. Roldán, and Rafael Corchuelo, which was submitted for evaluation to Big Data Research.
The materials are organised into the following folders:
- "data-lakes": each subfolder corresponds to a data lake, and each CSV file inside a data-lake corresponds to a dataset. The last column of the datasets is called "clazz", but it is set to "0" in all cases. A few of the original datasets had a class, but it was removed to ensure that neither RóMULO nor the other competitors use it since they all are unsupervised proposals.
- "results": it provides the results of testing RóMULO and other competitors on the previous data lakes. The results consist of several "*-results.csv" files that provide effectiveness and efficiency results for each proposal used in the experimentation.
- "system": it provides the python code required to run and test RóMULO. There is a "launch.cmd" script that launches the experimentation.
COMPETITORS
-------------------
The implementation of AffinityPropagation, Meanshift, and OPTICS-XI is available in SckitLearn. The implementation of GSPPCA is available from the authors at https://github.com/pamattei/GSPPCA. THe implementation of PQC is available from the authors at https://github.com/racaes/PQC.
创建时间:
2021-01-13



