Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy
收藏DataCite Commons2023-07-01 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/Metatasks_for_Auto-Sklearn_1_-_ROC_AUC_and_Balanced_Accuracy/23613627
下载链接
链接失效反馈官方服务:
资源简介:
Prediction Data of Base Models from Auto-Sklearn 1 on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC. <br> The files of this figshare item include data that was collected for the paper: <br> <strong>Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML,</strong> <em>Lennart Purucker, Lennart Schneider, Marie Anastacio, Joeran Beel, Bernd Bischl, Holger Hoos, Second International Conference on Automated Machine Learning, 2023.</em> <br> The data was stored and used with the <em>assembled </em>framework: https://github.com/ISG-Siegen/assembled. <br> In detail, the data contains the predictions of base models on validation and test as produced by running Auto-Sklearn 1 for 4 hours. Such prediction data is included for each model produced by Auto-Sklearn 1 on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper. <br> The data was collected by code created for the paper and is available in its <em>reproducibility repository</em>: https://doi.org/10.6084/m9.figshare.23613624. <br> Its usage is intended for but not limited to using <em>assembled </em>to evaluate post hoc ensembling methods for AutoML. <br> Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker. <br> The link resolves to a directory containing the following: <br> example_metatasks: contains an example metatask for test purposes before committing to downloading all files. metatasks_roc_auc.zip: The Metatasks obtained by running Auto-Sklearn 1 for ROC AUC. metatasks_bacc.zip: The Metatasks obtained by running Auto-Sklearn 1 for Balanced Accuracy. <br> The size after unzipping the entire file is: metatasks_roc_auc.zip: ~450GB metatasks_bacc.zip: ~330GB We suggest extracting only files that are of interest from the .zip archive, as these can be much smaller in size and might suffice for experiments. <br> The metatask .zip files contain 2 subdirectories for Metatasks produced based on TopN or SiloTopN pruning (see paper for details). In each of these subdirectories, 2 files per metatask exist. One .json file with metadata information and a .hdf or .csv file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the <em>assembled </em>framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.
本数据集为基于AutoML基准测试集71个分类数据集的Auto-Sklearn 1基模型预测数据,涵盖平衡准确率(Balanced Accuracy)与受试者工作特征曲线下面积(ROC AUC)两类评估指标。
本figshare资源包含为以下论文采集的数据:《Q(D)O-ES:AutoML中事后集成选择的基于种群的质量(多样性)优化算法》,作者为Lennart Purucker、Lennart Schneider、Marie Anastacio、Joeran Beel、Bernd Bischl、Holger Hoos,发表于2023年第二届自动化机器学习国际会议。
本数据集依托assembled框架进行存储与使用,框架开源地址:https://github.com/ISG-Siegen/assembled。
具体而言,本数据集包含运行Auto-Sklearn 1达4小时所生成的基模型在验证集与测试集上的预测结果。针对AutoML基准测试集的71个分类数据集,在10折交叉验证的每一折下,Auto-Sklearn 1生成的每一个模型对应的预测数据均已收录。本数据集涵盖两类评估指标:受试者工作特征曲线下面积与平衡准确率。更多细节可参见上述论文。
本数据集由为该论文开发的代码采集,其可复现仓库地址为:https://doi.org/10.6084/m9.figshare.23613624。
本数据集的使用场景不限于依托assembled框架评估AutoML领域的事后集成选择方法。
详情说明:上述链接指向用于提供下载的托管服务器。我们选择托管服务器的原因是,受限于文件体积与存储空间,目前未找到其他成本合理且适配的大型文件共享方案。若您希望通过其他方式获取数据,或知晓更合适的替代方案,请联系Lennart Purucker。
该链接指向一个包含以下内容的目录:
example_metatasks:包含用于测试的示例元任务(metatask),供用户在下载全部文件前进行测试。
metatasks_roc_auc.zip:通过运行Auto-Sklearn 1并以受试者工作特征曲线下面积为指标所得到的元任务数据集。
metatasks_bacc.zip:通过运行Auto-Sklearn 1并以平衡准确率为指标所得到的元任务数据集。
全部文件解压后的体积如下:
metatasks_roc_auc.zip:约450GB
metatasks_bacc.zip:约330GB
我们建议仅解压实验所需的目标文件,此类文件体积更小,可满足多数实验需求。
上述元任务压缩包包含两个子目录,分别对应基于TopN剪枝与SiloTopN剪枝生成的元任务(详细说明参见论文)。每个子目录下,每个元任务对应两个文件:一个存储元数据信息的.json文件,以及一个存储预测数据的.hdf或.csv文件。关于如何读取并将其作为元任务使用的详细说明,可参见assembled框架与可复现仓库。若您希望获取不含元任务结构的原始数据,建议单独查看文件内容与元数据,或先通过元任务工具进行解析。
提供机构:
figshare
创建时间:
2023-07-01



