Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy

Name: Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy
Creator: figshare
Published: 2023-07-01 18:00:22
License: 暂无描述

DataCite Commons2023-07-01 更新2024-08-18 收录

下载链接：

https://figshare.com/articles/dataset/Metatasks_for_Auto-Sklearn_1_-_ROC_AUC_and_Balanced_Accuracy/23613627

下载链接

链接失效反馈

官方服务：

资源简介：

Prediction Data of Base Models from Auto-Sklearn 1 on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC. The files of this figshare item include data that was collected for the paper: Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML, Lennart Purucker, Lennart Schneider, Marie Anastacio, Joeran Beel, Bernd Bischl, Holger Hoos, Second International Conference on Automated Machine Learning, 2023. The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled. In detail, the data contains the predictions of base models on validation and test as produced by running Auto-Sklearn 1 for 4 hours. Such prediction data is included for each model produced by Auto-Sklearn 1 on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper. The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23613624. Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML. Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker. The link resolves to a directory containing the following: example_metatasks: contains an example metatask for test purposes before committing to downloading all files. metatasks_roc_auc.zip: The Metatasks obtained by running Auto-Sklearn 1 for ROC AUC. metatasks_bacc.zip: The Metatasks obtained by running Auto-Sklearn 1 for Balanced Accuracy. The size after unzipping the entire file is: metatasks_roc_auc.zip: ~450GB metatasks_bacc.zip: ~330GB We suggest extracting only files that are of interest from the .zip archive, as these can be much smaller in size and might suffice for experiments. The metatask .zip files contain 2 subdirectories for Metatasks produced based on TopN or SiloTopN pruning (see paper for details). In each of these subdirectories, 2 files per metatask exist. One .json file with metadata information and a .hdf or .csv file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.

本数据集为基于AutoML基准测试集71个分类数据集的Auto-Sklearn 1基模型预测数据，涵盖平衡准确率（Balanced Accuracy）与受试者工作特征曲线下面积（ROC AUC）两类评估指标。本figshare资源包含为以下论文采集的数据：《Q(D)O-ES：AutoML中事后集成选择的基于种群的质量（多样性）优化算法》，作者为Lennart Purucker、Lennart Schneider、Marie Anastacio、Joeran Beel、Bernd Bischl、Holger Hoos，发表于2023年第二届自动化机器学习国际会议。本数据集依托assembled框架进行存储与使用，框架开源地址：https://github.com/ISG-Siegen/assembled。具体而言，本数据集包含运行Auto-Sklearn 1达4小时所生成的基模型在验证集与测试集上的预测结果。针对AutoML基准测试集的71个分类数据集，在10折交叉验证的每一折下，Auto-Sklearn 1生成的每一个模型对应的预测数据均已收录。本数据集涵盖两类评估指标：受试者工作特征曲线下面积与平衡准确率。更多细节可参见上述论文。本数据集由为该论文开发的代码采集，其可复现仓库地址为：https://doi.org/10.6084/m9.figshare.23613624。本数据集的使用场景不限于依托assembled框架评估AutoML领域的事后集成选择方法。详情说明：上述链接指向用于提供下载的托管服务器。我们选择托管服务器的原因是，受限于文件体积与存储空间，目前未找到其他成本合理且适配的大型文件共享方案。若您希望通过其他方式获取数据，或知晓更合适的替代方案，请联系Lennart Purucker。该链接指向一个包含以下内容的目录： example_metatasks：包含用于测试的示例元任务（metatask），供用户在下载全部文件前进行测试。 metatasks_roc_auc.zip：通过运行Auto-Sklearn 1并以受试者工作特征曲线下面积为指标所得到的元任务数据集。 metatasks_bacc.zip：通过运行Auto-Sklearn 1并以平衡准确率为指标所得到的元任务数据集。全部文件解压后的体积如下： metatasks_roc_auc.zip：约450GB metatasks_bacc.zip：约330GB 我们建议仅解压实验所需的目标文件，此类文件体积更小，可满足多数实验需求。上述元任务压缩包包含两个子目录，分别对应基于TopN剪枝与SiloTopN剪枝生成的元任务（详细说明参见论文）。每个子目录下，每个元任务对应两个文件：一个存储元数据信息的.json文件，以及一个存储预测数据的.hdf或.csv文件。关于如何读取并将其作为元任务使用的详细说明，可参见assembled框架与可复现仓库。若您希望获取不含元任务结构的原始数据，建议单独查看文件内容与元数据，或先通过元任务工具进行解析。

提供机构：

figshare

创建时间：

2023-07-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集