five

OpenML-CC18

收藏
OpenDataLab2026-05-17 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/OpenML-CC18
下载链接
链接失效反馈
官方服务:
资源简介:
我们提倡使用经过整理的、全面的机器学习数据集基准测试套件,以标准化的基于 OpenML 的接口和用 Python、Java 和 R 编写的互补软件工具包为后盾。我们展示了如何使用标准化的基于 OpenML 的基准测试套件轻松执行全面的基准测试研究以及用 Python、Java 和 R 编写的互补软件工具包。 OpenML 基准测试套件的主要显着特点是 (i) 通过标准化数据格式、API 和现有客户端库易于使用; (ii) 关于套件内容的机器可读元信息; (iii) 在线共享结果,实现大规模比较。作为第一个这样的套件,我们提出了 OpenML-CC18,这是一个机器学习基准套件,包含 72 个分类数据集,从 OpenML 上的数千个数据集中精心策划。纳入标准是: * 密集数据集独立观察的分类任务 * 类数 >= 2,每个类至少有 20 个观察和少数类与多数类的比例必须超过 5% * 500 <= 观察数 <= 100000 * one-hot-encoding 后的特征数量 < 5000 * 没有人工数据集 * 没有更大数据集的子集,也没有其他数据集的二值化 * 没有可以通过使用单个特征或使用简单的决策树来完全预测的数据集* 来源或参考可用 如果您使用此基准测试套件,请引用:Bernd Bischl、Giuseppe Casalicchio、Matthias Feurer、Frank Hutter、Michel Lang、Rafael G. Mantovani、Jan N. van Rijn 和 Joaquin Vanschoren。 “OpenML 基准测试套件”arXiv:1708.03731v2 [stats.ML] (2019)。 @article{oml-benchmarking-suites, title={OpenML Benchmarking Suites}, author={Bernd Bischl and Giuseppe Casalicchio and Matthias Feurer and Frank Hutter and Michel Lang and Rafael G. Mantovani and Jan N. van Rijn and Joaquin Vanschoren},年={2019},日记={arXiv:1708.03731v2 [stat.ML]} }

We advocate for the use of curated, comprehensive machine learning dataset benchmark suites, supported by standardized OpenML-based interfaces and complementary software toolkits written in Python, Java, and R. We demonstrate how to easily conduct comprehensive benchmarking studies using standardized OpenML-based benchmark suites and complementary software toolkits implemented in Python, Java, and R. The key salient features of OpenML benchmark suites are as follows: (i) Ease of use via standardized data formats, APIs, and existing client libraries; (ii) Machine-readable meta-information regarding the contents of the suite; (iii) Online result sharing to enable large-scale comparative analyses. As the first such suite, we introduce OpenML-CC18, a machine learning benchmark suite consisting of 72 classification datasets carefully curated from thousands of datasets available on OpenML. The inclusion criteria are: - Classification tasks with dense datasets and independent observations - Number of classes ≥ 2, each class containing at least 20 observations, and the ratio of the minority class to the majority class must exceed 5% - Number of observations between 500 and 100,000 (inclusive) - Number of features after one-hot encoding < 5000 - No artificial datasets - No subsets of larger datasets, nor binarized versions of other datasets - No datasets that can be perfectly predicted using a single feature or a simple decision tree - Available source or reference information If you use this benchmark suite, please cite: Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. "OpenML Benchmarking Suites" arXiv:1708.03731v2 [stat.ML] (2019). @article{oml-benchmarking-suites, title={OpenML Benchmarking Suites}, author={Bernd Bischl and Giuseppe Casalicchio and Matthias Feurer and Frank Hutter and Michel Lang and Rafael G. Mantovani and Jan N. van Rijn and Joaquin Vanschoren}, year={2019}, journal={arXiv:1708.03731v2 [stat.ML]} }
提供机构:
OpenDataLab
创建时间:
2022-08-19
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作