polinaeterna/tabular-benchmark
收藏Hugging Face2023-09-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/polinaeterna/tabular-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
license: []
pretty_name: tabular_benchmark
tags: []
task_categories:
- tabular-classification
- tabular-regression
configs:
- config_name: clf_cat_covertype
data_files: clf_cat/covertype.csv
- config_name: clf_num_Higgs
data_files: clf_num/Higgs.csv
---
# Tabular Benchmark
## Dataset Description
This dataset is a curation of various datasets from [openML](https://www.openml.org/) and is curated to benchmark performance of various machine learning algorithms.
- **Repository:** https://github.com/LeoGrin/tabular-benchmark/community
- **Paper:** https://hal.archives-ouvertes.fr/hal-03723551v2/document
### Dataset Summary
Benchmark made of curation of various tabular data learning tasks, including:
- Regression from Numerical and Categorical Features
- Regression from Numerical Features
- Classification from Numerical and Categorical Features
- Classification from Numerical Features
### Supported Tasks and Leaderboards
- `tabular-regression`
- `tabular-classification`
## Dataset Structure
### Data Splits
This dataset consists of four splits (folders) based on tasks and datasets included in tasks.
- reg_num: Task identifier for regression on numerical features.
- reg_cat: Task identifier for regression on numerical and categorical features.
- clf_num: Task identifier for classification on numerical features.
- clf_cat: Task identifier for classification on categorical features.
Depending on the dataset you want to load, you can load the dataset by passing `task_name/dataset_name` to `data_files` argument of `load_dataset` like below:
```python
from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files="reg_cat/house_sales.csv")
```
## Dataset Creation
### Curation Rationale
This dataset is curated to benchmark performance of tree based models against neural networks. The process of picking the datasets for curation is mentioned in the paper as below:
- **Heterogeneous columns**. Columns should correspond to features of different nature. This excludes
images or signal datasets where each column corresponds to the same signal on different sensors.
- **Not high dimensional**. We only keep datasets with a d/n ratio below 1/10.
- **Undocumented datasets** We remove datasets where too little information is available. We did keep
datasets with hidden column names if it was clear that the features were heterogeneous.
- **I.I.D. data**. We remove stream-like datasets or time series.
- **Real-world data**. We remove artificial datasets but keep some simulated datasets. The difference is
subtle, but we try to keep simulated datasets if learning these datasets are of practical importance
(like the Higgs dataset), and not just a toy example to test specific model capabilities.
- **Not too small**. We remove datasets with too few features (< 4) and too few samples (< 3 000). For
benchmarks on numerical features only, we remove categorical features before checking if enough
features and samples are remaining.
- **Not too easy**. We remove datasets which are too easy. Specifically, we remove a dataset if a simple model (max of a single tree and a regression, logistic or OLS)
reaches a score whose relative difference with the score of both a default Resnet (from Gorishniy et al. [2021]) and a default HistGradientBoosting model (from scikit learn)
is below 5%. Other benchmarks use different metrics to remove too easy datasets, like removing datasets perfectly separated by a single decision classifier [Bischl et al., 2021],
but this ignores varying Bayes rate across datasets. As tree ensembles are superior to simple trees and logistic regresison [Fernández-Delgado et al., 2014],
a close score for the simple and powerful models suggests that we are already close to the best achievable score.
- **Not deterministic**. We remove datasets where the target is a deterministic function of the data. This
mostly means removing datasets on games like poker and chess. Indeed, we believe that these
datasets are very different from most real-world tabular datasets, and should be studied separately
### Source Data
**Numerical Classification**
|dataset_name|n_samples|n_features|original_link|new_link|
|---|---|---|---|---|
|electricity|38474.0|7.0|https://www.openml.org/d/151|https://www.openml.org/d/44120|
|covertype|566602.0|10.0|https://www.openml.org/d/293|https://www.openml.org/d/44121|
|pol|10082.0|26.0|https://www.openml.org/d/722|https://www.openml.org/d/44122|
|house_16H|13488.0|16.0|https://www.openml.org/d/821|https://www.openml.org/d/44123|
|MagicTelescope|13376.0|10.0|https://www.openml.org/d/1120|https://www.openml.org/d/44125|
|bank-marketing|10578.0|7.0|https://www.openml.org/d/1461|https://www.openml.org/d/44126|
|Bioresponse|3434.0|419.0|https://www.openml.org/d/4134|https://www.openml.org/d/45019|
|MiniBooNE|72998.0|50.0|https://www.openml.org/d/41150|https://www.openml.org/d/44128|
|default-of-credit-card-clients|13272.0|20.0|https://www.openml.org/d/42477|https://www.openml.org/d/45020|
|Higgs|940160.0|24.0|https://www.openml.org/d/42769|https://www.openml.org/d/44129|
|eye_movements|7608.0|20.0|https://www.openml.org/d/1044|https://www.openml.org/d/44130|
|Diabetes130US|71090.0|7.0|https://www.openml.org/d/4541|https://www.openml.org/d/45022|
|jannis|57580.0|54.0|https://www.openml.org/d/41168|https://www.openml.org/d/45021|
|heloc|10000.0|22.0|"https://www.kaggle.com/datasets/averkiyoliabev/home-equity-line-of-creditheloc?select=heloc_dataset_v1+%281%29.csv"|https://www.openml.org/d/45026|
|credit|16714.0|10.0|"https://www.kaggle.com/c/GiveMeSomeCredit/data?select=cs-training.csv"|https://www.openml.org/d/44089|
|california|20634.0|8.0|"https://www.dcc.fc.up.pt/ltorgo/Regression/cal_housing.html"|https://www.openml.org/d/45028|
**Categorical Classification**
|dataset_name|n_samples|n_features|original_link|new_link|
|---|---|---|---|---|
|electricity|38474.0|8.0|https://www.openml.org/d/151|https://www.openml.org/d/44156|
|eye_movements|7608.0|23.0|https://www.openml.org/d/1044|https://www.openml.org/d/44157|
|covertype|423680.0|54.0|https://www.openml.org/d/1596|https://www.openml.org/d/44159|
|albert|58252.0|31.0|https://www.openml.org/d/41147|https://www.openml.org/d/45035|
|compas-two-years|4966.0|11.0|https://www.openml.org/d/42192|https://www.openml.org/d/45039|
|default-of-credit-card-clients|13272.0|21.0|https://www.openml.org/d/42477|https://www.openml.org/d/45036|
|road-safety|111762.0|32.0|https://www.openml.org/d/42803|https://www.openml.org/d/45038|
**Numerical Regression**
|dataset_name|n_samples|n_features|original_link|new_link|
|---|---|---|---|---|
|cpu_act|8192.0|21.0|https://www.openml.org/d/197|https://www.openml.org/d/44132|
|pol|15000.0|26.0|https://www.openml.org/d/201|https://www.openml.org/d/44133|
|elevators|16599.0|16.0|https://www.openml.org/d/216|https://www.openml.org/d/44134|
|wine_quality|6497.0|11.0|https://www.openml.org/d/287|https://www.openml.org/d/44136|
|Ailerons|13750.0|33.0|https://www.openml.org/d/296|https://www.openml.org/d/44137|
|yprop_4_1|8885.0|42.0|https://www.openml.org/d/416|https://www.openml.org/d/45032|
|houses|20640.0|8.0|https://www.openml.org/d/537|https://www.openml.org/d/44138|
|house_16H|22784.0|16.0|https://www.openml.org/d/574|https://www.openml.org/d/44139|
|delays_zurich_transport|5465575.0|9.0|https://www.openml.org/d/40753|https://www.openml.org/d/45034|
|diamonds|53940.0|6.0|https://www.openml.org/d/42225|https://www.openml.org/d/44140|
|Brazilian_houses|10692.0|8.0|https://www.openml.org/d/42688|https://www.openml.org/d/44141|
|Bike_Sharing_Demand|17379.0|6.0|https://www.openml.org/d/42712|https://www.openml.org/d/44142|
|nyc-taxi-green-dec-2016|581835.0|9.0|https://www.openml.org/d/42729|https://www.openml.org/d/44143|
|house_sales|21613.0|15.0|https://www.openml.org/d/42731|https://www.openml.org/d/44144|
|sulfur|10081.0|6.0|https://www.openml.org/d/23515|https://www.openml.org/d/44145|
|medical_charges|163065.0|5.0|https://www.openml.org/d/42720|https://www.openml.org/d/44146|
|MiamiHousing2016|13932.0|14.0|https://www.openml.org/d/43093|https://www.openml.org/d/44147|
|superconduct|21263.0|79.0|https://www.openml.org/d/43174|https://www.openml.org/d/44148|
**Categorical Regression**
|dataset_name|n_samples|n_features|original_link|new_link|
|---|---|---|---|---|
|topo_2_1|8885.0|255.0|https://www.openml.org/d/422|https://www.openml.org/d/45041|
|analcatdata_supreme|4052.0|7.0|https://www.openml.org/d/504|https://www.openml.org/d/44055|
|visualizing_soil|8641.0|4.0|https://www.openml.org/d/688|https://www.openml.org/d/44056|
|delays_zurich_transport|5465575.0|12.0|https://www.openml.org/d/40753|https://www.openml.org/d/45045|
|diamonds|53940.0|9.0|https://www.openml.org/d/42225|https://www.openml.org/d/44059|
|Allstate_Claims_Severity|188318.0|124.0|https://www.openml.org/d/42571|https://www.openml.org/d/45046|
|Mercedes_Benz_Greener_Manufacturing|4209.0|359.0|https://www.openml.org/d/42570|https://www.openml.org/d/44061|
|Brazilian_houses|10692.0|11.0|https://www.openml.org/d/42688|https://www.openml.org/d/44062|
|Bike_Sharing_Demand|17379.0|11.0|https://www.openml.org/d/42712|https://www.openml.org/d/44063|
|Airlines_DepDelay_1M|1000000.0|5.0|https://www.openml.org/d/42721|https://www.openml.org/d/45047|
|nyc-taxi-green-dec-2016|581835.0|16.0|https://www.openml.org/d/42729|https://www.openml.org/d/44065|
|abalone|4177.0|8.0|https://www.openml.org/d/42726|https://www.openml.org/d/45042|
|house_sales|21613.0|17.0|https://www.openml.org/d/42731|https://www.openml.org/d/44066|
|seattlecrime6|52031.0|4.0|https://www.openml.org/d/42496|https://www.openml.org/d/45043|
|medical_charges|163065.0|5.0|https://www.openml.org/d/42720|https://www.openml.org/d/45048|
|particulate-matter-ukair-2017|394299.0|6.0|https://www.openml.org/d/42207|https://www.openml.org/d/44068|
|SGEMM_GPU_kernel_performance|241600.0|9.0|https://www.openml.org/d/43144|https://www.openml.org/d/44069|
### Dataset Curators
Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux.
### Licensing Information
[More Information Needed]
### Citation Information
Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux. Why do tree-based models still outperform deep
learning on typical tabular data?. NeurIPS 2022 Datasets and Benchmarks Track, Nov 2022, New
Orleans, United States. ffhal-03723551v2f
提供机构:
polinaeterna
原始信息汇总
Tabular Benchmark 数据集概述
数据集描述
Tabular Benchmark 数据集是从 openML 中精选的多个数据集,旨在评估各种机器学习算法的性能。
数据集摘要
该数据集包含多种表格数据学习任务,包括:
- 从数值和分类特征进行回归
- 从数值特征进行回归
- 从数值和分类特征进行分类
- 从数值特征进行分类
支持的任务和排行榜
tabular-regressiontabular-classification
数据集结构
数据分割
数据集包含四个分割(文件夹),基于任务和包含在任务中的数据集:
reg_num: 数值特征回归任务标识符reg_cat: 数值和分类特征回归任务标识符clf_num: 数值特征分类任务标识符clf_cat: 分类特征分类任务标识符
根据要加载的数据集,可以通过传递 task_name/dataset_name 到 data_files 参数来加载数据集,例如:
python from datasets import load_dataset dataset = load_dataset("inria-soda/tabular-benchmark", data_files="reg_cat/house_sales.csv")
数据集创建
精选理由
该数据集旨在评估树基模型与神经网络的性能。选择数据集的过程如下:
- 异构列:列应对应不同性质的特征。
- 非高维:仅保留 d/n 比率低于 1/10 的数据集。
- 无文档数据集:移除信息太少的数据集。
- 独立同分布数据:移除流式数据或时间序列。
- 真实世界数据:移除人工数据集,但保留一些模拟数据集。
- 非太小:移除特征太少(< 4)和样本太少(< 3000)的数据集。
- 非太简单:移除太简单的数据集。
- 非确定性:移除目标为数据确定性函数的数据集。
源数据
数值分类
| 数据集名称 | 样本数 | 特征数 | 原始链接 | 新链接 |
|---|---|---|---|---|
| electricity | 38474.0 | 7.0 | https://www.openml.org/d/151 | https://www.openml.org/d/44120 |
| covertype | 566602.0 | 10.0 | https://www.openml.org/d/293 | https://www.openml.org/d/44121 |
| pol | 10082.0 | 26.0 | https://www.openml.org/d/722 | https://www.openml.org/d/44122 |
| house_16H | 13488.0 | 16.0 | https://www.openml.org/d/821 | https://www.openml.org/d/44123 |
| MagicTelescope | 13376.0 | 10.0 | https://www.openml.org/d/1120 | https://www.openml.org/d/44125 |
| bank-marketing | 10578.0 | 7.0 | https://www.openml.org/d/1461 | https://www.openml.org/d/44126 |
| Bioresponse | 3434.0 | 419.0 | https://www.openml.org/d/4134 | https://www.openml.org/d/45019 |
| MiniBooNE | 72998.0 | 50.0 | https://www.openml.org/d/41150 | https://www.openml.org/d/44128 |
| default-of-credit-card-clients | 13272.0 | 20.0 | https://www.openml.org/d/42477 | https://www.openml.org/d/45020 |
| Higgs | 940160.0 | 24.0 | https://www.openml.org/d/42769 | https://www.openml.org/d/44129 |
| eye_movements | 7608.0 | 20.0 | https://www.openml.org/d/1044 | https://www.openml.org/d/44130 |
| Diabetes130US | 71090.0 | 7.0 | https://www.openml.org/d/4541 | https://www.openml.org/d/45022 |
| jannis | 57580.0 | 54.0 | https://www.openml.org/d/41168 | https://www.openml.org/d/45021 |
| heloc | 10000.0 | 22.0 | "https://www.kaggle.com/datasets/averkiyoliabev/home-equity-line-of-creditheloc?select=heloc_dataset_v1+%281%29.csv" | https://www.openml.org/d/45026 |
| credit | 16714.0 | 10.0 | "https://www.kaggle.com/c/GiveMeSomeCredit/data?select=cs-training.csv" | https://www.openml.org/d/44089 |
| california | 20634.0 | 8.0 | "https://www.dcc.fc.up.pt/ltorgo/Regression/cal_housing.html" | https://www.openml.org/d/45028 |
分类分类
| 数据集名称 | 样本数 | 特征数 | 原始链接 | 新链接 |
|---|---|---|---|---|
| electricity | 38474.0 | 8.0 | https://www.openml.org/d/151 | https://www.openml.org/d/44156 |
| eye_movements | 7608.0 | 23.0 | https://www.openml.org/d/1044 | https://www.openml.org/d/44157 |
| covertype | 423680.0 | 54.0 | https://www.openml.org/d/1596 | https://www.openml.org/d/44159 |
| albert | 58252.0 | 31.0 | https://www.openml.org/d/41147 | https://www.openml.org/d/45035 |
| compas-two-years | 4966.0 | 11.0 | https://www.openml.org/d/42192 | https://www.openml.org/d/45039 |
| default-of-credit-card-clients | 13272.0 | 21.0 | https://www.openml.org/d/42477 | https://www.openml.org/d/45036 |
| road-safety | 111762.0 | 32.0 | https://www.openml.org/d/42803 | https://www.openml.org/d/45038 |
数值回归
| 数据集名称 | 样本数 | 特征数 | 原始链接 | 新链接 |
|---|---|---|---|---|
| cpu_act | 8192.0 | 21.0 | https://www.openml.org/d/197 | https://www.openml.org/d/44132 |
| pol | 15000.0 | 26.0 | https://www.openml.org/d/201 | https://www.openml.org/d/44133 |
| elevators | 16599.0 | 16.0 | https://www.openml.org/d/216 | https://www.openml.org/d/44134 |
| wine_quality | 6497.0 | 11.0 | https://www.openml.org/d/287 | https://www.openml.org/d/44136 |
| Ailerons | 13750.0 | 33.0 | https://www.openml.org/d/296 | https://www.openml.org/d/44137 |
| yprop_4_1 | 8885.0 | 42.0 | https://www.openml.org/d/416 | https://www.openml.org/d/45032 |
| houses | 20640.0 | 8.0 | https://www.openml.org/d/537 | https://www.openml.org/d/44138 |
| house_16H | 22784.0 | 16.0 | https://www.openml.org/d/574 | https://www.openml.org/d/44139 |
| delays_zurich_transport | 5465575.0 | 9.0 | https://www.openml.org/d/40753 | https://www.openml.org/d/45034 |
| diamonds | 53940.0 | 6.0 | https://www.openml.org/d/42225 | https://www.openml.org/d/44140 |
| Brazilian_houses | 10692.0 | 8.0 | https://www.openml.org/d/42688 | https://www.openml.org/d/44141 |
| Bike_Sharing_Demand | 17379.0 | 6.0 | https://www.openml.org/d/42712 | https://www.openml.org/d/44142 |
| nyc-taxi-green-dec-2016 | 581835.0 | 9.0 | https://www.openml.org/d/42729 | https://www.openml.org/d/44143 |
| house_sales | 21613.0 | 15.0 | https://www.openml.org/d/42731 | https://www.openml.org/d/44144 |
| sulfur | 10081.0 | 6.0 | https://www.openml.org/d/23515 | https://www.openml.org/d/44145 |
| medical_charges | 163065.0 | 5.0 | https://www.openml.org/d/42720 | https://www.openml.org/d/44146 |
| MiamiHousing2016 | 13932.0 | 14.0 | https://www.openml.org/d/43093 | https://www.openml.org/d/44147 |
| superconduct | 21263.0 | 79.0 | https://www.openml.org/d/43174 | https://www.openml.org/d/44148 |
分类回归
| 数据集名称 | 样本数 | 特征数 | 原始链接 | 新链接 |
|---|---|---|---|---|
| topo_2_1 | 8885.0 | 255.0 | https://www.openml.org/d/422 | https://www.openml.org/d/45041 |
| analcatdata_supreme | 4052.0 | 7.0 | https://www.openml.org/d/504 | https://www.openml.org/d/44055 |
| visualizing_soil | 8641.0 | 4.0 | https://www.openml.org/d/688 | https://www.openml.org/d/44056 |
| delays_zurich_transport | 5465575.0 | 12.0 | https://www.openml.org/d/40753 | https://www.openml.org/d/45045 |
| diamonds | 53940.0 | 9.0 | https://www.openml.org/d/42225 | https://www.openml.org/d/44059 |
| Allstate_Claims_Severity | 188318.0 | 124.0 | https://www.openml.org/d/42571 | https://www.openml.org/d/45046 |
| Mercedes_Benz_Greener_Manufacturing | 4209.0 | 359.0 | https://www.openml.org/d/42570 | https://www.openml.org/d/44061 |
| Brazilian_houses | 10692.0 | 11.0 | https://www.openml.org/d/42688 | https://www.openml.org/d/44062 |
| Bike_Sharing_Demand | 17379.0 | 11.0 | https://www.openml.org/d/42712 | https://www.openml.org/d/44063 |
| Airlines_DepDelay_1M | 1000000.0 | 5.0 | https://www.openml.org/d/42721 | https://www.openml.org/d/45047 |
| nyc-taxi-green-dec-2016 | 581835.0 | 16.0 | https://www.openml.org/d/42729 | https://www.openml.org/d/44065 |
| abalone | 4177.0 | 8.0 | https://www.openml.org/d/42726 | https://www.openml.org/d/45042 |
| house_sales | 21613.0 | 17.0 | https://www.openml.org/d/42731 | https://www.openml.org/d/44066 |
| seattlecrime6 | 52031.0 | 4.0 | https://www.openml.org/d/42496 | https://www.openml.org/d/45043 |
| medical_charges | 163065.0 | 5.0 | https://www.openml.org/d/42720 | https://www.openml.org/d/45048 |
| particulate-matter-ukair-2017 | 394299.0 | 6.0 | https://www.openml.org/d/42207 | https://www.openml.org/d/44068 |
| SGEMM_GPU_kernel_performance | 241600.0 | 9.0 | https://www.openml.org/d/43144 | https://www.openml.org/d/44069 |
数据集管理员
Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux.
许可信息
[更多信息需要]
引用信息
Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?. NeurIPS 2022 Datasets and Benchmarks Track, Nov 2022, New Orleans, United States. ffhal-03723551v2f
搜集汇总
数据集介绍

构建方式
在机器学习领域,表格数据基准测试对于评估模型性能至关重要。tabular-benchmark数据集通过精心筛选来自OpenML平台的多样化数据集构建而成,其构建过程遵循严格的筛选标准,确保数据集的科学性与实用性。构建者首先排除了高维、文档不全或非独立同分布的数据,专注于保留具有异质列特征的真实世界数据集。同时,通过剔除过于简单或确定性的任务,该数据集旨在提供具有挑战性的基准环境,以支持树模型与神经网络在表格数据上的公平比较。
使用方法
使用tabular-benchmark数据集时,研究者可通过Hugging Face的datasets库便捷加载。具体操作中,需根据任务类型与数据集名称,将路径参数传递给load_dataset函数,例如指定data_files为'reg_cat/house_sales.csv'以加载相应数据。该设计支持灵活的配置选择,允许用户针对不同任务(如数值分类或分类回归)进行模型训练与评估。通过这种模块化访问方式,研究者可高效开展基准测试,推动表格数据机器学习方法的前沿探索。
背景与挑战
背景概述
在机器学习领域,表格数据作为结构化信息的重要载体,广泛存在于金融、医疗、社会科学等现实场景中。2022年,由Léo Grinsztajn、Edouard Oyallon和Gaël Varoquaux等研究人员创建的tabular-benchmark数据集,旨在系统评估树模型与神经网络在表格数据任务上的性能差异。该数据集精选自OpenML平台,涵盖分类与回归任务,并区分数值型与类别型特征,核心研究问题聚焦于探索不同算法在异构表格数据上的泛化能力与效率,为模型比较提供了严谨的基准,推动了表格数据学习方法的标准化发展。
当前挑战
表格数据学习面临多重挑战:其一,数据异构性显著,特征间存在复杂交互与非线性关系,传统模型难以捕捉深层模式;其二,类别特征的高维稀疏性增加了表示学习的难度,易导致过拟合或计算冗余。在数据集构建过程中,需严格筛选符合真实场景的数据,排除高维、易解或确定性过强的样本,同时确保数据独立同分布且规模适中,这一过程涉及大量人工审核与平衡,以维持基准的公正性与代表性。
常用场景
经典使用场景
在机器学习领域,表格数据作为结构化信息的主要载体,其处理与分析始终是研究的核心议题。tabular-benchmark数据集通过精心筛选和整合来自OpenML的多样化表格任务,为算法性能评估提供了标准化平台。该数据集最经典的使用场景在于系统性地比较树模型与神经网络在表格数据上的表现,涵盖了数值与分类特征的回归及分类任务,如Higgs粒子识别和房价预测,为模型选择与优化提供了实证基础。
解决学术问题
表格数据学习长期面临模型泛化能力与效率的权衡难题。tabular-benchmark通过严格的数据筛选标准,如排除高维、确定性或过于简单的数据集,聚焦于真实世界、非独立同分布且具有挑战性的任务,有效解决了传统基准中存在的偏差问题。其意义在于为学术界提供了可靠、异构的评估环境,推动了关于树模型与深度学习在表格领域优劣的深入探讨,并促进了更公平、可复现的算法比较研究。
实际应用
表格数据在金融、医疗、交通等实际领域广泛应用,但模型部署常受数据异质性与规模限制。tabular-benchmark集成了信用卡违约预测、医疗费用估计、交通延误分析等现实任务,为行业提供了可直接参考的基准框架。例如,通过该数据集评估的模型可辅助银行进行风险控制,或优化城市交通管理系统,从而提升决策效率与自动化水平,体现了从学术研究到产业落地的桥梁作用。
数据集最近研究
最新研究方向
在表格数据机器学习领域,tabular-benchmark数据集作为一项精心策划的基准测试资源,正推动着模型性能评估的前沿探索。该数据集汇集了来自OpenML的多样化真实世界表格任务,涵盖数值与分类特征的回归及分类问题,其设计初衷在于系统比较树模型与神经网络在典型表格数据上的表现。近期研究聚焦于利用该基准深入剖析深度学习在表格数据中面临的挑战,例如特征交互的隐式学习与归纳偏置的局限性,同时探索融合树结构与神经架构的混合模型,以提升模型的泛化能力与可解释性。相关热点事件包括NeurIPS 2022会议上对该数据集的正式发布及其伴随的学术讨论,这激发了社区对表格数据学习本质的重新思考,促进了更公平、更严谨的算法比较,对推动稳健机器学习方法的发展具有深远意义。
以上内容由遇见数据集搜集并总结生成



