inria-soda/STRABLE-benchmark
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/inria-soda/STRABLE-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
license: unknown
language:
- en
pretty_name: STRABLE Benchmark
---
# STRABLE: Benchmarking Tabular Machine Learning with Strings
This dataset card describes the STRABLE benchmark, a comprehensive suite designed for evaluating machine learning models on tabular data containing strings.
## Dataset Details
### Dataset Description
Benchmarking tabular data has revealed the benefit of dedicated architectures, pushing the state of the art. However, real-world tables often contain string entries beyond pure numbers, a setting that has been understudied due to a lack of a solid benchmarking suite.
STRABLE is a comprehensive benchmarking corpus of 108 tables containing both strings and numbers. These datasets are carefully curated learning problems across diverse application fields to enable the empirical study of tabular learning with strings.
### Dataset Sources
- **Repository:** https://github.com/soda-inria/strable
- **Paper:** (Coming Soon)
- **Project Page:** https://soda-inria.github.io/strable
## Uses
### Direct Use
This dataset is intended to be used by researchers and practitioners evaluating tabular machine learning pipelines. It allows users to answer critical research questions regarding string representations in tables: whether dedicated end-to-end learners are needed, or if modular architectures (combining string encoders with tabular learners) are sufficient. Datasets cover binary classification, multi-class classification, and regression tasks.
### Out-of-Scope Use
This dataset consists of tables with string entries that can be found "in the wild" rather than long-form free text or document-level data. The data extracts represent a static cross-sectional snapshot, making it unsuitable for evaluating time-series or temporal dynamics.
## Dataset Structure
The corpus comprises 108 distinct tables organized into individual dataset configurations. Datasets are organized into folders, containing a `config.json` file and a `data.parquet` file.
Because the target variable names vary across the 108 datasets, you must extract the correct target dynamically from the associated `config.json` file. Here is how to load a specific dataset and run a baseline pipeline:
```python
import json
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from skrub import TableVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
repo_id = "inria-soda/STRABLE-benchmark"
dataset_name = "aijob_ai-ml-ds-salaries"
# 1. Load the dataset using data_dir
dataset = load_dataset(repo_id, data_dir=dataset_name)
df = dataset['train'].to_pandas()
# 2. Download and read the specific config.json to get the target column
config_path = hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=f"{dataset_name}/config.json")
with open(config_path, "r") as f:
config = json.load(f)
# Adjust the key "target" if your json uses a different key name (e.g., "target_column")
target_col = config["target_name"]
# 3. Separate features and target
X = df.drop(columns=[target_col])
y = df[target_col]
# 4. Run pipeline
pipeline = make_pipeline(TableVectorizer(), Ridge())
scores = cross_val_score(pipeline, X, y, cv=3, scoring='r2')
print(f"Mean R2 Score: {scores.mean():.3f}")
```
## Dataset Creation
Data was aggregated from 33 distinct sources spanning 8 application fields: Commerce, Economy, Education, Energy, Food, Health, Infrastructure, and Social.
## Data Collection and Processing
We applied minimal pre-processing to structure the data into a tabular format amenable to machine learning, prioritizing real-world heterogeneity.
- We flattened nested structures and removed duplicate rows.
- We dropped single-value columns, all-null columns, and rows with missing labels.
- We removed features that act as a trivial function of the target to prevent data leakage.
- We did not impute missing entries; they are preserved for the encoder-learner pipelines to handle natively.
- We sub-sampled large tables to a maximum of 75,000 rows to ensure computational feasibility.
- For regression tasks, we applied a skewness-minimization protocol to the target variables.
```bibtex
@unpublished{strable2026,
title={STRABLE: Benchmarking Tabular Machine Learning with Strings},
author={Anonymous Authors},
year={2026}
}
```
提供机构:
inria-soda



