inria-soda/STRABLE-benchmark

Name: inria-soda/STRABLE-benchmark
Creator: inria-soda
Published: 2026-03-04 17:27:41
License: 暂无描述

Hugging Face2026-03-04 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/inria-soda/STRABLE-benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unknown language: - en pretty_name: STRABLE Benchmark --- # STRABLE: Benchmarking Tabular Machine Learning with Strings This dataset card describes the STRABLE benchmark, a comprehensive suite designed for evaluating machine learning models on tabular data containing strings. ## Dataset Details ### Dataset Description Benchmarking tabular data has revealed the benefit of dedicated architectures, pushing the state of the art. However, real-world tables often contain string entries beyond pure numbers, a setting that has been understudied due to a lack of a solid benchmarking suite. STRABLE is a comprehensive benchmarking corpus of 108 tables containing both strings and numbers. These datasets are carefully curated learning problems across diverse application fields to enable the empirical study of tabular learning with strings. ### Dataset Sources - **Repository:** https://github.com/soda-inria/strable - **Paper:** (Coming Soon) - **Project Page:** https://soda-inria.github.io/strable ## Uses ### Direct Use This dataset is intended to be used by researchers and practitioners evaluating tabular machine learning pipelines. It allows users to answer critical research questions regarding string representations in tables: whether dedicated end-to-end learners are needed, or if modular architectures (combining string encoders with tabular learners) are sufficient. Datasets cover binary classification, multi-class classification, and regression tasks. ### Out-of-Scope Use This dataset consists of tables with string entries that can be found "in the wild" rather than long-form free text or document-level data. The data extracts represent a static cross-sectional snapshot, making it unsuitable for evaluating time-series or temporal dynamics. ## Dataset Structure The corpus comprises 108 distinct tables organized into individual dataset configurations. Datasets are organized into folders, containing a `config.json` file and a `data.parquet` file. Because the target variable names vary across the 108 datasets, you must extract the correct target dynamically from the associated `config.json` file. Here is how to load a specific dataset and run a baseline pipeline: ```python import json from datasets import load_dataset from huggingface_hub import hf_hub_download from skrub import TableVectorizer from sklearn.pipeline import make_pipeline from sklearn.linear_model import Ridge from sklearn.model_selection import cross_val_score repo_id = "inria-soda/STRABLE-benchmark" dataset_name = "aijob_ai-ml-ds-salaries" # 1. Load the dataset using data_dir dataset = load_dataset(repo_id, data_dir=dataset_name) df = dataset['train'].to_pandas() # 2. Download and read the specific config.json to get the target column config_path = hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=f"{dataset_name}/config.json") with open(config_path, "r") as f: config = json.load(f) # Adjust the key "target" if your json uses a different key name (e.g., "target_column") target_col = config["target_name"] # 3. Separate features and target X = df.drop(columns=[target_col]) y = df[target_col] # 4. Run pipeline pipeline = make_pipeline(TableVectorizer(), Ridge()) scores = cross_val_score(pipeline, X, y, cv=3, scoring='r2') print(f"Mean R2 Score: {scores.mean():.3f}") ``` ## Dataset Creation Data was aggregated from 33 distinct sources spanning 8 application fields: Commerce, Economy, Education, Energy, Food, Health, Infrastructure, and Social. ## Data Collection and Processing We applied minimal pre-processing to structure the data into a tabular format amenable to machine learning, prioritizing real-world heterogeneity. - We flattened nested structures and removed duplicate rows. - We dropped single-value columns, all-null columns, and rows with missing labels. - We removed features that act as a trivial function of the target to prevent data leakage. - We did not impute missing entries; they are preserved for the encoder-learner pipelines to handle natively. - We sub-sampled large tables to a maximum of 75,000 rows to ensure computational feasibility. - For regression tasks, we applied a skewness-minimization protocol to the target variables. ```bibtex @unpublished{strable2026, title={STRABLE: Benchmarking Tabular Machine Learning with Strings}, author={Anonymous Authors}, year={2026} } ```

提供机构：

inria-soda

5,000+

优质数据集

54 个

任务类型

进入经典数据集