ai-for-data/TabBench

Name: ai-for-data/TabBench
Creator: ai-for-data
Published: 2026-04-11 10:30:54
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/ai-for-data/TabBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit size_categories: - 1M<n<10M task_categories: - text-classification - feature-extraction - text-retrieval tags: - tabular - embedding - benchmark - contrastive-learning - retrieval - classification pretty_name: TabBench - Tabular Embedding Benchmark --- <div align="center"> # TabBench: Tabular Embedding Benchmark ### A Comprehensive Evaluation Suite for Tabular Embedding Models [![GitHub](https://img.shields.io/badge/GitHub-TabEmbed-black)](https://github.com/qiangminjie27/TabEmbed) </div> --- ## Overview **TabBench** is a comprehensive benchmark designed to evaluate the tabular understanding capability of embedding models. It assesses two critical dimensions of tabular representation: **linear separability** (via classification) and **semantic alignment** (via retrieval). TabBench aggregates diverse datasets from four authoritative repositories and provides a standardized evaluation pipeline. ## Benchmark Statistics | Category | Count | Samples / Corpus | |:---|---:|---:| | **Classification** | | | | Grinsztajn | 56 datasets | 521,889 | | OpenML-CC18 | 66 datasets | 249,939 | | OpenML-CTR23 | 34 datasets | 210,026 | | UniPredict | 155 datasets | 386,618 | | *Classification Total* | *311 datasets* | *1,368,472* | | **Retrieval** | | | | Corpus | — | 1,394,247 | | Numeric Queries | 10,000 | — | | Categorical Queries | 10,000 | — | | Mixed Queries | 10,000 | — | | *Retrieval Total* | *30,000 queries* | *1,394,247* | ## Data Format ### Serialization All tabular rows are serialized into natural language using the template: ``` The {column_name} is {value}. The {column_name} is {value}. ... ``` For example: ``` The age is 25. The occupation is Engineer. The salary is 75000.50. The city is New York. ``` ### Classification Task Each dataset directory contains: - `train.jsonl` / `test.jsonl`: Each line is a JSON object with the following fields: - `text`: Serialized tabular row - `label`: Target label (string) - `dataset`: Dataset name - `benchmark`: Source benchmark name - `task_type`: Task type (`clf`) - `train.csv` / `test.csv`: Original tabular data in CSV format - `metadata.json`: Dataset metadata including `dataset`, `benchmark`, `sub_benchmark`, `task_type`, `data_type`, `target_column`, `label_values`, `num_labels`, `train_samples`, `test_samples`, `train_label_distribution`, `test_label_distribution` ```json {"text": "The age is 36. The workclass is Private. The fnlwgt is 172256.0. ...", "label": ">50K", "dataset": "adult", "benchmark": "openml_cc18", "task_type": "clf"} ``` ### Retrieval Task The retrieval directory contains: - `corpus.jsonl`: Global corpus of serialized rows (~1.4M documents), each with fields `idx`, `text`, `label`, `dataset`, `benchmark` - `queries.jsonl`: All retrieval queries (30,000 total: 10k numeric + 10k categorical + 10k mixed) Corpus format: ```json {"idx": 0, "text": "The V1 is 3.0. The V2 is 559.0. ...", "label": "1.0", "dataset": "albert", "benchmark": "grinsztajn"} ``` Query format: ```json { "task": "retrieval", "query_id": "retrieval_numeric_000001", "query_text": "find records where Easter is 0", "query_type": "numeric", "conditions": [{"field": "Easter", "operator": "==", "value": 0.0, "type": "numeric"}], "num_conditions": 1, "matching_indices": [1384050, 1384051, ...], "num_matches": 1822 } ``` ## Evaluation Protocol ### Classification (Linear Probing) 1. Extract frozen embeddings for all samples using the target model 2. Train an independent Logistic Regression classifier per dataset (`max_iter=1000`, `random_state=42`) 3. Report **Accuracy** and **Macro-F1** on the test split ### Retrieval (Dense Retrieval) 1. Encode all corpus documents and queries 2. Build a Faiss `IndexFlatIP` index (cosine similarity via L2-normalized vectors) 3. Retrieve top-k documents for each query 4. Report **MRR@10** and **nDCG@10** ### Overall Score The **Overall** metric is the macro-average of Accuracy, F1, MRR@10, and nDCG@10. ## Leaderboard | Model | #Params | Overall | Accuracy | F1 | MRR@10 | nDCG@10 | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | Jina-Embeddings-v3 | 0.6B | 41.48 | 60.33 | 46.11 | 32.49 | 26.98 | | Jasper-Token-Compression | 0.6B | 42.75 | 61.25 | 47.69 | 33.56 | 28.50 | | Qwen3-Embedding-0.6B | 0.6B | 44.92 | 62.81 | 50.32 | 36.00 | 30.56 | | **TabEmbed-0.6B** | **0.6B** | **65.27** | **67.16** | **56.56** | **71.72** | **65.64** | | F2LLM-4B | 4B | 48.02 | 64.92 | 52.48 | 40.60 | 34.08 | | Octen-Embedding-4B | 4B | 48.62 | 65.36 | 53.64 | 40.97 | 34.51 | | Qwen3-Embedding-4B | 4B | 48.91 | 65.09 | 52.72 | 42.04 | 35.76 | | **TabEmbed-4B** | **4B** | **70.71** | **69.51** | **59.75** | **79.33** | **74.25** | | SFR-Embedding-Mistral | 7B | 49.42 | 64.28 | 50.75 | 44.23 | 38.41 | | Linq-Embed-Mistral | 7B | 50.74 | 66.06 | 53.33 | 44.65 | 38.92 | | GTE-Qwen2-7B-Instruct | 7B | 51.27 | 64.67 | 51.76 | 47.44 | 41.19 | | Qwen3-Embedding-8B | 8B | 48.03 | 65.08 | 52.81 | 40.06 | 34.16 | | **TabEmbed-8B** | **8B** | **71.62** | **69.88** | **60.19** | **80.58** | **75.83** | ## Quick Start ```bash # Clone the evaluation code git clone https://github.com/qiangminjie27/TabEmbed.git cd TabEmbed pip install -r requirements.txt # Run evaluation python src/run_benchmark.py \ --benchmark_dir /path/to/TabBench \ --model_name_or_path your-model-name \ --output_dir results/ \ --max_seq_length 1024 \ --batch_size 64 ``` ## Source Datasets TabBench is built upon the following high-quality data sources: - [Grinsztajn et al. (2022)](https://arxiv.org/abs/2207.08815) — Tree-based models benchmark - [OpenML-CC18](https://www.openml.org/s/99) — OpenML curated classification benchmark - [OpenML-CTR23](https://www.openml.org/s/336) — OpenML tabular regression benchmark - [UniPredict](https://arxiv.org/abs/2310.03266) — Universal prediction benchmark Raw evaluation data is sourced from [tabula-8b-eval-suite](https://huggingface.co/datasets/mlfoundations/tabula-8b-eval-suite). ## Citation If you use TabBench in your research, please cite: > Paper coming soon. Please check back later for the BibTeX citation. ## License This benchmark is released under the [MIT License](https://opensource.org/licenses/MIT). Note: The individual upstream datasets included in this benchmark may have their own respective licenses. Please refer to the original data sources for their specific terms.

--- 语言： - 英语许可证： - MIT协议数据规模： - 100万<样本量<1000万任务类别： - 文本分类 - 特征提取 - 文本检索标签： - 表格数据 - 嵌入 - 基准测试 - 对比学习 - 检索 - 分类标准名称：TabBench——表格嵌入基准测试集 --- <div align="center"> # TabBench：表格嵌入基准测试集 ### 面向表格嵌入模型的综合评估套件 [![GitHub](https://img.shields.io/badge/GitHub-TabEmbed-black)](https://github.com/qiangminjie27/TabEmbed) </div> --- ## 概述 **TabBench**是一款专为评估嵌入模型的表格理解能力打造的综合基准测试集，其从两大核心维度对表格表征进行评测：**线性可分性**（通过分类任务实现）与**语义对齐性**（通过检索任务实现）。该基准集整合了来自4个权威数据源的多样化数据集，并提供了标准化的评估流程。 ## 基准测试集统计数据 | 类别 | 数量 | 样本/语料库 | |:---|---:|---:| | **分类任务** | | | | Grinsztajn | 56个数据集 | 521,889条 | | OpenML-CC18 | 66个数据集 | 249,939条 | | OpenML-CTR23 | 34个数据集 | 210,026条 | | UniPredict | 155个数据集 | 386,618条 | | *分类任务总计* | *311个数据集* | *1,368,472条* | | **检索任务** | | | | 语料库 | — | 1,394,247条 | | 数值型查询 | 10,000个 | — | | 类别型查询 | 10,000个 | — | | 混合型查询 | 10,000个 | — | | *检索任务总计* | *30,000个查询* | *1,394,247条* | ## 数据格式 ### 序列化方式所有表格行均通过以下模板序列化为自然语言： The {column_name} is {value}. The {column_name} is {value}. ... 示例： The age is 25. The occupation is Engineer. The salary is 75000.50. The city is New York. ### 分类任务每个数据集目录包含以下文件： - `train.jsonl` / `test.jsonl`：每行均为包含以下字段的JSON对象： - `text`：序列化后的表格行文本 - `label`：目标标签（字符串类型） - `dataset`：数据集名称 - `benchmark`：所属基准测试集名称 - `task_type`：任务类型（取值为`clf`） - `train.csv` / `test.csv`：CSV格式的原始表格数据 - `metadata.json`：数据集元数据，包含`dataset`、`benchmark`、`sub_benchmark`、`task_type`、`data_type`、`target_column`、`label_values`、`num_labels`、`train_samples`、`test_samples`、`train_label_distribution`、`test_label_distribution`等字段示例JSON格式： json {"text": "The age is 36. The workclass is Private. The fnlwgt is 172256.0. ...", "label": ">50K", "dataset": "adult", "benchmark": "openml_cc18", "task_type": "clf"} ### 检索任务检索任务目录包含以下文件： - `corpus.jsonl`：序列化表格行构成的全局语料库（约140万条文档），每条文档包含`idx`、`text`、`label`、`dataset`、`benchmark`字段 - `queries.jsonl`：全部检索查询（总计30,000条：10,000条数值型查询、10,000条类别型查询、10,000条混合型查询） #### 语料库格式 json {"idx": 0, "text": "The V1 is 3.0. The V2 is 559.0. ...", "label": "1.0", "dataset": "albert", "benchmark": "grinsztajn"} #### 查询格式 json { "task": "retrieval", "query_id": "retrieval_numeric_000001", "query_text": "find records where Easter is 0", "query_type": "numeric", "conditions": [{"field": "Easter", "operator": "==", "value": 0.0, "type": "numeric"}], "num_conditions": 1, "matching_indices": [1384050, 1384051, ...], "num_matches": 1822 } ## 评估协议 ### 分类任务（线性探测） 1. 使用待评估模型提取所有样本的冻结嵌入（即不更新模型参数） 2. 为每个数据集训练独立的逻辑回归分类器（参数设置：`max_iter=1000`，`random_state=42`） 3. 在测试集上报告**准确率（Accuracy）**与**宏F1值（Macro-F1）** ### 检索任务（稠密检索） 1. 对所有语料库文档与查询进行编码 2. 构建Faiss的`IndexFlatIP`索引（通过L2归一化向量计算余弦相似度） 3. 为每个查询检索Top-k相关文档 4. 报告**平均倒数排名@10（MRR@10）**与**归一化折损累积增益@10（nDCG@10）** ### 总体得分 **总体得分**为准确率、F1值、MRR@10与nDCG@10的宏平均结果。 ## 排行榜 | 模型 | 参数量 | 总体得分 | 准确率 | F1值 | MRR@10 | nDCG@10 | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | Jina-Embeddings-v3 | 0.6B | 41.48 | 60.33 | 46.11 | 32.49 | 26.98 | | Jasper-Token-Compression | 0.6B | 42.75 | 61.25 | 47.69 | 33.56 | 28.50 | | Qwen3-Embedding-0.6B | 0.6B | 44.92 | 62.81 | 50.32 | 36.00 | 30.56 | | **TabEmbed-0.6B** | **0.6B** | **65.27** | **67.16** | **56.56** | **71.72** | **65.64** | | F2LLM-4B | 4B | 48.02 | 64.92 | 52.48 | 40.60 | 34.08 | | Octen-Embedding-4B | 4B | 48.62 | 65.36 | 53.64 | 40.97 | 34.51 | | Qwen3-Embedding-4B | 4B | 48.91 | 65.09 | 52.72 | 42.04 | 35.76 | | **TabEmbed-4B** | **4B** | **70.71** | **69.51** | **59.75** | **79.33** | **74.25** | | SFR-Embedding-Mistral | 7B | 49.42 | 64.28 | 50.75 | 44.23 | 38.41 | | Linq-Embed-Mistral | 7B | 50.74 | 66.06 | 53.33 | 44.65 | 38.92 | | GTE-Qwen2-7B-Instruct | 7B | 51.27 | 64.67 | 51.76 | 47.44 | 41.19 | | Qwen3-Embedding-8B | 8B | 48.03 | 65.08 | 52.81 | 40.06 | 34.16 | | **TabEmbed-8B** | **8B** | **71.62** | **69.88** | **60.19** | **80.58** | **75.83** | ## 快速上手 bash # 克隆评估代码 git clone https://github.com/qiangminjie27/TabEmbed.git cd TabEmbed pip install -r requirements.txt # 运行评估 python src/run_benchmark.py --benchmark_dir /path/to/TabBench --model_name_or_path your-model-name --output_dir results/ --max_seq_length 1024 --batch_size 64 ## 源数据集 TabBench基于以下高质量数据源构建： - [Grinsztajn等人（2022）](https://arxiv.org/abs/2207.08815) — 树基模型基准测试集 - [OpenML-CC18](https://www.openml.org/s/99) — OpenML精选分类基准测试集 - [OpenML-CTR23](https://www.openml.org/s/336) — OpenML表格回归基准测试集 - [UniPredict](https://arxiv.org/abs/2310.03266) — 通用预测基准测试集原始评估数据源自[tabula-8b-eval-suite](https://huggingface.co/datasets/mlfoundations/tabula-8b-eval-suite)。 ## 引用说明若您在研究中使用TabBench，请引用： > 论文即将上线，请稍后获取BibTeX引用格式。 ## 许可证本基准测试集采用[MIT许可证](https://opensource.org/licenses/MIT)发布。注意：本基准测试集中包含的各上游数据集可能拥有各自的许可证，请查阅原始数据源以获取其具体条款。

提供机构：

ai-for-data

5,000+

优质数据集

54 个

任务类型

进入经典数据集