avewright/tabula-pretraining-corpus
收藏Hugging Face2026-03-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/avewright/tabula-pretraining-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- tabular-classification
- tabular-regression
language:
- en
tags:
- tabular
- synthetic
- real-data
- pretraining
- tabpfn
pretty_name: "Tabula Pretraining Corpus"
configs:
- config_name: datagen_001
data_files:
- split: train
path: datagen_001/*
- config_name: datagen_002
data_files:
- split: train
path: datagen_002/*
- config_name: datagen_003
data_files:
- split: train
path: datagen_003/*
- config_name: datagen_004
data_files:
- split: train
path: datagen_004/*
- config_name: datagen_005
data_files:
- split: train
path: datagen_005/*
- config_name: datagen_006
data_files:
- split: train
path: datagen_006/*
- config_name: datagen_007
data_files:
- split: train
path: datagen_007/*
- config_name: datagen_008
data_files:
- split: train
path: datagen_008/*
- config_name: datagen_009
data_files:
- split: train
path: datagen_009/*
- config_name: datagen_010
data_files:
- split: train
path: datagen_010/*
- config_name: datagen_011
data_files:
- split: train
path: datagen_011/*
- config_name: datagen_012
data_files:
- split: train
path: datagen_012/*
- config_name: datagen_013
data_files:
- split: train
path: datagen_013/*
- config_name: datagen_014
data_files:
- split: train
path: datagen_014/*
- config_name: datagen_078
data_files:
- split: train
path: datagen_078/*
- config_name: datagen_079
data_files:
- split: train
path: datagen_079/*
- config_name: datagen_080
data_files:
- split: train
path: datagen_080/*
- config_name: datagen_081
data_files:
- split: train
path: datagen_081/*
- config_name: datagen_082
data_files:
- split: train
path: datagen_082/*
- config_name: datagen_083
data_files:
- split: train
path: datagen_083/*
- config_name: datagen_084
data_files:
- split: train
path: datagen_084/*
- config_name: datagen_085
data_files:
- split: train
path: datagen_085/*
- config_name: datagen_086
data_files:
- split: train
path: datagen_086/*
- config_name: datagen_087
data_files:
- split: train
path: datagen_087/*
- config_name: datagen_088
data_files:
- split: train
path: datagen_088/*
- config_name: datagen_089
data_files:
- split: train
path: datagen_089/*
- config_name: datagen_090
data_files:
- split: train
path: datagen_090/*
- config_name: datagen_091
data_files:
- split: train
path: datagen_091/*
- config_name: datagen_092
data_files:
- split: train
path: datagen_092/*
- config_name: datagen_093
data_files:
- split: train
path: datagen_093/*
- config_name: datagen_094
data_files:
- split: train
path: datagen_094/*
- config_name: datagen_095
data_files:
- split: train
path: datagen_095/*
- config_name: datagen_096
data_files:
- split: train
path: datagen_096/*
- config_name: datagen_097
data_files:
- split: train
path: datagen_097/*
- config_name: datagen_098
data_files:
- split: train
path: datagen_098/*
- config_name: datagen_099
data_files:
- split: train
path: datagen_099/*
- config_name: datagen_100
data_files:
- split: train
path: datagen_100/*
- config_name: datagen_101
data_files:
- split: train
path: datagen_101/*
- config_name: datagen_102
data_files:
- split: train
path: datagen_102/*
- config_name: datagen_103
data_files:
- split: train
path: datagen_103/*
- config_name: datagen_104
data_files:
- split: train
path: datagen_104/*
- config_name: datagen_105
data_files:
- split: train
path: datagen_105/*
- config_name: datagen_106
data_files:
- split: train
path: datagen_106/*
- config_name: datagen_107
data_files:
- split: train
path: datagen_107/*
- config_name: datagen_108
data_files:
- split: train
path: datagen_108/*
- config_name: datagen_109
data_files:
- split: train
path: datagen_109/*
- config_name: datagen_110
data_files:
- split: train
path: datagen_110/*
- config_name: datagen_111
data_files:
- split: train
path: datagen_111/*
- config_name: datagen_112
data_files:
- split: train
path: datagen_112/*
- config_name: datagen_113
data_files:
- split: train
path: datagen_113/*
- config_name: datagen_114
data_files:
- split: train
path: datagen_114/*
- config_name: datagen_115
data_files:
- split: train
path: datagen_115/*
- config_name: datagen_116
data_files:
- split: train
path: datagen_116/*
- config_name: datagen_117
data_files:
- split: train
path: datagen_117/*
- config_name: datagen_118
data_files:
- split: train
path: datagen_118/*
- config_name: datagen_119
data_files:
- split: train
path: datagen_119/*
- config_name: datagen_120
data_files:
- split: train
path: datagen_120/*
- config_name: datagen_121
data_files:
- split: train
path: datagen_121/*
- config_name: datagen_122
data_files:
- split: train
path: datagen_122/*
- config_name: datagen_123
data_files:
- split: train
path: datagen_123/*
- config_name: datagen_124
data_files:
- split: train
path: datagen_124/*
- config_name: datagen_125
data_files:
- split: train
path: datagen_125/*
- config_name: datagen_126
data_files:
- split: train
path: datagen_126/*
- config_name: datagen_127
data_files:
- split: train
path: datagen_127/*
- config_name: datagen_129
data_files:
- split: train
path: datagen_129/*
- config_name: datagen_130
data_files:
- split: train
path: datagen_130/*
- config_name: datagen_131
data_files:
- split: train
path: datagen_131/*
- config_name: datagen_132
data_files:
- split: train
path: datagen_132/*
- config_name: datagen_133
data_files:
- split: train
path: datagen_133/*
- config_name: datagen_134
data_files:
- split: train
path: datagen_134/*
- config_name: datagen_135
data_files:
- split: train
path: datagen_135/*
- config_name: datagen_136
data_files:
- split: train
path: datagen_136/*
- config_name: datagen_137
data_files:
- split: train
path: datagen_137/*
- config_name: datagen_138
data_files:
- split: train
path: datagen_138/*
- config_name: datagen_139
data_files:
- split: train
path: datagen_139/*
- config_name: datagen_140
data_files:
- split: train
path: datagen_140/*
- config_name: datagen_143
data_files:
- split: train
path: datagen_143/*
- config_name: datagen_144
data_files:
- split: train
path: datagen_144/*
- config_name: datagen_145
data_files:
- split: train
path: datagen_145/*
- config_name: datagen_146
data_files:
- split: train
path: datagen_146/*
- config_name: datagen_147
data_files:
- split: train
path: datagen_147/*
- config_name: datagen_148
data_files:
- split: train
path: datagen_148/*
- config_name: datagen_149
data_files:
- split: train
path: datagen_149/*
- config_name: datagen_150
data_files:
- split: train
path: datagen_150/*
- config_name: datagen_151
data_files:
- split: train
path: datagen_151/*
- config_name: datagen_152
data_files:
- split: train
path: datagen_152/*
- config_name: datagen_154
data_files:
- split: train
path: datagen_154/*
- config_name: datagen_156
data_files:
- split: train
path: datagen_156/*
- config_name: datagen_158
data_files:
- split: train
path: datagen_158/*
- config_name: datagen_159
data_files:
- split: train
path: datagen_159/*
- config_name: datagen_160
data_files:
- split: train
path: datagen_160/*
- config_name: datagen_163
data_files:
- split: train
path: datagen_163/*
- config_name: datagen_164
data_files:
- split: train
path: datagen_164/*
- config_name: datagen_165
data_files:
- split: train
path: datagen_165/*
- config_name: datagen_166
data_files:
- split: train
path: datagen_166/*
- config_name: datagen_167
data_files:
- split: train
path: datagen_167/*
- config_name: datagen_168
data_files:
- split: train
path: datagen_168/*
- config_name: datagen_169
data_files:
- split: train
path: datagen_169/*
- config_name: datagen_170
data_files:
- split: train
path: datagen_170/*
- config_name: datagen_172
data_files:
- split: train
path: datagen_172/*
- config_name: datagen_173
data_files:
- split: train
path: datagen_173/*
- config_name: datagen_174
data_files:
- split: train
path: datagen_174/*
- config_name: datagen_176
data_files:
- split: train
path: datagen_176/*
---
# Tabula Pretraining Corpus
A continuously growing tabular pretraining corpus for the Tabula foundation model
(tabPFN-style in-context learning). Built by an autonomous agent that alternates
between harvesting permissively-licensed real datasets and generating high-quality
synthetic ones.
## Usage
```python
from datasets import load_dataset
# Load a specific batch config
ds = load_dataset("avewright/tabula-pretraining-corpus", name="datagen_001")
# Load all configs
from huggingface_hub import HfApi
api = HfApi()
# List available configs by checking repo folders
```
Each config represents a batch with its own column schema (different domains have
different feature names). Load configs individually rather than all at once.
## Stats (auto-updated)
| Metric | Value |
|--------|-------|
| Total rows | 7,014,523 |
| Real-data batches | 14 |
| Synthetic batches | 94 |
| Last updated | 2026-03-13 18:17 UTC |
## Schema
Every row has feature columns plus `_source_meta` (JSON string):
- `batch_id`, `source_type`, `source_id`, `domain`, `task_type`, `license`, `citation_key`
## Sources & Citations
| batch_id | source_type | method | source_id | n_datasets | total_rows | status |
|----------|-------------|--------|-----------|------------|------------|--------|
| datagen_001 | synthetic | TreePrior | synthetic:TreePrior | 4 | 3500 | success |
| datagen_002 | synthetic | TreePrior | synthetic:TreePrior | 6 | 17000 | success |
| datagen_003 | synthetic | SCM | synthetic:SCM | 14 | 47000 | success |
| datagen_004 | synthetic | TreePrior | synthetic:TreePrior | 5 | 17000 | success |
| datagen_005 | synthetic | SCM | synthetic:SCM | 11 | 33000 | success |
| datagen_006 | synthetic | TreePrior | synthetic:TreePrior | 7 | 15500 | success |
| datagen_007 | synthetic | SCM | synthetic:SCM | 12 | 62500 | success |
| datagen_008 | synthetic | TreePrior | synthetic:TreePrior | 5 | 17500 | success |
| datagen_009 | synthetic | TreePrior | synthetic:TreePrior | 5 | 8500 | success |
| datagen_010 | synthetic | TreePrior | synthetic:TreePrior | 8 | 15500 | success |
| datagen_011 | synthetic | SCM | synthetic:SCM | 13 | 56500 | success |
| datagen_012 | synthetic | GaussianMixture | synthetic:GaussianMixture | 10 | 33500 | success |
| datagen_013 | synthetic | Polynomial | synthetic:Polynomial | 14 | 39000 | success |
| datagen_014 | synthetic | Regression | synthetic:Regression | 13 | 61500 | success |
| datagen_078 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 5 | 6500 | success |
| datagen_079 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 12 | 55000 | success |
| datagen_080 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 11 | 47000 | success |
| datagen_081 | synthetic | TreePrior | synthetic:TreePrior | 6 | 23500 | success |
| datagen_082 | synthetic | SCM | synthetic:SCM | 15 | 46000 | success |
| datagen_083 | synthetic | GaussianMixture | synthetic:GaussianMixture | 10 | 41000 | success |
| datagen_084 | synthetic | Polynomial | synthetic:Polynomial | 15 | 95500 | success |
| datagen_085 | synthetic | Regression | synthetic:Regression | 15 | 57500 | success |
| datagen_086 | synthetic | TimeSeries | synthetic:TimeSeries | 12 | 19500 | success |
| datagen_087 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 4 | 4000 | success |
| datagen_088 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 13 | 46500 | success |
| datagen_089 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 12 | 38000 | success |
| datagen_090 | synthetic | TreePrior | synthetic:TreePrior | 4 | 31000 | success |
| datagen_091 | synthetic | SCM | synthetic:SCM | 15 | 63000 | success |
| datagen_092 | synthetic | GaussianMixture | synthetic:GaussianMixture | 8 | 34000 | success |
| datagen_093 | synthetic | Polynomial | synthetic:Polynomial | 10 | 23000 | success |
| datagen_094 | synthetic | Regression | synthetic:Regression | 14 | 74000 | success |
| datagen_095 | synthetic | TimeSeries | synthetic:TimeSeries | 14 | 19500 | success |
| datagen_096 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 8 | 25000 | success |
| datagen_097 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 14 | 43500 | success |
| datagen_098 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 12 | 43000 | success |
| datagen_099 | synthetic | TreePrior | synthetic:TreePrior | 6 | 14500 | success |
| datagen_100 | synthetic | SCM | synthetic:SCM | 12 | 62500 | success |
| datagen_101 | synthetic | GaussianMixture | synthetic:GaussianMixture | 13 | 65500 | success |
| datagen_102 | synthetic | Polynomial | synthetic:Polynomial | 13 | 53000 | success |
| datagen_103 | synthetic | Regression | synthetic:Regression | 15 | 56500 | success |
| datagen_104 | synthetic | TimeSeries | synthetic:TimeSeries | 11 | 16500 | success |
| datagen_105 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 3 | 7500 | success |
| datagen_106 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 14 | 71000 | success |
| datagen_107 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 9 | 34000 | success |
| datagen_108 | synthetic | TreePrior | synthetic:TreePrior | 4 | 12500 | success |
| datagen_109 | synthetic | SCM | synthetic:SCM | 11 | 37000 | success |
| datagen_110 | synthetic | GaussianMixture | synthetic:GaussianMixture | 11 | 56500 | success |
| datagen_111 | real | real-ingest | pmlb:522_pm10|pmlb:spambase|pmlb:573_cpu_act|pmlb: | 5 | 14193 | success |
| datagen_112 | synthetic | Polynomial | synthetic:Polynomial | 14 | 114000 | success |
| datagen_113 | synthetic | Regression | synthetic:Regression | 13 | 170500 | success |
| datagen_114 | real | real-ingest | pmlb:_deprecated_solar_flare_1|pmlb:_deprecated_pr | 3 | 996 | success |
| datagen_115 | synthetic | TimeSeries | synthetic:TimeSeries | 14 | 22500 | success |
| datagen_116 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 3 | 4500 | success |
| datagen_117 | real | real-ingest | pmlb:biomed|pmlb:_deprecated_prnn_fglass|pmlb:spec | 4 | 998 | success |
| datagen_118 | synthetic | TimeSeries | synthetic:TimeSeries | 12 | 14500 | success |
| datagen_119 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 1 | 500 | success |
| datagen_120 | real | real-ingest | pmlb:balance_scale|pmlb:583_fri_c1_1000_50|pmlb:so | 5 | 3300 | success |
| datagen_121 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 14 | 198000 | success |
| datagen_122 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 12 | 61500 | success |
| datagen_123 | synthetic | TreePrior | synthetic:TreePrior | 7 | 64500 | success |
| datagen_124 | real | real-ingest | pmlb:GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1|p | 4 | 21160 | success |
| datagen_125 | synthetic | SCM | synthetic:SCM | 14 | 90500 | success |
| datagen_126 | synthetic | GaussianMixture | synthetic:GaussianMixture | 13 | 127000 | success |
| datagen_127 | synthetic | Polynomial | synthetic:Polynomial | 11 | 70000 | success |
| datagen_129 | real | real-ingest | hf:BrotherTony/employee-burnout-turnover-predictio | 2 | 54180 | success |
| datagen_130 | synthetic | Regression | synthetic:Regression | 14 | 162500 | success |
| datagen_131 | synthetic | TimeSeries | synthetic:TimeSeries | 12 | 19500 | success |
| datagen_132 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 8 | 19000 | success |
| datagen_133 | real | real-ingest | pmlb:628_fri_c3_1000_5|pmlb:560_bodyfat|pmlb:appen | 3 | 1358 | success |
| datagen_134 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 10 | 116500 | success |
| datagen_135 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 13 | 183500 | success |
| datagen_136 | synthetic | TreePrior | synthetic:TreePrior | 4 | 111000 | success |
| datagen_137 | real | real-ingest | pmlb:_deprecated_cleveland|pmlb:GAMETES_Heterogene | 5 | 3128 | success |
| datagen_138 | synthetic | SCM | synthetic:SCM | 12 | 85000 | success |
| datagen_139 | synthetic | GaussianMixture | synthetic:GaussianMixture | 14 | 131000 | success |
| datagen_140 | synthetic | Polynomial | synthetic:Polynomial | 12 | 152500 | success |
| datagen_143 | synthetic | Regression | synthetic:Regression | 15 | 126500 | success |
| datagen_144 | synthetic | TimeSeries | synthetic:TimeSeries | 11 | 19500 | success |
| datagen_145 | real | real-ingest | pmlb:auto_insurance_symboling|pmlb:breast_cancer|p | 5 | 2241 | success |
| datagen_146 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 4 | 24000 | success |
| datagen_147 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 10 | 140000 | success |
| datagen_148 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 10 | 168000 | success |
| datagen_149 | real | real-ingest | pmlb:strogatz_predprey2|pmlb:vehicle|pmlb:_depreca | 3 | 1936 | success |
| datagen_150 | synthetic | TreePrior | synthetic:TreePrior | 6 | 20500 | success |
| datagen_151 | synthetic | SCM | synthetic:SCM | 13 | 195000 | success |
| datagen_152 | synthetic | GaussianMixture | synthetic:GaussianMixture | 9 | 131000 | success |
| datagen_154 | synthetic | Polynomial | synthetic:Polynomial | 10 | 69000 | success |
| datagen_156 | synthetic | Regression | synthetic:Regression | 15 | 96500 | success |
| datagen_158 | synthetic | TimeSeries | synthetic:TimeSeries | 14 | 19500 | success |
| datagen_159 | real | real-ingest | openml:14|openml:44 | 2 | 6601 | success |
| datagen_160 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 6 | 116500 | success |
| datagen_163 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 15 | 190000 | success |
| datagen_164 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 9 | 82500 | success |
| datagen_165 | real | real-ingest | pmlb:_deprecated_wdbc|pmlb:631_fri_c1_500_5|pmlb:3 | 5 | 53229 | success |
| datagen_166 | synthetic | TreePrior | synthetic:TreePrior | 7 | 43000 | success |
| datagen_167 | synthetic | SCM | synthetic:SCM | 11 | 91500 | success |
| datagen_168 | synthetic | GaussianMixture | synthetic:GaussianMixture | 13 | 223500 | success |
| datagen_169 | real | real-ingest | pmlb:heart_disease_cleveland|pmlb:1193_BNG_lowbwt| | 3 | 31657 | success |
| datagen_170 | synthetic | Polynomial | synthetic:Polynomial | 15 | 206500 | success |
| datagen_172 | synthetic | TreePrior | synthetic:TreePrior | 9 | 245000 | success |
| datagen_173 | synthetic | SCM | synthetic:SCM | 13 | 277000 | success |
| datagen_174 | real | real-ingest | openml:48|openml:9|openml:29 | 3 | 1046 | success |
## Key Citations
```bibtex
@inproceedings{hollmann2023tabpfn,
title = {TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second},
author = {Hollmann, Noah and M{\"u}ller, Samuel and Eggensperger, Katharina and Hutter, Frank},
booktitle = {ICLR},
year = {2023}
}
@article{vanschoren2014openml,
title = {OpenML: Networked Science in Machine Learning},
author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis},
journal = {ACM SIGKDD Explorations},
year = {2014}
}
@article{Olson2017PMLB,
title = {PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison},
author = {Olson, Randal S. and La Cava, William and Orzechowski, Patryk and Urbanowicz, Ryan J. and Moore, Jason H.},
journal = {BioData Mining},
volume = {10},
number = {36},
year = {2017}
}
@article{scholkopf2021causal,
title = {Toward Causal Representation Learning},
author = {Sch{\"o}lkopf, Bernhard and others},
journal = {Proceedings of the IEEE},
year = {2021}
}
```
## License
Individual rows carry their own `license` field inside `_source_meta`.
Synthetic rows are Apache 2.0. Real rows carry the original source license.
提供机构:
avewright



