Name: avewright/tabula-pretraining-corpus
Creator: avewright
Published: 2026-03-13 18:17:49
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/avewright/tabula-pretraining-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - tabular-classification - tabular-regression language: - en tags: - tabular - synthetic - real-data - pretraining - tabpfn pretty_name: "Tabula Pretraining Corpus" configs: - config_name: datagen_001 data_files: - split: train path: datagen_001/* - config_name: datagen_002 data_files: - split: train path: datagen_002/* - config_name: datagen_003 data_files: - split: train path: datagen_003/* - config_name: datagen_004 data_files: - split: train path: datagen_004/* - config_name: datagen_005 data_files: - split: train path: datagen_005/* - config_name: datagen_006 data_files: - split: train path: datagen_006/* - config_name: datagen_007 data_files: - split: train path: datagen_007/* - config_name: datagen_008 data_files: - split: train path: datagen_008/* - config_name: datagen_009 data_files: - split: train path: datagen_009/* - config_name: datagen_010 data_files: - split: train path: datagen_010/* - config_name: datagen_011 data_files: - split: train path: datagen_011/* - config_name: datagen_012 data_files: - split: train path: datagen_012/* - config_name: datagen_013 data_files: - split: train path: datagen_013/* - config_name: datagen_014 data_files: - split: train path: datagen_014/* - config_name: datagen_078 data_files: - split: train path: datagen_078/* - config_name: datagen_079 data_files: - split: train path: datagen_079/* - config_name: datagen_080 data_files: - split: train path: datagen_080/* - config_name: datagen_081 data_files: - split: train path: datagen_081/* - config_name: datagen_082 data_files: - split: train path: datagen_082/* - config_name: datagen_083 data_files: - split: train path: datagen_083/* - config_name: datagen_084 data_files: - split: train path: datagen_084/* - config_name: datagen_085 data_files: - split: train path: datagen_085/* - config_name: datagen_086 data_files: - split: train path: datagen_086/* - config_name: datagen_087 data_files: - split: train path: datagen_087/* - config_name: datagen_088 data_files: - split: train path: datagen_088/* - config_name: datagen_089 data_files: - split: train path: datagen_089/* - config_name: datagen_090 data_files: - split: train path: datagen_090/* - config_name: datagen_091 data_files: - split: train path: datagen_091/* - config_name: datagen_092 data_files: - split: train path: datagen_092/* - config_name: datagen_093 data_files: - split: train path: datagen_093/* - config_name: datagen_094 data_files: - split: train path: datagen_094/* - config_name: datagen_095 data_files: - split: train path: datagen_095/* - config_name: datagen_096 data_files: - split: train path: datagen_096/* - config_name: datagen_097 data_files: - split: train path: datagen_097/* - config_name: datagen_098 data_files: - split: train path: datagen_098/* - config_name: datagen_099 data_files: - split: train path: datagen_099/* - config_name: datagen_100 data_files: - split: train path: datagen_100/* - config_name: datagen_101 data_files: - split: train path: datagen_101/* - config_name: datagen_102 data_files: - split: train path: datagen_102/* - config_name: datagen_103 data_files: - split: train path: datagen_103/* - config_name: datagen_104 data_files: - split: train path: datagen_104/* - config_name: datagen_105 data_files: - split: train path: datagen_105/* - config_name: datagen_106 data_files: - split: train path: datagen_106/* - config_name: datagen_107 data_files: - split: train path: datagen_107/* - config_name: datagen_108 data_files: - split: train path: datagen_108/* - config_name: datagen_109 data_files: - split: train path: datagen_109/* - config_name: datagen_110 data_files: - split: train path: datagen_110/* - config_name: datagen_111 data_files: - split: train path: datagen_111/* - config_name: datagen_112 data_files: - split: train path: datagen_112/* - config_name: datagen_113 data_files: - split: train path: datagen_113/* - config_name: datagen_114 data_files: - split: train path: datagen_114/* - config_name: datagen_115 data_files: - split: train path: datagen_115/* - config_name: datagen_116 data_files: - split: train path: datagen_116/* - config_name: datagen_117 data_files: - split: train path: datagen_117/* - config_name: datagen_118 data_files: - split: train path: datagen_118/* - config_name: datagen_119 data_files: - split: train path: datagen_119/* - config_name: datagen_120 data_files: - split: train path: datagen_120/* - config_name: datagen_121 data_files: - split: train path: datagen_121/* - config_name: datagen_122 data_files: - split: train path: datagen_122/* - config_name: datagen_123 data_files: - split: train path: datagen_123/* - config_name: datagen_124 data_files: - split: train path: datagen_124/* - config_name: datagen_125 data_files: - split: train path: datagen_125/* - config_name: datagen_126 data_files: - split: train path: datagen_126/* - config_name: datagen_127 data_files: - split: train path: datagen_127/* - config_name: datagen_129 data_files: - split: train path: datagen_129/* - config_name: datagen_130 data_files: - split: train path: datagen_130/* - config_name: datagen_131 data_files: - split: train path: datagen_131/* - config_name: datagen_132 data_files: - split: train path: datagen_132/* - config_name: datagen_133 data_files: - split: train path: datagen_133/* - config_name: datagen_134 data_files: - split: train path: datagen_134/* - config_name: datagen_135 data_files: - split: train path: datagen_135/* - config_name: datagen_136 data_files: - split: train path: datagen_136/* - config_name: datagen_137 data_files: - split: train path: datagen_137/* - config_name: datagen_138 data_files: - split: train path: datagen_138/* - config_name: datagen_139 data_files: - split: train path: datagen_139/* - config_name: datagen_140 data_files: - split: train path: datagen_140/* - config_name: datagen_143 data_files: - split: train path: datagen_143/* - config_name: datagen_144 data_files: - split: train path: datagen_144/* - config_name: datagen_145 data_files: - split: train path: datagen_145/* - config_name: datagen_146 data_files: - split: train path: datagen_146/* - config_name: datagen_147 data_files: - split: train path: datagen_147/* - config_name: datagen_148 data_files: - split: train path: datagen_148/* - config_name: datagen_149 data_files: - split: train path: datagen_149/* - config_name: datagen_150 data_files: - split: train path: datagen_150/* - config_name: datagen_151 data_files: - split: train path: datagen_151/* - config_name: datagen_152 data_files: - split: train path: datagen_152/* - config_name: datagen_154 data_files: - split: train path: datagen_154/* - config_name: datagen_156 data_files: - split: train path: datagen_156/* - config_name: datagen_158 data_files: - split: train path: datagen_158/* - config_name: datagen_159 data_files: - split: train path: datagen_159/* - config_name: datagen_160 data_files: - split: train path: datagen_160/* - config_name: datagen_163 data_files: - split: train path: datagen_163/* - config_name: datagen_164 data_files: - split: train path: datagen_164/* - config_name: datagen_165 data_files: - split: train path: datagen_165/* - config_name: datagen_166 data_files: - split: train path: datagen_166/* - config_name: datagen_167 data_files: - split: train path: datagen_167/* - config_name: datagen_168 data_files: - split: train path: datagen_168/* - config_name: datagen_169 data_files: - split: train path: datagen_169/* - config_name: datagen_170 data_files: - split: train path: datagen_170/* - config_name: datagen_172 data_files: - split: train path: datagen_172/* - config_name: datagen_173 data_files: - split: train path: datagen_173/* - config_name: datagen_174 data_files: - split: train path: datagen_174/* - config_name: datagen_176 data_files: - split: train path: datagen_176/* --- # Tabula Pretraining Corpus A continuously growing tabular pretraining corpus for the Tabula foundation model (tabPFN-style in-context learning). Built by an autonomous agent that alternates between harvesting permissively-licensed real datasets and generating high-quality synthetic ones. ## Usage ```python from datasets import load_dataset # Load a specific batch config ds = load_dataset("avewright/tabula-pretraining-corpus", name="datagen_001") # Load all configs from huggingface_hub import HfApi api = HfApi() # List available configs by checking repo folders ``` Each config represents a batch with its own column schema (different domains have different feature names). Load configs individually rather than all at once. ## Stats (auto-updated) | Metric | Value | |--------|-------| | Total rows | 7,014,523 | | Real-data batches | 14 | | Synthetic batches | 94 | | Last updated | 2026-03-13 18:17 UTC | ## Schema Every row has feature columns plus `_source_meta` (JSON string): - `batch_id`, `source_type`, `source_id`, `domain`, `task_type`, `license`, `citation_key` ## Sources & Citations | batch_id | source_type | method | source_id | n_datasets | total_rows | status | |----------|-------------|--------|-----------|------------|------------|--------| | datagen_001 | synthetic | TreePrior | synthetic:TreePrior | 4 | 3500 | success | | datagen_002 | synthetic | TreePrior | synthetic:TreePrior | 6 | 17000 | success | | datagen_003 | synthetic | SCM | synthetic:SCM | 14 | 47000 | success | | datagen_004 | synthetic | TreePrior | synthetic:TreePrior | 5 | 17000 | success | | datagen_005 | synthetic | SCM | synthetic:SCM | 11 | 33000 | success | | datagen_006 | synthetic | TreePrior | synthetic:TreePrior | 7 | 15500 | success | | datagen_007 | synthetic | SCM | synthetic:SCM | 12 | 62500 | success | | datagen_008 | synthetic | TreePrior | synthetic:TreePrior | 5 | 17500 | success | | datagen_009 | synthetic | TreePrior | synthetic:TreePrior | 5 | 8500 | success | | datagen_010 | synthetic | TreePrior | synthetic:TreePrior | 8 | 15500 | success | | datagen_011 | synthetic | SCM | synthetic:SCM | 13 | 56500 | success | | datagen_012 | synthetic | GaussianMixture | synthetic:GaussianMixture | 10 | 33500 | success | | datagen_013 | synthetic | Polynomial | synthetic:Polynomial | 14 | 39000 | success | | datagen_014 | synthetic | Regression | synthetic:Regression | 13 | 61500 | success | | datagen_078 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 5 | 6500 | success | | datagen_079 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 12 | 55000 | success | | datagen_080 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 11 | 47000 | success | | datagen_081 | synthetic | TreePrior | synthetic:TreePrior | 6 | 23500 | success | | datagen_082 | synthetic | SCM | synthetic:SCM | 15 | 46000 | success | | datagen_083 | synthetic | GaussianMixture | synthetic:GaussianMixture | 10 | 41000 | success | | datagen_084 | synthetic | Polynomial | synthetic:Polynomial | 15 | 95500 | success | | datagen_085 | synthetic | Regression | synthetic:Regression | 15 | 57500 | success | | datagen_086 | synthetic | TimeSeries | synthetic:TimeSeries | 12 | 19500 | success | | datagen_087 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 4 | 4000 | success | | datagen_088 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 13 | 46500 | success | | datagen_089 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 12 | 38000 | success | | datagen_090 | synthetic | TreePrior | synthetic:TreePrior | 4 | 31000 | success | | datagen_091 | synthetic | SCM | synthetic:SCM | 15 | 63000 | success | | datagen_092 | synthetic | GaussianMixture | synthetic:GaussianMixture | 8 | 34000 | success | | datagen_093 | synthetic | Polynomial | synthetic:Polynomial | 10 | 23000 | success | | datagen_094 | synthetic | Regression | synthetic:Regression | 14 | 74000 | success | | datagen_095 | synthetic | TimeSeries | synthetic:TimeSeries | 14 | 19500 | success | | datagen_096 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 8 | 25000 | success | | datagen_097 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 14 | 43500 | success | | datagen_098 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 12 | 43000 | success | | datagen_099 | synthetic | TreePrior | synthetic:TreePrior | 6 | 14500 | success | | datagen_100 | synthetic | SCM | synthetic:SCM | 12 | 62500 | success | | datagen_101 | synthetic | GaussianMixture | synthetic:GaussianMixture | 13 | 65500 | success | | datagen_102 | synthetic | Polynomial | synthetic:Polynomial | 13 | 53000 | success | | datagen_103 | synthetic | Regression | synthetic:Regression | 15 | 56500 | success | | datagen_104 | synthetic | TimeSeries | synthetic:TimeSeries | 11 | 16500 | success | | datagen_105 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 3 | 7500 | success | | datagen_106 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 14 | 71000 | success | | datagen_107 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 9 | 34000 | success | | datagen_108 | synthetic | TreePrior | synthetic:TreePrior | 4 | 12500 | success | | datagen_109 | synthetic | SCM | synthetic:SCM | 11 | 37000 | success | | datagen_110 | synthetic | GaussianMixture | synthetic:GaussianMixture | 11 | 56500 | success | | datagen_111 | real | real-ingest | pmlb:522_pm10|pmlb:spambase|pmlb:573_cpu_act|pmlb: | 5 | 14193 | success | | datagen_112 | synthetic | Polynomial | synthetic:Polynomial | 14 | 114000 | success | | datagen_113 | synthetic | Regression | synthetic:Regression | 13 | 170500 | success | | datagen_114 | real | real-ingest | pmlb:_deprecated_solar_flare_1|pmlb:_deprecated_pr | 3 | 996 | success | | datagen_115 | synthetic | TimeSeries | synthetic:TimeSeries | 14 | 22500 | success | | datagen_116 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 3 | 4500 | success | | datagen_117 | real | real-ingest | pmlb:biomed|pmlb:_deprecated_prnn_fglass|pmlb:spec | 4 | 998 | success | | datagen_118 | synthetic | TimeSeries | synthetic:TimeSeries | 12 | 14500 | success | | datagen_119 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 1 | 500 | success | | datagen_120 | real | real-ingest | pmlb:balance_scale|pmlb:583_fri_c1_1000_50|pmlb:so | 5 | 3300 | success | | datagen_121 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 14 | 198000 | success | | datagen_122 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 12 | 61500 | success | | datagen_123 | synthetic | TreePrior | synthetic:TreePrior | 7 | 64500 | success | | datagen_124 | real | real-ingest | pmlb:GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1|p | 4 | 21160 | success | | datagen_125 | synthetic | SCM | synthetic:SCM | 14 | 90500 | success | | datagen_126 | synthetic | GaussianMixture | synthetic:GaussianMixture | 13 | 127000 | success | | datagen_127 | synthetic | Polynomial | synthetic:Polynomial | 11 | 70000 | success | | datagen_129 | real | real-ingest | hf:BrotherTony/employee-burnout-turnover-predictio | 2 | 54180 | success | | datagen_130 | synthetic | Regression | synthetic:Regression | 14 | 162500 | success | | datagen_131 | synthetic | TimeSeries | synthetic:TimeSeries | 12 | 19500 | success | | datagen_132 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 8 | 19000 | success | | datagen_133 | real | real-ingest | pmlb:628_fri_c3_1000_5|pmlb:560_bodyfat|pmlb:appen | 3 | 1358 | success | | datagen_134 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 10 | 116500 | success | | datagen_135 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 13 | 183500 | success | | datagen_136 | synthetic | TreePrior | synthetic:TreePrior | 4 | 111000 | success | | datagen_137 | real | real-ingest | pmlb:_deprecated_cleveland|pmlb:GAMETES_Heterogene | 5 | 3128 | success | | datagen_138 | synthetic | SCM | synthetic:SCM | 12 | 85000 | success | | datagen_139 | synthetic | GaussianMixture | synthetic:GaussianMixture | 14 | 131000 | success | | datagen_140 | synthetic | Polynomial | synthetic:Polynomial | 12 | 152500 | success | | datagen_143 | synthetic | Regression | synthetic:Regression | 15 | 126500 | success | | datagen_144 | synthetic | TimeSeries | synthetic:TimeSeries | 11 | 19500 | success | | datagen_145 | real | real-ingest | pmlb:auto_insurance_symboling|pmlb:breast_cancer|p | 5 | 2241 | success | | datagen_146 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 4 | 24000 | success | | datagen_147 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 10 | 140000 | success | | datagen_148 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 10 | 168000 | success | | datagen_149 | real | real-ingest | pmlb:strogatz_predprey2|pmlb:vehicle|pmlb:_depreca | 3 | 1936 | success | | datagen_150 | synthetic | TreePrior | synthetic:TreePrior | 6 | 20500 | success | | datagen_151 | synthetic | SCM | synthetic:SCM | 13 | 195000 | success | | datagen_152 | synthetic | GaussianMixture | synthetic:GaussianMixture | 9 | 131000 | success | | datagen_154 | synthetic | Polynomial | synthetic:Polynomial | 10 | 69000 | success | | datagen_156 | synthetic | Regression | synthetic:Regression | 15 | 96500 | success | | datagen_158 | synthetic | TimeSeries | synthetic:TimeSeries | 14 | 19500 | success | | datagen_159 | real | real-ingest | openml:14|openml:44 | 2 | 6601 | success | | datagen_160 | synthetic | MixedType_TreePrior | synthetic:MixedType_TreePrior | 6 | 116500 | success | | datagen_163 | synthetic | MixedType_SCM | synthetic:MixedType_SCM | 15 | 190000 | success | | datagen_164 | synthetic | MixedType_GaussianMixture | synthetic:MixedType_GaussianMixture | 9 | 82500 | success | | datagen_165 | real | real-ingest | pmlb:_deprecated_wdbc|pmlb:631_fri_c1_500_5|pmlb:3 | 5 | 53229 | success | | datagen_166 | synthetic | TreePrior | synthetic:TreePrior | 7 | 43000 | success | | datagen_167 | synthetic | SCM | synthetic:SCM | 11 | 91500 | success | | datagen_168 | synthetic | GaussianMixture | synthetic:GaussianMixture | 13 | 223500 | success | | datagen_169 | real | real-ingest | pmlb:heart_disease_cleveland|pmlb:1193_BNG_lowbwt| | 3 | 31657 | success | | datagen_170 | synthetic | Polynomial | synthetic:Polynomial | 15 | 206500 | success | | datagen_172 | synthetic | TreePrior | synthetic:TreePrior | 9 | 245000 | success | | datagen_173 | synthetic | SCM | synthetic:SCM | 13 | 277000 | success | | datagen_174 | real | real-ingest | openml:48|openml:9|openml:29 | 3 | 1046 | success | ## Key Citations ```bibtex @inproceedings{hollmann2023tabpfn, title = {TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second}, author = {Hollmann, Noah and M{\"u}ller, Samuel and Eggensperger, Katharina and Hutter, Frank}, booktitle = {ICLR}, year = {2023} } @article{vanschoren2014openml, title = {OpenML: Networked Science in Machine Learning}, author = {Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis}, journal = {ACM SIGKDD Explorations}, year = {2014} } @article{Olson2017PMLB, title = {PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison}, author = {Olson, Randal S. and La Cava, William and Orzechowski, Patryk and Urbanowicz, Ryan J. and Moore, Jason H.}, journal = {BioData Mining}, volume = {10}, number = {36}, year = {2017} } @article{scholkopf2021causal, title = {Toward Causal Representation Learning}, author = {Sch{\"o}lkopf, Bernhard and others}, journal = {Proceedings of the IEEE}, year = {2021} } ``` ## License Individual rows carry their own `license` field inside `_source_meta`. Synthetic rows are Apache 2.0. Real rows carry the original source license.

应用场景：