alexodavies/cleantablib

Name: alexodavies/cleantablib
Creator: alexodavies
Published: 2026-02-25 16:53:09
License: 暂无描述

Hugging Face2026-02-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/alexodavies/cleantablib

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - tabular-classification - tabular-regression language: - en size_categories: - 100K<n<1M configs: - config_name: stage1 data_files: - split: train path: data/stage1/**/*.parquet - config_name: stage2 data_files: - split: train path: data/stage2/**/*.parquet - config_name: stage3 data_files: - split: train path: data/stage3/**/*.parquet - config_name: stage4 data_files: - split: train path: data/stage4/**/*.parquet --- # CleanTabLib A cleaned and processed version of [TabLib](https://huggingface.co/datasets/approximatelabs/tablib-v1-full), a large-scale collection of tabular data from diverse sources (GitHub, CommonCrawl, and others). Each table has been filtered for quality, columns classified as categorical or continuous, and optionally normalized/encoded for direct use in machine learning. ## Quick Start ```python from datasets import load_dataset import pyarrow as pa ds = load_dataset("alexodavies/cleantablib", "stage4") for example in ds['train']: table_id = example['table_id'] metadata = example['metadata'] # Deserialize the Arrow IPC bytes back to a table reader = pa.RecordBatchStreamReader(example['arrow_bytes']) table = reader.read_all() df = table.to_pandas() ``` ## Dataset Structure Files are organized into sharded subdirectories to stay within platform limits: ``` data/ stage1/ shard_00/ batch_00001.parquet ... shard_01/ ... stage2/ shard_00/ batch_00001.parquet ... shard_01/ ... stage3/ shard_00/ batch_00001.parquet ... shard_01/ ... stage4/ shard_00/ batch_00001.parquet ... shard_01/ ... ``` Each stage is an independent config. Use `load_dataset(repo, config_name)` to load a specific stage. Most users will want **stage4** (fully processed) or **stage1** (original filtered tables). ## Dataset Statistics | Stage | Tables | Files | Size | Rows (mean) | Rows (median) | Cols (mean) | Cols (median) | |-------|--------|-------|------|-------------|---------------|-------------|---------------| | stage1 | 2,988,000 | 7,470 | 107.1 GB | 1,042 | 113 | 6 | 4 | | stage2 | 2,210,180 | 14,879 | 75.2 GB | 1,271 | 116 | 6 | 5 | | stage3 | 2,210,043 | 14,879 | 92.3 GB | 1,270 | 116 | 6 | 5 | | stage4 | 2,117,811 | 14,879 | 198.6 GB | 1,232 | 117 | 6 | 5 | **Pass-through rates:** - Stage 1 to Stage 2: 74.0% - Stage 2 to Stage 3: 100.0% - Stage 3 to Stage 4: 95.8% ### Column Types (stage2) | Type | Count | % | |------|-------|---| | categorical | 3,766,554 | 30.3% | | ambiguous | 3,526,030 | 28.4% | | continuous | 3,332,725 | 26.8% | | likely_text_or_id | 1,808,006 | 14.5% | ### Column Types (stage3) | Type | Count | % | |------|-------|---| | categorical | 3,766,202 | 30.3% | | ambiguous | 3,525,681 | 28.4% | | continuous | 3,332,522 | 26.8% | | likely_text_or_id | 1,807,915 | 14.5% | ### ML Classifications (stage3) | Classification | Count | % | |----------------|-------|---| | categorical | 5,512,597 | 51.9% | | continuous | 4,550,758 | 42.8% | | needs_llm_review | 561,050 | 5.3% | ### Classification Sources (stage3) | Source | Count | % | |--------|-------|---| | heuristic | 7,098,724 | 66.8% | | ml_high | 2,094,239 | 19.7% | | ml_medium | 870,392 | 8.2% | | ml_low | 561,050 | 5.3% | ### Column Types (stage4) | Type | Count | % | |------|-------|---| | categorical | 3,718,359 | 30.8% | | ambiguous | 3,371,838 | 27.9% | | continuous | 3,284,452 | 27.2% | | likely_text_or_id | 1,711,168 | 14.2% | ## Processing Pipeline ### Stage 1 — Filter Basic quality filtering applied to raw TabLib tables: - Minimum **64 rows** - Minimum **2 columns** - Maximum **50% missing values** - Row count must be >= column count - Tables exceeding 1 GB estimated memory are skipped - Duplicate column names are deduplicated ### Stage 2 — Heuristic Classification Rule-based column type classification: | Rule | Classification | |------|----------------| | Uniqueness < 5% | Categorical | | Uniqueness > 30% and numeric | Continuous | | Uniqueness > 95% and non-numeric | Dropped (likely ID/text) | | Everything else | Ambiguous (sent to Stage 3) | String columns with numeric-like values (commas, K/M/B suffixes, scientific notation) are converted. ### Stage 3 — ML Classifier A Random Forest classifier resolves ambiguous columns from Stage 2. It extracts 33 distribution-agnostic features (string length patterns, character distributions, entropy, numeric convertibility) and predicts categorical vs. continuous. Confidence levels: - **High** (>0.75): accepted automatically - **Medium** (0.65-0.75): accepted with lower confidence - **Low** (<0.65): marked for review, dropped in Stage 4 ### Stage 4 — Normalization & Encoding Final transformations to make data ML-ready: - **Continuous columns**: z-score normalization (`(x - mean) / std`) - **Categorical columns**: integer encoding (1, 2, 3, ...) - **Low-confidence columns**: dropped All transformation parameters are stored in `metadata` for reversibility. ## Reversing Transformations Stage 4 metadata contains the parameters needed to undo normalization and encoding: ```python import pyarrow as pa # Load a stage4 example metadata = example['metadata'] reader = pa.RecordBatchStreamReader(example['arrow_bytes']) table = reader.read_all() df = table.to_pandas() # Reverse z-score normalization for a continuous column norm_params = metadata['stage4_normalized_columns']['my_column'] df['my_column'] = df['my_column'] * norm_params['std'] + norm_params['mean'] # Reverse integer encoding for a categorical column enc_params = metadata['stage4_encoded_columns']['my_category'] inverse_map = {v: k for k, v in enc_params['mapping'].items()} df['my_category'] = df['my_category'].map(inverse_map) ``` ## Schema Each row in the parquet files represents one table: | Column | Type | Description | |--------|------|-------------| | `table_id` | string | Unique identifier for the source table | | `arrow_bytes` | binary | Serialized PyArrow table in IPC streaming format | | `metadata` | struct | Processing metadata from all pipeline stages | ## Source Built from [approximatelabs/tablib-v1-full](https://huggingface.co/datasets/approximatelabs/tablib-v1-full). If you use this dataset, please cite the original TabLib work.

提供机构：

alexodavies

5,000+

优质数据集

54 个

任务类型

进入经典数据集