five

rachitgoyell/vayu-cleaned

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/rachitgoyell/vayu-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en tags: - air-quality - india - cpcb - pollution - environment - aqi - cleaned - machine-learning pretty_name: VAYU — Cleaned & Model-Ready CPCB Air Quality Data (India) size_categories: - 100K<n<1M --- # VAYU — Cleaned and Model-Ready CPCB Air Quality Data Cleaned, feature-engineered, and split datasets ready for ML training. Produced by the VAYU data preparation pipeline from raw CPCB sensor data. ## Contents | Folder | File | Use With | |---|---|---| | `05_shared/` | `master_cleaned.parquet` | All models — primary cleaned file | | `05_shared/` | `master_cleaned.csv` | Same, human-readable backup | | `01_regression/` | `regression_train.csv` | Linear Regression, Multiple Regression | | `01_regression/` | `regression_test.csv` | Linear Regression, Multiple Regression | | `02_classification/` | `clf_train_scaled.csv` | Logistic Regression, KNN, SVM | | `02_classification/` | `clf_train_unscaled.csv` | Decision Trees, Random Forest | | `02_classification/` | `clf_test_scaled.csv` | Logistic Regression, KNN, SVM | | `02_classification/` | `clf_test_unscaled.csv` | Decision Trees, Random Forest | | `02_classification/` | `label_map.json` | Decode integer predictions to category names | | `03_clustering/` | `city_profiles_scaled.csv` | K-Means clustering | | `03_clustering/` | `city_clusters.csv` | City cluster assignments + labels | | `04_dimensionality/` | `pollutant_matrix_scaled.csv` | PCA, t-SNE, SVD | | `04_dimensionality/` | `pca_components.csv` | Pre-computed PCA coordinates | | `04_dimensionality/` | `pca_loadings.csv` | Component loadings + variance explained | | `04_dimensionality/` | `tsne_sample.csv` | t-SNE 2D coordinates (15k sample) | ## Cleaning Operations Applied 1. Sentinel value `999` → `NaN` (CPCB sensor error code) 2. Physical range validation per pollutant (unit-aware — CO stored as µg/m³) 3. Forward fill short gaps ≤ 3 hours within each city 4. Drop rows where all pollutants are NaN (extended outages) 5. Deduplicate on city + datetime 6. Parse datetime, extract month / hour / day_of_week / season 7. Derive AQI_category from numeric AQI using CPCB breakpoints ## Target Variables | Task | Target | Range | |---|---|---| | Regression | AQI (numeric) | 0 – 500 | | Classification | AQI category (integer encoded) | 0 = Good → 5 = Severe | | Clustering | None (unsupervised) | — | ## Related Repository Raw source data: [rachitgoyell/vayu-raw](https://huggingface.co/datasets/rachitgoyell/vayu-raw)
提供机构:
rachitgoyell
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作