rachitgoyell/vayu-cleaned
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/rachitgoyell/vayu-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
tags:
- air-quality
- india
- cpcb
- pollution
- environment
- aqi
- cleaned
- machine-learning
pretty_name: VAYU — Cleaned & Model-Ready CPCB Air Quality Data (India)
size_categories:
- 100K<n<1M
---
# VAYU — Cleaned and Model-Ready CPCB Air Quality Data
Cleaned, feature-engineered, and split datasets ready for ML training.
Produced by the VAYU data preparation pipeline from raw CPCB sensor data.
## Contents
| Folder | File | Use With |
|---|---|---|
| `05_shared/` | `master_cleaned.parquet` | All models — primary cleaned file |
| `05_shared/` | `master_cleaned.csv` | Same, human-readable backup |
| `01_regression/` | `regression_train.csv` | Linear Regression, Multiple Regression |
| `01_regression/` | `regression_test.csv` | Linear Regression, Multiple Regression |
| `02_classification/` | `clf_train_scaled.csv` | Logistic Regression, KNN, SVM |
| `02_classification/` | `clf_train_unscaled.csv` | Decision Trees, Random Forest |
| `02_classification/` | `clf_test_scaled.csv` | Logistic Regression, KNN, SVM |
| `02_classification/` | `clf_test_unscaled.csv` | Decision Trees, Random Forest |
| `02_classification/` | `label_map.json` | Decode integer predictions to category names |
| `03_clustering/` | `city_profiles_scaled.csv` | K-Means clustering |
| `03_clustering/` | `city_clusters.csv` | City cluster assignments + labels |
| `04_dimensionality/` | `pollutant_matrix_scaled.csv` | PCA, t-SNE, SVD |
| `04_dimensionality/` | `pca_components.csv` | Pre-computed PCA coordinates |
| `04_dimensionality/` | `pca_loadings.csv` | Component loadings + variance explained |
| `04_dimensionality/` | `tsne_sample.csv` | t-SNE 2D coordinates (15k sample) |
## Cleaning Operations Applied
1. Sentinel value `999` → `NaN` (CPCB sensor error code)
2. Physical range validation per pollutant (unit-aware — CO stored as µg/m³)
3. Forward fill short gaps ≤ 3 hours within each city
4. Drop rows where all pollutants are NaN (extended outages)
5. Deduplicate on city + datetime
6. Parse datetime, extract month / hour / day_of_week / season
7. Derive AQI_category from numeric AQI using CPCB breakpoints
## Target Variables
| Task | Target | Range |
|---|---|---|
| Regression | AQI (numeric) | 0 – 500 |
| Classification | AQI category (integer encoded) | 0 = Good → 5 = Severe |
| Clustering | None (unsupervised) | — |
## Related Repository
Raw source data:
[rachitgoyell/vayu-raw](https://huggingface.co/datasets/rachitgoyell/vayu-raw)
提供机构:
rachitgoyell



