five

cosuleabianca/eea-pm25-forecasting

收藏
Hugging Face2026-02-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cosuleabianca/eea-pm25-forecasting
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - time-series-forecasting - tabular-regression tags: - air-quality - pm25 - forecasting - environment - europe - eea language: - en pretty_name: EEA PM2.5 Air Quality Dataset (Europe) size_categories: - 1M<n<10M --- # EEA PM2.5 Air Quality Dataset Hourly air quality measurements from the European Environment Agency (EEA) for PM2.5 forecasting research. ## Dataset Description This dataset contains hourly air pollutant concentrations and meteorological data from monitoring stations across 5 European cities, prepared for machine learning forecasting tasks. ### Data Sources - **Air Quality**: European Environment Agency (EEA) Air Quality Portal - **Weather**: Open-Meteo Archive API ### Coverage - **Time Period**: 2018-01-08 to 2024-12-31 - **Countries**: 5 (AT, BE, ES, FI, FR) - **Cities**: Wien, Paris, Madrid, Antwerpen, Helsinki - **Monitoring Stations**: 38 - **Total Records**: 1,945,153 hourly observations - **Total Features**: 81 columns ## Dataset Files ### Raw Data (Parquet) | File | Description | |------|-------------| | `PM2.5.parquet` | PM2.5 concentrations (all sites) | | `PM2.5_filtered.parquet` | PM2.5 (filtered to quality sites) | | `NO2.parquet` | NO2 concentrations (all sites) | | `NO2_filtered.parquet` | NO2 (filtered sites) | | `PM10.parquet` | PM10 concentrations (all sites) | | `PM10_filtered.parquet` | PM10 (filtered sites) | ### ML-Ready Dataset | File | Description | Size | |------|-------------|------| | `ml_ready_dataset_full_realistic.csv` | Feature-engineered dataset | ~1.6 GB | ## Features (81 columns) ### Metadata - `Start`: Original timestamp - `Country`: Country code (AT, BE, ES, FI, FR) - `SiteNumber`: Station identifier - `dt_utc`: Timestamp in UTC - `dt_local`: Timestamp in local timezone ### Site Metadata (Geographic) - `Latitude`, `Longitude`, `Altitude` ### Station Type (One-Hot Encoded) - `StationType_background` - `StationType_industrial` - `StationType_traffic` ### Station Area (One-Hot Encoded) - `StationArea_rural` - `StationArea_rural-nearcity` - `StationArea_suburban` - `StationArea_urban` ### Weather Features (Open-Meteo API) - `temperature_2m`: Air temperature at 2m (°C) - `relative_humidity_2m`: Relative humidity (%) - `dew_point_2m`: Dew point temperature (°C) - `wind_u`: East-west wind component (m/s) - `wind_v`: North-south wind component (m/s) - `precipitation`: Hourly precipitation (mm) - `surface_pressure`: Surface pressure (hPa) ### Target Variable - `PM2.5`: Current PM2.5 concentration (µg/m³) ### Pollutant Features - `NO2`: Current NO2 concentration - `PM10`: Current PM10 concentration ### Temporal Features - `hour`, `day_of_week`, `day_of_month`, `month`, `year` - `is_weekend`: Weekend indicator (0/1) - `season`: Season indicator - `hour_sin`, `hour_cos`: Cyclical hour encoding - `month_sin`, `month_cos`: Cyclical month encoding ### Lag Features (1h, 2h, 3h, 6h, 12h, 24h, 168h) - `PM2.5_lag_1h`, `PM2.5_lag_2h`, `PM2.5_lag_3h`, `PM2.5_lag_6h`, `PM2.5_lag_12h`, `PM2.5_lag_24h`, `PM2.5_lag_168h` - `NO2_lag_1h`, `NO2_lag_2h`, `NO2_lag_3h`, `NO2_lag_6h`, `NO2_lag_12h`, `NO2_lag_24h`, `NO2_lag_168h` - `PM10_lag_1h`, `PM10_lag_2h`, `PM10_lag_3h`, `PM10_lag_6h`, `PM10_lag_12h`, `PM10_lag_24h`, `PM10_lag_168h` ### Rolling Mean Features (3h, 6h, 12h, 24h windows) - `PM2.5_rolling_mean_3h`, `PM2.5_rolling_mean_6h`, `PM2.5_rolling_mean_12h`, `PM2.5_rolling_mean_24h` - `NO2_rolling_mean_3h`, `NO2_rolling_mean_6h`, `NO2_rolling_mean_12h`, `NO2_rolling_mean_24h` - `PM10_rolling_mean_3h`, `PM10_rolling_mean_6h`, `PM10_rolling_mean_12h`, `PM10_rolling_mean_24h` ### Rolling Std Features (3h, 6h, 12h, 24h windows) - `PM2.5_rolling_std_3h`, `PM2.5_rolling_std_6h`, `PM2.5_rolling_std_12h`, `PM2.5_rolling_std_24h` - `NO2_rolling_std_3h`, `NO2_rolling_std_6h`, `NO2_rolling_std_12h`, `NO2_rolling_std_24h` - `PM10_rolling_std_3h`, `PM10_rolling_std_6h`, `PM10_rolling_std_12h`, `PM10_rolling_std_24h` ## Data Quality ### Filtering Criteria Stations included meet these quality thresholds: - **Train completeness**: ≥50% (2018-2022) - **Test completeness**: ≥50% (2023-2024) - **Maximum gap**: ≤168 hours ### Preprocessing - Sentinel values (<0) replaced with NaN - Time-based lag/rolling features (handles data gaps correctly) - Weather data merged by nearest hour - Local timezone conversion for temporal features - **No missing values** in final dataset ## Stations by Country | Country | City | Stations | |---------|------|----------| | AT | Wien | 10 | | BE | Antwerpen | 8 | | ES | Madrid | 9 | | FI | Helsinki | 5 | | FR | Paris | 6 | ## Usage ### Load with Pandas ```python import pandas as pd # Load ML-ready dataset df = pd.read_csv("ml_ready_dataset_full_realistic.csv") # Train/test split (temporal) train = df[df['dt_utc'] < '2023-01-01'] test = df[df['dt_utc'] >= '2023-01-01'] ``` ### Load with Hugging Face Datasets ```python from datasets import load_dataset dataset = load_dataset("cosuleabianca/eea-pm25-dataset") ``` ### Load Raw Parquet Files ```python import pandas as pd pm25 = pd.read_parquet("PM2.5_filtered.parquet") no2 = pd.read_parquet("NO2_filtered.parquet") ``` ## Train/Test Split | Split | Period | Purpose | |-------|--------|---------| | Train | 2018-01-08 to 2022-12-31 | Model training | | Test | 2023-01-01 to 2024-12-31 | Evaluation | This temporal split simulates real-world forecasting scenarios. ## Regenerating the Dataset If you prefer to regenerate from raw EEA data: ```bash # Clone the repository git clone https://github.com/CosuleaBianca/eea-pm25 cd eea-pm25 # Install dependencies pip install -r requirements.txt # Run data pipeline python dataset_build/src/download_pollutants.py python dataset_build/src/filter_pm25_sites.py python dataset_build/src/process_data.py python dataset_build/src/prepare_ml_dataset.py python dataset_build/src/coverage_only_v6.py python dataset_build/src/dataset_full_realistic_v6.py ``` ## Citation If you use this dataset, please cite: ```bibtex @misc{eea-pm25-dataset, author = {Chisilev Bianca-Iuliana}, title = {EEA PM2.5 Air Quality Dataset for Europe}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/cosuleabianca/eea-pm25-dataset} } ``` ## Links - **GitHub Repository**: [Github repository](https://github.com/CosuleaBianca/eea-pm25) - **Pre-trained Models**: [Models](https://huggingface.co/cosuleabianca/eea-pm25) ## License CC BY 4.0 - You are free to share and adapt, with attribution. Original data from the European Environment Agency is provided under the EEA standard reuse policy.
提供机构:
cosuleabianca
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作