five

cassini-team-todo/eea-waterbase

收藏
Hugging Face2026-04-24 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/cassini-team-todo/eea-waterbase
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: EEA Waterbase (WISE-4) v2018.1 license: other license_name: eea-reuse-policy license_link: https://www.eea.europa.eu/en/legal-notice language: - en tags: - water-quality - environmental - eea - wise - eu - hydrology - biology-eqr - copernicus-adjacent size_categories: - 10M<n<100M configs: - config_name: disaggregated data_files: Waterbase_v2018_1_T_WISE4_DisaggregatedData.parquet - config_name: aggregated data_files: Waterbase_v2018_1_T_WISE4_AggregatedData.parquet - config_name: aggregated_by_waterbody data_files: Waterbase_v2018_1_T_WISE4_AggregatedDataByWaterBody.parquet - config_name: biology_eqr data_files: Waterbase_v2018_1_T_WISE4_BiologyEQRData.parquet - config_name: biology_eqr_classification data_files: Waterbase_v2018_1_T_WISE4_BiologyEQRClassificationProcedure.parquet - config_name: monitoring_sites data_files: Waterbase_v2018_1_WISE4_MonitoringSite_DerivedData.parquet --- # EEA Waterbase (WISE-4) — v2018.1 Mirror of the European Environment Agency's **Waterbase – Water Quality ICM** (WISE-4) tabular release, version 2018.1. Contains station-level and water-body-level measurements of chemical and biological determinands in European surface and ground waters, plus the monitoring-site registry with coordinates. Uploaded here for convenient team access during the **11th CASSINI Hackathon – EU Space for Water**. This is a redistribution of the original EEA CSVs, converted to Parquet (snappy compression) for faster loading and type preservation. See Source & Licensing below. ## Files | File | Size | Rows | What it is | |---|---|---|---| | `Waterbase_v2018_1_T_WISE4_DisaggregatedData.parquet` | 331 MB | 33,848,578 | Per-sample measurements (one row per sampling date) | | `Waterbase_v2018_1_T_WISE4_AggregatedData.parquet` | 82 MB | 3,211,183 | Per-site yearly aggregates (min/mean/max/median/stddev, LOQ flags) | | `Waterbase_v2018_1_T_WISE4_AggregatedDataByWaterBody.parquet` | 507 KB | 20,251 | Same aggregates rolled up per water body, with per-class site counts | | `Waterbase_v2018_1_T_WISE4_BiologyEQRData.parquet` | 587 KB | 29,741 | Biological Ecological Quality Ratio results per monitoring site | | `Waterbase_v2018_1_T_WISE4_BiologyEQRClassificationProcedure.parquet` | 63 KB | 2,553 | EQR classification boundary values (country × water-body type × determinand) | | `Waterbase_v2018_1_WISE4_MonitoringSite_DerivedData.parquet` | 1.56 MB | 56,464 | Monitoring-site registry: IDs, water-body link, `lon`/`lat`, confidentiality flag | Total: ~416 MB across all six files (vs. ~10 GB as CSV). ## Schema highlights ### Measurement tables (`Disaggregated`, `Aggregated`, `AggregatedDataByWaterBody`) Shared keys: - `monitoringSiteIdentifier` / `waterBodyIdentifier` + `*IdentifierScheme` — join keys - `parameterWaterBodyCategory` — `RW` (river), `LW` (lake), `GW` (groundwater), `TW`/`CW` (transitional/coastal) - `observedPropertyDeterminandCode` — typically CAS codes, e.g. `CAS_7440-38-2` (arsenic), `CAS_14797-55-8` (nitrate) - `procedureAnalysedFraction`, `procedureAnalysedMedia`, `resultUom` (e.g. `ug/L`, `mg{NO3}/L`) - `procedureLOQValue` — limit of quantification; paired `resultQuality*BelowLOQ` flags **Disaggregated** adds `phenomenonTimeSamplingDate` + `resultObservedValue`. **Aggregated** / **AggregatedByWaterBody** add `phenomenonTimeReferenceYear`, `parameterSamplingPeriod`, `resultNumberOfSamples`, and min/mean/max/median/stddev columns. **AggregatedByWaterBody** additionally provides `resultNumberOfSitesClass1..5`. ### Biology EQR tables - `observedPropertyDeterminandBiologyEQRCode` — `EEA_*` codes (instead of CAS) - `resultEcologicalStatusClassValue`, `resultEQRValue`, `resultNormalisedEQRValue` - Classification procedure table gives boundary values for classes 1/2, 2/3, 3/4, 4/5 per country and water-body type ### Monitoring sites `monitoringSiteIdentifier`, `waterBodyIdentifier`, `confidentialityStatus`, `lon`, `lat`. The join key for putting any measurement on a map. ## Usage ### With `datasets` ```python from datasets import load_dataset # Small tables — fine to load fully sites = load_dataset("cassini-team-todo/eea-waterbase", "monitoring_sites", split="train") eqr = load_dataset("cassini-team-todo/eea-waterbase", "biology_eqr", split="train") # Large table — stream to avoid materialising all ~34M rows in memory disagg = load_dataset( "cassini-team-todo/eea-waterbase", "disaggregated", split="train", streaming=True, ) for row in disagg.take(5): print(row) ``` ### With pandas / pyarrow (direct file access) ```python import pandas as pd sites = pd.read_parquet("Waterbase_v2018_1_WISE4_MonitoringSite_DerivedData.parquet") agg = pd.read_parquet("Waterbase_v2018_1_T_WISE4_AggregatedData.parquet") # Read only the columns you need from the 33M-row disaggregated file cols = ["monitoringSiteIdentifier", "observedPropertyDeterminandCode", "phenomenonTimeSamplingDate", "resultObservedValue", "resultUom"] disagg = pd.read_parquet( "Waterbase_v2018_1_T_WISE4_DisaggregatedData.parquet", columns=cols, ) ``` ### Joining measurements to coordinates ```python import pandas as pd sites = pd.read_parquet("Waterbase_v2018_1_WISE4_MonitoringSite_DerivedData.parquet") agg = pd.read_parquet("Waterbase_v2018_1_T_WISE4_AggregatedData.parquet") geo = agg.merge(sites[["monitoringSiteIdentifier", "lon", "lat"]], on="monitoringSiteIdentifier", how="left") ``` ## Known quirks - Numeric columns use `.` as decimal separator; missing values are null. - `parameterSamplingPeriod` is an ISO-interval-ish string (`2012-01--2012-12`), not a proper date. - `phenomenonTimeSamplingDate` (Disaggregated) and `metadata_beginLifeSpanVersion` are proper timestamps in Parquet. - LOQ handling is explicit: a `resultQualityMeanBelowLOQ = 1` flag means the reported mean is a substitution, not a direct measurement. - Some `metadata_observationStatus = U` rows carry `QC_LEGACY_*` remarks — filter if you want only `A` (accepted) records. ## Source & Licensing - **Publisher:** European Environment Agency (EEA) - **Original URL:** https://www.eea.europa.eu/en/datahub (search "Waterbase – Water Quality ICM") — product discovery page: https://discomap.eea.europa.eu/ - **Version:** v2018.1, published 2018-04-05 - **Format:** Original CSVs were converted to Parquet (snappy compression) with `pyarrow`. No rows were filtered or modified; the UTF-8 BOM on the first column header was stripped. Schema matches the original 1:1. - **Reuse:** Governed by the EEA legal notice — https://www.eea.europa.eu/en/legal-notice — which authorises reuse with attribution. **Users of this mirror must comply with the EEA's terms.** We are only redistributing for hackathon convenience and claim no additional rights. ### Attribution > Source: European Environment Agency, Waterbase – Water Quality ICM (WISE-4), version 2018.1. ## Citation ``` European Environment Agency (2018). Waterbase – Water Quality ICM (WISE-4), v2018.1. https://www.eea.europa.eu/en/datahub ``` ## Project context Part of the [11th CASSINI Hackathon – EU Space for Water](https://taikai.network/cassinihackathons/hackathons/space-for-water). Combined with Copernicus Sentinel-2/3 observations, this in-situ record serves as ground truth for satellite-derived water-quality products.
提供机构:
cassini-team-todo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作