cassini-team-todo/eea-waterbase
收藏Hugging Face2026-04-24 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/cassini-team-todo/eea-waterbase
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: EEA Waterbase (WISE-4) v2018.1
license: other
license_name: eea-reuse-policy
license_link: https://www.eea.europa.eu/en/legal-notice
language:
- en
tags:
- water-quality
- environmental
- eea
- wise
- eu
- hydrology
- biology-eqr
- copernicus-adjacent
size_categories:
- 10M<n<100M
configs:
- config_name: disaggregated
data_files: Waterbase_v2018_1_T_WISE4_DisaggregatedData.parquet
- config_name: aggregated
data_files: Waterbase_v2018_1_T_WISE4_AggregatedData.parquet
- config_name: aggregated_by_waterbody
data_files: Waterbase_v2018_1_T_WISE4_AggregatedDataByWaterBody.parquet
- config_name: biology_eqr
data_files: Waterbase_v2018_1_T_WISE4_BiologyEQRData.parquet
- config_name: biology_eqr_classification
data_files: Waterbase_v2018_1_T_WISE4_BiologyEQRClassificationProcedure.parquet
- config_name: monitoring_sites
data_files: Waterbase_v2018_1_WISE4_MonitoringSite_DerivedData.parquet
---
# EEA Waterbase (WISE-4) — v2018.1
Mirror of the European Environment Agency's **Waterbase – Water Quality ICM** (WISE-4) tabular release, version 2018.1. Contains station-level and water-body-level measurements of chemical and biological determinands in European surface and ground waters, plus the monitoring-site registry with coordinates.
Uploaded here for convenient team access during the **11th CASSINI Hackathon – EU Space for Water**. This is a redistribution of the original EEA CSVs, converted to Parquet (snappy compression) for faster loading and type preservation. See Source & Licensing below.
## Files
| File | Size | Rows | What it is |
|---|---|---|---|
| `Waterbase_v2018_1_T_WISE4_DisaggregatedData.parquet` | 331 MB | 33,848,578 | Per-sample measurements (one row per sampling date) |
| `Waterbase_v2018_1_T_WISE4_AggregatedData.parquet` | 82 MB | 3,211,183 | Per-site yearly aggregates (min/mean/max/median/stddev, LOQ flags) |
| `Waterbase_v2018_1_T_WISE4_AggregatedDataByWaterBody.parquet` | 507 KB | 20,251 | Same aggregates rolled up per water body, with per-class site counts |
| `Waterbase_v2018_1_T_WISE4_BiologyEQRData.parquet` | 587 KB | 29,741 | Biological Ecological Quality Ratio results per monitoring site |
| `Waterbase_v2018_1_T_WISE4_BiologyEQRClassificationProcedure.parquet` | 63 KB | 2,553 | EQR classification boundary values (country × water-body type × determinand) |
| `Waterbase_v2018_1_WISE4_MonitoringSite_DerivedData.parquet` | 1.56 MB | 56,464 | Monitoring-site registry: IDs, water-body link, `lon`/`lat`, confidentiality flag |
Total: ~416 MB across all six files (vs. ~10 GB as CSV).
## Schema highlights
### Measurement tables (`Disaggregated`, `Aggregated`, `AggregatedDataByWaterBody`)
Shared keys:
- `monitoringSiteIdentifier` / `waterBodyIdentifier` + `*IdentifierScheme` — join keys
- `parameterWaterBodyCategory` — `RW` (river), `LW` (lake), `GW` (groundwater), `TW`/`CW` (transitional/coastal)
- `observedPropertyDeterminandCode` — typically CAS codes, e.g. `CAS_7440-38-2` (arsenic), `CAS_14797-55-8` (nitrate)
- `procedureAnalysedFraction`, `procedureAnalysedMedia`, `resultUom` (e.g. `ug/L`, `mg{NO3}/L`)
- `procedureLOQValue` — limit of quantification; paired `resultQuality*BelowLOQ` flags
**Disaggregated** adds `phenomenonTimeSamplingDate` + `resultObservedValue`.
**Aggregated** / **AggregatedByWaterBody** add `phenomenonTimeReferenceYear`, `parameterSamplingPeriod`, `resultNumberOfSamples`, and min/mean/max/median/stddev columns.
**AggregatedByWaterBody** additionally provides `resultNumberOfSitesClass1..5`.
### Biology EQR tables
- `observedPropertyDeterminandBiologyEQRCode` — `EEA_*` codes (instead of CAS)
- `resultEcologicalStatusClassValue`, `resultEQRValue`, `resultNormalisedEQRValue`
- Classification procedure table gives boundary values for classes 1/2, 2/3, 3/4, 4/5 per country and water-body type
### Monitoring sites
`monitoringSiteIdentifier`, `waterBodyIdentifier`, `confidentialityStatus`, `lon`, `lat`. The join key for putting any measurement on a map.
## Usage
### With `datasets`
```python
from datasets import load_dataset
# Small tables — fine to load fully
sites = load_dataset("cassini-team-todo/eea-waterbase", "monitoring_sites", split="train")
eqr = load_dataset("cassini-team-todo/eea-waterbase", "biology_eqr", split="train")
# Large table — stream to avoid materialising all ~34M rows in memory
disagg = load_dataset(
"cassini-team-todo/eea-waterbase",
"disaggregated",
split="train",
streaming=True,
)
for row in disagg.take(5):
print(row)
```
### With pandas / pyarrow (direct file access)
```python
import pandas as pd
sites = pd.read_parquet("Waterbase_v2018_1_WISE4_MonitoringSite_DerivedData.parquet")
agg = pd.read_parquet("Waterbase_v2018_1_T_WISE4_AggregatedData.parquet")
# Read only the columns you need from the 33M-row disaggregated file
cols = ["monitoringSiteIdentifier", "observedPropertyDeterminandCode",
"phenomenonTimeSamplingDate", "resultObservedValue", "resultUom"]
disagg = pd.read_parquet(
"Waterbase_v2018_1_T_WISE4_DisaggregatedData.parquet",
columns=cols,
)
```
### Joining measurements to coordinates
```python
import pandas as pd
sites = pd.read_parquet("Waterbase_v2018_1_WISE4_MonitoringSite_DerivedData.parquet")
agg = pd.read_parquet("Waterbase_v2018_1_T_WISE4_AggregatedData.parquet")
geo = agg.merge(sites[["monitoringSiteIdentifier", "lon", "lat"]],
on="monitoringSiteIdentifier", how="left")
```
## Known quirks
- Numeric columns use `.` as decimal separator; missing values are null.
- `parameterSamplingPeriod` is an ISO-interval-ish string (`2012-01--2012-12`), not a proper date.
- `phenomenonTimeSamplingDate` (Disaggregated) and `metadata_beginLifeSpanVersion` are proper timestamps in Parquet.
- LOQ handling is explicit: a `resultQualityMeanBelowLOQ = 1` flag means the reported mean is a substitution, not a direct measurement.
- Some `metadata_observationStatus = U` rows carry `QC_LEGACY_*` remarks — filter if you want only `A` (accepted) records.
## Source & Licensing
- **Publisher:** European Environment Agency (EEA)
- **Original URL:** https://www.eea.europa.eu/en/datahub (search "Waterbase – Water Quality ICM") — product discovery page: https://discomap.eea.europa.eu/
- **Version:** v2018.1, published 2018-04-05
- **Format:** Original CSVs were converted to Parquet (snappy compression) with `pyarrow`. No rows were filtered or modified; the UTF-8 BOM on the first column header was stripped. Schema matches the original 1:1.
- **Reuse:** Governed by the EEA legal notice — https://www.eea.europa.eu/en/legal-notice — which authorises reuse with attribution. **Users of this mirror must comply with the EEA's terms.** We are only redistributing for hackathon convenience and claim no additional rights.
### Attribution
> Source: European Environment Agency, Waterbase – Water Quality ICM (WISE-4), version 2018.1.
## Citation
```
European Environment Agency (2018). Waterbase – Water Quality ICM (WISE-4), v2018.1.
https://www.eea.europa.eu/en/datahub
```
## Project context
Part of the [11th CASSINI Hackathon – EU Space for Water](https://taikai.network/cassinihackathons/hackathons/space-for-water). Combined with Copernicus Sentinel-2/3 observations, this in-situ record serves as ground truth for satellite-derived water-quality products.
提供机构:
cassini-team-todo



