claysheaff/enviolations
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/claysheaff/enviolations
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
language:
- en
pretty_name: Enviolations — US Environmental Compliance Data
tags:
- environmental
- epa
- compliance
- violations
- government-data
- regulatory
- united-states
size_categories:
- 10M<n<100M
configs:
- config_name: facilities
data_files: data/facilities.parquet
- config_name: violations
data_files: data/violations.parquet
- config_name: facility_scores
data_files: data/facility_scores.parquet
- config_name: facility_matches
data_files: data/facility_matches.parquet
- config_name: unified_facilities
data_files: data/unified_facilities.parquet
---
# Enviolations — US Environmental Compliance Data
A snapshot of normalized US environmental compliance data aggregated from federal EPA programs (ECHO, RCRA, CAA, SDWA, SEMS, UCMR, PFAS) and 50+ state environmental agencies. ~7.3M facilities, ~1.1M violations, with cross-source entity resolution and risk scoring.
> **Not for regulated environmental due diligence (Phase I ESAs), compliance attestations, lending, insurance, or any decision with material legal or financial consequences.** See [Disclaimer](#disclaimer) at the bottom.
**Snapshot date**: March 23, 2026. This is a frozen snapshot — government data drifts. For fresh data, run the source code at https://github.com/csheaff/enviolations.
## Source code
The Python pipeline that produced this dataset is open-sourced at https://github.com/csheaff/enviolations (MIT). Anyone can re-run it to refresh the data or add new sources.
## What's in the dataset
Five Parquet files, each loadable as a separate config:
| Config | Rows | Description |
|---|---|---|
| `facilities` | ~7.3M | Master facility list. One row per (source, source_id) pair. 74 sources covering all 50 states + DC. |
| `violations` | ~1.1M | Violation and enforcement records linked to facilities. Includes type, date, severity, description. |
| `facility_scores` | ~4.1M | Computed risk score (0–100, where 0 = clean, 100 = worst) per facility. Includes violation count, NAICS tier, program count. |
| `facility_matches` | ~7.3M | Cross-source entity-resolution clusters. Each row maps a (source, source_id) to a `canonical_id` shared across matched duplicates. 4 match tiers: address → geo → name → fuzzy. |
| `unified_facilities` | ~4.1M | Materialized cross-source merged view. One row per resolved entity (deduplicated across sources). |
```python
from datasets import load_dataset
# Load just the facility scores
scores = load_dataset("claysheaff/enviolations", "facility_scores")
# Or all of them
ds = load_dataset("claysheaff/enviolations")
```
## Coverage
74 sources across:
- **EPA programs**: ECHO, RCRA (hazardous waste), CAA (air), SDWA (drinking water), SEMS (Superfund), UCMR (unregulated contaminants), PFAS
- **All 50 states + DC**: ArcGIS, Socrata, REST, GIS, and CSV-export upstream patterns
- **PFAS-specific**: dedicated EPA + IL/MI/NJ/OH/WI PFAS layers
Top-10 sources by facility count:
| Source | Facilities |
|---|---|
| nm_nmed (NM permits) | 2.16M |
| epa_echo | 1.33M |
| tceq (TX) | 737K |
| epa_rcra | 570K |
| nj_dep | 335K |
| epa_pfas | 203K |
| mn_pca | 189K |
| epa_caa | 183K |
| epa_sdwa | 142K |
| va_deq | 111K |
## Score convention
`facility_scores.score`: **0 = clean (low risk), 100 = worst (high risk).** Matches industry standards (EPA HRS, HUD NSPIRE). Do not invert it.
## Known limitations
These are inherent to the upstream government source data. Plan around them; they cannot be fixed in post-processing.
- **Name truncation at 30 characters.** EPA ECHO and many state APIs cap facility names. The full name doesn't exist in the upstream data.
- **Violation date gaps.** Some sources (notably Indiana spill records) have ~95% null `violation_date`. The dates don't exist in the upstream API.
- **Sentinel/centroid coordinates.** ~68K EPA facilities use county or state centroid coordinates instead of real GPS. Mitigated by sentinel detection in the pipeline, but the underlying coordinates can't be improved without a separate geocoding source.
- **Source-specific gotchas** (e.g. PA storage-tank "1975" in city field, NJ PFAS missing zip codes) are documented in the source code repo's README.
## Source data licensing
The underlying data is published by US federal and state government agencies. Federal government data is generally public domain (17 U.S.C. § 105); state data licenses vary but are almost universally permissive when published as open data. This aggregated dataset is released under **CC0** (public domain) — no attribution required, but a link back is appreciated.
## Caveats
- This is a **frozen snapshot**, not a live feed. Government data drifts over time (new violations, retired facilities, agency reorganizations). For fresh data, run the source pipeline.
- The South Carolina connector reflects the 2024 SCDHEC → SC DES reorganization. Pre-reorg data was already ingested before that endpoint was retired.
- Maintenance is best-effort. Issues and PRs on the source code repo may sit a while.
## Disclaimer
This dataset is provided **AS IS**, without warranty of any kind, express or implied. The data was aggregated from public US federal and state environmental agency sources; transcription, geocoding, normalization, and entity-resolution errors are possible and have been observed.
**Do NOT use this for:**
- ASTM E1527-21 Phase I Environmental Site Assessments or any other regulated environmental due-diligence work product
- Regulatory compliance attestations or filings
- Lending decisions, insurance underwriting, or actuarial analysis
- Legal proceedings, real estate transactions, or any decision with material legal or financial consequences
For authoritative facility records, **contact the source agency directly** (EPA via [echo.epa.gov](https://echo.epa.gov/), or the relevant state environmental agency).
**Takedown / correction requests:** open an issue at https://github.com/csheaff/enviolations/issues. Maintenance is best-effort; issues will be reviewed when time allows.
The author makes no representation that any specific facility, violation, or score reflects current compliance status, and disclaims all liability for decisions made in reliance on this dataset.
## Citation
If you use this data, no citation is required (CC0). A link back to https://huggingface.co/datasets/claysheaff/enviolations or the source repo is appreciated.
提供机构:
claysheaff



