five

claysheaff/enviolations

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/claysheaff/enviolations
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 language: - en pretty_name: Enviolations — US Environmental Compliance Data tags: - environmental - epa - compliance - violations - government-data - regulatory - united-states size_categories: - 10M<n<100M configs: - config_name: facilities data_files: data/facilities.parquet - config_name: violations data_files: data/violations.parquet - config_name: facility_scores data_files: data/facility_scores.parquet - config_name: facility_matches data_files: data/facility_matches.parquet - config_name: unified_facilities data_files: data/unified_facilities.parquet --- # Enviolations — US Environmental Compliance Data A snapshot of normalized US environmental compliance data aggregated from federal EPA programs (ECHO, RCRA, CAA, SDWA, SEMS, UCMR, PFAS) and 50+ state environmental agencies. ~7.3M facilities, ~1.1M violations, with cross-source entity resolution and risk scoring. > **Not for regulated environmental due diligence (Phase I ESAs), compliance attestations, lending, insurance, or any decision with material legal or financial consequences.** See [Disclaimer](#disclaimer) at the bottom. **Snapshot date**: March 23, 2026. This is a frozen snapshot — government data drifts. For fresh data, run the source code at https://github.com/csheaff/enviolations. ## Source code The Python pipeline that produced this dataset is open-sourced at https://github.com/csheaff/enviolations (MIT). Anyone can re-run it to refresh the data or add new sources. ## What's in the dataset Five Parquet files, each loadable as a separate config: | Config | Rows | Description | |---|---|---| | `facilities` | ~7.3M | Master facility list. One row per (source, source_id) pair. 74 sources covering all 50 states + DC. | | `violations` | ~1.1M | Violation and enforcement records linked to facilities. Includes type, date, severity, description. | | `facility_scores` | ~4.1M | Computed risk score (0–100, where 0 = clean, 100 = worst) per facility. Includes violation count, NAICS tier, program count. | | `facility_matches` | ~7.3M | Cross-source entity-resolution clusters. Each row maps a (source, source_id) to a `canonical_id` shared across matched duplicates. 4 match tiers: address → geo → name → fuzzy. | | `unified_facilities` | ~4.1M | Materialized cross-source merged view. One row per resolved entity (deduplicated across sources). | ```python from datasets import load_dataset # Load just the facility scores scores = load_dataset("claysheaff/enviolations", "facility_scores") # Or all of them ds = load_dataset("claysheaff/enviolations") ``` ## Coverage 74 sources across: - **EPA programs**: ECHO, RCRA (hazardous waste), CAA (air), SDWA (drinking water), SEMS (Superfund), UCMR (unregulated contaminants), PFAS - **All 50 states + DC**: ArcGIS, Socrata, REST, GIS, and CSV-export upstream patterns - **PFAS-specific**: dedicated EPA + IL/MI/NJ/OH/WI PFAS layers Top-10 sources by facility count: | Source | Facilities | |---|---| | nm_nmed (NM permits) | 2.16M | | epa_echo | 1.33M | | tceq (TX) | 737K | | epa_rcra | 570K | | nj_dep | 335K | | epa_pfas | 203K | | mn_pca | 189K | | epa_caa | 183K | | epa_sdwa | 142K | | va_deq | 111K | ## Score convention `facility_scores.score`: **0 = clean (low risk), 100 = worst (high risk).** Matches industry standards (EPA HRS, HUD NSPIRE). Do not invert it. ## Known limitations These are inherent to the upstream government source data. Plan around them; they cannot be fixed in post-processing. - **Name truncation at 30 characters.** EPA ECHO and many state APIs cap facility names. The full name doesn't exist in the upstream data. - **Violation date gaps.** Some sources (notably Indiana spill records) have ~95% null `violation_date`. The dates don't exist in the upstream API. - **Sentinel/centroid coordinates.** ~68K EPA facilities use county or state centroid coordinates instead of real GPS. Mitigated by sentinel detection in the pipeline, but the underlying coordinates can't be improved without a separate geocoding source. - **Source-specific gotchas** (e.g. PA storage-tank "1975" in city field, NJ PFAS missing zip codes) are documented in the source code repo's README. ## Source data licensing The underlying data is published by US federal and state government agencies. Federal government data is generally public domain (17 U.S.C. § 105); state data licenses vary but are almost universally permissive when published as open data. This aggregated dataset is released under **CC0** (public domain) — no attribution required, but a link back is appreciated. ## Caveats - This is a **frozen snapshot**, not a live feed. Government data drifts over time (new violations, retired facilities, agency reorganizations). For fresh data, run the source pipeline. - The South Carolina connector reflects the 2024 SCDHEC → SC DES reorganization. Pre-reorg data was already ingested before that endpoint was retired. - Maintenance is best-effort. Issues and PRs on the source code repo may sit a while. ## Disclaimer This dataset is provided **AS IS**, without warranty of any kind, express or implied. The data was aggregated from public US federal and state environmental agency sources; transcription, geocoding, normalization, and entity-resolution errors are possible and have been observed. **Do NOT use this for:** - ASTM E1527-21 Phase I Environmental Site Assessments or any other regulated environmental due-diligence work product - Regulatory compliance attestations or filings - Lending decisions, insurance underwriting, or actuarial analysis - Legal proceedings, real estate transactions, or any decision with material legal or financial consequences For authoritative facility records, **contact the source agency directly** (EPA via [echo.epa.gov](https://echo.epa.gov/), or the relevant state environmental agency). **Takedown / correction requests:** open an issue at https://github.com/csheaff/enviolations/issues. Maintenance is best-effort; issues will be reviewed when time allows. The author makes no representation that any specific facility, violation, or score reflects current compliance status, and disclaims all liability for decisions made in reliance on this dataset. ## Citation If you use this data, no citation is required (CC0). A link back to https://huggingface.co/datasets/claysheaff/enviolations or the source repo is appreciated.
提供机构:
claysheaff
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作