sachinkg12/us-county-hazard-features
收藏Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sachinkg12/us-county-hazard-features
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- tabular-classification
tags:
- disaster-prediction
- fema
- hazard-assessment
- climate
- geospatial
- us-counties
- multi-hazard
- cascade-interactions
pretty_name: US County Multi-Hazard Features for Disaster Declaration Prediction
size_categories:
- 1M<n<10M
---
# US County Multi-Hazard Features Dataset
A curated, ML-ready dataset of **1,014,930 county-month observations** spanning **3,222 US counties** from 2000 to 2026, integrating 7 federal data sources into 42 engineered features for predicting FEMA disaster declarations 90 days in advance.
## Dataset Summary
| Property | Value |
|----------|-------|
| **Rows** | 1,014,930 |
| **Columns** | 50 (42 features + target + metadata) |
| **Counties** | 3,222 (all US counties with available data) |
| **Time span** | 2000-01 to 2026-03 |
| **Granularity** | County-month |
| **Target** | `declaration_next_90d` — binary, FEMA disaster declaration within 90 days |
| **Positive rate** | 11.02% |
| **Format** | Apache Parquet (Snappy compression) |
## Data Sources
This dataset integrates 7 federal data sources — all free, no API keys required:
| Source | What |
|--------|------|
| **FEMA Disaster Declarations** | Every federal disaster declaration by county (1953–present) |
| **USGS Earthquakes** | M2.5+ seismic events with coordinates (1964–present) |
| **NOAA Storm Events** | Tornadoes, floods, hurricanes, hail + casualties/damage (2000–present) |
| **US Census Bureau** | County demographics, housing, economics |
| **US Drought Monitor** | Weekly drought severity by county, D0–D4 (2000–present) |
| **NIFC Wildfires** | Wildfire incidents with acres burned (2000–present) |
| **NFIP Flood Claims** | National Flood Insurance Program claims and payouts (1978–present) |
## Features (42 total)
### FEMA History (7)
Rolling window declaration counts and recency metrics.
| Feature | Type | Description |
|---------|------|-------------|
| `declarations_1yr` | int | Disaster declarations in prior 1 year |
| `declarations_3yr` | int | Disaster declarations in prior 3 years |
| `declarations_5yr` | int | Disaster declarations in prior 5 years |
| `declarations_10yr` | int | Disaster declarations in prior 10 years |
| `months_since_last_decl` | int | Months since most recent declaration (-1 if none) |
| `major_disaster_ratio` | float | Fraction of declarations that were major disasters |
| `ia_program_ratio` | float | Fraction with Individual Assistance programs |
### Storm Events (10)
NOAA severe weather aggregations.
| Feature | Type | Description |
|---------|------|-------------|
| `storm_event_count_1yr` | int | Storm events in prior 1 year |
| `storm_event_count_5yr` | int | Storm events in prior 5 years |
| `storm_deaths_5yr` | int | Storm-related deaths in prior 5 years |
| `storm_injuries_5yr` | int | Storm-related injuries in prior 5 years |
| `storm_property_damage_5yr` | float | Property damage ($) in prior 5 years |
| `storm_crop_damage_5yr` | float | Crop damage ($) in prior 5 years |
| `tornado_count_5yr` | int | Tornado events in prior 5 years |
| `flood_count_5yr` | int | Flood events in prior 5 years |
| `hail_count_5yr` | int | Hail events in prior 5 years |
| `max_tor_f_scale_5yr` | int | Maximum tornado F-scale in prior 5 years |
### Socioeconomic (5)
US Census demographic and economic indicators.
| Feature | Type | Description |
|---------|------|-------------|
| `population` | long | County population |
| `housing_units` | long | Number of housing units |
| `median_home_value` | long | Median home value ($) |
| `population_density` | float | People per square mile |
| `land_area_sq_mi` | float | County land area in square miles |
### Drought (4)
US Drought Monitor severity metrics.
| Feature | Type | Description |
|---------|------|-------------|
| `drought_severity_avg_5yr` | float | Average drought severity score (5yr) |
| `drought_max_severity_5yr` | float | Maximum drought severity score (5yr) |
| `severe_drought_weeks_5yr` | int | Weeks of severe drought (D2+) in 5 years |
| `drought_d4_pct_max_5yr` | float | Peak percentage of county in D4 (exceptional) drought |
### Wildfire (4)
NIFC wildfire incident metrics.
| Feature | Type | Description |
|---------|------|-------------|
| `wildfire_count_1yr` | int | Wildfire incidents in prior 1 year |
| `wildfire_count_5yr` | int | Wildfire incidents in prior 5 years |
| `wildfire_acres_burned_5yr` | float | Total acres burned in prior 5 years |
| `wildfire_max_acres_5yr` | float | Largest single wildfire (acres) in 5 years |
### NFIP Flood Insurance (3)
National Flood Insurance Program claim patterns.
| Feature | Type | Description |
|---------|------|-------------|
| `nfip_claim_count_5yr` | int | NFIP claims in prior 5 years |
| `nfip_total_payout_5yr` | float | Total NFIP payouts ($) in prior 5 years |
| `nfip_avg_payout_5yr` | float | Average NFIP payout ($) in prior 5 years |
### Spatial (2)
Neighborhood and state-level context.
| Feature | Type | Description |
|---------|------|-------------|
| `neighbor_avg_declarations_5yr` | float | Average 5yr declarations of neighboring counties |
| `state_avg_declarations_5yr` | float | Average 5yr declarations across the state |
### Cascade Interaction Features (7)
**Novel contribution**: Multiplicative interaction terms capturing multi-hazard co-occurrence.
| Feature | Type | Description |
|---------|------|-------------|
| `cascade_drought_fire_risk` | float | `drought_severity_6mo × log1p(wildfire_acres_1yr)` |
| `cascade_fire_flood_risk` | float | `log1p(burn_scar_acres_18mo) × flood_events_1yr` |
| `cascade_hurricane_flood_risk` | float | `hurricane_declarations_60d × flood_events_30d` |
| `cascade_earthquake_landslide_risk` | float | `significant_quakes_90d × severe_storms_30d` |
| `cascade_storm_compound_count` | int | Severe storms in prior 30 days (compound events) |
| `cascade_active_chains` | int | Count of active cascade interactions (0–5) |
| `cascade_max_chain_length` | int | Longest active hazard chain (1–3) |
### Target & Metadata
| Column | Type | Description |
|--------|------|-------------|
| `fips` | string | 5-digit FIPS county code |
| `year_month` | string | Observation month (YYYY-MM) |
| `declaration_next_90d` | bool | **Target**: FEMA declaration within 90 days |
| `declaration_type_next_90d` | string | Declaration type if positive (DR, EM, etc.) |
## Usage
```python
import pandas as pd
df = pd.read_parquet("us-county-hazard-features.parquet")
# Temporal train/test split (recommended)
train = df[df["year_month"] < "2022-01"]
val = df[(df["year_month"] >= "2022-01") & (df["year_month"] < "2023-01")]
test = df[(df["year_month"] >= "2023-01") & (df["year_month"] <= "2024-12")]
# Note: Exclude months after 2024-12 — FEMA declaration data is incomplete
# Feature columns (42 features, no temporal — see paper for ablation justification)
FEATURE_COLS = [c for c in df.columns if c not in [
"fips", "year_month", "declaration_next_90d", "declaration_type_next_90d",
"month_of_year", "is_hurricane_season", "is_tornado_season", "is_wildfire_season"
]]
```
## Benchmark Results
Using XGBoost with temporal split:
| Model | ROC-AUC | PR-AUC | F1 |
|-------|---------|--------|-----|
| Naive (prior) | 0.500 | 0.549 | 0.000 |
| Logistic Regression | 0.542 | 0.139 | 0.199 |
| Random Forest | 0.845 | 0.287 | 0.334 |
| **XGBoost** | **0.893** | **0.555** | **0.482** |
95% Bootstrap CI: ROC-AUC [0.890, 0.896]
## Key Findings
1. **FEMA Dominance**: Removing FEMA features drops AUC from 0.89 to 0.63 — declaration history is the strongest predictor, suggesting the federal process is path-dependent.
2. **Cascade Interactions**: Multi-hazard cascade features improve compound disaster detection (recall lift +2.9% for cascade events, ROC-AUC 0.907 vs 0.893 overall).
3. **Declaration Equity**: Low-income counties (Q1) show 2.3x higher prediction residuals than wealthy counties (Q4) at the same hazard exposure level (p < 1e-100), suggesting structural inequities in federal disaster declarations.
## Citation
```bibtex
@dataset{gupta2026uscountyhazard,
title={US County Multi-Hazard Features for Disaster Declaration Prediction},
author={Gupta, Sachin},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/sachinkg12/us-county-hazard-features}
}
```
## License
Apache 2.0
## Source Code
[HazardCast](https://github.com/sachinkg12/HazardCast) — Full pipeline: data ingestion, feature engineering, model training, and REST API.
license: Apache-2.0
任务类别:
- 表格分类(tabular-classification)
标签:
- 灾害预测(disaster-prediction)
- 联邦应急管理局(FEMA)
- 灾害评估(hazard-assessment)
- 气候(climate)
- 地理空间(geospatial)
- 美国郡县(us-counties)
- 多灾害(multi-hazard)
- 级联交互(cascade-interactions)
友好名称:用于灾害申报预测的美国郡县多灾害特征数据集
数据规模类别:
- 100万<样本量<1000万
# 美国郡县多灾害特征数据集
这是一份经过精心整理、可直接用于机器学习的数据集,包含2000年至2026年间覆盖3222个美国郡县的1,014,930条郡县-月度观测样本,整合了7个联邦数据源的信息,构建了42个工程化特征,用于提前90天预测联邦应急管理局(FEMA)的灾害申报。
## 数据集概览
| 属性 | 取值 |
|----------|-------|
| **行数** | 1,014,930 |
| **列数** | 50(42个特征 + 目标变量 + 元数据) |
| **覆盖郡县数** | 3,222(所有有可用数据的美国郡县) |
| **时间跨度** | 2000-01 至 2026-03 |
| **数据粒度** | 郡县-月度 |
| **目标变量** | `declaration_next_90d` —— 二分类变量,表示90天内是否会发生联邦应急管理局(FEMA)灾害申报 |
| **正样本比例** | 11.02% |
| **数据格式** | Apache Parquet(Snappy压缩) |
## 数据来源
本数据集整合了7个联邦数据源的信息——所有数据源均免费,无需API密钥:
| 数据源 | 内容说明 |
|--------|----------|
| **联邦应急管理局(FEMA)灾害申报数据** | 1953年至今的所有郡县级别联邦灾害申报记录 |
| **美国地质调查局(USGS)地震数据** | 1964年至今的震级M2.5及以上、带坐标的地震事件 |
| **美国国家海洋和大气管理局(NOAA)风暴事件数据** | 2000年至今的龙卷风、洪水、飓风、冰雹事件及人员伤亡、财产损失记录 |
| **美国人口普查局数据** | 郡县人口统计、住房、经济相关指标 |
| **美国干旱监测数据** | 2000年至今的每周郡县干旱严重程度等级(D0–D4) |
| **国家野火协调机构(NIFC)野火数据** | 2000年至今的野火事件及过火面积记录 |
| **国家洪水保险计划(NFIP)洪水索赔数据** | 1978年至今的国家洪水保险计划索赔及赔付记录 |
## 特征(共42个)
### FEMA申报历史(7个)
滚动窗口申报计数及近期性指标。
| 特征名称 | 数据类型 | 描述 |
|---------|------|-------------|
| `declarations_1yr` | int | 过去1年内的灾害申报次数 |
| `declarations_3yr` | int | 过去3年内的灾害申报次数 |
| `declarations_5yr` | int | 过去5年内的灾害申报次数 |
| `declarations_10yr` | int | 过去10年内的灾害申报次数 |
| `months_since_last_decl` | int | 距上一次灾害申报的月数(无申报记录则为-1) |
| `major_disaster_ratio` | float | 重大灾害申报占总申报的比例 |
| `ia_program_ratio` | float | 包含个人援助项目的申报占比 |
### 风暴事件(10个)
美国国家海洋和大气管理局(NOAA)强天气聚合特征。
| 特征名称 | 数据类型 | 描述 |
|---------|------|-------------|
| `storm_event_count_1yr` | int | 过去1年内的风暴事件总数 |
| `storm_event_count_5yr` | int | 过去5年内的风暴事件总数 |
| `storm_deaths_5yr` | int | 过去5年内的风暴相关死亡人数 |
| `storm_injuries_5yr` | int | 过去5年内的风暴相关受伤人数 |
| `storm_property_damage_5yr` | float | 过去5年内的风暴造成的财产损失(美元) |
| `storm_crop_damage_5yr` | float | 过去5年内的风暴造成的农作物损失(美元) |
| `tornado_count_5yr` | int | 过去5年内的龙卷风事件数 |
| `flood_count_5yr` | int | 过去5年内的洪水事件数 |
| `hail_count_5yr` | int | 过去5年内的冰雹事件数 |
| `max_tor_f_scale_5yr` | int | 过去5年内出现的最大龙卷风F级评级 |
### 社会经济特征(5个)
美国人口普查局的人口统计与经济指标。
| 特征名称 | 数据类型 | 描述 |
|---------|------|-------------|
| `population` | long | 郡县总人口 |
| `housing_units` | long | 住房单元总数 |
| `median_home_value` | long | 住房均价(美元) |
| `population_density` | float | 人口密度(人/平方英里) |
| `land_area_sq_mi` | float | 郡县陆地面积(平方英里) |
### 干旱特征(4个)
美国干旱监测严重程度指标。
| 特征名称 | 数据类型 | 描述 |
|---------|------|-------------|
| `drought_severity_avg_5yr` | float | 过去5年的平均干旱严重程度得分 |
| `drought_max_severity_5yr` | float | 过去5年的最大干旱严重程度得分 |
| `severe_drought_weeks_5yr` | int | 过去5年内遭遇严重干旱(D2及以上)的周数 |
| `drought_d4_pct_max_5yr` | float | 过去5年内郡县遭遇极端干旱(D4级)的最大占比 |
### 野火特征(4个)
国家野火协调机构(NIFC)野火事件指标。
| 特征名称 | 数据类型 | 描述 |
|---------|------|-------------|
| `wildfire_count_1yr` | int | 过去1年内的野火事件数 |
| `wildfire_count_5yr` | int | 过去5年内的野火事件数 |
| `wildfire_acres_burned_5yr` | float | 过去5年内的总过火面积 |
| `wildfire_max_acres_5yr` | float | 过去5年内单次最大野火的过火面积 |
### 国家洪水保险计划(NFIP)洪水保险特征(3个)
国家洪水保险计划索赔模式特征。
| 特征名称 | 数据类型 | 描述 |
|---------|------|-------------|
| `nfip_claim_count_5yr` | int | 过去5年内的NFIP索赔次数 |
| `nfip_total_payout_5yr` | float | 过去5年内的NFIP总赔付金额(美元) |
| `nfip_avg_payout_5yr` | float | 过去5年内的NFIP平均赔付金额(美元) |
### 空间特征(2个)
邻域及州级上下文特征。
| 特征名称 | 数据类型 | 描述 |
|---------|------|-------------|
| `neighbor_avg_declarations_5yr` | float | 邻郡过去5年的平均灾害申报次数 |
| `state_avg_declarations_5yr` | float | 全州过去5年的平均灾害申报次数 |
### 级联交互特征(7个)
**创新贡献**:用于捕捉多灾害共现关系的乘法交互项。
| 特征名称 | 数据类型 | 描述 |
|---------|------|-------------|
| `cascade_drought_fire_risk` | float | 6个月干旱严重程度 × log1p(1年内过火面积) |
| `cascade_fire_flood_risk` | float | log1p(18个月过火疤痕面积) × 1年内洪水事件数 |
| `cascade_hurricane_flood_risk` | float | 60天内飓风申报次数 × 30天内洪水事件数 |
| `cascade_earthquake_landslide_risk` | float | 90天内显著地震次数 × 30天内强风暴次数 |
| `cascade_storm_compound_count` | int | 过去30天内的复合强风暴事件数 |
| `cascade_active_chains` | int | 活跃级联交互链的数量(0–5) |
| `cascade_max_chain_length` | int | 最长活跃灾害链的长度(1–3) |
### 目标变量与元数据
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `fips` | string | 5位FIPS郡县代码 |
| `year_month` | string | 观测月份(格式为YYYY-MM) |
| `declaration_next_90d` | bool | **目标变量**:90天内是否会发生FEMA灾害申报 |
| `declaration_type_next_90d` | string | 若为正样本则为灾害申报类型(如DR、EM等) |
## 使用方法
python
import pandas as pd
df = pd.read_parquet("us-county-hazard-features.parquet")
# 时序划分训练集/验证集/测试集(推荐方案)
train = df[df["year_month"] < "2022-01"]
val = df[(df["year_month"] >= "2022-01") & (df["year_month"] < "2023-01")]
test = df[(df["year_month"] >= "2023-01") & (df["year_month"] <= "2024-12")]
# 注意:排除2024年12月之后的月份——联邦应急管理局(FEMA)的申报数据尚未完整
# 特征列(共42个,不含时间相关特征——详见论文中的消融实验依据)
FEATURE_COLS = [c for c in df.columns if c not in [
"fips", "year_month", "declaration_next_90d", "declaration_type_next_90d",
"month_of_year", "is_hurricane_season", "is_tornado_season", "is_wildfire_season"
]]
## 基准测试结果
采用XGBoost模型与时序划分方案:
| 模型 | ROC-AUC | PR-AUC | F1分数 |
|-------|---------|--------|-----|
| 朴素基准(先验概率) | 0.500 | 0.549 | 0.000 |
| 逻辑回归 | 0.542 | 0.139 | 0.199 |
| 随机森林 | 0.845 | 0.287 | 0.334 |
| **XGBoost** | **0.893** | **0.555** | **0.482** |
95% Bootstrap置信区间:ROC-AUC为[0.890, 0.896]
## 关键发现
1. **FEMA申报历史的主导作用**:移除FEMA相关特征后,AUC从0.89降至0.63——灾害申报历史是最强的预测因子,表明联邦灾害申报流程具有路径依赖性。
2. **级联交互特征的增益**:多灾害级联特征提升了复合灾害的检测能力(级联事件的召回率提升+2.9%,整体ROC-AUC从0.893提升至0.907)。
3. **灾害申报公平性问题**:在相同灾害暴露水平下,低收入郡县(Q1)的预测残差是高收入郡县(Q4)的2.3倍(p < 1e-100),表明联邦灾害申报流程存在结构性公平性缺陷。
## 引用格式
bibtex
@dataset{gupta2026uscountyhazard,
title={US County Multi-Hazard Features for Disaster Declaration Prediction},
author={Gupta, Sachin},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/sachinkg12/us-county-hazard-features}
}
## 许可证
Apache 2.0
## 源代码
[HazardCast](https://github.com/sachinkg12/HazardCast) — 完整流水线涵盖:数据摄取、特征工程、模型训练与REST API部署。
提供机构:
sachinkg12



