Sashank-810/crisisnet-dataset
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Sashank-810/crisisnet-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-classification
- text-classification
- graph-ml
language:
- en
tags:
- finance
- credit-risk
- default-prediction
- time-series
- nlp
- graph
- energy
- earnings-calls
- sec-filings
- corporate-finance
pretty_name: CrisisNet — Corporate Default Risk Dataset
size_categories:
- 1K<n<10K
dataset_info:
splits:
- name: train
- name: validation
- name: test
---
# CrisisNet — Corporate Default Risk Dataset
> *"Every cancer screening programme works on one insight: the disease speaks before the patient feels it. CrisisNet applies the same logic to corporate finance."*
CrisisNet is a multi-modal, network-aware dataset for building early-warning systems for corporate financial distress. It covers **40 U.S. Energy sector companies** (S&P 500) over **10 years (2015–2025)** across three parallel signal types: time series financials, NLP text from filings and earnings calls, and a supply-chain network graph.
---
## Dataset Structure
The dataset is organised into four top-level folders mirroring the four analytical modules in the CrisisNet architecture:
```
Module_1/ ← Time Series & Credit Risk (Module A)
Module_2/ ← NLP Text: 10-K Filings + Earnings Calls (Module B)
Module_3/ ← Supply Chain Network Graph (Module C)
Labels/ ← Default & Distress Events (ground truth for all modules)
splits/ ← Pre-computed train / validation / test splits
data/ ← Master company list
```
---
## Module_1 — Time Series & Credit Risk Engine
**Purpose:** Feeds the `X_ts(c,t)` feature vector — the financial heartbeat monitor.
### `Module_1/market_data/`
| File | Description |
|------|-------------|
| `all_prices.parquet` | Daily OHLCV stock prices for all 40 tickers, 2015–present (2,821 rows × 205 cols) |
| `all_prices.csv` | Same as above in CSV format |
| `financials/{TICKER}_income.csv` | Quarterly income statement per company |
| `financials/{TICKER}_balance_sheet.csv` | Quarterly balance sheet per company |
| `financials/{TICKER}_cashflow.csv` | Quarterly cash flow statement per company |
| `financials/{TICKER}_info.csv` | Company metadata (sector, market cap, description) |
> **Note:** CHK, HES, MRO, PXD, SWN have only `_info.csv` — these companies were acquired or delisted (Pioneer→Exxon, Hess→Chevron, Marathon Oil→ConocoPhillips, Southwestern→Chesapeake). They are intentionally kept as **distress/exit label cases**.
### `Module_1/credit_spreads/`
22 FRED series (2005–present) covering credit spreads, treasury yields, macro indicators, and energy prices:
| Series | Description |
|--------|-------------|
| `BAMLH0A0HYM2` | ICE BofA US High Yield OAS — primary distress signal |
| `BAA10Y`, `AAA10Y` | Moody's corporate bond spreads |
| `VIXCLS` | CBOE VIX — market fear gauge |
| `DCOILWTICO`, `DCOILBRENTEU` | WTI & Brent crude oil prices |
| `DHHNGSP` | Henry Hub natural gas spot price |
| `T10Y2Y` | 10Y-2Y Treasury spread (recession predictor) |
| `DGS10`, `DGS2`, `DGS3MO` | Treasury yields |
| `UNRATE`, `CPIAUCSL`, `FEDFUNDS`, `INDPRO` | Macro indicators |
| `fred_all_series.parquet` | All 22 series combined (5,681 rows × 22 cols) |
### `Module_1/sec_xbrl/`
Structured XBRL financial data from SEC EDGAR for 35/40 companies:
- `company_facts/{TICKER}_facts.json` — All XBRL-reported line items (Assets, Liabilities, Revenue, EPS, etc.) with full quarterly history
- `submissions/{TICKER}_submissions.json` — Filing history (dates, form types, accession numbers)
- `ticker_cik_mapping.csv` — Ticker ↔ SEC CIK number mapping
---
## Module_2 — NLP: 10-K Filings & Earnings Calls
**Purpose:** Feeds the `X_nlp(c,t)` feature vector via LDA topic modelling and FinBERT sentiment.
### `Module_2/10k_extracted/10-K/`
353 structured JSON files, one per company per year (2015–2024). Each file contains:
```json
{
"item_1": "Business description — supply chain, customers, operations...",
"item_1a": "Risk factors — debt levels, commodity exposure, going concern...",
"item_7": "Management Discussion & Analysis — earnings narrative...",
"item_7a": "Market risk disclosures — interest rate, commodity hedging...",
"item_8": "Financial statements narrative..."
}
```
**Naming:** `{CIK}_{FormType}_{Year}_{AccessionNumber}.json`
**Usage for NLP:**
- Run **LDA** (Gensim) or **BERTopic** on `item_7` (MD&A) to extract latent distress topics
- Apply **FinBERT** sentence-by-sentence to `item_1a` (Risk Factors) for sentiment time series
- Track **KL-divergence** of topic distributions quarter-over-quarter as a leading signal
### `Module_2/transcripts/`
Earnings call Q&A transcripts from HuggingFace (`lamini/earnings-calls-qa`):
- 860,164 Q&A records from public company earnings calls
- Fields: `question`, `answer`, `ticker`, `date`
- Re-download: `datasets.load_dataset("lamini/earnings-calls-qa")`
---
## Module_3 — Supply Chain Network Graph
**Purpose:** Feeds the `X_graph(c,t)` feature vector via community detection and contagion simulation.
### `Module_3/edges_template.csv`
30 pre-populated directed edges representing known Energy sector supplier-customer relationships:
```
source, target, relationship_type, description
SLB, XOM, service_provider, oilfield services
HAL, CVX, service_provider, oilfield services
EPD, VLO, pipeline_supplier, NGL supply
...
```
**Usage:** Load into NetworkX → run Louvain community detection → compute DebtRank contagion scores.
### `Module_3/customer_disclosures_raw.csv`
660 customer/supplier disclosure mentions extracted from 10-K Item 1 and Item 7 sections. Use these to augment the graph edges with NLP-extracted relationships.
---
## Labels — Default & Distress Events
**Purpose:** Ground truth labels for all three modules.
### `Labels/energy_defaults_curated.csv`
24 curated bankruptcy/default events (2001–2021):
```
company, ticker, event_date, event_type, details
Chesapeake Energy, CHK, 2020-06-28, Chapter 11, COVID + legacy debt...
Whiting Petroleum, WLL, 2020-04-01, Chapter 11, COVID oil crash
...
```
### `Labels/distress_from_drawdowns.csv`
76 mechanically detected distress episodes from stock price drawdowns (>50% peak-to-trough within 6 months) — useful as soft labels for the ML model.
### `Labels/lopucki_brd_reference.json`
Reference pointer to the Florida-UCLA LoPucki Bankruptcy Research Database (1,000+ cases, 1979–2022) for cross-referencing additional default events.
---
## Train / Validation / Test Splits
**Split strategy: temporal walk-forward** (no lookahead leakage)
| Split | Period | Rationale |
|-------|--------|-----------|
| `train` | 2015–2021 | Includes 2015–16 oil crash + 2020 COVID wave defaults |
| `validation` | 2022 | Post-COVID recovery, hyperparameter tuning |
| `test` | 2023–2025 | Held-out, never seen during training |
Pre-split parquet files are in `splits/`:
```
splits/
stock_prices/ train.parquet, validation.parquet, test.parquet
fred_macro/ train.parquet, validation.parquet, test.parquet
labels/
energy_defaults/ train.parquet, validation.parquet, test.parquet
distress_drawdowns/ train.parquet, validation.parquet, test.parquet
10k_filings/ train_manifest.json, validation_manifest.json, test_manifest.json
```
---
## Recommended Usage
### Module A — Time Series Credit Risk Engine
```python
import pandas as pd
# Load training data
prices_train = pd.read_parquet("splits/stock_prices/train.parquet")
fred_train = pd.read_parquet("splits/fred_macro/train.parquet")
labels_train = pd.read_parquet("splits/labels/distress_drawdowns/train.parquet")
# Feature engineering: rolling volatility, Merton Distance-to-Default
# 30-day rolling log-return volatility per ticker
log_ret = prices_train.xs("Close", axis=1, level=0).pct_change().apply(lambda x: (1+x).apply(pd.np.log))
vol_30d = log_ret.rolling(30).std() * (252**0.5)
# Altman Z-Score (benchmark) — requires balance sheet data:
# Z = 1.2*X1 + 1.4*X2 + 3.3*X3 + 0.6*X4 + 1.0*X5
# where X1=Working Capital/TA, X2=Retained Earnings/TA, X3=EBIT/TA,
# X4=Market Cap/Book Liabilities, X5=Revenue/TA
# Walk-forward cross-validation (never use future data in training window)
# Use expanding window: train on t-36m to t, predict t+1m to t+6m
```
### Module B — NLP Topic Modelling
```python
import json, os
from gensim import corpora, models # LDA baseline
# from bertopic import BERTopic # upgraded model
# Load 10-K MD&A sections for training period
train_manifest = json.load(open("splits/10k_filings/train_manifest.json"))
corpus = []
for fpath in train_manifest["files"]:
filing = json.load(open(fpath))
text = filing.get("item_7", "") + " " + filing.get("item_1a", "")
corpus.append(text)
# Preprocessing: tokenise, remove Safe Harbor boilerplate, financial stopwords
# financial_stopwords = ["forward-looking", "may", "could", "believe", "expect", ...]
# Train LDA (K=15 topics typical for earnings text)
# Compare coherence scores across K=10,15,20,25 to find optimal
# FinBERT sentiment on risk factor sentences:
# from transformers import pipeline
# finbert = pipeline("text-classification", model="ProsusAI/finbert")
```
### Module C — Supply Chain Network
```python
import pandas as pd, networkx as nx
import community as community_louvain # pip install python-louvain
# Build directed graph from known edges
edges = pd.read_csv("Module_3/edges_template.csv")
G = nx.from_pandas_edgelist(edges, "source", "target",
edge_attr="relationship_type",
create_using=nx.DiGraph())
# Add nodes from company list
companies = pd.read_csv("data/company_list.csv")
for _, row in companies.iterrows():
G.nodes[row["ticker"]]["subsector"] = row["subsector"]
# Louvain community detection
partition = community_louvain.best_partition(G.to_undirected())
# Centrality metrics
betweenness = nx.betweenness_centrality(G)
pagerank = nx.pagerank(G)
# DebtRank contagion: mark one node as defaulted, propagate stress
# proportional to edge weights through the graph
```
### Module D — Fusion & Health Score
```python
import lightgbm as lgb
from sklearn.calibration import CalibratedClassifierCV
# Concatenate feature vectors from all three modules:
# X = pd.concat([X_ts, X_nlp, X_graph], axis=1) # (company × quarter) index
# Walk-forward split (expanding window)
# Train: 2015Q1 → 2020Q4 | Val: 2021Q1 → 2021Q4 | Test: 2022Q1 →
# LightGBM with early stopping on val AUC
model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05,
num_leaves=31, min_child_samples=10)
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
eval_metric="auc",
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(50)])
# Calibrate to produce true probabilities (Platt scaling)
calibrated = CalibratedClassifierCV(model, cv="prefit", method="sigmoid")
calibrated.fit(X_val, y_val)
# SHAP for interpretability
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
```
---
## Company Universe
40 S&P 500 Energy sector companies across 6 subsectors:
| Subsector | Tickers |
|-----------|---------|
| Integrated Oil | XOM, CVX, OXY |
| Exploration & Production | COP, EOG, PXD*, DVN, FANG, MRO*, APA, OVV, HES*, CTRA, MTDR, PR, CHRD |
| Oilfield Services | SLB, HAL, BKR, FTI, NOV |
| Refining | VLO, MPC, PSX, DK, PBF |
| Midstream/Pipelines | KMI, WMB, OKE, ET, EPD, TRGP, DTM, AM |
| Natural Gas / LNG | EQT, AR, RRC, SWN*, CHK*, LNG |
*\* Delisted/acquired/bankrupt — useful as distress/exit label cases*
---
## Research Questions (from project proposal)
- **RQ1 — Prediction:** Can we predict corporate default events 3–6 months in advance using time series + NLP + network features with higher AUC-ROC than Altman Z-Score?
- **RQ2 — Contagion:** Which companies act as 'super-spreaders' of financial distress — and can community detection identify them before a crisis?
- **RQ3 — Narrative Signal:** Does sentiment and topic shift in earnings call language provide statistically significant leading signal for credit deterioration?
---
## Citation
```bibtex
@dataset{crisisnet2025,
title = {CrisisNet: A Multi-Modal Corporate Default Risk Dataset},
author = {Sashank and team},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/Sashank-810/crisisnet-dataset}
}
```
---
## License
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — Free to use for research and commercial purposes with attribution.
Data sourced from: Yahoo Finance (yfinance), FRED API (St. Louis Fed), SEC EDGAR (public domain), HuggingFace `lamini/earnings-calls-qa`.
提供机构:
Sashank-810



