five

Sashank-810/crisisnet-dataset

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Sashank-810/crisisnet-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - tabular-classification - text-classification - graph-ml language: - en tags: - finance - credit-risk - default-prediction - time-series - nlp - graph - energy - earnings-calls - sec-filings - corporate-finance pretty_name: CrisisNet — Corporate Default Risk Dataset size_categories: - 1K<n<10K dataset_info: splits: - name: train - name: validation - name: test --- # CrisisNet — Corporate Default Risk Dataset > *"Every cancer screening programme works on one insight: the disease speaks before the patient feels it. CrisisNet applies the same logic to corporate finance."* CrisisNet is a multi-modal, network-aware dataset for building early-warning systems for corporate financial distress. It covers **40 U.S. Energy sector companies** (S&P 500) over **10 years (2015–2025)** across three parallel signal types: time series financials, NLP text from filings and earnings calls, and a supply-chain network graph. --- ## Dataset Structure The dataset is organised into four top-level folders mirroring the four analytical modules in the CrisisNet architecture: ``` Module_1/ ← Time Series & Credit Risk (Module A) Module_2/ ← NLP Text: 10-K Filings + Earnings Calls (Module B) Module_3/ ← Supply Chain Network Graph (Module C) Labels/ ← Default & Distress Events (ground truth for all modules) splits/ ← Pre-computed train / validation / test splits data/ ← Master company list ``` --- ## Module_1 — Time Series & Credit Risk Engine **Purpose:** Feeds the `X_ts(c,t)` feature vector — the financial heartbeat monitor. ### `Module_1/market_data/` | File | Description | |------|-------------| | `all_prices.parquet` | Daily OHLCV stock prices for all 40 tickers, 2015–present (2,821 rows × 205 cols) | | `all_prices.csv` | Same as above in CSV format | | `financials/{TICKER}_income.csv` | Quarterly income statement per company | | `financials/{TICKER}_balance_sheet.csv` | Quarterly balance sheet per company | | `financials/{TICKER}_cashflow.csv` | Quarterly cash flow statement per company | | `financials/{TICKER}_info.csv` | Company metadata (sector, market cap, description) | > **Note:** CHK, HES, MRO, PXD, SWN have only `_info.csv` — these companies were acquired or delisted (Pioneer→Exxon, Hess→Chevron, Marathon Oil→ConocoPhillips, Southwestern→Chesapeake). They are intentionally kept as **distress/exit label cases**. ### `Module_1/credit_spreads/` 22 FRED series (2005–present) covering credit spreads, treasury yields, macro indicators, and energy prices: | Series | Description | |--------|-------------| | `BAMLH0A0HYM2` | ICE BofA US High Yield OAS — primary distress signal | | `BAA10Y`, `AAA10Y` | Moody's corporate bond spreads | | `VIXCLS` | CBOE VIX — market fear gauge | | `DCOILWTICO`, `DCOILBRENTEU` | WTI & Brent crude oil prices | | `DHHNGSP` | Henry Hub natural gas spot price | | `T10Y2Y` | 10Y-2Y Treasury spread (recession predictor) | | `DGS10`, `DGS2`, `DGS3MO` | Treasury yields | | `UNRATE`, `CPIAUCSL`, `FEDFUNDS`, `INDPRO` | Macro indicators | | `fred_all_series.parquet` | All 22 series combined (5,681 rows × 22 cols) | ### `Module_1/sec_xbrl/` Structured XBRL financial data from SEC EDGAR for 35/40 companies: - `company_facts/{TICKER}_facts.json` — All XBRL-reported line items (Assets, Liabilities, Revenue, EPS, etc.) with full quarterly history - `submissions/{TICKER}_submissions.json` — Filing history (dates, form types, accession numbers) - `ticker_cik_mapping.csv` — Ticker ↔ SEC CIK number mapping --- ## Module_2 — NLP: 10-K Filings & Earnings Calls **Purpose:** Feeds the `X_nlp(c,t)` feature vector via LDA topic modelling and FinBERT sentiment. ### `Module_2/10k_extracted/10-K/` 353 structured JSON files, one per company per year (2015–2024). Each file contains: ```json { "item_1": "Business description — supply chain, customers, operations...", "item_1a": "Risk factors — debt levels, commodity exposure, going concern...", "item_7": "Management Discussion & Analysis — earnings narrative...", "item_7a": "Market risk disclosures — interest rate, commodity hedging...", "item_8": "Financial statements narrative..." } ``` **Naming:** `{CIK}_{FormType}_{Year}_{AccessionNumber}.json` **Usage for NLP:** - Run **LDA** (Gensim) or **BERTopic** on `item_7` (MD&A) to extract latent distress topics - Apply **FinBERT** sentence-by-sentence to `item_1a` (Risk Factors) for sentiment time series - Track **KL-divergence** of topic distributions quarter-over-quarter as a leading signal ### `Module_2/transcripts/` Earnings call Q&A transcripts from HuggingFace (`lamini/earnings-calls-qa`): - 860,164 Q&A records from public company earnings calls - Fields: `question`, `answer`, `ticker`, `date` - Re-download: `datasets.load_dataset("lamini/earnings-calls-qa")` --- ## Module_3 — Supply Chain Network Graph **Purpose:** Feeds the `X_graph(c,t)` feature vector via community detection and contagion simulation. ### `Module_3/edges_template.csv` 30 pre-populated directed edges representing known Energy sector supplier-customer relationships: ``` source, target, relationship_type, description SLB, XOM, service_provider, oilfield services HAL, CVX, service_provider, oilfield services EPD, VLO, pipeline_supplier, NGL supply ... ``` **Usage:** Load into NetworkX → run Louvain community detection → compute DebtRank contagion scores. ### `Module_3/customer_disclosures_raw.csv` 660 customer/supplier disclosure mentions extracted from 10-K Item 1 and Item 7 sections. Use these to augment the graph edges with NLP-extracted relationships. --- ## Labels — Default & Distress Events **Purpose:** Ground truth labels for all three modules. ### `Labels/energy_defaults_curated.csv` 24 curated bankruptcy/default events (2001–2021): ``` company, ticker, event_date, event_type, details Chesapeake Energy, CHK, 2020-06-28, Chapter 11, COVID + legacy debt... Whiting Petroleum, WLL, 2020-04-01, Chapter 11, COVID oil crash ... ``` ### `Labels/distress_from_drawdowns.csv` 76 mechanically detected distress episodes from stock price drawdowns (>50% peak-to-trough within 6 months) — useful as soft labels for the ML model. ### `Labels/lopucki_brd_reference.json` Reference pointer to the Florida-UCLA LoPucki Bankruptcy Research Database (1,000+ cases, 1979–2022) for cross-referencing additional default events. --- ## Train / Validation / Test Splits **Split strategy: temporal walk-forward** (no lookahead leakage) | Split | Period | Rationale | |-------|--------|-----------| | `train` | 2015–2021 | Includes 2015–16 oil crash + 2020 COVID wave defaults | | `validation` | 2022 | Post-COVID recovery, hyperparameter tuning | | `test` | 2023–2025 | Held-out, never seen during training | Pre-split parquet files are in `splits/`: ``` splits/ stock_prices/ train.parquet, validation.parquet, test.parquet fred_macro/ train.parquet, validation.parquet, test.parquet labels/ energy_defaults/ train.parquet, validation.parquet, test.parquet distress_drawdowns/ train.parquet, validation.parquet, test.parquet 10k_filings/ train_manifest.json, validation_manifest.json, test_manifest.json ``` --- ## Recommended Usage ### Module A — Time Series Credit Risk Engine ```python import pandas as pd # Load training data prices_train = pd.read_parquet("splits/stock_prices/train.parquet") fred_train = pd.read_parquet("splits/fred_macro/train.parquet") labels_train = pd.read_parquet("splits/labels/distress_drawdowns/train.parquet") # Feature engineering: rolling volatility, Merton Distance-to-Default # 30-day rolling log-return volatility per ticker log_ret = prices_train.xs("Close", axis=1, level=0).pct_change().apply(lambda x: (1+x).apply(pd.np.log)) vol_30d = log_ret.rolling(30).std() * (252**0.5) # Altman Z-Score (benchmark) — requires balance sheet data: # Z = 1.2*X1 + 1.4*X2 + 3.3*X3 + 0.6*X4 + 1.0*X5 # where X1=Working Capital/TA, X2=Retained Earnings/TA, X3=EBIT/TA, # X4=Market Cap/Book Liabilities, X5=Revenue/TA # Walk-forward cross-validation (never use future data in training window) # Use expanding window: train on t-36m to t, predict t+1m to t+6m ``` ### Module B — NLP Topic Modelling ```python import json, os from gensim import corpora, models # LDA baseline # from bertopic import BERTopic # upgraded model # Load 10-K MD&A sections for training period train_manifest = json.load(open("splits/10k_filings/train_manifest.json")) corpus = [] for fpath in train_manifest["files"]: filing = json.load(open(fpath)) text = filing.get("item_7", "") + " " + filing.get("item_1a", "") corpus.append(text) # Preprocessing: tokenise, remove Safe Harbor boilerplate, financial stopwords # financial_stopwords = ["forward-looking", "may", "could", "believe", "expect", ...] # Train LDA (K=15 topics typical for earnings text) # Compare coherence scores across K=10,15,20,25 to find optimal # FinBERT sentiment on risk factor sentences: # from transformers import pipeline # finbert = pipeline("text-classification", model="ProsusAI/finbert") ``` ### Module C — Supply Chain Network ```python import pandas as pd, networkx as nx import community as community_louvain # pip install python-louvain # Build directed graph from known edges edges = pd.read_csv("Module_3/edges_template.csv") G = nx.from_pandas_edgelist(edges, "source", "target", edge_attr="relationship_type", create_using=nx.DiGraph()) # Add nodes from company list companies = pd.read_csv("data/company_list.csv") for _, row in companies.iterrows(): G.nodes[row["ticker"]]["subsector"] = row["subsector"] # Louvain community detection partition = community_louvain.best_partition(G.to_undirected()) # Centrality metrics betweenness = nx.betweenness_centrality(G) pagerank = nx.pagerank(G) # DebtRank contagion: mark one node as defaulted, propagate stress # proportional to edge weights through the graph ``` ### Module D — Fusion & Health Score ```python import lightgbm as lgb from sklearn.calibration import CalibratedClassifierCV # Concatenate feature vectors from all three modules: # X = pd.concat([X_ts, X_nlp, X_graph], axis=1) # (company × quarter) index # Walk-forward split (expanding window) # Train: 2015Q1 → 2020Q4 | Val: 2021Q1 → 2021Q4 | Test: 2022Q1 → # LightGBM with early stopping on val AUC model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=31, min_child_samples=10) model.fit(X_train, y_train, eval_set=[(X_val, y_val)], eval_metric="auc", callbacks=[lgb.early_stopping(50), lgb.log_evaluation(50)]) # Calibrate to produce true probabilities (Platt scaling) calibrated = CalibratedClassifierCV(model, cv="prefit", method="sigmoid") calibrated.fit(X_val, y_val) # SHAP for interpretability import shap explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) shap.summary_plot(shap_values, X_test) ``` --- ## Company Universe 40 S&P 500 Energy sector companies across 6 subsectors: | Subsector | Tickers | |-----------|---------| | Integrated Oil | XOM, CVX, OXY | | Exploration & Production | COP, EOG, PXD*, DVN, FANG, MRO*, APA, OVV, HES*, CTRA, MTDR, PR, CHRD | | Oilfield Services | SLB, HAL, BKR, FTI, NOV | | Refining | VLO, MPC, PSX, DK, PBF | | Midstream/Pipelines | KMI, WMB, OKE, ET, EPD, TRGP, DTM, AM | | Natural Gas / LNG | EQT, AR, RRC, SWN*, CHK*, LNG | *\* Delisted/acquired/bankrupt — useful as distress/exit label cases* --- ## Research Questions (from project proposal) - **RQ1 — Prediction:** Can we predict corporate default events 3–6 months in advance using time series + NLP + network features with higher AUC-ROC than Altman Z-Score? - **RQ2 — Contagion:** Which companies act as 'super-spreaders' of financial distress — and can community detection identify them before a crisis? - **RQ3 — Narrative Signal:** Does sentiment and topic shift in earnings call language provide statistically significant leading signal for credit deterioration? --- ## Citation ```bibtex @dataset{crisisnet2025, title = {CrisisNet: A Multi-Modal Corporate Default Risk Dataset}, author = {Sashank and team}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/Sashank-810/crisisnet-dataset} } ``` --- ## License [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — Free to use for research and commercial purposes with attribution. Data sourced from: Yahoo Finance (yfinance), FRED API (St. Louis Fed), SEC EDGAR (public domain), HuggingFace `lamini/earnings-calls-qa`.
提供机构:
Sashank-810
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作