elroyg/fin-jepa-study0
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/elroyg/fin-jepa-study0
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- tabular-classification
tags:
- finance
- xbrl
- distress-prediction
- sec-edgar
- tabular
- financial-statements
pretty_name: "Fin-JEPA Study 0: XBRL Financial Distress Dataset"
size_categories:
- 10K<n<100K
language:
- en
configs:
- config_name: default
data_files:
- split: train
path: default/train.parquet
- split: validation
path: default/validation.parquet
- split: test
path: default/test.parquet
default: true
dataset_info:
- config_name: default
features:
- name: cik
dtype: string
- name: ticker
dtype: string
- name: fiscal_year
dtype: int64
- name: period_end
dtype: date32
- name: filed_date
dtype: date32
- name: sector
dtype: string
- name: sic_code
dtype: string
- name: total_assets
dtype: float64
- name: total_liabilities
dtype: float64
- name: total_equity
dtype: float64
- name: current_assets
dtype: float64
- name: current_liabilities
dtype: float64
- name: retained_earnings
dtype: float64
- name: cash_equivalents
dtype: float64
- name: total_debt
dtype: float64
- name: total_revenue
dtype: float64
- name: cost_of_sales
dtype: float64
- name: operating_income
dtype: float64
- name: net_income
dtype: float64
- name: interest_expense
dtype: float64
- name: cash_from_operations
dtype: float64
- name: cash_from_investing
dtype: float64
- name: cash_from_financing
dtype: float64
- name: stock_decline
dtype: int8
- name: earnings_restate
dtype: int8
- name: audit_qualification
dtype: int8
- name: sec_enforcement
dtype: int8
- name: bankruptcy
dtype: int8
- name: earnings_restate_source
dtype: string
- name: fwd_ret_252d
dtype: float64
- name: mkt_adj_252d
dtype: float64
- name: delisted
dtype: bool
- name: period_end_xbrl
dtype: date32
- name: period_end_label
dtype: date32
splits:
- name: train
num_examples: 14748
- name: validation
num_examples: 6066
- name: test
num_examples: 14895
---
# Fin-JEPA Study 0: XBRL Financial Distress Dataset
A tabular dataset of **35,709** company-year observations for
**4,520** unique SEC filers (2012--2023), linking XBRL financial
statement features extracted from 10-K filings to five binary distress
outcomes. Built for the first gate of the
[Financial JEPA](https://github.com/elroy-galbraith/fin-jepa) project.
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("elroyg/fin-jepa-study0")
print(ds)
# DatasetDict({
# train: Dataset(num_rows=14748, ...),
# validation: Dataset(num_rows=6066, ...),
# test: Dataset(num_rows=14895, ...),
# })
train = ds["train"].to_pandas()
print(train.columns.tolist())
```
## Dataset Structure
### Configs
| Config | Description | Files |
|--------|-------------|-------|
| `default` | Merged features + labels, temporally split | `default/{train,validation,test}.parquet` |
| (raw files) | Individual source parquets for advanced use | `raw/*.parquet` (see below) |
### Columns (34 total, including 2 audit columns)
**Identifiers / Metadata (7)**
| Column | Type | Description |
|--------|------|-------------|
| `cik` | string | SEC Central Index Key (10-digit, zero-padded) |
| `ticker` | string | Equity ticker symbol (nullable) |
| `fiscal_year` | int64 | Fiscal year of the 10-K filing |
| `period_end` | date | Fiscal year-end date |
| `filed_date` | date | Date the 10-K was filed with the SEC |
| `sector` | string | Fama-French 12-industry sector |
| `sic_code` | string | 4-digit SIC code (nullable) |
**XBRL Financial Features (16)** --- all `float64`, sourced from SEC EDGAR Company Facts API
| Column | Statement | XBRL Concept(s) |
|--------|-----------|-----------------|
| `total_assets` | Balance Sheet | Assets |
| `total_liabilities` | Balance Sheet | Liabilities |
| `total_equity` | Balance Sheet | StockholdersEquity (+fallback) |
| `current_assets` | Balance Sheet | AssetsCurrent |
| `current_liabilities` | Balance Sheet | LiabilitiesCurrent |
| `retained_earnings` | Balance Sheet | RetainedEarningsAccumulatedDeficit |
| `cash_equivalents` | Balance Sheet | CashAndCashEquivalentsAtCarryingValue (+fallbacks) |
| `total_debt` | Balance Sheet | Computed: LongTermDebt + ShortTermBorrowings |
| `total_revenue` | Income | Revenues (+fallbacks) |
| `cost_of_sales` | Income | CostOfGoodsSold (+fallbacks) |
| `operating_income` | Income | OperatingIncomeLoss |
| `net_income` | Income | NetIncomeLoss (+fallbacks) |
| `interest_expense` | Income | InterestExpense (+fallback) |
| `cash_from_operations` | Cash Flow | NetCashProvidedByUsedInOperatingActivities (+fallback) |
| `cash_from_investing` | Cash Flow | NetCashProvidedByUsedInInvestingActivities (+fallback) |
| `cash_from_financing` | Cash Flow | NetCashProvidedByUsedInFinancingActivities (+fallback) |
> Features are **un-normalised** (raw USD values as reported). NaN means the
> concept was not reported in the filing. See the
> [feature engineering module](https://github.com/elroy-galbraith/fin-jepa/blob/main/src/fin_jepa/data/feature_engineering.py)
> for ratio computation, YoY changes, winsorisation, and quantile normalisation.
**Distress Labels (5)** --- nullable `Int8` (0 = no event, 1 = event, NaN = unavailable)
| Column | Definition | Source | Positive Rate |
|--------|-----------|--------|--------------|
| `stock_decline` | Market-adjusted 252-day return < -20% | yfinance (forward returns from filing date) | 43.2% |
| `earnings_restate` | Either (a) a 10-K/A-family filing (10-K/A, 10-KT/A, NT 10-K/A) within 1095 days of `period_end` in the EDGAR index, or (b) the XBRL Company Facts API reports ≥2 10-K filings for the same `(cik, period_end)`. See `earnings_restate_source` for per-row provenance. | Reconciled: EDGAR quarterly index + XBRL amendment registry | 22.3% |
| `audit_qualification` | Going-concern or adverse audit opinion | External CSV (Audit Analytics) | 0.0% |
| `sec_enforcement` | SEC enforcement action disclosure | EDGAR EFTS full-text search (noisy proxy) | 0.5% |
| `bankruptcy` | Chapter 7/11 filing (8-K Item 1.03) | EDGAR EFTS full-text search | 0.8% |
> **NaN semantics:** NaN means "data unavailable", **not** "no event".
> Training code must mask NaN labels in the loss function.
>
> **Delisted companies:** Firms with fewer than 252 trading days of data
> after their filing date are marked `delisted=True` and assigned
> `stock_decline=1`.
>
> **`earnings_restate` reconciliation (v1.1):** The v1.0 label was a narrow
> 10-K/A proxy within a 365-day window, and only ~9% of XBRL-detected
> amendments carried `earnings_restate=1`. v1.1 reconciles the two signals:
> the label fires if either (a) a 10-K/A-family filing (including
> `NT 10-K/A` and `10-KT/A`) appears in the EDGAR index within 1095 days
> of `period_end`, or (b) the XBRL Company Facts API reports multiple
> 10-K filings for the same `(cik, period_end)`. The `earnings_restate_source`
> column records per-row which signal fired (`edgar`, `xbrl`, `both`, or
> `none`) so consumers can reproduce v1.0 by filtering to
> `earnings_restate_source in ("edgar", "both")` with the narrow horizon
> applied offline, or ablate by source in Study 1. This is still a proxy
> — for ground truth, use Audit Analytics.
>
> **`audit_qualification` coverage:** This label may be all-NaN if no
> external Audit Analytics data was available at build time.
**Market Context (3)**
| Column | Type | Description |
|--------|------|-------------|
| `fwd_ret_252d` | float64 | Log return over 252 trading days from filing date |
| `mkt_adj_252d` | float64 | Market-adjusted return (stock - S&P 500) |
| `delisted` | bool | True if < 252 trading days remain post-filing |
**Audit Columns (2)** --- for diagnosing date mismatches between pipelines
| Column | Type | Description |
|--------|------|-------------|
| `period_end_xbrl` | date | Fiscal year-end date from XBRL filing metadata (= `period_end`) |
| `period_end_label` | date | Fiscal year-end date from the label/market pipeline |
> The XBRL and label pipelines sometimes disagree on the canonical fiscal
> year-end date by a few days (e.g. Dec 29 vs Dec 26 for companies whose
> fiscal year ends on the last Saturday of December). The dataset joins on
> `(cik, fiscal_year)` and uses the XBRL date as canonical `period_end`.
> Rows where the two dates differ by more than 14 days are dropped as
> mismatches. These audit columns let you verify alignment.
## Temporal Splits
Splits are **strictly temporal** to prevent look-ahead bias:
| Split | Period End Range | Rows | Purpose |
|-------|-----------------|------|---------|
| `train` | ≤ 2017-12-31 | 14,748 | Model training |
| `validation` | 2018-01-01 to 2019-12-31 | 6,066 | Hyperparameter tuning |
| `test` | 2020-01-01 to 2023-12-31 | 14,895 | Final evaluation |
> The test period covers COVID-19, high-inflation, and rising-rate regimes,
> testing out-of-distribution generalisation.
## Data Sources
All data is derived from **public, free sources**:
| Component | Source | URL |
|-----------|--------|-----|
| XBRL Features | SEC EDGAR Company Facts API | `https://data.sec.gov/api/xbrl/companyfacts/` |
| Filing Index | SEC EDGAR Quarterly Index | `https://www.sec.gov/Archives/edgar/full-index/` |
| Market Data | Yahoo Finance (via yfinance) | `https://finance.yahoo.com/` |
| SEC Enforcement | EDGAR EFTS Full-Text Search | `https://efts.sec.gov/LATEST/search-index` |
| Bankruptcy | EDGAR EFTS (8-K Item 1.03) | `https://efts.sec.gov/LATEST/search-index` |
The universe includes **delisted, bankrupt, and acquired companies** ---
there is no survivorship bias.
## Raw Files
For advanced users who want the individual source tables, four parquets
are available in the `raw/` directory:
```python
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="elroyg/fin-jepa-study0",
repo_type="dataset",
filename="raw/xbrl_features.parquet",
)
```
| File | Rows | Description |
|------|------|-------------|
| `raw/xbrl_features.parquet` | ~45k | 16 raw XBRL features per company-year |
| `raw/label_database.parquet` | ~45k | 5 binary distress labels |
| `raw/company_universe.parquet` | ~14k | Company metadata (one row per filer) |
| `raw/market_aligned.parquet` | ~45k | Forward returns at 4 horizons + market benchmarks |
## Citation
```bibtex
@dataset{galbraith2025finjepa,
author = {Galbraith, Elroy},
title = {Fin-JEPA Study 0: XBRL Financial Distress Dataset},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/elroyg/fin-jepa-study0}
}
```
## License
MIT
许可证:MIT协议
任务类别:表格分类
标签:金融、可扩展商业报告语言(XBRL)、困境预测、SEC EDGAR、表格、财务报表
展示名称:"Fin-JEPA 研究0:XBRL财务困境数据集"
样本规模:10000 < 样本数 < 100000
语言:英语
配置项:
- 配置名称:default
数据文件:
- 划分集:train
路径:default/train.parquet
- 划分集:validation
路径:default/validation.parquet
- 划分集:test
路径:default/test.parquet
为默认配置
数据集信息:
- 配置名称:default
特征:
- 名称:cik
数据类型:字符串
- 名称:ticker
数据类型:字符串
- 名称:fiscal_year
数据类型:int64
- 名称:period_end
数据类型:date32
- 名称:filed_date
数据类型:date32
- 名称:sector
数据类型:字符串
- 名称:sic_code
数据类型:字符串
- 名称:total_assets
数据类型:float64
- 名称:total_liabilities
数据类型:float64
- 名称:total_equity
数据类型:float64
- 名称:current_assets
数据类型:float64
- 名称:current_liabilities
数据类型:float64
- 名称:retained_earnings
数据类型:float64
- 名称:cash_equivalents
数据类型:float64
- 名称:total_debt
数据类型:float64
- 名称:total_revenue
数据类型:float64
- 名称:cost_of_sales
数据类型:float64
- 名称:operating_income
数据类型:float64
- 名称:net_income
数据类型:float64
- 名称:interest_expense
数据类型:float64
- 名称:cash_from_operations
数据类型:float64
- 名称:cash_from_investing
数据类型:float64
- 名称:cash_from_financing
数据类型:float64
- 名称:stock_decline
数据类型:int8
- 名称:earnings_restate
数据类型:int8
- 名称:audit_qualification
数据类型:int8
- 名称:sec_enforcement
数据类型:int8
- 名称:bankruptcy
数据类型:int8
- 名称:earnings_restate_source
数据类型:字符串
- 名称:fwd_ret_252d
数据类型:float64
- 名称:mkt_adj_252d
数据类型:float64
- 名称:delisted
数据类型:bool
- 名称:period_end_xbrl
数据类型:date32
- 名称:period_end_label
数据类型:date32
划分集:
- 名称:train
样本数:14748
- 名称:validation
样本数:6066
- 名称:test
样本数:14895
# Fin-JEPA 研究0:XBRL财务困境数据集
本数据集包含**35709**条公司-年度观测样本,覆盖**4520**家独立的SEC申报主体(2012年至2023年),将从10-K申报文件中提取的XBRL财务报表特征与5项二分类困境结果相关联,为[Financial JEPA](https://github.com/elroy-galbraith/fin-jepa)项目的首个阶段构建。
## 快速入门
python
from datasets import load_dataset
ds = load_dataset("elroyg/fin-jepa-study0")
print(ds)
# DatasetDict({
# train: Dataset(num_rows=14748, ...),
# validation: Dataset(num_rows=6066, ...),
# test: Dataset(num_rows=14895, ...),
# })
train = ds["train"].to_pandas()
print(train.columns.tolist())
## 数据集结构
### 配置项
| 配置项 | 描述 | 文件路径 |
|--------|-------------|-------|
| `default` | 合并后的特征与标签,按时间顺序拆分 | `default/{train,validation,test}.parquet` |
| (原始文件) | 面向高级使用场景的单源Parquet文件 | `raw/*.parquet`(详见下文) |
### 列说明(共34列,含2项审计列)
#### 标识符与元数据(7列)
| 列名 | 数据类型 | 说明 |
|--------|------|-------------|
| `cik` | 字符串 | SEC中央索引密钥(10位,补零填充) |
| `ticker` | 字符串 | 股票代码(可为空) |
| `fiscal_year` | int64 | 10-K申报文件对应的会计年度 |
| `period_end` | 日期 | 会计期末日期 |
| `filed_date` | 日期 | 10-K文件向SEC提交的日期 |
| `sector` | 字符串 | 法玛-弗伦奇12行业分类板块 |
| `sic_code` | 字符串 | 4位标准行业分类代码(可为空) |
#### XBRL财务特征(16列)——全部为`float64`类型,数据源自SEC EDGAR公司事实API
| 列名 | 所属报表 | XBRL概念(含备选概念) |
|--------|-----------|-----------------|
| `total_assets` | 资产负债表 | 资产(Assets) |
| `total_liabilities` | 资产负债表 | 负债(Liabilities) |
| `total_equity` | 资产负债表 | 股东权益(StockholdersEquity,含备选概念) |
| `current_assets` | 资产负债表 | 流动资产(AssetsCurrent) |
| `current_liabilities` | 资产负债表 | 流动负债(LiabilitiesCurrent) |
| `retained_earnings` | 资产负债表 | 累计留存收益(亏损)(RetainedEarningsAccumulatedDeficit) |
| `cash_equivalents` | 资产负债表 | 现金及现金等价物(Carrying价值下的CashAndCashEquivalents,含备选概念) |
| `total_debt` | 资产负债表 | 计算值:长期债务(LongTermDebt)+ 短期借款(ShortTermBorrowings) |
| `total_revenue` | 利润表 | 总营收(Revenues,含备选概念) |
| `cost_of_sales` | 利润表 | 销售成本(CostOfGoodsSold,含备选概念) |
| `operating_income` | 利润表 | 营业损益(OperatingIncomeLoss) |
| `net_income` | 利润表 | 净损益(NetIncomeLoss,含备选概念) |
| `interest_expense` | 利润表 | 利息费用(InterestExpense,含备选概念) |
| `cash_from_operations` | 现金流量表 | 经营活动现金净流入/流出净额(NetCashProvidedByUsedInOperatingActivities,含备选概念) |
| `cash_from_investing` | 现金流量表 | 投资活动现金净流入/流出净额(NetCashProvidedByUsedInInvestingActivities,含备选概念) |
| `cash_from_financing` | 现金流量表 | 筹资活动现金净流入/流出净额(NetCashProvidedByUsedInFinancingActivities,含备选概念) |
> 特征未做归一化处理,采用申报时的原始美元金额。NaN表示该概念未在申报文件中披露。有关比率计算、同比变化、缩尾处理和分位数归一化的细节,请参阅[特征工程模块](https://github.com/elroy-galbraith/fin-jepa/blob/main/src/fin_jepa/data/feature_engineering.py)。
#### 困境标签(5列)——可为空的`Int8`类型(0=无对应事件,1=发生对应事件,NaN=数据不可用)
| 列名 | 定义 | 数据来源 | 阳性样本占比 |
|--------|-----------|--------|--------------|
| `stock_decline` | 市场调整后252交易日收益率 < -20% | yfinance(基于申报日期的远期收益率) | 43.2% |
| `earnings_restate` | 满足以下任一条件:(a) 在会计期末日期后1095天内,EDGAR索引中存在10-K/A系列申报文件(10-K/A、10-KT/A、NT 10-K/A);或(b) XBRL公司事实API显示同一`(cik, period_end)`对应至少2份10-K申报文件。详见`earnings_restate_source`列的每条记录来源。 | 整合数据:EDGAR季度索引 + XBRL修正案注册表 | 22.3% |
| `audit_qualification` | 持续经营疑虑或否定审计意见 | 外部CSV文件(Audit Analytics) | 0.0% |
| `sec_enforcement` | SEC执法行动披露 | EDGAR EFTS全文搜索(噪声代理指标) | 0.5% |
| `bankruptcy` | 第7章/第11章破产申请(8-K项目1.03) | EDGAR EFTS全文搜索 | 0.8% |
> **NaN语义说明**:NaN表示“数据不可用”,而非“无对应事件”。训练代码需在损失函数中对NaN标签进行掩码处理。
>
> **退市公司**:申报日期后剩余交易天数不足252天的公司将被标记为`delisted=True`,且`stock_decline=1`。
>
> **`earnings_restate`标签修订(v1.1)**:v1.0版本的标签为365天窗口内的窄义10-K/A代理指标,仅约9%的XBRL检测到的修正案对应`earnings_restate=1`。v1.1版本整合了两类信号:当满足以下任一条件时,标签置1:(a) 在会计期末日期后1095天内,EDGAR索引中存在10-K/A系列申报文件(包括`NT 10-K/A`和`10-KT/A`);或(b) XBRL公司事实API显示同一`(cik, period_end)`对应多份10-K申报文件。`earnings_restate_source`列会记录每条记录触发的信号来源(`edgar`、`xbrl`、`both`或`none`),使用者可通过筛选`earnings_restate_source in ("edgar", "both")`并在离线阶段应用窄窗口来复现v1.0版本,或在研究1中对不同来源进行消融实验。该指标仍为代理变量——若需获取真实标签,请使用Audit Analytics。
>
> **`audit_qualification`覆盖范围**:若构建数据集时未获取到外部Audit Analytics数据,则该列可能全为NaN。
#### 市场上下文(3列)
| 列名 | 数据类型 | 说明 |
|--------|------|-------------|
| `fwd_ret_252d` | float64 | 申报日期后252交易日的对数收益率 |
| `mkt_adj_252d` | float64 | 市场调整后收益率(个股收益率 - 标普500收益率) |
| `delisted` | bool | 若申报后剩余交易天数不足252天则为True |
#### 审计列(2列)——用于排查数据管道间的日期不匹配问题
| 列名 | 数据类型 | 说明 |
|--------|------|-------------|
| `period_end_xbrl` | 日期 | XBRL申报元数据中的会计期末日期(与`period_end`一致) |
| `period_end_label` | 日期 | 标签/市场数据管道中的会计期末日期 |
> XBRL与标签数据管道有时会在标准会计期末日期上存在数日差异(例如,财年结束于12月最后一个周六的公司,可能被标记为12月29日或12月26日)。本数据集通过`(cik, fiscal_year)`进行关联,并以XBRL日期作为标准`period_end`。若两个日期差异超过14天,则会被作为不匹配样本剔除。上述审计列可用于验证数据对齐情况。
## 时间划分
数据集划分严格遵循时间顺序,以避免前瞻偏差:
| 划分集 | 会计期末日期范围 | 样本数 | 用途 |
|-------|-----------------|------|---------|
| `train` | ≤ 2017-12-31 | 14,748 | 模型训练 |
| `validation` | 2018-01-01 至 2019-12-31 | 6,066 | 超参数调优 |
| `test` | 2020-01-01 至 2023-12-31 | 14,895 | 最终模型评估 |
> 测试集覆盖了新冠疫情、高通胀与加息周期,用于测试模型的分布外泛化能力。
## 数据来源
所有数据均来自**公开免费的数据源**:
| 数据组件 | 来源 | 链接 |
|-----------|--------|-----|
| XBRL特征 | SEC EDGAR公司事实API | `https://data.sec.gov/api/xbrl/companyfacts/` |
| 申报索引 | SEC EDGAR季度索引 | `https://www.sec.gov/Archives/edgar/full-index/` |
| 市场数据 | 雅虎财经(通过yfinance库) | `https://finance.yahoo.com/` |
| SEC执法行动 | EDGAR EFTS全文搜索 | `https://efts.sec.gov/LATEST/search-index` |
| 破产信息 | EDGAR EFTS(8-K项目1.03) | `https://efts.sec.gov/LATEST/search-index` |
数据集覆盖**退市、破产与被收购公司**,不存在生存者偏差。
## 原始文件
对于需要获取单源数据表的高级用户,`raw/`目录下提供了4个Parquet文件:
python
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="elroyg/fin-jepa-study0",
repo_type="dataset",
filename="raw/xbrl_features.parquet",
)
| 文件路径 | 样本数 | 说明 |
|------|------|-------------|
| `raw/xbrl_features.parquet` | ~45k | 每家公司-年度的16项原始XBRL特征 |
| `raw/label_database.parquet` | ~45k | 5项二分类困境标签 |
| `raw/company_universe.parquet` | ~14k | 公司元数据(每家申报主体一行) |
| `raw/market_aligned.parquet` | ~45k | 4个时间窗口的远期收益率 + 市场基准数据 |
## 引用格式
bibtex
@dataset{galbraith2025finjepa,
author = {Galbraith, Elroy},
title = {Fin-JEPA Study 0: XBRL Financial Distress Dataset},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/elroyg/fin-jepa-study0}
}
## 许可证
MIT协议
提供机构:
elroyg



