five

elroyg/fin-jepa-study0

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/elroyg/fin-jepa-study0
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - tabular-classification tags: - finance - xbrl - distress-prediction - sec-edgar - tabular - financial-statements pretty_name: "Fin-JEPA Study 0: XBRL Financial Distress Dataset" size_categories: - 10K<n<100K language: - en configs: - config_name: default data_files: - split: train path: default/train.parquet - split: validation path: default/validation.parquet - split: test path: default/test.parquet default: true dataset_info: - config_name: default features: - name: cik dtype: string - name: ticker dtype: string - name: fiscal_year dtype: int64 - name: period_end dtype: date32 - name: filed_date dtype: date32 - name: sector dtype: string - name: sic_code dtype: string - name: total_assets dtype: float64 - name: total_liabilities dtype: float64 - name: total_equity dtype: float64 - name: current_assets dtype: float64 - name: current_liabilities dtype: float64 - name: retained_earnings dtype: float64 - name: cash_equivalents dtype: float64 - name: total_debt dtype: float64 - name: total_revenue dtype: float64 - name: cost_of_sales dtype: float64 - name: operating_income dtype: float64 - name: net_income dtype: float64 - name: interest_expense dtype: float64 - name: cash_from_operations dtype: float64 - name: cash_from_investing dtype: float64 - name: cash_from_financing dtype: float64 - name: stock_decline dtype: int8 - name: earnings_restate dtype: int8 - name: audit_qualification dtype: int8 - name: sec_enforcement dtype: int8 - name: bankruptcy dtype: int8 - name: earnings_restate_source dtype: string - name: fwd_ret_252d dtype: float64 - name: mkt_adj_252d dtype: float64 - name: delisted dtype: bool - name: period_end_xbrl dtype: date32 - name: period_end_label dtype: date32 splits: - name: train num_examples: 14748 - name: validation num_examples: 6066 - name: test num_examples: 14895 --- # Fin-JEPA Study 0: XBRL Financial Distress Dataset A tabular dataset of **35,709** company-year observations for **4,520** unique SEC filers (2012--2023), linking XBRL financial statement features extracted from 10-K filings to five binary distress outcomes. Built for the first gate of the [Financial JEPA](https://github.com/elroy-galbraith/fin-jepa) project. ## Quick Start ```python from datasets import load_dataset ds = load_dataset("elroyg/fin-jepa-study0") print(ds) # DatasetDict({ # train: Dataset(num_rows=14748, ...), # validation: Dataset(num_rows=6066, ...), # test: Dataset(num_rows=14895, ...), # }) train = ds["train"].to_pandas() print(train.columns.tolist()) ``` ## Dataset Structure ### Configs | Config | Description | Files | |--------|-------------|-------| | `default` | Merged features + labels, temporally split | `default/{train,validation,test}.parquet` | | (raw files) | Individual source parquets for advanced use | `raw/*.parquet` (see below) | ### Columns (34 total, including 2 audit columns) **Identifiers / Metadata (7)** | Column | Type | Description | |--------|------|-------------| | `cik` | string | SEC Central Index Key (10-digit, zero-padded) | | `ticker` | string | Equity ticker symbol (nullable) | | `fiscal_year` | int64 | Fiscal year of the 10-K filing | | `period_end` | date | Fiscal year-end date | | `filed_date` | date | Date the 10-K was filed with the SEC | | `sector` | string | Fama-French 12-industry sector | | `sic_code` | string | 4-digit SIC code (nullable) | **XBRL Financial Features (16)** --- all `float64`, sourced from SEC EDGAR Company Facts API | Column | Statement | XBRL Concept(s) | |--------|-----------|-----------------| | `total_assets` | Balance Sheet | Assets | | `total_liabilities` | Balance Sheet | Liabilities | | `total_equity` | Balance Sheet | StockholdersEquity (+fallback) | | `current_assets` | Balance Sheet | AssetsCurrent | | `current_liabilities` | Balance Sheet | LiabilitiesCurrent | | `retained_earnings` | Balance Sheet | RetainedEarningsAccumulatedDeficit | | `cash_equivalents` | Balance Sheet | CashAndCashEquivalentsAtCarryingValue (+fallbacks) | | `total_debt` | Balance Sheet | Computed: LongTermDebt + ShortTermBorrowings | | `total_revenue` | Income | Revenues (+fallbacks) | | `cost_of_sales` | Income | CostOfGoodsSold (+fallbacks) | | `operating_income` | Income | OperatingIncomeLoss | | `net_income` | Income | NetIncomeLoss (+fallbacks) | | `interest_expense` | Income | InterestExpense (+fallback) | | `cash_from_operations` | Cash Flow | NetCashProvidedByUsedInOperatingActivities (+fallback) | | `cash_from_investing` | Cash Flow | NetCashProvidedByUsedInInvestingActivities (+fallback) | | `cash_from_financing` | Cash Flow | NetCashProvidedByUsedInFinancingActivities (+fallback) | > Features are **un-normalised** (raw USD values as reported). NaN means the > concept was not reported in the filing. See the > [feature engineering module](https://github.com/elroy-galbraith/fin-jepa/blob/main/src/fin_jepa/data/feature_engineering.py) > for ratio computation, YoY changes, winsorisation, and quantile normalisation. **Distress Labels (5)** --- nullable `Int8` (0 = no event, 1 = event, NaN = unavailable) | Column | Definition | Source | Positive Rate | |--------|-----------|--------|--------------| | `stock_decline` | Market-adjusted 252-day return < -20% | yfinance (forward returns from filing date) | 43.2% | | `earnings_restate` | Either (a) a 10-K/A-family filing (10-K/A, 10-KT/A, NT 10-K/A) within 1095 days of `period_end` in the EDGAR index, or (b) the XBRL Company Facts API reports ≥2 10-K filings for the same `(cik, period_end)`. See `earnings_restate_source` for per-row provenance. | Reconciled: EDGAR quarterly index + XBRL amendment registry | 22.3% | | `audit_qualification` | Going-concern or adverse audit opinion | External CSV (Audit Analytics) | 0.0% | | `sec_enforcement` | SEC enforcement action disclosure | EDGAR EFTS full-text search (noisy proxy) | 0.5% | | `bankruptcy` | Chapter 7/11 filing (8-K Item 1.03) | EDGAR EFTS full-text search | 0.8% | > **NaN semantics:** NaN means "data unavailable", **not** "no event". > Training code must mask NaN labels in the loss function. > > **Delisted companies:** Firms with fewer than 252 trading days of data > after their filing date are marked `delisted=True` and assigned > `stock_decline=1`. > > **`earnings_restate` reconciliation (v1.1):** The v1.0 label was a narrow > 10-K/A proxy within a 365-day window, and only ~9% of XBRL-detected > amendments carried `earnings_restate=1`. v1.1 reconciles the two signals: > the label fires if either (a) a 10-K/A-family filing (including > `NT 10-K/A` and `10-KT/A`) appears in the EDGAR index within 1095 days > of `period_end`, or (b) the XBRL Company Facts API reports multiple > 10-K filings for the same `(cik, period_end)`. The `earnings_restate_source` > column records per-row which signal fired (`edgar`, `xbrl`, `both`, or > `none`) so consumers can reproduce v1.0 by filtering to > `earnings_restate_source in ("edgar", "both")` with the narrow horizon > applied offline, or ablate by source in Study 1. This is still a proxy > — for ground truth, use Audit Analytics. > > **`audit_qualification` coverage:** This label may be all-NaN if no > external Audit Analytics data was available at build time. **Market Context (3)** | Column | Type | Description | |--------|------|-------------| | `fwd_ret_252d` | float64 | Log return over 252 trading days from filing date | | `mkt_adj_252d` | float64 | Market-adjusted return (stock - S&P 500) | | `delisted` | bool | True if < 252 trading days remain post-filing | **Audit Columns (2)** --- for diagnosing date mismatches between pipelines | Column | Type | Description | |--------|------|-------------| | `period_end_xbrl` | date | Fiscal year-end date from XBRL filing metadata (= `period_end`) | | `period_end_label` | date | Fiscal year-end date from the label/market pipeline | > The XBRL and label pipelines sometimes disagree on the canonical fiscal > year-end date by a few days (e.g. Dec 29 vs Dec 26 for companies whose > fiscal year ends on the last Saturday of December). The dataset joins on > `(cik, fiscal_year)` and uses the XBRL date as canonical `period_end`. > Rows where the two dates differ by more than 14 days are dropped as > mismatches. These audit columns let you verify alignment. ## Temporal Splits Splits are **strictly temporal** to prevent look-ahead bias: | Split | Period End Range | Rows | Purpose | |-------|-----------------|------|---------| | `train` | &le; 2017-12-31 | 14,748 | Model training | | `validation` | 2018-01-01 to 2019-12-31 | 6,066 | Hyperparameter tuning | | `test` | 2020-01-01 to 2023-12-31 | 14,895 | Final evaluation | > The test period covers COVID-19, high-inflation, and rising-rate regimes, > testing out-of-distribution generalisation. ## Data Sources All data is derived from **public, free sources**: | Component | Source | URL | |-----------|--------|-----| | XBRL Features | SEC EDGAR Company Facts API | `https://data.sec.gov/api/xbrl/companyfacts/` | | Filing Index | SEC EDGAR Quarterly Index | `https://www.sec.gov/Archives/edgar/full-index/` | | Market Data | Yahoo Finance (via yfinance) | `https://finance.yahoo.com/` | | SEC Enforcement | EDGAR EFTS Full-Text Search | `https://efts.sec.gov/LATEST/search-index` | | Bankruptcy | EDGAR EFTS (8-K Item 1.03) | `https://efts.sec.gov/LATEST/search-index` | The universe includes **delisted, bankrupt, and acquired companies** --- there is no survivorship bias. ## Raw Files For advanced users who want the individual source tables, four parquets are available in the `raw/` directory: ```python from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="elroyg/fin-jepa-study0", repo_type="dataset", filename="raw/xbrl_features.parquet", ) ``` | File | Rows | Description | |------|------|-------------| | `raw/xbrl_features.parquet` | ~45k | 16 raw XBRL features per company-year | | `raw/label_database.parquet` | ~45k | 5 binary distress labels | | `raw/company_universe.parquet` | ~14k | Company metadata (one row per filer) | | `raw/market_aligned.parquet` | ~45k | Forward returns at 4 horizons + market benchmarks | ## Citation ```bibtex @dataset{galbraith2025finjepa, author = {Galbraith, Elroy}, title = {Fin-JEPA Study 0: XBRL Financial Distress Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/elroyg/fin-jepa-study0} } ``` ## License MIT

许可证:MIT协议 任务类别:表格分类 标签:金融、可扩展商业报告语言(XBRL)、困境预测、SEC EDGAR、表格、财务报表 展示名称:"Fin-JEPA 研究0:XBRL财务困境数据集" 样本规模:10000 < 样本数 < 100000 语言:英语 配置项: - 配置名称:default 数据文件: - 划分集:train 路径:default/train.parquet - 划分集:validation 路径:default/validation.parquet - 划分集:test 路径:default/test.parquet 为默认配置 数据集信息: - 配置名称:default 特征: - 名称:cik 数据类型:字符串 - 名称:ticker 数据类型:字符串 - 名称:fiscal_year 数据类型:int64 - 名称:period_end 数据类型:date32 - 名称:filed_date 数据类型:date32 - 名称:sector 数据类型:字符串 - 名称:sic_code 数据类型:字符串 - 名称:total_assets 数据类型:float64 - 名称:total_liabilities 数据类型:float64 - 名称:total_equity 数据类型:float64 - 名称:current_assets 数据类型:float64 - 名称:current_liabilities 数据类型:float64 - 名称:retained_earnings 数据类型:float64 - 名称:cash_equivalents 数据类型:float64 - 名称:total_debt 数据类型:float64 - 名称:total_revenue 数据类型:float64 - 名称:cost_of_sales 数据类型:float64 - 名称:operating_income 数据类型:float64 - 名称:net_income 数据类型:float64 - 名称:interest_expense 数据类型:float64 - 名称:cash_from_operations 数据类型:float64 - 名称:cash_from_investing 数据类型:float64 - 名称:cash_from_financing 数据类型:float64 - 名称:stock_decline 数据类型:int8 - 名称:earnings_restate 数据类型:int8 - 名称:audit_qualification 数据类型:int8 - 名称:sec_enforcement 数据类型:int8 - 名称:bankruptcy 数据类型:int8 - 名称:earnings_restate_source 数据类型:字符串 - 名称:fwd_ret_252d 数据类型:float64 - 名称:mkt_adj_252d 数据类型:float64 - 名称:delisted 数据类型:bool - 名称:period_end_xbrl 数据类型:date32 - 名称:period_end_label 数据类型:date32 划分集: - 名称:train 样本数:14748 - 名称:validation 样本数:6066 - 名称:test 样本数:14895 # Fin-JEPA 研究0:XBRL财务困境数据集 本数据集包含**35709**条公司-年度观测样本,覆盖**4520**家独立的SEC申报主体(2012年至2023年),将从10-K申报文件中提取的XBRL财务报表特征与5项二分类困境结果相关联,为[Financial JEPA](https://github.com/elroy-galbraith/fin-jepa)项目的首个阶段构建。 ## 快速入门 python from datasets import load_dataset ds = load_dataset("elroyg/fin-jepa-study0") print(ds) # DatasetDict({ # train: Dataset(num_rows=14748, ...), # validation: Dataset(num_rows=6066, ...), # test: Dataset(num_rows=14895, ...), # }) train = ds["train"].to_pandas() print(train.columns.tolist()) ## 数据集结构 ### 配置项 | 配置项 | 描述 | 文件路径 | |--------|-------------|-------| | `default` | 合并后的特征与标签,按时间顺序拆分 | `default/{train,validation,test}.parquet` | | (原始文件) | 面向高级使用场景的单源Parquet文件 | `raw/*.parquet`(详见下文) | ### 列说明(共34列,含2项审计列) #### 标识符与元数据(7列) | 列名 | 数据类型 | 说明 | |--------|------|-------------| | `cik` | 字符串 | SEC中央索引密钥(10位,补零填充) | | `ticker` | 字符串 | 股票代码(可为空) | | `fiscal_year` | int64 | 10-K申报文件对应的会计年度 | | `period_end` | 日期 | 会计期末日期 | | `filed_date` | 日期 | 10-K文件向SEC提交的日期 | | `sector` | 字符串 | 法玛-弗伦奇12行业分类板块 | | `sic_code` | 字符串 | 4位标准行业分类代码(可为空) | #### XBRL财务特征(16列)——全部为`float64`类型,数据源自SEC EDGAR公司事实API | 列名 | 所属报表 | XBRL概念(含备选概念) | |--------|-----------|-----------------| | `total_assets` | 资产负债表 | 资产(Assets) | | `total_liabilities` | 资产负债表 | 负债(Liabilities) | | `total_equity` | 资产负债表 | 股东权益(StockholdersEquity,含备选概念) | | `current_assets` | 资产负债表 | 流动资产(AssetsCurrent) | | `current_liabilities` | 资产负债表 | 流动负债(LiabilitiesCurrent) | | `retained_earnings` | 资产负债表 | 累计留存收益(亏损)(RetainedEarningsAccumulatedDeficit) | | `cash_equivalents` | 资产负债表 | 现金及现金等价物(Carrying价值下的CashAndCashEquivalents,含备选概念) | | `total_debt` | 资产负债表 | 计算值:长期债务(LongTermDebt)+ 短期借款(ShortTermBorrowings) | | `total_revenue` | 利润表 | 总营收(Revenues,含备选概念) | | `cost_of_sales` | 利润表 | 销售成本(CostOfGoodsSold,含备选概念) | | `operating_income` | 利润表 | 营业损益(OperatingIncomeLoss) | | `net_income` | 利润表 | 净损益(NetIncomeLoss,含备选概念) | | `interest_expense` | 利润表 | 利息费用(InterestExpense,含备选概念) | | `cash_from_operations` | 现金流量表 | 经营活动现金净流入/流出净额(NetCashProvidedByUsedInOperatingActivities,含备选概念) | | `cash_from_investing` | 现金流量表 | 投资活动现金净流入/流出净额(NetCashProvidedByUsedInInvestingActivities,含备选概念) | | `cash_from_financing` | 现金流量表 | 筹资活动现金净流入/流出净额(NetCashProvidedByUsedInFinancingActivities,含备选概念) | > 特征未做归一化处理,采用申报时的原始美元金额。NaN表示该概念未在申报文件中披露。有关比率计算、同比变化、缩尾处理和分位数归一化的细节,请参阅[特征工程模块](https://github.com/elroy-galbraith/fin-jepa/blob/main/src/fin_jepa/data/feature_engineering.py)。 #### 困境标签(5列)——可为空的`Int8`类型(0=无对应事件,1=发生对应事件,NaN=数据不可用) | 列名 | 定义 | 数据来源 | 阳性样本占比 | |--------|-----------|--------|--------------| | `stock_decline` | 市场调整后252交易日收益率 < -20% | yfinance(基于申报日期的远期收益率) | 43.2% | | `earnings_restate` | 满足以下任一条件:(a) 在会计期末日期后1095天内,EDGAR索引中存在10-K/A系列申报文件(10-K/A、10-KT/A、NT 10-K/A);或(b) XBRL公司事实API显示同一`(cik, period_end)`对应至少2份10-K申报文件。详见`earnings_restate_source`列的每条记录来源。 | 整合数据:EDGAR季度索引 + XBRL修正案注册表 | 22.3% | | `audit_qualification` | 持续经营疑虑或否定审计意见 | 外部CSV文件(Audit Analytics) | 0.0% | | `sec_enforcement` | SEC执法行动披露 | EDGAR EFTS全文搜索(噪声代理指标) | 0.5% | | `bankruptcy` | 第7章/第11章破产申请(8-K项目1.03) | EDGAR EFTS全文搜索 | 0.8% | > **NaN语义说明**:NaN表示“数据不可用”,而非“无对应事件”。训练代码需在损失函数中对NaN标签进行掩码处理。 > > **退市公司**:申报日期后剩余交易天数不足252天的公司将被标记为`delisted=True`,且`stock_decline=1`。 > > **`earnings_restate`标签修订(v1.1)**:v1.0版本的标签为365天窗口内的窄义10-K/A代理指标,仅约9%的XBRL检测到的修正案对应`earnings_restate=1`。v1.1版本整合了两类信号:当满足以下任一条件时,标签置1:(a) 在会计期末日期后1095天内,EDGAR索引中存在10-K/A系列申报文件(包括`NT 10-K/A`和`10-KT/A`);或(b) XBRL公司事实API显示同一`(cik, period_end)`对应多份10-K申报文件。`earnings_restate_source`列会记录每条记录触发的信号来源(`edgar`、`xbrl`、`both`或`none`),使用者可通过筛选`earnings_restate_source in ("edgar", "both")`并在离线阶段应用窄窗口来复现v1.0版本,或在研究1中对不同来源进行消融实验。该指标仍为代理变量——若需获取真实标签,请使用Audit Analytics。 > > **`audit_qualification`覆盖范围**:若构建数据集时未获取到外部Audit Analytics数据,则该列可能全为NaN。 #### 市场上下文(3列) | 列名 | 数据类型 | 说明 | |--------|------|-------------| | `fwd_ret_252d` | float64 | 申报日期后252交易日的对数收益率 | | `mkt_adj_252d` | float64 | 市场调整后收益率(个股收益率 - 标普500收益率) | | `delisted` | bool | 若申报后剩余交易天数不足252天则为True | #### 审计列(2列)——用于排查数据管道间的日期不匹配问题 | 列名 | 数据类型 | 说明 | |--------|------|-------------| | `period_end_xbrl` | 日期 | XBRL申报元数据中的会计期末日期(与`period_end`一致) | | `period_end_label` | 日期 | 标签/市场数据管道中的会计期末日期 | > XBRL与标签数据管道有时会在标准会计期末日期上存在数日差异(例如,财年结束于12月最后一个周六的公司,可能被标记为12月29日或12月26日)。本数据集通过`(cik, fiscal_year)`进行关联,并以XBRL日期作为标准`period_end`。若两个日期差异超过14天,则会被作为不匹配样本剔除。上述审计列可用于验证数据对齐情况。 ## 时间划分 数据集划分严格遵循时间顺序,以避免前瞻偏差: | 划分集 | 会计期末日期范围 | 样本数 | 用途 | |-------|-----------------|------|---------| | `train` | ≤ 2017-12-31 | 14,748 | 模型训练 | | `validation` | 2018-01-01 至 2019-12-31 | 6,066 | 超参数调优 | | `test` | 2020-01-01 至 2023-12-31 | 14,895 | 最终模型评估 | > 测试集覆盖了新冠疫情、高通胀与加息周期,用于测试模型的分布外泛化能力。 ## 数据来源 所有数据均来自**公开免费的数据源**: | 数据组件 | 来源 | 链接 | |-----------|--------|-----| | XBRL特征 | SEC EDGAR公司事实API | `https://data.sec.gov/api/xbrl/companyfacts/` | | 申报索引 | SEC EDGAR季度索引 | `https://www.sec.gov/Archives/edgar/full-index/` | | 市场数据 | 雅虎财经(通过yfinance库) | `https://finance.yahoo.com/` | | SEC执法行动 | EDGAR EFTS全文搜索 | `https://efts.sec.gov/LATEST/search-index` | | 破产信息 | EDGAR EFTS(8-K项目1.03) | `https://efts.sec.gov/LATEST/search-index` | 数据集覆盖**退市、破产与被收购公司**,不存在生存者偏差。 ## 原始文件 对于需要获取单源数据表的高级用户,`raw/`目录下提供了4个Parquet文件: python from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="elroyg/fin-jepa-study0", repo_type="dataset", filename="raw/xbrl_features.parquet", ) | 文件路径 | 样本数 | 说明 | |------|------|-------------| | `raw/xbrl_features.parquet` | ~45k | 每家公司-年度的16项原始XBRL特征 | | `raw/label_database.parquet` | ~45k | 5项二分类困境标签 | | `raw/company_universe.parquet` | ~14k | 公司元数据(每家申报主体一行) | | `raw/market_aligned.parquet` | ~45k | 4个时间窗口的远期收益率 + 市场基准数据 | ## 引用格式 bibtex @dataset{galbraith2025finjepa, author = {Galbraith, Elroy}, title = {Fin-JEPA Study 0: XBRL Financial Distress Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/elroyg/fin-jepa-study0} } ## 许可证 MIT协议
提供机构:
elroyg
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作