ttchopper/openfundex
收藏Hugging Face2026-04-20 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ttchopper/openfundex
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
pretty_name: OpenFundex
tags:
- finance
- value-investing
- sec-filings
- fundamental-analysis
- tabular
task_categories:
- tabular-classification
- tabular-regression
size_categories:
- 100K<n<1M
configs:
- config_name: default
default: true
data_files:
- { split: train, path: train_clean.parquet }
- { split: validation, path: validation_clean.parquet }
- { split: test, path: test_clean.parquet }
- config_name: full
data_files:
- { split: train, path: train.parquet }
- { split: validation, path: validation.parquet }
- { split: test, path: test.parquet }
- config_name: live
data_files:
- { split: clean, path: recent_clean.parquet }
- { split: full, path: recent.parquet }
---
# OpenFundex Dataset
A structured dataset of SEC financial filings for deep value analysis and financial distress prediction.
## Dataset Description
- **License**: CC-BY-4.0
- **Language**: English
### Summary
OpenFundex contains financial statement data extracted from SEC EDGAR filings,
enriched with derived financial metrics and labeled with established quality scores
(Piotroski F-Score, Altman Z'-Score, Graham metrics). Designed for training ML
models to assess company financial health and identify deep value opportunities.
**Key design decision:** This dataset uses only fundamental data from SEC filings.
No market prices or equity trading data are included, eliminating survivorship bias.
## Supported Tasks
- **Tabular Classification**: Predict financial distress, bankruptcy, value creation, fundamental improvement
- **Tabular Regression**: Predict quality scores (f_score, z_prime_score, composite_quality_score), growth rates
- **Anomaly Detection**: Identify companies in financial distress or with QA anomalies
## Dataset Structure
### Splits
| Split | Records | Companies | Date Range |
|-------|---------|-----------|------------|
| train | 221,779 | 11,786 | 2008-12-31 to 2019-12-31 |
| validation | 45,109 | 7,207 | 2020-01-31 to 2021-12-31 |
| test | 47,334 | 7,208 | 2022-01-31 to 2023-12-31 |
| recent | 43,288 | 6,507 | 2024-01-31 to 2026-02-28 |
**Total records:** 357,510
### Feature Groups
| Group | Count |
|-------|-------|
| Identifiers | 5 |
| Context | 4 |
| Raw Features (SEC XBRL) | 33 |
| Derived Features | 8 |
| Engineered Features | 23 |
| QA Flags | 4 |
| Prediction Targets | 21 |
| Rank Targets | 14 |
## Scoring Models
- **Piotroski F-Score** (0-9): Nine binary signals measuring profitability, leverage, and operating efficiency. Null when no prior quarter available for delta signals.
- **Altman Z'-Score** (Float): Private-firm bankruptcy risk variant with zone classification (safe/grey/distress). Null for financial firms (SIC 6000-6999).
- **Beneish Coverage** (0-8): Count of computable M-Score components. Full M-Score is computed transiently during enrichment but not retained.
- **Graham Metrics**: Graham Number, NCAV/share, tangible book value/share, net working capital/share, defensive score (0-5).
- **Quality Signals**: Cash conversion ratio, accrual ratio, free cash flow margin.
- **Composite Quality Score**: Z-score normalized average of key quality signals within each quarter cross-section.
## Target Columns
21 forward-looking prediction targets using same-quarter year-over-year comparisons:
### 1-Year Targets
- **Growth rates** (6): BVPS, equity, earnings, revenue, OCF, FCF growth
- **Level/delta** (2): Forward ROE, margin expansion
- **Binary** (5): ROA improved, fundamentals improved (≥3 of 5 metrics), value created (equity grew AND ROE>0), survived (filed Q+4 AND no bankruptcy in window), filed for bankruptcy (petition within 365 days)
### 2-Year Targets
- **Growth rates** (6): Same metrics as 1-year, over 2-year horizon
- **Binary** (2): Survived (filed Q+8 AND no bankruptcy in window), filed for bankruptcy (petition within 730 days)
Survived and bankruptcy targets are mutually exclusive by construction. All targets are null when forward quarter data is unavailable.
### Rank-Transformed Targets (14 columns)
Cross-sectional percentile ranks (0-1] for all Float64 targets, computed per quarter
using `rank("average") / count()`. Raw growth targets are extremely skewed
(mean ~3.1, median ~0.03) and produce negative information coefficients for
regression models. Rank-transforming yields IC ~0.37.
## Dataset Creation
### Source Data
All data sourced from SEC EDGAR: Financial Statement Data Sets (FSDS) for fundamentals
and Full-Text Search API (EFTS) for 8-K Item 1.03 bankruptcy filings.
No market data providers. No third-party data. No equity pricing data.
### Pipeline
1. **Ingest**: Download quarterly SEC FSDS ZIP files and bankruptcy events from EDGAR
2. **Parse**: Extract XBRL financial data, normalize 32 tags to standard fields
3. **Enrich**: Compute derived ratios (8), scoring models (5), bankruptcy flags, and QA flags
4. **Label**: Generate 21 forward-looking prediction targets (including bankruptcy)
5. **Split**: Temporal train/validation/test/recent splits with leakage validation
6. **Evaluate**: Quality checks, ML fitness, and publication readiness
7. **Publish**: Stage and upload to Hugging Face Hub
## Considerations
### Known Limitations
- XBRL coverage varies: some companies report fewer standardized tags
- F-Score delta components require prior quarter data (null for first appearance)
- Z'-Score was designed for manufacturing firms; interpretation varies by sector
- No market data: cannot compute price-based metrics (P/E, market cap, etc.)
- `bankruptcy_chapter` is populated for bankrupt CIKs (values like `"7"`, `"11"`,
`"15"`, `"9"` extracted via 8-K Item 1.03 filing bodies); non-bankrupt rows
are null. Edge cases not yet handled: chapter spelled out as a word
("chapter eleven") and USC-only citations with no "chapter" keyword both
yield null.
### Bias Considerations
- **No survivorship bias**: Uses only SEC filing data, not equity market prices
- **Temporal integrity**: Strict time-based splits prevent data leakage
- **Sector bias**: Z'-Score thresholds may not be equally applicable across all sectors
- **Financial firms excluded from Z'-Score**: Financial companies (SIC 6000-6999) have null Z'-Score values
## License
This dataset is released under the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/).
The underlying SEC data is in the public domain.
## Citation
```bibtex
@dataset{openfundex,
title={OpenFundex: SEC Financial Filings for Deep Value Analysis},
author={Danielson, Luke},
year={2026},
url={https://github.com/danielukea/openfundex},
license={CC-BY-4.0}
}
```
## Changelog
### v0.6.0 — 2026-04-17
- **`bankruptcy_chapter` populated:** Chapter number (`"7"`, `"11"`, `"15"`, `"9"`)
is now extracted from the 8-K Item 1.03 filing body for each bankrupt CIK and
stored in the `bankruptcy_chapter` column. Non-bankrupt rows remain null.
Extraction uses a guardword-gated Item 1.03 window search; known limitations
(word-spelled chapters, USC-only citations) are documented above.
### v0.5.0 — 2026-04-17
- **Loader contract:** dataset card now declares explicit `configs` and `data_files`. `load_dataset("ttchopper/openfundex")` returns the QA-filtered historical splits (`train`/`validation`/`test`); unfiltered historical data is available via `load_dataset(..., "full")`; live 2024+ inference data via `load_dataset(..., "live")`.
- Previously the card omitted `configs:`, causing the Hub to glob all parquets and misassemble splits (paired `_clean` files were concatenated or surfaced as extra splits, and `recent*` was not declared at all).
- Raw parquet access (`hf://datasets/ttchopper/openfundex/train.parquet`) is unaffected.
提供机构:
ttchopper



