five

tidy-finance/factor-library-grid

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tidy-finance/factor-library-grid
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Tidy Finance Factor Library Specification Grid license: mit language: - en tags: - finance - asset-pricing - factor-models - portfolio-sorts - empirical-finance size_categories: - 100K<n<1M --- # Tidy Finance Factor Library: Specification Grid Lookup table mapping specification IDs to portfolio sorting configurations. Use this dataset together with the [Portfolio Returns](https://huggingface.co/datasets/tidy-finance/portfolio-returns) dataset to identify the methodological choices behind each factor return series. ## Dataset Details ### Dataset Description The dataset contains approximately 180,000 unique specification paths for constructing long-short portfolio returns. Each row defines a complete set of preprocessing and sorting choices (sample exclusions, lagging convention, breakpoint definition, weighting scheme, rebalancing frequency). The `id` column links to the corresponding return series in the Portfolio Returns dataset. - **Curated by:** Christoph Frey (Lancaster University), Christoph Scheuch (Tidy Intelligence), Stefan Voigt (University of Copenhagen), Patrick Weiss (Reykjavík University) - **Funded by:** Danish Finance Institute - **License:** MIT ### Dataset Sources - **Repository:** [https://github.com/tidy-finance/jss-multilingual-factor-library](https://github.com/tidy-finance/jss-multilingual-factor-library) - **R package:** [https://github.com/tidy-finance/r-tidyfinance](https://github.com/tidy-finance/r-tidyfinance) - **Python package:** [https://github.com/tidy-finance/py-tidyfinance](https://github.com/tidy-finance/py-tidyfinance) - **Demo:** [https://app-download-center.cloud.sdu.dk/](https://app-download-center.cloud.sdu.dk/) ## Uses ### Direct Use - Joining with the Portfolio Returns dataset to filter or group factor returns by specific methodological choices. - Robustness and sensitivity analysis: selecting subsets of specifications to study how preprocessing decisions affect factor premia. - Replication: documenting the exact configuration behind a reported result. ### Out-of-Scope Use - Standalone analysis. The grid contains no return data and must be joined with the Portfolio Returns dataset via the `id` column. ## Dataset Structure The dataset consists of a single Parquet file with 13 columns and approximately 180,000 rows. <table> <thead> <tr> <th>Column</th> <th>Type</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>id</code></td> <td>int32</td> <td>Unique specification identifier, foreign key to the Portfolio Returns dataset</td> </tr> <tr> <td><code>sorting_variable</code></td> <td>string</td> <td>Sorting characteristic (e.g., <code>sv_ag</code> for asset growth, <code>sv_bm</code> for book-to-market)</td> </tr> <tr> <td><code>exclude_size</code></td> <td>double</td> <td>Size exclusion threshold: <code>0</code> (none) or <code>0.2</code> (bottom 20th NYSE percentile)</td> </tr> <tr> <td><code>exclude_financials</code></td> <td>bool</td> <td>Whether financial firms (SIC 6000-6799) are excluded</td> </tr> <tr> <td><code>exclude_utilities</code></td> <td>bool</td> <td>Whether utility firms (SIC 4900-4999) are excluded</td> </tr> <tr> <td><code>exclude_negative_earnings</code></td> <td>bool</td> <td>Whether firms with negative earnings are excluded</td> </tr> <tr> <td><code>sorting_variable_lag</code></td> <td>string</td> <td>Lagging convention: <code>3m</code>, <code>6m</code>, or <code>ff</code> (Fama-French)</td> </tr> <tr> <td><code>rebalancing</code></td> <td>string</td> <td>Rebalancing frequency: <code>monthly</code> or <code>annual</code> (July)</td> </tr> <tr> <td><code>breakpoints_main</code></td> <td>double</td> <td>Number of quantile portfolios for the primary sort: <code>5</code> or <code>10</code></td> </tr> <tr> <td><code>sorting_method</code></td> <td>string</td> <td>Sorting method: <code>univariate</code>, <code>bivariate-dependent</code>, or <code>bivariate-independent</code></td> </tr> <tr> <td><code>breakpoints_secondary</code></td> <td>double</td> <td>Number of quantile portfolios for the secondary sort (size): <code>2</code>, <code>5</code>, or <code>NA</code> for univariate sorts</td> </tr> <tr> <td><code>breakpoints_exchanges</code></td> <td>string</td> <td>Exchanges used for breakpoint computation: <code>NYSE</code> or <code>AMEX|NASDAQ|NYSE</code></td> </tr> <tr> <td><code>weighting_scheme</code></td> <td>string</td> <td>Portfolio weighting: <code>EW</code> (equal-weighted) or <code>VW</code> (value-weighted)</td> </tr> </tbody> </table> ## Dataset Creation ### Curation Rationale Factor construction involves many subjective methodological choices. Rather than committing to a single specification, we enumerate all valid combinations to enable systematic robustness analysis and transparent reporting. ### Source Data #### Data Collection and Processing The grid is generated programmatically from the full factorial combination of preprocessing choices, with invalid configurations removed (e.g., univariate sorts have no secondary breakpoints; market equity is excluded from bivariate sorts where size is the secondary variable; earnings-to-market excludes configurations that allow negative earnings). See `code/01_define_portfolio_sorts_grid.R` in the companion repository for the exact generation logic. #### Who are the source data producers? The grid is a methodological artifact created by the dataset authors. No external data sources are involved. ### Personal and Sensitive Information The dataset contains no personal or sensitive information. All columns describe portfolio sorting configurations. ## Bias, Risks, and Limitations - The grid reflects the authors' choice of specification dimensions and does not cover all possible methodological variations (e.g., alternative industry classifications, different minimum listing requirements, or alternative risk-free rate definitions). - Some specifications may produce portfolios with very few stocks in certain months, particularly for smaller sorting variables or restrictive exclusion criteria. ### Recommendations Always join with the Portfolio Returns dataset via the `id` column. When reporting results, cite the specific `id` or the full set of column values to ensure reproducibility. ## Citation **BibTeX:** ```bibtex @article{frey2026transparent, title={A Transparent Financial Risk Factor Library}, author={Frey, Christoph and Scheuch, Christoph and Voigt, Stefan and Weiss, Patrick}, year={2026}, journal={Working Paper} } Dataset Card Authors Christoph Frey, Christoph Scheuch, Stefan Voigt, Patrick Weiss Dataset Card Contact Stefan Voigt (stefan.voigt@econ.ku.dk), Patrick Weiss (patrickw@ru.is)
提供机构:
tidy-finance
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作