tidy-finance/factor-library-grid
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tidy-finance/factor-library-grid
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Tidy Finance Factor Library Specification Grid
license: mit
language:
- en
tags:
- finance
- asset-pricing
- factor-models
- portfolio-sorts
- empirical-finance
size_categories:
- 100K<n<1M
---
# Tidy Finance Factor Library: Specification Grid
Lookup table mapping specification IDs to portfolio sorting configurations. Use this dataset together with the [Portfolio Returns](https://huggingface.co/datasets/tidy-finance/portfolio-returns) dataset to identify the methodological choices behind each factor return series.
## Dataset Details
### Dataset Description
The dataset contains approximately 180,000 unique specification paths for constructing long-short portfolio returns. Each row defines a complete set of preprocessing and sorting choices (sample exclusions, lagging convention, breakpoint definition, weighting scheme, rebalancing frequency). The `id` column links to the corresponding return series in the Portfolio Returns dataset.
- **Curated by:** Christoph Frey (Lancaster University), Christoph Scheuch (Tidy Intelligence), Stefan Voigt (University of Copenhagen), Patrick Weiss (Reykjavík University)
- **Funded by:** Danish Finance Institute
- **License:** MIT
### Dataset Sources
- **Repository:** [https://github.com/tidy-finance/jss-multilingual-factor-library](https://github.com/tidy-finance/jss-multilingual-factor-library)
- **R package:** [https://github.com/tidy-finance/r-tidyfinance](https://github.com/tidy-finance/r-tidyfinance)
- **Python package:** [https://github.com/tidy-finance/py-tidyfinance](https://github.com/tidy-finance/py-tidyfinance)
- **Demo:** [https://app-download-center.cloud.sdu.dk/](https://app-download-center.cloud.sdu.dk/)
## Uses
### Direct Use
- Joining with the Portfolio Returns dataset to filter or group factor returns by specific methodological choices.
- Robustness and sensitivity analysis: selecting subsets of specifications to study how preprocessing decisions affect factor premia.
- Replication: documenting the exact configuration behind a reported result.
### Out-of-Scope Use
- Standalone analysis. The grid contains no return data and must be joined with the Portfolio Returns dataset via the `id` column.
## Dataset Structure
The dataset consists of a single Parquet file with 13 columns and approximately 180,000 rows.
<table>
<thead>
<tr>
<th>Column</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>id</code></td>
<td>int32</td>
<td>Unique specification identifier, foreign key to the Portfolio Returns dataset</td>
</tr>
<tr>
<td><code>sorting_variable</code></td>
<td>string</td>
<td>Sorting characteristic (e.g., <code>sv_ag</code> for asset growth, <code>sv_bm</code> for book-to-market)</td>
</tr>
<tr>
<td><code>exclude_size</code></td>
<td>double</td>
<td>Size exclusion threshold: <code>0</code> (none) or <code>0.2</code> (bottom 20th NYSE percentile)</td>
</tr>
<tr>
<td><code>exclude_financials</code></td>
<td>bool</td>
<td>Whether financial firms (SIC 6000-6799) are excluded</td>
</tr>
<tr>
<td><code>exclude_utilities</code></td>
<td>bool</td>
<td>Whether utility firms (SIC 4900-4999) are excluded</td>
</tr>
<tr>
<td><code>exclude_negative_earnings</code></td>
<td>bool</td>
<td>Whether firms with negative earnings are excluded</td>
</tr>
<tr>
<td><code>sorting_variable_lag</code></td>
<td>string</td>
<td>Lagging convention: <code>3m</code>, <code>6m</code>, or <code>ff</code> (Fama-French)</td>
</tr>
<tr>
<td><code>rebalancing</code></td>
<td>string</td>
<td>Rebalancing frequency: <code>monthly</code> or <code>annual</code> (July)</td>
</tr>
<tr>
<td><code>breakpoints_main</code></td>
<td>double</td>
<td>Number of quantile portfolios for the primary sort: <code>5</code> or <code>10</code></td>
</tr>
<tr>
<td><code>sorting_method</code></td>
<td>string</td>
<td>Sorting method: <code>univariate</code>, <code>bivariate-dependent</code>, or <code>bivariate-independent</code></td>
</tr>
<tr>
<td><code>breakpoints_secondary</code></td>
<td>double</td>
<td>Number of quantile portfolios for the secondary sort (size): <code>2</code>, <code>5</code>, or <code>NA</code> for univariate sorts</td>
</tr>
<tr>
<td><code>breakpoints_exchanges</code></td>
<td>string</td>
<td>Exchanges used for breakpoint computation: <code>NYSE</code> or <code>AMEX|NASDAQ|NYSE</code></td>
</tr>
<tr>
<td><code>weighting_scheme</code></td>
<td>string</td>
<td>Portfolio weighting: <code>EW</code> (equal-weighted) or <code>VW</code> (value-weighted)</td>
</tr>
</tbody>
</table>
## Dataset Creation
### Curation Rationale
Factor construction involves many subjective methodological choices. Rather than committing to a single specification, we enumerate all valid combinations to enable systematic robustness analysis and transparent reporting.
### Source Data
#### Data Collection and Processing
The grid is generated programmatically from the full factorial combination of preprocessing choices, with invalid configurations removed (e.g., univariate sorts have no secondary breakpoints; market equity is excluded from bivariate sorts where size is the secondary variable; earnings-to-market excludes configurations that allow negative earnings). See `code/01_define_portfolio_sorts_grid.R` in the companion repository for the exact generation logic.
#### Who are the source data producers?
The grid is a methodological artifact created by the dataset authors. No external data sources are involved.
### Personal and Sensitive Information
The dataset contains no personal or sensitive information. All columns describe portfolio sorting configurations.
## Bias, Risks, and Limitations
- The grid reflects the authors' choice of specification dimensions and does not cover all possible methodological variations (e.g., alternative industry classifications, different minimum listing requirements, or alternative risk-free rate definitions).
- Some specifications may produce portfolios with very few stocks in certain months, particularly for smaller sorting variables or restrictive exclusion criteria.
### Recommendations
Always join with the Portfolio Returns dataset via the `id` column. When reporting results, cite the specific `id` or the full set of column values to ensure reproducibility.
## Citation
**BibTeX:**
```bibtex
@article{frey2026transparent,
title={A Transparent Financial Risk Factor Library},
author={Frey, Christoph and Scheuch, Christoph and Voigt, Stefan and Weiss, Patrick},
year={2026},
journal={Working Paper}
}
Dataset Card Authors
Christoph Frey, Christoph Scheuch, Stefan Voigt, Patrick Weiss
Dataset Card Contact
Stefan Voigt (stefan.voigt@econ.ku.dk), Patrick Weiss (patrickw@ru.is)
提供机构:
tidy-finance



