bguzzo2k/ohlc_1d_mixture
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bguzzo2k/ohlc_1d_mixture
下载链接
链接失效反馈官方服务:
资源简介:
# Macroeconomic & S&P 500 Yahoo Finance Dataset
This repository contains a comprehensive historical dataset for 936 financial instruments, including S&P 500 components, broad market indices, commodities, currencies, and macroeconomic indicators. The data is programmatically extracted from the Yahoo Finance API, cleaned, and normalized for use in quantitative modeling and machine learning.
**Dataset Hub:** [bguzzo2k/ohlc_1d_mixture](https://huggingface.co/datasets/bguzzo2k/ohlc_1d_mixture)
## Repository Structure
- **`1d/`**: Raw daily OHLCV data in Parquet format (`{TICKER}_max_1d.parquet`).
- **`log_ret_norm/`**: Normalized log-return data optimized for stationarity (`{TICKER}_max_1d_norm.parquet`).
- **`bulk/`**: Multi-ticker raw download aggregates before individual splitting.
- **`utils/`**: Python scripts for data acquisition, normalization, and configuration.
- **`tickers_list.toml`**: Central configuration file defining all 936 tracked assets across 9 distinct asset classes.
---
## Dataset Construction Pipeline
The dataset is built using a modular pipeline that ensures reproducibility and quality:
### 1. Asset Configuration (`tickers_list.toml`)
The scope of the dataset is defined in `tickers_list.toml`. It organizes tickers into logical groups:
- **Broad Market Indices** (S&P 500, NASDAQ, Global Benchmarks)
- **Fixed Income & Yields** (US Treasuries, Corporate/International Bonds)
- **Commodities & Futures** (Energy, Metals, Agriculture)
- **FX Matrices** (Major/Emerging Currency Pairs)
- **Sectors & Thematics** (GICS Sectors, AI, Robotics)
- **Digital Assets** (Cryptocurrencies)
- **S&P 500 Constituents** (Individual equities)
### 2. Data Acquisition (`utils/yfinace_dowloader.py`)
Utilizes the `yfinance` library to fetch maximum historical data (`period="max"`) at a 1-day interval.
- **Filtering:** Discards non-trading days (NaN Open values).
- **Standardization:** Retains core OHLCV columns and formats indices to `YYYY-MM-DD`.
- **Persistence:** Individual ticker files are stored in `1d/` with Snappy compression.
### 3. Log-Return Normalization (`utils/dataset_normalizartion.py`)
Transforms raw price levels into stationary log-returns using the previous day's close ($Close_{t-1}$) as the reference:
- **Calculation:** $Metric_{norm} = \ln(Metric_t / Close_{t-1})$
- **Numerical Stability:** Implements clipping (threshold: $1 \times 10^{-9}$) to prevent numerical instability.
- **Parallelization:** Employs a `ThreadPoolExecutor` for high-throughput processing across the 936-file dataset.
### 4. S&P 500 Maintenance (`utils/sp500_toml_generator.py`)
A utility script that scrapes the current S&P 500 composition from Wikipedia to update the `tickers_list.toml` configuration dynamically.
---
## Data Schema
### Raw Data (`1d/`)
| Column | Type | Description |
| :--- | :--- | :--- |
| **Index (Date)** | String | ISO 8601 Date (`YYYY-MM-DD`) |
| **Open/High/Low/Close** | Float64 | Price metrics (USD or Local Currency) |
| **Volume** | Int64 | Trading volume |
### Normalized Data (`log_ret_norm/`)
| Column | Type | Description |
| :--- | :--- | :--- |
| **Index (Date)** | String | ISO 8601 Date (`YYYY-MM-DD`) |
| **Open_Norm** | Float64 | $\ln(Open_t / Close_{t-1})$ |
| **High_Norm** | Float64 | $\ln(High_t / Close_{t-1})$ |
| **Low_Norm** | Float64 | $\ln(Low_t / Close_{t-1})$ |
| **Close_Norm** | Float64 | $\ln(Close_t / Close_{t-1})$ |
提供机构:
bguzzo2k



