elkassabgi/hfdatalibrary
收藏Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/elkassabgi/hfdatalibrary
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
pretty_name: HF Data Library
tags:
- finance
- high-frequency
- intraday
- OHLCV
- market-microstructure
- financial-econometrics
- equity
- ETF
- realized-volatility
size_categories:
- 1B<n<10B
task_categories:
- time-series-forecasting
- tabular-regression
---
# HF Data Library: High-Frequency U.S. Equity Data
[](https://hfdatalibrary.com) [](https://doi.org/10.5281/zenodo.19501605) [](https://creativecommons.org/licenses/by/4.0/)
Free, research-grade collection of OHLCV (Open-High-Low-Close-Volume) data for **1,391 U.S. equities and ETFs**, covering December 2002 through the present (45 tickers extending to January 1991). Data is available in multiple timeframes from 1-minute up to monthly. Updated weekly via automated pipeline.
**Maintainer:** Ahmed Elkassabgi, University of Central Arkansas
**ORCID:** [0000-0002-5926-7493](https://orcid.org/0000-0002-5926-7493)
**Permanent DOI:** [10.5281/zenodo.19501605](https://doi.org/10.5281/zenodo.19501605)
## Where to download
**This Hugging Face repository contains documentation only.** The actual data is hosted at:
➡️ **https://hfdatalibrary.com**
Free registration required (email, ORCID, or Google). Data is available as direct downloads (Parquet or CSV) or via REST API at `https://api.hfdatalibrary.com`.
## What's in the dataset
- **1,391 tickers** of U.S. equities and ETFs
- **1.53 billion** 1-minute bars (clean version)
- **December 2002 – present** (with 45 tickers extending to January 1991)
- **Weekly automated updates**
### Cleaning versions
Two cleaning versions are provided:
- **Raw:** as received from the source, no modifications
- **Clean:** nine-step cleaning pipeline applied (outside-hours removal, OHLC violations, duplicates, Brownlees-Gallo outlier filter, splice-boundary adjustment)
A gap-filled version is intentionally **not** distributed — see the accompanying paper for documented biases introduced by LOCF gap-filling. Researchers who need a regular grid can apply LOCF to the Clean version themselves.
### Available timeframes
All cleaning versions are aggregated into multiple timeframes:
| Timeframe | Description |
|---|---|
| 1-minute | Base data (highest resolution) |
| 5-minute | Aggregated from 1-minute |
| 15-minute | Aggregated from 1-minute |
| 30-minute | Aggregated from 1-minute |
| Hourly | Aggregated from 1-minute |
| Daily | Open-to-close per trading day |
| Weekly | Aggregated to trading weeks |
| Monthly | Aggregated to calendar months |
### Pre-computed academic variables
25 variables computed daily for each ticker in each cleaning version:
**Volatility (5):** Realized variance (1-min and 5-min sampling), bipower variation (BNS 2004), Parkinson (1980), Yang-Zhang (2000)
**Spreads (2):** Roll (1984) implied spread, Corwin-Schultz (2012) high-low spread
**Autocorrelation (3):** First-order return autocorrelation, variance ratio (5-min), variance ratio (10-min)
**Jump detection (3):** BNS z-statistic, BNS jump indicators at 1% and 5% levels
**Liquidity (4):** Amihud (2002) illiquidity ratio, daily dollar volume, share volume, observed trade count
**Data quality (4):** Gap rate, observed bars per day, longest gap, max bars since last trade
**Returns (4):** Open-to-close return, overnight return, daily high-low range, intraday return standard deviation
## Data sources
- **Pre-March 2022:** PiTrading, derived from the consolidated tape (CTA/UTP)
- **Post-March 2022:** IEX Exchange HIST
## Quick start (Python)
```python
import requests
import pandas as pd
from io import BytesIO
# Register at https://hfdatalibrary.com to get an API key
API_KEY = "your-key-here"
# Get a download token (links expire after 10 minutes)
r = requests.get(
"https://api.hfdatalibrary.com/v1/download-token/AAPL",
params={"version": "clean", "format": "parquet", "timeframe": "1min"},
headers={"X-API-Key": API_KEY}
)
url = r.json()["url"]
# Download the file
data = requests.get(url).content
df = pd.read_parquet(BytesIO(data))
print(df.head())
```
## File schema
Each ticker is a single Parquet (or CSV) file. For 1-minute data:
| Column | Type | Description |
|---|---|---|
| datetime | datetime64 | Bar timestamp (Eastern Time) |
| Open | float64 | Opening price (split/dividend adjusted) |
| High | float64 | Highest price during the bar |
| Low | float64 | Lowest price during the bar |
| Close | float64 | Closing price |
| Volume | int64 | Shares traded |
| source | string | "pitrading" (pre-2022) or "iex" (post-2022) |
Higher timeframes (5-min, 15-min, daily, etc.) follow the same schema but with the `datetime` column resampled to the chosen interval.
## License
This dataset is licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/).
You are free to share and adapt the material for any purpose, including commercially, provided you give appropriate credit.
## How to cite
```bibtex
@dataset{elkassabgi2026hfdatalibrary,
author = {Elkassabgi, Ahmed},
title = {{HF Data Library: High-Frequency U.S. Equity Data (1-Minute OHLCV)}},
year = {2026},
version = {1.0},
publisher = {Zenodo},
doi = {10.5281/zenodo.19501605},
url = {https://hfdatalibrary.com}
}
```
## Links
- **Website:** https://hfdatalibrary.com
- **API:** https://api.hfdatalibrary.com
- **Documentation:** https://hfdatalibrary.com/pages/docs.html
- **Data dictionary:** https://hfdatalibrary.com/pages/dictionary.html
- **Code samples:** https://hfdatalibrary.com/pages/code.html
- **GitHub:** https://github.com/elkassabgi/hfdatalibrary
- **Zenodo:** https://zenodo.org/records/19501605
- **Contact:** admin@hfdatalibrary.com
### 数据集元数据
- 许可证:CC BY 4.0
- 语言:英语
- 正式名称:HF 数据库(HF Data Library)
- 标签:金融、高频、日内、OHLCV(Open-High-Low-Close-Volume)、市场微观结构、金融计量经济学、股票、交易所交易基金(ETF)、已实现波动率
- 规模范围:10亿 < 数据量 < 100亿条
- 任务类别:时间序列预测、表格回归
---
# HF 数据库:美国股票高频数据
[](https://hfdatalibrary.com) [](https://doi.org/10.5281/zenodo.19501605) [](https://creativecommons.org/licenses/by/4.0/)
本数据集为面向1391只美国股票与交易所交易基金(ETF)的OHLCV(开盘价-最高价-最低价-收盘价-成交量,Open-High-Low-Close-Volume)研究级免费数据集,覆盖2002年12月至今的行情数据(其中45只标的的行情数据可追溯至1991年1月)。数据支持从1分钟到月线的多时间粒度获取,通过自动化流水线每周更新。
**维护者**:Ahmed Elkassabgi,中阿肯色大学
**ORCID**:[0000-0002-5926-7493](https://orcid.org/0000-0002-5926-7493)
**永久DOI**:[10.5281/zenodo.19501605](https://doi.org/10.5281/zenodo.19501605)
## 下载渠道
本Hugging Face仓库仅包含文档,实际数据托管于:
➡️ **https://hfdatalibrary.com**
需完成免费注册(支持邮箱、ORCID或Google账号),数据可直接下载Parquet或CSV格式文件,也可通过REST API接口`https://api.hfdatalibrary.com`获取。
## 数据集内容
- **1391只标的**:美国股票与交易所交易基金(ETF)
- **15.3亿条** 1分钟K线(清洗版数据)
- **覆盖时段**:2002年12月至今(其中45只标的的行情数据可追溯至1991年1月)
- **每周自动更新**
### 数据清洗版本
本数据集提供两种清洗版本:
- **原始版**:直接从数据源获取,未做任何修改
- **清洗版**:经过九步标准化清洗流水线处理,包括盘外数据剔除、OHLC格式违规修正、重复数据删除、Brownlees-Gallo异常值过滤、拼接边界调整等。
**重要说明**:本数据集未提供间隙填充版本——相关研究已证实末次观测值结转(LOCF,Last Observation Carried Forward)间隙填充法会引入可量化的偏差,如需规则化网格数据,研究者可自行基于清洗版数据进行LOCF填充。
### 可用时间粒度
所有清洗版本均可聚合为以下时间粒度:
| 时间粒度 | 说明 |
|---|---|
| 1分钟 | 基础原始数据(最高分辨率) |
| 5分钟 | 由1分钟数据聚合得到 |
| 15分钟 | 由1分钟数据聚合得到 |
| 30分钟 | 由1分钟数据聚合得到 |
| 小时线 | 由1分钟数据聚合得到 |
| 日线 | 每个交易日的开盘至收盘价行情 |
| 周线 | 按交易周聚合得到 |
| 月线 | 按自然月聚合得到 |
### 预计算学术变量
针对每只标的的每个清洗版本,我们每日计算25项学术变量:
**波动率类(5项)**:已实现方差(1分钟与5分钟采样频率)、双幂变差(BNS 2004)、Parkinson(1980)波动率估计、Yang-Zhang(2000)波动率估计
**价差类(2项)**:Roll(1984)隐含价差、Corwin-Schultz(2012)高低价价差
**自相关类(3项)**:一阶收益率自相关系数、5分钟方差比、10分钟方差比
**跳跃检测类(3项)**:BNS z统计量、1%与5%显著性水平下的BNS跳跃指示变量
**流动性类(4项)**:Amihud(2002)非流动性比率、每日美元成交额、股票成交量、实际交易笔数
**数据质量类(4项)**:数据间隙率、每日有效K线数、最长数据间隙时长、上次交易后最大间隔K线数
**收益率类(4项)**:开盘至收盘价收益率、隔夜收益率、当日高低价波动幅度、日内收益率标准差
## 数据来源
- **2022年3月前**:PiTrading,基于合并行情磁带(CTA/UTP)衍生得到
- **2022年3月后**:IEX Exchange HIST
## Python快速上手
python
import requests
import pandas as pd
from io import BytesIO
# 请先在https://hfdatalibrary.com注册以获取API密钥
API_KEY = "your-key-here"
# 获取下载令牌(链接有效期为10分钟)
r = requests.get(
"https://api.hfdatalibrary.com/v1/download-token/AAPL",
params={"version": "clean", "format": "parquet", "timeframe": "1min"},
headers={"X-API-Key": API_KEY}
)
url = r.json()["url"]
# 下载数据文件
data = requests.get(url).content
df = pd.read_parquet(BytesIO(data))
print(df.head())
## 文件结构
每只标的对应一个单独的Parquet或CSV文件。以1分钟数据为例,文件字段如下:
| 字段名 | 数据类型 | 说明 |
|---|---|---|
| datetime | datetime64 | K线时间戳(美国东部时间) |
| Open | float64 | 开盘价(已进行拆股与分红调整) |
| High | float64 | 该时段内最高价 |
| Low | float64 | 该时段内最低价 |
| Close | float64 | 收盘价 |
| Volume | int64 | 成交股数 |
| source | string | 数据源:"pitrading"(2022年前)或"iex"(2022年后) |
更高时间粒度(5分钟、15分钟、日线、月线等)的数据遵循相同字段结构,仅`datetime`字段会被重采样至对应时间间隔。
## 许可证
本数据集采用[Creative Commons Attribution 4.0 International(CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)许可证授权。您可自由共享、改编本数据集用于任何用途(包括商业用途),但需注明适当的引用来源。
## 引用格式
bibtex
@dataset{elkassabgi2026hfdatalibrary,
author = {Elkassabgi, Ahmed},
title = {{HF Data Library: High-Frequency U.S. Equity Data (1-Minute OHLCV)}},
year = {2026},
version = {1.0},
publisher = {Zenodo},
doi = {10.5281/zenodo.19501605},
url = {https://hfdatalibrary.com}
}
## 相关链接
- **官网**:https://hfdatalibrary.com
- **API接口**:https://api.hfdatalibrary.com
- **文档中心**:https://hfdatalibrary.com/pages/docs.html
- **数据字典**:https://hfdatalibrary.com/pages/dictionary.html
- **代码示例**:https://hfdatalibrary.com/pages/code.html
- **GitHub仓库**:https://github.com/elkassabgi/hfdatalibrary
- **Zenodo存档**:https://zenodo.org/records/19501605
- **联系方式**:admin@hfdatalibrary.com
提供机构:
elkassabgi



