elkassabgi/hfdatalibrary

Name: elkassabgi/hfdatalibrary
Creator: elkassabgi
Published: 2026-04-11 19:58:59
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/elkassabgi/hfdatalibrary

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en pretty_name: HF Data Library tags: - finance - high-frequency - intraday - OHLCV - market-microstructure - financial-econometrics - equity - ETF - realized-volatility size_categories: - 1B<n<10B task_categories: - time-series-forecasting - tabular-regression --- # HF Data Library: High-Frequency U.S. Equity Data [![Website](https://img.shields.io/badge/Website-hfdatalibrary.com-2563eb)](https://hfdatalibrary.com) [![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.19501605-blue)](https://doi.org/10.5281/zenodo.19501605) [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) Free, research-grade collection of OHLCV (Open-High-Low-Close-Volume) data for **1,391 U.S. equities and ETFs**, covering December 2002 through the present (45 tickers extending to January 1991). Data is available in multiple timeframes from 1-minute up to monthly. Updated weekly via automated pipeline. **Maintainer:** Ahmed Elkassabgi, University of Central Arkansas **ORCID:** [0000-0002-5926-7493](https://orcid.org/0000-0002-5926-7493) **Permanent DOI:** [10.5281/zenodo.19501605](https://doi.org/10.5281/zenodo.19501605) ## Where to download **This Hugging Face repository contains documentation only.** The actual data is hosted at: ➡️ **https://hfdatalibrary.com** Free registration required (email, ORCID, or Google). Data is available as direct downloads (Parquet or CSV) or via REST API at `https://api.hfdatalibrary.com`. ## What's in the dataset - **1,391 tickers** of U.S. equities and ETFs - **1.53 billion** 1-minute bars (clean version) - **December 2002 – present** (with 45 tickers extending to January 1991) - **Weekly automated updates** ### Cleaning versions Two cleaning versions are provided: - **Raw:** as received from the source, no modifications - **Clean:** nine-step cleaning pipeline applied (outside-hours removal, OHLC violations, duplicates, Brownlees-Gallo outlier filter, splice-boundary adjustment) A gap-filled version is intentionally **not** distributed — see the accompanying paper for documented biases introduced by LOCF gap-filling. Researchers who need a regular grid can apply LOCF to the Clean version themselves. ### Available timeframes All cleaning versions are aggregated into multiple timeframes: | Timeframe | Description | |---|---| | 1-minute | Base data (highest resolution) | | 5-minute | Aggregated from 1-minute | | 15-minute | Aggregated from 1-minute | | 30-minute | Aggregated from 1-minute | | Hourly | Aggregated from 1-minute | | Daily | Open-to-close per trading day | | Weekly | Aggregated to trading weeks | | Monthly | Aggregated to calendar months | ### Pre-computed academic variables 25 variables computed daily for each ticker in each cleaning version: **Volatility (5):** Realized variance (1-min and 5-min sampling), bipower variation (BNS 2004), Parkinson (1980), Yang-Zhang (2000) **Spreads (2):** Roll (1984) implied spread, Corwin-Schultz (2012) high-low spread **Autocorrelation (3):** First-order return autocorrelation, variance ratio (5-min), variance ratio (10-min) **Jump detection (3):** BNS z-statistic, BNS jump indicators at 1% and 5% levels **Liquidity (4):** Amihud (2002) illiquidity ratio, daily dollar volume, share volume, observed trade count **Data quality (4):** Gap rate, observed bars per day, longest gap, max bars since last trade **Returns (4):** Open-to-close return, overnight return, daily high-low range, intraday return standard deviation ## Data sources - **Pre-March 2022:** PiTrading, derived from the consolidated tape (CTA/UTP) - **Post-March 2022:** IEX Exchange HIST ## Quick start (Python) ```python import requests import pandas as pd from io import BytesIO # Register at https://hfdatalibrary.com to get an API key API_KEY = "your-key-here" # Get a download token (links expire after 10 minutes) r = requests.get( "https://api.hfdatalibrary.com/v1/download-token/AAPL", params={"version": "clean", "format": "parquet", "timeframe": "1min"}, headers={"X-API-Key": API_KEY} ) url = r.json()["url"] # Download the file data = requests.get(url).content df = pd.read_parquet(BytesIO(data)) print(df.head()) ``` ## File schema Each ticker is a single Parquet (or CSV) file. For 1-minute data: | Column | Type | Description | |---|---|---| | datetime | datetime64 | Bar timestamp (Eastern Time) | | Open | float64 | Opening price (split/dividend adjusted) | | High | float64 | Highest price during the bar | | Low | float64 | Lowest price during the bar | | Close | float64 | Closing price | | Volume | int64 | Shares traded | | source | string | "pitrading" (pre-2022) or "iex" (post-2022) | Higher timeframes (5-min, 15-min, daily, etc.) follow the same schema but with the `datetime` column resampled to the chosen interval. ## License This dataset is licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/). You are free to share and adapt the material for any purpose, including commercially, provided you give appropriate credit. ## How to cite ```bibtex @dataset{elkassabgi2026hfdatalibrary, author = {Elkassabgi, Ahmed}, title = {{HF Data Library: High-Frequency U.S. Equity Data (1-Minute OHLCV)}}, year = {2026}, version = {1.0}, publisher = {Zenodo}, doi = {10.5281/zenodo.19501605}, url = {https://hfdatalibrary.com} } ``` ## Links - **Website:** https://hfdatalibrary.com - **API:** https://api.hfdatalibrary.com - **Documentation:** https://hfdatalibrary.com/pages/docs.html - **Data dictionary:** https://hfdatalibrary.com/pages/dictionary.html - **Code samples:** https://hfdatalibrary.com/pages/code.html - **GitHub:** https://github.com/elkassabgi/hfdatalibrary - **Zenodo:** https://zenodo.org/records/19501605 - **Contact:** admin@hfdatalibrary.com

### 数据集元数据 - 许可证：CC BY 4.0 - 语言：英语 - 正式名称：HF 数据库（HF Data Library） - 标签：金融、高频、日内、OHLCV(Open-High-Low-Close-Volume)、市场微观结构、金融计量经济学、股票、交易所交易基金(ETF)、已实现波动率 - 规模范围：10亿 < 数据量 < 100亿条 - 任务类别：时间序列预测、表格回归 --- # HF 数据库：美国股票高频数据 [![官网](https://img.shields.io/badge/官网-hfdatalibrary.com-2563eb)](https://hfdatalibrary.com) [![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.19501605-blue)](https://doi.org/10.5281/zenodo.19501605) [![许可证：CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) 本数据集为面向1391只美国股票与交易所交易基金（ETF）的OHLCV（开盘价-最高价-最低价-收盘价-成交量，Open-High-Low-Close-Volume）研究级免费数据集，覆盖2002年12月至今的行情数据（其中45只标的的行情数据可追溯至1991年1月）。数据支持从1分钟到月线的多时间粒度获取，通过自动化流水线每周更新。 **维护者**：Ahmed Elkassabgi，中阿肯色大学 **ORCID**：[0000-0002-5926-7493](https://orcid.org/0000-0002-5926-7493) **永久DOI**：[10.5281/zenodo.19501605](https://doi.org/10.5281/zenodo.19501605) ## 下载渠道本Hugging Face仓库仅包含文档，实际数据托管于： ➡️ **https://hfdatalibrary.com** 需完成免费注册（支持邮箱、ORCID或Google账号），数据可直接下载Parquet或CSV格式文件，也可通过REST API接口`https://api.hfdatalibrary.com`获取。 ## 数据集内容 - **1391只标的**：美国股票与交易所交易基金（ETF） - **15.3亿条** 1分钟K线（清洗版数据） - **覆盖时段**：2002年12月至今（其中45只标的的行情数据可追溯至1991年1月） - **每周自动更新** ### 数据清洗版本本数据集提供两种清洗版本： - **原始版**：直接从数据源获取，未做任何修改 - **清洗版**：经过九步标准化清洗流水线处理，包括盘外数据剔除、OHLC格式违规修正、重复数据删除、Brownlees-Gallo异常值过滤、拼接边界调整等。 **重要说明**：本数据集未提供间隙填充版本——相关研究已证实末次观测值结转（LOCF，Last Observation Carried Forward）间隙填充法会引入可量化的偏差，如需规则化网格数据，研究者可自行基于清洗版数据进行LOCF填充。 ### 可用时间粒度所有清洗版本均可聚合为以下时间粒度： | 时间粒度 | 说明 | |---|---| | 1分钟 | 基础原始数据（最高分辨率） | | 5分钟 | 由1分钟数据聚合得到 | | 15分钟 | 由1分钟数据聚合得到 | | 30分钟 | 由1分钟数据聚合得到 | | 小时线 | 由1分钟数据聚合得到 | | 日线 | 每个交易日的开盘至收盘价行情 | | 周线 | 按交易周聚合得到 | | 月线 | 按自然月聚合得到 | ### 预计算学术变量针对每只标的的每个清洗版本，我们每日计算25项学术变量： **波动率类（5项）**：已实现方差（1分钟与5分钟采样频率）、双幂变差（BNS 2004）、Parkinson(1980)波动率估计、Yang-Zhang(2000)波动率估计 **价差类（2项）**：Roll(1984)隐含价差、Corwin-Schultz(2012)高低价价差 **自相关类（3项）**：一阶收益率自相关系数、5分钟方差比、10分钟方差比 **跳跃检测类（3项）**：BNS z统计量、1%与5%显著性水平下的BNS跳跃指示变量 **流动性类（4项）**：Amihud(2002)非流动性比率、每日美元成交额、股票成交量、实际交易笔数 **数据质量类（4项）**：数据间隙率、每日有效K线数、最长数据间隙时长、上次交易后最大间隔K线数 **收益率类（4项）**：开盘至收盘价收益率、隔夜收益率、当日高低价波动幅度、日内收益率标准差 ## 数据来源 - **2022年3月前**：PiTrading，基于合并行情磁带（CTA/UTP）衍生得到 - **2022年3月后**：IEX Exchange HIST ## Python快速上手 python import requests import pandas as pd from io import BytesIO # 请先在https://hfdatalibrary.com注册以获取API密钥 API_KEY = "your-key-here" # 获取下载令牌（链接有效期为10分钟） r = requests.get( "https://api.hfdatalibrary.com/v1/download-token/AAPL", params={"version": "clean", "format": "parquet", "timeframe": "1min"}, headers={"X-API-Key": API_KEY} ) url = r.json()["url"] # 下载数据文件 data = requests.get(url).content df = pd.read_parquet(BytesIO(data)) print(df.head()) ## 文件结构每只标的对应一个单独的Parquet或CSV文件。以1分钟数据为例，文件字段如下： | 字段名 | 数据类型 | 说明 | |---|---|---| | datetime | datetime64 | K线时间戳（美国东部时间） | | Open | float64 | 开盘价（已进行拆股与分红调整） | | High | float64 | 该时段内最高价 | | Low | float64 | 该时段内最低价 | | Close | float64 | 收盘价 | | Volume | int64 | 成交股数 | | source | string | 数据源："pitrading"（2022年前）或"iex"（2022年后） | 更高时间粒度（5分钟、15分钟、日线、月线等）的数据遵循相同字段结构，仅`datetime`字段会被重采样至对应时间间隔。 ## 许可证本数据集采用[Creative Commons Attribution 4.0 International（CC BY 4.0）](https://creativecommons.org/licenses/by/4.0/)许可证授权。您可自由共享、改编本数据集用于任何用途（包括商业用途），但需注明适当的引用来源。 ## 引用格式 bibtex @dataset{elkassabgi2026hfdatalibrary, author = {Elkassabgi, Ahmed}, title = {{HF Data Library: High-Frequency U.S. Equity Data (1-Minute OHLCV)}}, year = {2026}, version = {1.0}, publisher = {Zenodo}, doi = {10.5281/zenodo.19501605}, url = {https://hfdatalibrary.com} } ## 相关链接 - **官网**：https://hfdatalibrary.com - **API接口**：https://api.hfdatalibrary.com - **文档中心**：https://hfdatalibrary.com/pages/docs.html - **数据字典**：https://hfdatalibrary.com/pages/dictionary.html - **代码示例**：https://hfdatalibrary.com/pages/code.html - **GitHub仓库**：https://github.com/elkassabgi/hfdatalibrary - **Zenodo存档**：https://zenodo.org/records/19501605 - **联系方式**：admin@hfdatalibrary.com

提供机构：

elkassabgi

5,000+

优质数据集

54 个

任务类型

进入经典数据集