five

elkassabgi/hfdatalibrary

收藏
Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/elkassabgi/hfdatalibrary
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en pretty_name: HF Data Library tags: - finance - high-frequency - intraday - OHLCV - market-microstructure - financial-econometrics - equity - ETF - realized-volatility size_categories: - 1B<n<10B task_categories: - time-series-forecasting - tabular-regression --- # HF Data Library: High-Frequency U.S. Equity Data [![Website](https://img.shields.io/badge/Website-hfdatalibrary.com-2563eb)](https://hfdatalibrary.com) [![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.19501605-blue)](https://doi.org/10.5281/zenodo.19501605) [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) Free, research-grade collection of OHLCV (Open-High-Low-Close-Volume) data for **1,391 U.S. equities and ETFs**, covering December 2002 through the present (45 tickers extending to January 1991). Data is available in multiple timeframes from 1-minute up to monthly. Updated weekly via automated pipeline. **Maintainer:** Ahmed Elkassabgi, University of Central Arkansas **ORCID:** [0000-0002-5926-7493](https://orcid.org/0000-0002-5926-7493) **Permanent DOI:** [10.5281/zenodo.19501605](https://doi.org/10.5281/zenodo.19501605) ## Where to download **This Hugging Face repository contains documentation only.** The actual data is hosted at: ➡️ **https://hfdatalibrary.com** Free registration required (email, ORCID, or Google). Data is available as direct downloads (Parquet or CSV) or via REST API at `https://api.hfdatalibrary.com`. ## What's in the dataset - **1,391 tickers** of U.S. equities and ETFs - **1.53 billion** 1-minute bars (clean version) - **December 2002 – present** (with 45 tickers extending to January 1991) - **Weekly automated updates** ### Cleaning versions Two cleaning versions are provided: - **Raw:** as received from the source, no modifications - **Clean:** nine-step cleaning pipeline applied (outside-hours removal, OHLC violations, duplicates, Brownlees-Gallo outlier filter, splice-boundary adjustment) A gap-filled version is intentionally **not** distributed — see the accompanying paper for documented biases introduced by LOCF gap-filling. Researchers who need a regular grid can apply LOCF to the Clean version themselves. ### Available timeframes All cleaning versions are aggregated into multiple timeframes: | Timeframe | Description | |---|---| | 1-minute | Base data (highest resolution) | | 5-minute | Aggregated from 1-minute | | 15-minute | Aggregated from 1-minute | | 30-minute | Aggregated from 1-minute | | Hourly | Aggregated from 1-minute | | Daily | Open-to-close per trading day | | Weekly | Aggregated to trading weeks | | Monthly | Aggregated to calendar months | ### Pre-computed academic variables 25 variables computed daily for each ticker in each cleaning version: **Volatility (5):** Realized variance (1-min and 5-min sampling), bipower variation (BNS 2004), Parkinson (1980), Yang-Zhang (2000) **Spreads (2):** Roll (1984) implied spread, Corwin-Schultz (2012) high-low spread **Autocorrelation (3):** First-order return autocorrelation, variance ratio (5-min), variance ratio (10-min) **Jump detection (3):** BNS z-statistic, BNS jump indicators at 1% and 5% levels **Liquidity (4):** Amihud (2002) illiquidity ratio, daily dollar volume, share volume, observed trade count **Data quality (4):** Gap rate, observed bars per day, longest gap, max bars since last trade **Returns (4):** Open-to-close return, overnight return, daily high-low range, intraday return standard deviation ## Data sources - **Pre-March 2022:** PiTrading, derived from the consolidated tape (CTA/UTP) - **Post-March 2022:** IEX Exchange HIST ## Quick start (Python) ```python import requests import pandas as pd from io import BytesIO # Register at https://hfdatalibrary.com to get an API key API_KEY = "your-key-here" # Get a download token (links expire after 10 minutes) r = requests.get( "https://api.hfdatalibrary.com/v1/download-token/AAPL", params={"version": "clean", "format": "parquet", "timeframe": "1min"}, headers={"X-API-Key": API_KEY} ) url = r.json()["url"] # Download the file data = requests.get(url).content df = pd.read_parquet(BytesIO(data)) print(df.head()) ``` ## File schema Each ticker is a single Parquet (or CSV) file. For 1-minute data: | Column | Type | Description | |---|---|---| | datetime | datetime64 | Bar timestamp (Eastern Time) | | Open | float64 | Opening price (split/dividend adjusted) | | High | float64 | Highest price during the bar | | Low | float64 | Lowest price during the bar | | Close | float64 | Closing price | | Volume | int64 | Shares traded | | source | string | "pitrading" (pre-2022) or "iex" (post-2022) | Higher timeframes (5-min, 15-min, daily, etc.) follow the same schema but with the `datetime` column resampled to the chosen interval. ## License This dataset is licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/). You are free to share and adapt the material for any purpose, including commercially, provided you give appropriate credit. ## How to cite ```bibtex @dataset{elkassabgi2026hfdatalibrary, author = {Elkassabgi, Ahmed}, title = {{HF Data Library: High-Frequency U.S. Equity Data (1-Minute OHLCV)}}, year = {2026}, version = {1.0}, publisher = {Zenodo}, doi = {10.5281/zenodo.19501605}, url = {https://hfdatalibrary.com} } ``` ## Links - **Website:** https://hfdatalibrary.com - **API:** https://api.hfdatalibrary.com - **Documentation:** https://hfdatalibrary.com/pages/docs.html - **Data dictionary:** https://hfdatalibrary.com/pages/dictionary.html - **Code samples:** https://hfdatalibrary.com/pages/code.html - **GitHub:** https://github.com/elkassabgi/hfdatalibrary - **Zenodo:** https://zenodo.org/records/19501605 - **Contact:** admin@hfdatalibrary.com

### 数据集元数据 - 许可证:CC BY 4.0 - 语言:英语 - 正式名称:HF 数据库(HF Data Library) - 标签:金融、高频、日内、OHLCV(Open-High-Low-Close-Volume)、市场微观结构、金融计量经济学、股票、交易所交易基金(ETF)、已实现波动率 - 规模范围:10亿 < 数据量 < 100亿条 - 任务类别:时间序列预测、表格回归 --- # HF 数据库:美国股票高频数据 [![官网](https://img.shields.io/badge/官网-hfdatalibrary.com-2563eb)](https://hfdatalibrary.com) [![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.19501605-blue)](https://doi.org/10.5281/zenodo.19501605) [![许可证:CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) 本数据集为面向1391只美国股票与交易所交易基金(ETF)的OHLCV(开盘价-最高价-最低价-收盘价-成交量,Open-High-Low-Close-Volume)研究级免费数据集,覆盖2002年12月至今的行情数据(其中45只标的的行情数据可追溯至1991年1月)。数据支持从1分钟到月线的多时间粒度获取,通过自动化流水线每周更新。 **维护者**:Ahmed Elkassabgi,中阿肯色大学 **ORCID**:[0000-0002-5926-7493](https://orcid.org/0000-0002-5926-7493) **永久DOI**:[10.5281/zenodo.19501605](https://doi.org/10.5281/zenodo.19501605) ## 下载渠道 本Hugging Face仓库仅包含文档,实际数据托管于: ➡️ **https://hfdatalibrary.com** 需完成免费注册(支持邮箱、ORCID或Google账号),数据可直接下载Parquet或CSV格式文件,也可通过REST API接口`https://api.hfdatalibrary.com`获取。 ## 数据集内容 - **1391只标的**:美国股票与交易所交易基金(ETF) - **15.3亿条** 1分钟K线(清洗版数据) - **覆盖时段**:2002年12月至今(其中45只标的的行情数据可追溯至1991年1月) - **每周自动更新** ### 数据清洗版本 本数据集提供两种清洗版本: - **原始版**:直接从数据源获取,未做任何修改 - **清洗版**:经过九步标准化清洗流水线处理,包括盘外数据剔除、OHLC格式违规修正、重复数据删除、Brownlees-Gallo异常值过滤、拼接边界调整等。 **重要说明**:本数据集未提供间隙填充版本——相关研究已证实末次观测值结转(LOCF,Last Observation Carried Forward)间隙填充法会引入可量化的偏差,如需规则化网格数据,研究者可自行基于清洗版数据进行LOCF填充。 ### 可用时间粒度 所有清洗版本均可聚合为以下时间粒度: | 时间粒度 | 说明 | |---|---| | 1分钟 | 基础原始数据(最高分辨率) | | 5分钟 | 由1分钟数据聚合得到 | | 15分钟 | 由1分钟数据聚合得到 | | 30分钟 | 由1分钟数据聚合得到 | | 小时线 | 由1分钟数据聚合得到 | | 日线 | 每个交易日的开盘至收盘价行情 | | 周线 | 按交易周聚合得到 | | 月线 | 按自然月聚合得到 | ### 预计算学术变量 针对每只标的的每个清洗版本,我们每日计算25项学术变量: **波动率类(5项)**:已实现方差(1分钟与5分钟采样频率)、双幂变差(BNS 2004)、Parkinson(1980)波动率估计、Yang-Zhang(2000)波动率估计 **价差类(2项)**:Roll(1984)隐含价差、Corwin-Schultz(2012)高低价价差 **自相关类(3项)**:一阶收益率自相关系数、5分钟方差比、10分钟方差比 **跳跃检测类(3项)**:BNS z统计量、1%与5%显著性水平下的BNS跳跃指示变量 **流动性类(4项)**:Amihud(2002)非流动性比率、每日美元成交额、股票成交量、实际交易笔数 **数据质量类(4项)**:数据间隙率、每日有效K线数、最长数据间隙时长、上次交易后最大间隔K线数 **收益率类(4项)**:开盘至收盘价收益率、隔夜收益率、当日高低价波动幅度、日内收益率标准差 ## 数据来源 - **2022年3月前**:PiTrading,基于合并行情磁带(CTA/UTP)衍生得到 - **2022年3月后**:IEX Exchange HIST ## Python快速上手 python import requests import pandas as pd from io import BytesIO # 请先在https://hfdatalibrary.com注册以获取API密钥 API_KEY = "your-key-here" # 获取下载令牌(链接有效期为10分钟) r = requests.get( "https://api.hfdatalibrary.com/v1/download-token/AAPL", params={"version": "clean", "format": "parquet", "timeframe": "1min"}, headers={"X-API-Key": API_KEY} ) url = r.json()["url"] # 下载数据文件 data = requests.get(url).content df = pd.read_parquet(BytesIO(data)) print(df.head()) ## 文件结构 每只标的对应一个单独的Parquet或CSV文件。以1分钟数据为例,文件字段如下: | 字段名 | 数据类型 | 说明 | |---|---|---| | datetime | datetime64 | K线时间戳(美国东部时间) | | Open | float64 | 开盘价(已进行拆股与分红调整) | | High | float64 | 该时段内最高价 | | Low | float64 | 该时段内最低价 | | Close | float64 | 收盘价 | | Volume | int64 | 成交股数 | | source | string | 数据源:"pitrading"(2022年前)或"iex"(2022年后) | 更高时间粒度(5分钟、15分钟、日线、月线等)的数据遵循相同字段结构,仅`datetime`字段会被重采样至对应时间间隔。 ## 许可证 本数据集采用[Creative Commons Attribution 4.0 International(CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)许可证授权。您可自由共享、改编本数据集用于任何用途(包括商业用途),但需注明适当的引用来源。 ## 引用格式 bibtex @dataset{elkassabgi2026hfdatalibrary, author = {Elkassabgi, Ahmed}, title = {{HF Data Library: High-Frequency U.S. Equity Data (1-Minute OHLCV)}}, year = {2026}, version = {1.0}, publisher = {Zenodo}, doi = {10.5281/zenodo.19501605}, url = {https://hfdatalibrary.com} } ## 相关链接 - **官网**:https://hfdatalibrary.com - **API接口**:https://api.hfdatalibrary.com - **文档中心**:https://hfdatalibrary.com/pages/docs.html - **数据字典**:https://hfdatalibrary.com/pages/dictionary.html - **代码示例**:https://hfdatalibrary.com/pages/code.html - **GitHub仓库**:https://github.com/elkassabgi/hfdatalibrary - **Zenodo存档**:https://zenodo.org/records/19501605 - **联系方式**:admin@hfdatalibrary.com
提供机构:
elkassabgi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作