five

ResearchRL/diffquant-data

收藏
Hugging Face2026-04-15 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ResearchRL/diffquant-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit tags: - finance - trading - cryptocurrency - bitcoin - time-series - OHLCV - binance - futures - quantitative-finance - differentiable-trading pretty_name: BTCUSDT 1-Min Futures — 5-Year Research Dataset (2021–2025) size_categories: - 1M<n<10M task_categories: - time-series-forecasting --- # BTCUSDT 1-Min Futures — 5-Year Research Dataset (2021–2025) A gap-free 1-minute OHLCV dataset for **BTCUSDT Binance USDⓈ-M Perpetual Futures** covering five full calendar years: **2021-01-01 through 2025-12-31 (UTC)**. This repository contains **raw market bars only**. Feature engineering, aggregation, sample construction, normalization, and temporal splitting belong to the downstream [DiffQuant](https://github.com/YuriyKolesnikov/diffquant) pipeline, described below for reproducibility. Full codebase: [GitHub Repository](https://github.com/YuriyKolesnikov/diffquant) --- ## What this dataset is — and is not **Is:** - A clean, gap-free 1-minute futures bar dataset (2,629,440 bars) - A reproducible research input for intraday quantitative studies - The primary data source for the DiffQuant differentiable trading pipeline **Is not:** - A trading signal or strategy - A labelled prediction dataset - An RL environment with rewards or actions - Order-book, trades, funding rates, open interest, or liquidation data --- ## Dataset card | | | |---|---| | **Asset** | BTCUSDT Binance USDⓈ-M Perpetual Futures | | **Resolution** | 1-minute bars, close-time convention | | **Period** | 2021-01-01 00:00 UTC → 2025-12-31 23:59 UTC | | **Total bars** | 2,629,440 | | **Coverage** | 100.00% — zero gaps | | **File** | `btcusdt_1min_2021_2025.npz` (40.6 MB, NumPy compressed) | | **Price range** | $15,502 → $126,087 | | **OHLC violations** | 0 ✓ | | **Duplicate timestamps** | 0 ✓ | | **License** | MIT | --- ## Collection and quality assurance Source: Binance USDⓈ-M Futures public API via internal database. All bars use **close-time convention** — each timestamp marks the end of the bar. QA checks applied before release: - Duplicate timestamp detection - Full date-range gap scan (minute-level) - OHLC consistency: `low ≤ min(open, close)` and `high ≥ max(open, close)` - Negative price and volume checks - Schema validation across all columns Results for this release: | Check | Result | |---|---| | Duplicate timestamps | 0 ✓ | | Missing minutes | 0 ✓ | | OHLC violations | 0 ✓ | | Negative prices | 0 ✓ | | Zero-volume bars | 213 (retained — valid observations) | --- ## File structure ```python import numpy as np data = np.load("btcusdt_1min_2021_2025.npz", allow_pickle=True) bars = data["bars"] # (2_629_440, 6) float32 — raw exchange bars timestamps = data["timestamps"] # (2_629_440,) int64 — Unix ms UTC, close-time columns = list(data["columns"]) # ['open', 'high', 'low', 'close', 'volume', 'num_trades'] meta = str(data["meta"][0]) # provenance string ``` ### Channels (raw values) | Index | Name | Description | |---|---|---| | 0 | `open` | First trade price in the bar | | 1 | `high` | Highest trade price in the bar | | 2 | `low` | Lowest trade price in the bar | | 3 | `close` | Last trade price in the bar | | 4 | `volume` | Total base asset volume (BTC) | | 5 | `num_trades` | Number of individual trades | All values are stored as raw floats with no pre-processing applied. ### Summary statistics | Channel | Min | Max | Mean | |---|---|---|---| | open | 15,502.00 | 126,086.70 | 54,382.59 | | high | 15,532.20 | 126,208.50 | 54,406.74 | | low | 15,443.20 | 126,030.00 | 54,358.47 | | close | 15,502.00 | 126,086.80 | 54,382.60 | | volume | 0.00 | 40,256.00 | 241.90 | | num_trades | 0.00 | 263,775.00 | 2,551.55 | ### Bars by year ``` 2021: 525,600 ██████████████████████████████ 2022: 525,600 ██████████████████████████████ 2023: 525,600 ██████████████████████████████ 2024: 527,040 ██████████████████████████████ (leap year) 2025: 525,600 ██████████████████████████████ ``` ### Sample bars **First 5 bars (2021-01-01):** | # | Datetime UTC | open | high | low | close | volume | num_trades | |---|---|---|---|---|---|---|---| | 0 | 2021-01-01 00:00 | 28939.90 | 28981.55 | 28934.65 | 28951.68 | 126.0 | 929 | | 1 | 2021-01-01 00:01 | 28948.19 | 28997.16 | 28935.30 | 28991.01 | 143.0 | 1120 | | 2 | 2021-01-01 00:02 | 28992.98 | 29045.93 | 28991.01 | 29035.18 | 256.0 | 1967 | | 3 | 2021-01-01 00:03 | 29036.41 | 29036.97 | 28993.19 | 29016.23 | 102.0 | 987 | | 4 | 2021-01-01 00:04 | 29016.23 | 29023.87 | 28995.50 | 29002.92 | 85.0 | 832 | **Mid-dataset (2023-07-03):** | # | Datetime UTC | open | high | low | close | volume | num_trades | |---|---|---|---|---|---|---|---| | 1314720 | 2023-07-03 00:00 | 30611.70 | 30615.70 | 30611.70 | 30612.70 | 42.0 | 649 | | 1314721 | 2023-07-03 00:01 | 30612.70 | 30624.40 | 30612.70 | 30613.90 | 150.0 | 1846 | | 1314722 | 2023-07-03 00:02 | 30613.90 | 30614.00 | 30600.00 | 30600.00 | 241.0 | 1796 | **Last 5 bars (2025-12-31):** | # | Datetime UTC | open | high | low | close | volume | num_trades | |---|---|---|---|---|---|---|---| | 2629435 | 2025-12-31 23:55 | 87608.40 | 87608.40 | 87608.30 | 87608.30 | 10.0 | 182 | | 2629436 | 2025-12-31 23:56 | 87608.40 | 87613.90 | 87608.30 | 87613.90 | 14.0 | 343 | | 2629437 | 2025-12-31 23:57 | 87613.90 | 87621.70 | 87613.80 | 87621.70 | 7.0 | 231 | | 2629438 | 2025-12-31 23:58 | 87621.60 | 87631.90 | 87603.90 | 87608.10 | 38.0 | 815 | | 2629439 | 2025-12-31 23:59 | 87608.10 | 87608.20 | 87608.10 | 87608.20 | 11.0 | 206 | --- ## Quick start ```python from huggingface_hub import hf_hub_download import numpy as np import pandas as pd path = hf_hub_download( repo_id = "ResearchRL/diffquant-data", filename = "btcusdt_1min_2021_2025.npz", repo_type = "dataset", ) data = np.load(path, allow_pickle=True) bars = data["bars"] # (2_629_440, 6) float32 ts = data["timestamps"] # Unix ms UTC index = pd.to_datetime(ts, unit="ms", utc=True) df = pd.DataFrame(bars, columns=list(data["columns"]), index=index) print(df.head()) ``` --- ## Reference pipeline: DiffQuant The dataset is designed to be used with the DiffQuant data pipeline. Below is a precise description of the transformations applied — included here so the dataset can be used reproducibly outside DiffQuant as well. ### Step 1 — Aggregation Resample from 1-min to any target resolution using clock-aligned buckets. `origin="epoch"` ensures bars always land on exact boundaries (`:05`, `:10`, …). Partial buckets at series edges are dropped. ```python from data.aggregator import aggregate from configs.base_config import MasterConfig cfg = MasterConfig() cfg.data.timeframe_min = 5 # valid: {1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, 60} bars_5m, ts_5m = aggregate(bars_1m, timestamps, cfg) ``` ### Step 2 — Feature engineering Applied channel-by-channel after aggregation. The first bar is always dropped (no prior close available for log-return computation). | Channel | Transformation | |---|---| | open, high, low, close | `log(price_t / close_{t-1})` — log-return vs previous bar close | | volume | `log(volume_t / rolling_mean(volume, window) + eps)` — relative intensity | | num_trades | `log(num_trades_t / rolling_mean(num_trades, window) + eps)` — same | | typical_price (optional) | `log(((H+L+C)/3)_t / close_{t-1})` | | time features (optional) | `[sin_hour, cos_hour, sin_dow, cos_dow]` — cyclic UTC encoding | ### Step 3 — Feature presets ```python cfg.data.preset = "ohlc" # 4 channels cfg.data.preset = "ohlcv" # 5 channels (default) cfg.data.preset = "full" # 6 channels cfg.data.add_typical_price = True # +1 channel cfg.data.add_time_features = True # +4 channels # Or fully custom: cfg.data.preset = "custom" cfg.data.feature_columns = ["close", "volume"] ``` ### Step 4 — Temporal splits The full dataset supports arbitrary split boundaries via `SplitConfig`. The primary DiffQuant experiment used the following non-overlapping splits: ``` Train : 2024-01-01 → 2025-03-31 (15 months — intentionally recent) Val : 2025-04-01 → 2025-06-30 (3 months) Test : 2025-07-01 → 2025-09-30 (3 months, out-of-sample) Backtest : 2025-10-01 → 2025-12-31 (3 months, final hold-out) ``` The training window is deliberately limited to 15 months rather than the full historical record. This keeps the training regime close to the evaluation periods and minimizes distribution shift. Extending to earlier data is the recommended first ablation and is straightforward via `SplitConfig.train_start`. ### Step 5 — Full pipeline one-liner ```python from data.pipeline import load_or_build from configs.base_config import MasterConfig cfg = MasterConfig() splits = load_or_build("btcusdt_1min_2021_2025.npz", cfg, cache_dir="data_cache/") # splits["train"]["full_sequences"] — (N, ctx+hor, F) sliding windows for training # splits["val"]["raw_features"] — continuous array for walk-forward evaluation ``` Results are MD5-hashed and cached on disk. Cache is invalidated automatically when the config changes (timeframe, preset, split boundaries, feature flags). --- ## Project context This dataset is the data foundation for **DiffQuant**, a research framework studying direct optimization of trading objectives. Most ML trading systems face a structural misalignment: models are trained on proxy losses — MSE, cross-entropy, TD-error — while performance is measured in realized PnL, Sharpe ratio, and drawdown. DiffQuant studies what happens when this proxy is removed entirely: the full pipeline from raw features through a differentiable mark-to-market simulator to the Sharpe ratio becomes a single computation graph. `loss.backward()` optimizes what the strategy actually earns, with transaction costs and slippage accounted for in every gradient update. **Key references:** - Buehler, H., Gonon, L., Teichmann, J., Wood, B. (2019). *Deep Hedging.* Quantitative Finance, 19(8). [`arXiv:1802.03042`](https://arxiv.org/abs/1802.03042) — foundational framework for end-to-end differentiable financial objectives. - Moody, J., Saffell, M. (2001). *Learning to Trade via Direct Reinforcement.* IEEE Transactions on Neural Networks, 12(4). — original formulation of direct PnL optimization as a training objective. - Khubiev, K., Semenov, M., Podlipnova, I., Khubieva, D. (2026). *Finance-Grounded Optimization For Algorithmic Trading.* [`arXiv:2509.04541`](https://arxiv.org/abs/2509.04541) — closest parallel work on financial loss functions for return prediction. <p> <strong>Research article (English · Medium):</strong><br> <a href="https://medium.com/@YuriKolesnikovAI/diffquant-end-to-end-sharpe-optimization-through-a-differentiable-trading-simulator-a64d428f0fd4">DiffQuant: End-to-End Sharpe Optimization Through a Differentiable Trading Simulator</a> </p> <p> <strong>Статья (Русский · Habr):</strong><br> <a href="https://habr.com/ru/articles/1022254/">DiffQuant: прямая оптимизация коэффициента Шарпа через дифференцируемый торговый симулятор</a> </p> <p> <strong>DiffQuant pipeline:</strong> <a href="https://github.com/YuriyKolesnikov/diffquant">github.com/YuriyKolesnikov/diffquant</a> </p> --- ## Citation ```bibtex @dataset{Kolesnikov2026diffquant_data, author = {Kolesnikov, Yuriy}, title = {{BTCUSDT} 1-Min Futures — 5-Year Research Dataset (2021--2025)}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ResearchRL/diffquant-data}, } ```
提供机构:
ResearchRL
搜集汇总
数据集介绍
main_image_url
构建方式
在加密货币量化研究领域,高频数据的质量与完整性是模型可靠性的基石。本数据集通过Binance USDⓈ-M永续期货公开API系统采集,采用收盘时间惯例,确保每一分钟K线的时间戳精确对应其结束时刻。构建过程中实施了严格的质量保证流程,包括重复时间戳检测、全日期范围分钟级缺口扫描、OHLC一致性验证以及价格与交易量的非负性检查,最终生成了覆盖2021年至2025年连续五年、无任何数据缺失的2,629,440条一分钟OHLCV记录,所有数据均以原始浮点数形式存储,未经过任何预处理。
特点
该数据集的核心特征在于其高度的纯净性与完整性,专为日内量化研究设计。它提供了比特币兑USDT永续期货在五年内的完整分钟级价格与交易活动快照,价格区间跨越15,502至126,087美元,涵盖了市场的主要波动周期。数据集严格遵循OHLCV标准格式,包含开盘价、最高价、最低价、收盘价、基础资产成交量及交易笔数六个维度,且经过验证不存在任何OHLC违规或负值。其零缺口的特性确保了时间序列的连续性,为构建可复现的研究流水线提供了理想的基础。
使用方法
作为DiffQuant可微分交易流水线的指定数据源,该数据集的使用遵循一套标准化的处理流程。研究者首先通过Hugging Face Hub下载NumPy压缩文件,加载原始K线数据与时间戳。随后,可根据目标研究分辨率,利用提供的聚合函数将一分钟数据重采样至更高时间框架。特征工程阶段对价格通道计算对数收益率,对成交量与交易笔数计算相对于滚动均值的相对强度,并可选择性添加典型价格与循环时间编码特征。数据流水线支持灵活的预设配置与自定义特征选择,并提供了明确的时间划分方案,便于进行训练、验证与样本外测试,所有处理步骤均具备缓存机制以确保计算效率与可复现性。
背景与挑战
背景概述
在量化金融领域,高频率、高质量的时间序列数据是驱动算法交易与市场微观结构研究的基石。diffquant-data数据集由Yuriy Kolesnikov于2026年发布,专注于提供比特币永续期货的分钟级OHLCV数据。该数据集覆盖了2021年至2025年连续五年的完整周期,包含超过260万条无间隙数据条,旨在支持端到端可微分交易策略的研究。其核心研究问题在于解决传统机器学习交易系统中代理损失函数与真实交易目标之间的错配,通过提供原始市场数据,为DiffQuant框架下的直接夏普比率优化等前沿方法奠定数据基础,对推动加密货币市场的量化建模与可微分金融研究具有显著影响力。
当前挑战
该数据集致力于解决金融时间序列预测与可微分交易策略优化领域的挑战。首要挑战在于市场数据的非平稳性与高噪声特性,使得模型在捕捉价格动态、波动率聚类以及市场微观结构效应时面临困难。其次,构建过程中需克服数据完整性与一致性的难题,包括确保分钟级数据无缺失、消除时间戳重复、验证OHLC价格逻辑关系,并处理极端市场事件下的异常值。此外,数据集作为原始市场数据,需在后续特征工程与样本构建中应对分布漂移问题,以支持稳健的模型训练与评估,这对量化研究的可复现性与泛化能力提出了严格要求。
常用场景
经典使用场景
在量化金融领域,高频时间序列数据的分析是构建交易策略的核心基础。diffquant-data数据集以其长达五年、无缺失的一分钟OHLCV数据,为日内交易策略的研发提供了经典范例。研究者通常利用该数据集进行价格波动性建模、市场微观结构分析,以及高频预测模型的训练与验证,其分钟级分辨率能够捕捉加密货币市场瞬息万变的动态特征,为策略回测和性能评估提供了高保真的历史环境。
解决学术问题
该数据集主要致力于解决金融时间序列预测中的若干关键学术问题,包括处理高噪声环境下的价格序列建模、克服传统代理损失函数与最终交易目标之间的错配,以及探索端到端可微分优化在交易策略中的应用。通过提供干净、连续的原始数据,它支持对直接强化学习、深度对冲等前沿方法进行实证研究,推动了将夏普比率等金融指标直接纳入模型训练流程的理论与实践进展。
衍生相关工作
围绕该数据集衍生的经典工作主要集中在可微分交易框架的探索上。其配套的DiffQuant研究管道是代表性成果,它实现了从原始特征到夏普比率计算的端到端可微分图,消除了代理损失函数。相关研究脉络可追溯至Deep Hedging的端到端对冲框架、直接强化学习的交易优化思想,以及近期关于金融接地优化的探讨,这些工作共同推动了基于直接目标优化的算法交易新范式。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作