jonyling/eth-usd-price-prediction
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jonyling/eth-usd-price-prediction
下载链接
链接失效反馈官方服务:
资源简介:
# ETH-USD 4h Price Direction Model
A production-grade machine learning pipeline that predicts **4-hour forward returns** for ETH/USD using 1-minute Binance OHLCV data fused with 7 years of Ethereum on-chain metrics from Dune Analytics.
---
## Project Overview
| | |
|---|---|
| **Target** | 4h forward return: `close[t+4] / close[t] - 1` |
| **Model** | LightGBM Regressor (Optuna-tuned) |
| **Bar size** | 1h (resampled from 1-min) |
| **Features** | 84 (price, volume, technicals, on-chain, cyclical) |
| **Training data** | ~52,000 hourly bars (~7 years) |
| **Train / Holdout split** | 80 / 20 (time-based) |
---
## Backtest Results (Out-of-Sample Holdout)
| Metric | Value |
|---|---|
| Sharpe Ratio | **2.8** |
| Cumulative Return | **+39.7%** |
| Max Drawdown | **-4.3%** |
| Win Rate | **74.2%** |
| Number of Trades | 58 |
| Trading cost | 0.10% per trade (Binance Spot w/ BNB) |
### Robustness Checks
| Test | Result |
|---|---|
| Data leakage | ✅ PASS |
| OOS stability (both halves Sharpe > 0) | ✅ PASS |
| Monte Carlo (100 paths, % positive Sharpe) | ✅ PASS — 100% |
---
## Feature Categories
- **Price returns**: 1h, 2h, 4h, 6h, 12h, 24h, 48h, 168h lookback
- **Candle structure**: body ratio, range %, close location value, gap high/low
- **Moving averages**: SMA and price-vs-SMA at 4h, 12h, 24h, 48h, 168h
- **Volume**: volume ratios, volatility, OBV
- **Technical indicators**: RSI (6h, 14h, 24h), MACD, Bollinger Bands
- **On-chain (Ethereum)**: transaction count, active senders/receivers, ETH transferred, gas used — with 24h/168h MAs, % changes, and momentum
- **Cyclical time**: hour and day-of-week encoded as sin/cos
- **Lagged features**: return and range lags at 1, 2, 3, 4, 6, 12 bars
---
## Repo Structure
```
├── ETH-USD_ML.ipynb # Full pipeline: EDA → features → training → backtest
├── feature_engineering.py # Feature engineering utilities
├── fetch_xrp_eth.py # Data fetching from Binance
├── parquet-conversion.py # CSV → Parquet conversion
├── Dune_Query.ipynb # On-chain data pulls from Dune Analytics
├── model_export/
│ ├── lgbm_eth_4h_regressor.joblib # Trained model (sklearn API)
│ ├── lgbm_eth_4h_regressor.txt # Trained model (LightGBM native)
│ ├── feature_pipeline.py # Feature engineering for inference
│ ├── predict.py # Signal generator CLI
│ └── config.json # Model metadata & hyperparameters
├── cnn_lstm/ # CNN-LSTM experiment (did not beat LGBM)
│ ├── train.py # Model training script
│ ├── predict.py # Inference / signal generation
│ ├── feature_pipeline.py # Feature engineering for CNN-LSTM
│ ├── alert_runner.py # Telegram alert runner
│ ├── config.json # Model config
│ └── setup_cron.sh # Cron job setup
```
---
## Quickstart
### 1. Install dependencies
```bash
pip install lightgbm polars pandas numpy scikit-learn optuna ta
```
### 2. Generate signals from new data
```bash
cd model_export
python predict.py --input latest_hourly.csv --output signals.csv
```
Input CSV must contain hourly OHLCV columns (`open`, `high`, `low`, `close`, `volume`) plus Ethereum on-chain columns (see `config.json` for the full feature list).
### 3. Signal interpretation
- `signal = 1` → **Long**: predicted return > entry threshold (`K × cost`)
- `signal = -1` → **Short**: predicted return < −threshold
- `signal = 0` → **Flat**: no trade
The volatility filter (24h rolling std of hourly returns > median) must also be satisfied for a trade to trigger.
---
## Data Sources
| Source | Data | Coverage |
|---|---|---|
| Binance API | ETH/USDT 1-minute OHLCV | ~7 years |
| Dune Analytics | Ethereum on-chain metrics | ~7 years |
---
## Model Hyperparameters (Optuna-tuned)
| Parameter | Value |
|---|---|
| Learning rate | 0.00758 |
| Max depth | 6 |
| Num leaves | 57 |
| Min child samples | 96 |
| Subsample | 0.708 |
| Colsample by tree | 0.549 |
| L1 reg (alpha) | 0.203 |
| L2 reg (lambda) | 0.196 |
---
## Deep Learning Experiments (CNN-LSTM & Transformer)
We tested several deep learning architectures to see whether neural sequence models could improve on the LightGBM baseline. **None outperformed it.**
### Architectures Tested
| Model | Input | Window | Params | Target |
|---|---|---|---|---|
| CNN-LSTM (1-min bars) | `(120, 31)` | 2h lookback | ~200K | 1-min fwd return |
| Transformer Encoder (1h bars) | `(168, 42)` | 7d lookback | ~300K+ | 4h fwd return |
| CNN-LSTM (4h bars) | `(42, 27)` | 7d lookback | ~10K | z-scored 4h fwd return |
| Hybrid: CNN-LSTM embeddings + LGBM | 27 flat + 32 LSTM + 1 pred = 60 | — | — | 4h fwd return |
All models used PyTorch on GPU, Huber loss, Adam/AdamW optimizers, and early stopping.
### Results vs. LightGBM
| Approach | Sharpe | Return | Win Rate | Max DD |
|---|---|---|---|---|
| **LGBM 4h Regression** | **2.8** | **+39.7%** | **74.2%** | **-4.3%** |
| CNN-LSTM 1-min | — | — | ~49.6% | — |
| Transformer 1h | — | — | ~49.7% | — |
| Hybrid embeds + LGBM | +0.26 | +1.95% | low | -19.12% |
The 1-min CNN-LSTM and 1h Transformer both **collapsed to predicting ≈ 0** for every bar (R² ≈ 0.000, direction accuracy ≈ coin-flip), generating zero actionable trades at any threshold.
### Why Deep Learning Failed Here
1. **Signal-to-noise ratio is too low.** Crypto returns at 1-min and 1-hour frequencies are nearly symmetric around zero. Under Huber/MSE loss the optimal strategy is to predict the mean (≈ 0) — which minimizes loss but is useless for trading.
2. **Neural nets are overparameterized for the data.** The 4h dataset has only ~13K non-overlapping samples; even the deliberately small 10K-parameter CNN-LSTM struggled. Tree-based models handle this low-sample regime far better by finding sharp nonlinear thresholds without gradient-based representation learning.
3. **Proxy features ≠ real order-book data.** Microstructure features (Corwin-Schultz spread, VPIN, Amihud illiquidity) estimated from OHLCV are noisy approximations that lack the granularity of actual Level-2 bid/ask depth.
4. **Huber loss is symmetric.** The model receives no extra penalty for wrong-direction predictions, so it is never incentivized to make bold directional calls — it collapses to the mean.
5. **Embedding aggregation is lossy.** Compressing 240 × 32-dim 1-min LSTM hidden states into a single 4h summary (via mean pooling or last-value) destroys temporal ordering and fine-grained information.
### Conclusion
CNN-LSTM embeddings *did* appear in 8 of the top-20 feature importances in the hybrid model, suggesting they capture some temporal structure — but those patterns are not profitable enough to overcome noise and beat LGBM's Sharpe of 2.8. The expected value of further deep learning investment is negative:
$$E[\text{value}] \approx P(\Delta\text{Sharpe}>0) \times \Delta\text{Sharpe} - \text{compute cost} - \text{overfitting risk} < 0$$
The LightGBM 4h regression model remains the clear winner.
---
## Dataset
The full merged dataset (1-min OHLCV + 7-year on-chain features) is available on Hugging Face:
[jonyling/eth-usd-price-prediction](https://huggingface.co/datasets/jonyling/eth-usd-price-prediction)
---
## Disclaimer
This project is for **educational and research purposes only**. Past backtest performance does not guarantee future results. Nothing here constitutes financial advice.
提供机构:
jonyling
搜集汇总
数据集介绍

构建方式
该数据集融合了来自Binance的ETH/USDT 1分钟OHLCV数据与Dune Analytics提供的七年以太坊链上指标,以1小时为频率重新采样,构造出约52,000条小时级数据样本。通过计算4小时前向收益作为预测目标,并衍生出84个特征,涵盖价格收益、蜡烛结构、移动均线、成交量、技术指标、链上指标以及周期性时间编码等维度,形成了一个多源、多尺度的结构化数据集。
特点
数据集最显著的特点在于其丰富的特征体系与严谨的时间序列分割。84个特征中不仅包含传统量价指标和技术分析信号,还创新性地引入了以太坊链上活跃地址数、交易次数、Gas用量等链上动量特征,并辅以滞后值和周期编码以捕捉时序依赖。80/20的时间顺序划分确保无未来数据泄露,回测中实现了2.8的夏普比率和74.2%的胜率,验证了数据集的预测潜力。
使用方法
用户可通过安装lightgbm、polars等依赖后,使用预训练模型进行推理。将包含OHLCV和链上字段的新数据按小时组织为CSV文件,运行`predict.py`脚本即可生成交易信号:信号为1代表做多,-1代表做空,0代表持仓观望。同时,数据集的完整预处理、特征工程与模型训练流程均在开源的Jupyter笔记本中透明呈现,便于研究者复现与改进。
背景与挑战
背景概述
以太坊(Ethereum)作为去中心化金融(DeFi)与智能合约生态的核心基础设施,其原生资产ETH的价格波动对市场参与者具有深远影响。该数据集于近期由研究人员jonyling创建,旨在利用机器学习方法预测ETH/USD的4小时前向收益。研究机构或独立研究者通过融合来自Binance的1分钟OHLCV高频交易数据与Dune Analytics提供的7年以太坊链上指标,构建了包含84个特征(涵盖价格、成交量、技术指标、链上数据及周期性时间编码)的预测框架。该数据集以LightGBM回归器为核心模型,在时间序列留出测试中取得了夏普比率2.8、累计收益39.7%的优异表现,为加密货币价格预测领域提供了从数据融合到模型评估的完整基准。
当前挑战
该数据集所应对的领域核心挑战在于加密货币市场极高的信噪比和低频可预测性。具体而言:1)收益分布近乎对称,传统神经网络(如CNN-LSTM、Transformer)在短时高频数据上倾向于预测零均值而无法产生有效交易信号;2)有效独立样本量稀缺(约13,000个4小时柱状图),导致过参数化模型失效,而树模型通过锐利非线性阈值在此场景下表现优越。在构建过程中,面临的挑战包括:高频1分钟数据与时序聚合特征之间的信息损失,以及基于OHLCV估计的微观结构特征(如Corwin-Schultz价差)对真实订单簿深度信息的近似误差,这要求特征工程需兼顾计算效率与预测信号的保真度。
常用场景
经典使用场景
在金融时间序列预测与量化交易领域,ETH-USD价格预测数据集为研究者提供了一个融合了高频交易数据与区块链链上指标的独特实验平台。该数据集以1分钟级别的Binance OHLCV数据为核心,结合长达七年的Ethereum链上指标,如交易数量、活跃地址数、Gas消耗等,并精心构建了84维特征空间,涵盖价格动量、成交量变化、技术指标及周期性编码。其经典使用场景包括构建4小时前瞻收益的回归预测模型,以及基于LightGBM、CNN-LSTM等算法的多空信号生成策略,为加密货币市场的方向性预测研究奠定了坚实基础。
解决学术问题
该数据集直击加密货币预测中信息维度单一与过拟合两大顽疾。一方面,通过将链上交易活跃度、网络状态等基本面数据与传统价格、成交量、技术指标深度融合,突破了仅依赖价格序列的预测瓶颈,揭示了链上行为对价格运动的增量预测价值。另一方面,通过严格的时间序列划分、蒙特卡洛稳健性检验以及夏普比率、最大回撤等多维评估体系,有效遏制了回测中的未来信息泄露与虚假发现,为时序预测模型在低信噪比环境下的鲁棒性评估提供了可复现的基准范本。
衍生相关工作
围绕该数据集已涌现了一系列富有启发性的相关工作。研究者尝试了CNN-LSTM、Transformer编码器等深度学习架构,并探索了将LSTM嵌入特征与LightGBM结合的混合模型,尽管这些尝试在性能上未能超越基准LightGBM,但深入揭示了深度学习在低样本、低信噪比金融数据上的局限性。多篇学术论文以此数据集为基线,进一步比较了不同特征工程策略(如量化因子合成、小波分解)和模型融合方法,推动了加密货币预测领域从单纯追求精度向注重统计显著性、经济收益与计算效率三元平衡的范式转型。
以上内容由遇见数据集搜集并总结生成



