five

Blaze2oi/stock-dataset

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Blaze2oi/stock-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: Date dtype: string - name: Open dtype: float64 - name: High dtype: float64 - name: Low dtype: float64 - name: Close dtype: float64 - name: Volume dtype: int64 - name: Dividends dtype: float64 - name: Stock Splits dtype: float64 - name: Ticker dtype: string - name: SMA_5 dtype: float64 - name: SMA_10 dtype: float64 - name: SMA_20 dtype: float64 - name: SMA_50 dtype: float64 - name: EMA_12 dtype: float64 - name: EMA_26 dtype: float64 - name: MACD dtype: float64 - name: MACD_Signal dtype: float64 - name: MACD_Histogram dtype: float64 - name: RSI dtype: float64 - name: BB_Middle dtype: float64 - name: BB_Upper dtype: float64 - name: BB_Lower dtype: float64 - name: BB_Width dtype: float64 - name: BB_Position dtype: float64 - name: Volatility dtype: float64 - name: Price_Change dtype: float64 - name: Price_Change_5d dtype: float64 - name: High_Low_Ratio dtype: float64 - name: Open_Close_Ratio dtype: float64 - name: Volume_SMA dtype: float64 - name: Volume_Ratio dtype: float64 - name: Close_lag_1 dtype: float64 - name: Close_lag_2 dtype: float64 - name: Close_lag_3 dtype: float64 - name: Close_lag_5 dtype: float64 - name: Close_lag_10 dtype: float64 - name: Volume_lag_1 dtype: float64 - name: Volume_lag_2 dtype: float64 - name: Volume_lag_3 dtype: float64 - name: Volume_lag_5 dtype: float64 - name: Volume_lag_10 dtype: float64 - name: Price_Change_lag_1 dtype: float64 - name: Price_Change_lag_2 dtype: float64 - name: Price_Change_lag_3 dtype: float64 - name: Price_Change_lag_5 dtype: float64 - name: Price_Change_lag_10 dtype: float64 - name: RSI_lag_1 dtype: float64 - name: RSI_lag_2 dtype: float64 - name: RSI_lag_3 dtype: float64 - name: RSI_lag_5 dtype: float64 - name: RSI_lag_10 dtype: float64 - name: MACD_lag_1 dtype: float64 - name: MACD_lag_2 dtype: float64 - name: MACD_lag_3 dtype: float64 - name: MACD_lag_5 dtype: float64 - name: MACD_lag_10 dtype: float64 - name: Volatility_lag_1 dtype: float64 - name: Volatility_lag_2 dtype: float64 - name: Volatility_lag_3 dtype: float64 - name: Volatility_lag_5 dtype: float64 - name: Volatility_lag_10 dtype: float64 - name: Future_Return_1d dtype: float64 - name: Future_Up_1d dtype: int64 - name: Future_Category_1d dtype: float64 - name: Future_Return_5d dtype: float64 - name: Future_Up_5d dtype: int64 - name: Future_Category_5d dtype: float64 - name: Future_Return_10d dtype: float64 - name: Future_Up_10d dtype: int64 - name: Future_Category_10d dtype: float64 - name: Future_Return_20d dtype: float64 - name: Future_Up_20d dtype: int64 - name: Future_Category_20d dtype: float64 splits: - name: train num_bytes: 374644429 num_examples: 620095 download_size: 335534650 dataset_size: 374644429 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - time-series-forecasting - reinforcement-learning - tabular-regression language: - en tags: - finance - time-series - stocks - technical-analysis - yahoo-finance - reinforcement-learning pretty_name: S&P 500 Comprehensive Stock Market Dataset size_categories: - 100K<n<1M --- # 📈 S&P 500 Comprehensive Stock Market Dataset <div align="center"> ![Dataset](https://img.shields.io/badge/Dataset-S%26P%20500-blue) ![Records](https://img.shields.io/badge/Records-620K+-green) ![Features](https://img.shields.io/badge/Features-73-orange) ![License](https://img.shields.io/badge/License-MIT-yellow) ![Time Period](https://img.shields.io/badge/Time%20Period-5%20Years-purple) </div> ## 🎯 Dataset Overview This comprehensive dataset contains **620,095 daily observations** of S&P 500 companies with **73 meticulously engineered features** spanning the last 5 years. Designed specifically for time series forecasting, stock price prediction, and advanced financial modeling tasks. ### 📊 Key Statistics | Metric | Value | |--------|-------| | **Total Records** | 620,095 daily observations | | **Features** | 73 comprehensive features | | **Time Period** | Last 5 years | | **Companies** | S&P 500 constituents | | **Data Source** | Yahoo Finance API | | **Update Frequency** | Daily market data | ## 🚀 Quick Start ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("Adilbai/stock-dataset") df = dataset["train"].to_pandas() # Basic info print(f"Dataset shape: {df.shape}") print(f"Date range: {df['Date'].min()} to {df['Date'].max()}") print(f"Unique tickers: {df['Ticker'].nunique()}") ``` ## 🔧 Feature Categories ### 📈 Basic Market Data (9 features) - **Date**: Trading date timestamp - **OHLC Data**: Open, High, Low, Close prices - **Volume**: Number of shares traded - **Corporate Actions**: Dividends, Stock Splits - **Ticker**: Stock symbol identifier ### 📊 Technical Analysis Indicators (16 features) #### Moving Averages - `SMA_5`, `SMA_10`, `SMA_20`, `SMA_50`: Simple Moving Averages - `EMA_12`, `EMA_26`: Exponential Moving Averages #### Momentum Indicators - `MACD`, `MACD_Signal`, `MACD_Histogram`: MACD components - `RSI`: Relative Strength Index (14-period) #### Volatility Indicators - `BB_Middle`, `BB_Upper`, `BB_Lower`: Bollinger Bands - `BB_Width`, `BB_Position`: Bollinger Bands metrics - `Volatility`: Historical volatility measure ### ⚙️ Engineered Features (16 features) - `Price_Change`: Daily price change - `Price_Change_5d`: 5-day price change - `High_Low_Ratio`: High to low price ratio - `Open_Close_Ratio`: Open to close price ratio - `Volume_SMA`: Volume moving average - `Volume_Ratio`: Volume to average ratio ### ⏳ Lagged Features (32 features) Historical context with 10-period lags for: - **Price Lags**: `Close_lag_1` to `Close_lag_10` - **Volume Lags**: `Volume_lag_1` to `Volume_lag_10` - **Price Change Lags**: `Price_Change_lag_1` to `Price_Change_lag_10` - **RSI Lags**: `RSI_lag_1` to `RSI_lag_10` - **MACD Lags**: `MACD_lag_1` to `MACD_lag_10` - **Volatility Lags**: `Volatility_lag_1` to `Volatility_lag_10` ### 🎯 Target Variables (12 features) | Time Horizon | Return | Direction | Category | |--------------|--------|-----------|----------| | **1-Day** | `Future_Return_1d` | `Future_Up_1d` | `Future_Category_1d` | | **5-Day** | `Future_Return_5d` | `Future_Up_5d` | `Future_Category_5d` | | **10-Day** | `Future_Return_10d` | `Future_Up_10d` | `Future_Category_10d` | | **20-Day** | `Future_Return_20d` | `Future_Up_20d` | `Future_Category_20d` | ## 🎯 Use Cases ### 🔮 Primary Applications - **Stock Price Prediction**: Forecast future prices using technical indicators - **Direction Classification**: Predict price movement direction - **Risk Assessment**: Analyze volatility and market risk patterns - **Trading Strategy Development**: Backtest algorithmic strategies - **Financial Research**: Academic computational finance research ### 🤖 Machine Learning Tasks - **Regression**: Predict continuous returns (`Future_Return_*`) - **Binary Classification**: Predict direction (`Future_Up_*`) - **Multi-class Classification**: Predict movements (`Future_Category_*`) - **Time Series Forecasting**: Leverage lagged features - **Anomaly Detection**: Identify unusual market patterns ## 📝 Example Usage ### Data Exploration ```python # View dataset structure print(f"Dataset shape: {df.shape}") print(f"Features: {df.columns.tolist()}") # Target distribution print(df['Future_Up_1d'].value_counts()) print(df['Future_Category_1d'].value_counts()) ``` ### Feature Selection ```python # Technical indicators technical_features = [ 'SMA_5', 'SMA_10', 'RSI', 'MACD', 'BB_Position', 'Volatility' ] # Lagged features lag_features = [col for col in df.columns if 'lag' in col] # All targets targets = [col for col in df.columns if 'Future_' in col] ``` ### Model Training Example ```python from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import TimeSeriesSplit # Prepare features and target features = technical_features + lag_features X = df[features].fillna(method='ffill') y = df['Future_Return_1d'] # Time series split tscv = TimeSeriesSplit(n_splits=5) model = RandomForestRegressor(n_estimators=100) # Train model for train_idx, test_idx in tscv.split(X): X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx] model.fit(X_train, y_train) predictions = model.predict(X_test) ``` ## ⚠️ Important Considerations ### 🔴 Data Limitations - **Survivorship Bias**: Only current S&P 500 constituents included - **Market Hours**: Regular trading session data only - **Corporate Actions**: Historical adjustments may affect patterns ### ⚡ Usage Guidelines - **Temporal Order**: Maintain chronological order in train/test splits - **Look-ahead Bias**: Avoid using future information in features - **Market Regimes**: Performance may vary across market conditions - **Feature Correlation**: Technical indicators share underlying price data ## 📊 Data Quality ### ✅ Quality Assurance - **Industry-standard** technical indicator calculations - **Comprehensive** historical context with multiple time horizons - **Robust** data validation pipelines - **Proper handling** of corporate actions and market holidays ### 🔧 Data Processing - **Forward-fill** methodology for missing data - **Vectorized operations** for consistency - **No look-ahead bias** in feature construction - **Dividend and split** adjustments included ## 📖 Citation If you use this dataset in your research, please cite: ```bibtex @dataset{adilbai_sp500_dataset, title={S&P 500 Comprehensive Stock Market Dataset}, author={Adilbai}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/datasets/Adilbai/stock-dataset} } ``` ## 📄 License This dataset is released under the **MIT License**. While the dataset compilation and feature engineering are provided under MIT license, users should be aware of Yahoo Finance's terms of service for the underlying data. ## ⚠️ Disclaimer > **Important**: This dataset is provided for educational and research purposes only. It should not be used as the sole basis for investment decisions. Past performance does not guarantee future results. Users should conduct their own research and consider consulting with financial advisors before making investment decisions. --- <div align="center"> **Built with ❤️ for the financial ML community** [🤗 Hugging Face](https://huggingface.co/datasets/Adilbai/stock-dataset) • [📊 Dataset](https://huggingface.co/datasets/Adilbai/stock-dataset) • [🐛 Issues](https://huggingface.co/datasets/Adilbai/stock-dataset/discussions) </div>
提供机构:
Blaze2oi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作