xxparthparekhxx/indian-stock-market-minute-data

Name: xxparthparekhxx/indian-stock-market-minute-data
Creator: xxparthparekhxx
Published: 2026-01-25 15:44:42
License: 暂无描述

Hugging Face2026-01-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/xxparthparekhxx/indian-stock-market-minute-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - time-series-forecasting - tabular-regression tags: - finance - nse - india - stock-market - quantitative-finance - upstox pretty_name: Indian Stock Market Minute & Daily Data size_categories: - 10B<n<100B configs: - config_name: default data_files: - split: minute path: minute/*.parquet - split: day path: day/*.parquet --- # 🇮🇳 Indian Stock Market Data: Minute & Daily (2000 - 2026) ## 📌 Overview This is a high-performance financial dataset containing the historical price history of **2,500+ NSE Stocks and Indices**. The dataset has been **sharded and optimized** for high-speed training. Instead of thousands of tiny files, it is grouped into large ~1.5GB Parquet shards, making it ideal for fast streaming with the Hugging Face `datasets` library. ## 📊 Dataset Stats - **Total Rows:** ~715 Million - **Size:** ~10.5 GB (Compressed Snappy Parquet) / ~125 GB (Uncompressed) - **Coverage:** 99.4% of active/suspended NSE Equities & Indices - **Granularity:** - **Minute:** 1-minute intraday candles (2022-2026) - **Day:** Daily candles (2000-2026) - **Schema:** `symbol`, `timestamp` (UTC), `open`, `high`, `low`, `close`, `volume`, `oi` ## 📂 Directory Structure The data is partitioned by frequency to allow for efficient loading. ```text /minute/ train-00000.parquet (Stocks A-C) train-00001.parquet (Stocks C-H) ... /day/ train-00000.parquet (All Daily Data) ``` > **Note:** The files are sorted by `Symbol` then `Timestamp`. This means all data for a specific stock (e.g., `RELIANCE`) is contiguous within a single shard, maximizing compression and read speed. ## 💻 Usage (Python) ### 🚀 Option 1: Using Hugging Face Datasets (Recommended) This method automatically handles downloading, caching, and iterating over the shards. ```python from datasets import load_dataset # 1. Load ALL Minute-Level Data (Streams 10.5 GB in shards) # Use split="minute" to get the high-res intraday data ds_minute = load_dataset("xxparthparekhxx/indian-stock-market-minute-data", split="minute") # 2. Filter for a specific stock # (The library efficiently scans the Arrow table in RAM) reliance = ds_minute.filter(lambda x: x['symbol'] == 'RELIANCE') print(reliance[0]) ``` ### ⚡ Option 2: Streaming (No Download) If you don't want to download the full 10.5 GB to disk, you can stream it on-the-fly. ```python from datasets import load_dataset dataset = load_dataset( "xxparthparekhxx/indian-stock-market-minute-data", split="minute", streaming=True ) # Iterate through the dataset without downloading everything # Since data is sorted by Symbol, you will see all rows for a stock sequentially for row in dataset: if row['symbol'] == 'TATASTEEL': print(row) # Stop after finding the first row to prove it works break ``` ### 📉 Option 3: Load Daily Data Only If you only need daily timeframe data (2000-2026), you can load just the daily split (~100MB). ```python from datasets import load_dataset ds_day = load_dataset("xxparthparekhxx/indian-stock-market-minute-data", split="day") print(ds_day[0]) ``` ### 🐼 Option 4: Using Pandas You can read individual shards directly if you prefer manual control. ```python import pandas as pd # Load the first shard of minute data (Contains stocks starting with A-B approx) df = pd.read_parquet("hf://datasets/xxparthparekhxx/indian-stock-market-minute-data/minute/train-00000.parquet") print(df.head()) ``` ## 📝 Schema & Data Types | Column | Type | Description | |---|---|---| | `symbol` | String | NSE Trading Symbol (e.g., `RELIANCE`, `NIFTY_50`) | | `timestamp` | Datetime (ns) | **UTC Timezone**. (Add +5:30 for IST) | | `open` | Float32 | Opening Price | | `high` | Float32 | High Price | | `low` | Float32 | Low Price | | `close` | Float32 | Closing Price | | `volume` | Int64 | Volume Traded | | `oi` | Int64 | Open Interest (0 if not applicable) | ## ⚠️ Disclaimer This dataset is intended for **research, educational, and backtesting purposes only**. - It is not a live feed. - Do not use this as the primary basis for live financial trading. - The authors are not responsible for any financial losses incurred from using this data. ## 📄 License This dataset is released under the **MIT License**.

许可证：MIT协议任务类别： - 时间序列预测 - 表格回归标签： - 金融 - NSE（印度国家证券交易所） - 印度 - 股票市场 - 量化金融 - Upstox 展示名称：印度股票市场分钟级与日线数据数据规模分级：100亿<数据量<1000亿配置项： - 配置名称：默认数据文件： - 拆分方式：分钟级路径：minute/*.parquet - 拆分方式：日线级路径：day/*.parquet # 🇮🇳 印度股票市场数据：分钟级与日线（2000-2026） ## 📌 概览这是一份高性能金融行情数据集，涵盖**2500+支印度国家证券交易所（NSE）上市股票与指数**的历史价格数据。本数据集已完成分片优化与性能调优，以支持高速训练。相较于数千个小型文件，本数据集被整合为单份约1.5GB的Parquet分片，非常适合使用Hugging Face `datasets`库进行快速流式读取。 ## 📊 数据集统计 - **总数据行数**：约7.15亿行 - **数据体量**：压缩后（Snappy Parquet格式）约10.5GB / 未压缩约125GB - **覆盖范围**：99.4%的活跃/停牌印度国家证券交易所（NSE）股票与指数 - **数据粒度**： - 分钟级：1分钟频度日内K线（2022-2026年） - 日线级：日频K线（2000-2026年） - **数据Schema**：`symbol`（交易代码）、`timestamp`（UTC时间戳）、`open`（开盘价）、`high`（最高价）、`low`（最低价）、`close`（收盘价）、`volume`（成交量）、`oi`（持仓量） ## 📂 目录结构数据按数据频率分区存储，以实现高效加载。 text /minute/ train-00000.parquet （覆盖股票代码A-C） train-00001.parquet （覆盖股票代码C-H） ... /day/ train-00000.parquet （全量日线数据） > **注意**：所有文件均按`symbol`（交易代码）与`timestamp`（时间戳）排序。这意味着单支股票（例如`RELIANCE`）的所有数据会连续存储在单个分片中，最大化压缩效率与读取速度。 ## 💻 Python使用示例 ### 🚀 方案1：使用Hugging Face Datasets库（推荐）该方法可自动处理下载、缓存与分片迭代流程。 python from datasets import load_dataset # 1. 加载全量分钟级数据（以分片形式流式读取10.5GB数据） # 使用split="minute"参数获取高分辨率日内行情数据 ds_minute = load_dataset("xxparthparekhxx/indian-stock-market-minute-data", split="minute") # 2. 筛选特定股票 # 该库可高效在内存中扫描Arrow表格 reliance = ds_minute.filter(lambda x: x['symbol'] == 'RELIANCE') print(reliance[0]) ### ⚡ 方案2：流式读取（无需下载）若无需将全量10.5GB数据下载至本地，可直接进行流式读取。 python from datasets import load_dataset dataset = load_dataset( "xxparthparekhxx/indian-stock-market-minute-data", split="minute", streaming=True ) # 无需下载全部数据即可遍历数据集 # 由于数据已按交易代码排序，您将按顺序获取单支股票的所有数据行 for row in dataset: if row['symbol'] == 'TATASTEEL': print(row) # 仅打印第一行以验证功能 break ### 📉 方案3：仅加载日线级数据若仅需日频行情数据（2000-2026年），可仅加载日线拆分数据集（体量约100MB）。 python from datasets import load_dataset ds_day = load_dataset("xxparthparekhxx/indian-stock-market-minute-data", split="day") print(ds_day[0]) ### 🐼 方案4：使用Pandas手动读取若您需要自主控制加载流程，可直接读取单个分片文件。 python import pandas as pd # 加载第一个分钟级数据分片（约覆盖股票代码A-B） df = pd.read_parquet("hf://datasets/xxparthparekhxx/indian-stock-market-minute-data/minute/train-00000.parquet") print(df.head()) ## 📝 数据Schema与数据类型 | 列名 | 数据类型 | 字段说明 | |---|---|---| | `symbol` | 字符串 | 印度国家证券交易所（NSE）交易代码（例如`RELIANCE`、`NIFTY_50`） | | `timestamp` | 纳秒级datetime | **UTC时区**（如需转换为印度标准时间（IST），需添加5小时30分钟） | | `open` | Float32 | 开盘价 | | `high` | Float32 | 最高价 | | `low` | Float32 | 最低价 | | `close` | Float32 | 收盘价 | | `volume` | Int64 | 成交量 | | `oi` | Int64 | 持仓量（无对应数据时为0） | ## ⚠️ 免责声明本数据集仅用于**研究、教育与回测用途**。 - 本数据集非实时行情源。 - 请勿将其作为实盘金融交易的核心决策依据。 - 数据集作者不对因使用本数据导致的任何金融损失承担责任。 ## 📄 许可证本数据集采用**MIT协议**发布。

提供机构：

xxparthparekhxx

5,000+

优质数据集

54 个

任务类型

进入经典数据集