Dataset and News Sentiments for NSEI Stock Market Prediction
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Dataset_and_News_Sentiments_for_NSEI_Stock_Market_Prediction/30150130
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains the financial time-series and corresponding news sentiment data used to conduct the research in the paper, "Stock Price Prediction with Limited Historical Data: A Model-Driven Exploration". The study's primary objective was to evaluate and compare the performance of four distinct forecasting models—ARIMA, Random Forest (RF), LSTM, and a hybrid LSTM + FinBERT—under the critical constraint of data scarcity.
Data Sources
Primary Stock Data: The dataset features historical daily price data for the NSEI (NIFTY 50) index, obtained from Yahoo Finance. This includes 'Open', 'High', 'Low', 'Close', and 'Volume' for each trading day. The 'Close' price is the target variable for prediction.News Sentiment Data: Textual data from a local news.csv file, containing dated financial news headlines, was used to generate sentiment scores.Dataset Composition and Feature EngineeringTo create a rich feature set for the models, the raw data was significantly augmented:
Engineered Features: The dataset includes technical indicators (RSI, MACD, Bollinger Bands), lagged price and return features, and rolling window statistics (e.g., moving averages).Sentiment Scores: The FinBERT model was used to process news headlines and generate a daily sentiment score, which was incorporated as an additional feature.Final Structure: The final dataset is intentionally small to simulate real-world data constraints, consisting of 60 rows and 29 columns.Experimental Setup and UsageA chronological 80%-20% split was applied to the data for model training and evaluation.
Training/Validation Set: The first 48 rows.Test Set: The final 12 rows, held out for performance evaluation.The data was formatted specifically for each model:
ARIMA: Utilized only the univariate 'Close' price time series from the training data.Random Forest: Employed a tabular format where all 28 engineered features (including indicators, lags, and sentiment) were used to predict the 'Close' price.LSTM & Hybrid LSTM+FinBERT: The data was transformed into 3D sequences with a lookback window of 16 days to predict the subsequent day's price. The hybrid model specifically included the sentiment score as an input feature within these sequences. For these neural network models, all numerical features were scaled.This dataset was instrumental in demonstrating that for financial forecasting tasks with limited data, the ensemble-based Random Forest model is a more effective and reliable choice due to its robustness against overfitting.
本数据集包含用于开展论文《有限历史数据下的股价预测:一种模型驱动探索》(Stock Price Prediction with Limited Historical Data: A Model-Driven Exploration)相关研究的金融时间序列与对应新闻情感数据。本研究的核心目标是在数据稀缺的严苛约束下,评估并对比四种差异化预测模型的性能——自回归积分滑动平均模型(ARIMA)、随机森林(Random Forest, RF)、长短期记忆网络(LSTM)以及混合模型LSTM+FinBERT。
数据来源
基础股票数据:本数据集包含印度国家证券交易所50指数(NSEI/NIFTY 50)的历史每日价格数据,取自雅虎财经(Yahoo Finance),涵盖每个交易日的开盘价(Open)、最高价(High)、最低价(Low)、收盘价(Close)与成交量(Volume),其中收盘价为预测目标变量。
新闻情感数据:本研究使用本地news.csv文件中的带时间戳的金融新闻标题文本数据生成情感评分。
数据集构成与特征工程
为构建丰富的模型输入特征集,研究团队对原始数据进行了大幅增强:
工程化特征:数据集包含技术指标(相对强弱指数RSI、平滑异同移动平均线MACD、布林带Bollinger Bands)、滞后价格与收益特征,以及滚动窗口统计量(如移动平均线)。
情感评分:采用FinBERT模型处理新闻标题并生成每日情感评分,将其作为额外特征纳入数据集。
最终结构:为模拟真实世界的数据约束,最终数据集规模刻意设置为较小体量,共包含60行数据与29列特征。
实验设置与使用方式
研究采用按时间顺序划分的80%-20%拆分方式进行模型训练与评估:
训练/验证集:前48行数据。
测试集:最后12行数据,用于模型性能评估。
数据集针对各模型进行了针对性格式化:
ARIMA:仅使用训练数据中的单变量收盘价时间序列。
随机森林:采用表格格式,使用全部28个工程化特征(包括技术指标、滞后特征与情感评分)预测收盘价。
LSTM与混合模型LSTM+FinBERT:将数据转换为具有16天回溯窗口的三维序列,用于预测次日股价。其中混合模型特别将情感评分作为序列输入特征之一。针对这些神经网络模型,所有数值特征均进行了标准化处理。
本数据集有力证明,在有限数据的金融预测任务中,基于集成学习的随机森林模型因具备更强的抗过拟合能力,是更为高效可靠的选择。
创建时间:
2025-09-17



