five

A Dataset of Indonesian Tweets and Daily Market Data for Non-Blue-Chip Stocks on the Indonesia Stock Exchange

收藏
Mendeley Data2026-05-21 收录
下载链接:
https://data.mendeley.com/datasets/rkpkkd2mhf
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset provides paired Indonesian-language social media text and daily market data for 15 non-blue-chip stocks listed on the Indonesia Stock Exchange (IDX). This dataset designed to support research on retail investor behaviour, social media sentiment, short-term stock price movements, and market euphoria in an emerging-market context. The dataset spans the period from 1 January 2022 to 31 December 2024 and consists of three Excel files. 1. Tweet Data (8,968 rows) - Contains cleaned and anonymised Indonesian tweets collected via targeted keyword queries on X (formerly Twitter) using the Apify platform. - Each row includes: stock ticker, full cleaned tweet text, posting date and time (created_at), and engagement metrics (favorite_count, retweet_count, reply_count, is_quote_status). - Each tweet dataset has translated versions in Indonesian (Tweet Dataset - ID.xlsx) and English (Tweet Dataset - EN.xlsx). - This file is used for sentiment analysis, training Indonesian NLP models (e.g. IndoBERT), and studying retail investor discussions on stocks. 2. Market Data (10,830 rows) - Contains daily OHLCV (Open, High, Low, Close, Volume) records for the same 15 stocks, downloaded from Yahoo Finance using the yfinance Python library. - Missing values due to weekends, public holidays, or temporary suspensions have been forward-filled to maintain a continuous time series. - This file supports technical analysis, time-series forecasting, and can be merge with tweet data for multimodal research. 3. Combined Multimodal Data (16,395 rows) - Integrates the daily market variables with aggregated daily social media metrics using a left-join operation by "Ticker" and "Date". - Includes the continuous OHLCV data, extracted technical features (14-day RSI, daily price change percentage, and daily volume change percentage), daily tweet counts, and average daily sentiment scores. - The average daily sentiment scores are computed from the cleaned tweets using a pre-trained IndoBERT model, mapped to a continuous scale from -1.0 (very negative) to +1.0 (very positive). The 15 stocks were selected based on high retail participation and speculative characteristics: AUTO, BRMS, BRPT, DSSA, FORU, IMAS, KARW, KONI, MLPT, PANI, PSAB, SGER, SRAJ, TOBA, and TPIA. All tweet texts have been cleaned and anonymised: usernames, user mentions (@), external URLs, hashtags, and personal identifiers have been removed to ensure user privacy. This resource can be used for: • Training and evaluating Indonesian NLP models for sentiment analysis (e.g., using pre-trained IndoBERT on the cleaned texts) • Developing short-term stock price forecasting systems • Studying the relationship between social media activity volume and market volatility • Multimodal deep learning experiments that combine text embeddings with time-series market features • Investigating retail-driven euphoria and speculative behaviour in non-blue-chip stocks
创建时间:
2026-05-21
二维码
社区交流群
二维码
科研交流群
商业服务