A Dataset of Indonesian Tweets and Daily Market Data for Non-Blue-Chip Stocks on the Indonesia Stock Exchange
收藏Mendeley Data2026-05-21 收录
下载链接:
https://data.mendeley.com/datasets/rkpkkd2mhf
下载链接
链接失效反馈官方服务:
资源简介:
This dataset provides paired Indonesian-language social media text and daily market data for 15 non-blue-chip stocks listed on the Indonesia Stock Exchange (IDX). This dataset designed to support research on retail investor behaviour, social media sentiment, short-term stock price movements, and market euphoria in an emerging-market context.
The dataset spans the period from 1 January 2022 to 31 December 2024 and consists of three Excel files.
1. Tweet Data (8,968 rows)
- Contains cleaned and anonymised Indonesian tweets collected via targeted keyword queries on X (formerly Twitter) using the Apify platform.
- Each row includes: stock ticker, full cleaned tweet text, posting date and time (created_at), and engagement metrics (favorite_count, retweet_count, reply_count, is_quote_status).
- Each tweet dataset has translated versions in Indonesian (Tweet Dataset - ID.xlsx) and English (Tweet Dataset - EN.xlsx).
- This file is used for sentiment analysis, training Indonesian NLP models (e.g. IndoBERT), and studying retail investor discussions on stocks.
2. Market Data (10,830 rows)
- Contains daily OHLCV (Open, High, Low, Close, Volume) records for the same 15 stocks, downloaded from Yahoo Finance using the yfinance Python library.
- Missing values due to weekends, public holidays, or temporary suspensions have been forward-filled to maintain a continuous time series.
- This file supports technical analysis, time-series forecasting, and can be merge with tweet data for multimodal research.
3. Combined Multimodal Data (16,395 rows)
- Integrates the daily market variables with aggregated daily social media metrics using a left-join operation by "Ticker" and "Date".
- Includes the continuous OHLCV data, extracted technical features (14-day RSI, daily price change percentage, and daily volume change percentage), daily tweet counts, and average daily sentiment scores.
- The average daily sentiment scores are computed from the cleaned tweets using a pre-trained IndoBERT model, mapped to a continuous scale from -1.0 (very negative) to +1.0 (very positive).
The 15 stocks were selected based on high retail participation and speculative characteristics: AUTO, BRMS, BRPT, DSSA, FORU, IMAS, KARW, KONI, MLPT, PANI, PSAB, SGER, SRAJ, TOBA, and TPIA.
All tweet texts have been cleaned and anonymised: usernames, user mentions (@), external URLs, hashtags, and personal identifiers have been removed to ensure user privacy.
This resource can be used for:
• Training and evaluating Indonesian NLP models for sentiment analysis (e.g., using pre-trained IndoBERT on the cleaned texts)
• Developing short-term stock price forecasting systems
• Studying the relationship between social media activity volume and market volatility
• Multimodal deep learning experiments that combine text embeddings with time-series market features
• Investigating retail-driven euphoria and speculative behaviour in non-blue-chip stocks
创建时间:
2026-05-21



