five

Social Media Pulse Dataset: POSTS

收藏
Snowflake2023-10-13 更新2024-05-01 收录
下载链接:
https://app.snowflake.com/marketplace/listing/GZ2FTZ4QN9K
下载链接
链接失效反馈
官方服务:
资源简介:
ZENPULSAR’s data centric AI platform “PUMP” monitors in real time multiple social media networks to track activities related to financial and crypto assets and then analyse them. It detects emerging viral narratives likely to form trends and impact financial assets. PUMP clears out the noise of social media with unmatched speed and accuracy. It identifies viral narratives related to the assets you track, early signals we can spot and act on before the crowds and everyone else. ZENPULSAR’s technology is also leveraged by a variety of clients to manage critical events such as product launches, policy platform developments, reputation crisis management, and disinformation campaigns. We are providing time series social media data relevant to selected assets. The data is extracted from Twitter, Reddit, Seeking Alpha and Telegram. The data provided can be split into 4 categories: 1. Data describing sentiment of social media posts 1a. Number of social media posts with bullish/bearish sentiment towards a target asset per period 1b. Number of upvotes/downvotes, likes, replies, comments, cross-posts of the posts with bullish/bearish sentiment towards target asset per period 2. Data describing activity of social media accounts 2a. Number of social media posts per period 3. Data describing engagement of social media accounts 3a. Number of likes and upvotes/downvotes per period 3b. Number of replies and comments to the posts per period 3c. Number of retweets and cross-posts per period 4. Data describing credibility of social media accounts 4a. Number of Social media posts done by accounts identified as bots/not bots per period 4b. Number of Upvotes/downvotes, likes, replies, comments, cross-posts of the posts done by accounts identified as bots/non-bots per period 4c. Number of social media posts done by accounts identified as influencers/market analysts per period 4d. Number of upvotes/downvotes, likes, replies, comments, cross-posts of the posts done by accounts influencers/market analysts per period Data analytics methodology Selection of asset-relevant social media posts: This task is done via iterative usage of information retrieval methods such as keyword extraction and topic modelling (LDA, BERTopic, etc.). We extract the keywords for each asset that are commonly used by people. Because a person who wants to influence public opinion on an asset must provide a specific name for the target asset, such as relevant codes or common names, the keywords they choose will help us to identify them. Also, there are fine-tuned models to help us to determine the truth about the financial topics. By combining these methods and models, we can focus on the data to seek the alpha or identify critical events from different influencers. Financial-related classification: To filter the key samples from large amounts of posts and news, we employ one of the state-of-art NLP models (Roberta-XLM) to achieve the best performance. There were already some pre-trained models focused on the news containing traditional assets such as bonds, FX, and stocks. By using weak-supervision learning and the additional internal data related to less traditional assets like crypto (added via such techniques as pseudo-labelling), our fine-tuned classifier can achieve great accuracy and precision. This is a binary classification to predict whether the post is related to finance or not. Account classification: To classify an account as a bot or as an authentic user, we apply a combination of the following techniques: - NLP-based content analysis - we employ transformer models like google MT5 or XLM-Roberta trained on bot post datasets. - Heuristics-based features (speed of posting, statistical characteristics based on NER analysis results, etc). Those features are fed to the Support Vector machine classifier. - The format of recent posts from the same user. Many bots have templates for different posts by putting the text together and transforming it. The model can extract features from the format to improve the model. - Analysis of network topology (bots have a different one from human accounts), specifically betweenness centrality characteristics of an account within an account network (Katz centrality, Pagerank). To classify an account as an influencer or a market analyst, or an abnormal user we apply a combination of the following techniques: - NLP-based content analysis - transformer models like google MT5 or XLM-Roberta trained on influencer post datasets. - Analysis of the account following network characteristics of an account, specifically betweenness centrality, within the account network (Katz centrality, Pagerank, Eigenvector centrality). - Number of followers/reddit karma thresholds. Sentiment detection: We utilise transformer-based models (FinBert, CryptoBert and CryptoRoberta) finetuned on our internal datasets. The model was trained on cryptocurrency and stock data collected from social media, and three classes will be output by the classifier, bearish, neutral, and bullish.
提供机构:
ZENPULSAR
创建时间:
2023-10-12
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集由ZENPULSAR的PUMP平台提供,通过实时监控Twitter、Reddit等社交媒体,追踪金融和加密资产相关活动,并分析潜在趋势。数据涵盖帖子情感、账户活动、参与度及可信度四个类别,采用NLP模型和网络分析等方法进行资产筛选、金融分类、账户识别和情感检测。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作