five

SilverSkylan/exorde-social-media-one-month-2024

收藏
Hugging Face2026-03-27 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilverSkylan/exorde-social-media-one-month-2024
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - multilingual license: mit annotations_creators: - machine-generated language_creators: - found size_categories: - 100M<n<1B source_datasets: - original task_categories: - text-classification - summarization - text-retrieval pretty_name: Exorde Social Media Dataset December 2024 Week 1 tags: - social-media - multi-lingual - sentiment-analysis - emotion-detection - text --- --- # Multi-Source, Multi-Language Social Media Dataset (1 Week Sample) This dataset represents a rich, diverse snapshot of global online discourse, collected over nearly one month from November 14, 2024, to December 12, 2024. It comprises 269 million unique social media posts & articles from various social media platforms, blogs, and news articles, all precisely timestamped at the moment of posting. This dataset is procuded by Exorde Labs. www.exordelabs.com/. This dataset includes many conversations around Black Friday, Post US Elections, European financial & political changes, the collapse of the Syrian regime, the killing of the UnitedHealth CEO, and many other topics. The potential is wide. All items in this dataset are captured publicly, in near real-time, allowing post-deletion & retrospective analyses. This dataset is an extract of the full stream produced by Exorde. ## Methodology: Total sampling of the web, statistical capture of all topics ## Dataset Highlights - **Multi-Source**: Captures content from a wide range of online platforms - **Multi-Language**: Covers 122 different languages - **High-Resolution Temporal Data**: Each entry is timestamped to the exact moment of posting - **Rich Metadata**: Includes sentiment analysis, emotion detection, and thematic categorization - **Large Scale**: 270 million unique entries collected in near real-time - **Diverse Content**: Social media posts, blog entries, news articles, and more ## Dataset Schema - **date**: string (exact timestamp of post) - **original_text**: string - **url**: string - **author_hash**: string (SHA-1 hash for privacy) - **language**: string - **primary_theme**: string - **english_keywords**: string - **sentiment**: double - **main_emotion**: string - **secondary_themes**: list<element: int64> ## Attributes description - **original_text** is the exact original text of the item/post, as it was collected. It should match the original content before any deletion/edition. - **author_hash** is a SHA-1 Hash of the author username on a given platform, when provided. Many items have None Author_hash. - **language** is detected by a fasttext-langdetect model. Isocode ISO 639. - **primary_theme** is the output of MoritzLaurer/deberta-v3-xsmall-zeroshot-v1.1-all-33, on on the classes below. - **secondary_themes** are the same theme classes with a mapping: > 1. Economy > 2. Technology > 3. Investing > 4. Business > 5. Cryptocurrency > 6. Social > 7. Politics > 8. Finance > 9. Entertainment > 10. Health > 11. Law > 12. Sports > 13. Science > 14. Environment > 15. People - **main_emotion** is computed from an emotion scoring Language model, fine-tuned on social media data. - **english_keywords** is a powerful attribute, computed from an English translation of the original text. These keywords represent the core content (relevant keywords) of the text. They are produced from KeyBert & statistical algorithms. They should be mostly in English except when translation was faulty, in that case they will be in the original language. - **Sentiment** is computed & aggregated from several models, including deep learning models. It is a value between -1 and 1. -1 being negative, 0 neutral and 1 positive. ## Key Statistics - **Total entries**: 269,403,210 (543 files, 496138 average rows per file) - **Date range**: 2024-11-14 to 2024-12-11 (included) - **Unique authors**: 21 104 502 - **Languages**: 122 - **Primary themes**: 16 - **Main emotions**: 26 - **Average sentiment**: 0.043 - **Most common emotion**: Neutral ### Top 20 Sources - x.com 179,375,295 - reddit.com 52,639,009 - bsky.app 24,893,642 - youtube.com 7,851,888 - 4channel.org 1,077,691 - jeuxvideo.com 280,376 - forocoches.com 226,300 - mastodon.social 225,319 - news.ycombinator.com 132,079 - lemmy.world 120,941 - investing.com 113,480 - tribunnews.com 89,057 - threads.net 55,838 - yahoo.co.jp 54,662 - yahoo.com 38,665 - indiatimes.com 38,006 - news18.com 33,241 - bhaskar.com 30,653 - chosun.com 28,692 - tradingview.com 28,261 - +5000 others [Full source distribution](https://gist.githubusercontent.com/MathiasExorde/53eea5617640487bdd1e8d124b2df5e4/raw/5bb9a4cd9b477216d64af65e3a0918879f806e8b/gistfile1.txt) ### Top 10 Languages 1. English (en): 190,190,353 2. Spanish (es): 184,04,746 3. Japanese (ja): 14,034,642 4. Portuguese (pt): 12,395,668 5. French (fr): 5,910,246 6. German (de): 4,618,554 7. Arabic (ar): 3,777537 8. Turkish (tr): 2,922,411 9. Italian (it): 2,425,941 [Full language distribution](https://gist.github.com/MathiasExorde/bded85ba620de095705bb20507fcf6f1#file-gistfile1-txt) ## About Exorde Labs Exorde Labs is pioneering a novel collective distributed data DePIN (Decentralized Physical Infrastructure Network). Our mission is to produce a representative view of the web, minute by minute. Since our inception in July 2023, we have achieved: - Current capacity: Processing up to 4 billion elements annually - Growth rate: 20% monthly increase in data volume - Coverage: A comprehensive, real-time snapshot of global online discourse - More than 10 Million data points are processed daily, half a million per hour in near real-time This dataset is a small sample of our capabilities, offering researchers and developers a glimpse into the rich, multi-faceted data we collect and analyze. For more information about our work and services, visit: - [Exorde Labs Website](https://www.exordelabs.com/) - [Social Media Data](https://www.exordelabs.com/social-media-data) - [Exorde Labs API](https://www.exordelabs.com/api) ## Use Cases This dataset is invaluable for a wide range of applications, including but not limited to: - Real-time trend analysis - Cross-platform social media research - Multi-lingual sentiment analysis - Emotion detection across cultures - Thematic analysis of global discourse - Event detection and tracking - Influence mapping and network analysis ## Acknowledgments We would like to thank the open-source community for their continued support and feedback. Special thanks to all the platforms and users whose public data has contributed to this dataset. Massive thanks to the Exorde Network and its data enthusiast community, unique of its kind. ## Licensing Information This dataset is released under the MIT license. ## Citation Information If you use this dataset in your research or applications, please cite it as follows: `Exorde Labs. (2024). Multi-Source, Multi-Language Social Media Dataset [Data set]. Exorde Labs. https://www.exordelabs.com/` ## Contact Information For questions, feedback, or more information about this dataset or Exorde Labs' services, please contact us at: - Email: [hello@exordelabs.com](mailto:info@exordelabs.com) - Twitter: [@ExordeLabs](https://twitter.com/ExordeLabs) - GitHub: [Exorde Labs](https://github.com/exorde-labs) We are committed to supporting the open-source community by providing high-quality, diverse datasets for cutting-edge research and development. If you find this dataset useful, consider exploring our API for real-time access to our full range of social media data. ![Exorde Labs Logo](https://cdn.prod.website-files.com/620398f412d5829aa28fbb86/62278ca0202d025e97b76555_portrait-logo-color.png) ---
提供机构:
SilverSkylan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作