Twitter dataset
收藏DataCite Commons2024-12-20 更新2025-01-06 收录
下载链接:
https://figshare.com/articles/dataset/Twitter_dataset/28069163
下载链接
链接失效反馈官方服务:
资源简介:
The <b>Truth Seeker Dataset</b> is designed to support research in the detection and classification of misinformation on social media platforms, particularly focusing on Twitter. This dataset is part of a broader initiative to enhance the understanding of how machine learning (ML) and natural language processing (NLP) can be leveraged to identify fake news and misleading content in real-time.Dataset CompositionThe Truth Seeker Dataset comprises a substantial collection of social media posts that have been meticulously labeled as either real or fake. It was constructed using advanced ML algorithms and NLP techniques to analyze the language patterns in social media communications. The dataset includes:<b>Raw Social Media Posts</b>: A diverse range of tweets that reflect various topics and sentiments.<b>Labeling</b>: Each post is annotated with binary labels indicating its authenticity (real or fake).<b>Feature Sets</b>: Two distinct subsets of the dataset have been created using different NLP vectorization methods—Word2Vec and TF-IDF. This allows researchers to explore how different feature representations impact model performance.<b>Research Applications</b>The primary aim of the Truth Seeker Dataset is to facilitate the development and validation of models that can accurately classify social media content. Key applications include:<b>Fake News Detection</b>: Utilizing various ML algorithms, including Random Forest and AdBoost, which have demonstrated high F1 scores in preliminary evaluations.<b>Model Comparison</b>: Researchers can compare the effectiveness of different ML approaches on the same dataset, enabling a clearer understanding of which methods yield the best results in detecting misinformation.<b>Algorithm Development</b>: The dataset serves as a benchmark for developing new algorithms aimed at improving accuracy in fake news detection.
**求真者数据集(Truth Seeker Dataset)** 旨在支持社交媒体平台(尤其聚焦推特(Twitter))上的虚假信息检测与分类研究。本数据集隶属于一项系统性研究计划,旨在深化对如何借助机器学习(ML)与自然语言处理(NLP)技术实时识别假新闻及误导性内容的认知。
**数据集构成**
求真者数据集包含经精细标注为真实或虚假的海量社交媒体帖文集合,其构建过程借助了先进的机器学习算法与自然语言处理技术,用于分析社交媒体传播中的语言模式。本数据集包含以下内容:
**原始社交媒体帖文**:涵盖各类主题与情感倾向的多样化推特(Twitter)帖文集合。
**标注信息**:每条帖文均被赋予二元标签,用以标识其真实性(真实或虚假)。
**特征集**:本数据集基于两种不同的自然语言处理向量化方法——Word2Vec与TF-IDF,构建了两个独立子集,以便研究者探究不同特征表示方式对模型性能的影响。
**研究应用场景**
求真者数据集的核心目标是助力可精准分类社交媒体内容的模型开发与验证,其主要应用包括:
**假新闻检测**:可结合随机森林(Random Forest)、自适应提升算法(AdBoost)等多种机器学习算法开展研究,上述算法在初步评估中已展现出较高的F1分数。
**模型对比研究**:研究者可基于同一数据集对比不同机器学习方法的有效性,从而更清晰地厘清何种方法在虚假信息检测任务中表现最优。
**算法开发**:该数据集可作为基准数据集,用于开发旨在提升假新闻检测准确率的新型算法。
提供机构:
figshare
创建时间:
2024-12-20
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



