UPFD-GOS (User Preference-aware Fake News Detection)

OpenDataLab2026-04-05 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/UPFD-GOS

下载链接

链接失效反馈

资源简介：

有关基准测试，请参阅其变体 UPFD-POL 和 UPFD-GOS。该数据集已与 Pytorch Geometric (PyG) 和 Deep Graph Library (DGL) 集成。您可以在安装最新版本的 PyG 或 DGL 后加载数据集。 UPFD 数据集包括两组树状结构图，用于评估二进制图分类、图异常检测和假/真新闻检测任务。数据集以 Pytorch-Geometric 数据集对象的形式转储。您可以使用 PyG 轻松加载数据并运行各种 GNN 模型。该数据集包括 Twitter 上根据来自 Politifact 和 Gossipcop 的事实核查信息构建的虚假和真实新闻传播（转推）网络。新闻转推图最初由 FakeNewsNet 提取。每个图都是一个层次树结构图，其中根节点代表新闻；叶节点是转发根新闻的 Twitter 用户。如果他/她转发了新闻推文，则用户节点对新闻节点具有优势。如果一个用户转发另一用户的新闻推文，则两个用户节点具有优势。我们从参与 FakeNewsNet 中假新闻传播的用户那里抓取了近 2000 万条历史推文，以在数据集中生成节点特征。我们在数据集中合并了四种节点特征类型，768 维的 bert 和 300 维的 spacy 特征分别使用预训练的 BERT 和 spaCy word2vec 进行编码。 10 维个人资料特征是从 Twitter 帐户的个人资料中获得的。您可以参考 profile_feature.py 进行配置文件特征提取。 310 维内容特征由 300 维用户评论 word2vec (spaCy) 嵌入加上 10 维个人资料特征组成。数据集统计如下图：数据 #图表 #假新闻 #总节点 #总边数 #平均。每个图的节点政治事实 314 157 41,054 40,740 131 八卦警察 5464 2732 314,262 308,798 58 有关 UPFD 数据集的更多详细信息，请参阅论文。由于 Twitter 政策，我们无法公开发布被抓取用户的历史推文。获取对应的推特用户信息，可以参考我们github repo中\data下的新闻列表并将新闻 id 映射到 FakeNewsNet。然后，您可以按照 FakeNewsNet 上的说明抓取用户信息。在 UPFD 项目中，我们使用 Tweepy 和 Twitter Developer API 来获取用户信息。

For benchmarking, please refer to its variants UPFD-POL and UPFD-GOS. This dataset has been integrated with PyTorch Geometric (PyG) and Deep Graph Library (DGL). You can load the dataset after installing the latest version of PyG or DGL. The UPFD dataset includes two sets of tree-structured graphs for evaluating binary graph classification, graph anomaly detection, and fake/real news detection tasks. The dataset is stored as PyTorch Geometric Dataset objects, allowing you to easily load the data and run various GNN models using PyG. This dataset covers the propagation (retweet) networks of fake and real news on Twitter, which are constructed based on fact-checking information from Politifact and Gossipcop. The news retweet graphs were originally extracted from FakeNewsNet. Each graph is a hierarchical tree-structured diagram, where the root node represents the news, and the leaf nodes are Twitter users who retweeted the root news. An edge exists between a user node and the news node if the user retweeted the news tweet. An edge also exists between two user nodes if one user retweeted a news tweet posted by the other user. We crawled nearly 20 million historical tweets from users who participated in fake news propagation in FakeNewsNet to generate node features for the dataset. We have incorporated four types of node features: 768-dimensional BERT features and 300-dimensional spaCy features, which are encoded using pretrained BERT and spaCy word2vec respectively; 10-dimensional profile features extracted from Twitter account profiles; and 310-dimensional content features composed of 300-dimensional user comment word2vec (spaCy) embeddings plus the 10-dimensional profile features. You can refer to profile_feature.py for profile feature extraction. The dataset statistics are shown below: | Dataset | Number of Graphs | Number of Fake News | Total Nodes | Total Edges | Average Nodes per Graph | |---------------|------------------|---------------------|-------------|-------------|-------------------------| | Politifact | 314 | 157 | 41,054 | 40,740 | 131 | | Gossipcop | 5,464 | 2,732 | 314,262 | 308,798 | 58 | For more details about the UPFD dataset, please refer to the associated paper. Due to Twitter's policies, we cannot publicly release the historical tweets of the crawled users. To obtain the corresponding Twitter user information, you can refer to the news list in the `data` directory of our GitHub repository and map the news IDs to FakeNewsNet. Then, you can crawl user information following the instructions provided on FakeNewsNet. In the UPFD project, we used Tweepy and the Twitter Developer API to acquire user information.

提供机构：

OpenDataLab

创建时间：

2022-06-23

AI搜集汇总

数据集介绍

背景与挑战

背景概述

UPFD-GOS是一个用于用户偏好感知虚假新闻检测的图分类数据集，基于Twitter新闻传播网络构建，包含5,464个树状结构图，其中假新闻占2,732个，总节点数超过31万。数据集集成了多种节点特征（如BERT和spaCy编码），适用于二进制图分类和异常检测任务，并与PyTorch Geometric和Deep Graph Library兼容，便于GNN模型应用。

以上内容由AI搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集