DeepfakeWatch

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/f68vnm7t7w

下载链接

链接失效反馈

官方服务：

资源简介：

Deepfake technology is a threat to information integrity across social media platforms. Understanding how the public views and talks about these manipulations produced by AI is critical in the development of methods for countering them. However, manually analysing thousands of comments is labour-intensive and requires special knowledge. This dataset allows one to analyze the public stance and discourse patterns around deepfake content. I have extracted 2351 English comments from 41 YouTube videos that are about deepfakes between July 2025 and January 2026. The classification of each comment was done using BART-large-mnli for the detection of the stance and identification of the claims. Comments are assigned to four different types of stance: believesreal (where the user had concerns about threats from deepfake use), believesfake (where the user is skeptical), uncertain (where the user is doubtful), and metadiscussion (the user's general commentary about the situation). Six types of claims were identified: detectionliteracy (spotting techniques), voicefraud (audio cloning), warpropaganda (military context), celebrity (impersonation), scamfinance (financial fraud), and electionpolitics (political manipulation). Topic modeling using NMF identified 8 discussion themes (0.53 coherence), much better than LDA alternatives. We were able to track temporal changes over seven months using the divergence measure called Jensen-Shannon divergence, and we found empirical evidence of significant change points for discourse patterns. Privacy is ensured with SHA-256 hashing of any and all identifiers. This dataset serves these research areas of computational social science, misinformation detection, and content moderation. Researchers can use it for training models for the detection of stance or analyzing the evolution of discourse or the public awareness of AI-generated content. 1. Primary Dataset: Records: 2,351 comments Format: CSV Columns: 18 (video_id, comment_id_hash, month, like_count, spam_score, stance_label, stance_conf, claim_type, claim_conf, topic_id_nmf, topic_p0-p7_nmf) 2. Stance Distribution: meta_discussion: 1,264 (53.8%) uncertain: 484 (20.6%) believes_fake: 360 (15.3%) believes_real: 243 (10.3%) 3. Claim Type Distribution: detection_literacy: 748 (31.8%) voice_fraud: 565 (24.0%) war_propaganda: 416 (17.7%) celebrity: 373 (15.9%) scam_finance: 137 (5.8%) election_politics: 112 (4.8%) 4. Topic Model: Algorithm: NMF (k=8) Coherence: 0.5310 Features: TF-IDF vectors Topics: AI-generated content, fake news debates, war propaganda, military authenticity, social discourse, Ukraine narrative, bot accusations, South Asian discussions 5. Quality Metrics: Duplicates: 0 Missing values: 0 Spam flagged: 9.45% (retained with scores) Video coverage: 32.8% (41/125 videos) 6. Documentation: README.md: Complete methodology and usage guide CODEBOOK.md: All variable definitions and distributions

创建时间：

2026-01-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集