five

Brazilian Social Media Anti-vaccine Information Disorder Dataset - Telegram

收藏
REDU2025-01-01 更新2026-05-11 收录
下载链接:
https://redu.unicamp.br/citation?persistentId=doi:10.25824/redu/5JIVDT
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains approximately four million Telegram posts collected from 119 prominent Brazilian anti-vaccine channels between 2020 and 2025. The dataset includes message content, metadata, associated media, and classification related to vaccine posts, enabling researchers to examine how false or misleading information spreads, evolves, and influences public sentiment. The collection captures the period corresponding to a significant decline in national vaccination coverage and the concurrent "infodemic" circulating on digital platforms. Total Messages: 3,998,633 Time Period: January 2020 – June 2025 Total Data Volume: 5.5 TB (including media) License: Creative Commons BY-NC-SA 4.0 General Statistics Total Channels/Groups: 119 Unique Anonymized Users: 71,672 Messages with Text: 3,345,088 (83.6%) Vaccine-Related Posts: 407,723 (10.2%) Main Languages: Portuguese (58.3%), English (8.0%), Spanish (1.7%) Data was collected using a custom Python tool built on the Telethon library. The target channels were identified using seed lists from prior literature and keyword searches including terms like "Vacina," "mRNA," "Nova Ordem Mundial," and "Efeitos adversos". Only public channels with at least 1,000 members were monitored. Annotation & Processing Language Detection: Performed using langdetect with a confidence threshold of 0.5. Topic Classification: The field is_vaccine_related was generated using the Sabiá-3 Large Language Model. The model achieved a 90% F1-score against human annotators. Criteria: Mentions of vaccines/immunization, efficacy/safety discussions, policy discussions, conspiracy theories, or hesitancy. Limitations 1. Engagement Metrics: The "reactions" feature was only implemented by Telegram in late 2021; data prior to Dec 30, 2021, lacks this metadata. 2. File Size: Media files larger than 50MB were excluded from collection. 3. Deleted Content: Content removed by channel admins or the Brazilian Supreme Court during the collection period may be missing or altered The metadata is in a .json file, which can be opened in any simple text editor.
提供机构:
. Instituto de Computação)
创建时间:
2025-01-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作