Brazilian Social Media Anti-vaccine Information Disorder Dataset - Telegram
收藏REDU2025-01-01 更新2026-05-11 收录
下载链接:
https://redu.unicamp.br/citation?persistentId=doi:10.25824/redu/5JIVDT
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains approximately four million Telegram posts collected from 119 prominent Brazilian anti-vaccine channels between 2020 and 2025. The dataset includes message content, metadata, associated media, and classification related to vaccine posts, enabling researchers to examine how false or misleading information spreads, evolves, and influences public sentiment. The collection captures the period corresponding to a significant decline in national vaccination coverage and the concurrent "infodemic" circulating on digital platforms. Total Messages: 3,998,633 Time Period: January 2020 – June 2025 Total Data Volume: 5.5 TB (including media) License: Creative Commons BY-NC-SA 4.0 General Statistics Total Channels/Groups: 119 Unique Anonymized Users: 71,672 Messages with Text: 3,345,088 (83.6%) Vaccine-Related Posts: 407,723 (10.2%) Main Languages: Portuguese (58.3%), English (8.0%), Spanish (1.7%) Data was collected using a custom Python tool built on the Telethon library. The target channels were identified using seed lists from prior literature and keyword searches including terms like "Vacina," "mRNA," "Nova Ordem Mundial," and "Efeitos adversos". Only public channels with at least 1,000 members were monitored. Annotation & Processing Language Detection: Performed using langdetect with a confidence threshold of 0.5. Topic Classification: The field is_vaccine_related was generated using the Sabiá-3 Large Language Model. The model achieved a 90% F1-score against human annotators. Criteria: Mentions of vaccines/immunization, efficacy/safety discussions, policy discussions, conspiracy theories, or hesitancy. Limitations 1. Engagement Metrics: The "reactions" feature was only implemented by Telegram in late 2021; data prior to Dec 30, 2021, lacks this metadata. 2. File Size: Media files larger than 50MB were excluded from collection. 3. Deleted Content: Content removed by channel admins or the Brazilian Supreme Court during the collection period may be missing or altered The metadata is in a .json file, which can be opened in any simple text editor.
提供机构:
. Instituto de Computação)
创建时间:
2025-01-01



