Labeled corpora for post-training Language Models on thematic and misinformation classification in a One Health context
收藏DataCite Commons2025-11-12 更新2026-03-29 收录
下载链接:
https://dataverse.cirad.fr/citation?persistentId=doi:10.18167/DVN1/4JRKO2
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains five corpora of labeled texts used for fine-tuning language models based on selective masking to adapt them to targeted domains within the One Health context. The corpora comprise collections of unannotated texts generally sourced from PubMed and PADI-web, representing two main areas of application: (i) thematic content related to the One Health domain, covering the biomedical, phytosanitary, and syndromic surveillance fields, and (ii) epidemic misinformation. The repository contains 5 files: Medical Text - Cancer_snippets: 996 scientific articles and abstracts on human cancers, extracted from the Medical Text Dataset - Cancer Doc Classification Dataset. This corpus is divided into three classes (Thyroid Cancer: 283, Colon Cancer: 261, Lung Cancer: 453). PubMed Plant Diseases_snippets: 1,200 abstracts of PubMed scientific papers written in English that focus on the plant health domain. This corpus is divided equally among three major plant diseases that affect crops (Downy Mildew, Powdery Mildew, and Bacterial Wilt). Abstracts were collected by us using web scraping, selecting those whose titles and content contained the disease names. PADI-web Plant Health_snippets: 748 news articles on Xylella fastidiosa (i.e., plant disease) collected with PADI-web (https://padi-web.cirad.fr/en) and manually classified by experts into two classes: relevant (317 articles, i.e., documents related to a new, suspected or unknown outbreak) or not relevant (431 articles). PADI-web Syndromic_snippets: 769 online news articles, divided into two classes: positive, with 311 news articles dealing with unknown diseases, and negative, with 458 news articles where a pathogenic cause is identified. CoAID_snippets: 252 news articles and Facebook posts on the COVID-19 epidemic, extracted from the largest CoAID dataset. This corpus is divided into two classes: fake, with 126 fake news items, and true, with 126 real news. The complete corpora are available under restricted access, while the open-access versions contain only snippets from each corpus.
提供机构:
CIRAD Dataverse
创建时间:
2025-10-27



