Unlabeled corpora for post-training Language Models on thematic and misinformation classification in a One Health context
收藏DataCite Commons2025-11-12 更新2026-03-29 收录
下载链接:
https://dataverse.cirad.fr/citation?persistentId=doi:10.18167/DVN1/XOD6SP
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains four corpora of unlabeled texts used to post-training language models based on selective masking to adapt them to targeted domains within the One Health context. The corpora comprise collections of unannotated texts generally sourced from PubMed and PADI-web, representing two main areas of application: (i) thematic content related to the One Health domain, covering the biomedical, phytosanitary, and syndromic surveillance fields, and (ii) epidemic misinformation. The repository contains 4 files: PubMed Biomedical_snippets: 10,000 English abstracts of biomedical articles, extracted from the PubMed Article Summarization Dataset PubMed Plant Health_snippet: 9,388 English abstracts of PubMed articles on plant health, collected by us through web scraping, selecting abstracts with titles and content containing keywords related to plant health (e.g., plant diseases and plant names). PADI-web Unspecified Diseases_snippet: 8,000 English news articles dedicated to syndromic surveillance (i.e., articles describing unknown diseases and symptoms), collected from the PADI-web tool (https://padi-web.cirad.fr/en). PADI-web Public Health_snippet: 10,000 English news articles on human epidemics (e.g., Influenza and Ebola), used for the epidemic misinformation domain. The complete corpora are available under restricted access, while the open-access versions contain only snippets from each corpus.
提供机构:
CIRAD Dataverse
创建时间:
2025-10-27



