five

SURSY data: Textual data for syndromic surveillance

收藏
DataCite Commons2026-02-02 更新2026-03-29 收录
下载链接:
https://dataverse.cirad.fr/citation?persistentId=doi:10.18167/DVN1/MYMPSO
下载链接
链接失效反馈
官方服务:
资源简介:
Datasets produced in the context of the <b>SURSY project</b> to conduct <b>syndromic surveillance for plant health </b> (monitoring potentially new diseases (i.e. disease X) and the development of already existing diseases on new host plants). <br><br> ======<br><br> <b>Phase 1: Articles collected with PADI-web labeled by thematic categories: </b><br> SV: Plant health (Santé Végétale) <br> SA: Animal health (Santé Animale) <br> SP: Public health (Santé Publique) <br><br> <b>Phase 2: Articles collected with PADI-web labeled by syndromic categories: </b><br> VS: Syndromic surveillance (veille Syndromique)<br> NVS: Non syndromic surveillance (Non Veille Syndromique)<br><br> ======<br><br> <b>Phase 1 Datasets</b>: <br> <i>Title, Text, Thematic (SV/SA/SP), Disease, Text_without_html, Clean_text </i><br><br> - P1-UMAP-KMeans – Articles vectorized with TF-IDF (max 10,000 features), reduced to 3D using UMAP, clustered with K-means, and selected around cluster centroids to ensure thematic balance. <br> CSV file: data_selected_with_umap_kmeans_phase1.csv <br><br> - P1-DiseaseBalanced – Balanced selection by disease within SV, then proportional sampling in SA and SP, with undersampling to maintain disease diversity.<br> CSV file: data_selected_by_disease_phase1.csv <br><br> ======<br><br> <b>Phase 2 Datasets</b><br> <i>Text, Tokens_clean, Text_clean, Type_article (VS/NVS), Thematic (SA/SV) </I><br><br> - P2+P1-NoDiseaseNamesRatio – Phase 2 data (with disease names) combined with a portion of Phase 1 data (without disease names) to achieve a 1:10 VS/NVS ratio. Phase 1 articles are considered NVS. <br> CSV file: data_final_phase2.csv <br><br> <i>The P2+P1-NoDiseaseNamesRatio dataset gave the best performance and was retained as the main Phase 2 dataset</i> <br><br> Other data of Phase 2 (private): <br><br> - P2-Only-WithDiseaseNames – Articles collected exclusively in Phase 2, containing disease names. <br> CSV file: data_phase2_only_with_disease_names.csv <br><br> - P2+P1-WithDiseaseNamesRatio – Phase 2 data (with disease names) combined with a portion of Phase 1 data (also with disease names) to achieve the same 1:10 ratio.<br> CSV file: p2_with_p1_both_disease_names_public.csv <br><br> - P2+P1-WithDiseaseNames-Cleaned – Data from version 3 with all disease names removed.<br> CSV file: p2_with_p1_both_no_disease_names_public.csv <br><br>
提供机构:
CIRAD Dataverse
创建时间:
2026-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作