SURSY data: Textual data for syndromic surveillance
收藏DataCite Commons2026-02-02 更新2026-03-29 收录
下载链接:
https://dataverse.cirad.fr/citation?persistentId=doi:10.18167/DVN1/MYMPSO
下载链接
链接失效反馈官方服务:
资源简介:
Datasets produced in the context of the <b>SURSY project</b> to conduct <b>syndromic surveillance for plant health </b> (monitoring potentially new diseases (i.e. disease X) and the development of already existing diseases on new host plants). <br><br>
======<br><br>
<b>Phase 1: Articles collected with PADI-web labeled by thematic categories: </b><br>
SV: Plant health (Santé Végétale) <br>
SA: Animal health (Santé Animale) <br>
SP: Public health (Santé Publique) <br><br>
<b>Phase 2: Articles collected with PADI-web labeled by syndromic categories: </b><br>
VS: Syndromic surveillance (veille Syndromique)<br>
NVS: Non syndromic surveillance (Non Veille Syndromique)<br><br>
======<br><br>
<b>Phase 1 Datasets</b>: <br>
<i>Title, Text, Thematic (SV/SA/SP), Disease, Text_without_html, Clean_text </i><br><br>
- P1-UMAP-KMeans – Articles vectorized with TF-IDF (max 10,000 features), reduced to 3D using UMAP, clustered with K-means, and selected around cluster centroids to ensure thematic balance. <br>
CSV file: data_selected_with_umap_kmeans_phase1.csv <br><br>
- P1-DiseaseBalanced – Balanced selection by disease within SV, then proportional sampling in SA and SP, with undersampling to maintain disease diversity.<br>
CSV file: data_selected_by_disease_phase1.csv <br><br>
======<br><br>
<b>Phase 2 Datasets</b><br>
<i>Text, Tokens_clean, Text_clean, Type_article (VS/NVS), Thematic (SA/SV) </I><br><br>
- P2+P1-NoDiseaseNamesRatio – Phase 2 data (with disease names) combined with a portion of Phase 1 data (without disease names) to achieve a 1:10 VS/NVS ratio. Phase 1 articles are considered NVS. <br>
CSV file: data_final_phase2.csv <br><br>
<i>The P2+P1-NoDiseaseNamesRatio dataset gave the best performance and was retained as the main Phase 2 dataset</i> <br><br>
Other data of Phase 2 (private): <br><br>
- P2-Only-WithDiseaseNames – Articles collected exclusively in Phase 2, containing disease names. <br>
CSV file: data_phase2_only_with_disease_names.csv <br><br>
- P2+P1-WithDiseaseNamesRatio – Phase 2 data (with disease names) combined with a portion of Phase 1 data (also with disease names) to achieve the same 1:10 ratio.<br>
CSV file: p2_with_p1_both_disease_names_public.csv <br><br>
- P2+P1-WithDiseaseNames-Cleaned – Data from version 3 with all disease names removed.<br>
CSV file: p2_with_p1_both_no_disease_names_public.csv <br><br>
提供机构:
CIRAD Dataverse
创建时间:
2026-01-06



