Dataset for Evaluating location strategies in Padiweb
收藏DataCite Commons2024-11-13 更新2025-04-09 收录
下载链接:
https://dataverse.cirad.fr/citation?persistentId=doi:10.18167/DVN1/Y1J9XK
下载链接
链接失效反馈官方服务:
资源简介:
This dataset has been built in the framework of the optimization of the MUlti-Source surveillance Tool for the detection of Avian Influenza outbreaks in mammalian species (MUST-AI). The MUST-AI tool collects health events reported from 3 sources: two official sources, WAHIS from the World Animal Health Organization and mails from Program for Monitoring Emerging Diseases (ProMED); and one unofficial source, PADI-web, which collects online media articles. PADI-web uses 5 various strategies to locate health events mentioned in the text articles. The aim of our study was to assess the various strategies. The dataset consists 7 case studies (outbreak events from official sources WAHIS or the scientific literature) associated to 222 validated media articles collected by PADI-web through the 5 strategies. The matching criteria to associate a case study to a PADI-web article are based on the country of the outbreak and the time period. . The five evaluated strategies are:. (A) SpaCy locations in Outbreak articles: extraction with SpaCy of locations in articles classified as an epidemiological outbreak. . (B) SpaCy locations in Outbreak articles and Current event sentences: extraction with SpaCy of locations in articles classified as an epidemiological outbreak and in a sentence that has been classified as relating to a current event.. (C) SpaCy locations in beginning of Outbreak articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article classified as an epidemiological outbreak. . (D) PADI-web-specific locations: extraction of locations by the location extraction model trained on PADI-web data.. (E) SpaCy locations in beginning of articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article... Each case study is associated with an identification number. For each case study, the set of Padiweb articles is given with a unique identification number. The dataset contains the values as follows: . - Source: source of the case study, as WAHIS or published article in the scientific literature. - Id_gold_standard: case study identification number as the outbreak id reported in WAHIS or ranked literature case study number. - Id_article: identification number of media articles as generated in PADI-web - URL: url to the source article. - Strategy X: binary value that stipulates whether the article has been returned by the strategy X.
提供机构:
CIRAD Dataverse
创建时间:
2024-09-13



