five

French Entity-Linking dataset between annotated tweets collected during major crises in France and French Wikipedia corpus

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7767293
下载链接
链接失效反馈
官方服务:
资源简介:
Most of the available datasets are not particularly adapted to our target application: geolocate natural disasters from social networks. First, social media posts are largely underrepresented in these datasets, and the only Twitter dataset lacks Entity-Linking annotations. Second, none of the datasets focuses on a crisis or natural disaster event. To mitigate these issues, we extracted a collection of French tweets written during earthquakes and major floods that have occurred in France in recent years. We set up Label-Studio in order to annotate these tweets. A total of 4617 tweets were annotated, including 1678 tweets posted during earthquakes and 2939 during floods. For each annotated tweet, mentions were annotated using the set of labels described earlier in the paper as well as, when possible, the target Wikipedia title. Named “RéSoCIO” in reference to the research project in which it was carried out, the dataset resulting from this work contains a total of 12 828 annotated mentions and 1 513 distinct Wikipedia entities. 85% of mentions were associated with a Wikipedia page and 94 % if we ignore the RISKNAT and DAMAGES labels, which are often difficult to map to an existing entity. Labels #Mentions #Linked #Entities PERSON 315 263 136 ORG 863 790 281 GEOLOC 4375 4234 701 TRANSPORT 250 203 101 EVENT 35 21 16 FACILITY 129 94 49 RISKNAT 5502 4994 128 DAMAGES 1136 121 56 OTHER 223 200 46 Total 12828 1322 1513 Overview of the mentions annotated in the Twitter dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset. Labels #Mentions #Linked #Entities PERSON 1100102 1098406 557697 ORG 750925 749504 130394 GEOLOC 2729702 2728296 215924 TRANSPORT 161539 160487 53405 EVENT 798433 798251 86471 FACILITY 258835 258513 109867 RISKNAT 5502 4994 127 DAMAGES 1136 121 56 OTHER 4340621 4339658 682458 Total 10146795 10138230 1836399 Overview of the mentions annotated in the full dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.
创建时间:
2023-03-25
二维码
社区交流群
二维码
科研交流群
商业服务