French Entity-Linking dataset between annotated tweets collected during major crises in France and French Wikipedia corpus
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7767293
下载链接
链接失效反馈官方服务:
资源简介:
Most of the available datasets are not particularly adapted to our target application: geolocate natural disasters from social networks. First, social media posts are largely underrepresented in these datasets, and the only Twitter dataset lacks Entity-Linking annotations. Second, none of the datasets focuses on a crisis or natural disaster event.
To mitigate these issues, we extracted a collection of French tweets written during earthquakes and major floods that have occurred in France in recent years. We set up Label-Studio in order to annotate these tweets. A total of 4617 tweets were annotated, including 1678 tweets posted during earthquakes and 2939 during floods. For each annotated tweet, mentions were annotated using the set of labels described earlier in the paper as well as, when possible, the target Wikipedia title.
Named “RéSoCIO” in reference to the research project in which it was carried out, the dataset resulting from this work contains a total of 12 828 annotated mentions and 1 513 distinct Wikipedia entities. 85% of mentions were associated with a Wikipedia page and 94 % if we ignore the RISKNAT and DAMAGES labels, which are often difficult to map to an existing entity.
Labels
#Mentions
#Linked
#Entities
PERSON
315
263
136
ORG
863
790
281
GEOLOC
4375
4234
701
TRANSPORT
250
203
101
EVENT
35
21
16
FACILITY
129
94
49
RISKNAT
5502
4994
128
DAMAGES
1136
121
56
OTHER
223
200
46
Total
12828
1322
1513
Overview of the mentions annotated in the Twitter dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.
Labels
#Mentions
#Linked
#Entities
PERSON
1100102
1098406
557697
ORG
750925
749504
130394
GEOLOC
2729702
2728296
215924
TRANSPORT
161539
160487
53405
EVENT
798433
798251
86471
FACILITY
258835
258513
109867
RISKNAT
5502
4994
127
DAMAGES
1136
121
56
OTHER
4340621
4339658
682458
Total
10146795
10138230
1836399
Overview of the mentions annotated in the full dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.
创建时间:
2023-03-25



