Dataset for Named Entity Recognition and Entity Linking from Greek Wikipedia Events
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7429036
下载链接
链接失效反馈官方服务:
资源简介:
An automated benchmark dataset for (Named Entity Recognition) NER and (Named Entity Linking) NEL tools, based on Greek Wikipedia events pages.
Note: This data includes data from the following sources:
- Wikipedia el.wikipedia.org
Description
The dataset is provided in the form of three JSON-formatted subsets i.e., train, validation and test in an analogy of 70-20-10. The current version of the dataset contains 18,617 events annotated with 40,798 entity mentions and 36,189 links to elWikipedia (and wikidata ids). The dataset contains annotations belonging to 8 entity types: person, organization, location, gpe, event, facility, product and work of art.
Overall dataset statistics
Docs
Tokens
Sentences
Surface Mentions
Valid Links
Red Links
Train
13,031
332,077
16,927
28,593
25,365
3,228
Validation
3,722
94,746
4,844
8,168
7,240
928
Test
1,862
47,450
2,427
4,037
3,584
453
Total
18,617
474,361
24,200
40,798
36,189
4,609
Example
A record example is given below.
{
"json_file": "February 2012_39_0 events",
"text": "Sudan and South Sudan sign non-aggression pact.",
"ground_truth_mentions": [
{"start": 0, "end": 4, "surface_mention": "Sudan", "mention_type": "GPE"},
{"start": 10, "end": 20, "surface_mention": "South Sudan", "mention_type": "GPE"}
],
"ground_truth_links": [
{"enwiki": "Sudan","wikidata": "Q1049"},
{"enwiki": "South_Sudan", "wikidata": "Q958"}
]
}
Code
https://gitlab.isl.ics.forth.gr/debatelab/elwiki_events_benchmark
Acknowledgments
This work has received funding from the Hellenic Foundation for Research and Innovation (HFRI) and the General Secretariat for Research and Technology (GSRT), under grant agreement No 4195.
创建时间:
2023-05-23



