A Document-Level Event Extraction Dataset for Cultivated Land Protection (DLEE4CP)
收藏DataCite Commons2026-01-15 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=44882464e8a04030a40a6145dc8c992c
下载链接
链接失效反馈官方服务:
资源简介:
The dataset is constructed using data sources from official websites, including announcements and news reports released by the Ministry of Natural Resources of China, provincial natural resources departments, and local natural resources authorities. Through a rigorous screening process, cases explicitly mentioning cultivated land or permanent basic farmland were selected, resulting in a final collection of 500 long-text announcement documents. During the dataset construction process, the collected 500 long-text announcement documents were first structurally segmented into two parts: the title and the main body. The main body was then preprocessed, primarily involving the removal of non-meaningful symbols and the reduction of punctuation discrepancies. Subsequently, an event schema specific to the cultivated land protection domain was defined based on the announcement documents, encompassing 11 event types, 12 entity types, and 63 event arguments. Following this, all case data within the announcement document were annotated using the BRAT annotation tool, during which fine-grained annotation and coreference resolution were performed. Finally, the usability of the dataset was evaluated by employing seven event extraction models,including BERT, BERT+CRF, BERT+BiLSTM+CRF, DCFEE-O, DCFEE-M, GreedyDec, Doc2EDAG, and GIT. The uploaded dataset comprises three files, as detailed below: (1) train.json, which serves as the training data, containing 400 instances. (2) test.json, which is designated as the testing data, consisting of 50 instances. (3) dev.json, which functions as the validation data, also containing 50 instances. Each data instance includes the following four components: (1) id, which is a unique identifier assigned to each notification case. (2) title, which is the title of the corresponding notification case. (3) text, which is the main body of the notification case, providing a detailed description of the cultivated land violation incident. (4) event_list, which is event information annotated according to the defined event schema, covering event types, entity types, and event arguments. Each event is structured into a trigger and an arguments set . The trigger contains text(the textual content of the trigger), start and end (the start and end positions of the trigger in the text), event_type (The type of event which the trigger corresponds to). And each argument is sequentially described by text (the textual content of the argument) , start and end (the start and end positions of the argument in the text), role(the argument type (i.e., its semantic role in the event).
提供机构:
Science Data Bank
创建时间:
2026-01-15



