five

GSAP-ERE

收藏
DataCite Commons2025-12-17 更新2026-02-07 收录
下载链接:
https://berd-platform.de/doi/10.60914/c4c1d-s0587
下载链接
链接失效反馈
官方服务:
资源简介:
GSAP-ERE Dataset Introduction GSAP-ERE is a dataset to train and evaluate models for Entity and Relation Extraction of machine learning related entities in scholarly publications (e.g., research papers). Find more information on the GSAP Project on data.gesis.org/gsap. Data Citation Please reference: Wolfgang Otto, Lu Gan, Sharmila Upadhyaya, Saurav Karmakar, Stefan Dietze (2026) GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning. AAAI2026. Version Information The annotation is finished on the 15th of April 2025 and can be used to reproduce the results in the connected publication Otto et al. 2026 (mentioned above). Train/Dev/Test-Split The dataset was partitioned into training, validation, and test sets with an 80% / 10% / 10% split, respectively, ensuring that all data points from a single publication remained within a single set to prevent data leakage. Label Sets Our 10 Named Entity Labels in 4 semantic grouped   Method related: MLModel MLModelGeneric ModelArchitecture Method Data related: Dataset DatasetGeneric DataSource Task related: Task Referencing: ReferenceLink URL Our 18 Relation Labels (incl. domain and range) in 7 semantic groups Model Design: Method -usedFor-> Method|MLModel(Generic) MLModel(Generic)|Method -architecture-> ModelArchitecture MLModel(Generic) -isBasedOn-> MLModel(Generic) Task Binding: MLModel(Generic)|Method -appliedTo-> Task Dataset(Generic) -benchmarkFor-> Task Data Usage: MLModel(Generic)|Method -trainedOn-> Dataset(Generic) MLModel(Generic)|Method -evaluatedOn-> Dataset(Generic) Data Provenance: Dataset(Generic) -transformedFrom-> Dataset(Generic) Dataset(Generic) -generatedBy-> Method Dataset(Generic) -sourcedFrom-> DataSource Data Properties: Dataset(Generic) -size-> DatasetGeneric Dataset(Generic) -hasInstanceType-> DatasetGeneric Peer Relations: <Any> -coreference-> <Same as Subject> <Any> -isPartOf-> <Same as Subject> <Any> -isHyponymOf-> <Same as Subject> <Any> -isComparedTo-> <Same as Subject> Referencing: <Any> -citation-> ReferenceLink <Any> -url-> URL   Format The Files are encoded in the jsonl format, where each line represents the valid json of one publication. Data field for each document The data format of the jsonl files is compatible with many works in the field of entity and relation extraction (e.g., HGERE). Each line of the jsonl file represents one document containing the following fields: sentences: A list of sentences represented by a list of tokens (`[[<sentence_1_token_1_id>, <sentence_1_token_2_id>, ...],  [sentence_2_token_2id, ...], ...] (Resolve the word_ids based on the vocabulary given on our github project GSAP-ERE.) ner: A list of named entities represented by a list of three elements: begin of entity, end of entity, label (e.g., [[<begin_idx>, <end_idx>, "MLModel"], ...] for each sentence. This includes stacked (i.e., overlapping) annotations. relations :  A list of relation for each sentence. Each relation is represented by the begin and end of subject and object and the relation label for each sentence (e.g., `[<begin_idx_subject>, <end_idx_subject>, <begin_idx_object>, <end_idx_object>, "isPartOf"] ` clusters: This field exists for compatibility reasons. In this version no reference clusters are annotated. This will be reflected in future versions of the dataset. doc_id: a unique identifier for each document annotator: Id representing the initial annoator of the document (0 or 1) . During the refinement process other annotators might have corrected some of the annotations.
提供机构:
BERD@NFDI
创建时间:
2025-12-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作