CO-Fun: A German Dataset on Company Outsourcing in Fund Prospectuses for Named Entity Recognition and Relation Extraction
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12745115
下载链接
链接失效反馈官方服务:
资源简介:
The process of cyber mapping gives insights in relationships among financial entities and service providers. Centered around the outsourcing practices of companies within fund prospectuses in Germany, we introduce a dataset specifically designed for named entity recognition and relation extraction tasks. The labeling process on 948 sentences was carried out by three experts which yields to 5,969 annotations for four entity types (Outsourcing, Company, Location and Software) and 4,102 relation annotations (Outsourcing–Company, Company–Location). Furthermore, state-of-the-art deep learning models were trained on this dataset to recognize entities and extract relations. This repository is the anonymized version of the dataset, along with guidelines and the code used for model training.
In the following the content of each file is explained:
CO-Fun-1.0-anonymized.jsonl file contains the raw data of CO-Fun consists of records formatted in JSON. Each entry has the annotated text which is present in form of HTML. The annotation for each named entity in the text are specified with span tags. Below you can find an exmple of an entry in raw data:
{ "datetime": "2023-05-04T14:15:54.501875783", "entities": [ { "color": "rgb(255, 0, 0)", "text": "Ermittlung der täglichen und jährlichen Steuerdaten", "id": "255c1d4a-d9b0-4fff-8779-6a68f803ce51", "type": "Auslagerung" }, { "color": "rgb(0, 0, 255)", "text": "tba - the beauty aside GmbH", "id": "fad78727-1645-4b39-9478-daecb3b4bd2b", "type": "Unternehmen" } ], "text": " • Die Ermittlung der täglichen und jährlichen Steuerdaten für die Fonds wurde auf die tba - the beauty aside GmbH ausgelagert. ", "relations": [ { "src": {: "rgb(255, 0, 0)", "text": "Ermittlung der täglichen und jährlichen Steuerdaten", "id": "255c1d4a-d9b0-4fff-8779-6a68f803ce51", "type": "Auslagerung" }, "color" "trg": { "color": "rgb(0, 0, 255)", "text": "tba - the beauty aside GmbH", "id": "fad78727-1645-4b39-9478-daecb3b4bd2b", "type": "Unternehmen" }, "type": "Auslagerung-Unternehmen" } ]}
CO-Fun_Annotation-Guideline-EN.pdf is a graphical user interface in German to annotate a sentence with named entities andrelations.
The prepared-data-and-code folder consists of datasets and python code files for Named Entity Recognition (NER) and Relation Extraction tasks. The training, development and test sets in text format for the CRF model, as well as in text and SpaCy formats for the BERT and RoBERTa models.
创建时间:
2024-09-12



