DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect

Name: DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect
Creator: doi.org
Published: 2023-05-02 00:00:00
License: 暂无描述

doi.org2023-05-02 更新2025-03-23 收录

下载链接：

http://doi.org/10.17632/286sss4k9v.4

下载链接

链接失效反馈

官方服务：

资源简介：

DarNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija. The corpus contains more than 65K tokens, 13.8% of which are named entities. Named entities in the dataset are annotated with one of the following tags, using the BIO tagging scheme: person (PER), location (LOC), organization (ORG), miscellaneous (MISC). The distribution of named entities in the dataset is as follows: PER (15.3%), LOC (38.1%), ORG (15.5%), MISC (31.1%). The corpus is presented in the Data folder and it is split into two sets: DarNERcorp_train and DarNERcorp_test. The first set represents 80% of the data and the second represents 20%. In addition to the data, the Python scripts used in the collection and data formatting are provided in the Code folder.

DarNERcorp 为摩洛哥方言或达里贾语中命名实体识别（NER）的手动标注语料库。该语料库包含超过65K个标记（token），其中13.8%为命名实体。数据集中的命名实体使用以下标签之一进行标注，采用BIO标注方案：人物（PER）、地点（LOC）、组织（ORG）、其他（MISC）。数据集中命名实体的分布如下：人物（PER）占15.3%，地点（LOC）占38.1%，组织（ORG）占15.5%，其他（MISC）占31.1%。语料库包含于数据文件夹中，并分为两个集合：DarNERcorp_train 和 DarNERcorp_test。第一个集合代表80%的数据，第二个集合代表20%。此外，用于数据收集和格式化的Python脚本也包含在代码文件夹中。

提供机构：

doi.org