five

DarNERcorp: a Named Entity Recognition Corpus in the Moroccan Dialect

收藏
doi.org2023-05-02 更新2025-03-23 收录
下载链接:
http://doi.org/10.17632/286sss4k9v.4
下载链接
链接失效反馈
官方服务:
资源简介:
DarNERcorp is a manually annotated corpus for Named Entity Recognition (NER) in the Moroccan Dialect or Darija. The corpus contains more than 65K tokens, 13.8% of which are named entities. Named entities in the dataset are annotated with one of the following tags, using the BIO tagging scheme: person (PER), location (LOC), organization (ORG), miscellaneous (MISC). The distribution of named entities in the dataset is as follows: PER (15.3%), LOC (38.1%), ORG (15.5%), MISC (31.1%). The corpus is presented in the Data folder and it is split into two sets: DarNERcorp_train and DarNERcorp_test. The first set represents 80% of the data and the second represents 20%. In addition to the data, the Python scripts used in the collection and data formatting are provided in the Code folder.

DarNERcorp 为摩洛哥方言或达里贾语中命名实体识别(NER)的手动标注语料库。该语料库包含超过65K个标记(token),其中13.8%为命名实体。数据集中的命名实体使用以下标签之一进行标注,采用BIO标注方案:人物(PER)、地点(LOC)、组织(ORG)、其他(MISC)。数据集中命名实体的分布如下:人物(PER)占15.3%,地点(LOC)占38.1%,组织(ORG)占15.5%,其他(MISC)占31.1%。语料库包含于数据文件夹中,并分为两个集合:DarNERcorp_train 和 DarNERcorp_test。第一个集合代表80%的数据,第二个集合代表20%。此外,用于数据收集和格式化的Python脚本也包含在代码文件夹中。
提供机构:
doi.org
二维码
社区交流群
二维码
科研交流群
商业服务