Wojood
收藏arXiv2022-05-23 更新2024-06-21 收录
下载链接:
https://ontology.birzeit.edu/wojood
下载链接
链接失效反馈官方服务:
资源简介:
Wojood是由比尔宰特大学创建的阿拉伯语嵌套命名实体语料库,包含约55万现代标准阿拉伯语和方言词汇,手动标注了21种实体类型。该数据集特别之处在于其注释了嵌套实体而非常见的扁平注释。数据集中约有7.5万个实体,其中22.5%为嵌套实体。通过多任务学习使用预训练的阿拉伯语BERT(AraBERT)模型进行验证,该模型在数据集上达到了0.884的总体微观F1分数。Wojood数据集覆盖多个领域,支持四层嵌套,并包含现代标准阿拉伯语和方言文本,旨在解决阿拉伯语嵌套命名实体识别的挑战。
Wojood is an Arabic nested named entity corpus developed by Birzeit University. It contains approximately 550,000 tokens of Modern Standard Arabic (MSA) and dialectal Arabic, with manual annotations for 21 distinct entity types. A key feature of this corpus is its annotation of nested entities, as opposed to the more common flat annotation approach. The corpus includes a total of around 75,000 entities, 22.5% of which are nested entities. It was validated using a pre-trained Arabic BERT (AraBERT) model via multi-task learning, yielding an overall micro-F1 score of 0.884. Covering multiple domains, the Wojood corpus supports up to four levels of nested annotation and encompasses both Modern Standard Arabic and dialectal Arabic texts, with the goal of addressing the challenges inherent in Arabic nested named entity recognition.
提供机构:
比尔宰特大学
创建时间:
2022-05-20



