NEMO-Corpus (NEMO Hebrew NER and Morphology Corpus)
收藏OpenDataLab2026-05-31 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/NEMO-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
希伯来树库(国土报)语料库的命名实体 (NER) 注释,包括:语素和令牌级别 NER 标签、嵌套提及等。
我们在 TACL 论文“命名实体和形态学 (NEMO^2) 的神经建模 (NEMO^2)”[1] 中发布了 NEMO 语料库,我们在广泛的实验和分析中使用它,展示了形态学边界对于 NER 神经建模在形态学上的重要性丰富的语言。这些模型和实验的代码可以在 NEMO 代码库中找到。
主要特点:
语素、单标记和多标记序列标签。词素标签提供准确的边界,token-multi 提供部分子词形态但没有准确的边界,token-single 仅提供 token 级别的信息。
所有注释均采用 BIOSE 格式(B=Begin,I=Inside,O=Outside,S=Singleton,E=End)。
广泛使用的 OntoNotes 实体类别集:GPE(地缘政治实体)、PER(人)、LOC(位置)、ORG(组织)、FAC(设施)、EVE(事件)、WOA(艺术品)、 ANG(语言),DUC(产品)。
NEMO 包括希伯来树库的两个主要版本 UD(通用依赖)和 SPMRL 的 NER 注释。这些可以使用 bclm 与树库的其他形态句法信息层对齐
我们提供嵌套提及。 NEMO^2 论文中只使用了第一层,也是最宽的层。我们邀请您接受这个挑战!
此处提供了用于注释的指南。
语料库由两位以希伯来语为母语的学术教育人士注释,并由项目经理策划。我们还提供注释者所做的原始注释,以促进有分歧的学习工作。
使用 WebAnno(版本 3.4.5)执行注释
基本语料库统计
火车
开发者
测试
句子
4,937
500
706
代币
93,504
8,531
12,619
语素
127,031
11,301
16,828
所有提及
6,282
499
932
类型:人 (PER)
2,128
193
267
类型:组织 (ORG)
2,043
119
408
类型:地缘政治(GPE)
1,377
121
195
类型:位置 (LOC)
331
28
41
类型:设施 (FAC)
163
12
11
类型:艺术作品 (WOA)
114
9
6
类型:事件 (EVE)
57
12
0
类型:产品 (DUC)
36
2
3
类型:语言 (ANG)
33
3
1
评估
NEMO 代码存储库中提供了评估脚本以及评估说明。
引文
@article{10.1162/tacl_a_00404,
作者 = {Bareket, Dan and Tsarfaty, Reut},
title = "{命名实体和形态学的神经建模 (NEMO2)}",
期刊 = {计算语言学协会的交易},
音量 = {9},
页数 = {909-928},
年 = {2021},
月 = {09},
abstract = "{命名实体识别 (NER) 是一项基本的 NLP 任务,通常表述为对一系列标记进行分类。形态丰富的语言 (MRL) 对这一基本表述提出了挑战,因为命名实体的边界不一定重合相反,它们尊重形态边界。为了解决 MRL 中的 NER,我们需要回答两个基本问题,即要标记的基本单元是什么,以及如何在现实环境中检测和分类这些单元(即,其中没有黄金形态可用)。我们在一个新的 NER 基准上对这些问题进行了实证研究,该基准具有并行的标记级和词素级 NER 注释,我们为现代希伯来语开发了这些注释,这是一种形态丰富且模棱两可的语言。我们的结果表明对形态边界进行显式建模可以提高 NER 性能,以及一种新颖的混合架构,其中 NER 先于并修剪形态分解位置,大大优于标准管道,其中形态分解严格先于 NER,为希伯来语 NER 和希伯来语形态分解任务设置了新的性能标准。}",
issn = {2307-387X},
doi = {10.1162/tacl_a_00404},
网址 = {https://doi.org/10.1162/tacl\_a\_00404},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00404/1962472/tacl\_a\_00404.pdf},
}
Named Entity Recognition (NER) annotations for the Hebrew Treebank (Haaretz) corpus, including: morpheme and token-level NER labels, nested mentions, etc.
We released the NEMO corpus in our TACL paper *Neural Modeling of Named Entities and Morphology (NEMO^2)* [1], which we used in extensive experiments and analyses to demonstrate the importance of morphological boundaries for neural NER modeling in morphologically rich languages. Code for these models and experiments is available in the NEMO code repository.
### Key Features
- Morpheme, single-token, and multi-token sequence labels. Morpheme labels provide accurate boundaries; token-multi provides partial subword morphology but no accurate boundaries, while token-single only offers token-level information.
- All annotations follow the BIOSE schema (B=Begin, I=Inside, O=Outside, S=Singleton, E=End).
- The widely adopted OntoNotes entity category set is used: GPE (Geopolitical Entity), PER (Person), LOC (Location), ORG (Organization), FAC (Facility), EVE (Event), WOA (Work of Art), ANG (Language), DUC (Product).
NEMO includes NER annotations for two major versions of the Hebrew Treebank: UD (Universal Dependencies) and SPMRL. These can be aligned with other morphosyntactic information layers of the treebank using bclm.
We provide nested mentions. Only the first, widest layer was used in the NEMO^2 paper; we invite you to take on this challenge!
Annotation guidelines are provided here.
The corpus was annotated by two native Hebrew-speaking academic educators and curated by a project manager. We also provide the raw annotations made by the annotators to facilitate discrepant learning work.
Annotations were performed using WebAnno (version 3.4.5).
## Basic Corpus Statistics
| Split | Sentences | Tokens | Morphemes | Total Mentions |
|-------------|-----------|---------|-----------|----------------|
| Train | 4,937 | 93,504 | 127,031 | 6,282 |
| Dev | 500 | 8,531 | 11,301 | 499 |
| Test | 706 | 12,619 | 16,828 | 932 |
### Entity Type Counts
| Entity Type | Train | Dev | Test |
|---------------------------|-------|-----|------|
| Person (PER) | 2,128 | 193 | 267 |
| Organization (ORG) | 2,043 | 119 | 408 |
| Geopolitical Entity (GPE) | 1,377 | 121 | 195 |
| Location (LOC) | 331 | 28 | 41 |
| Facility (FAC) | 163 | 12 | 11 |
| Work of Art (WOA) | 114 | 9 | 6 |
| Event (EVE) | 57 | 12 | 0 |
| Product (DUC) | 36 | 2 | 3 |
| Language (ANG) | 33 | 3 | 1 |
## Evaluation
Evaluation scripts and evaluation instructions are provided in the NEMO code repository.
## Citation
bibtex
@article{10.1162/tacl_a_00404,
author = {Bareket, Dan and Tsarfaty, Reut},
title = {Neural Modeling of Named Entities and Morphology (NEMO^2)},
journal = {Transactions of the Association for Computational Linguistics},
volume = {9},
pages = {909-928},
year = {2021},
month = {09},
abstract = {Named Entity Recognition (NER) is a fundamental NLP task, typically formulated as classifying a sequence of tokens. Morphologically Rich Languages (MRLs) pose a challenge to this standard formulation, as named entity boundaries do not necessarily coincide; instead, they respect morphological boundaries. To address NER in MRLs, we need to answer two fundamental questions: what is the basic unit to label, and how to detect and classify these units in real-world settings (i.e., where no gold-standard morphology is available). We conduct an empirical study of these questions on a new NER benchmark with parallel token-level and morpheme-level NER annotations, which we developed for Modern Hebrew, a morphologically rich and ambiguous language. Our results show that explicitly modeling morphological boundaries improves NER performance, and that a novel hybrid architecture, where NER precedes and prunes morphological decomposition positions, substantially outperforms the standard pipeline where morphological decomposition strictly precedes NER, setting new performance benchmarks for both Hebrew NER and Hebrew morphological decomposition tasks.},
issn = {2307-387X},
doi = {10.1162/tacl_a_00404},
url = {https://doi.org/10.1162/tacl_a_00404},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00404/1962472/tacl_a_00404.pdf},
}
提供机构:
OpenDataLab
创建时间:
2022-06-28
搜集汇总
数据集介绍

背景与挑战
背景概述
NEMO-Corpus是一个希伯来语命名实体识别和形态学语料库,基于希伯来树库语料库注释,包含语素和令牌级别的NER标签以及嵌套提及,采用BIOSE格式和OntoNotes实体类别集。该数据集由母语者标注,用于研究形态丰富语言中命名实体边界的神经建模,支持训练、开发和测试,并提出了混合架构以提升NER和形态分解性能。
以上内容由遇见数据集搜集并总结生成



