Annotated GMB Corpus
收藏www.kaggle.com2018-10-07 更新2025-03-24 收录
下载链接:
https://www.kaggle.com/shoumikgoswami/annotated-gmb-corpus
下载链接
链接失效反馈官方服务:
资源简介:
### Context
Named Entity Recognition for annotated corpus using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
### Content
The dataset an extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc. GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed.
The attached dataset is in tab separated format, the goal is to create a good model to classify the Tag column. The data is labelled using the IOB tagging system.
Here are the following classes in the dataset -
geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon
### Acknowledgements
The dataset is a subset of the original dataset shared here -
https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/kernels
### Inspiration
The data can be used by anyone who is starting off with NER in NLP.
### 背景
利用GMB(格罗宁根意义库)语料库进行命名实体识别的标注语料集,该语料库专为实体分类而设计,并采用自然语言处理技术对数据集进行增强和优化。GMB语料库规模庞大,标注丰富。然而,GMB语料库并非完美无瑕,它并非金标准语料库,意味着其并非完全由人类标注,且无法确保100%的准确性。该语料库通过使用现有的标注者进行创建,并在必要时由人类进行校正。
附带的语料集为GMB语料库的提取部分,以制表符分隔格式呈现。目标是为创建一个优秀的模型以分类标签列。数据使用IOB标注系统进行标注。
以下是数据集中的以下类别:
geo - 地理实体
org - 组织
per - 人物
gpe - 地缘政治实体
tim - 时间指示
art - 文物
eve - 事件
nat - 自然现象
### 致谢
本语料集为以下原始数据集的子集:
https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/kernels
### 灵感
该数据集可供任何开始从事自然语言处理中命名实体识别的初学者使用。
提供机构:
www.kaggle.com



