Karavet/pioNER-Armenian-Named-Entity

Name: Karavet/pioNER-Armenian-Named-Entity
Creator: Karavet
Published: 2022-10-21 16:07:06
License: 暂无描述

Hugging Face2022-10-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Karavet/pioNER-Armenian-Named-Entity

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: [hy] task_categories: [named-entity-recognition] multilinguality: [monolingual] task_ids: [named-entity-recognition] license: [apache-2.0] --- ## Table of Contents - [Table of Contents](#table-of-contents) - [pioNER - named entity annotated datasets](#pioNER---named-entity-annotated-datasets) - [Silver-standard dataset](#silver-standard-dataset) - [Gold-standard dataset](#gold-standard-dataset) # pioNER - named entity annotated datasets pioNER corpus provides gold-standard and automatically generated named-entity datasets for the Armenian language. Alongside the datasets, we release 50-, 100-, 200-, and 300-dimensional GloVe word embeddings trained on a collection of Armenian texts from Wikipedia, news, blogs, and encyclopedia. ## Silver-standard dataset The generated corpus is automatically extracted and annotated using Armenian Wikipedia. We used a modification of [Nothman et al](https://www.researchgate.net/publication/256660013_Learning_multilingual_named_entity_recognition_from_Wikipedia) and [Sysoev and Andrianov](http://www.dialog-21.ru/media/3433/sysoevaaandrianovia.pdf) approaches to create this corpus. This approach uses links between Wikipedia articles to extract fragments of named-entity annotated texts. The corpus is split into train and development sets. *Table 1. Statistics for pioNER train, development and test sets* | dataset | #tokens | #sents | annotation | texts' source | |-------------|:--------:|:-----:|:--------:|:-----:| | train | 130719 | 5964 | automatic | Wikipedia | | dev | 32528 | 1491 | automatic | Wikipedia | | test | 53606 | 2529 | manual | iLur.am | ## Gold-standard dataset This dataset is a collection of over 250 news articles from iLur.am with manual named-entity annotation. It includes sentences from political, sports, local and world news, and is comparable in size with the test sets of other languages (Table 2). We aim it to serve as a benchmark for future named entity recognition systems designed for the Armenian language. The dataset contains annotations for 3 popular named entity classes: people (PER), organizations (ORG), and locations (LOC), and is released in CoNLL03 format with IOB tagging scheme. During annotation, we generally relied on categories and [guidelines assembled by BBN](https://catalog.ldc.upenn.edu/docs/LDC2005T33/BBN-Types-Subtypes.html) Technologies for TREC 2002 question answering track Tokens and sentences were segmented according to the UD standards for the Armenian language from [ArmTreebank project](http://armtreebank.yerevann.com/tokenization/process/). *Table 2. Comparison of pioNER gold-standard test set with test sets for English, Russian, Spanish and German* | test dataset | #tokens | #LOC | #ORG | #PER | |-------------|:--------:|:-----:|:--------:|:-----:| | Armenian pioNER | 53606 | 1312 | 1338 | 1274 | | Russian factRuEval-2016 | 59382 | 1239 | 1595 | 1353 | | German CoNLL03 | 51943 | 1035 | 773 | 1195 | | Spanish CoNLL02 | 51533 | 1084 | 1400 | 735 | | English CoNLL03 | 46453 | 1668 | 1661 | 1671 |

提供机构：

Karavet

原始信息汇总

pioNER - named entity annotated datasets

Silver-standard dataset

Source: Automatically extracted and annotated from Armenian Wikipedia.
Method: Utilizes a modification of Nothman et al. and Sysoev and Andrianov approaches, using Wikipedia links to extract named-entity annotated texts.
Corpus Split:
- Train: 130,719 tokens, 5,964 sentences, automatic annotation.
- Dev: 32,528 tokens, 1,491 sentences, automatic annotation.
- Test: 53,606 tokens, 2,529 sentences, manual annotation from iLur.am.

Gold-standard dataset

Source: Collection of over 250 news articles from iLur.am with manual named-entity annotation.
Content: Includes sentences from political, sports, local, and world news.
Annotation: Contains annotations for people (PER), organizations (ORG), and locations (LOC) in CoNLL03 format with IOB tagging scheme.
Segmentation: Tokens and sentences segmented according to UD standards for the Armenian language from the ArmTreebank project.
Comparison:
- Armenian pioNER: 53,606 tokens, 1,312 LOC, 1,338 ORG, 1,274 PER.
- Russian factRuEval-2016: 59,382 tokens, 1,239 LOC, 1,595 ORG, 1,353 PER.
- German CoNLL03: 51,943 tokens, 1,035 LOC, 773 ORG, 1,195 PER.
- Spanish CoNLL02: 51,533 tokens, 1,084 LOC, 1,400 ORG, 735 PER.
- English CoNLL03: 46,453 tokens, 1,668 LOC, 1,661 ORG, 1,671 PER.

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，命名实体识别（NER）是信息抽取的核心任务之一。Karavet/pioNER-Armenian-Named-Entity数据集专为亚美尼亚语设计，其构建分为银标准与金标准两大部分。银标准数据集源自亚美尼亚语维基百科，采用Nothman等人及Sysoev与Andrianov方法的改进版本，通过解析维基百科文章间的超链接自动提取并标注命名实体片段。金标准数据集则包含来自iLur.am新闻网站的250余篇手工标注文章，涵盖政治、体育、本地及国际新闻，标注过程遵循BBN Technologies为TREC 2002问答任务制定的类别与指南，采用IOB标记方案及CoNLL03格式。分词与句子切分依据ArmTreebank项目的亚美尼亚语UD标准进行，确保了标注的准确性与一致性。

特点

该数据集的核心特点在于其双重架构，既提供了大规模自动生成的银标准数据，又配备了高质量的手工金标准基准。银标准数据集包含训练集与开发集，分别拥有130,719和32,528个词元，覆盖广泛的维基百科文本。金标准测试集则包含53,606个词元，涵盖人物（PER）、组织（ORG）和地点（LOC）三类命名实体，其规模与英语、俄语、德语及西班牙语的知名NER测试集相当，如英语CoNLL03的46,453词元。这种设计使得pioNER不仅适用于模型训练，更可作为亚美尼亚语NER系统的标准化评估基准。此外，数据集还附带50至300维的GloVe词向量，基于维基百科、新闻、博客及百科全书文本训练，进一步丰富了其应用价值。

使用方法

使用该数据集时，研究人员可直接加载其CoNLL03格式的标注数据，适用于基于序列标注的NER模型训练与评估。银标准数据集的训练与开发集可用于模型预训练或数据增强，而金标准测试集则用于性能验证。数据集支持常见的NER框架，如Hugging Face的Transformers库，用户可通过简单配置加载并适配IOB标签。为获得最佳效果，建议结合发布的GloVe词向量初始化嵌入层，或利用预训练语言模型进行微调。使用前需确保分词器符合UD标准，以保持数据一致性。该数据集在Apache-2.0许可下发布，便于学术与商业场景中的广泛采用。

背景与挑战

背景概述

命名实体识别（NER）作为自然语言处理领域的核心任务，旨在从非结构化文本中识别出具有特定意义的实体，如人名、地名和组织名。然而，相较于英语等资源丰富的语言，亚美尼亚语等低资源语言的NER研究长期受限于标注语料的匮乏。为填补这一空白，由Karavet团队主导的pioNER数据集于近年发布，该数据集由亚美尼亚埃里温大学等机构的研究人员构建，核心目标是为亚美尼亚语提供标准化的NER评估基准。数据集包含两个部分：基于维基百科链接自动标注的银标准语料，以及从新闻网站iLur.am人工标注的金标准测试集，涵盖政治、体育等多领域文本。该数据集不仅提供了与CoNLL03格式兼容的IOB标注，还配套发布了亚美尼亚语GloVe词向量，显著推动了该语言的NER研究进程，成为后续系统性能对比的关键参考。

当前挑战

pioNER数据集面临的挑战主要体现在两个层面。在领域问题层面，亚美尼亚语作为形态丰富的低资源语言，其词形变化复杂、命名实体边界模糊，且缺乏大规模预训练语言模型支持，导致传统NER方法在该语言上泛化能力不足。在构建过程中，银标准语料虽通过维基百科链接自动生成，但链接噪声与标注不一致问题难以彻底消除，例如跨类别实体（如地名与组织名重叠）的歧义性。金标准数据集虽经人工校验，但仅涵盖250篇新闻文章，规模有限，且实体类别仅包含PER、ORG、LOC三类，无法覆盖日期、数量等细粒度实体，限制了模型在多样化场景下的应用。此外，语料来源单一（仅限新闻与维基百科），缺乏对话、社交媒体等非正式文本，进一步制约了数据集的领域适应性。

常用场景

经典使用场景

在自然语言处理领域，命名实体识别（NER）是信息抽取的核心任务之一。Karavet/pioNER-Armenian-Named-Entity数据集为亚美尼亚语这一低资源语言提供了首个公开可用的命名实体标注资源，其经典使用场景在于训练和评估亚美尼亚语的NER模型。该数据集包含通过维基百科自动生成的银标准语料（约16.3万词）和经过人工精细标注的金标准测试集（约5.4万词），覆盖人名、组织名和地名三类实体，采用CoNLL03格式与IOB标注方案，为低资源语言的序列标注研究奠定了数据基础。

衍生相关工作

pioNER数据集的发布催生了一系列亚美尼亚语自然语言处理的衍生研究。其银标准语料的自动标注方法借鉴了Nothman等人的维基百科实体链接技术，而金标准语料的标注规范则参考了BBN为TREC 2002问答任务制定的实体类型体系。后续工作可包括基于该数据集训练亚美尼亚语的BERT或XLM-R等预训练语言模型，并探索跨语言微调策略以提升低资源场景下的NER性能。此外，该数据集也为亚美尼亚语依存句法分析、关系抽取等上游任务提供了实体标注的基准参考，促进了对亚美尼亚语语言结构的系统性研究。

数据集最近研究