rmyeid/polyglot_ner
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/rmyeid/polyglot_ner
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language_creators:
- found
language:
- ar
- bg
- ca
- cs
- da
- de
- el
- en
- es
- et
- fa
- fi
- fr
- he
- hi
- hr
- hu
- id
- it
- ja
- ko
- lt
- lv
- ms
- nl
- 'no'
- pl
- pt
- ro
- ru
- sk
- sl
- sr
- sv
- th
- tl
- tr
- uk
- vi
- zh
license:
- unknown
multilinguality:
- multilingual
pretty_name: Polyglot-NER
size_categories:
- unknown
source_datasets:
- original
task_categories:
- token-classification
task_ids:
- named-entity-recognition
paperswithcode_id: polyglot-ner
dataset_info:
- config_name: ca
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 143746026
num_examples: 372665
download_size: 1107018606
dataset_size: 143746026
- config_name: de
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 156744752
num_examples: 547578
download_size: 1107018606
dataset_size: 156744752
- config_name: es
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 145387551
num_examples: 386699
download_size: 1107018606
dataset_size: 145387551
- config_name: fi
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 95175890
num_examples: 387465
download_size: 1107018606
dataset_size: 95175890
- config_name: hi
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 177698330
num_examples: 401648
download_size: 1107018606
dataset_size: 177698330
- config_name: id
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 152560050
num_examples: 463862
download_size: 1107018606
dataset_size: 152560050
- config_name: ko
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 174523416
num_examples: 560105
download_size: 1107018606
dataset_size: 174523416
- config_name: ms
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 155268778
num_examples: 528181
download_size: 1107018606
dataset_size: 155268778
- config_name: pl
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 159684112
num_examples: 623267
download_size: 1107018606
dataset_size: 159684112
- config_name: ru
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 200717423
num_examples: 551770
download_size: 1107018606
dataset_size: 200717423
- config_name: sr
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 183437513
num_examples: 559423
download_size: 1107018606
dataset_size: 183437513
- config_name: tl
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 47104871
num_examples: 160750
download_size: 1107018606
dataset_size: 47104871
- config_name: vi
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 141062258
num_examples: 351643
download_size: 1107018606
dataset_size: 141062258
- config_name: ar
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 183551222
num_examples: 339109
download_size: 1107018606
dataset_size: 183551222
- config_name: cs
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 156792129
num_examples: 564462
download_size: 1107018606
dataset_size: 156792129
- config_name: el
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 195456401
num_examples: 446052
download_size: 1107018606
dataset_size: 195456401
- config_name: et
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 21961619
num_examples: 87023
download_size: 1107018606
dataset_size: 21961619
- config_name: fr
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 147560734
num_examples: 418411
download_size: 1107018606
dataset_size: 147560734
- config_name: hr
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 154151689
num_examples: 629667
download_size: 1107018606
dataset_size: 154151689
- config_name: it
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 147520094
num_examples: 378325
download_size: 1107018606
dataset_size: 147520094
- config_name: lt
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 165319919
num_examples: 848018
download_size: 1107018606
dataset_size: 165319919
- config_name: nl
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 150737871
num_examples: 520664
download_size: 1107018606
dataset_size: 150737871
- config_name: pt
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 145627857
num_examples: 396773
download_size: 1107018606
dataset_size: 145627857
- config_name: sk
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 134174889
num_examples: 500135
download_size: 1107018606
dataset_size: 134174889
- config_name: sv
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 157058369
num_examples: 634881
download_size: 1107018606
dataset_size: 157058369
- config_name: tr
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 164456506
num_examples: 607324
download_size: 1107018606
dataset_size: 164456506
- config_name: zh
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 165056969
num_examples: 1570853
download_size: 1107018606
dataset_size: 165056969
- config_name: bg
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 190509195
num_examples: 559694
download_size: 1107018606
dataset_size: 190509195
- config_name: da
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 150551293
num_examples: 546440
download_size: 1107018606
dataset_size: 150551293
- config_name: en
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 145491677
num_examples: 423982
download_size: 1107018606
dataset_size: 145491677
- config_name: fa
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 180093656
num_examples: 492903
download_size: 1107018606
dataset_size: 180093656
- config_name: he
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 177231613
num_examples: 459933
download_size: 1107018606
dataset_size: 177231613
- config_name: hu
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 160702240
num_examples: 590218
download_size: 1107018606
dataset_size: 160702240
- config_name: ja
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 193679570
num_examples: 1691018
download_size: 1107018606
dataset_size: 193679570
- config_name: lv
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 76256241
num_examples: 331568
download_size: 1107018606
dataset_size: 76256241
- config_name: 'no'
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 152431612
num_examples: 552176
download_size: 1107018606
dataset_size: 152431612
- config_name: ro
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 96369897
num_examples: 285985
download_size: 1107018606
dataset_size: 96369897
- config_name: sl
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 148140079
num_examples: 521251
download_size: 1107018606
dataset_size: 148140079
- config_name: th
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 360409343
num_examples: 217631
download_size: 1107018606
dataset_size: 360409343
- config_name: uk
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 198251631
num_examples: 561373
download_size: 1107018606
dataset_size: 198251631
- config_name: combined
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: words
sequence: string
- name: ner
sequence: string
splits:
- name: train
num_bytes: 6286855097
num_examples: 21070925
download_size: 1107018606
dataset_size: 6286855097
---
# Dataset Card for Polyglot-NER
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://sites.google.com/site/rmyeid/projects/polylgot-ner](https://sites.google.com/site/rmyeid/projects/polylgot-ner)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 45.39 GB
- **Size of the generated dataset:** 12.54 GB
- **Total amount of disk used:** 57.93 GB
### Dataset Summary
Polyglot-NER
A training dataset automatically generated from Wikipedia and Freebase the task
of named entity recognition. The dataset contains the basic Wikipedia based
training data for 40 languages we have (with coreference resolution) for the task of
named entity recognition. The details of the procedure of generating them is outlined in
Section 3 of the paper (https://arxiv.org/abs/1410.3791). Each config contains the data
corresponding to a different language. For example, "es" includes only spanish examples.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### ar
- **Size of downloaded dataset files:** 1.11 GB
- **Size of the generated dataset:** 183.55 MB
- **Total amount of disk used:** 1.29 GB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"id": "2",
"lang": "ar",
"ner": ["O", "O", "O", "O", "O", "O", "O", "O", "LOC", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "PER", "PER", "PER", "PER", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
"words": "[\"وفي\", \"مرحلة\", \"موالية\", \"أنشأت\", \"قبيلة\", \"مكناسة\", \"الزناتية\", \"مكناسة\", \"تازة\", \",\", \"وأقام\", \"بها\", \"المرابطون\", \"قلعة\", \"..."
}
```
#### bg
- **Size of downloaded dataset files:** 1.11 GB
- **Size of the generated dataset:** 190.51 MB
- **Total amount of disk used:** 1.30 GB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"id": "1",
"lang": "bg",
"ner": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
"words": "[\"Дефиниция\", \"Наименованията\", \"\\\"\", \"книжовен\", \"\\\"/\\\"\", \"литературен\", \"\\\"\", \"език\", \"на\", \"български\", \"за\", \"тази\", \"кодифи..."
}
```
#### ca
- **Size of downloaded dataset files:** 1.11 GB
- **Size of the generated dataset:** 143.75 MB
- **Total amount of disk used:** 1.25 GB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"id": "2",
"lang": "ca",
"ner": "[\"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O...",
"words": "[\"Com\", \"a\", \"compositor\", \"deixà\", \"un\", \"immens\", \"llegat\", \"que\", \"inclou\", \"8\", \"simfonies\", \"(\", \"1822\", \"),\", \"diverses\", ..."
}
```
#### combined
- **Size of downloaded dataset files:** 1.11 GB
- **Size of the generated dataset:** 6.29 GB
- **Total amount of disk used:** 7.39 GB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"id": "18",
"lang": "es",
"ner": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
"words": "[\"Los\", \"cambios\", \"en\", \"la\", \"energía\", \"libre\", \"de\", \"Gibbs\", \"\\\\\", \"Delta\", \"G\", \"nos\", \"dan\", \"una\", \"cuantificación\", \"de..."
}
```
#### cs
- **Size of downloaded dataset files:** 1.11 GB
- **Size of the generated dataset:** 156.79 MB
- **Total amount of disk used:** 1.26 GB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"id": "3",
"lang": "cs",
"ner": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
"words": "[\"Historie\", \"Symfonická\", \"forma\", \"se\", \"rozvinula\", \"se\", \"především\", \"v\", \"období\", \"klasicismu\", \"a\", \"romantismu\", \",\", \"..."
}
```
### Data Fields
The data fields are the same among all splits.
#### ar
- `id`: a `string` feature.
- `lang`: a `string` feature.
- `words`: a `list` of `string` features.
- `ner`: a `list` of `string` features.
#### bg
- `id`: a `string` feature.
- `lang`: a `string` feature.
- `words`: a `list` of `string` features.
- `ner`: a `list` of `string` features.
#### ca
- `id`: a `string` feature.
- `lang`: a `string` feature.
- `words`: a `list` of `string` features.
- `ner`: a `list` of `string` features.
#### combined
- `id`: a `string` feature.
- `lang`: a `string` feature.
- `words`: a `list` of `string` features.
- `ner`: a `list` of `string` features.
#### cs
- `id`: a `string` feature.
- `lang`: a `string` feature.
- `words`: a `list` of `string` features.
- `ner`: a `list` of `string` features.
### Data Splits
| name | train |
|----------|---------:|
| ar | 339109 |
| bg | 559694 |
| ca | 372665 |
| combined | 21070925 |
| cs | 564462 |
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Citation Information
```
@article{polyglotner,
author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},
title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition},
journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30- May 2, 2015}},
month = {April},
year = {2015},
publisher = {SIAM},
}
```
### Contributions
Thanks to [@joeddav](https://github.com/joeddav) for adding this dataset.
提供机构:
rmyeid
原始信息汇总
数据集概述
数据集基本信息
- 名称: Polyglot-NER
- 语言: 支持多种语言,包括阿拉伯语、保加利亚语、加泰罗尼亚语、捷克语、丹麦语、德语、希腊语、英语等共40种语言。
- 许可证: 未知
- 多语言性: 多语言
- 数据集大小: 未知
- 源数据: 原始数据
- 任务类别: 令牌分类
- 任务ID: 命名实体识别
- PapersWithCode ID: polyglot-ner
数据集结构
数据实例
- 特征:
id: 字符串类型lang: 字符串类型words: 字符串序列ner: 字符串序列
数据分割
- 训练集:
- 不同语言的训练集大小不同,例如:
- 阿拉伯语: 339,109个实例
- 保加利亚语: 559,694个实例
- 加泰罗尼亚语: 372,665个实例
- 捷克语: 564,462个实例
- 组合集: 21,070,925个实例
- 不同语言的训练集大小不同,例如:
数据大小
- 下载大小: 1,107,018,606字节
- 数据集大小: 根据不同语言配置,大小从几十MB到几百MB不等。
数据集创建
- 注释创建者: 机器生成
- 语言创建者: 发现
源数据
- 类型: 原始数据
- 数据来源: 自动从维基百科和Freebase生成,用于命名实体识别任务。
注释
- 类型: 机器生成
个人和敏感信息
- 信息: 未提供
使用数据集的考虑
- 社会影响: 未提供
- 偏见讨论: 未提供
- 其他已知限制: 未提供
附加信息
- 数据集管理员: 未提供
- 许可信息: 未知
- 引用信息: 未提供
- 贡献: 未提供
搜集汇总
数据集介绍

构建方式
在跨语言自然语言处理领域,构建高质量命名实体识别数据集面临资源稀缺的挑战。Polyglot-NER数据集采用自动化方法,从多语言维基百科和Freebase知识库中提取文本信息,通过核心ference解析技术生成标注。该过程遵循特定算法,将实体链接至知识库条目,并依据预定义规则分配实体类别标签,从而形成大规模、多语言的训练语料。
特点
该数据集涵盖四十种语言,包括阿拉伯语、中文、日语等非拉丁语系,展现出卓越的语言多样性。每个语言配置独立,数据规模从数万到数百万条不等,中文和日语样本量尤为突出。数据结构统一,包含词序列及对应的命名实体标签序列,支持序列标注任务。其多语言特性为跨语言迁移学习提供了丰富资源。
使用方法
研究人员可通过Hugging Face平台加载特定语言配置或组合版本,直接用于训练命名实体识别模型。该数据集适用于评估模型在多语言环境下的泛化能力,或作为预训练数据增强跨语言表示。使用时应考虑自动标注可能引入的噪声,建议结合人工评估或后处理技术以提升模型鲁棒性。
背景与挑战
背景概述
在自然语言处理领域,跨语言命名实体识别(NER)是推动多语言信息提取技术发展的关键任务。Polyglot-NER数据集由Rami Al-Rfou等研究人员于2014年创建,其核心研究问题在于解决传统NER系统对低资源语言覆盖不足的困境。该数据集通过自动化方法从维基百科和Freebase知识库中提取并标注了40种语言的实体信息,显著提升了多语言NER模型的训练效率与泛化能力,为后续的跨语言迁移学习研究奠定了重要基础。
当前挑战
Polyglot-NER数据集旨在应对多语言命名实体识别中数据稀缺与标注标准不统一的挑战。在领域问题层面,不同语言间的实体表达差异、语法结构多样性以及文化特定实体识别构成了主要障碍。构建过程中,自动化标注机制面临实体歧义消解、跨语言核心ference解析的复杂性,同时需处理维基百科数据中的噪声与标注一致性难题,这些因素共同影响了数据集的精确度与可靠性。
常用场景
经典使用场景
在自然语言处理领域,多语言命名实体识别(NER)是理解文本语义结构的关键任务。Polyglot-NER数据集以其覆盖40种语言的广泛性,成为训练和评估跨语言NER模型的经典资源。研究者通常利用该数据集构建多语言序列标注模型,通过统一的标注框架比较不同语言间实体识别的性能差异,从而探索语言特性对模型泛化能力的影响。
解决学术问题
该数据集有效缓解了多语言NER研究中数据稀缺的困境,特别是针对资源匮乏的语言。通过自动化生成的标注数据,它解决了传统人工标注成本高昂、规模有限的问题,为跨语言迁移学习、零样本或少样本学习提供了实验基础。其意义在于推动了语言无关的表示学习研究,促进了多语言信息抽取技术的均衡发展,对构建包容性人工智能系统具有深远影响。
衍生相关工作
基于Polyglot-NER数据集,学术界衍生了一系列经典研究工作。例如,跨语言BERT预训练模型利用该数据进行多任务学习,增强了模型对低资源语言的实体识别能力。此外,该数据集常被用作评估多语言词嵌入质量的基准,推动了如MUSE和VecMap等对齐方法的发展。这些工作共同深化了对多语言表示迁移机制的理解,为后续的通用语言模型奠定了基础。
以上内容由遇见数据集搜集并总结生成



