universalner/universal_ner

Name: universalner/universal_ner
Creator: universalner
Published: 2024-09-03 14:13:47
License: 暂无描述

Hugging Face2024-09-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/universalner/universal_ner

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 language: - ceb - da - de - en - hr - pt - ru - sk - sr - sv - tl - zh task_categories: - token-classification dataset_info: - config_name: ceb_gja features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: test num_bytes: 39540 num_examples: 188 download_size: 30395 dataset_size: 39540 - config_name: da_ddt features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: train num_bytes: 2304027 num_examples: 4383 - name: validation num_bytes: 293562 num_examples: 564 - name: test num_bytes: 285813 num_examples: 565 download_size: 2412623 dataset_size: 2883402 - config_name: de_pud features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: test num_bytes: 641819 num_examples: 1000 download_size: 501924 dataset_size: 641819 - config_name: en_ewt features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: train num_bytes: 6133506 num_examples: 12543 - name: validation num_bytes: 782835 num_examples: 2001 - name: test num_bytes: 785361 num_examples: 2077 download_size: 5962747 dataset_size: 7701702 - config_name: en_pud features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: test num_bytes: 600666 num_examples: 1000 download_size: 462120 dataset_size: 600666 - config_name: hr_set features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: train num_bytes: 4523323 num_examples: 6914 - name: validation num_bytes: 656738 num_examples: 960 - name: test num_bytes: 719703 num_examples: 1136 download_size: 4620262 dataset_size: 5899764 - config_name: pt_bosque features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: train num_bytes: 4839200 num_examples: 7018 - name: validation num_bytes: 802880 num_examples: 1172 - name: test num_bytes: 780768 num_examples: 1167 download_size: 4867264 dataset_size: 6422848 - config_name: pt_pud features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: test num_bytes: 661453 num_examples: 1000 download_size: 507495 dataset_size: 661453 - config_name: ru_pud features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: test num_bytes: 795294 num_examples: 1000 download_size: 669214 dataset_size: 795294 - config_name: sk_snk features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: train num_bytes: 2523121 num_examples: 8483 - name: validation num_bytes: 409448 num_examples: 1060 - name: test num_bytes: 411686 num_examples: 1061 download_size: 2597877 dataset_size: 3344255 - config_name: sr_set features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: train num_bytes: 2174631 num_examples: 3328 - name: validation num_bytes: 349276 num_examples: 536 - name: test num_bytes: 336065 num_examples: 520 download_size: 2248325 dataset_size: 2859972 - config_name: sv_pud features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: test num_bytes: 588564 num_examples: 1000 download_size: 464252 dataset_size: 588564 - config_name: sv_talbanken features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: train num_bytes: 2027488 num_examples: 4303 - name: validation num_bytes: 291774 num_examples: 504 - name: test num_bytes: 615209 num_examples: 1219 download_size: 2239432 dataset_size: 2934471 - config_name: tl_trg features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: test num_bytes: 23671 num_examples: 128 download_size: 18546 dataset_size: 23671 - config_name: tl_ugnayan features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: test num_bytes: 31732 num_examples: 94 download_size: 23941 dataset_size: 31732 - config_name: zh_gsd features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: train num_bytes: 2747999 num_examples: 3997 - name: validation num_bytes: 355515 num_examples: 500 - name: test num_bytes: 335893 num_examples: 500 download_size: 2614866 dataset_size: 3439407 - config_name: zh_gsdsimp features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: train num_bytes: 2747863 num_examples: 3997 - name: validation num_bytes: 352423 num_examples: 500 - name: test num_bytes: 335869 num_examples: 500 download_size: 2611290 dataset_size: 3436155 - config_name: zh_pud features: - name: idx dtype: string - name: text dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC - name: annotator sequence: string splits: - name: test num_bytes: 607418 num_examples: 1000 download_size: 460357 dataset_size: 607418 --- # Dataset Card for Universal NER ### Dataset Summary Universal NER (UNER) is an open, community-driven initiative aimed at creating gold-standard benchmarks for Named Entity Recognition (NER) across multiple languages. The primary objective of UNER is to offer high-quality, cross-lingually consistent annotations, thereby standardizing and advancing multilingual NER research. UNER v1 includes 19 datasets with named entity annotations, uniformly structured across 13 diverse languages. ### Supported Tasks and Leaderboards - `token-classification`: The dataset can be used to train token classification models of the NER variety. Some pre-trained models released as part of the UNER v1 release can be found at https://huggingface.co/universalner ### Languages The dataset contains data in the following languages: - Cebuano (`ceb`) - Danish (`da`) - German (`de`) - English (`en`) - Croatian (`hr`) - Portuguese (`pt`) - Russian (`ru`) - Slovak (`sk`) - Serbian (`sr`) - Swedish (`sv`) - Tagalog (`tl`) - Chinese (`zh`) ## Dataset Structure ### Data Instances An example from the `UNER_English-PUD` test set looks as follows ```json { "idx": "n01016-0002", "text": "Several analysts have suggested Huawei is best placed to benefit from Samsung's setback.", "tokens": [ "Several", "analysts", "have", "suggested", "Huawei", "is", "best", "placed", "to", "benefit", "from", "Samsung", "'s", "setback", "." ], "ner_tags": [ "O", "O", "O", "O", "B-ORG", "O", "O", "O", "O", "O", "O", "B-ORG", "O", "O", "O" ], "annotator": "blvns" } ``` ### Data Fields - `idx`: the ID uniquely identifying the sentence (instance), if available. - `text`: the full text of the sentence (instance) - `tokens`: the text of the sentence (instance) split into tokens. Note that this split is inhereted from Universal Dependencies - `ner_tags`: the NER tags associated with each one of the `tokens` - `annotator`: the annotator who provided the `ner_tags` for this particular instance ### Data Splits TBD ## Dataset Creation ### Curation Rationale TBD ### Source Data #### Initial Data Collection and Normalization We selected the Universal Dependency (UD) corpora as the default base texts for annotation due to their extensive language coverage, pre-existing data collection, cleaning, tokenization, and permissive licensing. This choice accelerates our process by providing a robust foundation. By adding another annotation layer to the already detailed UD annotations, we facilitate verification within our project and enable comprehensive multilingual research across the entire NLP pipeline. Given that UD annotations operate at the word level, we adopted the BIO annotation schema (specifically IOB2). In this schema, words forming the beginning (B) or inside (I) part of an entity (X ∈ {PER, LOC, ORG}) are annotated accordingly, while all other words receive an O tag. To maintain consistency, we preserve UD's original tokenization. Although UD serves as the default data source for UNER, the project is not restricted to UD corpora, particularly for languages not currently represented in UD. The primary requirement for inclusion in the UNER corpus is adherence to the UNER tagging guidelines. Additionally, we are open to converting existing NER efforts on UD treebanks to align with UNER. In this initial release, we have included four datasets transferred from other manual annotation efforts on UD sources (for DA, HR, ARABIZI, and SR). #### Who are the source language producers? This information can be found on per-dataset basis for each of the source Universal Dependencies datasets. ### Annotations #### Annotation process The data has been annotated by #### Who are the annotators? For the initial UNER annotation effort, we recruited volunteers from the multilingual NLP community via academic networks and social media. The annotators were coordinated through a Slack workspace, with all contributors working on a voluntary basis. We assume that annotators are either native speakers of the language they annotate or possess a high level of proficiency, although no formal language tests were conducted. The selection of the 13 dataset languages in the first UNER release was driven by the availability of annotators. As the project evolves, we anticipate the inclusion of additional languages and datasets as more annotators become available. ### Personal and Sensitive Information TBD ## Considerations for Using the Data ### Social Impact of Dataset TBD ### Discussion of Biases TBD ### Other Known Limitations TBD ## Additional Information ### Dataset Curators List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here. ### Licensing Information The UNER v1 is released under the terms of the [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/) license ### Citation Information If you use this dataset, please cite the corresponding [paper](https://aclanthology.org/2024.naacl-long.243): ``` @inproceedings{ mayhew2024universal, title={Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark}, author={Stephen Mayhew and Terra Blevins and Shuheng Liu and Marek Šuppa and Hila Gonen and Joseph Marvin Imperial and Börje F. Karlsson and Peiqin Lin and Nikola Ljubešić and LJ Miranda and Barbara Plank and Arij Riab and Yuval Pinter} booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, year={2024}, url={https://aclanthology.org/2024.naacl-long.243/} } ```

提供机构：

universalner

原始信息汇总

数据集概述

数据集信息

配置名称：ceb_gja

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- test: 39540 字节，188 个样本
下载大小：30395 字节
数据集大小：39540 字节

配置名称：da_ddt

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- train: 2304027 字节，4383 个样本
- validation: 293562 字节，564 个样本
- test: 285813 字节，565 个样本
下载大小：2412623 字节
数据集大小：2883402 字节

配置名称：de_pud

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- test: 641819 字节，1000 个样本
下载大小：501924 字节
数据集大小：641819 字节

配置名称：en_ewt

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- train: 6133506 字节，12543 个样本
- validation: 782835 字节，2001 个样本
- test: 785361 字节，2077 个样本
下载大小：5962747 字节
数据集大小：7701702 字节

配置名称：en_pud

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- test: 600666 字节，1000 个样本
下载大小：462120 字节
数据集大小：600666 字节

配置名称：hr_set

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- train: 4523323 字节，6914 个样本
- validation: 656738 字节，960 个样本
- test: 719703 字节，1136 个样本
下载大小：4620262 字节
数据集大小：5899764 字节

配置名称：pt_bosque

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- train: 4839200 字节，7018 个样本
- validation: 802880 字节，1172 个样本
- test: 780768 字节，1167 个样本
下载大小：4867264 字节
数据集大小：6422848 字节

配置名称：pt_pud

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- test: 661453 字节，1000 个样本
下载大小：507495 字节
数据集大小：661453 字节

配置名称：ru_pud

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- test: 795294 字节，1000 个样本
下载大小：669214 字节
数据集大小：795294 字节

配置名称：sk_snk

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- train: 2523121 字节，8483 个样本
- validation: 409448 字节，1060 个样本
- test: 411686 字节，1061 个样本
下载大小：2597877 字节
数据集大小：3344255 字节

配置名称：sr_set

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- train: 2174631 字节，3328 个样本
- validation: 349276 字节，536 个样本
- test: 336065 字节，520 个样本
下载大小：2248325 字节
数据集大小：2859972 字节

配置名称：sv_pud

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- test: 588564 字节，1000 个样本
下载大小：464252 字节
数据集大小：588564 字节

配置名称：sv_talbanken

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- train: 2027488 字节，4303 个样本
- validation: 291774 字节，504 个样本
- test: 615209 字节，1219 个样本
下载大小：2239432 字节
数据集大小：2934471 字节

配置名称：tl_trg

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- test: 23671 字节，128 个样本
下载大小：18546 字节
数据集大小：23671 字节

配置名称：tl_ugnayan

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- test: 31732 字节，94 个样本
下载大小：23941 字节
数据集大小：31732 字节

配置名称：zh_gsd

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- train: 2747999 字节，3997 个样本
- validation: 355515 字节，500 个样本
- test: 335893 字节，500 个样本
下载大小：2614866 字节
数据集大小：3439407 字节

配置名称：zh_gsdsimp

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
- annotator: 字符串序列
分割：
- train: 2747863 字节，3997 个样本
- validation: 352423 字节，500 个样本
- test: 335869 字节，500 个样本
下载大小：2611290 字节
数据集大小：3436155 字节

配置名称：zh_pud

特征：
- idx: 字符串
- text: 字符串
- tokens: 字符串序列
- ner_tags: 序列，包含类标签：
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - `3: B-ORG
  - `4: I-ORG
  - `5: B-LOC
  - `6: I-LOC
- annotator: 字符串序列
分割：
- test: 607418 字节，1000 个样本
下载大小：460357 字节
数据集大小：607418 字节

搜集汇总

数据集介绍

构建方式

在命名实体识别领域，构建高质量的多语言基准数据集是推动跨语言研究的关键。Universal NER数据集以通用依存树库为文本基础，充分利用其广泛的语言覆盖和预处理的优势。通过采用BIO标注方案，该数据集对人物、地点和组织三类实体进行统一标注，确保了跨语言的一致性。标注工作由多语言NLP社区的志愿者完成，他们基于自愿原则贡献专业语言知识，使得数据集涵盖了包括宿务语、丹麦语、德语、英语、克罗地亚语、葡萄牙语、俄语、斯洛伐克语、塞尔维亚语、瑞典语、他加禄语和中文在内的13种语言。

特点

Universal NER数据集的核心特征在于其跨语言的统一性和高质量标注。该数据集整合了19个子集，每个子集均遵循一致的标注规范，提供了人物、地点和组织三类实体的精细标注。数据集中每个实例包含原始文本、分词序列、NER标签及标注者信息，结构清晰且便于模型训练与评估。其多语言覆盖不仅包括主流语言，也纳入了资源相对稀缺的语言，为研究语言多样性下的实体识别提供了宝贵资源。标注过程的社区驱动特性进一步保障了数据的可靠性与学术价值。

使用方法

该数据集适用于训练和评估多语言命名实体识别模型。研究人员可通过HuggingFace平台直接加载特定语言配置，如`universalner/universal_ner`中的`en_ewt`或`zh_gsd`，获取已分割的训练、验证和测试集。每个数据实例的`tokens`和`ner_tags`字段可直接用于序列标注任务，支持基于Transformer架构的预训练模型进行微调。数据集的统一结构便于跨语言迁移学习实验，用户可比较不同语言间模型的性能差异，推动多语言NLP技术的发展。

背景与挑战

背景概述

在自然语言处理领域，命名实体识别作为信息抽取的核心任务，长期以来面临跨语言标注标准不一的困境。Universal NER数据集由Stephen Mayhew等研究人员于2024年提出，旨在构建一个开放、社区驱动的多语言命名实体识别黄金标准基准。该数据集依托Universal Dependencies语料库，覆盖宿务语、丹麦语、德语、英语、克罗地亚语、葡萄牙语、俄语、斯洛伐克语、塞尔维亚语、瑞典语、他加禄语及中文等13种语言，采用统一的BIO标注体系，为多语言NER研究提供了跨语言一致的高质量标注资源，显著推动了多语言信息处理技术的标准化进程。

当前挑战

Universal NER致力于解决多语言命名实体识别中标注规范不统一、语言资源分布不均的领域挑战，其构建过程亦面临诸多困难。在领域层面，不同语言间实体表达的语法结构、命名习惯存在显著差异，实现跨语言标注一致性需要克服语言特有的歧义性与文化语境影响。在构建过程中，数据采集依赖于志愿者社区，导致语言覆盖范围受限于标注者可用性；同时，整合来自Universal Dependencies的异构语料需处理原始分词与标注体系的转换，确保不同语言数据集在实体类别与边界界定上保持严格对齐，这些因素共同构成了数据集构建的质量与规模挑战。

常用场景

经典使用场景

在自然语言处理领域，命名实体识别作为信息抽取的核心任务，长期面临跨语言数据稀缺与标注标准不一的挑战。Universal NER数据集通过整合13种语言的19个标注语料，构建了统一标注框架下的多语言基准。其最经典的使用场景在于为跨语言命名实体识别模型提供标准化训练与评估平台，研究者可利用其一致的PER、LOC、ORG实体类别定义，系统比较不同语言间模型的泛化能力与迁移效果。

实际应用

在实际应用层面，Universal NER支撑着全球化数字服务中的多语言信息结构化需求。其标注数据可直接用于训练商业搜索引擎的跨语言实体链接模块，提升新闻聚合系统对国际事件中机构、地名的识别精度。在金融舆情监控领域，该数据集帮助构建能同时处理英文财报、中文社媒、德语新闻的实体抽取管道，为跨国企业的风险洞察提供技术基础。

衍生相关工作

基于该数据集衍生的经典工作包括跨语言实体对齐框架XLM-RoBERTa的微调研究，以及多任务学习架构mT5的命名实体识别适配。部分研究利用其统一标注特性开发了语言无关的实体边界检测算法，另有工作通过对比学习挖掘不同语言实体分布的隐式关联。这些成果显著提升了低资源语言如宿务语、塔加洛语的实体识别性能，并催生了面向斯拉夫语系的语言家族迁移学习新范式。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集