five

rmyeid/polyglot_ner

收藏
Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/rmyeid/polyglot_ner
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - found language: - ar - bg - ca - cs - da - de - el - en - es - et - fa - fi - fr - he - hi - hr - hu - id - it - ja - ko - lt - lv - ms - nl - 'no' - pl - pt - ro - ru - sk - sl - sr - sv - th - tl - tr - uk - vi - zh license: - unknown multilinguality: - multilingual pretty_name: Polyglot-NER size_categories: - unknown source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition paperswithcode_id: polyglot-ner dataset_info: - config_name: ca features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 143746026 num_examples: 372665 download_size: 1107018606 dataset_size: 143746026 - config_name: de features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 156744752 num_examples: 547578 download_size: 1107018606 dataset_size: 156744752 - config_name: es features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 145387551 num_examples: 386699 download_size: 1107018606 dataset_size: 145387551 - config_name: fi features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 95175890 num_examples: 387465 download_size: 1107018606 dataset_size: 95175890 - config_name: hi features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 177698330 num_examples: 401648 download_size: 1107018606 dataset_size: 177698330 - config_name: id features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 152560050 num_examples: 463862 download_size: 1107018606 dataset_size: 152560050 - config_name: ko features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 174523416 num_examples: 560105 download_size: 1107018606 dataset_size: 174523416 - config_name: ms features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 155268778 num_examples: 528181 download_size: 1107018606 dataset_size: 155268778 - config_name: pl features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 159684112 num_examples: 623267 download_size: 1107018606 dataset_size: 159684112 - config_name: ru features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 200717423 num_examples: 551770 download_size: 1107018606 dataset_size: 200717423 - config_name: sr features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 183437513 num_examples: 559423 download_size: 1107018606 dataset_size: 183437513 - config_name: tl features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 47104871 num_examples: 160750 download_size: 1107018606 dataset_size: 47104871 - config_name: vi features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 141062258 num_examples: 351643 download_size: 1107018606 dataset_size: 141062258 - config_name: ar features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 183551222 num_examples: 339109 download_size: 1107018606 dataset_size: 183551222 - config_name: cs features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 156792129 num_examples: 564462 download_size: 1107018606 dataset_size: 156792129 - config_name: el features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 195456401 num_examples: 446052 download_size: 1107018606 dataset_size: 195456401 - config_name: et features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 21961619 num_examples: 87023 download_size: 1107018606 dataset_size: 21961619 - config_name: fr features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 147560734 num_examples: 418411 download_size: 1107018606 dataset_size: 147560734 - config_name: hr features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 154151689 num_examples: 629667 download_size: 1107018606 dataset_size: 154151689 - config_name: it features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 147520094 num_examples: 378325 download_size: 1107018606 dataset_size: 147520094 - config_name: lt features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 165319919 num_examples: 848018 download_size: 1107018606 dataset_size: 165319919 - config_name: nl features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 150737871 num_examples: 520664 download_size: 1107018606 dataset_size: 150737871 - config_name: pt features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 145627857 num_examples: 396773 download_size: 1107018606 dataset_size: 145627857 - config_name: sk features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 134174889 num_examples: 500135 download_size: 1107018606 dataset_size: 134174889 - config_name: sv features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 157058369 num_examples: 634881 download_size: 1107018606 dataset_size: 157058369 - config_name: tr features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 164456506 num_examples: 607324 download_size: 1107018606 dataset_size: 164456506 - config_name: zh features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 165056969 num_examples: 1570853 download_size: 1107018606 dataset_size: 165056969 - config_name: bg features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 190509195 num_examples: 559694 download_size: 1107018606 dataset_size: 190509195 - config_name: da features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 150551293 num_examples: 546440 download_size: 1107018606 dataset_size: 150551293 - config_name: en features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 145491677 num_examples: 423982 download_size: 1107018606 dataset_size: 145491677 - config_name: fa features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 180093656 num_examples: 492903 download_size: 1107018606 dataset_size: 180093656 - config_name: he features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 177231613 num_examples: 459933 download_size: 1107018606 dataset_size: 177231613 - config_name: hu features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 160702240 num_examples: 590218 download_size: 1107018606 dataset_size: 160702240 - config_name: ja features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 193679570 num_examples: 1691018 download_size: 1107018606 dataset_size: 193679570 - config_name: lv features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 76256241 num_examples: 331568 download_size: 1107018606 dataset_size: 76256241 - config_name: 'no' features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 152431612 num_examples: 552176 download_size: 1107018606 dataset_size: 152431612 - config_name: ro features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 96369897 num_examples: 285985 download_size: 1107018606 dataset_size: 96369897 - config_name: sl features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 148140079 num_examples: 521251 download_size: 1107018606 dataset_size: 148140079 - config_name: th features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 360409343 num_examples: 217631 download_size: 1107018606 dataset_size: 360409343 - config_name: uk features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 198251631 num_examples: 561373 download_size: 1107018606 dataset_size: 198251631 - config_name: combined features: - name: id dtype: string - name: lang dtype: string - name: words sequence: string - name: ner sequence: string splits: - name: train num_bytes: 6286855097 num_examples: 21070925 download_size: 1107018606 dataset_size: 6286855097 --- # Dataset Card for Polyglot-NER ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://sites.google.com/site/rmyeid/projects/polylgot-ner](https://sites.google.com/site/rmyeid/projects/polylgot-ner) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 45.39 GB - **Size of the generated dataset:** 12.54 GB - **Total amount of disk used:** 57.93 GB ### Dataset Summary Polyglot-NER A training dataset automatically generated from Wikipedia and Freebase the task of named entity recognition. The dataset contains the basic Wikipedia based training data for 40 languages we have (with coreference resolution) for the task of named entity recognition. The details of the procedure of generating them is outlined in Section 3 of the paper (https://arxiv.org/abs/1410.3791). Each config contains the data corresponding to a different language. For example, "es" includes only spanish examples. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### ar - **Size of downloaded dataset files:** 1.11 GB - **Size of the generated dataset:** 183.55 MB - **Total amount of disk used:** 1.29 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": "2", "lang": "ar", "ner": ["O", "O", "O", "O", "O", "O", "O", "O", "LOC", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "PER", "PER", "PER", "PER", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "words": "[\"وفي\", \"مرحلة\", \"موالية\", \"أنشأت\", \"قبيلة\", \"مكناسة\", \"الزناتية\", \"مكناسة\", \"تازة\", \",\", \"وأقام\", \"بها\", \"المرابطون\", \"قلعة\", \"..." } ``` #### bg - **Size of downloaded dataset files:** 1.11 GB - **Size of the generated dataset:** 190.51 MB - **Total amount of disk used:** 1.30 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": "1", "lang": "bg", "ner": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "words": "[\"Дефиниция\", \"Наименованията\", \"\\\"\", \"книжовен\", \"\\\"/\\\"\", \"литературен\", \"\\\"\", \"език\", \"на\", \"български\", \"за\", \"тази\", \"кодифи..." } ``` #### ca - **Size of downloaded dataset files:** 1.11 GB - **Size of the generated dataset:** 143.75 MB - **Total amount of disk used:** 1.25 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": "2", "lang": "ca", "ner": "[\"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O...", "words": "[\"Com\", \"a\", \"compositor\", \"deixà\", \"un\", \"immens\", \"llegat\", \"que\", \"inclou\", \"8\", \"simfonies\", \"(\", \"1822\", \"),\", \"diverses\", ..." } ``` #### combined - **Size of downloaded dataset files:** 1.11 GB - **Size of the generated dataset:** 6.29 GB - **Total amount of disk used:** 7.39 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": "18", "lang": "es", "ner": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "words": "[\"Los\", \"cambios\", \"en\", \"la\", \"energía\", \"libre\", \"de\", \"Gibbs\", \"\\\\\", \"Delta\", \"G\", \"nos\", \"dan\", \"una\", \"cuantificación\", \"de..." } ``` #### cs - **Size of downloaded dataset files:** 1.11 GB - **Size of the generated dataset:** 156.79 MB - **Total amount of disk used:** 1.26 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "id": "3", "lang": "cs", "ner": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "words": "[\"Historie\", \"Symfonická\", \"forma\", \"se\", \"rozvinula\", \"se\", \"především\", \"v\", \"období\", \"klasicismu\", \"a\", \"romantismu\", \",\", \"..." } ``` ### Data Fields The data fields are the same among all splits. #### ar - `id`: a `string` feature. - `lang`: a `string` feature. - `words`: a `list` of `string` features. - `ner`: a `list` of `string` features. #### bg - `id`: a `string` feature. - `lang`: a `string` feature. - `words`: a `list` of `string` features. - `ner`: a `list` of `string` features. #### ca - `id`: a `string` feature. - `lang`: a `string` feature. - `words`: a `list` of `string` features. - `ner`: a `list` of `string` features. #### combined - `id`: a `string` feature. - `lang`: a `string` feature. - `words`: a `list` of `string` features. - `ner`: a `list` of `string` features. #### cs - `id`: a `string` feature. - `lang`: a `string` feature. - `words`: a `list` of `string` features. - `ner`: a `list` of `string` features. ### Data Splits | name | train | |----------|---------:| | ar | 339109 | | bg | 559694 | | ca | 372665 | | combined | 21070925 | | cs | 564462 | ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{polyglotner, author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven}, title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition}, journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30- May 2, 2015}}, month = {April}, year = {2015}, publisher = {SIAM}, } ``` ### Contributions Thanks to [@joeddav](https://github.com/joeddav) for adding this dataset.
提供机构:
rmyeid
原始信息汇总

数据集概述

数据集基本信息

  • 名称: Polyglot-NER
  • 语言: 支持多种语言,包括阿拉伯语、保加利亚语、加泰罗尼亚语、捷克语、丹麦语、德语、希腊语、英语等共40种语言。
  • 许可证: 未知
  • 多语言性: 多语言
  • 数据集大小: 未知
  • 源数据: 原始数据
  • 任务类别: 令牌分类
  • 任务ID: 命名实体识别
  • PapersWithCode ID: polyglot-ner

数据集结构

数据实例

  • 特征:
    • id: 字符串类型
    • lang: 字符串类型
    • words: 字符串序列
    • ner: 字符串序列

数据分割

  • 训练集:
    • 不同语言的训练集大小不同,例如:
      • 阿拉伯语: 339,109个实例
      • 保加利亚语: 559,694个实例
      • 加泰罗尼亚语: 372,665个实例
      • 捷克语: 564,462个实例
      • 组合集: 21,070,925个实例

数据大小

  • 下载大小: 1,107,018,606字节
  • 数据集大小: 根据不同语言配置,大小从几十MB到几百MB不等。

数据集创建

  • 注释创建者: 机器生成
  • 语言创建者: 发现

源数据

  • 类型: 原始数据
  • 数据来源: 自动从维基百科和Freebase生成,用于命名实体识别任务。

注释

  • 类型: 机器生成

个人和敏感信息

  • 信息: 未提供

使用数据集的考虑

  • 社会影响: 未提供
  • 偏见讨论: 未提供
  • 其他已知限制: 未提供

附加信息

  • 数据集管理员: 未提供
  • 许可信息: 未知
  • 引用信息: 未提供
  • 贡献: 未提供
搜集汇总
数据集介绍
main_image_url
构建方式
在跨语言自然语言处理领域,构建高质量命名实体识别数据集面临资源稀缺的挑战。Polyglot-NER数据集采用自动化方法,从多语言维基百科和Freebase知识库中提取文本信息,通过核心ference解析技术生成标注。该过程遵循特定算法,将实体链接至知识库条目,并依据预定义规则分配实体类别标签,从而形成大规模、多语言的训练语料。
特点
该数据集涵盖四十种语言,包括阿拉伯语、中文、日语等非拉丁语系,展现出卓越的语言多样性。每个语言配置独立,数据规模从数万到数百万条不等,中文和日语样本量尤为突出。数据结构统一,包含词序列及对应的命名实体标签序列,支持序列标注任务。其多语言特性为跨语言迁移学习提供了丰富资源。
使用方法
研究人员可通过Hugging Face平台加载特定语言配置或组合版本,直接用于训练命名实体识别模型。该数据集适用于评估模型在多语言环境下的泛化能力,或作为预训练数据增强跨语言表示。使用时应考虑自动标注可能引入的噪声,建议结合人工评估或后处理技术以提升模型鲁棒性。
背景与挑战
背景概述
在自然语言处理领域,跨语言命名实体识别(NER)是推动多语言信息提取技术发展的关键任务。Polyglot-NER数据集由Rami Al-Rfou等研究人员于2014年创建,其核心研究问题在于解决传统NER系统对低资源语言覆盖不足的困境。该数据集通过自动化方法从维基百科和Freebase知识库中提取并标注了40种语言的实体信息,显著提升了多语言NER模型的训练效率与泛化能力,为后续的跨语言迁移学习研究奠定了重要基础。
当前挑战
Polyglot-NER数据集旨在应对多语言命名实体识别中数据稀缺与标注标准不统一的挑战。在领域问题层面,不同语言间的实体表达差异、语法结构多样性以及文化特定实体识别构成了主要障碍。构建过程中,自动化标注机制面临实体歧义消解、跨语言核心ference解析的复杂性,同时需处理维基百科数据中的噪声与标注一致性难题,这些因素共同影响了数据集的精确度与可靠性。
常用场景
经典使用场景
在自然语言处理领域,多语言命名实体识别(NER)是理解文本语义结构的关键任务。Polyglot-NER数据集以其覆盖40种语言的广泛性,成为训练和评估跨语言NER模型的经典资源。研究者通常利用该数据集构建多语言序列标注模型,通过统一的标注框架比较不同语言间实体识别的性能差异,从而探索语言特性对模型泛化能力的影响。
解决学术问题
该数据集有效缓解了多语言NER研究中数据稀缺的困境,特别是针对资源匮乏的语言。通过自动化生成的标注数据,它解决了传统人工标注成本高昂、规模有限的问题,为跨语言迁移学习、零样本或少样本学习提供了实验基础。其意义在于推动了语言无关的表示学习研究,促进了多语言信息抽取技术的均衡发展,对构建包容性人工智能系统具有深远影响。
衍生相关工作
基于Polyglot-NER数据集,学术界衍生了一系列经典研究工作。例如,跨语言BERT预训练模型利用该数据进行多任务学习,增强了模型对低资源语言的实体识别能力。此外,该数据集常被用作评估多语言词嵌入质量的基准,推动了如MUSE和VecMap等对齐方法的发展。这些工作共同深化了对多语言表示迁移机制的理解,为后续的通用语言模型奠定了基础。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作