five

tner/wikiann

收藏
Hugging Face2022-09-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tner/wikiann
下载链接
链接失效反馈
官方服务:
资源简介:
WikiAnn是一个多语言的命名实体识别(NER)数据集,涵盖了282种语言,主要用于识别文本中的位置(LOC)、组织(ORG)和人物(PER)实体。该数据集基于Wikipedia数据,包含训练集、验证集和测试集,每种语言的数据量有所不同。数据集的结构包括标签ID和数据分割信息,适用于跨语言的命名实体识别任务。

WikiAnn is a multilingual named entity recognition (NER) dataset covering 282 languages. It is primarily designed to identify three core entity categories in text: Location (LOC), Organization (ORG), and Person (PER). Built on Wikipedia data, the dataset includes training, validation, and test splits, with varying data volumes across different languages. The dataset structure encompasses label IDs and data split information, and is suitable for cross-lingual named entity recognition tasks.
提供机构:
tner
原始信息汇总

数据集概述

基本信息

  • 数据集名称: WikiAnn
  • 别名: tner/wikiann
  • 语言: 多语言,支持多种语言,如ace, bg, da等。
  • 多语言性: 多语言
  • 大小: 10K<100k

任务与结构

  • 任务类别: 令牌分类
  • 任务ID: 命名实体识别
  • 实体类型: LOC, ORG, PER

数据集结构

  • 数据实例: 每个实例包含tokenstags,例如:

    { tokens: [#, #, ユ, リ, ウ, ス, ・, ベ, ー, リ, ッ, ク, #, 1, 9,9,9], tags: [6, 6, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6] }

  • 标签ID: 标签与ID的映射关系,如: python { "B-LOC": 0, "B-ORG": 1, "B-PER": 2, "I-LOC": 3, "I-ORG": 4, "I-PER": 5, "O": 6 }

数据分割

数据集根据不同语言分割为训练集、验证集和测试集,具体数据量根据语言不同而异。

引用信息

@inproceedings{pan-etal-2017-cross, title = "Cross-lingual Name Tagging and Linking for 282 Languages", author = "Pan, Xiaoman and Zhang, Boliang and May, Jonathan and Nothman, Joel and Knight, Kevin and Ji, Heng", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P17-1178", doi = "10.18653/v1/P17-1178", pages = "1946--1958", abstract = "The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating {``}silver-standard{} annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.", }

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
tner/wikiann是一个多语言命名实体识别数据集,覆盖166种语言并包含LOC/ORG/PER三种实体类型。数据集采用token-tag格式,不同语言的数据量从100到20,000条不等,主要语言如英语、中文等数据规模较大。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作