MULTICONER

Name: MULTICONER
Creator: 亚马逊公司
Published: 2022-08-31 04:45:54
License: 暂无描述

arXiv2022-08-31 更新2024-06-21 收录

下载链接：

https://registry.opendata.aws/multiconer/

下载链接

链接失效反馈

官方服务：

资源简介：

MULTICONER是一个大规模多语言命名实体识别数据集，涵盖11种语言和3个领域（维基句子、问题和搜索查询），并包含多语言和代码混合子集。该数据集旨在代表NER的当代挑战，包括低上下文场景（短且无大小写的文本）、语法复杂的实体（如电影标题）和长尾实体分布。通过使用维基百科和Wikidata等公开资源，结合本地化版本和自动文本翻译方法，生成了适用于测试跨语言和跨领域NER性能的数据。MULTICONER旨在帮助进一步研究构建健壮的NER系统，特别是在处理复杂和未见实体时。

MULTICONER is a large-scale multilingual named entity recognition (NER) dataset that covers 11 languages and 3 domains (Wikipedia sentences, questions, and search queries), and includes multilingual and code-mixed subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short, case-agnostic texts), grammatically complex entities such as movie titles, and long-tailed entity distributions. By utilizing publicly available resources like Wikipedia and Wikidata, combined with localized versions and automatic text translation approaches, data for testing cross-lingual and cross-domain NER performance is constructed. MULTICONER aims to support further research on developing robust NER systems, particularly when handling complex and unseen entities.

提供机构：

亚马逊公司

创建时间：

2022-08-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集