five

MULTICONER

收藏
arXiv2022-08-31 更新2024-06-21 收录
下载链接:
https://registry.opendata.aws/multiconer/
下载链接
链接失效反馈
官方服务:
资源简介:
MULTICONER是一个大规模多语言命名实体识别数据集,涵盖11种语言和3个领域(维基句子、问题和搜索查询),并包含多语言和代码混合子集。该数据集旨在代表NER的当代挑战,包括低上下文场景(短且无大小写的文本)、语法复杂的实体(如电影标题)和长尾实体分布。通过使用维基百科和Wikidata等公开资源,结合本地化版本和自动文本翻译方法,生成了适用于测试跨语言和跨领域NER性能的数据。MULTICONER旨在帮助进一步研究构建健壮的NER系统,特别是在处理复杂和未见实体时。

MULTICONER is a large-scale multilingual named entity recognition (NER) dataset that covers 11 languages and 3 domains (Wikipedia sentences, questions, and search queries), and includes multilingual and code-mixed subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short, case-agnostic texts), grammatically complex entities such as movie titles, and long-tailed entity distributions. By utilizing publicly available resources like Wikipedia and Wikidata, combined with localized versions and automatic text translation approaches, data for testing cross-lingual and cross-domain NER performance is constructed. MULTICONER aims to support further research on developing robust NER systems, particularly when handling complex and unseen entities.
提供机构:
亚马逊公司
创建时间:
2022-08-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作