tomaarsen/MultiCoNER

Name: tomaarsen/MultiCoNER
Creator: tomaarsen
Published: 2023-10-01 19:39:19
License: 暂无描述

Hugging Face2023-10-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/tomaarsen/MultiCoNER

下载链接

链接失效反馈

官方服务：

资源简介：

MultiCoNER（版本1）是一个大型多语言命名实体识别数据集，涵盖11种语言的3个领域（维基句子、问题和搜索查询），以及多语言和代码混合子集。该数据集旨在代表NER中的当代挑战，包括低上下文场景（短文本和无大小写文本）、句法复杂的实体（如电影标题）和长尾实体分布。数据集包含26M个标记，通过启发式句子采样、模板提取和插槽填充以及机器翻译等技术从公共资源中编译而成。

MultiCoNER (Version 1) is a large-scale multilingual named entity recognition (NER) dataset covering three domains (Wikipedia sentences, questions, and search queries) across 11 languages, alongside multilingual and code-mixed subsets. This dataset is designed to represent contemporary core challenges in NER, including low-context scenarios (short and uncased texts), syntactically complex entities such as movie titles, and long-tail entity distributions. Comprising 26 million tokens, the dataset is compiled from public resources using techniques including heuristic sentence sampling, template extraction and slot filling, and machine translation.

提供机构：

tomaarsen

原始信息汇总

数据集概述

数据集信息

许可证: CC BY 4.0
任务类别: 词性标注
语言:
- 孟加拉语 (bn)
- 德语 (de)
- 英语 (en)
- 西班牙语 (es)
- 波斯语 (fa)
- 印地语 (hi)
- 韩语 (ko)
- 荷兰语 (nl)
- 俄语 (ru)
- 土耳其语 (tr)
- 中文 (zh)
- 多语言 (multilingual)
标签:
- PER: 人名
- LOC: 地点
- CORP: 公司
- GRP: 团体
- PROD: 产品
- CW: 创意作品
标签映射: python { "O": 0, "B-PER": 1, "I-PER": 2, "B-LOC": 3, "I-LOC": 4, "B-CORP": 5, "I-CORP": 6, "B-GRP": 7, "I-GRP": 8, "B-PROD": 9, "I-PROD": 10, "B-CW": 11, "I-CW": 12, }

数据集配置

孟加拉语 (bn)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 5616369字节
- validation: 800个样本, 301806字节
- test: 133119个样本, 21668288字节
下载大小: 31446032字节
数据集大小: 27586463字节

德语 (de)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 4056698字节
- validation: 800个样本, 214572字节
- test: 217824个样本, 37113304字节
下载大小: 44089736字节
数据集大小: 41384574字节

英语 (en)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 4330080字节
- validation: 800个样本, 229689字节
- test: 217818个样本, 38728401字节
下载大小: 44709663字节
数据集大小: 43288170字节

西班牙语 (es)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 4576557字节
- validation: 800个样本, 238872字节
- test: 217887个样本, 41457435字节
下载大小: 46861727字节
数据集大小: 46272864字节

波斯语 (fa)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 5550551字节
- validation: 800个样本, 294184字节
- test: 165702个样本, 30301688字节
下载大小: 38042406字节
数据集大小: 36146423字节

印地语 (hi)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 6189324字节
- validation: 800个样本, 321246字节
- test: 141565个样本, 25771882字节
下载大小: 35165171字节
数据集大小: 32282452字节

韩语 (ko)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 4439652字节
- validation: 800个样本, 233963字节
- test: 178249个样本, 27529239字节
下载大小: 35281170字节
数据集大小: 32202854字节

混合语言 (mix)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 1500个样本, 307844字节
- validation: 500个样本, 100909字节
- test: 100000个样本, 20218549字节
下载大小: 21802985字节
数据集大小: 20627302字节

多语言 (multi)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 168300个样本, 54119956字节
- validation: 8800个样本, 2846552字节
- test: 471911个样本, 91509480字节
下载大小: 148733494字节
数据集大小: 148475988字节

荷兰语 (nl)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 4070487字节
- validation: 800个样本, 209337字节
- test: 217337个样本, 37128925字节
下载大小: 43263864字节
数据集大小: 41408749字节

俄语 (ru)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 5313989字节
- validation: 800个样本, 279470字节
- test: 217501个样本, 47458726字节
下载大小: 54587257字节
数据集大小: 53052185字节

土耳其语 (tr)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 4076774字节
- validation: 800个样本, 213017字节
- test: 136935个样本, 14779846字节
下载大小: 22825291字节
数据集大小: 19069637字节

中文 (zh)

特征:
- id: int32
- tokens: 字符串序列
- ner_tags: 标签序列
分割:
- train: 15300个样本, 5899475字节
- validation: 800个样本, 310396字节
- test: 151661个样本, 29349271字节
下载大小: 36101525字节
数据集大小: 35559142字节

搜集汇总

数据集介绍

背景与挑战

背景概述

MultiCoNER是一个大规模多语言命名实体识别数据集，覆盖11种语言和三个文本领域，专注于处理低上下文、复杂句法实体等现实挑战。该数据集采用IOB标注格式，包含超过290万行数据，适用于多语言NER模型的研究和评估。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集