---
license: cc-by-4.0
task_categories:
- token-classification
language:
- bn
- zh
- de
- en
- es
- fa
- fr
- hi
- it
- pt
- sv
- uk
tags:
- multiconer
- ner
- multilingual
- named entity recognition
- fine-grained ner
size_categories:
- 100K<n<1M
---
# Dataset Card for Multilingual Complex Named Entity Recognition (MultiCoNER)
## Dataset Description
- **Homepage:** https://multiconer.github.io
- **Repository:**
- **Paper:**
- **Leaderboard:** https://multiconer.github.io/results, https://codalab.lisn.upsaclay.fr/competitions/10025
- **Point of Contact:** https://multiconer.github.io/organizers
### Dataset Summary
The tagset of MultiCoNER is a fine-grained tagset.
The fine to coarse level mapping of the tags are as follows:
* Location (LOC) : Facility, OtherLOC, HumanSettlement, Station
* Creative Work (CW) : VisualWork, MusicalWork, WrittenWork, ArtWork, Software
* Group (GRP) : MusicalGRP, PublicCORP, PrivateCORP, AerospaceManufacturer, SportsGRP, CarManufacturer, ORG
* Person (PER) : Scientist, Artist, Athlete, Politician, Cleric, SportsManager, OtherPER
* Product (PROD) : Clothing, Vehicle, Food, Drink, OtherPROD
* Medical (MED) : Medication/Vaccine, MedicalProcedure, AnatomicalStructure, Symptom, Disease
### Supported Tasks and Leaderboards
The final leaderboard of the shared task is available <a href="https://multiconer.github.io/results" target="_blank">here</a>.
### Languages
Supported languages are Bangla, Chinese, English, Spanish, Farsi, French, German, Hindi, Italian, Portuguese, Swedish, Ukrainian.
## Dataset Structure
The dataset follows CoNLL format.
### Data Instances
Here are some examples in different languages:
* Bangla: [লিটল মিক্স | MusicalGrp] এ যোগদানের আগে তিনি [পিৎজা হাট | ORG] এ ওয়েট্রেস হিসাবে কাজ করেছিলেন।
* Chinese: 它的纤维穿过 [锁骨 | AnatomicalStructure] 并沿颈部侧面倾斜向上和内侧.
* English: [wes anderson | Artist]'s film [the grand budapest hotel | VisualWork] opened the festival .
* Farsi: است] ناگویا |HumanSettlement] مرکزاین استان شهر
* French: l [amiral de coligny | Politician] réussit à s y glisser .
* German: in [frühgeborenes | Disease] führt dies zu [irds | Symptom] .
* Hindi: १७९६ में उन्हें [शाही स्वीडिश विज्ञान अकादमी | Facility] का सदस्य चुना गया।
* Italian: è conservato nel [rijksmuseum | Facility] di [amsterdam | HumanSettlement] .
* Portuguese: também é utilizado para se fazer [licor | Drink] e [vinhos | Drink].
* Spanish: fue superado por el [aon center | Facility] de [los ángeles | HumanSettlement] .
* Swedish: [tom hamilton | Artist] amerikansk musiker basist i [aerosmith | MusicalGRP] .
* Ukrainian: назва альбому походить з роману « [кінець дитинства | WrittenWork] » англійського письменника [артура кларка | Artist] .
### Data Fields
The data has two fields. One is the token and another is the label. Here is an example from the English data.
```
# id f5458a3a-cd23-4df4-8384-4e23fe33a66b domain=en
doris _ _ B-Artist
day _ _ I-Artist
included _ _ O
in _ _ O
the _ _ O
album _ _ O
billy _ _ B-MusicalWork
rose _ _ I-MusicalWork
's _ _ I-MusicalWork
jumbo _ _ I-MusicalWork
```
### Data Splits
Train, Dev, and Test splits are provided
## Dataset Creation
TBD
## Loading the Dataset
```python
from datasets import load_dataset
english_data = load_dataset('MultiCoNER/multiconer_v2', 'English (EN)')
```
### Licensing Information
CC BY 4.0
### Citation Information
```
@inproceedings{multiconer2-report,
title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},
author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},
booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},
year={2023},
publisher={Association for Computational Linguistics},
}
@article{multiconer2-data,
title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},
author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},
year={2023},
}
```
license: 知识共享署名4.0(CC BY 4.0)
task_categories:
- 令牌分类(token-classification)
language:
- bn(孟加拉语)
- zh(中文)
- de(德语)
- en(英语)
- es(西班牙语)
- fa(波斯语)
- fr(法语)
- hi(印地语)
- it(意大利语)
- pt(葡萄牙语)
- sv(瑞典语)
- uk(乌克兰语)
tags:
- multiconer(MultiCoNER)
- ner(命名实体识别,NER)
- multilingual(多语言)
- named entity recognition(命名实体识别)
- fine-grained ner(细粒度命名实体识别)
size_categories:
- 100K<n<1M
---
# 多语言复杂命名实体识别(MultiCoNER)数据集卡片
## 数据集描述
- **官方主页:** https://multiconer.github.io
- **代码仓库:**
- **相关论文:**
- **排行榜:** https://multiconer.github.io/results, https://codalab.lisn.upsaclay.fr/competitions/10025
- **联系人:** https://multiconer.github.io/organizers
### 数据集概览
MultiCoNER采用细粒度标注体系,标注的细粒度到粗粒度映射关系如下:
* 位置(LOC):设施(Facility)、其他位置(OtherLOC)、人类聚居地(HumanSettlement)、站点(Station)
* 创作作品(CW):视觉作品(VisualWork)、音乐作品(MusicalWork)、文字作品(WrittenWork)、艺术作品(ArtWork)、软件(Software)
* 群体(GRP):音乐团体(MusicalGRP)、公共企业(PublicCORP)、私人企业(PrivateCORP)、航空航天制造商(AerospaceManufacturer)、体育团体(SportsGRP)、汽车制造商(CarManufacturer)、组织(ORG)
* 人物(PER):科学家(Scientist)、艺术家(Artist)、运动员(Athlete)、政治家(Politician)、神职人员(Cleric)、体育经理(SportsManager)、其他人物(OtherPER)
* 产品(PROD):服装(Clothing)、交通工具(Vehicle)、食品(Food)、饮品(Drink)、其他产品(OtherPROD)
* 医疗(MED):药物/疫苗(Medication/Vaccine)、医疗程序(MedicalProcedure)、解剖结构(AnatomicalStructure)、症状(Symptom)、疾病(Disease)
### 支持任务与排行榜
本次共享任务的最终排行榜可<a href="https://multiconer.github.io/results" target="_blank">点击此处</a>查看。
### 支持语言
本次数据集支持孟加拉语、中文、英语、西班牙语、波斯语、法语、德语、印地语、意大利语、葡萄牙语、瑞典语、乌克兰语。
## 数据集结构
本次数据集遵循CoNLL格式规范。
### 数据样例
以下为不同语言的部分标注样例:
* 孟加拉语:[লিটল মিক্স | MusicalGrp] এ যোগদানের আগে তিনি [পিৎজা হাট | ORG] এ ওয়েট্রেস হিসাবে কাজ করেছিলেন।
* 中文:它的纤维穿过 [锁骨 | AnatomicalStructure] 并沿颈部侧面倾斜向上和内侧。
* 英语:[wes anderson | Artist]'s film [the grand budapest hotel | VisualWork] opened the festival .
* 波斯语: است] ناگویا |HumanSettlement] مرکزاین استان شهر
* 法语:l [amiral de coligny | Politician] réussit à s y glisser .
* 德语:in [frühgeborenes | Disease] führt dies zu [irds | Symptom] .
* 印地语:१७९६ में उन्हें [शाही स्वीडिश विज्ञान अकादमी | Facility] का सदस्य चुना गया।
* 意大利语:è conservato nel [rijksmuseum | Facility] di [amsterdam | HumanSettlement] .
* 葡萄牙语:também é utilizado para se fazer [licor | Drink] e [vinhos | Drink].
* 西班牙语:fue superado por el [aon center | Facility] de [los ángeles | HumanSettlement] .
* 瑞典语:[tom hamilton | Artist] amerikansk musiker basist i [aerosmith | MusicalGRP] .
* 乌克兰语:назва альбому походить з роману « [кінець дитинства | WrittenWork] » англійського письменника [артура кларка | Artist] .
### 数据字段
本次数据集包含两个字段:令牌(token)与标签(label)。以下为英语数据集的样例:
# id f5458a3a-cd23-4df4-8384-4e23fe33a66b domain=en
doris _ _ B-Artist
day _ _ I-Artist
included _ _ O
in _ _ O
the _ _ O
album _ _ O
billy _ _ B-MusicalWork
rose _ _ I-MusicalWork
's _ _ I-MusicalWork
jumbo _ _ I-MusicalWork
### 数据划分
本次数据集提供训练集、开发集与测试集三种划分。
## 数据集构建
待补充(TBD)
## 数据集加载
python
from datasets import load_dataset
english_data = load_dataset('MultiCoNER/multiconer_v2', 'English (EN)')
### 授权协议
知识共享署名4.0(CC BY 4.0)
### 引用信息
@inproceedings{multiconer2-report,
title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},
author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},
booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},
year={2023},
publisher={Association for Computational Linguistics},
}
@article{multiconer2-data,
title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},
author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},
year={2023},
}