SpeedOfMagic/ontonotes_english
收藏数据集概述
基本信息
- 数据集名称: ontonotes_english
- 语言: 英语
- 许可: 未知
- 数据集大小: 10K<n<100K
- 任务类型:
- 命名实体识别 (Named Entity Recognition)
数据集描述
数据集摘要
这是一个预处理版本的OntoNotes v5.0数据集。数据集中的句子被解包并以行形式存储,字段被重命名以匹配conll2003数据集。数据来源是一个私有仓库,该仓库从另一个未知位置的公共仓库获取数据。由于所有仓库的数据都没有许可证,因此不存在许可问题。
支持的任务和排行榜
- 命名实体识别 (Named Entity Recognition)
数据结构
数据实例
json { "tokens": ["Well", ",", "the", "Hundred", "Regiments", "Offensive", "was", "divided", "into", "three", "phases", "."], "ner_tags": [0, 0, 29, 30, 30, 30, 0, 0, 0, 27, 0, 0] }
数据字段
tokens(List[str]): 原始数据集中的单词。ner_tags(List[ClassLabel]): 原始数据集中的命名实体。使用BIO标签表示句子中的命名实体。- 标签集:
datasets.ClassLabel(num_classes=37, names=["O", "B-PERSON", "I-PERSON", "B-NORP", "I-NORP", "B-FAC", "I-FAC", "B-ORG", "I-ORG", "B-GPE", "I-GPE", "B-LOC", "I-LOC", "B-PRODUCT", "I-PRODUCT", "B-DATE", "I-DATE", "B-TIME", "I-TIME", "B-PERCENT", "I-PERCENT", "B-MONEY", "I-MONEY", "B-QUANTITY", "I-QUANTITY", "B-ORDINAL", "I-ORDINAL", "B-CARDINAL", "I-CARDINAL", "B-EVENT", "I-EVENT", "B-WORK_OF_ART", "I-WORK_OF_ART", "B-LAW", "I-LAW", "B-LANGUAGE", "I-LANGUAGE"])
- 标签集:
数据分割
- 训练集 (train)
- 验证集 (validation)
- 测试集 (test)
数据集创建
数据来源
数据来自一个私有仓库,该仓库从另一个未知位置的公共仓库获取数据。
许可信息
无许可证
引用信息
bibtex @inproceedings{pradhan-etal-2013-towards, title = "Towards Robust Linguistic Analysis using {O}nto{N}otes", author = {Pradhan, Sameer and Moschitti, Alessandro and Xue, Nianwen and Ng, Hwee Tou and Bj{"o}rkelund, Anders and Uryupina, Olga and Zhang, Yuchen and Zhong, Zhi}, booktitle = "Proceedings of the Seventeenth Conference on Computational Natural Language Learning", month = aug, year = "2013", address = "Sofia, Bulgaria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W13-3516", pages = "143--152", }



