five

CLEANCONLL

收藏
arXiv2023-10-25 更新2024-06-21 收录
下载链接:
https://github.com/flairNLP/CleanCoNLL
下载链接
链接失效反馈
官方服务:
资源简介:
CLEANCONLL是由柏林洪堡大学的Susanna Rücker和Alan Akbik创建的一个近无噪声的命名实体识别数据集,是对著名的CoNLL-03数据集的改进版本。该数据集通过引入实体链接信息,增强了标注的一致性和质量,使得现有的最先进模型在该数据集上能达到更高的F1分数(97.1%),并且由于标注噪声的减少,正确预测被错误计为错误的比例从47%降至6%。CLEANCONLL适用于分析最先进模型的剩余错误,并且表明高资源、粗粒度的NER任务的理论上限尚未达到。该数据集已公开发布,供研究社区使用,以促进NER领域的进一步研究和模型评估。

CLEANCONLL is a near-noise-free named entity recognition (NER) dataset developed by Susanna Rücker and Alan Akbik from Humboldt-Universität zu Berlin, which serves as an enhanced revision of the widely recognized CoNLL-03 dataset. By integrating entity linking information, this dataset improves annotation consistency and quality, enabling contemporary state-of-the-art models to achieve a superior F1 score of 97.1% on this benchmark. Furthermore, with the reduction of annotation noise, the proportion of correct predictions erroneously counted as errors has dropped from 47% to 6%. CLEANCONLL is suitable for analyzing the residual errors of state-of-the-art NER models, and it demonstrates that the theoretical upper bound of high-resource, coarse-grained NER tasks remains unachieved. This dataset has been publicly released for the global research community to facilitate further advancements in NER research and model evaluation.
提供机构:
柏林洪堡大学
创建时间:
2023-10-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作