five

atamiles/NERsocial

收藏
Hugging Face2024-12-18 更新2024-12-21 收录
下载链接:
https://hf-mirror.com/datasets/atamiles/NERsocial
下载链接
链接失效反馈
官方服务:
资源简介:
NERsocial是一个专门为人机交互(HRI)应用设计的命名实体识别(NER)数据集。该数据集包含99,448个句子、153,102个实体标记和134,074个实体,涵盖六种对社交互动至关重要的实体类型:饮料、食物、爱好、工作、宠物和运动。此外,通过重新注释CoNLL2003数据集,增加了三种实体类型:人名、国家和组织。数据集使用RapidNER框架构建,结合了从Wikidata提取的知识图谱和从Wikipedia、Reddit和Stack Exchange等多个来源收集的文本。数据集的构建过程创新且高效,使用Elasticsearch进行快速标注,将每句话的标注时间从1分钟减少到0.9毫秒。文本从多种来源精心挑选,Wikipedia提供了正式的定义内容,而Reddit和Stack Exchange则贡献了对话和互动的语言模式。标注质量由人工验证,达到了高的一致性,Fleiss Kappa得分为90.6%,Cohens Kappa得分在88.3%到92.9%之间。在使用BERT-base、RoBERTa-base和DeBERTa-v3-base等最先进的Transformer模型进行评估时,NERsocial表现出色,所有模型的F1分数均超过95%。该数据集在不同文本领域的鲁棒性方面表现尤为突出,与WNUT等类似数据集相比,在NERsocial上微调的模型表现出更好的可迁移性。这使得NERsocial在开发能够处理HRI应用中正式和非正式通信的NER系统时特别有价值。

NERsocial is a new named entity recognition dataset specifically designed for human-robot interaction (HRI) applications. It contains 99,448 sentences, 153,102 entity tokens, and 134,074 entities across six entity types that are crucial for social interaction: drinks, foods, hobbies, jobs, pets, and sports. Additionally, three more entity types were added by re-annotating the CoNLL2003 dataset: PERSONNAME, COUNTRY, and ORGANIZATION. The dataset was constructed using RapidNER, an efficient framework that combines knowledge graph extraction from Wikidata with text collection from multiple sources including Wikipedia, Reddit, and Stack Exchange. The datasets construction process was innovative and efficient, utilizing Elasticsearch for rapid annotation that reduced the time per sentence from 1 minute to 0.9 milliseconds. The texts were carefully selected from diverse sources: Wikipedia provided formal definitional content, while Reddit and Stack Exchange contributed conversational and interactive language patterns. The annotation quality was validated by human annotators, achieving a high inter-annotator agreement with a Fleiss Kappa score of 90.6% and Cohens Kappa scores ranging from 88.3% to 92.9% between pairs of annotators. When evaluated using state-of-the-art transformer models (BERT-base, RoBERTa-base, and DeBERTa-v3-base), NERsocial demonstrated strong performance with F1-scores above 95% across all models. The dataset particularly excels in robustness across different text domains, with models fine-tuned on NERsocial showing better transferability compared to those trained on similar datasets like WNUT. This makes NERsocial particularly valuable for developing NER systems that can handle both formal and informal communication in HRI applications.
提供机构:
atamiles
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作