NoisywikiHow
收藏arXiv2023-05-18 更新2024-06-21 收录
下载链接:
https://github.com/tangminji/NoisywikiHow
下载链接
链接失效反馈官方服务:
资源简介:
NoisywikiHow是由哈尔滨工业大学社会计算与信息检索研究中心创建的大规模NLP基准数据集,专注于处理自然语言处理中的真实世界噪声标签问题。该数据集包含超过89,000个程序性事件,通过最小化人工监督,模拟了人类在标注过程中的错误,引入了多种噪声源以复制真实世界的噪声。数据集的创建过程涉及从wikiHow网站爬取文章,并通过一系列自动化标注程序进行清洗和噪声注入。NoisywikiHow旨在评估和改进学习噪声标签(LNL)方法,特别是在意图识别任务中,该任务推动了从常识推理到对话系统等多种下游自然语言理解任务的发展。
NoisywikiHow is a large-scale natural language processing (NLP) benchmark dataset developed by the Social Computing and Information Retrieval Research Center of Harbin Institute of Technology, focusing on addressing real-world noisy label issues in natural language processing. Containing over 89,000 procedural events, the dataset is constructed by minimizing manual supervision, simulating human annotation errors, and introducing diverse noise sources to replicate real-world noisy scenarios. The dataset creation process involves crawling articles from the wikiHow website, followed by cleaning and noise injection via a series of automated annotation procedures. NoisywikiHow is designed to evaluate and advance learning with noisy labels (LNL) methods, particularly in the intent recognition task, which facilitates a wide range of downstream natural language understanding tasks spanning from commonsense reasoning to dialogue systems.
提供机构:
哈尔滨工业大学社会计算与信息检索研究中心
创建时间:
2023-05-18
搜集汇总
数据集介绍

背景与挑战
背景概述
NoisywikiHow是自然语言处理领域最大的真实世界噪声标签学习基准数据集,包含89,143个程序性事件,覆盖158个不重叠类别,具有长尾分布和受控的多源噪声。该数据集专为意图识别任务设计,模拟真实场景中的标签噪声,可用于评估模型在噪声环境下的性能。
以上内容由遇见数据集搜集并总结生成



