NoisywikiHow

Name: NoisywikiHow
Creator: 哈尔滨工业大学社会计算与信息检索研究中心
Published: 2023-05-18 13:01:04
License: 暂无描述

arXiv2023-05-18 更新2024-06-21 收录

下载链接：

https://github.com/tangminji/NoisywikiHow

下载链接

链接失效反馈

官方服务：

资源简介：

NoisywikiHow是由哈尔滨工业大学社会计算与信息检索研究中心创建的大规模NLP基准数据集，专注于处理自然语言处理中的真实世界噪声标签问题。该数据集包含超过89,000个程序性事件，通过最小化人工监督，模拟了人类在标注过程中的错误，引入了多种噪声源以复制真实世界的噪声。数据集的创建过程涉及从wikiHow网站爬取文章，并通过一系列自动化标注程序进行清洗和噪声注入。NoisywikiHow旨在评估和改进学习噪声标签(LNL)方法，特别是在意图识别任务中，该任务推动了从常识推理到对话系统等多种下游自然语言理解任务的发展。

NoisywikiHow is a large-scale natural language processing (NLP) benchmark dataset developed by the Social Computing and Information Retrieval Research Center of Harbin Institute of Technology, focusing on addressing real-world noisy label issues in natural language processing. Containing over 89,000 procedural events, the dataset is constructed by minimizing manual supervision, simulating human annotation errors, and introducing diverse noise sources to replicate real-world noisy scenarios. The dataset creation process involves crawling articles from the wikiHow website, followed by cleaning and noise injection via a series of automated annotation procedures. NoisywikiHow is designed to evaluate and advance learning with noisy labels (LNL) methods, particularly in the intent recognition task, which facilitates a wide range of downstream natural language understanding tasks spanning from commonsense reasoning to dialogue systems.

提供机构：

哈尔滨工业大学社会计算与信息检索研究中心

创建时间：

2023-05-18

搜集汇总

数据集介绍