leondz/wnut_17

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/leondz/wnut_17

下载链接

链接失效反馈

资源简介：

WNUT 17数据集是一个用于命名实体识别（NER）任务的数据集，专注于识别在噪声文本中出现的新颖和罕见的实体。数据集包含训练集、验证集和测试集，分别包含3394、1009和1287个样本。每个样本包含ID、文本的tokens以及对应的NER标签。NER标签采用IOB2格式，涵盖了多种实体类型，如公司、创意作品、团体、地点、人物和产品等。数据集的创建目的是为了提供新兴和罕见实体的定义，并基于此提供检测这些实体的数据集。

The WNUT 17 dataset is a named entity recognition (NER) dataset focused on identifying novel and rare entities in noisy text. It includes training, validation, and test splits, with 3,394, 1,009, and 1,287 samples respectively. Each sample contains an ID, the tokens of the input text, and their corresponding NER tags. The NER tags follow the IOB2 format and cover multiple entity types such as companies, creative works, groups, locations, persons, products and others. The dataset is developed to offer a definition for emerging and rare entities, as well as a benchmark dataset for detecting such entities.

提供机构：

leondz

原始信息汇总

数据集概述

数据集名称

名称: WNUT 17
别名: wnut_17

数据集描述

任务: 识别新兴和罕见实体
语言: 英语（en）
许可证: CC-BY-4.0
数据来源: 原始数据
数据类型: 单语种
规模: 1K<n<10K
任务类别: 词元分类
任务ID: 命名实体识别

数据集结构

特征:
- id: 字符串类型，示例ID
- tokens: 字符串序列，示例文本的词元
- ner_tags: 类别标签序列，词元的NER标签，使用IOB2格式
分割:
- train: 3394个示例
- validation: 1009个示例
- test: 1287个示例

数据集创建

注释创建者: 众包
语言创建者: 发现

数据集使用注意事项

引用信息:

@inproceedings{derczynski-etal-2017-results, title = "Results of the {WNUT}2017 Shared Task on Novel and Emerging Entity Recognition", author = "Derczynski, Leon and Nichols, Eric and van Erp, Marieke and Limsopatham, Nut", booktitle = "Proceedings of the 3rd Workshop on Noisy User-generated Text", month = sep, year = "2017", address = "Copenhagen, Denmark", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W17-4418", doi = "10.18653/v1/W17-4418", pages = "140--147", abstract = "This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet {``}so.. kktny in 30 mins?!{} {--} even human experts find the entity {`}kktny{} hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities. The task as described in this paper evaluated the ability of participating entries to detect and classify novel and emerging named entities in noisy text.", }

搜集汇总

数据集介绍

构建方式

WNUT 17数据集的构建旨在识别和分类文本中的新兴和罕见实体，该数据集通过众包方式对原始文本进行标注，标注内容包含文本中的命名实体及其类别。数据集涵盖了 corporation、creative-work、group、location、person 和 product 等类别，并采用 IOB2 格式的标注体系。构建过程中，数据集分为训练集、验证集和测试集三个部分，以确保模型的训练和评估质量。

特点

该数据集的特点在于其专注于新兴和罕见实体的识别，这对于提升命名实体识别在噪声文本中的召回率具有重要意义。数据集的多语言性单一，为英语，且规模适中，包含小于10,000个样本。此外，数据集采用 cc-by-4.0 许可，允许较为宽松的使用和分享。

使用方法

使用该数据集时，用户需要首先了解其数据结构，包括 id、tokens 和 ner_tags 三个字段。tokens 字段包含文本的分词，ner_tags 字段则包含相应的命名实体标签。用户可以利用这些信息对模型进行训练、验证和测试。数据集可通过 HuggingFace 的数据集库进行下载和加载，便于在自然语言处理任务中进行应用。

背景与挑战

背景概述

WNUT 17数据集，全称为 Emerging and Rare entity recognition，是在2017年由Leon Derczynski等研究人员发起的一个共享任务。该数据集的研究背景主要针对在噪声文本中识别新型和罕见的命名实体这一挑战，这对于现代诸多基于命名实体的任务（如事件聚类和摘要）具有重要意义。WNUT 17数据集的创建旨在提供一个对新兴和罕见实体的定义，并基于此定义构建相应的数据集，用于检测这些实体。该数据集的发布对自然语言处理领域，尤其是在实体识别方面产生了积极影响，为相关研究提供了宝贵的资源。

当前挑战

WNUT 17数据集在构建过程中遇到的挑战主要包括：1) 如何定义和识别新兴和罕见实体；2) 如何在噪声文本中保持对这些实体的识别准确性；3) 构建一个具有足够覆盖范围和多样性的数据集，以涵盖各种新兴和罕见实体。此外，数据集的构建还需考虑到个人隐私和敏感信息的处理，以及数据标注过程中的质量控制。在应用该数据集时，研究人员还需面对如何处理数据中的偏差和局限性等挑战。

常用场景

经典使用场景

在命名实体识别（NER）的研究与应用领域，WNUT 17数据集因其专注于新兴和罕见实体的识别而成为经典。该数据集通常被用于训练模型以识别和处理在噪声文本中出现的非常见或新颖的命名实体，这在诸如事件聚类和摘要等任务中具有重要价值。

衍生相关工作

基于WNUT 17数据集，学术界衍生出了一系列相关工作，包括对新兴实体识别算法的研究、对噪声文本处理技术的改进，以及对实体识别在特定领域应用的研究，这些工作进一步拓展了该数据集的影响力和应用范围。

数据集最近研究