leondz/wnut_17|命名实体识别数据集|文本分类数据集
收藏数据集概述
数据集名称
- 名称: WNUT 17
- 别名: wnut_17
数据集描述
- 任务: 识别新兴和罕见实体
- 语言: 英语(en)
- 许可证: CC-BY-4.0
- 数据来源: 原始数据
- 数据类型: 单语种
- 规模: 1K<n<10K
- 任务类别: 词元分类
- 任务ID: 命名实体识别
数据集结构
- 特征:
id
: 字符串类型,示例IDtokens
: 字符串序列,示例文本的词元ner_tags
: 类别标签序列,词元的NER标签,使用IOB2格式
- 分割:
train
: 3394个示例validation
: 1009个示例test
: 1287个示例
数据集创建
- 注释创建者: 众包
- 语言创建者: 发现
数据集使用注意事项
-
引用信息:
@inproceedings{derczynski-etal-2017-results, title = "Results of the {WNUT}2017 Shared Task on Novel and Emerging Entity Recognition", author = "Derczynski, Leon and Nichols, Eric and van Erp, Marieke and Limsopatham, Nut", booktitle = "Proceedings of the 3rd Workshop on Noisy User-generated Text", month = sep, year = "2017", address = "Copenhagen, Denmark", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W17-4418", doi = "10.18653/v1/W17-4418", pages = "140--147", abstract = "This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet {``}so.. kktny in 30 mins?!{} {--} even human experts find the entity {`}kktny{} hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities. The task as described in this paper evaluated the ability of participating entries to detect and classify novel and emerging named entities in noisy text.", }

Billboard-Hot-100
该数据集包含了自1958年以来所有Billboard Hot 100榜单的历史数据,详细记录了每首歌曲的排名、日期、表演者等信息。
github 收录
Yahoo Finance
Dataset About finance related to stock market
kaggle 收录
LinkedIn Salary Insights Dataset
LinkedIn Salary Insights Dataset 提供了全球范围内的薪资数据,包括不同职位、行业、地理位置和经验水平的薪资信息。该数据集旨在帮助用户了解薪资趋势和市场行情,支持职业规划和薪资谈判。
www.linkedin.com 收录
poi
本项目收集国内POI兴趣点,当前版本数据来自于openstreetmap。
github 收录
Infrared Thermal Image Dataset of High Voltage Electrical Power Equipment under Different Operating Conditions
Recognizing high voltage power equipment in electrical substations is the fundamental platform for effective condition monitoring of electrical power system. It enables proper identification and analysis of anomalies within the equipment, especially when in operation. The result such investigation can be applied for effective real-time measurement, control and protection schemes in the network. The use of visual images for this purpose would be limited during poor lighting conditions. However, Infrared (IR) images of the equipment are invariant to poor illumination condition. Hence, we have acquired the thermographic images of the high voltage power equipment using the portable professional FLIR C5 Infrared camera at different times of the day and load conditions. The dataset contains 5 categories of high voltages equipment common to most air-insulated electrical power substation at 132kV level, namely: circuit breakers, power transformers, surge arresters, disconnectors, and wave traps. The number of IR images for each class of equipment are: circuit breakers 203, power transformers 178, surge arresters 181, disconnectors 180, and wave traps 153. The IR images are 640 x 480 pixel RGB images captured using the rainbow color palette and properly segmented in labeled folders. The color bar in each IR image identifies the thermal range used during its acquisition. The dataset can be used for implementing novel research in computer vision based deep learning models, especially in object recognition, identification, fault classification or detection algorithms. The thermal profile of the equipment in the dataset could be applied for detection of hotspots and other related anomalies.
DataCite Commons 收录