HashSet

Name: HashSet
Creator: 海得拉巴国际信息技术研究所
Published: 2022-01-18 12:40:45
License: 暂无描述

arXiv2022-01-18 更新2024-06-21 收录

下载链接：

https://github.com/prashantkodali/HashSet

下载链接

链接失效反馈

官方服务：

资源简介：

HashSet是一个专为标签分割任务设计的数据集，由海得拉巴国际信息技术研究所创建。该数据集包含两个部分：HashSet-Manual，包含1,901个手动标注的标签，用于分割、命名实体识别和非英语令牌检测；HashSet-Distant，包含332,166个利用驼峰式提示自动分割的标签。数据集从不同的推文集合中抽样，旨在提供一个不同于现有数据集的标签分布，以训练和验证标签分割模型。HashSet特别关注于包含命名实体和非英语令牌的复杂标签，这些标签在现有的小型基准数据集中较少见，因此为评估模型性能提供了更全面的视角。

HashSet is a dataset specifically designed for label segmentation tasks, developed by the International Institute of Information Technology, Hyderabad. The dataset comprises two subsets: HashSet-Manual, which contains 1,901 manually annotated labels intended for segmentation, named entity recognition (NER), and non-English token detection; and HashSet-Distant, which includes 332,166 automatically segmented labels generated via camel-case prompting. The dataset is sampled from diverse tweet collections, aiming to provide a label distribution distinct from existing datasets for training and validating label segmentation models. HashSet specifically focuses on complex labels containing named entities and non-English tokens, which are relatively rare in current small-scale benchmark datasets, thus offering a more comprehensive perspective for evaluating model performance.

提供机构：

海得拉巴国际信息技术研究所

创建时间：

2022-01-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集