HashSet
收藏arXiv2022-01-18 更新2024-06-21 收录
下载链接:
https://github.com/prashantkodali/HashSet
下载链接
链接失效反馈官方服务:
资源简介:
HashSet是一个专为标签分割任务设计的数据集,由海得拉巴国际信息技术研究所创建。该数据集包含两个部分:HashSet-Manual,包含1,901个手动标注的标签,用于分割、命名实体识别和非英语令牌检测;HashSet-Distant,包含332,166个利用驼峰式提示自动分割的标签。数据集从不同的推文集合中抽样,旨在提供一个不同于现有数据集的标签分布,以训练和验证标签分割模型。HashSet特别关注于包含命名实体和非英语令牌的复杂标签,这些标签在现有的小型基准数据集中较少见,因此为评估模型性能提供了更全面的视角。
HashSet is a dataset specifically designed for label segmentation tasks, developed by the International Institute of Information Technology, Hyderabad. The dataset comprises two subsets: HashSet-Manual, which contains 1,901 manually annotated labels intended for segmentation, named entity recognition (NER), and non-English token detection; and HashSet-Distant, which includes 332,166 automatically segmented labels generated via camel-case prompting. The dataset is sampled from diverse tweet collections, aiming to provide a label distribution distinct from existing datasets for training and validating label segmentation models. HashSet specifically focuses on complex labels containing named entities and non-English tokens, which are relatively rare in current small-scale benchmark datasets, thus offering a more comprehensive perspective for evaluating model performance.
提供机构:
海得拉巴国际信息技术研究所
创建时间:
2022-01-18



