five

Hierarchical Text Classification corpora

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7319518
下载链接
链接失效反馈
官方服务:
资源简介:
A set of 3 datasets for Hierarchical Text Classification (HTC), with samples divided into training and testing splits. The hierarchies of labels within all datasets have depth 2. The Amazon5x5 dataset contains 500,000 user reviews tagged with the reviewed product's categories. There are 5 product categories with 100,000 examples each, and each category has 5 sub-categories. The Bugs dataset contains 30,050 bugs of the Linux kernel, labeled with exactly two categories identifying the affected component. Finally, the Web Of Science dataset contains 46,960 abstracts of scientific papers, labeled the article's domain (see original repo for more details). Datasets are published in JSONL format, where each line is a string formatted as a JSON, like in the example below. { "text": , "labels": [, , ...] } The hierarchical structure of labels in each dataset is documented in this repository.   These datasets have been presented in this paper: "Hierarchical Text Classification and its Foundations: a Review of Current Research" - DOI: 10.3390/electronics13071199 Some of these datasets have also been used in: "Ticket Automation: an Insight into Current Research with Applications to Multi-level Classification Scenarios" - DOI: 10.1016/j.eswa.2023.119984 "A multi-level approach for hierarchical Ticket Classification", accepted at WNUT 2022 - link   These datasets are partially derived from previous work, namely: [Amazon] J. Ni, J. Li, J. McAuley, "Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects", EMNLP 2019, doi: 10.18653/v1/D19-1018 [WOS] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification," 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017, pp. 364-371, doi: 10.1109/ICMLA.2017.0-134 [Linux Bugs] V. Lyubinets, T. Boiko and D. Nicholas, "Automated Labeling of Bugs and Tickets Using Attention-Based Mechanisms in Recurrent Neural Networks," 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), 2018, pp. 271-275, doi: 10.1109/DSMP.2018.8478511
创建时间:
2024-03-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作