A collection of nine multi-label text classification datasets

Name: A collection of nine multi-label text classification datasets
Creator: IEEE Dataport
License: 暂无描述

ieee-dataport.org2025-01-22 收录

下载链接：

https://ieee-dataport.org/documents/collection-nine-multi-label-text-classification-datasets

下载链接

链接失效反馈

官方服务：

资源简介：

This is a compressed package containing nine multi-label text classification data sets, including AAPD, CitySearch, Heritage, Laptop, Ohsumed, RCV1, Restaurant, Reuters, and Sentihood. The datasets of CitySearch, Heritage, Laptop, Restaurant and Sentihood are from the paper of “Bert-flow-vae: A weakly- supervised model for multi-label text classification” (url: https://aclanthology.org/2022.coling-1.104/). The original datasets of Reuters and Ohsumed are from http://disi.unitn.it/moschitti/corpora.htm. The original dataset of AAPD is from https://github.com/lancopku/SGM. The original dataset of RCV1 is from http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm. For all of these datasets, we adopt the raw text format. For Reuters, we retain the 10 largest classes. In terms of Reuters and Ohsumed, their category words are directly obtained from the descriptive words and seed words defined in [1] and [2]. Interms of the other datasets, we generate their category words with the protocol described in our proposed category word selection method CWS-SRC. [1] X. Chen, Y. Xia, P. Jin, and J. Carroll, “Dataless text classification with descriptive lda,” in AAAI, 2015, pp. 2224–2231.[2] D. Zha and C. Li, “Multi-label dataless text classification with topic modeling,” KAIS, vol. 61, no. 1, pp. 137–160, 2019

本压缩包内含九组多标签文本分类数据集，包括AAPD、CitySearch、Heritage、Laptop、Ohsumed、RCV1、Restaurant、Reuters及Sentihood。CitySearch、Heritage、Laptop、Restaurant和Sentihood数据集源于论文《Bert-flow-vae：一种适用于多标签文本分类的弱监督模型》（url：https://aclanthology.org/2022.coling-1.104/）。Reuters和Ohsumed的原始数据集来源于http://disi.unitn.it/moschitti/corpora.htm。AAPD的原始数据集来源于https://github.com/lancopku/SGM。RCV1的原始数据集来源于http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm。对于所有这些数据集，我们均采用原始文本格式。对于Reuters数据集，我们保留了10个最大的类别。至于Reuters和Ohsumed，其类别词汇直接来源于[1]和[2]中定义的描述性词汇和种子词汇。对于其他数据集，我们则依照我们提出的类别词汇选择方法CWS-SRC生成其类别词汇。[1] X. Chen, Y. Xia, P. Jin, and J. Carroll, “Dataless text classification with descriptive lda,” in AAAI, 2015, pp. 2224–2231.[2] D. Zha and C. Li, “Multi-label dataless text classification with topic modeling,” KAIS, vol. 61, no. 1, pp. 137–160, 2019

提供机构：

IEEE Dataport

5,000+

优质数据集

54 个

任务类型

进入经典数据集