RCV1的子集数据集 该语料库已经用于作者识别实验
收藏帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-26237.html
下载链接
链接失效反馈官方服务:
资源简介:
Dataset creator and donator: ZhiLiu, e-mail: liuzhi8673 '@' gmail.com, institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China Data Set Information: The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts. Attribute Information: Attributes of the dataset are character n-grams(n=1-5) Relevant Papers: J. Houvardas, E. Stamatatos, a€?N-gram Feature Selection for Authorship Identification,a€? in Proc. of the 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications, vol. 4183, pp.77-86, (2006) September 12-15; Varna, Bulgaria. E. Stamatatos, a€?Author Identification Using Imbalanced and Limited Training Texts,a€? In Proc. of the 4th International Workshop on Text-based Information Retrieval, (2007) September 3-7; Regensburg, Germany. Citation Request: Please refer to the donator Zhi Liu from National Engineering Research Center For E-Learning Technology???China.
提供机构:
帕依提提



