arXiv abstracts and titles from 1,469 single-authored papers (100 unique authors) in computer science

NIAID Data Ecosystem2026-03-14 收录

下载链接：

https://zenodo.org/record/7404701

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is meant to be used for experiments of Authorship Analysis. The dataset consists of abstracts of single-author papers from arXiv crawled using the arXiv's API by querying a list of computer-science-related keywords ("deep learning", "machine learning", "information retrieval", "computer science", "data mining", "support vector", "logistic regression", "artificial intelligence", "supervised learning"'). The corpus somehow follows a power-law distribution, with few prolific authors and many authors accounting for very few papers each: we retained authors with at least 10 papers, resulting in a total of 1,469 documents from 100 authors. The most prolific authors (Peter D. Turney and Subhash Kak) have 34 abstracts to their names, the 10 most prolific authors have written 22 or more articles, while 50% of the authors have no more than 12 abstracts to their names. In order to divide the corpus into a training set and a test set we perform a stratified split, with the production of each author being split into a training set (70%) and a test set (30%). We use these documents as examples of "scientific communication", characterised by a precise and compact style, with an abundance of technical terminology.

创建时间：

2022-12-06