five

arXiv abstracts and titles from 1,469 single-authored papers (100 unique authors) in computer science

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7404701
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is meant to be used for experiments of Authorship Analysis. The dataset consists of abstracts of single-author papers from arXiv crawled using the arXiv's API by querying a list of computer-science-related keywords ("deep learning", "machine learning", "information retrieval", "computer science", "data mining", "support vector", "logistic regression", "artificial intelligence", "supervised learning"'). The corpus somehow follows a power-law distribution, with few prolific authors and many authors accounting for very few papers each: we retained authors with at least 10 papers, resulting in a total of 1,469 documents from 100 authors. The most prolific authors (Peter D. Turney and Subhash Kak) have 34 abstracts to their names, the 10 most prolific authors have written 22 or more articles, while 50% of the authors have no more than 12 abstracts to their names. In order to divide the corpus into a training set and a test set we perform a stratified split, with the production of each author being split into a training set (70%) and a test set (30%). We use these documents as examples of "scientific communication", characterised by a precise and compact style, with an abundance of technical terminology.
创建时间:
2022-12-06
二维码
社区交流群
二维码
科研交流群
商业服务