10,000+ Artificial Intelligence Papers: arXiv Titles and Abstracts
收藏Mendeley Data2026-05-21 收录
下载链接:
https://data.mendeley.com/datasets/jhtrncpcnm
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains the titles and abstracts of over 17,000 research papers focused on Artificial Intelligence, retrieved from the arXiv pre-print repository. It provides a substantial textual corpus of recent academic literature, designed to support researchers, data engineers, and developers working on text mining and advanced machine learning tasks.
To facilitate immediate use, the dataset is provided in two formats: a raw version preserving the original formatting of the abstracts, and a thoroughly cleaned version processed specifically for textual analysis.
Data Collection Methodology
The data was collected in May 2026 by querying the arXiv advanced search portal for papers containing the term "Artificial Intelligence" within the abstract, filtered under the Computer Science classification. The extraction was performed using an automated R-based web scraping pipeline utilizing the rvest package.
Data Files
arxiv_titles_abstracts_raw.csv: Contains the unmodified, original text scraped directly from the arXiv search results. This file is ideal for applications requiring raw, unformatted academic prose.
arxiv_titles_abstracts_clean.csv: A pre-processed version of the dataset optimized for Natural Language Processing. The text pipeline applied to this file includes lowercasing, the removal of punctuation, digits, and non-ASCII characters, automated spell-correction using hunspell, and word lemmatization.
Potential Use Cases:
This corpus is highly versatile for a variety of NLP and data science applications, including:
Building and evaluating domain-specific Retrieval-Augmented Generation (RAG) pipelines.
Fine-tuning Large Language Models (LLMs) on academic and technical prose.
Training models for Named Entity Recognition (NER) to extract specific algorithms, hardware, or methodologies.
Performing unsupervised clustering (such as K-Means, Hierarchical, or DBSCAN) and topic modeling to track emerging trends in AI research.
Conducting bibliometric analysis on the evolution of AI terminology.
Keywords:
Natural Language Processing; Text Mining; Artificial Intelligence; Machine Learning; Text Corpus; Bibliometrics; Web Scraping; arXiv
创建时间:
2026-05-14



