10,000+ Artificial Intelligence Papers: arXiv Titles and Abstracts

Mendeley Data2026-05-21 收录

下载链接：

https://data.mendeley.com/datasets/jhtrncpcnm

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains the titles and abstracts of over 17,000 research papers focused on Artificial Intelligence, retrieved from the arXiv pre-print repository. It provides a substantial textual corpus of recent academic literature, designed to support researchers, data engineers, and developers working on text mining and advanced machine learning tasks. To facilitate immediate use, the dataset is provided in two formats: a raw version preserving the original formatting of the abstracts, and a thoroughly cleaned version processed specifically for textual analysis. Data Collection Methodology The data was collected in May 2026 by querying the arXiv advanced search portal for papers containing the term "Artificial Intelligence" within the abstract, filtered under the Computer Science classification. The extraction was performed using an automated R-based web scraping pipeline utilizing the rvest package. Data Files arxiv_titles_abstracts_raw.csv: Contains the unmodified, original text scraped directly from the arXiv search results. This file is ideal for applications requiring raw, unformatted academic prose. arxiv_titles_abstracts_clean.csv: A pre-processed version of the dataset optimized for Natural Language Processing. The text pipeline applied to this file includes lowercasing, the removal of punctuation, digits, and non-ASCII characters, automated spell-correction using hunspell, and word lemmatization. Potential Use Cases: This corpus is highly versatile for a variety of NLP and data science applications, including: Building and evaluating domain-specific Retrieval-Augmented Generation (RAG) pipelines. Fine-tuning Large Language Models (LLMs) on academic and technical prose. Training models for Named Entity Recognition (NER) to extract specific algorithms, hardware, or methodologies. Performing unsupervised clustering (such as K-Means, Hierarchical, or DBSCAN) and topic modeling to track emerging trends in AI research. Conducting bibliometric analysis on the evolution of AI terminology. Keywords: Natural Language Processing; Text Mining; Artificial Intelligence; Machine Learning; Text Corpus; Bibliometrics; Web Scraping; arXiv

创建时间：

2026-05-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集