five

10,000+ Artificial Intelligence Papers: arXiv Titles and Abstracts

收藏
Mendeley Data2026-05-21 收录
下载链接:
https://data.mendeley.com/datasets/jhtrncpcnm
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains the titles and abstracts of over 17,000 research papers focused on Artificial Intelligence, retrieved from the arXiv pre-print repository. It provides a substantial textual corpus of recent academic literature, designed to support researchers, data engineers, and developers working on text mining and advanced machine learning tasks. To facilitate immediate use, the dataset is provided in two formats: a raw version preserving the original formatting of the abstracts, and a thoroughly cleaned version processed specifically for textual analysis. Data Collection Methodology The data was collected in May 2026 by querying the arXiv advanced search portal for papers containing the term "Artificial Intelligence" within the abstract, filtered under the Computer Science classification. The extraction was performed using an automated R-based web scraping pipeline utilizing the rvest package. Data Files arxiv_titles_abstracts_raw.csv: Contains the unmodified, original text scraped directly from the arXiv search results. This file is ideal for applications requiring raw, unformatted academic prose. arxiv_titles_abstracts_clean.csv: A pre-processed version of the dataset optimized for Natural Language Processing. The text pipeline applied to this file includes lowercasing, the removal of punctuation, digits, and non-ASCII characters, automated spell-correction using hunspell, and word lemmatization. Potential Use Cases: This corpus is highly versatile for a variety of NLP and data science applications, including: Building and evaluating domain-specific Retrieval-Augmented Generation (RAG) pipelines. Fine-tuning Large Language Models (LLMs) on academic and technical prose. Training models for Named Entity Recognition (NER) to extract specific algorithms, hardware, or methodologies. Performing unsupervised clustering (such as K-Means, Hierarchical, or DBSCAN) and topic modeling to track emerging trends in AI research. Conducting bibliometric analysis on the evolution of AI terminology. Keywords: Natural Language Processing; Text Mining; Artificial Intelligence; Machine Learning; Text Corpus; Bibliometrics; Web Scraping; arXiv
创建时间:
2026-05-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作