CCRss/arxiv_papers_cs
收藏Hugging Face2024-04-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/CCRss/arxiv_papers_cs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
size_categories:
- 100K<n<1M
---
# CS Research Dataset
## Description
This dataset is a collection of scientific abstracts from the field of computer science, sourced from arXiv. It was created to facilitate research in natural language processing, specifically for tasks such as thematic modeling, trend analysis, and keyword extraction.
## Dataset Structure
The dataset is structured as follows:
- **title**: Title of the research paper.
- **id**: Unique identifier for each abstract.
- **abstract**: Abstract of the research paper.
- **categories**: Categories associated with the paper, primarily within the field of computer science.
- **doi**: Digital Object Identifier for the paper.
- **created**: Date when the paper was submitted to arXiv.
- **updated**: Date when the paper was last updated.
- **authors**: List of authors of the paper.
- **url**: URL to the original paper on arXiv.
- **abstract_length**: Length of the abstract in characters.
- **id_n**: Sequential number assigned to each abstract, starting from 0.
## Usage
This dataset can be used for various natural language processing tasks such as thematic modeling, trend analysis, keyword extraction, and more. It is particularly suitable for researchers and practitioners interested in the latest developments in computer science.
## How to Load the Dataset
You can load the dataset using the **datasets** library in Python:
```
from datasets import load_dataset
dataset = load_dataset("CCRss/arxiv_papers_cs")
```
## Dataset Creation
The dataset was created using the **arxivscraper** library in Python to scrape abstracts from the arXiv website. Here is an example of how the data was collected:
```
import arxivscraper
scraper = arxivscraper.Scraper(category='cs', date_from='20200101', date_until='20201231')
output = scraper.scrape()
```
You can find a detailed example in this **[Google Colab notebook](https://colab.research.google.com/drive/1tVM6undtOJn5-KOU-74QaK3MekQXoa-G?usp=sharing)** and test it.
## Application
This dataset was used to train a thematic modeling algorithm for analyzing trends in UAV-related research. The trained model is available in our model repository on Hugging Face.
## Acknowledgments
We would like to acknowledge Mahdi Sadjadi for creating the **arxivscraper** library, which was instrumental in collecting data for this dataset. The library is available on Zenodo: [arxivscraper (2017)](http://doi.org/10.5281/zenodo.889853).
## License
This dataset is provided under the MIT License.
提供机构:
CCRss
原始信息汇总
CS Research Dataset
描述
该数据集是从arXiv收集的计算机科学领域的科学摘要集合。它旨在促进自然语言处理领域的研究,特别是主题建模、趋势分析和关键词提取等任务。
数据集结构
数据集的结构如下:
- title: 研究论文的标题。
- id: 每个摘要的唯一标识符。
- abstract: 研究论文的摘要。
- categories: 与论文相关的主要在计算机科学领域的分类。
- doi: 论文的数字对象标识符。
- created: 论文提交到arXiv的日期。
- updated: 论文最后更新的日期。
- authors: 论文的作者列表。
- url: 论文在arXiv上的原始URL。
- abstract_length: 摘要的字符长度。
- id_n: 每个摘要的顺序编号,从0开始。
用途
该数据集可用于各种自然语言处理任务,如主题建模、趋势分析、关键词提取等。特别适合对计算机科学最新发展感兴趣的研究人员和从业者。
如何加载数据集
可以使用Python中的datasets库加载数据集: python from datasets import load_dataset
dataset = load_dataset("CCRss/arxiv_papers_cs")
数据集创建
该数据集使用Python中的arxivscraper库从arXiv网站抓取摘要创建。以下是数据收集的示例: python import arxivscraper
scraper = arxivscraper.Scraper(category=cs, date_from=20200101, date_until=20201231) output = scraper.scrape()
应用
该数据集用于训练主题建模算法,分析无人机相关研究的趋势。训练好的模型可在Hugging Face的模型仓库中找到。
致谢
我们感谢Mahdi Sadjadi创建的arxivscraper库,该库在收集此数据集数据时起到了关键作用。该库可在Zenodo上找到:arxivscraper (2017)。
许可证
该数据集在MIT许可证下提供。



