ashish-chouhan/arxiv_cs_papers

Name: ashish-chouhan/arxiv_cs_papers
Creator: ashish-chouhan
Published: 2023-10-24 13:31:08
License: 暂无描述

Hugging Face2023-10-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ashish-chouhan/arxiv_cs_papers

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: title dtype: string - name: abstract dtype: string - name: authors sequence: string - name: published dtype: string - name: url dtype: string - name: pdf_url dtype: string - name: arxiv_id dtype: string splits: - name: train num_bytes: 7726383 num_examples: 5000 download_size: 4366827 dataset_size: 7726383 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for "arxiv_cs_papers" This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained with requests to the ArXiv API. The ArXiv dataset contains features: <ul> <li> title </li> <li> abstract </li> <li> authors </li> <li> published </li> <li> url </li> <li> pdf_url </li> <li> arxiv_id </li> </ul>

提供机构：

ashish-chouhan

原始信息汇总

数据集卡片 "arxiv_cs_papers"

数据集描述

该数据集包含带有“cs.LG”标签的ArXiv论文子集，表明论文涉及机器学习。

数据集来源

核心数据集是从Kaggle上托管的完整ArXiv数据集中筛选出来的：https://www.kaggle.com/datasets/Cornell-University/arxiv。原始数据集包含大约200万篇论文。此数据集包含大约100,000篇论文，经过类别筛选。

数据集维护

该数据集通过请求ArXiv API进行维护。

数据集特征

数据集包含以下特征：

title: 字符串类型
abstract: 字符串类型
authors: 字符串序列
published: 字符串类型
url: 字符串类型
pdf_url: 字符串类型
arxiv_id: 字符串类型

数据集划分

训练集（train）：包含5000个样本，占用7726383字节

数据集大小

下载大小：4366827字节
数据集大小：7726383字节

配置

默认配置（default）：
- 数据文件路径：data/train-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集