ashish-chouhan/arxiv_cs_papers
收藏Hugging Face2023-10-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ashish-chouhan/arxiv_cs_papers
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: title
dtype: string
- name: abstract
dtype: string
- name: authors
sequence: string
- name: published
dtype: string
- name: url
dtype: string
- name: pdf_url
dtype: string
- name: arxiv_id
dtype: string
splits:
- name: train
num_bytes: 7726383
num_examples: 5000
download_size: 4366827
dataset_size: 7726383
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Dataset Card for "arxiv_cs_papers"
This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning.
The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering.
The dataset is maintained with requests to the ArXiv API.
The ArXiv dataset contains features:
<ul>
<li> title </li>
<li> abstract </li>
<li> authors </li>
<li> published </li>
<li> url </li>
<li> pdf_url </li>
<li> arxiv_id </li>
</ul>
提供机构:
ashish-chouhan
原始信息汇总
数据集卡片 "arxiv_cs_papers"
数据集描述
该数据集包含带有“cs.LG”标签的ArXiv论文子集,表明论文涉及机器学习。
数据集来源
核心数据集是从Kaggle上托管的完整ArXiv数据集中筛选出来的:https://www.kaggle.com/datasets/Cornell-University/arxiv。原始数据集包含大约200万篇论文。此数据集包含大约100,000篇论文,经过类别筛选。
数据集维护
该数据集通过请求ArXiv API进行维护。
数据集特征
数据集包含以下特征:
- title: 字符串类型
- abstract: 字符串类型
- authors: 字符串序列
- published: 字符串类型
- url: 字符串类型
- pdf_url: 字符串类型
- arxiv_id: 字符串类型
数据集划分
- 训练集(train):包含5000个样本,占用7726383字节
数据集大小
- 下载大小:4366827字节
- 数据集大小:7726383字节
配置
- 默认配置(default):
- 数据文件路径:data/train-*



