five

ShayManor/Labeled-arXiv

收藏
Hugging Face2026-03-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ShayManor/Labeled-arXiv
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 size_categories: - 1M<n<10M configs: - config_name: papers data_files: papers/** - config_name: authors data_files: authors/** task_categories: - feature-extraction - text-classification - summarization - text-generation language: - en tags: - nlp - summarization - networks --- # Labeled-arXiv Enriched arXiv paper metadata and derived author bibliometrics sourced from [OpenAlex](https://openalex.org/). Contains **2.93M papers** and **1.72M authors** across all arXiv subject areas. ## Subsets ### `papers` — 2.93M rows | Column | Type | Description | |---|---|---| | `id` | `string` | arXiv paper ID (e.g. `0704.0028`) | | `submitter` | `string` (nullable) | Name of the submitting author | | `authors` | `string` | Raw author string from arXiv metadata | | `title` | `string` | Paper title | | `comments` | `string` (nullable) | Submitter comments (page count, figures, etc.) | | `journal-ref` | `string` (nullable) | Journal publication reference | | `doi` | `string` | Digital Object Identifier | | `report-no` | `string` (nullable) | Report number | | `categories` | `string` | arXiv category tags (e.g. `math.CA math.PR`) | | `license` | `string` | License identifier (9 distinct values) | | `abstract` | `string` | Paper abstract | | `versions` | `list` | Version history entries | | `update_date` | `date32` | Date of last metadata update | | `authors_parsed` | `list` | Structured author names (parsed into components) | | `author_ids` | `list` (nullable) | OpenAlex/ORCID author IDs linked to this paper | | `deleted` | `bool` | Whether the paper has been removed from arXiv | | `citations` | `list` (nullable) | List of citing work identifiers | | `citation_count` | `int32` | Number of citations (range: 0–30.5k) | ### `authors` — 1.72M rows Author-level metrics **derived from the papers in this dataset**, not global totals. | Column | Type | Description | |---|---|---| | `author_id` | `string` | ORCID URL or OpenAlex ID | | `name` | `string` (nullable) | Author display name | | `paper_dois` | `list[string]` | DOIs of the author's papers in this dataset | | `h_index` | `int32` | H-index computed within this dataset (0–84) | | `works_count` | `int32` | Paper count within this dataset (1–13.9k) | | `cited_by_count` | `int32` | Total citations within this dataset (0–57k) | > Note: `h_index`, `works_count`, and `cited_by_count` are scoped to this dataset and do not represent an author's complete publication record. ## Usage ```python from datasets import load_dataset papers = load_dataset("ShayManor/Labeled-arXiv", "papers", split="train") authors = load_dataset("ShayManor/Labeled-arXiv", "authors", split="train") ``` ## Source Metadata from [OpenAlex](https://openalex.org/) (Priem et al., 2022), built on top of the [arXiv](https://arxiv.org/) bulk metadata. Author IDs use [ORCID](https://orcid.org/) URLs or OpenAlex internal identifiers. ## License [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/) — Public Domain.
提供机构:
ShayManor
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作