PeytonT/100k_papers_text
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/PeytonT/100k_papers_text
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Structured Paper Text (combined token export)
viewer: true
tags:
- datasets
- arxiv
- scientific-papers
- text
size_categories:
- 100K<n<1M
---
# Generated from Research Library:
https://github.com/peytontolbert/Research_Library
# Structured Paper Text Dataset
One row per paper, built by combining the existing tokenized PDF exports under `exports/pdfs_structured`.
## Important provenance note
This dataset reflects the **current structured PDF shards**, not necessarily complete paper text.
If the underlying shards were produced with truncated PDF preprocessing, the `text` field is partial.
Use per-paper `license` filtering before publishing broadly; many arXiv records carry `nonexclusive-distrib` rather than a general reuse license. This export prefers raw PDF text and does not apply a character cap, so rows sourced from PDFs are full-document extracts.
## Rows
- `train`: `100569` papers
- with matched arXiv metadata: `100569`
- structured-token rows: `17`
- preferred raw-PDF rows: `100549`
- raw-PDF fallback rows: `3`
## Main columns
- `paper_id`
- `canonical_paper_id`
- `paper_version`
- `pdf_path`
- `title`
- `abstract`
- `authors`
- `categories`
- `license`
- `text`
- `text_char_count`
- `token_count`
- `page_count`
- `token_types`
## Top licenses in export
- `http://arxiv.org/licenses/nonexclusive-distrib/1.0/`: 60005
- `http://creativecommons.org/licenses/by/4.0/`: 28994
- `http://creativecommons.org/licenses/by-nc-nd/4.0/`: 5111
- `http://creativecommons.org/licenses/by-nc-sa/4.0/`: 3442
- `http://creativecommons.org/licenses/by-sa/4.0/`: 1735
- `http://creativecommons.org/publicdomain/zero/1.0/`: 1244
- `http://creativecommons.org/licenses/by/3.0/`: 18
- `http://creativecommons.org/licenses/by-nc-sa/3.0/`: 16
- `http://creativecommons.org/licenses/publicdomain/`: 4
提供机构:
PeytonT



