five

PeytonT/100k_papers_text

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/PeytonT/100k_papers_text
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Structured Paper Text (combined token export) viewer: true tags: - datasets - arxiv - scientific-papers - text size_categories: - 100K<n<1M --- # Generated from Research Library: https://github.com/peytontolbert/Research_Library # Structured Paper Text Dataset One row per paper, built by combining the existing tokenized PDF exports under `exports/pdfs_structured`. ## Important provenance note This dataset reflects the **current structured PDF shards**, not necessarily complete paper text. If the underlying shards were produced with truncated PDF preprocessing, the `text` field is partial. Use per-paper `license` filtering before publishing broadly; many arXiv records carry `nonexclusive-distrib` rather than a general reuse license. This export prefers raw PDF text and does not apply a character cap, so rows sourced from PDFs are full-document extracts. ## Rows - `train`: `100569` papers - with matched arXiv metadata: `100569` - structured-token rows: `17` - preferred raw-PDF rows: `100549` - raw-PDF fallback rows: `3` ## Main columns - `paper_id` - `canonical_paper_id` - `paper_version` - `pdf_path` - `title` - `abstract` - `authors` - `categories` - `license` - `text` - `text_char_count` - `token_count` - `page_count` - `token_types` ## Top licenses in export - `http://arxiv.org/licenses/nonexclusive-distrib/1.0/`: 60005 - `http://creativecommons.org/licenses/by/4.0/`: 28994 - `http://creativecommons.org/licenses/by-nc-nd/4.0/`: 5111 - `http://creativecommons.org/licenses/by-nc-sa/4.0/`: 3442 - `http://creativecommons.org/licenses/by-sa/4.0/`: 1735 - `http://creativecommons.org/publicdomain/zero/1.0/`: 1244 - `http://creativecommons.org/licenses/by/3.0/`: 18 - `http://creativecommons.org/licenses/by-nc-sa/3.0/`: 16 - `http://creativecommons.org/licenses/publicdomain/`: 4
提供机构:
PeytonT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作