PeytonT/500k_papers_text
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/PeytonT/500k_papers_text
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Paper Text Dataset
viewer: true
tags:
- datasets
- arxiv
- scientific-papers
- text
size_categories:
- 100K<n<1M
---
# Generated from Research Library:
https://github.com/peytontolbert/Research_Library
# Deduped Paper Text Dataset
This dataset merges an existing paper-text parquet export with one or more
backfill parquet shards, then keeps exactly one row per `canonical_paper_id`.
One row corresponds to one canonical arXiv paper. Where multiple versions are
available, the merge keeps the strongest text row according to the repo's
selection logic: prefer non-partial text, then prefer raw-PDF-derived text,
then newer versions, then longer text.
## Important provenance note
This is a merged release, not a fresh one-pass export. It combines an earlier
deduped paper-text dataset with later backfill shards, then dedupes again on
`canonical_paper_id`.
Use per-paper `license` filtering before broad downstream publication. Many
arXiv records still carry `nonexclusive-distrib` rather than a general reuse
license.
This release is capped deterministically at exactly `500000` deduped papers. The selected inputs contained `574522` deduped papers before applying the cap.
## Rows
- `train`: `500000` papers
- unique canonical papers: `500000`
- base input rows considered: `124036`
- backfill input rows considered: `450486`
- deduped rows before target cap: `574522`
## Main columns
- `paper_id`
- `canonical_paper_id`
- `paper_version`
- `pdf_path`
- `title`
- `abstract`
- `authors`
- `categories`
- `license`
- `text`
- `text_char_count`
- `token_count`
- `page_count`
- `token_types`
## Files
- `train_00000.parquet`
- `train_00001.parquet`
- `train_00002.parquet`
- `train_00003.parquet`
- `train_00004.parquet`
提供机构:
PeytonT



