PeytonT/500k_papers_text

Name: PeytonT/500k_papers_text
Creator: PeytonT
Published: 2026-04-21 13:28:12
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/PeytonT/500k_papers_text

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Paper Text Dataset viewer: true tags: - datasets - arxiv - scientific-papers - text size_categories: - 100K<n<1M --- # Generated from Research Library: https://github.com/peytontolbert/Research_Library # Deduped Paper Text Dataset This dataset merges an existing paper-text parquet export with one or more backfill parquet shards, then keeps exactly one row per `canonical_paper_id`. One row corresponds to one canonical arXiv paper. Where multiple versions are available, the merge keeps the strongest text row according to the repo's selection logic: prefer non-partial text, then prefer raw-PDF-derived text, then newer versions, then longer text. ## Important provenance note This is a merged release, not a fresh one-pass export. It combines an earlier deduped paper-text dataset with later backfill shards, then dedupes again on `canonical_paper_id`. Use per-paper `license` filtering before broad downstream publication. Many arXiv records still carry `nonexclusive-distrib` rather than a general reuse license. This release is capped deterministically at exactly `500000` deduped papers. The selected inputs contained `574522` deduped papers before applying the cap. ## Rows - `train`: `500000` papers - unique canonical papers: `500000` - base input rows considered: `124036` - backfill input rows considered: `450486` - deduped rows before target cap: `574522` ## Main columns - `paper_id` - `canonical_paper_id` - `paper_version` - `pdf_path` - `title` - `abstract` - `authors` - `categories` - `license` - `text` - `text_char_count` - `token_count` - `page_count` - `token_types` ## Files - `train_00000.parquet` - `train_00001.parquet` - `train_00002.parquet` - `train_00003.parquet` - `train_00004.parquet`

提供机构：

PeytonT

5,000+

优质数据集

54 个

任务类型

进入经典数据集