five

PeytonT/500k_papers_text

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/PeytonT/500k_papers_text
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Paper Text Dataset viewer: true tags: - datasets - arxiv - scientific-papers - text size_categories: - 100K<n<1M --- # Generated from Research Library: https://github.com/peytontolbert/Research_Library # Deduped Paper Text Dataset This dataset merges an existing paper-text parquet export with one or more backfill parquet shards, then keeps exactly one row per `canonical_paper_id`. One row corresponds to one canonical arXiv paper. Where multiple versions are available, the merge keeps the strongest text row according to the repo's selection logic: prefer non-partial text, then prefer raw-PDF-derived text, then newer versions, then longer text. ## Important provenance note This is a merged release, not a fresh one-pass export. It combines an earlier deduped paper-text dataset with later backfill shards, then dedupes again on `canonical_paper_id`. Use per-paper `license` filtering before broad downstream publication. Many arXiv records still carry `nonexclusive-distrib` rather than a general reuse license. This release is capped deterministically at exactly `500000` deduped papers. The selected inputs contained `574522` deduped papers before applying the cap. ## Rows - `train`: `500000` papers - unique canonical papers: `500000` - base input rows considered: `124036` - backfill input rows considered: `450486` - deduped rows before target cap: `574522` ## Main columns - `paper_id` - `canonical_paper_id` - `paper_version` - `pdf_path` - `title` - `abstract` - `authors` - `categories` - `license` - `text` - `text_char_count` - `token_count` - `page_count` - `token_types` ## Files - `train_00000.parquet` - `train_00001.parquet` - `train_00002.parquet` - `train_00003.parquet` - `train_00004.parquet`
提供机构:
PeytonT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作