five

ZengXiangyu/pg19-and-proof-pile

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ZengXiangyu/pg19-and-proof-pile
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other tags: - evaluation - long-context - long-context modeling - pg19 - proof-pile - hici --- # HiCI Evaluation Data Pre-tokenized binary evaluation splits used in the [HiCI](https://arxiv.org/abs/2603.20843) paper (Hierarchical Construction-Integration for long-context LLMs). ## Contents | Path | Description | |------|-------------| | `pg19_llama2/test.bin` | PG19 test set, Llama-2 tokenizer (uint16) | | `pg19_llama2/validation.bin` | PG19 validation set, Llama-2 tokenizer (uint16) | | `pg19_llama3/test.bin` | PG19 test set, Llama-3 tokenizer (uint32) | | `pg19_llama3/validation.bin` | PG19 validation set, Llama-3 tokenizer (uint32) | | `pg19_qwen3/test.bin` | PG19 test set, Qwen3 tokenizer (uint32) | | `pg19_qwen3/validation.bin` | PG19 validation set, Qwen3 tokenizer (uint32) | | `pg19_raw/test.txt` | PG19 test set, raw text | | `pg19_raw/validation.txt` | PG19 validation set, raw text | | `proof-pile_llama2/test_sampled_data.bin` | Proof-pile 128-doc sampled test set, Llama-2 tokenizer (uint16) | | `proof-pile_llama3/test_sampled_data.bin` | Proof-pile 128-doc sampled test set, Llama-3 tokenizer (uint32) | | `proof-pile_qwen3/test_sampled_data.bin` | Proof-pile 128-doc sampled test set, Qwen3 tokenizer (uint32) | ## Format `.bin` files are memory-mapped token ID arrays, compatible with the evaluation scripts in the HiCI repo. - Llama-2 tokenized files: `uint16` (vocab size 32,000) - Llama-3 / Qwen3 tokenized files: `uint32` (vocab size > 65,535) ```python import numpy as np data = np.memmap("pg19_llama2/test.bin", dtype=np.uint16, mode="r") # Llama-2 data = np.memmap("pg19_qwen3/test.bin", dtype=np.uint32, mode="r") # Qwen3 / Llama-3 ``` ## Usage Download a single file: ```bash huggingface-cli download ZengXiangyu/pg19-and-proof-pile proof-pile_llama2/test_sampled_data.bin --repo-type dataset ``` Or the full dataset: ```bash huggingface-cli download ZengXiangyu/pg19-and-proof-pile --repo-type dataset --local-dir ./data ``` ## Proof-pile Sampling `proof-pile_llama2/test_sampled_data.bin` is identical to the file released by [LongLoRA](https://github.com/dvlab-research/LongLoRA): 128 documents randomly sampled from the proof-pile test split, each with at least 32,768 tokens, tokenized with the LLaMA-2 tokenizer. `proof-pile_llama3` and `proof-pile_qwen3` contain the **same 128 documents** re-tokenized with their respective tokenizers, enabling fair cross-model comparison. ## Source - PG19: [deepmind/pg19](https://huggingface.co/datasets/deepmind/pg19) - Proof-pile: [EleutherAI/proof-pile](https://huggingface.co/datasets/EleutherAI/proof-pile) - Proof-pile LLaMA-2 tokenized (original): [LongLoRA](https://github.com/dvlab-research/LongLoRA)
提供机构:
ZengXiangyu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作