ZengXiangyu/pg19-and-proof-pile
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ZengXiangyu/pg19-and-proof-pile
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
tags:
- evaluation
- long-context
- long-context modeling
- pg19
- proof-pile
- hici
---
# HiCI Evaluation Data
Pre-tokenized binary evaluation splits used in the [HiCI](https://arxiv.org/abs/2603.20843) paper
(Hierarchical Construction-Integration for long-context LLMs).
## Contents
| Path | Description |
|------|-------------|
| `pg19_llama2/test.bin` | PG19 test set, Llama-2 tokenizer (uint16) |
| `pg19_llama2/validation.bin` | PG19 validation set, Llama-2 tokenizer (uint16) |
| `pg19_llama3/test.bin` | PG19 test set, Llama-3 tokenizer (uint32) |
| `pg19_llama3/validation.bin` | PG19 validation set, Llama-3 tokenizer (uint32) |
| `pg19_qwen3/test.bin` | PG19 test set, Qwen3 tokenizer (uint32) |
| `pg19_qwen3/validation.bin` | PG19 validation set, Qwen3 tokenizer (uint32) |
| `pg19_raw/test.txt` | PG19 test set, raw text |
| `pg19_raw/validation.txt` | PG19 validation set, raw text |
| `proof-pile_llama2/test_sampled_data.bin` | Proof-pile 128-doc sampled test set, Llama-2 tokenizer (uint16) |
| `proof-pile_llama3/test_sampled_data.bin` | Proof-pile 128-doc sampled test set, Llama-3 tokenizer (uint32) |
| `proof-pile_qwen3/test_sampled_data.bin` | Proof-pile 128-doc sampled test set, Qwen3 tokenizer (uint32) |
## Format
`.bin` files are memory-mapped token ID arrays, compatible with the evaluation scripts in the HiCI repo.
- Llama-2 tokenized files: `uint16` (vocab size 32,000)
- Llama-3 / Qwen3 tokenized files: `uint32` (vocab size > 65,535)
```python
import numpy as np
data = np.memmap("pg19_llama2/test.bin", dtype=np.uint16, mode="r") # Llama-2
data = np.memmap("pg19_qwen3/test.bin", dtype=np.uint32, mode="r") # Qwen3 / Llama-3
```
## Usage
Download a single file:
```bash
huggingface-cli download ZengXiangyu/pg19-and-proof-pile proof-pile_llama2/test_sampled_data.bin --repo-type dataset
```
Or the full dataset:
```bash
huggingface-cli download ZengXiangyu/pg19-and-proof-pile --repo-type dataset --local-dir ./data
```
## Proof-pile Sampling
`proof-pile_llama2/test_sampled_data.bin` is identical to the file released by
[LongLoRA](https://github.com/dvlab-research/LongLoRA): 128 documents randomly sampled from the
proof-pile test split, each with at least 32,768 tokens, tokenized with the LLaMA-2 tokenizer.
`proof-pile_llama3` and `proof-pile_qwen3` contain the **same 128 documents** re-tokenized with
their respective tokenizers, enabling fair cross-model comparison.
## Source
- PG19: [deepmind/pg19](https://huggingface.co/datasets/deepmind/pg19)
- Proof-pile: [EleutherAI/proof-pile](https://huggingface.co/datasets/EleutherAI/proof-pile)
- Proof-pile LLaMA-2 tokenized (original): [LongLoRA](https://github.com/dvlab-research/LongLoRA)
提供机构:
ZengXiangyu



