ire-mrn/pegasusgpt-tokenized-gpt2-v1
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ire-mrn/pegasusgpt-tokenized-gpt2-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input_ids
list: int32
splits:
- name: train
num_bytes: 12951344784
num_examples: 4390022
- name: validation
num_bytes: 973658000
num_examples: 330432
download_size: 13933371999
dataset_size: 13925002784
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
---
# PegasusGPT Tokenized Dataset (GPT-2)
## Overview
This dataset contains **pre-tokenized training and validation data** for PegasusGPT.
It is designed for fast experimentation, Optuna hyperparameter tuning, and GPU-efficient training (no preprocessing required). Each example already contains tokenized inputs (`input_ids`).
---
## Dataset Structure
**Splits:** `train`, `validation`
Each example:
```json
{
"input_ids": [15496, 11, 995, ...]
}
```
---
## Tokenization Details
| Property | Value |
|---|---|
| Tokenizer | GPT-2 |
| Token column | `input_ids` |
| Vocabulary size | `50257` |
| EOS token ID | `50256` |
---
## Dataset Type
Tokenized documents (not windowed). Sequence windowing is applied during training.
---
## Dataset Size
Full dataset.
---
## Data Pipeline
Generated using the PegasusGPT preprocessing pipeline:
```bash
scripts/run_download_dataset.sh
scripts/run_remove_test_split.sh
scripts/run_filtering.sh
scripts/run_set_splitting.sh
scripts/run_tokenization_train_and_validation.sh
```
---
## Usage
```python
from datasets import load_dataset
ds = load_dataset("ire-mrn/pegasusgpt-tokenized-gpt2-v1")
train = ds["train"]
validation = ds["validation"]
```
> ⚠️ Replace the dataset name above if your Hugging Face repo slug differs.
---
## Notes
- No preprocessing required before training
- Compatible with PegasusGPT (`token_column = input_ids`)
- All samples use the same tokenizer configuration
---
## Recommended Use
- Optuna hyperparameter tuning
- Rapid experimentation
- GPU training without preprocessing overhead
提供机构:
ire-mrn



