five

ire-mrn/pegasusgpt-tokenized-gpt2-v1

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ire-mrn/pegasusgpt-tokenized-gpt2-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: input_ids list: int32 splits: - name: train num_bytes: 12951344784 num_examples: 4390022 - name: validation num_bytes: 973658000 num_examples: 330432 download_size: 13933371999 dataset_size: 13925002784 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* --- # PegasusGPT Tokenized Dataset (GPT-2) ## Overview This dataset contains **pre-tokenized training and validation data** for PegasusGPT. It is designed for fast experimentation, Optuna hyperparameter tuning, and GPU-efficient training (no preprocessing required). Each example already contains tokenized inputs (`input_ids`). --- ## Dataset Structure **Splits:** `train`, `validation` Each example: ```json { "input_ids": [15496, 11, 995, ...] } ``` --- ## Tokenization Details | Property | Value | |---|---| | Tokenizer | GPT-2 | | Token column | `input_ids` | | Vocabulary size | `50257` | | EOS token ID | `50256` | --- ## Dataset Type Tokenized documents (not windowed). Sequence windowing is applied during training. --- ## Dataset Size Full dataset. --- ## Data Pipeline Generated using the PegasusGPT preprocessing pipeline: ```bash scripts/run_download_dataset.sh scripts/run_remove_test_split.sh scripts/run_filtering.sh scripts/run_set_splitting.sh scripts/run_tokenization_train_and_validation.sh ``` --- ## Usage ```python from datasets import load_dataset ds = load_dataset("ire-mrn/pegasusgpt-tokenized-gpt2-v1") train = ds["train"] validation = ds["validation"] ``` > ⚠️ Replace the dataset name above if your Hugging Face repo slug differs. --- ## Notes - No preprocessing required before training - Compatible with PegasusGPT (`token_column = input_ids`) - All samples use the same tokenizer configuration --- ## Recommended Use - Optuna hyperparameter tuning - Rapid experimentation - GPU training without preprocessing overhead
提供机构:
ire-mrn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作