hemantvirmani/gpt-training-dataset
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hemantvirmani/gpt-training-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: GPT Training Dataset (WikiText + OpenWebText)
language:
- en
license: mit
task_categories:
- text-generation
task_ids:
- language-modeling
size_categories:
- 1GB<n<10GB
source_datasets:
- wikitext
- openwebtext
tags:
- gpt
- language-model
- text-generation
- pretraining
- nlp
configs:
- config_name: default
data_files:
- split: train
path: "dataset.txt"
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 1062767992 # 80%
num_examples: 1500206
- name: validation
num_bytes: 265691998 # 20%
num_examples: 375052
---
# 📚 GPT Training Dataset (WikiText + OpenWebText Mix)
## Overview
This dataset is a cleaned and curated text corpus designed for training small to mid-sized GPT-style language models.
It combines:
- WikiText-103 (high-quality structured text)
- OpenWebText (real-world web text, sampled)
The goal is to provide a **balanced dataset** that:
- trains quickly
- produces coherent text
- avoids excessive noise from large web corpora
---
## Dataset Composition
The final corpus is a mixture of high-quality encyclopedic text and diverse web content, designed to balance factual density with natural conversational flow.
| Source | Proportion | Description |
| :--- | :--- | :--- |
| **WikiText-103** | ~75% | High-quality, verified articles from Wikipedia. Provides structured knowledge. |
| **OpenWebText** | ~25% | Sampled web content filtered for quality. Provides stylistic variety. |
### Data Splits
The dataset is provided as a single `dataset.txt` file, intended to be split as follows:
- **Training (80%):** Primary data for model weight updates.
- **Validation (20%):** Used for calculating perplexity and monitoring over-fitting during training.
**Total size:** ~1.33 GB (uncompressed text).
---
## Technical Specifications
**Raw Text**: dataset.txt (1.33 GB)
**Tokenized Data (binary)**: dataset.bin (574 MB) — Contiguous token IDs for training
***Tokenized Data (numpy)**: tokens.npy (574 MB) — Same tokens as dataset.bin, stored as a NumPy array
---
## Preprocessing Pipeline
The dataset was generated using a custom script with the following steps:
### 1. Cleaning
- Removed section headers (e.g., `== Title ==`)
- Normalized spacing and punctuation artifacts
- Stripped malformed tokens
### 2. Filtering
- Removed very short or low-quality lines
- Ensured minimum text length and structure
### 3. Deduplication
- Removed duplicate entries (keeping one copy per unique sample)
### 4. Shuffling
- Randomized dataset order for better training distribution
### 5. Document Separation
- Each sample is separated using:
```
<|endoftext|>
```
---
## Files
### `dataset.txt`
- Cleaned text dataset
- One document per `<|endoftext|>` separator
- Recommended for:
- custom tokenization
- experimentation
---
### `dataset.bin`
- Pre-tokenized binary file
- Format: `uint16`
- Tokenizer: GPT-2 (`tiktoken`)
Load example:
```python
import numpy as np
data = np.memmap("dataset.bin", dtype=np.uint16, mode="r")
```
---
### `tokens.npy`
- Same tokenized data in NumPy format
- Useful for debugging and inspection
Load example:
```python
import numpy as np
tokens = np.load("tokens.npy", mmap_mode="r")
```
---
## Tokenization
Pre-tokenization (for `.bin` and `.npy`) uses:
- GPT-2 tokenizer via `tiktoken`
```python
import tiktoken
enc = tiktoken.get_encoding("gpt2")
```
⚠️ Important:
If using `dataset.bin`, you must use the **same tokenizer** for inference.
---
## Usage (PyTorch Example)
```python
import numpy as np
import torch
data = np.memmap("dataset.bin", dtype=np.uint16, mode="r")
data = torch.from_numpy(data.astype(np.int64))
block_size = 128
batch_size = 32
def get_batch():
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
return x, y
```
---
## Intended Use
This dataset is ideal for:
- training GPT-style models from scratch
- experimentation with small architectures
- educational purposes and learning pipelines
---
## Limitations
- Not suitable for large-scale production LLM training
- Limited domain diversity compared to massive corpora
- OpenWebText portion may still contain minor noise
---
## Reproducibility
Dataset can be regenerated using the provided script:
```bash
python prepare_dataset.py
```
---
## Acknowledgements
- WikiText-103
- OpenWebText
---
## License
Please refer to the original dataset licenses for:
- WikiText-103
- OpenWebText
---
## Author
Hemant Virmani created this as part of a GPT training pipeline experiment.
提供机构:
hemantvirmani



