five

hemantvirmani/gpt-training-dataset

收藏
Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hemantvirmani/gpt-training-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: GPT Training Dataset (WikiText + OpenWebText) language: - en license: mit task_categories: - text-generation task_ids: - language-modeling size_categories: - 1GB<n<10GB source_datasets: - wikitext - openwebtext tags: - gpt - language-model - text-generation - pretraining - nlp configs: - config_name: default data_files: - split: train path: "dataset.txt" dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 1062767992 # 80% num_examples: 1500206 - name: validation num_bytes: 265691998 # 20% num_examples: 375052 --- # 📚 GPT Training Dataset (WikiText + OpenWebText Mix) ## Overview This dataset is a cleaned and curated text corpus designed for training small to mid-sized GPT-style language models. It combines: - WikiText-103 (high-quality structured text) - OpenWebText (real-world web text, sampled) The goal is to provide a **balanced dataset** that: - trains quickly - produces coherent text - avoids excessive noise from large web corpora --- ## Dataset Composition The final corpus is a mixture of high-quality encyclopedic text and diverse web content, designed to balance factual density with natural conversational flow. | Source | Proportion | Description | | :--- | :--- | :--- | | **WikiText-103** | ~75% | High-quality, verified articles from Wikipedia. Provides structured knowledge. | | **OpenWebText** | ~25% | Sampled web content filtered for quality. Provides stylistic variety. | ### Data Splits The dataset is provided as a single `dataset.txt` file, intended to be split as follows: - **Training (80%):** Primary data for model weight updates. - **Validation (20%):** Used for calculating perplexity and monitoring over-fitting during training. **Total size:** ~1.33 GB (uncompressed text). --- ## Technical Specifications **Raw Text**: dataset.txt (1.33 GB) **Tokenized Data (binary)**: dataset.bin (574 MB) — Contiguous token IDs for training ***Tokenized Data (numpy)**: tokens.npy (574 MB) — Same tokens as dataset.bin, stored as a NumPy array --- ## Preprocessing Pipeline The dataset was generated using a custom script with the following steps: ### 1. Cleaning - Removed section headers (e.g., `== Title ==`) - Normalized spacing and punctuation artifacts - Stripped malformed tokens ### 2. Filtering - Removed very short or low-quality lines - Ensured minimum text length and structure ### 3. Deduplication - Removed duplicate entries (keeping one copy per unique sample) ### 4. Shuffling - Randomized dataset order for better training distribution ### 5. Document Separation - Each sample is separated using: ``` <|endoftext|> ``` --- ## Files ### `dataset.txt` - Cleaned text dataset - One document per `<|endoftext|>` separator - Recommended for: - custom tokenization - experimentation --- ### `dataset.bin` - Pre-tokenized binary file - Format: `uint16` - Tokenizer: GPT-2 (`tiktoken`) Load example: ```python import numpy as np data = np.memmap("dataset.bin", dtype=np.uint16, mode="r") ``` --- ### `tokens.npy` - Same tokenized data in NumPy format - Useful for debugging and inspection Load example: ```python import numpy as np tokens = np.load("tokens.npy", mmap_mode="r") ``` --- ## Tokenization Pre-tokenization (for `.bin` and `.npy`) uses: - GPT-2 tokenizer via `tiktoken` ```python import tiktoken enc = tiktoken.get_encoding("gpt2") ``` ⚠️ Important: If using `dataset.bin`, you must use the **same tokenizer** for inference. --- ## Usage (PyTorch Example) ```python import numpy as np import torch data = np.memmap("dataset.bin", dtype=np.uint16, mode="r") data = torch.from_numpy(data.astype(np.int64)) block_size = 128 batch_size = 32 def get_batch(): ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([data[i:i+block_size] for i in ix]) y = torch.stack([data[i+1:i+block_size+1] for i in ix]) return x, y ``` --- ## Intended Use This dataset is ideal for: - training GPT-style models from scratch - experimentation with small architectures - educational purposes and learning pipelines --- ## Limitations - Not suitable for large-scale production LLM training - Limited domain diversity compared to massive corpora - OpenWebText portion may still contain minor noise --- ## Reproducibility Dataset can be regenerated using the provided script: ```bash python prepare_dataset.py ``` --- ## Acknowledgements - WikiText-103 - OpenWebText --- ## License Please refer to the original dataset licenses for: - WikiText-103 - OpenWebText --- ## Author Hemant Virmani created this as part of a GPT training pipeline experiment.
提供机构:
hemantvirmani
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作