cfahlgren1/tinystories-gpt4-clean
收藏Hugging Face2026-03-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cfahlgren1/tinystories-gpt4-clean
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-sharing-1.0
---
# TinyStories GPT-4 Clean
A cleaned subset of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (Eldan & Li, 2023), keeping only GPT-4-generated stories. Adapted from [this thread](https://huggingface.co/datasets/roneneldan/TinyStories/discussions/15) that pointed out many issues with the original data and proposed a cleaning process.
## Overview
This cleaned dataset contains:
| Stat | Value |
|------|-------|
| Stories | 2,732,634 |
| Total characters | ~2.19B |
| Min doc length | 115 chars |
| Max doc length | 4,433 chars |
| Median doc length | 721 chars |
| Unique characters | 74 (ASCII only) |
| Duplicates | None |
| Download size | ~673MB |
### Suggested splits (by row index, data is pre-shuffled)
Suggested usage is as follows:
```python
from datasets import load_dataset
ds = load_dataset("karpathy/tinystories-gpt4-clean", split="train")
# Suggested default splits (data is pre-shuffled):
# rows 0..9,999 -> test (10K stories)
# rows 10,000..19,999 -> val (10K stories)
# rows 20,000..end -> train (2,712,634 stories)
test = ds.select(range(0, 10_000))
val = ds.select(range(10_000, 20_000))
train = ds.select(range(20_000, len(ds)))
```
| Split | Rows | Stories | Characters |
|-------|------|---------|------------|
| Test | 0..9,999 | 10,000 | 8,076,477 |
| Val | 10,000..19,999 | 10,000 | 8,026,787 |
| Train | 20,000..end | 2,712,634 | 2,175,177,929 |
## Cleaning pipeline
The raw TinyStories dataset contains ~5M stories from both GPT-3.5 and GPT-4. We filter to GPT-4 only (2,745,330 stories) and then apply the following cleaning steps:
1. **Unicode normalization**: curly quotes to straight quotes, em/en dashes to hyphens, ellipsis character to `...`, stray backslashes removed, double spaces collapsed.
2. **Non-ASCII rejection**: stories with any character outside printable ASCII (codes 32-127) are discarded. Newlines (code 10) are allowed as paragraph separators.
3. **Banned character rejection**: stories containing `|<>/`\`*=_&@~#%[]+()` are discarded. These almost always indicate formatting artifacts, HTML tags, chat templates, or code contamination.
4. **Minimum length**: stories under 100 characters are discarded (fragments, empty entries).
5. **Ending punctuation**: stories must end with `.` `!` `"` or `?` to ensure completeness.
### Rejection breakdown
| Reason | Count |
|--------|-------|
| Non-ASCII characters | 1,282 |
| Banned characters | 720 |
| Too short (< 100 chars) | 238 |
| Bad ending punctuation | 10,456 |
| **Total rejected** | **12,696** |
Only 0.46% of GPT-4 stories are rejected -- the data is quite clean to begin with.
## Character inventory
All 74 characters in the dataset (ASCII only):
```
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA\n !"$',-.0123456789:;?
```
No Unicode, no control characters, no special symbols.
## Format
Single parquet file with one column:
- `text` (string): the cleaned story text
## Source
- Original dataset: [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)
- Paper: [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) (Eldan & Li, 2023)
- Cleaning script: `clean.py` in this directory
提供机构:
cfahlgren1



