cfahlgren1/tinystories-gpt4-clean

Name: cfahlgren1/tinystories-gpt4-clean
Creator: cfahlgren1
Published: 2026-03-13 10:36:34
License: 暂无描述

Hugging Face2026-03-13 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/cfahlgren1/tinystories-gpt4-clean

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cdla-sharing-1.0 --- # TinyStories GPT-4 Clean A cleaned subset of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (Eldan & Li, 2023), keeping only GPT-4-generated stories. Adapted from [this thread](https://huggingface.co/datasets/roneneldan/TinyStories/discussions/15) that pointed out many issues with the original data and proposed a cleaning process. ## Overview This cleaned dataset contains: | Stat | Value | |------|-------| | Stories | 2,732,634 | | Total characters | ~2.19B | | Min doc length | 115 chars | | Max doc length | 4,433 chars | | Median doc length | 721 chars | | Unique characters | 74 (ASCII only) | | Duplicates | None | | Download size | ~673MB | ### Suggested splits (by row index, data is pre-shuffled) Suggested usage is as follows: ```python from datasets import load_dataset ds = load_dataset("karpathy/tinystories-gpt4-clean", split="train") # Suggested default splits (data is pre-shuffled): # rows 0..9,999 -> test (10K stories) # rows 10,000..19,999 -> val (10K stories) # rows 20,000..end -> train (2,712,634 stories) test = ds.select(range(0, 10_000)) val = ds.select(range(10_000, 20_000)) train = ds.select(range(20_000, len(ds))) ``` | Split | Rows | Stories | Characters | |-------|------|---------|------------| | Test | 0..9,999 | 10,000 | 8,076,477 | | Val | 10,000..19,999 | 10,000 | 8,026,787 | | Train | 20,000..end | 2,712,634 | 2,175,177,929 | ## Cleaning pipeline The raw TinyStories dataset contains ~5M stories from both GPT-3.5 and GPT-4. We filter to GPT-4 only (2,745,330 stories) and then apply the following cleaning steps: 1. **Unicode normalization**: curly quotes to straight quotes, em/en dashes to hyphens, ellipsis character to `...`, stray backslashes removed, double spaces collapsed. 2. **Non-ASCII rejection**: stories with any character outside printable ASCII (codes 32-127) are discarded. Newlines (code 10) are allowed as paragraph separators. 3. **Banned character rejection**: stories containing `|<>/`\`*=_&@~#%[]+()` are discarded. These almost always indicate formatting artifacts, HTML tags, chat templates, or code contamination. 4. **Minimum length**: stories under 100 characters are discarded (fragments, empty entries). 5. **Ending punctuation**: stories must end with `.` `!` `"` or `?` to ensure completeness. ### Rejection breakdown | Reason | Count | |--------|-------| | Non-ASCII characters | 1,282 | | Banned characters | 720 | | Too short (< 100 chars) | 238 | | Bad ending punctuation | 10,456 | | **Total rejected** | **12,696** | Only 0.46% of GPT-4 stories are rejected -- the data is quite clean to begin with. ## Character inventory All 74 characters in the dataset (ASCII only): ``` zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA\n !"$',-.0123456789:;? ``` No Unicode, no control characters, no special symbols. ## Format Single parquet file with one column: - `text` (string): the cleaned story text ## Source - Original dataset: [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) - Paper: [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) (Eldan & Li, 2023) - Cleaning script: `clean.py` in this directory

提供机构：

cfahlgren1

5,000+

优质数据集

54 个

任务类型

进入经典数据集