five

cfahlgren1/tinystories-gpt4-clean

收藏
Hugging Face2026-03-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cfahlgren1/tinystories-gpt4-clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cdla-sharing-1.0 --- # TinyStories GPT-4 Clean A cleaned subset of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (Eldan & Li, 2023), keeping only GPT-4-generated stories. Adapted from [this thread](https://huggingface.co/datasets/roneneldan/TinyStories/discussions/15) that pointed out many issues with the original data and proposed a cleaning process. ## Overview This cleaned dataset contains: | Stat | Value | |------|-------| | Stories | 2,732,634 | | Total characters | ~2.19B | | Min doc length | 115 chars | | Max doc length | 4,433 chars | | Median doc length | 721 chars | | Unique characters | 74 (ASCII only) | | Duplicates | None | | Download size | ~673MB | ### Suggested splits (by row index, data is pre-shuffled) Suggested usage is as follows: ```python from datasets import load_dataset ds = load_dataset("karpathy/tinystories-gpt4-clean", split="train") # Suggested default splits (data is pre-shuffled): # rows 0..9,999 -> test (10K stories) # rows 10,000..19,999 -> val (10K stories) # rows 20,000..end -> train (2,712,634 stories) test = ds.select(range(0, 10_000)) val = ds.select(range(10_000, 20_000)) train = ds.select(range(20_000, len(ds))) ``` | Split | Rows | Stories | Characters | |-------|------|---------|------------| | Test | 0..9,999 | 10,000 | 8,076,477 | | Val | 10,000..19,999 | 10,000 | 8,026,787 | | Train | 20,000..end | 2,712,634 | 2,175,177,929 | ## Cleaning pipeline The raw TinyStories dataset contains ~5M stories from both GPT-3.5 and GPT-4. We filter to GPT-4 only (2,745,330 stories) and then apply the following cleaning steps: 1. **Unicode normalization**: curly quotes to straight quotes, em/en dashes to hyphens, ellipsis character to `...`, stray backslashes removed, double spaces collapsed. 2. **Non-ASCII rejection**: stories with any character outside printable ASCII (codes 32-127) are discarded. Newlines (code 10) are allowed as paragraph separators. 3. **Banned character rejection**: stories containing `|<>/`\`*=_&@~#%[]+()` are discarded. These almost always indicate formatting artifacts, HTML tags, chat templates, or code contamination. 4. **Minimum length**: stories under 100 characters are discarded (fragments, empty entries). 5. **Ending punctuation**: stories must end with `.` `!` `"` or `?` to ensure completeness. ### Rejection breakdown | Reason | Count | |--------|-------| | Non-ASCII characters | 1,282 | | Banned characters | 720 | | Too short (< 100 chars) | 238 | | Bad ending punctuation | 10,456 | | **Total rejected** | **12,696** | Only 0.46% of GPT-4 stories are rejected -- the data is quite clean to begin with. ## Character inventory All 74 characters in the dataset (ASCII only): ``` zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA\n !"$',-.0123456789:;? ``` No Unicode, no control characters, no special symbols. ## Format Single parquet file with one column: - `text` (string): the cleaned story text ## Source - Original dataset: [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) - Paper: [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) (Eldan & Li, 2023) - Cleaning script: `clean.py` in this directory
提供机构:
cfahlgren1
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作