ChemCogLab/tinier_stories

Name: ChemCogLab/tinier_stories
Creator: ChemCogLab
Published: 2026-03-22 01:31:31
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ChemCogLab/tinier_stories

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit task_categories: - text-generation tags: - tiny-stories - compute-optimal - education configs: - config_name: default data_files: - split: train path: data/train.parquet - split: validation path: data/validation.parquet - split: test path: data/test.parquet --- # Tinier Stories (Compute-Optimal Scaling Subset) This dataset is a highly compressed, pre-tokenized subset designed exclusively for educational purposes and university-level AI coursework. It provides a lightweight sandbox for students to explore compute-optimal scaling, tokenizer compression, and language model training in heavily constrained environments. ## Dataset Structure To maximize batching efficiency, all stories have been strictly filtered to a **Maximum Sequence Length of 196 tokens**. The repository contains two types of files: * **Student Splits (`train.parquet`, `validation.parquet`, `test.parquet`)**: Ultra-fast files containing only the `ids` column (a list of integer token IDs). These are optimized to save bandwidth and memory during training loops. * **Text Backup (`train_with_text.parquet`)**: A reference file containing both the pre-encoded `ids` and the original `text` strings. This is useful for Exploratory Data Analysis (EDA) and verifying the Byte-Level BPE reconstruction. ## Tokenization The text is pre-tokenized using a custom **512-vocabulary Byte-Level Byte-Pair Encoding (BPE)** tokenizer (`tinier_stories_bpe_512.json`). Because the vocabulary is drastically reduced, the model must learn deeper contextual representations to process standard English. ## Original Source & Attribution This dataset is derived from the `karpathy/tinystories-gpt4-clean` dataset, a cleaned subset of the original TinyStories corpus. If you use this dataset or the broader TinyStories concept in your research or studies, please cite the original authors: > Eldan, Ronen, and Yuanzhi Li. "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?" *arXiv preprint arXiv:2305.07759* (2023).

提供机构：

ChemCogLab

5,000+

优质数据集

54 个

任务类型

进入经典数据集