five

ChemCogLab/tinier_stories

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ChemCogLab/tinier_stories
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit task_categories: - text-generation tags: - tiny-stories - compute-optimal - education configs: - config_name: default data_files: - split: train path: data/train.parquet - split: validation path: data/validation.parquet - split: test path: data/test.parquet --- # Tinier Stories (Compute-Optimal Scaling Subset) This dataset is a highly compressed, pre-tokenized subset designed exclusively for educational purposes and university-level AI coursework. It provides a lightweight sandbox for students to explore compute-optimal scaling, tokenizer compression, and language model training in heavily constrained environments. ## Dataset Structure To maximize batching efficiency, all stories have been strictly filtered to a **Maximum Sequence Length of 196 tokens**. The repository contains two types of files: * **Student Splits (`train.parquet`, `validation.parquet`, `test.parquet`)**: Ultra-fast files containing only the `ids` column (a list of integer token IDs). These are optimized to save bandwidth and memory during training loops. * **Text Backup (`train_with_text.parquet`)**: A reference file containing both the pre-encoded `ids` and the original `text` strings. This is useful for Exploratory Data Analysis (EDA) and verifying the Byte-Level BPE reconstruction. ## Tokenization The text is pre-tokenized using a custom **512-vocabulary Byte-Level Byte-Pair Encoding (BPE)** tokenizer (`tinier_stories_bpe_512.json`). Because the vocabulary is drastically reduced, the model must learn deeper contextual representations to process standard English. ## Original Source & Attribution This dataset is derived from the `karpathy/tinystories-gpt4-clean` dataset, a cleaned subset of the original TinyStories corpus. If you use this dataset or the broader TinyStories concept in your research or studies, please cite the original authors: > Eldan, Ronen, and Yuanzhi Li. "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?" *arXiv preprint arXiv:2305.07759* (2023).
提供机构:
ChemCogLab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作