Piros/CamStories_full
收藏Hugging Face2025-08-10 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/Piros/CamStories_full
下载链接
链接失效反馈官方服务:
资源简介:
CamStories-Full是一个完整的、经过清洗和合并的儿童小说故事语料库,来源于TinyStories GPT-4变体和SimpleStories数据集。它采用了与CamStories-10k相同的清洗、名称标准化和语法过滤流程,保留了31,244个token的完整未大小写敏感词汇表,是CamStories-10k的超集,具有更多的词汇多样性。同时,为每个词汇项提供了音频向量,支持两种音频格式:原始Kokoro 24 kHz合成和8 kHz、半秒(4096样本)向量。
CamStories-Full is the full cleaned and combined corpus of children’s fiction stories from the TinyStories GPT-4 variant and SimpleStories datasets. It uses the same extensive cleaning, name normalization, and grammar filtering pipeline as CamStories-10k, but retains the full uncased word vocabulary of 31,244 tokens, making it a superset of CamStories-10k with more lexical variety. Audio vectors for each vocabulary item are provided in two formats: Original Kokoro 24 kHz synthesis and 8 kHz, half-second (4096-sample) vectors.
提供机构:
Piros



