skeskinen/TinyStories-hf
收藏Hugging Face2023-05-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/skeskinen/TinyStories-hf
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 1911420483
num_examples: 2119719
- name: validation
num_bytes: 19306310
num_examples: 21990
download_size: 1000775442
dataset_size: 1930726793
---
A description of this dataset can be found at https://arxiv.org/abs/2305.07759
Copied from roneneldan/TinyStories
Modified with:
```
import ftfy.bad_codecs
from datasets import Dataset, DatasetDict
train = open('./TinyStories-train.txt', 'r', encoding='sloppy-windows-1252').read()
train = train.split('<|endoftext|>')
train = [l.strip() for l in train]
valid = open('./TinyStories-valid.txt', 'r', encoding='sloppy-windows-1252').read()
valid = valid.split('<|endoftext|>')
valid = [l.strip() for l in valid]
dataset = DatasetDict({
'train': Dataset.from_dict({'text': train }),
'validation': Dataset.from_dict({'text': valid}),
})
dataset.save_to_disk('./TinyStories')
```
提供机构:
skeskinen
原始信息汇总
数据集概述
数据集特征
- 名称: text
- 数据类型: string
数据集划分
- 训练集
- 示例数量: 2119719
- 数据大小: 1911420483 字节
- 验证集
- 示例数量: 21990
- 数据大小: 19306310 字节
数据集大小
- 下载大小: 1000775442 字节
- 总数据大小: 1930726793 字节



