skeskinen/TinyStories-Instruct-hf
收藏Hugging Face2023-05-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/skeskinen/TinyStories-Instruct-hf
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 2648754575
num_examples: 2476533
- name: validation
num_bytes: 26745785
num_examples: 25028
download_size: 1325495040
dataset_size: 2675500360
---
A description of this dataset can be found at https://arxiv.org/abs/2305.07759
Copied from roneneldan/TinyStoriesInstruct
Modified with:
```
import ftfy.bad_codecs
from datasets import Dataset, DatasetDict
train = open('./TinyStories-Instruct-train.txt', 'r', encoding='sloppy-windows-1252').read()
train = train.split('<|endoftext|>')
train = [l.strip() for l in train]
valid = open('./TinyStories-Instruct-valid.txt', 'r', encoding='sloppy-windows-1252').read()
valid = valid.split('<|endoftext|>')
valid = [l.strip() for l in valid]
dataset = DatasetDict({
'train': Dataset.from_dict({'text': train }),
'validation': Dataset.from_dict({'text': valid}),
})
dataset.save_to_disk('./TinyStories-Instruct')
```
提供机构:
skeskinen
原始信息汇总
数据集概述
数据集特征
- 名称: text
- 数据类型: string
数据集拆分
- 训练集
- 示例数量: 2476533
- 数据大小: 2648754575 字节
- 验证集
- 示例数量: 25028
- 数据大小: 26745785 字节
数据集大小
- 下载大小: 1325495040 字节
- 总数据大小: 2675500360 字节



