victorarizz/minipile-tokenized-llama3

Name: victorarizz/minipile-tokenized-llama3
Creator: victorarizz
Published: 2024-07-08 23:43:27
License: 暂无描述

Hugging Face2024-07-08 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/victorarizz/minipile-tokenized-llama3

下载链接

链接失效反馈

官方服务：

资源简介：

这是使用Llama3分词器对JeanKaddour/minipile数据集进行预分词处理的版本。数据集包含152,813个训练样本，总大小为5,007,987,636字节。分词过程使用了SAELens工具，上下文大小为8192，数据被打乱顺序，开始批处理标记为bos，开始序列标记为无，序列分隔符标记为eos。

This is a pre-tokenized version of the MiniPile dataset using the Llama3 tokenizer. The dataset includes a training set with 152813 samples and a total size of 5007987636 bytes. The tokenization process used the SAELens tool with specific settings such as context size and whether to shuffle the data. The licensing is identical to the original MiniPile dataset.

提供机构：

victorarizz

原始信息汇总

MiniPile Tokenized for Llama 3 (~1.2b tokens)

数据集信息

特征:
- input_ids: 序列类型为 int32
分割:
- train: 包含 152813 个样本，总大小为 5007987636.0 字节
下载大小: 2195659342 字节
数据集大小: 5007987636.0 字节

配置

配置名称: default
- 数据文件:
  - train: 路径为 data/train-*

许可

与原始 MiniPile 数据集相同。

分词细节

分词工具: SAELens (3.11.0)
设置:
- 上下文大小: 8192
- 是否打乱: 是
- 批次开始标记: "bos"
- 序列开始标记: 无
- 序列分隔符标记: "eos"

5,000+

优质数据集

54 个

任务类型

进入经典数据集