jbloom/openwebtext_tokenized_gemma-2-9b

Name: jbloom/openwebtext_tokenized_gemma-2-9b
Creator: jbloom
Published: 2024-07-05 15:13:02
License: 暂无描述

Hugging Face2024-07-05 更新2024-07-06 收录

下载链接：

https://hf-mirror.com/datasets/jbloom/openwebtext_tokenized_gemma-2-9b

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含一个名为input_ids的序列特征，数据类型为int32。数据集包含一个名为train的分割，其大小为35342397700.0字节，包含8620097个示例。数据集的总大小为35342397700.0字节，下载大小为17362323842字节。数据集是使用SAELens工具生成的，使用了Gemma-2 tokenizer对OpenWeb文本进行了预标记化处理。

The dataset contains a sequence feature named input_ids with a data type of int32. It includes a split named train, which is 35342397700.0 bytes in size and contains 8620097 examples. The total size of the dataset is 35342397700.0 bytes, with a download size of 17362323842 bytes. The dataset was generated using the SAELens tool, with OpenWeb text pretokenized using the Gemma-2 tokenizer.

提供机构：

jbloom

原始信息汇总

数据集概述

数据集信息

特征:
- input_ids: 序列类型为 int32
分割:
- train: 包含 8620097 个样本，总大小为 35342397700.0 字节
下载大小: 17362323842 字节
数据集大小: 35342397700.0 字节

配置

配置名称: default
- 数据文件:
  - train: 路径为 data/train-*

附加信息

SAELens 版本: 3.11.0
分词器名称: google/gemma-2-9b
原始数据集: Skylion007/openwebtext
原始分割: train
上下文大小: 1024
是否打乱: 是
开始批次标记: bos
序列分隔符标记: bos

5,000+

优质数据集

54 个

任务类型

进入经典数据集