pietrolesci/pile-deduped-pythia-preshuffled

Name: pietrolesci/pile-deduped-pythia-preshuffled
Creator: pietrolesci
Published: 2025-03-25 21:00:11
License: 暂无描述

Hugging Face2025-03-25 更新2025-08-30 收录

下载链接：

https://hf-mirror.com/datasets/pietrolesci/pile-deduped-pythia-preshuffled

下载链接

链接失效反馈

官方服务：

资源简介：

这个数据集包含了完全准备好的数据，这些数据已经被标记化并预洗牌，用于训练Pythia（去重）模型。该数据集与EleutherAI组织下的EleutherAI/pile-deduped-pythia-preshuffled数据集相同，但以更易于管理的格式呈现。数据集分为143个块（parquet文件），每个块包含1024000个序列（行），对应1000个批次，每个批次由1024个序列组成。数据集包含3列：uid（序列的顺序标识符），batch_idx（序列所属批次的索引），token_ids（标记化的文本）。

This dataset contains fully prepared data that has been tokenized and pre-shuffled for training the Pythia (deduplicated) models. It is the same as the one found in EleutherAI/pile-deduped-pythia-preshuffled but presented in a more manageable format. The dataset is split into 143 chunks (parquet files), each containing 1024000 sequences (rows) corresponding to 1000 batches, each formed by 1024 sequences. The dataset has 3 columns: uid (a sequential identifier for the sequence), batch_idx (the index of the batch to which a sequence belongs), and token_ids (the tokenized texts).

提供机构：

pietrolesci

5,000+

优质数据集

54 个

任务类型

进入经典数据集