pietrolesci/pile-deduped-pythia-preshuffled
收藏Hugging Face2025-03-25 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/pietrolesci/pile-deduped-pythia-preshuffled
下载链接
链接失效反馈官方服务:
资源简介:
这个数据集包含了完全准备好的数据,这些数据已经被标记化并预洗牌,用于训练Pythia(去重)模型。该数据集与EleutherAI组织下的EleutherAI/pile-deduped-pythia-preshuffled数据集相同,但以更易于管理的格式呈现。数据集分为143个块(parquet文件),每个块包含1024000个序列(行),对应1000个批次,每个批次由1024个序列组成。数据集包含3列:uid(序列的顺序标识符),batch_idx(序列所属批次的索引),token_ids(标记化的文本)。
This dataset contains fully prepared data that has been tokenized and pre-shuffled for training the Pythia (deduplicated) models. It is the same as the one found in EleutherAI/pile-deduped-pythia-preshuffled but presented in a more manageable format. The dataset is split into 143 chunks (parquet files), each containing 1024000 sequences (rows) corresponding to 1000 batches, each formed by 1024 sequences. The dataset has 3 columns: uid (a sequential identifier for the sequence), batch_idx (the index of the batch to which a sequence belongs), and token_ids (the tokenized texts).
提供机构:
pietrolesci



