five

pietrolesci/pile-deduped-pythia-preshuffled

收藏
Hugging Face2025-03-25 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/pietrolesci/pile-deduped-pythia-preshuffled
下载链接
链接失效反馈
官方服务:
资源简介:
这个数据集包含了完全准备好的数据,这些数据已经被标记化并预洗牌,用于训练Pythia(去重)模型。该数据集与EleutherAI组织下的EleutherAI/pile-deduped-pythia-preshuffled数据集相同,但以更易于管理的格式呈现。数据集分为143个块(parquet文件),每个块包含1024000个序列(行),对应1000个批次,每个批次由1024个序列组成。数据集包含3列:uid(序列的顺序标识符),batch_idx(序列所属批次的索引),token_ids(标记化的文本)。

This dataset contains fully prepared data that has been tokenized and pre-shuffled for training the Pythia (deduplicated) models. It is the same as the one found in EleutherAI/pile-deduped-pythia-preshuffled but presented in a more manageable format. The dataset is split into 143 chunks (parquet files), each containing 1024000 sequences (rows) corresponding to 1000 batches, each formed by 1024 sequences. The dataset has 3 columns: uid (a sequential identifier for the sequence), batch_idx (the index of the batch to which a sequence belongs), and token_ids (the tokenized texts).
提供机构:
pietrolesci
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作