Geonwoohong/pile-uncopyrighted-6b-tokenized-gpt2
收藏Hugging Face2025-10-21 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/Geonwoohong/pile-uncopyrighted-6b-tokenized-gpt2
下载链接
链接失效反馈官方服务:
资源简介:
Pile-Uncopyrighted的预标记子集,使用GPT-2 BPE编码,适用于Mixture-of-Experts(MoE)实验。每个记录包含固定长度的1024个标记序列,适用于解码器语言模型的训练。数据集包括Pile-CC、USPTO Backgrounds、Wikipedia (en)、GitHub、PubMed Abstracts和StackExchange等未版权子集。
A pre-tokenized subset of Pile-Uncopyrighted encoded with GPT-2 BPE tokenizer, suitable for Mixture-of-Experts (MoE) experiments. Each record contains fixed-length 1024-token sequences for decoder-only language model training, including subsets like Pile-CC, USPTO Backgrounds, Wikipedia (en), GitHub, PubMed Abstracts, and StackExchange.
提供机构:
Geonwoohong



