haggingfacehyz/Skylion007-openwebtext-gpt2-1024
收藏Hugging Face2025-01-14 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/haggingfacehyz/Skylion007-openwebtext-gpt2-1024
下载链接
链接失效反馈官方服务:
资源简介:
Skylion007/openwebtext是一个使用gpt2 tokenizer的大型文本数据集,包含了8835789个训练样本,数据集总大小为36226734900字节。该数据集适用于训练机器学习模型,特别是那些需要处理文本数据的模型,如语言模型。每个样本的上下文大小为1024。
Skylion007/openwebtext is a large text dataset tokenized by gpt2 tokenizer, containing 8835789 training samples, with a total dataset size of 36226734900 bytes. This dataset is suitable for training machine learning models, especially those dealing with text data, such as language models. Each sample has a context size of 1024.
提供机构:
haggingfacehyz



