sytelus/openwebtext
收藏Hugging Face2024-12-17 更新2024-12-21 收录
下载链接:
https://hf-mirror.com/datasets/sytelus/openwebtext
下载链接
链接失效反馈官方服务:
资源简介:
OpenWebText数据集是以arrow格式提供的,以便无需预处理即可轻松与HuggingFace API一起使用。该数据集可用于Andrej Karpathy的NanoGPT和@sytelus的NanuGPT,以复现GPT-2系列模型。数据集包含8,013,769个文档,使用tiktoken GPT2分词器处理的令牌数量为9,040,017,095,词汇量为50,257,磁盘大小为39,770,909,229字节。
This is the arrow format of the OpenWebText dataset, allowing easy use with HuggingFace APIs without any need for pre-processing. The dataset can be used with Andrej Karpathys NanoGPT and syteluss NanuGPU to reproduce the GPT-2 series of models. Statistics show the number of documents is 8013769, tokens are 9040017095 (using tiktoken GPT2 tokenizer, vocab size 50257), and the size on disk is 39,770,909,229 bytes.
提供机构:
sytelus



