emozilla/dolma-v1_7-305B-tokenized-llama3-nanoset
收藏Hugging Face2024-05-29 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/emozilla/dolma-v1_7-305B-tokenized-llama3-nanoset
下载链接
链接失效反馈官方服务:
资源简介:
Dolma数据集是NousResearch/dolma-v1_7-305B的Llama 3版本,被分割成10 GB的块。它主要用于文本生成任务,支持英语,适用于语言建模、因果语言模型和大型语言模型的研究。数据集的大小类别在100B到1T之间。
This is a tokenized version of the NousResearch/dolma-v1_7-305B dataset using the Llama 3 tokenizer, split into 10 GB chunks for easier handling. The dataset is intended for language modeling tasks and is part of the Nanotron project. It can be downloaded and recombined using specific commands, and it supports direct usage with numpy for data manipulation.
提供机构:
emozilla



