emozilla/dolma-v1_7-305B-tokenized-llama3-nanoset

Name: emozilla/dolma-v1_7-305B-tokenized-llama3-nanoset
Creator: emozilla
Published: 2024-05-29 18:34:55
License: 暂无描述

Hugging Face2024-05-29 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/emozilla/dolma-v1_7-305B-tokenized-llama3-nanoset

下载链接

链接失效反馈

官方服务：

资源简介：

Dolma数据集是NousResearch/dolma-v1_7-305B的Llama 3版本，被分割成10 GB的块。它主要用于文本生成任务，支持英语，适用于语言建模、因果语言模型和大型语言模型的研究。数据集的大小类别在100B到1T之间。

This is a tokenized version of the NousResearch/dolma-v1_7-305B dataset using the Llama 3 tokenizer, split into 10 GB chunks for easier handling. The dataset is intended for language modeling tasks and is part of the Nanotron project. It can be downloaded and recombined using specific commands, and it supports direct usage with numpy for data manipulation.

提供机构：

emozilla

5,000+

优质数据集

54 个

任务类型

进入经典数据集