pico-lm/pretokenized-dolma

Name: pico-lm/pretokenized-dolma
Creator: pico-lm
Published: 2025-04-16 10:43:37
License: 暂无描述

Hugging Face2025-04-16 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/pico-lm/pretokenized-dolma

下载链接

链接失效反馈

官方服务：

资源简介：

Pico数据集是Dolma数据集的预分词、预洗牌版本，Dolma是由AI2提供的高质量文本语料库。该数据集简化了训练过程，提供了2048个令牌的预分词文本块、预洗牌的数据、流式友好的格式以及总计420B的令牌数。使用该数据集的好处包括存储和内存效率、可重复性、快速和简单性。使用步骤包括设置HuggingFace凭证和通过Python代码加载数据集。

The Pico dataset is a pre-processed version of the Dolma dataset, providing pre-tokenized and pre-shuffled text. The dataset is chunked into 2048 tokens using the OLMo Tokenizer, and the data is pre-shuffled, suitable for streaming. The dataset totals 420B tokens, suitable for large-scale training. The dataset is highly efficient in storage and memory, offers good reproducibility, and is fast for training.

提供机构：

pico-lm

5,000+

优质数据集

54 个

任务类型

进入经典数据集