pico-lm/pretokenized-dolma
收藏Hugging Face2025-04-16 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/pico-lm/pretokenized-dolma
下载链接
链接失效反馈官方服务:
资源简介:
Pico数据集是Dolma数据集的预分词、预洗牌版本,Dolma是由AI2提供的高质量文本语料库。该数据集简化了训练过程,提供了2048个令牌的预分词文本块、预洗牌的数据、流式友好的格式以及总计420B的令牌数。使用该数据集的好处包括存储和内存效率、可重复性、快速和简单性。使用步骤包括设置HuggingFace凭证和通过Python代码加载数据集。
The Pico dataset is a pre-processed version of the Dolma dataset, providing pre-tokenized and pre-shuffled text. The dataset is chunked into 2048 tokens using the OLMo Tokenizer, and the data is pre-shuffled, suitable for streaming. The dataset totals 420B tokens, suitable for large-scale training. The dataset is highly efficient in storage and memory, offers good reproducibility, and is fast for training.
提供机构:
pico-lm



