Tasmay-Tib/Tokeniser
收藏Hugging Face2025-04-03 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/Tasmay-Tib/Tokeniser
下载链接
链接失效反馈官方服务:
资源简介:
这是一个基于Slim Pajama数据集构建的tokenizer,包含约1B个token,分为两个版本:0.5B版本(只包含验证数据)和1B版本(包含验证数据及测试数据)。该tokenizer使用了自定义算法,并包括token计数、文本语料库、来自SlimPajama的行/段落列表JSON、有序tokenizer和无序tokenizer。
This is a tokenizer built on the SlimPajama dataset, containing approximately 1B tokens, divided into two versions: the 0.5B version (containing only validation data) and the 1B version (containing both validation and test data). The tokenizer uses a custom algorithm and includes token counts, text corpora, a list of lines/paragraphs from SlimPajama in JSON format, an ordered tokenizer, and an unordered tokenizer with token IDs.
提供机构:
Tasmay-Tib



