mukunda1729/token-counting-edge-cases
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/mukunda1729/token-counting-edge-cases
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为token-counting-edge-cases,包含20个短字符串,用于测试和验证三种分词器家族(Claude、GPT的cl100k_base和Llama的SentencePiece)在处理各种边缘情况时的表现。每个字符串都有对应的近似token计数,适用于检查token计数器、分块器或上下文窗口拟合器的准确性。数据集涵盖了多种测试类别,包括基本测试、正常英语、非ASCII字符、编程语言、常见token消耗者等。
The dataset named token-counting-edge-cases contains 20 short strings with approximate token counts across three tokenizer families: Claude, GPT (cl100k_base), and Llama (SentencePiece). It is built for sanity-checking token counters, chunkers, or context-window fitters. The dataset covers various test categories including trivial baselines, normal English, non-ASCII characters, programming languages, common token-eaters, etc.
提供机构:
mukunda1729



