almaghrabima/SARFTokenizer-benchmark-eval
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/almaghrabima/SARFTokenizer-benchmark-eval
下载链接
链接失效反馈官方服务:
资源简介:
SARFTokenizer Benchmark Eval (600)数据集包含300个阿拉伯语和300个英语文档,用于评估SARFTokenizer与GPT-5、GPT-4o、Gemma-4、Qwen3.6、Kimi-K2.6、ALLaM等tokenizer的性能。数据集旨在使基准测试完全可复现,任何人都可以在相同的样本上计算相同的字符-标记比。数据集包含索引、语言、文本和来源字段,文本被截断为2000个字符。数据来自deeplatent-hq-bilingual验证分片的前5个阿拉伯语和英语文件,并应用了10%阿拉伯字符阈值的过滤。
The SARFTokenizer Benchmark Eval (600) dataset consists of 300 Arabic and 300 English documents used to benchmark the SARFTokenizer against GPT-5, GPT-4o, Gemma-4, Qwen3.6, Kimi-K2.6, ALLaM, and other tokenizers. This dataset makes the benchmark fully reproducible, allowing anyone to compute the exact same chars-per-token numbers on the same samples. The dataset includes fields for index, language, text (truncated to 2000 characters), and source. The data is sampled from the first 5 Arabic and English files of the deeplatent-hq-bilingual validation shards, with a 10% Arabic-character threshold for Arabic filtering.
提供机构:
almaghrabima



