almaghrabima/SARFTokenizer-benchmark-eval

Name: almaghrabima/SARFTokenizer-benchmark-eval
Creator: almaghrabima
Published: 2026-04-22 18:28:38
License: 暂无描述

Hugging Face2026-04-22 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/almaghrabima/SARFTokenizer-benchmark-eval

下载链接

链接失效反馈

官方服务：

资源简介：

SARFTokenizer Benchmark Eval (600)数据集包含300个阿拉伯语和300个英语文档，用于评估SARFTokenizer与GPT-5、GPT-4o、Gemma-4、Qwen3.6、Kimi-K2.6、ALLaM等tokenizer的性能。数据集旨在使基准测试完全可复现，任何人都可以在相同的样本上计算相同的字符-标记比。数据集包含索引、语言、文本和来源字段，文本被截断为2000个字符。数据来自deeplatent-hq-bilingual验证分片的前5个阿拉伯语和英语文件，并应用了10%阿拉伯字符阈值的过滤。

The SARFTokenizer Benchmark Eval (600) dataset consists of 300 Arabic and 300 English documents used to benchmark the SARFTokenizer against GPT-5, GPT-4o, Gemma-4, Qwen3.6, Kimi-K2.6, ALLaM, and other tokenizers. This dataset makes the benchmark fully reproducible, allowing anyone to compute the exact same chars-per-token numbers on the same samples. The dataset includes fields for index, language, text (truncated to 2000 characters), and source. The data is sampled from the first 5 Arabic and English files of the deeplatent-hq-bilingual validation shards, with a 10% Arabic-character threshold for Arabic filtering.

提供机构：

almaghrabima

5,000+

优质数据集

54 个

任务类型

进入经典数据集