archit11/hyperswitch-token-aware-cpt2
收藏Hugging Face2025-11-07 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/archit11/hyperswitch-token-aware-cpt2
下载链接
链接失效反馈官方服务:
资源简介:
Hyperswitch Token-Aware CPT数据集包含1076个来自Hyperswitch支付路由器项目的Rust代码样本,这些样本针对Continued Pre-Training进行了优化,并使用Kwaipilot/KAT-Dev分词器。样本分为文件、模块、组合文件和crate不同的粒度级别。数据集的统计信息包括样本总数、令牌总数、平均令牌数以及令牌分布情况。此外,还列出了使用方法和训练建议。
The Hyperswitch Token-Aware CPT dataset consists of 1,076 Rust code samples from the Hyperswitch payment router project, optimized for Continued Pre-Training (CPT) and tokenized using the Kwaipilot/KAT-Dev tokenizer. The samples are categorized into different granularity levels: file, module, combined_files, and crate. Dataset statistics include the total number of samples, total tokens, average tokens per sample, and token distribution. Additionally, the usage instructions and training recommendations are provided.
提供机构:
archit11



