khoomeik/samhitika-0.0.1
收藏Hugging Face2025-05-22 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/khoomeik/samhitika-0.0.1
下载链接
链接失效反馈官方服务:
资源简介:
这是一个使用Gemma3-27b模型从BookCorpus翻译成Sanskrit语的合成数据集,包含约4000万句子的低质量翻译,总标记数约为15亿。该数据集版本为v0.0.1,仅适用于Sanskrit的预训练实验和OCR数据增强,不适合训练智能水平高于GPT-2的模型。
This is a synthetic dataset of translations from BookCorpus to Sanskrit using the Gemma3-27b model, containing low-quality translations of about 40 million sentences, totaling approximately 1.5 billion (gemma3-)tokens. This dataset version is v0.0.1 and is only suitable for pre-training experiments and OCR data augmentation in Sanskrit, not for training models with intelligence greater than GPT-2.
提供机构:
khoomeik



