samhitika-0.0.1

Name: samhitika-0.0.1
Creator: maas
Published: 2025-08-15 16:32:39
License: 暂无描述

魔搭社区2025-08-15 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/samhitika-0.0.1

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for samhitika-0.0.1 Translation of ~40M sentences from BookCorpus to Sanskrit with Gemma3-27b, yielding ~1.5B (gemma3-)tokens of low quality Sanskrit. ## Dataset Details ### Description BookCorpus was translated one sentence at a time by Gemma3-27b in bf16 with the following prompt: ``` Translate the following English text to Sanskrit. Return only one Devanagari Sanskrit translation wrapped in triple backticks. Do NOT return any English. English: \``` {passage} \``` Sanskrit: ``` **WARNING**: This is a preliminary v0.0.1 of this dataset only for experimentation. Higher quality synthetic corpora will be released soon. - **Curated by:** Rohan Pandey - **Funded by:** Voltage Park - **Language(s) (NLP):** Sanskrit - **License:** MIT ### Dataset Sources - **Repository:** [khoomeik/sanskrit-ocr](https://github.com/KhoomeiK/sanskrit-ocr) ## Uses This is a fairly low quality synthetic v0.0.1 corpus! There are many instances of incorrect translations and accidental Hindi usage by the model. Therefore, this data is only suitable for Sanskrit pre-training experiments & OCR data augmentation after sufficient cleaning & filtering. ### Out-of-Scope Use This data is almost certainly not good enough to train a model with intelligence greater than GPT-2 (in Sanskrit). ## Dataset Structure The `bookcorpus_id` field lets you find the original English it was translated from, and the `text` field is the Sanskrit translation in Devanagari. ### Source Data This is a synthetic translation of [BookCorpus](https://huggingface.co/datasets/bookcorpus/bookcorpus)

# samhitika-0.0.1 数据集卡片使用bf16精度的Gemma3-27b模型将约4000万条来自图书语料库（BookCorpus）的英语句子翻译为梵语（Sanskrit），最终生成约15亿（Gemma3-）Token的低质量梵语文本。 ## 数据集详情 ### 数据集描述翻译流程为：使用bf16精度的Gemma3-27b模型逐条翻译源句，所用提示词如下：将以下英语文本翻译为梵语，仅返回包裹在三重反引号内的天城文梵语译文，不得包含任何英语内容。英语原文： {passage} 梵语译文： **警告**：本数据集仅为v0.0.1预览版，仅用于实验研究，更高质量的合成语料库将在后续发布。 - **整理者**：罗翰·潘迪（Rohan Pandey） - **资助方**：Voltage Park - **自然语言处理所用语言**：梵语（Sanskrit） - **授权协议**：MIT协议 ### 数据集来源 - **代码仓库**：[khoomeik/sanskrit-ocr](https://github.com/KhoomeiK/sanskrit-ocr) ## 数据集用途本数据集为v0.0.1版的低质量合成语料库，模型存在大量翻译错误，且偶尔会混入印地语内容。因此，本数据集仅适用于经过充分清洗与过滤后的梵语预训练实验，以及光学字符识别（OCR）数据增强任务。 ### 不适用场景本数据集的质量不足以训练性能超越GPT-2的梵语模型。 ## 数据集结构数据集中的`bookcorpus_id`字段可用于定位对应的原始英语源句子，`text`字段为天城文格式的梵语译文。 ### 源数据说明本数据集为[图书语料库（BookCorpus）](https://huggingface.co/datasets/bookcorpus/bookcorpus)的合成翻译数据集。

提供机构：

maas

创建时间：

2025-05-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集