samhitika-0.0.1
收藏魔搭社区2025-08-15 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/samhitika-0.0.1
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for samhitika-0.0.1
Translation of ~40M sentences from BookCorpus to Sanskrit with Gemma3-27b, yielding ~1.5B (gemma3-)tokens of low quality Sanskrit.
## Dataset Details
### Description
BookCorpus was translated one sentence at a time by Gemma3-27b in bf16 with the following prompt:
```
Translate the following English text to Sanskrit. Return only one Devanagari Sanskrit translation wrapped in triple backticks. Do NOT return any English.
English:
\```
{passage}
\```
Sanskrit:
```
**WARNING**: This is a preliminary v0.0.1 of this dataset only for experimentation. Higher quality synthetic corpora will be released soon.
- **Curated by:** Rohan Pandey
- **Funded by:** Voltage Park
- **Language(s) (NLP):** Sanskrit
- **License:** MIT
### Dataset Sources
- **Repository:** [khoomeik/sanskrit-ocr](https://github.com/KhoomeiK/sanskrit-ocr)
## Uses
This is a fairly low quality synthetic v0.0.1 corpus! There are many instances of incorrect translations and accidental Hindi usage by the model.
Therefore, this data is only suitable for Sanskrit pre-training experiments & OCR data augmentation after sufficient cleaning & filtering.
### Out-of-Scope Use
This data is almost certainly not good enough to train a model with intelligence greater than GPT-2 (in Sanskrit).
## Dataset Structure
The `bookcorpus_id` field lets you find the original English it was translated from, and the `text` field is the Sanskrit translation in Devanagari.
### Source Data
This is a synthetic translation of [BookCorpus](https://huggingface.co/datasets/bookcorpus/bookcorpus)
# samhitika-0.0.1 数据集卡片
使用bf16精度的Gemma3-27b模型将约4000万条来自图书语料库(BookCorpus)的英语句子翻译为梵语(Sanskrit),最终生成约15亿(Gemma3-)Token的低质量梵语文本。
## 数据集详情
### 数据集描述
翻译流程为:使用bf16精度的Gemma3-27b模型逐条翻译源句,所用提示词如下:
将以下英语文本翻译为梵语,仅返回包裹在三重反引号内的天城文梵语译文,不得包含任何英语内容。
英语原文:
{passage}
梵语译文:
**警告**:本数据集仅为v0.0.1预览版,仅用于实验研究,更高质量的合成语料库将在后续发布。
- **整理者**:罗翰·潘迪(Rohan Pandey)
- **资助方**:Voltage Park
- **自然语言处理所用语言**:梵语(Sanskrit)
- **授权协议**:MIT协议
### 数据集来源
- **代码仓库**:[khoomeik/sanskrit-ocr](https://github.com/KhoomeiK/sanskrit-ocr)
## 数据集用途
本数据集为v0.0.1版的低质量合成语料库,模型存在大量翻译错误,且偶尔会混入印地语内容。因此,本数据集仅适用于经过充分清洗与过滤后的梵语预训练实验,以及光学字符识别(OCR)数据增强任务。
### 不适用场景
本数据集的质量不足以训练性能超越GPT-2的梵语模型。
## 数据集结构
数据集中的`bookcorpus_id`字段可用于定位对应的原始英语源句子,`text`字段为天城文格式的梵语译文。
### 源数据说明
本数据集为[图书语料库(BookCorpus)](https://huggingface.co/datasets/bookcorpus/bookcorpus)的合成翻译数据集。
提供机构:
maas
创建时间:
2025-05-23



