Phase-Technologies/hindi-translation-dataset
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Phase-Technologies/hindi-translation-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
{}
---
# Hindi Translation Dataset
## Dataset Description
This dataset consists of English sentences paired with their corresponding Hindi translations. It serves as a foundational resource for linguistic research and machine translation tasks.
### Dataset Summary
- **Total Instances:** 2618
- **Source Language:** English (en-US)
- **Target Language:** Hindi (hi-IN)
- **Format:** CSV UTF-8
- **Curation:** Synthetic generation with duplicate removal.
## Technical Specifications
### Model Used
This dataset was generated using the **Sarvam-30B** model, specifically optimized for Indian languages and high-quality Hindi output.
### Data Fields
- `english`: The source sentence in English.
- `hindi`: The target translation in Devanagari script, optimized for natural flow.
### Generation Pipeline
1. **Extraction:** Initial prompts were processed by Sarvam-30B.
2. **Validation:** Responses were parsed for valid JSON structure.
3. **Post-Processing:** Merging of session outputs and deduplication using pandas.
4. **Final Export:** Normalized CSV generation.
## Intended Use
- Training and fine-tuning neural machine translation models.
- Evaluation of multilingual LLMs on translation accuracy.
## Ethical Considerations
- **Bias:** Reflects biases inherent in the Sarvam-30B model.
- **Privacy:** No PII was intentionally included.
## Licensing
Licensed under Apache License 2.0.
提供机构:
Phase-Technologies



