SyedHarshath/english-tamil
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SyedHarshath/english-tamil
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
- ta
task_categories:
- translation
pretty_name: English-Tamil Parallel Dataset
size_categories:
- 1M<n<10M
tags:
- translation
- english-tamil
- seq2seq
- parallel-corpus
- low-resource
- indian-languages
---
# English to Tamil Parallel Dataset
This is a large-scale English–Tamil parallel dataset for sequence-to-sequence tasks such as machine translation, fine-tuning LLMs, and language modeling.
---
## 📊 Dataset Details
- **Total Rows**: 5,264,867
- **Features**:
- `en` – English sentence
- `ta` – Tamil translation
- **Language Pair**: English → Tamil
- **Format**: Hugging Face `datasets.Dataset` and CSV
- **Size**: ~1.5 GB (CSV)
---
## 💡 Use Cases
This dataset is ideal for:
- Fine-tuning translation models (e.g., T5, MarianMT, mBART)
- Evaluating English–Tamil machine translation (MT) systems
- Building bilingual educational tools
- Research on low-resource and Indian languages
- Language modeling, alignment, and embeddings
---
## 📂 To Use This Dataset
```python
from datasets import load_dataset
dataset = load_dataset("gopi30/english-tamil", split="train")
print(dataset)
# Output:
# Dataset({
# features: ['en', 'ta'],
# num_rows: 5264867
# })
提供机构:
SyedHarshath



