five

SyedHarshath/english-tamil

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SyedHarshath/english-tamil
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en - ta task_categories: - translation pretty_name: English-Tamil Parallel Dataset size_categories: - 1M<n<10M tags: - translation - english-tamil - seq2seq - parallel-corpus - low-resource - indian-languages --- # English to Tamil Parallel Dataset This is a large-scale English–Tamil parallel dataset for sequence-to-sequence tasks such as machine translation, fine-tuning LLMs, and language modeling. --- ## 📊 Dataset Details - **Total Rows**: 5,264,867 - **Features**: - `en` – English sentence - `ta` – Tamil translation - **Language Pair**: English → Tamil - **Format**: Hugging Face `datasets.Dataset` and CSV - **Size**: ~1.5 GB (CSV) --- ## 💡 Use Cases This dataset is ideal for: - Fine-tuning translation models (e.g., T5, MarianMT, mBART) - Evaluating English–Tamil machine translation (MT) systems - Building bilingual educational tools - Research on low-resource and Indian languages - Language modeling, alignment, and embeddings --- ## 📂 To Use This Dataset ```python from datasets import load_dataset dataset = load_dataset("gopi30/english-tamil", split="train") print(dataset) # Output: # Dataset({ # features: ['en', 'ta'], # num_rows: 5264867 # })
提供机构:
SyedHarshath
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作