ML-Jonibek/English-Uzbek-Translation-1
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ML-Jonibek/English-Uzbek-Translation-1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- uz
task_categories:
- translation
task_ids:
- text2text-generation
multilinguality:
- translation
pretty_name: English-Uzbek Translation Dataset
size_categories:
- 10K<n<100K
tags:
- translation
- english
- uzbek
- NLP
- parallel-corpus
- low-resource
---
# 🌐 English–Uzbek Translation Dataset
A parallel corpus for **English ↔ Uzbek** machine translation, curated to support research and development of NLP models for the Uzbek language — one of the most underrepresented Turkic languages in open-source datasets.
---
## 📖 Dataset Description
This dataset contains aligned sentence pairs in **English** and **Uzbek**, designed for training, fine-tuning, and evaluating neural machine translation (NMT) models. The dataset aims to bridge the resource gap for Uzbek in the NLP community.
- **Languages:** English (`en`) → Uzbek (`uz`)
- **Task:** Machine Translation (seq2seq / text-to-text)
- **Script:** Latin (Uzbek uses the reformed Latin alphabet)
---
## 📊 Dataset Structure
### Data Fields
| Field | Type | Description |
|-------------|--------|------------------------------------|
| `en` | string | Source sentence in English |
| `uz` | string | Target sentence in Uzbek |
### Data Splits
| Split | Examples |
|--------------|----------|
| `train` | 25k |
> 📝 *Exact counts will be updated after final data versioning.*
---
## 🚀 Usage
### Load with 🤗 Datasets
```python
from datasets import load_dataset
dataset = load_dataset("ML-Jonibek/English-Uzbek-Translation-1")
# Access training split
train_data = dataset["train"]
### Fine-tune a Translation Model
```python
from transformers import MarianMTModel, MarianTokenizer
model_name = "Helsinki-NLP/opus-mt-en-trk" # or any seq2seq model
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Tokenize
inputs = tokenizer("Hello world", return_tensors="pt", padding=True)
translated = model.generate(**inputs)
print(tokenizer.decode(translated[0], skip_special_tokens=True))
```
---
## 🌍 Why Uzbek?
Uzbek is spoken by over **35 million people** worldwide, yet it remains severely underrepresented in machine translation research. Quality parallel corpora for Uzbek are rare, making this dataset a valuable contribution to:
- Neural Machine Translation (NMT)
- Cross-lingual transfer learning
- Uzbek language model development
- Multilingual NLP benchmarks
---
## 📁 Source & Data Collection
The sentence pairs were collected and curated from publicly available sources. The dataset focuses on:
- General-purpose conversational text
- News and informational content
- Everyday language and expressions
---
## ⚙️ Preprocessing
- Sentences were deduplicated
- Encoding normalized to UTF-8
- Whitespace and punctuation cleaned
- Aligned pairs verified for translation quality
---
## 📈 Intended Uses
| Use Case | Suitable? |
|---|---|
| Training NMT models | ✅ |
| Fine-tuning multilingual models (mBART, NLLB, M2M-100) | ✅ |
| Evaluation / benchmarking | ✅ |
| Cross-lingual embeddings | ✅ |
---
## ⚠️ Limitations
- The dataset may not cover all domains equally (e.g., legal, medical).
- Dialectal variations in Uzbek (Tashkent vs. regional dialects) may not be fully represented.
- Machine-translated or auto-aligned samples (if any) may contain noise.
---
## 🙏 Citation
If you use this dataset in your research, please cite it as:
```bibtex
@dataset{english-uzbek-dataset,
author = {ML-Jonibek},
title = {English–Uzbek Translation Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ML-Jonibek/English-Uzbek-Translation-1}
}
```
---
## 👤 Author
Created and maintained by **[ML-Jonibek](https://huggingface.co/ML-Jonibek)**.
Contributions, corrections, and feedback are welcome via the [community tab](https://huggingface.co/datasets/ML-Jonibek/English-Uzbek-Translation-1/discussions).
---
*Made with ❤️ for the Uzbek NLP community*
提供机构:
ML-Jonibek



