ML-Jonibek/Uzb-Eng-Translation-1
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ML-Jonibek/Uzb-Eng-Translation-1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- uz
license: cc0-1.0
task_categories:
- translation
pretty_name: English-Uzbek Translation Dataset (One Thousand and One Nights)
size_categories:
- 1K<n<10K
tags:
- translation
- english
- uzbek
- literary
- parallel-corpus
- one-thousand-and-one-nights
---
# 📖 English ↔ Uzbek Literary Translation Dataset
   
---
## 🌟 Dataset Summary
This dataset contains **parallel sentence pairs** translated between **English (EN)** and **Uzbek (UZ)**, extracted from the classic literary work *One Thousand and One Nights* (Ming bir kecha / مینگ بیر کیچه).
The dataset is ideal for:
- 🤖 Training and fine-tuning **Neural Machine Translation (NMT)** models for the EN↔UZ language pair
- 📊 **Benchmarking** translation quality on literary text
- 🔍 **Linguistic research** on English–Uzbek language structure
- 📚 **Low-resource language** NLP development (Uzbek)
---
## 📂 Dataset Structure
```
Translation1_en_uz.jsonl
```
Each line is a **JSON object** with two fields:
| Field | Type | Description |
|-------|--------|------------------------------------|
| `en` | string | English source sentence/segment |
| `uz` | string | Uzbek translated sentence/segment |
### 🔎 Example Entry
```json
{
"en": "A good story belongs to the whole world.",
"uz": "Yaxshi hikoya butun dunyoga tegishli."
}
```
```json
{
"en": "The stories known as the Thousand and One Nights are very old.",
"uz": "\"Ming bir kecha\" nomi bilan tanilgan hikoyalar juda qadimiy."
}
```
---
## 📊 Dataset Statistics
| Property | Value |
|---------------------|--------------------|
| **Format** | JSONL |
| **Total pairs** | ~2,245 |
| **Source language** | English (en) |
| **Target language** | Uzbek (uz) |
| **Domain** | Literary / Classic |
| **Text type** | Prose, narrative |
| **Script (UZ)** | Latin (Uzbek Latin alphabet) |
---
## 📖 Source Material
> 🏛️ *One Thousand and One Nights* — an ancient collection of Middle Eastern folk tales compiled during the Islamic Golden Age.
| Property | Details |
|--------------|----------------------------------------------------------|
| **Title** | One Thousand and One Nights / Ming bir kecha |
| **Origin** | 9th century CE |
| **Publisher**| The Penn Publishing Company, Philadelphia, 1928 |
| **Source** | [en.wikisource.org](https://en.wikisource.org) |
| **Illustrator** | Virginia Frances Sterrett |
| **Download** | [www.aliceandbooks.com](https://www.aliceandbooks.com) |
| **License** | Public Domain (CC0 1.0) |
---
## 🚀 Loading the Dataset
### Using 🤗 HuggingFace Datasets
```python
from datasets import load_dataset
dataset = load_dataset("ML-Jonibek/Translation1_en_uz", split="train")
print(dataset[0])
# {'en': 'A good story belongs to the whole world.',
# 'uz': 'Yaxshi hikoya butun dunyoga tegishli.'}
```
### Manual loading with Python
```python
import json
data = []
with open("Translation1_en_uz.jsonl", "r", encoding="utf-8") as f:
for line in f:
data.append(json.loads(line))
print(f"Total pairs: {len(data)}")
print(data[0])
```
---
## 🎯 Intended Use
### ✅ Suitable for
- Fine-tuning multilingual models (mBART, NLLB, mT5, etc.) on EN↔UZ
- Creating Uzbek language corpora for NLP research
- Training sequence-to-sequence translation models
- Evaluating BLEU/chrF scores on literary Uzbek text
### ⚠️ Limitations
- Text is **literary in style** — may not generalize well to news, technical, or conversational domains
- Uzbek translations may reflect **older translation conventions**
- Some segments are **sentence fragments** due to literary segmentation
---
## 💡 Example Use Case: Fine-tuning with `transformers`
```python
from transformers import MarianMTModel, MarianTokenizer
model_name = "Helsinki-NLP/opus-mt-en-uz"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
texts = ["A good story belongs to the whole world."]
inputs = tokenizer(texts, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
print(tokenizer.decode(translated[0], skip_special_tokens=True))
```
---
## 📜 License
This dataset is derived from **public domain** source material and is released under the **Creative Commons Zero (CC0 1.0)** license.
> You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
---
## 🙏 Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{translation1_en_uz_2024,
title = {English-Uzbek Literary Translation Dataset (One Thousand and One Nights)},
language = {en, uz},
source = {One Thousand and One Nights, Penn Publishing Company, 1928},
license = {CC0 1.0 Universal},
url = {https://huggingface.co/datasets/your-username/Translation1_en_uz}
}
```
---
## 🤝 Contributing
Contributions, corrections, and improvements are welcome!
Feel free to open an issue or pull request on the dataset repository.
---
<p align="center">
Made with ❤️ for the Uzbek NLP community 🇺🇿
</p>
提供机构:
ML-Jonibek



