five

ML-Jonibek/Uzb-Eng-Translation-1

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ML-Jonibek/Uzb-Eng-Translation-1
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - uz license: cc0-1.0 task_categories: - translation pretty_name: English-Uzbek Translation Dataset (One Thousand and One Nights) size_categories: - 1K<n<10K tags: - translation - english - uzbek - literary - parallel-corpus - one-thousand-and-one-nights --- # 📖 English ↔ Uzbek Literary Translation Dataset ![Language Pair](https://img.shields.io/badge/Language%20Pair-EN%20↔%20UZ-blue?style=for-the-badge) ![Domain](https://img.shields.io/badge/Domain-Literary-purple?style=for-the-badge) ![Size](https://img.shields.io/badge/Size-~2200%20pairs-green?style=for-the-badge) ![License](https://img.shields.io/badge/License-CC0%201.0-orange?style=for-the-badge) --- ## 🌟 Dataset Summary This dataset contains **parallel sentence pairs** translated between **English (EN)** and **Uzbek (UZ)**, extracted from the classic literary work *One Thousand and One Nights* (Ming bir kecha / مینگ بیر کیچه). The dataset is ideal for: - 🤖 Training and fine-tuning **Neural Machine Translation (NMT)** models for the EN↔UZ language pair - 📊 **Benchmarking** translation quality on literary text - 🔍 **Linguistic research** on English–Uzbek language structure - 📚 **Low-resource language** NLP development (Uzbek) --- ## 📂 Dataset Structure ``` Translation1_en_uz.jsonl ``` Each line is a **JSON object** with two fields: | Field | Type | Description | |-------|--------|------------------------------------| | `en` | string | English source sentence/segment | | `uz` | string | Uzbek translated sentence/segment | ### 🔎 Example Entry ```json { "en": "A good story belongs to the whole world.", "uz": "Yaxshi hikoya butun dunyoga tegishli." } ``` ```json { "en": "The stories known as the Thousand and One Nights are very old.", "uz": "\"Ming bir kecha\" nomi bilan tanilgan hikoyalar juda qadimiy." } ``` --- ## 📊 Dataset Statistics | Property | Value | |---------------------|--------------------| | **Format** | JSONL | | **Total pairs** | ~2,245 | | **Source language** | English (en) | | **Target language** | Uzbek (uz) | | **Domain** | Literary / Classic | | **Text type** | Prose, narrative | | **Script (UZ)** | Latin (Uzbek Latin alphabet) | --- ## 📖 Source Material > 🏛️ *One Thousand and One Nights* — an ancient collection of Middle Eastern folk tales compiled during the Islamic Golden Age. | Property | Details | |--------------|----------------------------------------------------------| | **Title** | One Thousand and One Nights / Ming bir kecha | | **Origin** | 9th century CE | | **Publisher**| The Penn Publishing Company, Philadelphia, 1928 | | **Source** | [en.wikisource.org](https://en.wikisource.org) | | **Illustrator** | Virginia Frances Sterrett | | **Download** | [www.aliceandbooks.com](https://www.aliceandbooks.com) | | **License** | Public Domain (CC0 1.0) | --- ## 🚀 Loading the Dataset ### Using 🤗 HuggingFace Datasets ```python from datasets import load_dataset dataset = load_dataset("ML-Jonibek/Translation1_en_uz", split="train") print(dataset[0]) # {'en': 'A good story belongs to the whole world.', # 'uz': 'Yaxshi hikoya butun dunyoga tegishli.'} ``` ### Manual loading with Python ```python import json data = [] with open("Translation1_en_uz.jsonl", "r", encoding="utf-8") as f: for line in f: data.append(json.loads(line)) print(f"Total pairs: {len(data)}") print(data[0]) ``` --- ## 🎯 Intended Use ### ✅ Suitable for - Fine-tuning multilingual models (mBART, NLLB, mT5, etc.) on EN↔UZ - Creating Uzbek language corpora for NLP research - Training sequence-to-sequence translation models - Evaluating BLEU/chrF scores on literary Uzbek text ### ⚠️ Limitations - Text is **literary in style** — may not generalize well to news, technical, or conversational domains - Uzbek translations may reflect **older translation conventions** - Some segments are **sentence fragments** due to literary segmentation --- ## 💡 Example Use Case: Fine-tuning with `transformers` ```python from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-en-uz" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) texts = ["A good story belongs to the whole world."] inputs = tokenizer(texts, return_tensors="pt", padding=True) translated = model.generate(**inputs) print(tokenizer.decode(translated[0], skip_special_tokens=True)) ``` --- ## 📜 License This dataset is derived from **public domain** source material and is released under the **Creative Commons Zero (CC0 1.0)** license. > You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. --- ## 🙏 Citation If you use this dataset in your research, please cite: ```bibtex @dataset{translation1_en_uz_2024, title = {English-Uzbek Literary Translation Dataset (One Thousand and One Nights)}, language = {en, uz}, source = {One Thousand and One Nights, Penn Publishing Company, 1928}, license = {CC0 1.0 Universal}, url = {https://huggingface.co/datasets/your-username/Translation1_en_uz} } ``` --- ## 🤝 Contributing Contributions, corrections, and improvements are welcome! Feel free to open an issue or pull request on the dataset repository. --- <p align="center"> Made with ❤️ for the Uzbek NLP community 🇺🇿 </p>
提供机构:
ML-Jonibek
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作