five

ML-Jonibek/English-Uzbek-Translation-1

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ML-Jonibek/English-Uzbek-Translation-1
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - uz task_categories: - translation task_ids: - text2text-generation multilinguality: - translation pretty_name: English-Uzbek Translation Dataset size_categories: - 10K<n<100K tags: - translation - english - uzbek - NLP - parallel-corpus - low-resource --- # 🌐 English–Uzbek Translation Dataset A parallel corpus for **English ↔ Uzbek** machine translation, curated to support research and development of NLP models for the Uzbek language — one of the most underrepresented Turkic languages in open-source datasets. --- ## 📖 Dataset Description This dataset contains aligned sentence pairs in **English** and **Uzbek**, designed for training, fine-tuning, and evaluating neural machine translation (NMT) models. The dataset aims to bridge the resource gap for Uzbek in the NLP community. - **Languages:** English (`en`) → Uzbek (`uz`) - **Task:** Machine Translation (seq2seq / text-to-text) - **Script:** Latin (Uzbek uses the reformed Latin alphabet) --- ## 📊 Dataset Structure ### Data Fields | Field | Type | Description | |-------------|--------|------------------------------------| | `en` | string | Source sentence in English | | `uz` | string | Target sentence in Uzbek | ### Data Splits | Split | Examples | |--------------|----------| | `train` | 25k | > 📝 *Exact counts will be updated after final data versioning.* --- ## 🚀 Usage ### Load with 🤗 Datasets ```python from datasets import load_dataset dataset = load_dataset("ML-Jonibek/English-Uzbek-Translation-1") # Access training split train_data = dataset["train"] ### Fine-tune a Translation Model ```python from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-en-trk" # or any seq2seq model tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) # Tokenize inputs = tokenizer("Hello world", return_tensors="pt", padding=True) translated = model.generate(**inputs) print(tokenizer.decode(translated[0], skip_special_tokens=True)) ``` --- ## 🌍 Why Uzbek? Uzbek is spoken by over **35 million people** worldwide, yet it remains severely underrepresented in machine translation research. Quality parallel corpora for Uzbek are rare, making this dataset a valuable contribution to: - Neural Machine Translation (NMT) - Cross-lingual transfer learning - Uzbek language model development - Multilingual NLP benchmarks --- ## 📁 Source & Data Collection The sentence pairs were collected and curated from publicly available sources. The dataset focuses on: - General-purpose conversational text - News and informational content - Everyday language and expressions --- ## ⚙️ Preprocessing - Sentences were deduplicated - Encoding normalized to UTF-8 - Whitespace and punctuation cleaned - Aligned pairs verified for translation quality --- ## 📈 Intended Uses | Use Case | Suitable? | |---|---| | Training NMT models | ✅ | | Fine-tuning multilingual models (mBART, NLLB, M2M-100) | ✅ | | Evaluation / benchmarking | ✅ | | Cross-lingual embeddings | ✅ | --- ## ⚠️ Limitations - The dataset may not cover all domains equally (e.g., legal, medical). - Dialectal variations in Uzbek (Tashkent vs. regional dialects) may not be fully represented. - Machine-translated or auto-aligned samples (if any) may contain noise. --- ## 🙏 Citation If you use this dataset in your research, please cite it as: ```bibtex @dataset{english-uzbek-dataset, author = {ML-Jonibek}, title = {English–Uzbek Translation Dataset}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ML-Jonibek/English-Uzbek-Translation-1} } ``` --- ## 👤 Author Created and maintained by **[ML-Jonibek](https://huggingface.co/ML-Jonibek)**. Contributions, corrections, and feedback are welcome via the [community tab](https://huggingface.co/datasets/ML-Jonibek/English-Uzbek-Translation-1/discussions). --- *Made with ❤️ for the Uzbek NLP community*
提供机构:
ML-Jonibek
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作