freococo/huggingface_myanmar_english_translation

Name: freococo/huggingface_myanmar_english_translation
Creator: freococo
Published: 2026-02-12 09:18:20
License: 暂无描述

Hugging Face2026-02-12 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/freococo/huggingface_myanmar_english_translation

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - my - en license: cc0-1.0 task_categories: - translation - text-generation - text-classification source_datasets: - HuggingFaceFW/finetranslations tags: - myanmar - burmese - unicode - zawgyi-converted - clean size_categories: - 1M<n<10M pretty_name: Cleaned & Sorted Myanmar-English Translation --- # Cleaned & Sorted Myanmar-English Translation Dataset This dataset is a cleaned, Unicode-normalized, and sorted version of the **Myanmar (Burmese)** subset from the massive [FineTranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) dataset. While the original dataset is excellent, Myanmar text on the web is often a mix of standard **Unicode** and the non-standard **Zawgyi** encoding. This repository fixes those encoding issues to provide a high-quality dataset for NLP tasks. ## Key Improvements in this Version 1. **Zawgyi Detection & Conversion**: - We analyzed the original `mya_Mymr` subset. - Detected rows using the **Zawgyi** font encoding (and corrupted PDF text) using a probability model and regex filters. - Converted all detected Zawgyi text into standard **Unicode**. 2. **Sorting**: - The dataset is sorted alphabetically (Dictionary order: က-အ). - It is further sorted by length (text length), making it useful for curriculum learning. 3. **Format**: - Converted to **Parquet** format for fast loading and low storage size. - Columns renamed for simplicity: `id`, `myanmar`, `english`. ## Dataset Structure The dataset contains the following fields: - **id**: The unique identifier (from the original source). - **myanmar**: The source text in Myanmar (Burmese), guaranteed to be **Unicode**. - **english**: The English translation (generated by Gemma3 27B in the original dataset). ## Data Sample Below is an example of a single row from the dataset. The text has been cleaned, converted to Unicode, and standardized. ```JSON {"id": "<urn:uuid:d1a31806-2ca6-4e6c-a0cb-c0e6892be821>", "myanmar": "ကကတစ်\nကကတစ်သည် မြန်မာနိုင်ငံ ကမ်းရိုးတစ်လျှောက်၌ တွေ့ရသော အရေးကြီးသည့် စားငါး တစ်မျိုးဖြစ်၍ လူသိများသည်။ ဤငါးသည် ပါဏဗေဒ အလိုအရ ငတောက်တူ၊ ကသမြင်း၊ ငစင်စပ် စသည်တို့နှင့်အတူ ပါစီဒို မျိုးရင်းတွင် ပါဝင်သည်။ ကကတစ်၏ ကိုယ်သည် ပြား၍ အနည်းငယ် ရှည်လျားသည်။ နှာတံတို၍ အောက်မေးရှေ့သို့ငေါထွက်နေသည်။ ခေါင်း၏ ဘေး တစ်ဖက်တစ်ချက်တွင် ကြီးမားသော မျက်လုံးနှစ်လုံးရှိသည်။ ကျောဆူးတောင်နှစ်ခုပါ၍ ရှေ့ဆူးတောင်တွင် မာကျောသော ဆူးရိုးများပါရှိသည်။ အမြီးမှာ ထိပ်မျိုးဖြစ်၏။ အကြေးများမှာ မကြီးမငယ်ဖြစ်၍ အနားကြမ်းအကြေးမျိုး ဖြစ်သည်။ ဤငါး၏ အရောင် ဝမ်းပိုက်တစ်လျှောက်တွင် အဖြူရောင် သန်းနေသည်။ ကကတစ်သည် သာမန်အားဖြင့် တစ်ပေခွဲသာရှိသော်လည်း၊ တစ်ခါတစ်ရံ အရှည် ၅ ပေအလေးချိန် ၅၅ ပိသာခန့်အထိ ရှိနိုင်သည်။ ဤငါး၏ စည်ဖောင်းများမှ အတော်အသင့် ကောင်းမွန်သော ငါး စည်ဖောင်းကော်ကို ပြုလုပ်ရရှိနိုင်သည်ဟု သိရသည်။\nကိုးကား[ပြင်ဆင်ရန်]\n- မြန်မာ့စွယ်စုံကျမ်း၊ အတွဲ(၁)", "english": "Kakaik\nKakaik is an important edible fish found along the coast of Myanmar and is well known. This fish, according to taxonomy, belongs to the Percidae family along with species such as Ngatouttu, Kasaemyin, and Ngsinsap. The body of Kakaik is flat and slightly elongated. It has a short snout protruding forward from the lower jaw. There are two large eyes on each side of the head. It has two dorsal fins, with hard spines on the anterior dorsal fin. The tail is pointed. The scales are medium in size and belong to the rough scale type. The color of this fish is whitish along the abdomen. Kakaik is usually about one and a half feet long, but sometimes can reach a length of 5 feet and weigh around 55 viss. It is known that reasonably good fish glue can be made from the swim bladders of this fish.\nReferences [Edit]\n- Myanmar Encyclopedia, Volume (1)"} ``` ## How to Use You can load this dataset easily using the Hugging Face `datasets` library or `pandas`. ### Using Hugging Face `datasets` ```Python from datasets import load_dataset DATASET = "freococo/huggingface_myanmar_english_translation" # Streaming = no full load dataset = load_dataset(DATASET, split="train", streaming=True) print("First 5 rows:\n") for i, row in enumerate(dataset): print(row) if i == 4: break ``` ## Data Processing Pipeline 1. **Source**: Downloaded the `mya_Mymr` subset from `HuggingFaceFW/finetranslations`. 2. **Filtering**: - Applied a **Zawgyi Probability Detector** (score > 0.5). - Applied **Regex Filters** to catch "Garbage" / corrupted PDF characters common in Myanmar datasets (e.g., impossible combinations like `န` followed by `ျ`). 3. **Normalization**: Detected Zawgyi rows were converted to standard Myanmar Unicode. 4. **Sorting**: Data was sorted lexicographically (Myanmar alphabet order) and by character length. ## Data Quality & Limitations While this dataset has undergone processing to separate clear Zawgyi (Old Font) rows from Unicode, it **is not fully scrubbed** of all noise. Users should be aware of the following: * **Unicode Normalization:** The primary goal of this release was to filter out high-probability Zawgyi rows and "garbage" PDF text. * **Residual Noise:** The dataset may still contain: * Emojis, symbols, and non-standard punctuation. * Mixed English-Myanmar text within the Myanmar column. * Edge-cases of Zawgyi text that had low probability scores (hybrid encoding). * Duplicated rows (intentionally didn't removed them to keep as original dataset) * **Recommendation:** We recommend using this dataset as a **base** for further fine-grained cleaning depending on your specific NLP task (e.g., removing emojis for machine translation). ## License & Attribution ### Modifications License The modifications (cleaning, converting, sorting, and formatting) presented in this repository are released under **CC0-1.0** (Public Domain). You are free to use this cleaned version as you wish. ### Original Source Attribution This dataset is derived from **FineTranslations** by Hugging Face FW. The original data is licensed under the **Open Data Commons Attribution License (ODC-By) v1.0**. **Citation for the original work:** ```bibtex @misc{penedo2026finetranslations, title={FineTranslations}, author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Amir Hossein Kargaran and Leandro von Werra}, year={2026}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{https://huggingface.co/datasets/HuggingFaceFW/finetranslations}} } ``` ```

提供机构：

freococo

5,000+

优质数据集

54 个

任务类型

进入经典数据集