DatarrX/myX-Mega-Corpus

Name: DatarrX/myX-Mega-Corpus
Creator: DatarrX
Published: 2026-02-21 16:40:10
License: 暂无描述

Hugging Face2026-02-21 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/DatarrX/myX-Mega-Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - my size_categories: - 10M<n<100M --- # myX-Mega-Corpus (Myanmar NLP Resource) ![DatarrX Logo](https://img.shields.io/badge/Organization-DatarrX-blue) ![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg) ![Topic: NLP](https://img.shields.io/badge/Focus-Myanmar_NLP-orange) ## 📌 Project Overview **myX-Mega-Corpus** is a comprehensive, large-scale Burmese language dataset curated by **DatarrX**. It consists of approximately **16 million rows** of cleaned and shuffled Burmese text. While originally designed for training the **myX-Semantic** word embedding model, this corpus is highly versatile and can be used for any NLP task, including LLM fine-tuning, sentiment analysis, and machine translation. ## 📚 Dataset Sources & Attribution This corpus is an aggregation of the following high-quality resources. We strictly adhere to open-source ethics and acknowledge the original authors: 1. **Myanmar Spoken Corpus**: Compiled by [freococo](https://huggingface.co/datasets/freococo/myanmar_spoken_corpus). License: **CC-BY-4.0**. This dataset contributes approximately 15 million rows of spoken and social media text. 2. **myX-Corpus**: Compiled by [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis). This dataset includes a diverse range of Burmese literary and technical text. ## 🛠 Data Specifications * **Total Rows:** 16,000,000+ * **File Format:** Raw Text (.txt) * **Language:** Burmese (Unicode Standard) * **Encoding:** UTF-8 * **Preprocessing:** Zawgyi to Unicode normalization, duplicate removal, and noise filtering were performed. ## ⚖️ License **CC-BY-4.0 (Creative Commons Attribution 4.0 International)** You are free to share, copy, and adapt this material for any purpose, even commercially, as long as you provide appropriate credit to the original authors and **DatarrX**. --- ## 📌 စီမံကိန်းအကျဉ်း **myX-Mega-Corpus** သည် **DatarrX** မှ စုစည်းဖန်တီးထားသော အကြီးစား မြန်မာဘာသာစကား ဒေတာစု (Dataset) ဖြစ်သည်။ စုစုပေါင်း စာကြောင်းရေ **၁၆ သန်းကျော်** (၅.၃ GB ဝန်းကျင်) ပါဝင်ပြီး သန့်စင်ပြီးသား (Cleaned) ဒေတာများကို စနစ်တကျ ရောနှော (Shuffled) ထားသည်။ မူလက **myX-Semantic** (Word Embedding Model) လေ့ကျင့်ရန် ရည်ရွယ်ခဲ့သော်လည်း ယခုအခါ မည်သည့် NLP လုပ်ငန်းစဉ်များ (ဥပမာ - LLM training, Sentiment analysis) အတွက်မဆို အခမဲ့ အသုံးပြုနိုင်သည်။ ## 📚 ရင်းမြစ်နှင့် ကျေးဇူးတင်လွှာ ဤ Corpus ကို အောက်ပါ အရည်အသွေးမြင့် ရင်းမြစ်များမှ စုစည်းပေါင်းစပ်ထားခြင်းဖြစ်ပြီး မူရင်းပိုင်ရှင်များကို လေးစားစွာဖြင့် မှတ်တမ်းတင်အပ်ပါသည် - ၁။ **Myanmar Spoken Corpus**: [freococo](https://huggingface.co/datasets/freococo/myanmar_spoken_corpus) မှ စုစည်းထားခြင်းဖြစ်ပြီး လိုင်စင်မှာ **CC-BY-4.0** ဖြစ်သည်။ (စကားပြောဒေတာနှင့် လူမှုကွန်ရက်စာသား ၁၅ သန်းခန့် ပါဝင်သည်)။ ၂။ **myX-Corpus**: [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis) မှ စုစည်းထားသော မြန်မာစာပေနှင့် နည်းပညာဆိုင်ရာ စာသားများ ဖြစ်သည်။ ## 🛠 ဒေတာဆိုင်ရာ အချက်အလက်များ * **စုစုပေါင်း စာကြောင်းရေ:** ၁၆,၀၀၀,၀၀၀ ကျော် * **ဖိုင်အမျိုးအစား:** Raw Text (.txt) * **ဘာသာစကား:** မြန်မာဘာသာ (ယူနီကုဒ်စံနှုန်း), အင်္ဂလိပ်ဘာသာ (အနည်းငယ်) * **Encoding:** UTF-8 * **Preprocessing:** ဇော်ဂျီမှ ယူနီကုဒ်သို့ ပြောင်းလဲခြင်း၊ ထပ်နေသောစာကြောင်းများ ဖယ်ရှားခြင်းနှင့် အမှိုက်ဒေတာများ သန့်စင်ခြင်းတို့ကို ပြုလုပ်ထားသည်။ ## ⚖️ လိုင်စင် **CC-BY-4.0 (Creative Commons Attribution 4.0 International)** ဤဒေတာများကို မည်သည့်ရည်ရွယ်ချက်ဖြင့်မဆို (စီးပွားရေးအရ အသုံးပြုခြင်းအပါအဝင်) လွတ်လပ်စွာ ကူးယူခြင်း၊ ပြုပြင်မွမ်းမံခြင်းနှင့် ဖြန့်ဝေခြင်းများ ပြုလုပ်နိုင်သည်။ သို့သော် မူရင်းရင်းမြစ်များနှင့် **DatarrX** ကို Credit (Attribution) သေချာစွာ ပေးရန် လိုအပ်သည်။ ## 👥 About DatarrX **DatarrX** is an open-source non-governmental organization dedicated to developing advanced Natural Language Processing (NLP) resources for the Burmese language။

提供机构：

DatarrX

5,000+

优质数据集

54 个

任务类型

进入经典数据集