five

CMLI-NLP/Mongolian-pretrain-dataset

收藏
Hugging Face2025-08-03 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/CMLI-NLP/Mongolian-pretrain-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - mn size_categories: - 100K<n<1M --- # Mongolian Pretraining Dataset ## Dataset Information - **Language**: Mongolian (Traditional Mongolian script) - **Size**: ~12GB - **Format**: Plain text (.txt) - **Use Case**: Language model pretraining ## Description This dataset contains Mongolian text data for training language models on low-resource languages. The data uses Traditional Mongolian script and covers 45 core characters identified through frequency analysis. **Code**: The Huffman transliteration framework implementation is available at [https://github.com/CMLI-NLP/HuffmanTranslit](https://github.com/CMLI-NLP/HuffmanTranslit) ## Technical Details - **Character Set**: 45 core Traditional Mongolian characters - **Encoding**: UTF-8 - **Script**: Traditional Mongolian script - **Total Samples**: 933,941 ## Usage Suitable for: - Continued pretraining of multilingual language models - Cross-lingual transfer learning research - Low-resource language processing ## Transliteration Framework Compatibility This dataset is designed to work with the Huffman-based transliteration framework for reversible conversion between Traditional Mongolian script and Latin characters. ## Citation If you use this dataset, please cite: ```bibtex @inproceedings{zhuang2025enhancing, title={Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages}, author={Zhuang, Wenhao and Sun, Yuan and Zhao, Xiaobing}, booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, pages={16299--16313}, year={2025} } ```
提供机构:
CMLI-NLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作