CMLI-NLP/Mongolian-pretrain-dataset
收藏Hugging Face2025-08-03 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/CMLI-NLP/Mongolian-pretrain-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- mn
size_categories:
- 100K<n<1M
---
# Mongolian Pretraining Dataset
## Dataset Information
- **Language**: Mongolian (Traditional Mongolian script)
- **Size**: ~12GB
- **Format**: Plain text (.txt)
- **Use Case**: Language model pretraining
## Description
This dataset contains Mongolian text data for training language models on low-resource languages. The data uses Traditional Mongolian script and covers 45 core characters identified through frequency analysis.
**Code**: The Huffman transliteration framework implementation is available at [https://github.com/CMLI-NLP/HuffmanTranslit](https://github.com/CMLI-NLP/HuffmanTranslit)
## Technical Details
- **Character Set**: 45 core Traditional Mongolian characters
- **Encoding**: UTF-8
- **Script**: Traditional Mongolian script
- **Total Samples**: 933,941
## Usage
Suitable for:
- Continued pretraining of multilingual language models
- Cross-lingual transfer learning research
- Low-resource language processing
## Transliteration Framework Compatibility
This dataset is designed to work with the Huffman-based transliteration framework for reversible conversion between Traditional Mongolian script and Latin characters.
## Citation
If you use this dataset, please cite:
```bibtex
@inproceedings{zhuang2025enhancing,
title={Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages},
author={Zhuang, Wenhao and Sun, Yuan and Zhao, Xiaobing},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={16299--16313},
year={2025}
}
```
提供机构:
CMLI-NLP



