cleverHeart/mala-bilingual-translation-corpus
收藏Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cleverHeart/mala-bilingual-translation-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- translation
size_categories:
- n>1T
---
# MaLA Corpus: Massive Language Adaptation Corpus
This [**MaLA-LM/mala-bilingual-translation-corpus**](https://huggingface.co/datasets/MaLA-LM/mala-bilingual-translation-corpus) is the MaLA bilingual translation corpus, collected and processed from various sources.
As a part of [**MaLA Corpus**](https://huggingface.co/collections/MaLA-LM/mala-corpus-66e05127641a51de34d39529) that aims to enhance massive language adaptation in many languages, it contains bilingual translation data (aka, parallel data and bitexts) in 2,500+ language pairs (500+ languages).
Key statistics of all language pairs available at https://github.com/MaLA-LM/LangResourceAtlas/tree/main/mala-parallel
The [**MaLA Corpus** (Massive Language Adaptation)](https://huggingface.co/collections/MaLA-LM/mala-corpus-66e05127641a51de34d39529) is a series of comprehensive, multilingual datasets designed to support the continual pre-training of large language models. This [**MaLA-LM/mala-bilingual-translation-corpus**](https://huggingface.co/datasets/MaLA-LM/mala-bilingual-translation-corpus) set can also support the training of multilingual translation models.
---
## Key Features
- **Language Coverage**: Includes data in 2,500+ language pairs.
- **Pre-processing**: The corpus is cleaned and deduplicated to ensure high-quality training data.
---
## Dataset Creation
This [**MaLA-LM/mala-bilingual-translation-corpus**](https://huggingface.co/datasets/MaLA-LM/mala-bilingual-translation-corpus) set was created by processing data from various sources, followed by rigorous pre-processing to ensure the quality of the data:
- **Cleaning**: Noisy and irrelevant data was removed to ensure higher data quality.
- **Deduplication**: Duplicate entries across multiple sources were eliminated.
- **Normalization**: The data was normalized, and language codes were standardized to ISO 639-3 to ensure consistency across all sources.
---
## Intended Use
This [**MaLA-LM/mala-bilingual-translation-corpus**](https://huggingface.co/datasets/MaLA-LM/mala-bilingual-translation-corpus) set is intended for researchers and developers looking to improve the multilingual capabilities of language models. It is especially useful for:
- **Continual Pre-training** of large language models to enhance the performance in low-resource languages.
- **Fine-tuning models** on multilingual benchmarks to improve language coverage across a variety of domains.
- **Multilingual tasks** such as machine translation.
---
## Take-down Policy
We don't own any part of the data. We will comply with legitimate requests by removing the affected sources from the corpora.
---
## Citation
This [**MaLA-LM/mala-bilingual-translation-corpus**](https://huggingface.co/datasets/MaLA-LM/mala-bilingual-translation-corpus) set was processed by the [MaLA-LM](https://mala-lm.github.io) project and used to train 🤗[MaLA-LM/emma-500-llama3.1-8b-bi](https://huggingface.co/MaLA-LM/emma-500-llama3.1-8b-bi) and 🤗[MaLA-LM/emma-500-llama3-8b-bi](https://huggingface.co/MaLA-LM/emma-500-llama3-8b-bi). If you find this dataset useful, please cite our paper below.
```
@article{ji2024emma500enhancingmassivelymultilingual,
title={{EMMA}-500: Enhancing Massively Multilingual Adaptation of Large Language Models},
author={Shaoxiong Ji and Zihao Li and Indraneil Paul and Jaakko Paavola and Peiqin Lin and Pinzhen Chen and Dayyán O'Brien and Hengyu Luo and Hinrich Schütze and Jörg Tiedemann and Barry Haddow},
year={2024},
journal={arXiv preprint 2409.17892},
url={https://arxiv.org/abs/2409.17892},
}
```
license: odc-by
task_categories:
- 翻译
size_categories:
- 样本量大于1万亿
---
# MaLA语料库:大规模语言适配语料库
本数据集为**MaLA双语翻译语料库(MaLA-LM/mala-bilingual-translation-corpus)**,数据采集自多源渠道并经过系统化加工处理。
作为旨在提升多语言大规模语言适配能力的**MaLA语料库(MaLA Corpus)**的组成部分,本数据集涵盖2500余种语言对(覆盖500余种语言)的双语翻译数据(又称平行语料与双语文本)。
所有语言对的核心统计数据可访问以下链接获取:https://github.com/MaLA-LM/LangResourceAtlas/tree/main/mala-parallel
**MaLA语料库(Massive Language Adaptation)**是一套全面的多语言数据集集合,旨在支撑大语言模型(Large Language Model,LLM)的持续预训练。本数据集同样可用于多语言翻译模型的训练。
---
## 核心特性
- **语言覆盖范围**:涵盖2500余种语言对的语料数据。
- **预处理流程**:本语料库经过清洗与去重处理,以保障训练数据的高质量水准。
---
## 数据集构建
本数据集通过多源数据采集,并经过严格的预处理流程以保障数据质量,具体步骤如下:
- **数据清洗**:移除噪声与无关数据,以提升数据整体质量。
- **去重处理**:消除多源数据间的重复条目。
- **标准化处理**:对数据进行标准化操作,并将语言代码统一规范为ISO 639-3标准,以保障所有源数据的一致性。
---
## 预期用途
本数据集面向希望提升语言模型多语言能力的研究人员与开发者,尤其适用于以下场景:
- **大语言模型持续预训练**:以提升其在低资源语言上的表现。
- **多语言基准微调**:在多语言基准数据集上对模型进行微调,以提升模型在多领域下的语言覆盖范围。
- **多语言任务研究**:如机器翻译等多语言任务。
---
## 下架政策
本团队未持有本数据集任何部分的版权。若收到合法下架请求,我们将从语料库中移除受影响的源数据。
---
## 引用方式
本数据集由MaLA-LM项目加工处理,用于训练🤗MaLA-LM/emma-500-llama3.1-8b-bi与🤗MaLA-LM/emma-500-llama3-8b-bi两款模型。若您认为本数据集对您的研究有所帮助,请引用如下论文:
@article{ji2024emma500enhancingmassivelymultilingual,
title={{EMMA}-500: Enhancing Massively Multilingual Adaptation of Large Language Models},
author={Shaoxiong Ji and Zihao Li and Indraneil Paul and Jaakko Paavola and Peiqin Lin and Pinzhen Chen and Dayyán O'Brien and Hengyu Luo and Hinrich Schütze and Jörg Tiedemann and Barry Haddow},
year={2024},
journal={arXiv preprint 2409.17892},
url={https://arxiv.org/abs/2409.17892},
}
提供机构:
cleverHeart



