Arabic Training Data

Name: Arabic Training Data
Creator: Inception-MBZUAI
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://huggingface.co/inception-mbzuai/jais-13b-chat

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个大规模的阿拉伯语数据集，它汇集了来自多个来源的信息，包括新闻文章、维基百科和书籍，并通过加入英文翻译来增强阿拉伯语的逻辑推理能力。该数据集旨在提升大型语言模型在阿拉伯语方面的性能，以解决现有模型主要在英语上训练所存在的局限性。数据集的规模宏大，包含数百亿个词汇，具体来说，包括1160亿个阿拉伯语词汇单位和2320亿个英语词汇单位。其任务是对语言模型进行预训练。

This is a large-scale Arabic dataset that aggregates information from diverse sources including news articles, Wikipedia, and books. It incorporates English translations to enhance the logical reasoning capabilities for Arabic language tasks. This dataset aims to improve the performance of large language models (LLMs) in Arabic, addressing the limitations of existing models predominantly trained on English corpora. With a massive scale totaling hundreds of billions of tokens, the dataset specifically includes 116 billion Arabic tokens and 232 billion English tokens. Its core purpose is to pre-train language models.

提供机构：

Inception-MBZUAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集