five

Kuwain Training Dataset

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/misraj-ai/Kuwain-Arabic-cleaner
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了1100亿个标记,其中900亿为阿拉伯语,200亿为英语,这些数据来源于公开可用的开源资源,包括多种阿拉伯语语料库和方言数据。该数据集涵盖了广泛的阿拉伯语方言数据,并经过大量筛选和清洗以提高数据质量。为了确保可复现性,还发布了专门用于阿拉伯语文本清洗的脚本。在规模上,该数据集达到了1100亿个标记,任务旨在进行阿拉伯语语言整合的语言模型训练和评估。

This dataset consists of 110 billion tokens, 90 billion of which are in Arabic and 20 billion in English. The data is sourced from publicly available open-source resources, including multiple Arabic corpora and dialectal datasets. This dataset covers a broad range of Arabic dialectal data, and has undergone extensive filtering and cleaning to improve data quality. To ensure reproducibility, specialized scripts tailored for Arabic text cleaning have also been released. With a total scale of 110 billion tokens, this dataset is designed for the training and evaluation of language models targeting Arabic language integration.
提供机构:
Beijing Academy of Artificial Intelligence (BAAI) and other open-source repositories
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作