ArabicText 2022

Name: ArabicText 2022
Creator: Yequan Wang; Jiahong Leng; Xuezhi Fang; Quanyue Ma; Beijing Academy of Artificial Intelligence
Published: 2022-12-16 00:00:00
License: 暂无描述

科学数据银行2022-12-16 更新2026-04-23 收录

下载链接：

https://www.scidb.cn/detail?dataSetId=cece25593c44455dbab5ca01af368067

下载链接

链接失效反馈

官方服务：

资源简介：

In cooperation with institutes of Arabic-speaking countries, containing AASTMT, BA and IIAI, the cognitive model and data research team of Beijing Academy of Artificial Intelligence(BAAI) has published ArabicText 2022, the world’s largest Arabic text dataset among the open-source community for pre-training language models.By collecting, aggregating and cleaning the public-available Arabic web data, we finally obtains a 200GB+ high-quality text dataset, which is the largest around the world’s open-source community. During the process of data cleaning, we applies and optimizes WudaoCleaner, an efficient and effective web text cleaning tool approved by WuDaoCorpora. At the same time, we integrate the open-source Arabic text cleaning toolkit, ArabertProcessor, into the whole cleaning pipeline as a insurance of language-specific data quality. Moreover, the informative data such as news and encyclopedia, account for more than 65% in our dataset, indicating that language models is able to gain prior knowledge easily from our corpus.Please obey the following agreement if you use our dataset.https://data.baai.ac.cn/resources/agreement/BAAIDataAgreement.pdf

提供机构：

Yequan Wang; Jiahong Leng; Xuezhi Fang; Quanyue Ma; Beijing Academy of Artificial Intelligence

创建时间：

2022-12-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集