ArabicText 2022
收藏科学数据银行2022-12-16 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=cece25593c44455dbab5ca01af368067
下载链接
链接失效反馈官方服务:
资源简介:
In cooperation with institutes of Arabic-speaking countries, containing AASTMT, BA and IIAI, the cognitive model and data research team of Beijing Academy of Artificial Intelligence(BAAI) has published ArabicText 2022, the world’s largest Arabic text dataset among the open-source community for pre-training language models.By collecting, aggregating and cleaning the public-available Arabic web data, we finally obtains a 200GB+ high-quality text dataset, which is the largest around the world’s open-source community. During the process of data cleaning, we applies and optimizes WudaoCleaner, an efficient and effective web text cleaning tool approved by WuDaoCorpora. At the same time, we integrate the open-source Arabic text cleaning toolkit, ArabertProcessor, into the whole cleaning pipeline as a insurance of language-specific data quality. Moreover, the informative data such as news and encyclopedia, account for more than 65% in our dataset, indicating that language models is able to gain prior knowledge easily from our corpus.Please obey the following agreement if you use our dataset.https://data.baai.ac.cn/resources/agreement/BAAIDataAgreement.pdf
提供机构:
Yequan Wang; Jiahong Leng; Xuezhi Fang; Quanyue Ma; Beijing Academy of Artificial Intelligence
创建时间:
2022-12-08



