Wudao Multi-language
收藏科学数据银行2022-12-30 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/en/detail?dataSetId=24833059a99a4953ba52e8ca7e3b9c69
下载链接
链接失效反馈官方服务:
资源简介:
Wudao Multi-language is a large-scale multilingual dataset constructed by Beijing Academy of Artificial Intelligence(BAAI). The total data volume of the dataset reached 1.21TB, covering 53 official languages of 65 countries in 9 major language families.In the process of dataset construction, we first download the Commom Crawl web page data and extract the text from them. Subsequently, we identify the language type of the extracted text, and conduct targeted web data collection to add a supplement for those rare languages. Finally, we perform cleaning and deduplication based on strict rules, removing various noises in the text that affect the pretraining and filtering out the privacy information. Our dataset can be applied to a wide range of pretraining tasks, which is of great value to the research of large multi-language models.Please obey the following agreement if you use our dataset.https://data.baai.ac.cn/resources/agreement/BAAIDataAgreement.pdf
提供机构:
Yequan Wang; 北京智源人工智能研究院
创建时间:
2022-12-21



