five

Wudao Multi-language

收藏
科学数据银行2022-12-30 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/en/detail?dataSetId=24833059a99a4953ba52e8ca7e3b9c69
下载链接
链接失效反馈
官方服务:
资源简介:
Wudao Multi-language is a large-scale multilingual dataset constructed by Beijing Academy of Artificial Intelligence(BAAI). The total data volume of the dataset reached 1.21TB, covering 53 official languages of 65 countries in 9 major language families.In the process of dataset construction, we first download the Commom Crawl web page data and extract the text from them. Subsequently, we identify the language type of the extracted text, and conduct targeted web data collection to add a supplement for those rare languages. Finally, we perform cleaning and deduplication based on strict rules, removing various noises in the text that affect the pretraining and filtering out the privacy information. Our dataset can be applied to a wide range of pretraining tasks, which is of great value to the research of large multi-language models.Please obey the following agreement if you use our dataset.https://data.baai.ac.cn/resources/agreement/BAAIDataAgreement.pdf
提供机构:
Yequan Wang; 北京智源人工智能研究院
创建时间:
2022-12-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作