five

WuDaoCorpora Text

收藏
科学数据银行2022-12-23 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab
下载链接
链接失效反馈
官方服务:
资源简介:
WuDaoCorpora Text is a large pretraining Chinese corpus constructed by Beijing Academy of Artificial Intelligence(BAAI). The total data volume of the dataset has exceeded 5TB, including 200GB open data.Compared with other pretraining corpora, the WuDaoCorpora Text has the following advantages.1) In the process of data collection, we classify the quality of web pages according to the proportion of words in web pages and the integrity of DOM trees, and select high-quality web page for data collection to ensure the corpus quality.2) Through data cooperation with other institutions and web page data crawling, the dataset covers a wide range types of Chinese text, including news, comments, encyclopedias, forums, blogs, academic papers, etc.3) The dataset uses more than 20 cleaning rules to obtain the final corpus from the 100TB original web page data. In the cleaning process, special attention is paid to the removal of private information to avoid the risk of privacy disclosure.4) The dataset contains 50+ data tags, such as education and laws, which is convenient for users to extract specific-domain data for model training in that field.Please obey the following agreement if you use our dataset.https://data.baai.ac.cn/resources/agreement/BAAIDataAgreement.pdf
提供机构:
Hanyu Zhao; Yequan Wang; 北京智源人工智能研究院
创建时间:
2022-12-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作