WuDaoCorpora Text

Name: WuDaoCorpora Text
Creator: Hanyu Zhao; Yequan Wang; 北京智源人工智能研究院
Published: 2022-12-23 00:00:00
License: 暂无描述

科学数据银行2022-12-23 更新2026-04-23 收录

下载链接：

https://www.scidb.cn/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab

下载链接

链接失效反馈

官方服务：

资源简介：

WuDaoCorpora Text is a large pretraining Chinese corpus constructed by Beijing Academy of Artificial Intelligence(BAAI). The total data volume of the dataset has exceeded 5TB, including 200GB open data.Compared with other pretraining corpora, the WuDaoCorpora Text has the following advantages.1) In the process of data collection, we classify the quality of web pages according to the proportion of words in web pages and the integrity of DOM trees, and select high-quality web page for data collection to ensure the corpus quality.2) Through data cooperation with other institutions and web page data crawling, the dataset covers a wide range types of Chinese text, including news, comments, encyclopedias, forums, blogs, academic papers, etc.3) The dataset uses more than 20 cleaning rules to obtain the final corpus from the 100TB original web page data. In the cleaning process, special attention is paid to the removal of private information to avoid the risk of privacy disclosure.4) The dataset contains 50+ data tags, such as education and laws, which is convenient for users to extract specific-domain data for model training in that field.Please obey the following agreement if you use our dataset.https://data.baai.ac.cn/resources/agreement/BAAIDataAgreement.pdf

提供机构：

Hanyu Zhao; Yequan Wang; 北京智源人工智能研究院

创建时间：

2022-12-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集