PT7 Web, an Annotated Portuguese Language Corpus
收藏IEEE2020-12-06 更新2026-04-17 收录
下载链接:
https://ieee-dataport.org/open-access/pt7-web-annotated-portuguese-language-corpus
下载链接
链接失效反馈官方服务:
资源简介:
PT7 Web is an annotated Portuguese language Corpus built from samples collected from Sep 2018 to Mar 2020 from seven Portuguese-speaking countries: Angola, Brazil, Portugal, Cape Verde, Guinea-Bissau, Macao e Mozambique. The records were filtered from Common Crawl — a public domain petabyte-scale dataset of webpages in many languages, mixed together in temporal snapshots of the web, monthly available [1]. The Brazilian pages were labeled as the positive class and the others as the negative class (non-Brazillian Portuguese). The dataset totalized 249.74 GB of raw HTML text related to 16,346,693 unique web pages. The data was preprocessed to produce high dimensionality (2 to the power of 18 = 262,144 features) vectors of word distribution as input for the training and test phases. A demo of the use of this data may be checked in a two-level fractional design to investigate cluster performance on Spark [2].
PT7 Web 是一个带标注的葡萄牙语语料库,采集自2018年9月至2020年3月期间7个葡萄牙语国家的样本:安哥拉、巴西、葡萄牙、佛得角、几内亚比绍、澳门以及莫桑比克。该语料库的记录源自公共爬虫数据集(Common Crawl)——一个公开领域的拍字节级多语言网页数据集,以月度快照形式整合了互联网不同时期的网页内容[1]。其中巴西的网页被标记为正类别,其余网页则被标记为负类别(非巴西葡萄牙语语料)。该数据集的原始HTML文本总容量达249.74 GB,对应16,346,693个唯一网页。数据经过预处理后,生成了高维度(2的18次方=262,144个特征)的词分布向量,作为训练与测试阶段的输入。可参考文献[2]中一项针对Spark集群性能研究的两水平分式析因设计,查看该数据集的使用示例。
提供机构:
Rodrigues, Jairson; Vasconcelos, Germano; Maciel, Paulo
创建时间:
2020-12-06



