PT7 Web, an Annotated Portuguese Language Corpus

Name: PT7 Web, an Annotated Portuguese Language Corpus
Creator: IEEE DataPort
Published: 2020-12-05 22:03:10
License: 暂无描述

DataCite Commons2020-12-05 更新2025-04-16 收录

下载链接：

https://ieee-dataport.org/open-access/pt7-web-annotated-portuguese-language-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

PT7 Web is an annotated Portuguese language Corpus built from samples collected from Sep 2018 to Mar 2020 from seven Portuguese-speaking countries: Angola, Brazil, Portugal, Cape Verde, Guinea-Bissau, Macao e Mozambique. The records were filtered from Common Crawl — a public domain petabyte-scale dataset of webpages in many languages, mixed together in temporal snapshots of the web, monthly available [1]. The Brazilian pages were labeled as the positive class and the others as the negative class (non-Brazillian Portuguese). The dataset totalized 249.74 GB of raw HTML text related to 16,346,693 unique web pages. The data was preprocessed to produce high dimensionality (16,384 features) vectors of word distribution as input for the training and test phases. A demo of the use of this data may be checked in a two-level fractional design to investigate cluster performance on Spark [2].

PT7 Web是一个带标注的葡萄牙语语料库（Corpus），基于2018年9月至2020年3月期间从七个葡语国家收集的样本构建而成：安哥拉、巴西、葡萄牙、佛得角、几内亚比绍、澳门和莫桑比克。这些记录从Common Crawl中筛选而来——这是一个公共领域的PB级多语言网页数据集，包含网络的时间快照混合数据，每月更新[1]。巴西网页被标记为正类，其他网页则被标记为负类（非巴西葡萄牙语）。该数据集总计249.74 GB原始HTML文本，对应16,346,693个唯一网页。数据经过预处理，生成高维（16384个特征）词分布向量，作为训练和测试阶段的输入。该数据的使用示例可参考基于两级分式设计的Spark集群性能研究[2]。

提供机构：

IEEE DataPort

创建时间：

2020-12-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集