bastao/VeraCruz_PT-BR
收藏Hugging Face2025-07-21 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/bastao/VeraCruz_PT-BR
下载链接
链接失效反馈官方服务:
资源简介:
VeraCruz数据集是一个全面的葡萄牙语内容集合,展示了葡萄牙语地区的语言和文化多样性。它包含约1.9亿个样本,根据URL元数据按地区分类为主要类别:葡萄牙(PT)、巴西(BR)和其他。对于‘其他’类别的样本,使用PeroVaz_PT-BR_Classifier进一步分类为PT或BR,并补充了‘label’和‘score’两列,分别表示预测的类别和预测标签的概率。数据集来源于MyCulturaX数据集的葡萄牙语部分,原始数据集未区分葡萄牙语的两个变体。由于数据集的广泛性,可能包含个人和敏感信息,用户需负责任地处理数据,并遵守隐私法律标准。数据集的许可条款遵循mC4和OSCAR的许可条款。
The VeraCruz Dataset is a comprehensive collection of Portuguese language content, showcasing the linguistic and cultural diversity of Portuguese-speaking regions. It includes around 190 million samples, organized by regional origin as indicated by URL metadata into primary categories: Portugal (PT), Brazil (BR), and Other. Samples in the Other category were further classified into PT or BR using the PeroVaz_PT-BR_Classifier, and supplemented with label and score columns indicating the predicted category and the probability of the predicted label, respectively. The dataset is derived from the Portuguese language segment of the MyCulturaX dataset, which does not differentiate between the two variants of Portuguese. Given the datasets extensive nature, it may contain personal and sensitive information, and users are advised to handle the data responsibly, employing ethical practices and privacy-compliant measures. The licensing terms for the VeraCruz Dataset follow those of mC4 and OSCAR.
提供机构:
bastao



