TucanoBR/GigaVerbo
收藏Hugging Face2025-07-24 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/TucanoBR/GigaVerbo
下载链接
链接失效反馈官方服务:
资源简介:
GigaVerbo是一个包含780 GB葡萄牙语文本的大型数据集,由多个数据集拼接而成,包含超过2000亿个标记。该数据集涵盖了各种来源,包括爬取的网站、文章、翻译的对话和法律文件。它为各种自然语言处理任务提供了丰富的资源,特别是语言模型训练和测试。数据集包括文本、元数据、标签和概率等特征,标签指示文本的质量(高或低),概率表示标签的置信度。数据集仅有一个训练集分割,并支持流式加载以避免下载整个数据集。该数据集由Nicholas Kluge Corrêa策划,并包含多个来源数据集的许可信息。
GigaVerbo is an extensive dataset comprising 780 GB of Portuguese text, being a concatenated version of several datasets, containing over 200 billion tokens. It encompasses various sources, including crawled websites, articles, translated conversations, and legal documents. This dataset offers a comprehensive and rich resource for various natural language processing tasks, particularly language modeling. It includes features such as text, metadata, label, and probs, with the label indicating the quality of the text (high or low) and probs representing the confidence score of the label. The dataset is available in a single split, train, and can be streamed to avoid downloading the entire dataset. The dataset is curated by Nicholas Kluge Corrêa and includes licensing information for the various source datasets.
提供机构:
TucanoBR



