tiagoloeblein/GodVerb
收藏Hugging Face2025-12-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/tiagoloeblein/GodVerb
下载链接
链接失效反馈官方服务:
资源简介:
GigaVerbo Clean 是一个从 TucanoBR/GigaVerbo 处理而来的数据集,包含数十亿葡萄牙语标记。该存储库不是原始数据集,而是一个独立的贡献,提供强大的清洗管道、通过 SHA-256 进行精确去重、语言过滤、重度归一化、简单语义分类、完整且可复现的日志,以及对数千万行的支持。重点是提供干净、一致且准备好用于语言模型训练的数据,确保确定性排序和有效去除噪声、重复和文本垃圾。
GigaVerbo Clean is a processed dataset derived from TucanoBR/GigaVerbo, containing billions of Portuguese tokens. This repository is not the original dataset but an independent contribution offering a robust cleaning pipeline, exact deduplication via SHA-256, language filtering, heavy normalization, simple semantic classification, complete and reproducible logs, and support for tens of millions of lines. The focus here is to provide clean, consistent, and ready-to-use data for language model training, with guaranteed deterministic ordering and effective removal of noise, duplicates, and textual garbage.
提供机构:
tiagoloeblein



