ClassiCC-Corpus/ClassiCC-PT
收藏Hugging Face2026-02-02 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/ClassiCC-Corpus/ClassiCC-PT
下载链接
链接失效反馈官方服务:
资源简介:
ClassiCC-PT是一个大型葡萄牙语网络语料库,包含大约1200亿词汇,专门为训练大型葡萄牙语语言模型而设计。语料库来源于Common Crawl快照,经过语言过滤、HTML文本提取、去重和基于神经网络的分类过滤处理,包括教育、STEM和毒性内容的分类。适用于继续预训练以适应葡萄牙语的大型语言模型。
ClassiCC-PT is a large-scale Portuguese web corpus containing approximately 120 billion tokens extracted from Common Crawl snapshots. Specifically designed for training large Portuguese language models, the corpus has undergone language filtering, HTML text extraction, deduplication, and neural-based classification filtering for educational, STEM, and toxic content.
提供机构:
ClassiCC-Corpus



