khaledyusuf44/somaliweb-v1
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/khaledyusuf44/somaliweb-v1
下载链接
链接失效反馈官方服务:
资源简介:
SomaliWeb v1 是一个经过清理、去重和质量过滤的索马里语网络语料库,包含约3.03亿标记(819,322个文档)。它通过聚合三个公共的索马里语网络分布(HPLT v2、CC100、索马里维基百科)并经过可重复的六阶段管道构建而成。该数据集是首个在Hugging Face上发布的专用版本化索马里语预训练语料库,具有完整的数据集卡片。语言为索马里语(标准索马里语,拉丁字母),无Maay Maay方言。数据包括训练集(778,355个文档,约2.88亿标记)和验证集(40,967个文档,约1500万标记),每个文档包含ID、文本、来源、词数、质量分数等字段。构建流程包括合并去重、清理标准化、语言识别验证、近重复检测、质量过滤和发布结构。数据集支持索马里语LLM预训练、标记化训练、低资源NLP研究等用途,但需注意限制如仅限标准索马里语、PII未完全移除等。许可证为CC-BY-SA 4.0。
SomaliWeb v1 is a cleaned, deduplicated, and quality-filtered Somali-language web corpus of ~303 million tokens (819,322 documents), built by aggregating three public Somali-heavy web distributions (HPLT v2, CC100, Somali Wikipedia) and passing them through a reproducible six-stage pipeline. It is the first dedicated and versioned Somali-only pretraining corpus released on Hugging Face with a complete dataset card. The language is Somali (Standard Somali, Latin script), with no Maay Maay detected. The dataset includes a train split (778,355 documents, ~288M tokens) and a validation split (40,967 documents, ~15M tokens), with each document containing fields such as ID, text, source, word count, quality score, etc. The construction pipeline involves merge and deduplication, cleaning and normalization, language identification verification, near-duplicate detection, quality filtering, and release structuring. It is intended for Somali LLM pretraining, tokenizer training, low-resource NLP research, and downstream task fine-tuning, but has limitations such as being Standard Somali only and incomplete PII removal. The license is CC-BY-SA 4.0.
提供机构:
khaledyusuf44



