HeshamHaroon/asfar
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/HeshamHaroon/asfar
下载链接
链接失效反馈官方服务:
资源简介:
Asfar(أَسْفَار,意为“大书”的复数形式)是一个经过清理、按页索引的阿拉伯文化遗产文本语料库,内容涵盖语法、词典学、文学、历史和诗歌评论等。该语料库基于Archive.org上的公共领域扫描文本,并经过标准化处理,适用于语言模型训练。数据集包含123,062页文本,来自461个PDF卷,涵盖6个Archive.org收藏集。文本语言为现代标准阿拉伯语和古典阿拉伯语。数据集经过NFC标准化处理,去除了tatweel、tashkeel,扩展了lam-alef连字,并压缩了空格。数据集适用于因果语言模型预训练、OCR去噪、词汇和语法资源开发等任务。
Asfar (أَسْفَار, plural of سِفْر — tome, large book) is a cleaned, page-indexed corpus of Arabic heritage literature — grammar, lexicography, adab, history, and poetry commentary — drawn from public-domain scans on Archive.org and normalized for language-model training. The dataset contains 123,062 pages from 461 PDF volumes across 6 Archive.org collections, focusing on Modern Standard and Classical Arabic. The text has been NFC-normalized, with tatweel removed, tashkeel stripped, lam-alef ligatures expanded, and whitespace collapsed. It is suitable for tasks such as causal LM pre-training, OCR denoising, and lexical and grammatical resource development.
提供机构:
HeshamHaroon



