Abdallah4Zain/hindawi-arabic-sections
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Abdallah4Zain/hindawi-arabic-sections
下载链接
链接失效反馈官方服务:
资源简介:
Hindawi阿拉伯书籍章节数据集是一个经过清洗的、章节级别的阿拉伯书籍数据集,用于NLP训练和研究。数据集包含来自Hindawi.org的阿拉伯书籍,涵盖了文学、哲学、历史、科学、心理学等多个类别。数据集经过了一系列的清洗步骤,包括移除英文和拉丁文本、去除阿拉伯变音符号、归一化阿拉伯字母、去重等。数据集包含52,830个章节,3,265本书籍,总字符数为843,868,321。数据集分为训练集(90%)、验证集(5%)和测试集(5%)。每个数据条目包括文本、书籍标题、作者、类别、章节标题、字符数和书籍ID等信息。
Hindawi Arabic Books — Sections Dataset is a cleaned, section-level dataset of Arabic books from Hindawi.org, prepared for NLP training and research. The dataset covers categories including literature, philosophy, history, science, psychology, and more. It includes 52,830 sections from 3,265 books, with a total of 843,868,321 characters. The dataset is split into training (90%), validation (5%), and test (5%) sets. Each entry contains text, book title, author, category, section title, character count, and book ID.
提供机构:
Abdallah4Zain



