five

nomeda-lab/hindawi-arabic-sections

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/nomeda-lab/hindawi-arabic-sections
下载链接
链接失效反馈
官方服务:
资源简介:
Hindawi阿拉伯语书籍节选数据集是一个经过清洗的、节选级别的阿拉伯语书籍数据集,来源于Hindawi.org网站,适用于NLP训练和研究。数据集包含52,830个节选,来自3,265本书籍,总字符数为843,868,321。每个节选都经过严格的清洗流程,包括移除英文/拉丁文本和字符、移除阿拉伯语变音符号、归一化阿拉伯字母、去重等步骤。数据集分为训练集(90%)、验证集(5%)和测试集(5%),并包含多个特征列,如文本、书籍标题、作者、类别、节标题、字符数和书籍ID等。数据集涵盖多个类别,如文学、历史、小说、哲学、诗歌等。

The Hindawi Arabic Books — Sections Dataset is a cleaned, section-level dataset of Arabic books from Hindawi.org, prepared for NLP training and research. The dataset contains 52,830 sections from 3,265 books, with a total of 843,868,321 characters. Each section has undergone a rigorous cleaning pipeline, including the removal of English/Latin text and characters, Arabic diacritics, normalization of Arabic letters, deduplication, and more. The dataset is split into training (90%), validation (5%), and test (5%) sets, and includes multiple feature columns such as text, book title, author, category, section title, character count, and book ID. The dataset covers various categories, including literature, history, novels, philosophy, poetry, and more.
提供机构:
nomeda-lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作