five

Quranic

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/rk96pn66m4
下载链接
链接失效反馈
官方服务:
资源简介:
This Quranic dataset addresses the critical need for comprehensive, computationally accessible linguistic resources for Classical Arabic (CA). The underlying premise is that the lack of such resources, particularly a complete machine-readable syntactic layer, hinders CA NLP advancement. This dataset demonstrates the feasibility of constructing such a resource for the entire Holy Quran using computational methods combined with expert validation. The data (~132,736 tokens) comprises three integrated layers: Orthographic: Includes standard Imlaai and Quran-specific Uthmani scripts, Buckwalter and phonetic transliterations, English translation, and dual (Quranic/sentence-based) indexing. Morphological: Features fine-grained Part-of-Speech tagging, detailed morphosyntactic features (case, mood, aspect, etc.), lemma, and root information based on refined, expert-validated schemas. Syntactic: Provides the first complete, computationally processable syntactic annotation for the entire Quran using a novel hybrid Constituency-Dependency framework. Data collection involved sourcing foundational text and annotations from public resources (Tanzil, Quranic Corpus, Comprehensive Islamic Library). Custom Python scripts handled orthographic processing, morphological re-annotation, and syntactic seed data preparation (image-to-text conversion). A Deep Learning parser (BiLSTM architecture utilizing custom Word2Vec embeddings derived from classical texts) generated the comprehensive syntactic layer. All layers underwent rigorous manual validation, including expert review and crucially cross-referencing the generated syntax against authoritative I'rab (grammatical analysis) references. Notable findings embodied by this dataset itself include the successful large-scale application of a hybrid syntactic annotation model to the entire Quran and the effective integration of rich, multi-faceted linguistic information within a unified structure. Data is presented primarily in an extended CoNLL-X tabular format, accompanied by auxiliary files (lexicons, schemas). Interpretation and Reuse: This Quranic dataset serves as a crucial benchmark for CA NLP. Researchers can use it to train and evaluate parsers, morphological analyzers, POS taggers, diacritization models. It offers rich empirical data for theoretical linguistics and a foundation for pedagogical tools, digital humanities projects, and other CA language technologies. An associated analytical tool (Noor) aids visualization and exploration. Users should note the syntactic layer, while extensively validated, awaits further exhaustive manual curation to reach definitive gold-standard status.
创建时间:
2025-04-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作