five

A Corpus of Arabic Literature (19-20th centuries) for Stylometric Tests

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5772260
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset contains three collections of mainly literary Arabic texts from the 19th and early 20th centuries. corpus022_JurjiZaydan_Dated is a dated corpus of 22 historical novels by Jurjī Zaydān. It is well established that Jurjī Zaydān was publishing roughly one novel per year and the dates of publication are well known, which makes this corpus a valuable material for testing chronological changes in the style of individual writers. corpus065 is a corpus of 65 books by 8 authors; corpus300 contains 300 books by 28 authors; Texts have been collected from https://www.hindawi.org/; the original EPUB files have been converted into clean text files (UTF8 encoding) and Arabic orthography has been normalized in the following manner: short vowels removed; the orthography of alif simplified (all alifs converted into bare alifs are used); alif maqṣūraŧs converted to yāʾs. For the names of authors and the names of works, the URIs follow the naming conventions used in the OpenITI project (https://github.com/openiti, for the most up-to-date description, see https://kitab-project.org/corpus-and-data). However, for better compatibility with R stylo, underscore (_) is used to connect URIs: AUTHOR_TITLE (in the dated Jurjī Zaydān corpus files are named YEAR_TITLE).

本数据集包含三个合集,主要收录19世纪至20世纪早期的阿拉伯文学文本。 corpus022_JurjiZaydan_Dated 是尤尔吉·宰丹(Jurjī Zaydān)所著22部历史小说的带日期语料库。学界已确认尤尔吉·宰丹大致每年出版一部小说,且其出版日期均已明确,因此该语料库可作为检验个体作家风格历时变化的珍贵研究材料。 corpus065 是由8位作者创作的65部书籍组成的语料库; corpus300 包含28位作者创作的300部书籍。 所有文本均采集自https://www.hindawi.org/;原始EPUB文件已转换为纯净文本文件(采用UTF8编码),并对阿拉伯语正字法进行了如下规范化处理:移除短元音;简化阿里夫(alif)的正字法(将所有基础形式的阿里夫统一为标准格式);将库夫勒阿里夫(alif maqṣūraŧ)转换为亚伊(yāʾ)。 作者名与作品名的统一资源标识符(Uniform Resource Identifier,URI)遵循OpenITI项目(https://github.com/openiti,最新详细说明参见https://kitab-project.org/corpus-and-data)所采用的命名约定。但为更好兼容R语言stylo包,使用下划线(_)连接URI各组成部分,格式为:AUTHOR_TITLE(带日期的尤尔吉·宰丹语料库文件的命名格式为YEAR_TITLE)。
创建时间:
2021-12-10
二维码
社区交流群
二维码
科研交流群
商业服务