OpenITI
收藏arXiv2018-09-11 更新2024-06-21 收录
下载链接:
https://github.com/OpenITI
下载链接
链接失效反馈官方服务:
资源简介:
OpenITI是由麻省理工学院计算机科学与人工智能实验室等机构创建的一个大规模历史阿拉伯语文本语料库,包含约1.5亿个单词,覆盖了1400年的阿拉伯语历史文献。该数据集主要来源于Al-Maktaba Al-Shamela、Shia在线图书馆和Al-Jami’ Al-Kabir等数字图书馆,涵盖了宗教和文学等多种类型的文本。数据集的创建过程包括文本收集、格式统一和元数据标准化等步骤。OpenITI的应用领域包括阿拉伯语历史研究、语言变化分析和文本重用检测等,旨在解决阿拉伯语历史文献的数字化和分析问题。
OpenITI is a large-scale historical Arabic text corpus developed by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and other institutions. It contains approximately 150 million words, spanning 1,400 years of Arabic historical documents. This dataset is primarily sourced from digital libraries including Al-Maktaba Al-Shamela, Shia Online Library, and Al-Jami’ Al-Kabir, covering various text genres such as religious and literary works. The construction process of OpenITI includes steps such as text collection, format standardization, and metadata normalization. Application scenarios of OpenITI include Arabic historical research, language change analysis, and text reuse detection, aiming to address the issues of digitization and analysis of Arabic historical documents.
提供机构:
麻省理工学院计算机科学与人工智能实验室
创建时间:
2018-09-11



