five

Shamela

收藏
arXiv2016-12-29 更新2024-06-21 收录
下载链接:
http://shamela.ws
下载链接
链接失效反馈
官方服务:
资源简介:
Shamela是一个大规模的历史阿拉伯语语料库,由麻省理工学院计算机科学与人工智能实验室等机构创建。该数据集包含超过6,100个文本,总计约10亿字,其中8亿字来自有日期文本。数据集内容涵盖从7世纪到现代的各个时期,主要来源于Al-Maktaba Al-Shamela网站。创建过程中,研究团队对文本进行了清洗、形态分析和语义增强处理。Shamela数据集主要应用于数字人文研究,旨在解决历史阿拉伯语语言分析和阿拉伯文化历史研究的问题。

Shamela is a large-scale historical Arabic corpus created by institutions such as the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). This dataset contains over 6,100 texts, totaling approximately 1 billion words, of which 800 million words are sourced from dated documents. The dataset spans all periods from the 7th century through to the modern era, and is primarily derived from the Al-Maktaba Al-Shamela website. During its development, the research team performed text cleaning, morphological analysis, and semantic enhancement on the corpus. The Shamela dataset is mainly applied in digital humanities research, aiming to resolve issues related to historical Arabic language analysis and Arabic cultural and historical studies.
提供机构:
麻省理工学院计算机科学与人工智能实验室
创建时间:
2016-12-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作