five

大型多样化阿拉伯语语料库

收藏
arXiv2023-05-09 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2201.09227v3
下载链接
链接失效反馈
官方服务:
资源简介:
大型多样化阿拉伯语语料库是由纽约大学阿布扎比分校数据科学与人工智能实验室开发,旨在为大型语言模型提供丰富的训练资源。该数据集包含超过500GB的阿拉伯语清洗文本,涵盖新闻、学术、社交、宗教、文化等多个领域,代表多种阿拉伯语方言。数据集的创建过程涉及从多个来源收集原始数据,并进行标准化和清洗处理。该数据集主要应用于自然语言处理任务,如机器阅读理解、文本摘要、情感分析等,旨在提升阿拉伯语语言模型的性能和泛化能力。

The Large and Diverse Arabic Corpus was developed by the Data Science and Artificial Intelligence Laboratory at New York University Abu Dhabi, with the purpose of providing rich training resources for large language models (LLMs). This dataset contains over 500 gigabytes (GB) of cleaned Arabic text, spanning multiple domains such as news, academic research, social media, religion, culture and more, and incorporates a wide range of Arabic dialects. The development process of this dataset involves collecting raw data from multiple sources, followed by standardization and cleaning procedures. It is primarily utilized for natural language processing (NLP) tasks including machine reading comprehension, text summarization, sentiment analysis and others, aiming to improve the performance and generalization ability of Arabic language models.
提供机构:
纽约大学阿布扎比分校数据科学与人工智能实验室
创建时间:
2022-01-23
二维码
社区交流群
二维码
科研交流群
商业服务