naab
收藏arXiv2022-08-29 更新2024-06-21 收录
下载链接:
https://huggingface.co/datasets/SLPL/naab
下载链接
链接失效反馈官方服务:
资源简介:
naab是谢里夫理工大学计算机工程系创建的Farsi语言最大的清洁文本数据集,包含约130GB数据,2.5亿段落和150亿单词。数据集名称源自Farsi语中的'纯'和'高等级'。内容涵盖广泛,包括正式和非正式文本,古典和现代文本,散文和诗歌等。创建过程涉及多个现有数据集的整合和清洁,使用流式处理方法简化预处理。该数据集主要用于Farsi语言的自然语言处理研究,支持自监督学习模型如Transformer的微调,有助于推动Farsi语言的NLP技术发展。
Naab is the largest cleaned text dataset in the Farsi language, created by the Department of Computer Engineering at Sharif University of Technology. It contains approximately 130 GB of data, 250 million paragraphs, and 15 billion words. The dataset's name is derived from the Farsi terms meaning 'pure' and 'high-grade'. Its content covers a wide range of text types, including formal and informal texts, classical and modern works, prose and poetry, among others. The creation process involves integrating and cleaning multiple existing datasets, and a streaming processing method is adopted to simplify preprocessing. This dataset is primarily intended for Farsi-language natural language processing (NLP) research, supporting fine-tuning of self-supervised learning models such as the Transformer, and facilitating the advancement of Farsi-language NLP technologies.
提供机构:
计算机工程系,谢里夫理工大学,德黑兰,伊朗
创建时间:
2022-08-29



