sawalni-ai/fw-darija
收藏Hugging Face2024-12-08 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/sawalni-ai/fw-darija
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个多语言文本数据集,包含超过5000万句子,涵盖100多种语言。数据来源于Common Crawl语料库,并通过GlotLID模型进行语言分类。为了优化低资源语言的识别,使用了Gherbal语言识别模型,特别是针对摩洛哥阿拉伯语。数据集包含多个配置,每个配置对应不同的语言。处理过程包括文本清理、句子分割、语言检测和过滤。最终的数据集可用于训练和评估摩洛哥阿拉伯语的模型。
This is a multilingual dataset containing over 50 million sentences across more than 100 languages, primarily sourced from the Common Crawl corpus. The dataset is language-classified using the GlotLID model and further processed using the Gherbal language detection model, especially for low-resource languages like Moroccan Arabic. The dataset includes multiple configurations, each corresponding to a different language. The processing pipeline involves text cleaning, sentence segmentation, language detection, and filtering, ultimately producing a high-quality dataset for Moroccan Arabic. The dataset analysis also includes an analysis of the source websites to understand the usage of Moroccan Arabic on the web.
提供机构:
sawalni-ai



