five

sawalni-ai/fw-darija

收藏
Hugging Face2024-12-08 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/sawalni-ai/fw-darija
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个多语言文本数据集,包含超过5000万句子,涵盖100多种语言。数据来源于Common Crawl语料库,并通过GlotLID模型进行语言分类。为了优化低资源语言的识别,使用了Gherbal语言识别模型,特别是针对摩洛哥阿拉伯语。数据集包含多个配置,每个配置对应不同的语言。处理过程包括文本清理、句子分割、语言检测和过滤。最终的数据集可用于训练和评估摩洛哥阿拉伯语的模型。

This is a multilingual dataset containing over 50 million sentences across more than 100 languages, primarily sourced from the Common Crawl corpus. The dataset is language-classified using the GlotLID model and further processed using the Gherbal language detection model, especially for low-resource languages like Moroccan Arabic. The dataset includes multiple configurations, each corresponding to a different language. The processing pipeline involves text cleaning, sentence segmentation, language detection, and filtering, ultimately producing a high-quality dataset for Moroccan Arabic. The dataset analysis also includes an analysis of the source websites to understand the usage of Moroccan Arabic on the web.
提供机构:
sawalni-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作