sawalni-ai/fw-darija

Name: sawalni-ai/fw-darija
Creator: sawalni-ai
Published: 2024-12-08 19:58:24
License: 暂无描述

Hugging Face2024-12-08 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/sawalni-ai/fw-darija

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个多语言文本数据集，包含超过5000万句子，涵盖100多种语言。数据来源于Common Crawl语料库，并通过GlotLID模型进行语言分类。为了优化低资源语言的识别，使用了Gherbal语言识别模型，特别是针对摩洛哥阿拉伯语。数据集包含多个配置，每个配置对应不同的语言。处理过程包括文本清理、句子分割、语言检测和过滤。最终的数据集可用于训练和评估摩洛哥阿拉伯语的模型。

This is a multilingual dataset containing over 50 million sentences across more than 100 languages, primarily sourced from the Common Crawl corpus. The dataset is language-classified using the GlotLID model and further processed using the Gherbal language detection model, especially for low-resource languages like Moroccan Arabic. The dataset includes multiple configurations, each corresponding to a different language. The processing pipeline involves text cleaning, sentence segmentation, language detection, and filtering, ultimately producing a high-quality dataset for Moroccan Arabic. The dataset analysis also includes an analysis of the source websites to understand the usage of Moroccan Arabic on the web.

提供机构：

sawalni-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集