atlasia/FineWeb2-Moroccan-Arabic-Predictions-0.9
收藏Hugging Face2025-01-03 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/atlasia/FineWeb2-Moroccan-Arabic-Predictions-0.9
下载链接
链接失效反馈官方服务:
资源简介:
摩洛哥达尔语数据集是从FineWeb2数据集中筛选出来的,包含摩洛哥地区广泛使用的阿拉伯方言达尔语的样本。该数据集通过使用先进的分类器来准确识别摩洛哥达尔语文本,填补了自然语言处理资源中的空白。数据集适用于语言建模、情感分析、机器翻译和方言分类等任务。数据集的构建方法是先使用GlotLID进行初步提取,然后使用SfaIA模型进行筛选,保留了模型对摩洛哥达尔语的高置信度样本。
The Moroccan Darija Dataset is a subset of the FineWeb2 dataset, containing samples of the Moroccan Darija dialect, which is widely spoken in Morocco. The dataset was created to fill the gap in NLP resources for this dialect. It is useful for tasks such as language modeling, sentiment analysis, machine translation, and dialectal classification. The dataset was extracted using an advanced classifier trained to accurately identify Moroccan Darija text, and it includes only samples with a high confidence score. The methodology involves initial extraction with GlotLID followed by refined selection with the SfaIA model.
提供机构:
atlasia



