atlasia/No-Arabic-Dialect-Left-Behind

Name: atlasia/No-Arabic-Dialect-Left-Behind
Creator: atlasia
Published: 2024-12-19 19:40:19
License: 暂无描述

Hugging Face2024-12-19 更新2024-12-21 收录

下载链接：

https://hf-mirror.com/datasets/atlasia/No-Arabic-Dialect-Left-Behind

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个包含现代标准阿拉伯语（MSA）和多种阿拉伯方言的文本集合，方言包括摩洛哥、阿尔及利亚、埃及、黎凡特、中东等地的方言。数据集用于训练摩洛哥方言识别器，并支持语言识别、方言分类和自然语言处理任务的研究和开发。数据集的特点包括方言的多样性、MSA的包含以及高质量的数据来源。数据集的格式为表格形式，包含文本内容、方言标签、数据来源和元数据等字段。数据集的使用支持多种NLP应用，如方言分类和词嵌入训练。整理过程包括数据收集和清洗预处理。局限性包括方言之间的重叠和潜在的偏见。伦理考虑包括隐私保护、公平使用和文化敏感性。未来的工作方向包括更好地平衡各方言的数据分布。

This is a compilation of text written in MSA and multiple Arabic dialects from various sources, including Moroccan, Algerian, Egyptian, Levantine, Middle-Eastern, and others. The dataset was used to train the Moroccan language identifier, aiming to support language identification, dialect classification, and natural language processing tasks for Arabic text. The dataset features include diversity of dialects, inclusion of MSA, and high-quality sources. The dataset is provided in a tabular format, containing fields such as text, dialect label, dataset source, and metadata. The datasets uses include dialect classification and word embedding training. Limitations include dialect overlap and potential biases. Ethical considerations include privacy protection, fair use, and cultural sensitivity.

提供机构：

atlasia

5,000+

优质数据集

54 个

任务类型

进入经典数据集