five

NADI-2025-Sub-task-3-all

收藏
魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/MBZUAI/NADI-2025-Sub-task-3-all
下载链接
链接失效反馈
官方服务:
资源简介:
For training and developing your models in the **closed track**, we provide the following datasets, which are publicly available on Hugging Face: The datasets represent a wide range of Arabic varieties and recording conditions, with over 85K training sentences in total. The datasets consist of dialectal, modern standard, classical, and code-switched Arabic speech and transcriptions. All except the Mixat and ArzEn subset are diacritized. | Dataset | Type | Diacritized | Train | Dev | |-----------|------------------|:-----------:|:------:|:---:| | MDASPC | Multi-dialectal | True | 60677 | >1K | | TunSwitch | Dialectal, CS | True | 5212 | 165 | | ClArTTS | CA | True | 9500 | 205 | | ArVoice | MSA | True | 2507 | – | | ArzEn | Dialectal, CS | False | 3344 | – | | Mixat | Dialectal, CS | False | 3721 | – | We removed samples containing fewer than 3 words and eliminated punctuations from all datasets to enhance consistency and quality. The resulted dataset contains 57K train and 1.5K for dev samples. For the closed track, you may use the full train/dev sets or a subset of them (for example, you may wish to use the undiacritized subsets for semi-supervised training or rely only on the diacritized subsets). For the open track, you can use these resources and/or any other resources for training, as long as they don't overlap with the test sets.

为支持您在**封闭赛道**中开展模型训练与开发工作,我们提供以下可在Hugging Face平台公开获取的数据集:本数据集覆盖多种阿拉伯语变体与录音场景,总训练句数超过85000句。数据集涵盖方言阿拉伯语、现代标准阿拉伯语、古典阿拉伯语以及语码转换(code-switched)阿拉伯语的语音与转录文本。除Mixat与ArzEn子集外,其余数据集均已标注变音符号(diacritized)。 | 数据集名称 | 数据类型 | 已标注变音符号(diacritized) | 训练样本数 | 开发集样本数 | |-----------|--------------------------|:---------------------------:|:----------:|:------------:| | MDASPC | 多方言混合 | 是 | 60677 | >1000 | | TunSwitch | 方言、语码转换(code-switched,CS) | 是 | 5212 | 165 | | ClArTTS | 古典阿拉伯语(Classical Arabic,CA) | 是 | 9500 | 205 | | ArVoice | 现代标准阿拉伯语(Modern Standard Arabic,MSA) | 是 | 2507 | 无 | | ArzEn | 方言、语码转换(code-switched,CS) | 否 | 3344 | 无 | | Mixat | 方言、语码转换(code-switched,CS) | 否 | 3721 | 无 | 我们已过滤掉单词数少于3的样本,并移除所有数据集中的标点符号,以提升数据集的一致性与质量。经处理后的数据集共包含57000条训练样本与1500条开发集样本。 针对封闭赛道,您可使用完整的训练/开发集,或其中的子集(例如,您可选择使用未标注变音符号(undiacritized)的子集开展半监督训练,或仅使用已标注变音符号(diacritized)的子集)。针对开放赛道,您可使用上述资源,或任意其他训练资源,只要这些资源未与测试集产生数据重叠即可。
提供机构:
maas
创建时间:
2025-06-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作