ADI-20
收藏arXiv2025-11-13 更新2025-11-14 收录
下载链接:
https://github.com/elyadata/ADI-20
下载链接
链接失效反馈官方服务:
资源简介:
ADI-20 数据集由 Avignon Université 与 Elyadata 团队构建,是在 ADI-17 基础上的扩展型阿拉伯语方言识别语音语料,覆盖全部阿拉伯语国家方言并显式加入现代标准阿拉伯语(MSA)。数据集总时长约 3556 小时,包含 19 种国家级方言与 MSA,每种方言训练集不少于 53 小时,语音主要来源于 YouTube 节目、广播新闻以及 TunSwitch 等公开语料,并对长于 30 秒的片段进行切分、短于 3 秒的片段予以剔除。作者在 ADI-17 的不平衡基础上补充了突尼斯、巴林等缺失方言以及苏丹、约旦等低资源方言,构建出 ADI-20-53h 与 ADI-20-full 等不同规模子集,用于系统研究训练数据规模与模型复杂度对方言识别性能的影响,并公开相应的数据清单和高性能基线模型,以支持阿拉伯语方言识别、语音识别及下游情感分析、文本归一化等任务的研究与复现。更多信息和数据访问见项目仓库:github.com/elyadata/ADI-20。
The ADI-20 dataset, constructed by the teams from Avignon Université and Elyadata, is an extended speech corpus for Arabic dialect identification based on ADI-17. It covers all Arabic national dialects and explicitly includes Modern Standard Arabic (MSA). The total duration of the dataset is approximately 3556 hours, including 19 national-level dialects and MSA, with each dialect's training set containing no less than 53 hours of data. The speech data is mainly sourced from public corpora such as YouTube programs, broadcast news, and TunSwitch. Segments longer than 30 seconds were split, while those shorter than 3 seconds were removed. Building on the imbalanced distribution of ADI-17, the authors supplemented missing dialects including those of Tunisia, Bahrain, as well as low-resource dialects such as those of Sudan and Jordan. They created subsets of varying scales such as ADI-20-53h and ADI-20-full, which are used to systematically investigate the impact of training data scale and model complexity on dialect identification performance. The corresponding data lists and high-performance baseline models are publicly released to support research and reproduction of tasks including Arabic dialect identification, speech recognition, and downstream applications like sentiment analysis and text normalization. For more information and data access, please refer to the project repository: github.com/elyadata/ADI-20.
创建时间:
2025-11-13
原始信息汇总
Arabic Dialect Identification (ADI) 数据集概述
数据集基本信息
- 数据集名称: Arabic Dialect Identification (ADI)
- 包含数据集: ADI-17 和 ADI-20 数据集及其子集
- 数据内容: 阿拉伯语方言语音片段
数据集获取与准备
- 音频文件获取: 由于许可原因,不直接提供音频文件,需使用YouTube视频ID自行下载
- 数据标注: 语音分段和标签已在CSV清单文件中完成
- 数据格式要求: 需将所有文件重新采样为单声道16kHz WAV格式
- 清单文件下载: https://elyadata-my.sharepoint.com/:f:/p/haroun_elleuch/ErGuqCu8uXBBu0dSQu_WwmsBxwdPWoQyWfHQ67H7xav2uw?e=nk559T
数据集用途
- 主要任务: 阿拉伯语方言识别(ADI)
- 应用方法: 使用Whisper和ECAPA-TDNN进行模型训练和评估
预训练模型
- ADI-20最佳模型: https://huggingface.co/Elyadata/ADI-whisper-ADI20
- ADI-17最佳模型: https://huggingface.co/Elyadata/ADI-whisper-ADI17
引用信息
bibtex @inproceedings{elleuch2025adi20, author = {Haroun Elleuch and Salima Mdhaffar and Yannick Estève and Fethi Bougares}, title = {ADI‑20: Arabic Dialect Identification Dataset and Models}, booktitle = {Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech)}, year = {2025}, address = {Rotterdam Ahoy Convention Centre, Rotterdam, The Netherlands}, month = {August}, days = {17‑21}, note = {To appear} }
搜集汇总
数据集介绍

背景与挑战
背景概述
ADI-20是一个用于阿拉伯语方言识别的数据集,包含语音片段和对应的YouTube视频ID。该数据集与ADI-17一起用于训练和评估方言分类模型,并在NADI 2025挑战赛中取得了第一名成绩。
以上内容由遇见数据集搜集并总结生成



