LordTenson/arabic_eou_sada_dataset
收藏Hugging Face2025-12-12 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/LordTenson/arabic_eou_sada_dataset
下载链接
链接失效反馈官方服务:
资源简介:
Arabic EOU SADA数据集是一个包含414,053个阿拉伯语对话语句的数据集,专门用于句子结束(EOU)检测任务。该数据集主要关注沙特阿拉伯的自然方言(خليجي / نجدي / حجازي)。任务为二分类问题,标签1表示说话者轮次结束,标签0表示说话者将继续。数据集包含四个列:text(阿拉伯语转录)、label(0或1)、silence_after_seconds(该段后的暂停时间)和split(训练/验证/测试集)。数据集的统计信息显示,训练集约331k样本,验证集和测试集各约41k样本,EOU=1的比例约为75.1%-75.2%。
The Arabic EOU SADA Dataset contains 414,053 conversational Arabic utterances annotated for End-of-Utterance (EOU) detection with a strong focus on natural Saudi dialect (خليجي / نجدي / حجازي). The task is binary classification where label=1 indicates End of speaker turn (EOU) and label=0 indicates the speaker will continue. The dataset includes four columns: text (Arabic transcription), label (0 or 1), silence_after_seconds (pause duration after this segment), and split (train/validation/test). Statistics show ~331k samples in train and ~41k each in validation and test sets, with ~75.1%-75.2% EOU=1.
提供机构:
LordTenson



