five

salmamohammedhamed22/arabic-eou-dataset

收藏
Hugging Face2025-12-12 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/salmamohammedhamed22/arabic-eou-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含用于阿拉伯语(特别是沙特方言)对话结束检测(EOU)的标记对话。目标是训练模型仅基于转录文本来预测说话者是否已完成发言,从而实现语音代理中的实时轮换。数据集特别针对沙特方言阿拉伯语,但也包括一般对话阿拉伯语。包含正样本(标签=1):完整的话语代表完成的轮次;负样本(标签=0):从每个话语生成的不完整前缀以模拟非最终轮次。数据集格式为JSON,每个记录包含文本和标签。文本可以是完整的对话或带有上下文的对话片段,标签1表示对话结束,0表示未结束。数据集生成过程包括预处理、滑动窗口上下文生成、正负样本生成、平衡和最终洗牌。数据集的主要用途是训练实时对话系统中的对话结束检测模型,但不适用于语音识别训练、语言建模或说话人分离。

This dataset contains conversational Arabic utterances labeled for End-of-Utterance (EOU) detection. The goal is to train models that can predict whether a speaker has finished speaking based on transcription text only, enabling real-time turn-taking in voice agents. The dataset is especially tailored to Saudi dialect Arabic, but also includes general conversational Arabic. It contains positive samples (label = 1): full utterances representing completed turns; negative samples (label = 0): incomplete prefixes generated from each utterance to simulate non-final turns. The dataset is in JSON format, each record contains text and label. The text can be a complete conversation or a conversation fragment with context, label 1 indicates the end of the conversation, 0 indicates not the end. The dataset generation process includes preprocessing, sliding window context generation, positive and negative sample generation, balancing, and final shuffling. The primary use of the dataset is to train end-of-utterance detection models in real-time dialogue systems, but it is not suitable for speech recognition training, language modeling, or speaker diarization.
提供机构:
salmamohammedhamed22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作