five

Yahia-123/Arabic-end-of-utterance_dataset

收藏
Hugging Face2025-12-14 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Yahia-123/Arabic-end-of-utterance_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含沙特阿拉伯方言(العامية السعودية)的真实阿拉伯语对话语句,标注了句子结束(End-of-Utterance, EOU)检测标签。数据集旨在反映自然对话行为,包括不完整的话轮、停顿和日常对话及客户服务对话中常见的延续模式。主要目标是为以下领域的研究和开发提供支持:句子结束检测、对话和基于对话的自然语言处理、语音和文本分割流程、以及阿拉伯方言建模(特别是沙特阿拉伯方言)。数据集以CSV文件形式提供,包含text(沙特方言语句)和label(句子结束标签,0或1)两列。标签定义:0表示非句子结束(不完整句子,预期有延续),1表示句子结束(完整对话话轮)。数据特点包括真实对话结构、自然停顿、不完整或中断的话轮、多语句延续、客户服务和日常对话场景,涵盖银行、支付、卡片、投诉、账户查询等领域。数据来源包括多方言阿拉伯对话数据、银行和金融客户服务对话、沙特方言文本样本以及YouTube播客内容。数据集经过严格的过滤和清理过程,确保语言一致性和方言重点。标注方法结合手动和半自动标注,旨在准确模拟现实世界对话流,特别考虑了语音转文本(STT)系统的需求。

This dataset contains realistic Arabic dialogue utterances in the Saudi dialect (العامية السعودية), annotated for End-of-Utterance (EOU) detection. The dataset is designed to reflect natural conversational behavior, including incomplete turns, pauses, and continuation patterns commonly found in real customer-service and daily-use dialogues. The primary goal of this dataset is to support research and development in: End-of-utterance detection, Conversational and dialogue-based NLP, Speech and text segmentation pipelines, Arabic dialect modeling, with emphasis on Saudi Arabic. The dataset is provided as a CSV file with columns text (Saudi dialect utterance) and label (End-of-Utterance label, 0 or 1). Label definitions: 0 → Not end of utterance (incomplete sentence, continuation expected), 1 → End of utterance (complete conversational turn). Data characteristics include realistic dialogue structure with natural pauses, incomplete or interrupted turns, continuations across multiple utterances, customer-service and everyday conversational contexts, and domains like banking, payments, cards, complaints, and account inquiries. Data sources include multi-dialect Arabic dialogue data, banking and financial customer-service conversations, Saudi dialect text samples, and YouTube podcast content. The dataset underwent rigorous filtering and cleaning to ensure language consistency and dialect focus. Annotation methodology combines manual and semi-manual labeling to accurately model real-world dialogue flow, with special consideration for speech-to-text (STT) system requirements.
提供机构:
Yahia-123
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作