Tnaot/large-dataset-audio-v2
收藏Hugging Face2025-12-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Tnaot/large-dataset-audio-v2
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含高棉语(柬埔寨语)的语音录音,带有详细的转录和注释。数据集中共有9,285个样本,总时长为336.68小时,平均每个样本130.54秒。数据集中74.4%的单词是高棉语,25.4%是英语。数据来源主要是youtube(5,947个样本)和telegram(2,626个样本)。数据集结构包括音频文件路径/字节、原始转录文本、清理后的转录文本、音频时长、说话者数量等多个字段。
This dataset contains Khmer (Cambodian) speech recordings with detailed transcriptions and annotations. The dataset consists of 9,285 samples with a total duration of 336.68 hours, averaging 130.54 seconds per sample. 74.4% of the words in the dataset are Khmer, while 25.4% are English. The primary sources of the data are youtube (5,947 samples) and telegram (2,626 samples). The dataset structure includes fields such as audio file path/bytes, raw transcription text, cleaned transcription text, audio duration, speaker count, and more.
提供机构:
Tnaot



