Tnaot/large-dataset-audio-v2

Name: Tnaot/large-dataset-audio-v2
Creator: Tnaot
Published: 2025-12-11 11:31:22
License: 暂无描述

Hugging Face2025-12-11 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Tnaot/large-dataset-audio-v2

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含高棉语（柬埔寨语）的语音录音，带有详细的转录和注释。数据集中共有9,285个样本，总时长为336.68小时，平均每个样本130.54秒。数据集中74.4%的单词是高棉语，25.4%是英语。数据来源主要是youtube（5,947个样本）和telegram（2,626个样本）。数据集结构包括音频文件路径/字节、原始转录文本、清理后的转录文本、音频时长、说话者数量等多个字段。

This dataset contains Khmer (Cambodian) speech recordings with detailed transcriptions and annotations. The dataset consists of 9,285 samples with a total duration of 336.68 hours, averaging 130.54 seconds per sample. 74.4% of the words in the dataset are Khmer, while 25.4% are English. The primary sources of the data are youtube (5,947 samples) and telegram (2,626 samples). The dataset structure includes fields such as audio file path/bytes, raw transcription text, cleaned transcription text, audio duration, speaker count, and more.

提供机构：

Tnaot

5,000+

优质数据集

54 个

任务类型

进入经典数据集