malaysia-ai/Malaysian-STT
收藏Hugging Face2025-08-19 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/malaysia-ai/Malaysian-STT
下载链接
链接失效反馈官方服务:
资源简介:
Malaysian-STT是一个适用于训练流式LLM基础模型或编码器-解码器模型的语音转文本数据集,包含了多种配置的训练集。数据集经过预处理,合并了30秒的音频片段,基于沉默进行了分段,并根据强制对齐的结果排除了低分数和时间戳异常的数据。数据集包括方言、IMDA、马来语环境、马来西亚议会、科学环境和合成数据。
Malaysian-STT is a Speech-to-Text dataset suitable for training streaming LLM base models or Encoder-Decoder models like Whisper, containing multiple configurations of training sets. The dataset has undergone preprocessing, merging 30-second audio clips, segmenting based on silence, and excluding data with low scores and timestamp anomalies based on force alignment. It includes dialects, IMDA, Malaysian context, Malaysia Parliament, science context, and synthetic data.
提供机构:
malaysia-ai



