TCST-UT: Tibetan-Chinese speech translation dataset of Ü-Tsang dialect
收藏科学数据银行2025-05-21 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=6178f79fcd7649669ac7b880f54b55e6
下载链接
链接失效反馈官方服务:
资源简介:
This TCST-UT dataset contains 58767 samples, 72.08 hours, and audio files from 147 different speakers. Each sample is a triplet consisting of Tibetan speech, corresponding Tibetan text, and Chinese text. Among them, the Tibetan language speech data comes from the M2ASR Tibetan dialect speech recognition dataset, which is published on the m2sr.cslt.org website. The audio files can be obtained through email requests, so this dataset does not directly provide voice audio files, only the audio paths of the samples contained in the dataset. The audio path of each sample, along with the corresponding Tibetan text and Chinese translated text, is stored in the output. json file, where the audio path refers to the path in the public dataset.The size of output. json is 22MB. The file puts the audio path, Tibetan text, and Chinese translation text of each sample into a dictionary, with the data format being:Abbreviation of Name - Audio Number: {“audio”: Audio file path,“text”: {“Tibetan”: The Tibetan text corresponding to the audio file“Chinese”: The Chinese text corresponding to the audio file}}
提供机构:
Dorje Peng Mao; China University of Political Science and Law; Minzu University of China; li xin
创建时间:
2024-12-18



