BSTC(Baidu Speech Translation Corpus)

Name: BSTC(Baidu Speech Translation Corpus)
Creator: OpenDataLab
Published: 2026-05-24 10:30:33
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/BSTC

下载链接

链接失效反馈

官方服务：

资源简介：

BSTC (百度语音翻译语料库) 是一个大规模的自动同声传译数据集。BSTC版本1.0包含50小时的真实演讲，包括三个部分，音频文件，成绩单和翻译。语料库可用于构建自动同声传译系统。语料库是从中国普通话谈话和报告中收集的，包括科学、技术、文化、经济等。谈话和报告中的话语被仔细转录成中文文本，并进一步翻译成英文文本。句子边界由英文文本而不是中文文本确定，后者类似于先前的相关语料库 (TED和翻译增强的LibriSpeech语料库)。

BSTC (Baidu Speech Translation Corpus) is a large-scale automatic simultaneous interpretation dataset. Version 1.0 of BSTC contains 50 hours of authentic speeches, which comprises three components: audio files, transcripts, and translations. This corpus can be utilized to develop automatic simultaneous interpretation systems. The corpus is collected from Chinese Mandarin speeches and reports spanning domains including science, technology, culture, economy, and more. The utterances within these speeches and reports are meticulously transcribed into Chinese texts and subsequently translated into English texts. Sentence boundaries are determined based on the English texts rather than the Chinese texts, a practice consistent with that of prior relevant corpora such as TED and translation-augmented LibriSpeech corpora.

提供机构：

OpenDataLab

创建时间：

2022-11-02

搜集汇总

数据集介绍