BSTC(Baidu Speech Translation Corpus)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/BSTC
下载链接
链接失效反馈官方服务:
资源简介:
BSTC (百度语音翻译语料库) 是一个大规模的自动同声传译数据集。BSTC版本1.0包含50小时的真实演讲,包括三个部分,音频文件,成绩单和翻译。语料库可用于构建自动同声传译系统。
语料库是从中国普通话谈话和报告中收集的,包括科学、技术、文化、经济等。谈话和报告中的话语被仔细转录成中文文本,并进一步翻译成英文文本。句子边界由英文文本而不是中文文本确定,后者类似于先前的相关语料库 (TED和翻译增强的LibriSpeech语料库)。
BSTC (Baidu Speech Translation Corpus) is a large-scale automatic simultaneous interpretation dataset. Version 1.0 of BSTC contains 50 hours of authentic speeches, which comprises three components: audio files, transcripts, and translations. This corpus can be utilized to develop automatic simultaneous interpretation systems.
The corpus is collected from Chinese Mandarin speeches and reports spanning domains including science, technology, culture, economy, and more. The utterances within these speeches and reports are meticulously transcribed into Chinese texts and subsequently translated into English texts. Sentence boundaries are determined based on the English texts rather than the Chinese texts, a practice consistent with that of prior relevant corpora such as TED and translation-augmented LibriSpeech corpora.
提供机构:
OpenDataLab
创建时间:
2022-11-02
搜集汇总
数据集介绍

背景与挑战
背景概述
BSTC是一个大规模自动同声传译数据集,包含50小时的真实演讲音频、中文转录和英文翻译,语料来源于中国普通话谈话和报告,覆盖科学、技术等多个领域。该数据集旨在支持自动同声传译系统的构建,句子边界基于英文文本确定,类似于TED等现有语料库,由百度于2021年发布。
以上内容由遇见数据集搜集并总结生成



