five

kudo-research/mustc-en-es-text-only

收藏
Hugging Face2022-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/kudo-research/mustc-en-es-text-only
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是从MuST-C语料库中选取的仅包含文本的英语-西班牙语翻译数据。MuST-C是一个多语言语音翻译语料库,包含从英语TED Talks中自动对齐的句子级转录和翻译。该数据集可用于训练机器翻译模型。
提供机构:
kudo-research
原始信息汇总

数据集概述

数据集名称

  • 名称: must-c_en-es_text-only
  • 全称: kudo-research/mustc-en-es-text-only

数据集描述

  • 摘要: 该数据集是从MuST-C多语言语音翻译语料库中提取的仅包含文本(英语-西班牙语)的部分。
  • 支持的任务: 机器翻译
  • 语言: 英语(en-US)、西班牙语(es-ES)

数据集结构

  • 数据实例示例:

    { "translation": { "en": "Ill tell you one quick story to illustrate what thats been like for me.", "es": "Les diré una rápida historia para ilustrar lo que ha sido para mí." } }

  • 数据字段:

    • translation: 包含两个键值对的对象,键为语言代码,值为文本内容。

数据集创建

  • 源数据: TED Talks
  • 许可证: 遵循Creative Commons Attribution-NonCommercial-NoDerivs 4.0许可。

引用信息

  • Bibtex引用:

    @article{CATTONI2021101155, title = {MuST-C: A multilingual corpus for end-to-end speech translation}, journal = {Computer Speech & Language}, volume = {66}, pages = {101155}, year = {2021}, issn = {0885-2308}, doi = {https://doi.org/10.1016/j.csl.2020.101155}, url = {https://www.sciencedirect.com/science/article/pii/S0885230820300887}, author = {Roldano Cattoni and Mattia Antonino {Di Gangi} and Luisa Bentivogli and Matteo Negri and Marco Turchi}, keywords = {Spoken language translation, Multilingual corpus}, abstract = {End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions can build on sizable ASR and MT training data for a variety of languages, the available SLT corpora suitable for end-to-end training are few, typically small and of limited language coverage. We contribute to fill this gap by presenting MuST-C, a large and freely available Multilingual Speech Translation Corpus built from English TED Talks. Its unique features include: i) language coverage and diversity (from English into 14 languages from different families), ii) size (at least 237 hours of transcribed recordings per language, 430 on average), iii) variety of topics and speakers, and iv) data quality. Besides describing the corpus creation methodology and discussing the outcomes of empirical and manual quality evaluations, we present baseline results computed with strong systems on each language direction covered by MuST-C.} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作