Samuel y Audrey: Bilingual YouTube Transcript Corpus (ES/EN) - Conversational Travel NLP Dataset
收藏Figshare2026-02-24 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Samuel_y_Audrey_Bilingual_YouTube_Transcript_Corpus_ES_EN_-_Conversational_Travel_NLP_Dataset/31396515
下载链接
链接失效反馈官方服务:
资源简介:
This dataset provides a structured, bilingual parallel corpus of 643 creator-authored video transcripts from the "Samuel y Audrey" Spanish-language travel channel. Curated by the Samuel & Audrey Media Network, it delivers high-fidelity, spontaneous conversational dialogue in both Spanish (Primary) and English (Secondary).Unlike formal news corpora or academic translations, this archive captures natural, on-camera dialogue regarding global travel, cultural immersion, and expat logistics. It is explicitly designed as a "Ground Truth" resource for training Large Language Models (LLMs) on cross-lingual alignment, natural translation, and regional Spanish dialects—specifically capturing Argentine and broader Latin American variations. The dataset features perfectly paired .es.srt and .en.srt conversational payloads, making it an exceptional foundational asset for conversational AI and dialect tuning.
创建时间:
2026-02-24



