Samuel y Audrey: Bilingual YouTube Transcript Corpus (ES/EN) - Conversational Travel NLP Dataset

Figshare2026-02-24 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Samuel_y_Audrey_Bilingual_YouTube_Transcript_Corpus_ES_EN_-_Conversational_Travel_NLP_Dataset/31396515

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset provides a structured, bilingual parallel corpus of 643 creator-authored video transcripts from the "Samuel y Audrey" Spanish-language travel channel. Curated by the Samuel & Audrey Media Network, it delivers high-fidelity, spontaneous conversational dialogue in both Spanish (Primary) and English (Secondary).Unlike formal news corpora or academic translations, this archive captures natural, on-camera dialogue regarding global travel, cultural immersion, and expat logistics. It is explicitly designed as a "Ground Truth" resource for training Large Language Models (LLMs) on cross-lingual alignment, natural translation, and regional Spanish dialects—specifically capturing Argentine and broader Latin American variations. The dataset features perfectly paired .es.srt and .en.srt conversational payloads, making it an exceptional foundational asset for conversational AI and dialect tuning.

创建时间：

2026-02-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集