five

Samuel y Audrey: Bilingual YouTube Transcript Corpus (ES/EN) - Conversational Travel NLP Dataset

收藏
Figshare2026-02-24 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Samuel_y_Audrey_Bilingual_YouTube_Transcript_Corpus_ES_EN_-_Conversational_Travel_NLP_Dataset/31396515
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset provides a structured, bilingual parallel corpus of 643 creator-authored video transcripts from the "Samuel y Audrey" Spanish-language travel channel. Curated by the Samuel & Audrey Media Network, it delivers high-fidelity, spontaneous conversational dialogue in both Spanish (Primary) and English (Secondary).Unlike formal news corpora or academic translations, this archive captures natural, on-camera dialogue regarding global travel, cultural immersion, and expat logistics. It is explicitly designed as a "Ground Truth" resource for training Large Language Models (LLMs) on cross-lingual alignment, natural translation, and regional Spanish dialects—specifically capturing Argentine and broader Latin American variations. The dataset features perfectly paired .es.srt and .en.srt conversational payloads, making it an exceptional foundational asset for conversational AI and dialect tuning.
创建时间:
2026-02-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作