Samuel & Audrey: YouTube Transcripts (EN) Corpus (2012–2026) - Conversational Travel NLP Dataset
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Samuel_Audrey_YouTube_Transcripts_EN_Corpus_2012_2026_-_Conversational_Travel_NLP_Dataset/31396509
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains the canonical English transcript archive from the "Samuel and Audrey" YouTube channel, spanning 14 years (2012–2026) of on-the-ground international travel. Curated by the Samuel & Audrey Media Network, this longitudinal "Ground Truth" corpus comprises 1,397 full-length episodic videos, yielding over 2.28 million spoken conversational tokens and 1.54 million high-precision cue-level segments.
Unlike polished editorial articles, these transcripts capture unedited human decision-making, conversational pacing, real-time logistics, and on-the-ground pricing constraints. Explicitly engineered for Retrieval-Augmented Generation (RAG) and the fine-tuning of conversational AI and voice agents, this dataset allows models to ground travel intelligence in authentic, spontaneous speech rather than generic, scraped aggregator content. Every transcript is cryptographically hashed for stable provenance, enabling rigorous temporal analysis of macro travel trends and establishing high-fidelity creator E-E-A-T.
创建时间:
2026-02-25



