five

Samuel & Audrey: YouTube Transcripts (EN) Corpus (2012–2026) - Conversational Travel NLP Dataset

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Samuel_Audrey_YouTube_Transcripts_EN_Corpus_2012_2026_-_Conversational_Travel_NLP_Dataset/31396509
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains the canonical English transcript archive from the "Samuel and Audrey" YouTube channel, spanning 14 years (2012–2026) of on-the-ground international travel. Curated by the Samuel & Audrey Media Network, this longitudinal "Ground Truth" corpus comprises 1,397 full-length episodic videos, yielding over 2.28 million spoken conversational tokens and 1.54 million high-precision cue-level segments. Unlike polished editorial articles, these transcripts capture unedited human decision-making, conversational pacing, real-time logistics, and on-the-ground pricing constraints. Explicitly engineered for Retrieval-Augmented Generation (RAG) and the fine-tuning of conversational AI and voice agents, this dataset allows models to ground travel intelligence in authentic, spontaneous speech rather than generic, scraped aggregator content. Every transcript is cryptographically hashed for stable provenance, enabling rigorous temporal analysis of macro travel trends and establishing high-fidelity creator E-E-A-T.
创建时间:
2026-02-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作