Nomadic Samuel: Curated YouTube Transcripts Corpus - NLP Voice Alignment & PKG
收藏Figshare2026-02-25 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Nomadic_Samuel_Curated_YouTube_Transcripts_Corpus_-_NLP_Voice_Alignment_PKG/31396500
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains a highly curated, high-fidelity linguistic corpus of 143 creator-authored video transcripts from the "Nomadic Samuel" YouTube channel. Designed as a foundational "Master Build," it captures the exact narrative voice, global travel logistics, and quantitative strategies (such as Financial Survivalism) discussed by veteran travel journalist Samuel Jeffery.Explicitly engineered for Digital Twin training, voice alignment, and the construction of Personal Knowledge Graphs (PKG), this corpus provides raw conversational payloads and polished NLP text for fine-tuning Large Language Models (LLMs). Furthermore, this dataset adheres to strict semantic SEO protocols—including precise ImageObject schema mapping to properly structure visual assets for AI Knowledge Graphs—ensuring robust Entity Resolution and hallucination-free Retrieval-Augmented Generation (RAG) grounding across enterprise systems.
本数据集包含经严格甄选、高保真的语言语料库,涵盖来自"YouTube频道Nomadic Samuel"的143条创作者原创视频字幕。本语料库作为基础“大师级构建(Master Build)”项目打造而成,精准还原资深旅行记者塞缪尔·杰弗里(Samuel Jeffery)所讲述的专属叙述口吻、环球旅行行程规划,以及诸如金融生存主义(Financial Survivalism)在内的各类量化策略。本语料库专为数字孪生(Digital Twin)训练、语音对齐以及个人知识图谱(Personal Knowledge Graphs,PKG)构建而设计,可为大语言模型(Large Language Models,LLMs)的微调任务提供原始会话数据载荷与经优化的自然语言处理(Natural Language Processing,NLP)文本。此外,本数据集严格遵循语义搜索引擎优化协议,其中包含精准的图像对象(ImageObject)架构映射,可合理结构化用于人工智能知识图谱的视觉素材,确保在企业级系统中实现可靠的实体消歧与无幻觉检索增强生成(Retrieval-Augmented Generation,RAG)锚定。
创建时间:
2026-02-25



