samuelandaudreymedianetwork/samuel-and-audrey-youtube-transcripts-en

Name: samuelandaudreymedianetwork/samuel-and-audrey-youtube-transcripts-en
Creator: samuelandaudreymedianetwork
Published: 2026-02-24 11:32:00
License: 暂无描述

Hugging Face2026-02-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/samuelandaudreymedianetwork/samuel-and-audrey-youtube-transcripts-en

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 language: - en task_categories: - text-generation - question-answering - summarization - text-classification tags: - youtube - youtube-transcripts - travel - food - tourism - vlogs - long-form - conversational - voice-assistants - rag - eeat - temporal size_categories: - 1M<n<10M --- # 🎥 Samuel & Audrey — YouTube Transcripts (EN) Corpus (2012–2026) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18665704.svg)](https://zenodo.org/records/18665704) [![ORCID](https://img.shields.io/badge/ORCID-0009--0006--3748--9630-A6CE39.svg)](https://orcid.org/0009-0006-3748-9630) [![ORCID](https://img.shields.io/badge/ORCID-0009--0007--2249--0441-A6CE39.svg)](https://orcid.org/0009-0007-2249-0441) [![GitHub](https://img.shields.io/badge/GitHub-Repository-black.svg)](https://github.com/samuelandaudreymedianetwork/samuel-and-audrey-youtube-transcripts-en-ledger) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) ## 📌 Context & Provenance This dataset contains the complete English transcript archive from the **“Samuel and Audrey - Travel and Food Videos”** YouTube channel. Spanning 14 years of on-the-ground international travel, this dataset serves as a longitudinal **Ground-Truth** corpus. Unlike polished articles, these transcripts capture unedited human decision-making, conversational pacing, logistics, pricing mentions, food reactions, and real-world constraints—making it an ideal resource for building travel assistants that sound human, not brochure-y. --- ## 📊 Counts Snapshot A massive repository of spoken-word travel intelligence anchored by high-signal **E-E-A-T** (Experience, Expertise, Authoritativeness, and Trustworthiness). | Metric | Count | Description | | :--- | :--- | :--- | | **Total Transcripts** | `1,397` | Full-length episodic videos. | | **Total Words** | `2,288,859` | Spoken conversational tokens. | | **Cue-Level Segments** | `1.54 Million` | High-precision segmented rows for RAG indexing. | | **Time Span** | `14 Years` | 2012-09-16 to 2026-02-03. | --- ## 🚀 Why Use This Dataset? * **Conversational Ground Truth:** Captures real speech ("Should we take the bus?", "How much is this?") and uncertainty that structured writing edits out. * **Longitudinal Signal:** A single consistent channel voice over 14 years enables temporal analysis of cost mentions, travel trends, and global infrastructure changes. * **Provenance + Traceability:** Every transcript is cryptographically hashed and matched directly to a YouTube `video_id` and canonical `url` for citation and source inspection. --- ## 📂 Canonical Files & Architecture This release includes files optimized for LLM context windows, streaming, and RAG ingestion: * `samuel-and-audrey-youtube-transcripts-en_hf.jsonl.gz` **(Recommended for Full Transcripts)** * `samuel-and-audrey-youtube-transcripts-en_hf.jsonl` * `samuel-and-audrey-youtube-transcripts-en_hf.csv` * `samuel-and-audrey-youtube-transcripts-en_segments_hf.jsonl.gz` **(Recommended for RAG: cue-level segments for high-precision retrieval)** ### Data Schema Overview *(Note: For a complete structural breakdown, please refer to `DATA_DICTIONARY.md`)* **Full Transcripts Core Fields:** * `id`: Stable transcript identifier * `content_hash`: SHA-1 hash of transcript `text` (deduplication/auditing) * `video_id` / `url`: Canonical YouTube identifiers * `published_at` / `video_date`: Temporal metadata * `title` / `youtube_title`: Content identifiers * `view_count`: Views at export time * `tags_list`: Array of semantic tags * `text` / `text_with_breaks` / `srt`: The transcript payload in various unrolled formats **Segments Core Fields (`_segments_hf`):** * `segment_id` / `transcript_id`: Relational mapping keys * `start_ms` / `end_ms`: Cue timestamps in milliseconds * `text`: Isolated cue text --- ## 🎯 Intended Use This dataset is specifically engineered for: * Travel-domain **Retrieval-Augmented Generation (RAG)** grounded in real speech. * Fine-tuning **conversational travel assistants** and **voice agents**. * Long-form summarization and dialogue-style compression. * Temporal analysis of price/cost mentions and macro travel trends. * Entity extraction (places, transport, food, attractions). * Evaluation of grounding and hallucination resistance in LLMs. --- ## 🔍 Data Preview <details> <summary>Click to view a sample raw transcript block</summary> ```text index: 1 transcript_id: 79bc53819f2f1ff2 content_hash: 8dae51261291157676fa604d14fdec2181eaf747 video_id: g7-wj8jaF0Q url: [https://www.youtube.com/watch?v=g7-wj8jaF0Q](https://www.youtube.com/watch?v=g7-wj8jaF0Q) published_at: 2012-09-16T00:31:08Z video_date: 2012-09-16 title: Exploring Sindorim, Guro & Gocheok Dong in Seoul, Korea view_count: 11215 text: Today we are doing the Seoul subway challenge, so we've been assigned line two and the idea is to explore as many stops as possible. We're going back to an area that used to be a part of my old stomping grounds - Sindorim...

提供机构：

samuelandaudreymedianetwork

5,000+

优质数据集

54 个

任务类型

进入经典数据集