lex-fridman-podcast-transcript-audio

Name: lex-fridman-podcast-transcript-audio
Creator: maas
Published: 2025-12-02 10:57:58
License: 暂无描述

魔搭社区2025-12-02 更新2025-03-08 收录

下载链接：

https://modelscope.cn/datasets/pengzhendong/lex-fridman-podcast-transcript-audio

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "lexFridmanPodcast-transcript-audio" ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Whispering-GPT](https://github.com/matallanas/whisper_gpt_pipeline) - **Repository:** [whisper_gpt_pipeline](https://github.com/matallanas/whisper_gpt_pipeline) - **Paper:** [whisper](https://cdn.openai.com/papers/whisper.pdf) and [gpt](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) - **Point of Contact:** [Whispering-GPT organization](https://huggingface.co/Whispering-GPT) ### Dataset Summary This dataset is created by applying whisper to the videos of the Youtube channel [Lex Fridman Podcast](https://www.youtube.com/watch?v=FhfmGM6hswI&list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4&ab_channel=LexFridman). The dataset was created a medium size whisper model. ### Languages - **Language**: English ## Dataset Structure The dataset contains all the transcripts plus the audio of the different videos of Lex Fridman Podcast. ### Data Fields The dataset is composed by: - **id**: Id of the youtube video. - **channel**: Name of the channel. - **channel\_id**: Id of the youtube channel. - **title**: Title given to the video. - **categories**: Category of the video. - **description**: Description added by the author. - **text**: Whole transcript of the video. - **segments**: A list with the time and transcription of the video. - **start**: When started the trancription. - **end**: When the transcription ends. - **text**: The text of the transcription. - **audio**: the extracted audio of the video in ogg format. ### Data Splits - Train split. ## Dataset Creation ### Source Data The transcriptions are from the videos of [Lex Fridman Podcast](https://www.youtube.com/watch?v=FhfmGM6hswI&list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4&ab_channel=LexFridman) ### Contributions Thanks to [Whispering-GPT](https://huggingface.co/Whispering-GPT) organization for adding this dataset.

# 数据集卡片："lexFridmanPodcast-transcript-audio" ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [遴选依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [贡献者](#contributions) ## 数据集描述 - **主页：** [Whispering-GPT](https://github.com/matallanas/whisper_gpt_pipeline) - **代码仓库：** [whisper_gpt_pipeline](https://github.com/matallanas/whisper_gpt_pipeline) - **相关论文：** [Whisper (Whisper)](https://cdn.openai.com/papers/whisper.pdf) 与 [GPT (Generative Pre-trained Transformer)](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) - **联系方式：** [Whispering-GPT组织](https://huggingface.co/Whispering-GPT) ### 数据集概述本数据集通过将Whisper语音模型 (Whisper) 应用于YouTube频道[Lex Fridman播客](https://www.youtube.com/watch?v=FhfmGM6hswI&list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4&ab_channel=LexFridman)的视频内容构建而成，基于中等规模的Whisper模型生成。 ### 语言 - **语言：** 英语 ## 数据集结构本数据集包含Lex Fridman播客各视频的完整转录文本及音频文件。 ### 数据实例 ### 数据字段本数据集由以下字段组成： - **id：** YouTube视频的唯一标识符。 - **channel：** 频道名称。 - **channel_id：** YouTube频道的唯一标识符。 - **title：** 视频标题。 - **categories：** 视频分类。 - **description：** 作者添加的视频描述信息。 - **text：** 视频的完整转录文本。 - **segments：** 包含视频转录时间区间与对应文本的列表，其内部字段包括： - **start：** 转录片段的起始时间戳。 - **end：** 转录片段的结束时间戳。 - **text：** 该转录片段的文本内容。 - **audio：** 从视频中提取的OGG格式音频文件。 ### 数据划分 - 训练集划分。 ## 数据集构建 ### 遴选依据 ### 源数据转录文本均来自[Lex Fridman播客](https://www.youtube.com/watch?v=FhfmGM6hswI&list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4&ab_channel=LexFridman)的视频内容。 ### 标注信息 ### 个人与敏感信息 ## 数据集使用注意事项 ### 数据集的社会影响 ### 偏差讨论 ### 其他已知局限性 ## 附加信息 ### 贡献者感谢[Whispering-GPT组织](https://huggingface.co/Whispering-GPT)贡献本数据集。

提供机构：

maas

创建时间：

2025-03-06

搜集汇总

数据集介绍