pardeep/youtube-vidoes-transcripts-hindi-english
收藏Hugging Face2024-01-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pardeep/youtube-vidoes-transcripts-hindi-english
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
---
**Context**
The dataset contains the Hindi and English subtitles for famous YouTube channels. This dataset was mainly created for the Hindi Language channel since the main goal was to use this dataset to build LLMs using the Hindi Language.
Data from channels in Information, Entertainment, Politics, Comedy, News, etc categories has been included in this dataset.
***Dataset Stats:***
- **58 channels**
- **103,042 total videos**
**Content**
- Video subtitles in Hindi and English
- Video metadata like duration, number of comments, likes, counts, published date
**Acknowledgements**
The source of this dataset is YouTube. The following packages were used to generate this dataset:
- [youtube-transcript-api](https://pypi.org/project/youtube-transcript-api/)
- [google-api-python-client](https://pypi.org/project/google-api-python-client/)
**Inspiration**
- Build LLMs model using Hindi
- Finetune models using Hindi for tasks like classification, summarization, translation, etc
提供机构:
pardeep
原始信息汇总
数据集概述
上下文
该数据集包含著名YouTube频道的印地语和英语字幕。主要目的是使用此数据集构建印地语的大型语言模型(LLMs)。
数据来源
数据集涵盖了信息、娱乐、政治、喜剧、新闻等多个类别的频道。
数据集统计
- 频道数量:58个
- 视频总数:103,042个
内容
- 视频字幕:印地语和英语
- 视频元数据:时长、评论数、点赞数、发布日期等
数据集生成工具
应用灵感
- 构建印地语的大型语言模型
- 使用印地语对模型进行微调,应用于分类、摘要、翻译等任务



