ChamaraVishwajithRajapaksha/YouTube-Transcript-Question-Answer-Dataset-for-RAG-Evaluation-Polished

Name: ChamaraVishwajithRajapaksha/YouTube-Transcript-Question-Answer-Dataset-for-RAG-Evaluation-Polished
Creator: ChamaraVishwajithRajapaksha
Published: 2025-10-24 02:46:53
License: 暂无描述

Hugging Face2025-10-24 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/ChamaraVishwajithRajapaksha/YouTube-Transcript-Question-Answer-Dataset-for-RAG-Evaluation-Polished

下载链接

链接失效反馈

官方服务：

资源简介：

YouTube视频转录问答数据集是一个自动生成的问答对数据集，它从选定的YouTube视频转录中创建，使用Ragas测试集生成器框架。该数据集旨在支持检索增强生成(RAG)评估、问答模型训练和语义推理研究。数据集的构建过程包括提取视频转录、将转录转换为LangChain文档、构建Ragas知识图谱以及使用OpenAI GPT-4o和Ragas从转录内容生成多样化的问答对。每个数据条目包括生成的问题、基于转录上下文的对应答案、支持段落或片段以及生成问答对的类型。数据集以CSV格式存储，可以轻松加载到Hugging Face的datasets或pandas中。

The YouTube Transcript Q&A Dataset is an automatically generated question-answer pair dataset created from the transcripts of selected YouTube videos using the Ragas Testset Generator framework. This dataset is designed to support Retrieval-Augmented Generation (RAG) evaluation, QA model training, and semantic reasoning research. The construction process of the dataset includes extracting video transcripts, converting transcripts into LangChain documents, constructing a Ragas Knowledge Graph, and using OpenAI GPT-4o and Ragas to generate diverse question-answer pairs from the transcript content. Each entry in the dataset includes the generated question, the corresponding answer based on the transcript context, the supporting paragraph or chunk, and the type of the generated question and answer pair. The dataset is stored in CSV format and can be easily loaded into Hugging Faces datasets or pandas.

提供机构：

ChamaraVishwajithRajapaksha

5,000+

优质数据集

54 个

任务类型

进入经典数据集