Acquired Podcast Transcripts and RAG Evaluation

Name: Acquired Podcast Transcripts and RAG Evaluation
Creator: Kaggle
Published: 2024-05-31 00:00:00
License: 暂无描述

www.kaggle.com2024-05-31 更新2025-01-16 收录

下载链接：

https://www.kaggle.com/harrywang/acquired-podcast-transcripts-and-rag-evaluation

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains 200 Acquired Podcast Transcripts we collected from the official website (https://www.acquired.fm/) with metadata specified in `acquired_metadata.csv`. The 200 transcripts contain approximately 3.5 million words, equivalent to about 5,500 pages when formatted into a Word document. We also developed a QA dataset for RAG evaluation in `acquired-qa-evaluation.csv`, which contains the following columns: - **question**: The question posed for evaluation. - **human_answer**: The answer provided by a human. - **ai_answer_without_the_transcript**: The answer provided by an AI model without access to the transcript. - **ai_answer_without_the_transcript_correctness**: The factual accuracy of the AI answer without the transcript verified by a human. - **ai_answer_with_the_transcript**: The answer provided by an AI model with access to the transcript. - **ai_answer_with_the_transcript_correctness**: The factual accuracy of the AI answer with the transcript verified by a human. - **quality_rating_for_answer_with_transcript**: The quality of the AI answer rated by a human. - **post_url**: The URL of the podcast episode related to the question. - **file_name**: The name of the transcript file associated with the episode. The project was created and designed by me with the help of the following people: - Rain Jiang: crawler development and data collection - Yihong (Eric) Chen: data parsing, cleaning, and analysis Here are the students from my Introduction to Generative AI course (Spring 2024) who contributed to creating the QA dataset: - Priya Amara - Saviour Adelwin Anyagri - Ezgi Basaranlar - Sara Baskaran - Nimet Batan Altiyaprak - Reed Bidgood - Daniel Coleman - James Dalton - Chaitanya Dhullipala - Yin Ding - Aksel Dirkzwager - Malek Elsayyid - John Fabricatore - J'Quoi George - Ed Gorman - Amanda Grosz - Donald Harris - Bryan Horsey - David Kam - Daria Klimkovskaia - Mathieu Lippens - Ruth McDuffie - Ashish Mishra - Achal Modi - Jayaprakash Moses - Naomi Nyarinda Okemwa - Silvia Atelo Okwach - Kardam Patel - Pramila Paudyal - Chris Pic - Rajesh Rao - Ronald Russian - Summer Shaheed - Rohan Swain - Shriya Tandon - Aniket Turaskar - Upendar Vanavasam - Andrea Young

本数据集汇聚了200篇从官方网站（https://www.acquired.fm/）获取的已获得播客文本，并附带在`acquired_metadata.csv`中指定的元数据。该200篇文本共计约350万词，若按Word文档格式编排，相当于约5500页。此外，我们还为RAG评估开发了QA数据集，存于`acquired-qa-evaluation.csv`，包含以下列项： - **问题**：用于评估提出的问题。 - **人工答案**：由人类提供的答案。 - **无文本的AI答案**：AI模型在无法访问文本的情况下提供的答案。 - **无文本的AI答案准确性**：由人类验证的无文本AI答案的事实准确性。 - **含文本的AI答案**：AI模型在有文本访问权限的情况下提供的答案。 - **含文本的AI答案准确性**：由人类验证的含文本AI答案的事实准确性。 - **含文本答案的质量评级**：由人类对AI答案质量进行的评级。 - **播客相关URL**：与问题相关的播客节目的URL。 - **文件名**：与节目相关的文本文件的名称。本项目由我创建并设计，在以下人员的协助下完成： - 雨江：爬虫开发和数据收集 - 陈毅宏（Eric Chen）：数据解析、清洗和分析以下为我春季2024年《生成式AI导论》课程的学生，他们为创建QA数据集做出了贡献： - 普里雅·阿马拉 - 萨弗沃·阿德尔温·安亚格里 - 埃兹吉·巴萨拉南 - 萨拉·巴卡兰 - 尼梅特·巴坦·阿尔蒂亚普拉克 - 里德·比戈德 - 丹尼尔·科尔曼 - 詹姆斯·达尔顿 - 查伊塔尼亚·杜利帕拉 - 薇恩·丁 - 阿克斯尔·德里克扎格 - 马勒克·埃尔赛义德 - 约翰·法布里卡托雷 - J'Quoi 乔治 - 埃德·戈尔曼 - 阿曼达·格罗斯 - 道朗·哈里斯 - 布莱恩·霍西 - 大卫·甘姆 - 达里亚·克里姆科夫斯基亚 - 马蒂厄·利彭斯 - 露丝·麦克杜菲 - 阿希什·米什拉 - 阿查尔·莫迪 - 拉贾普拉斯·摩西 - 纳奥米·尼亚林达·奥凯姆瓦 - 西尔维亚·阿特洛·奥卡奇 - 卡达姆·帕特尔 - 普拉米拉·帕乌达伊 - 克里斯·皮克 - 拉杰什·拉奥 - 罗纳德·俄罗斯 - 苏梅尔·谢赫德 - 罗汉·斯瓦伊恩 - 斯里娅·塔顿 - 阿尼克特·图拉斯卡尔 - 乌彭达·瓦纳萨姆 - 安德烈亚·扬

提供机构：

Kaggle

5,000+

优质数据集

54 个

任务类型

进入经典数据集