five

jaiw/lex_fridman_podcast_embeddings

收藏
Hugging Face2024-02-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jaiw/lex_fridman_podcast_embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering --- # Description This dataset contains csv's from the Lex Fridman podcast transcripts provided by [Whispering-GPT](https://huggingface.co/datasets/Whispering-GPT/lex-fridman-podcast). I split the episode transcripts into parent and child chunks for use with RAG. The parent chunks is size 500 and the child is size 50. The children come with embeddings using OpenAI `text-embedding-3-small` with 1024 dimensionality. # Motivation This was designed for use with [ParentDocumentRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever) or [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo.html). It should provide better retrievals for queries on specific details. Embedding all the chunks took a while so hopefully this saves someone time. The size of the csv's allow for uploading with Supabase "Table Editor" because I didn't want to use postgres COPY.
提供机构:
jaiw
原始信息汇总

数据集描述

该数据集包含来自Lex Fridman播客的转录文本,这些文本由Whispering-GPT提供。数据集将播客转录文本分割为父块和子块,用于RAG模型。父块大小为500,子块大小为50。子块包含使用OpenAI的text-embedding-3-small模型生成的嵌入向量,维度为1024。

动机

该数据集设计用于ParentDocumentRetrieverLlamaIndex,旨在提供针对特定细节查询的更好检索效果。嵌入所有块的过程耗时较长,希望该数据集能为用户节省时间。CSV文件的大小适合通过Supabase的“Table Editor”上传,避免了使用postgres COPY的需求。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作