calmgoose/book-embeddings

Name: calmgoose/book-embeddings
Creator: calmgoose
Published: 2023-05-02 20:47:09
License: 暂无描述

Hugging Face2023-05-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/calmgoose/book-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering - summarization - conversational - sentence-similarity language: - en pretty_name: FAISS Vector Store of Embeddings for Books tags: - faiss - langchain - instructor embeddings - vector stores - books - LLM --- # Vector store of embeddings for books - **"1984" by George Orwell** - **"The Almanac of Naval Ravikant" by Eric Jorgenson** This is a [faiss](https://github.com/facebookresearch/faiss) vector store created with [instructor embeddings](https://github.com/HKUNLP/instructor-embedding) using [LangChain](https://langchain.readthedocs.io/en/latest/modules/indexes/examples/embeddings.html#instructembeddings) . Use it for similarity search, question answering or anything else that leverages embeddings! 😃 Creating these embeddings can take a while so here's a convenient, downloadable one 🤗 ## How to use 1. Specify the book from one of the following: - `"1984"` - `"The Almanac of Naval Ravikant"` 3. Download data 4. Load to use with LangChain ``` pip install -qqq langchain InstructorEmbedding sentence_transformers faiss-cpu huggingface_hub ``` ```python import os from langchain.embeddings import HuggingFaceInstructEmbeddings from langchain.vectorstores.faiss import FAISS from huggingface_hub import snapshot_download # download the vectorstore for the book you want BOOK="1984" cache_dir=f"{book}_cache" vectorstore = snapshot_download(repo_id="calmgoose/book-embeddings", repo_type="dataset", revision="main", allow_patterns=f"books/{BOOK}/*", # to download only the one book cache_dir=cache_dir, ) # get path to the `vectorstore` folder that you just downloaded # we'll look inside the `cache_dir` for the folder we want target_dir = BOOK # Walk through the directory tree recursively for root, dirs, files in os.walk(cache_dir): # Check if the target directory is in the list of directories if target_dir in dirs: # Get the full path of the target directory target_path = os.path.join(root, target_dir) # load embeddings # this is what was used to create embeddings for the book embeddings = HuggingFaceInstructEmbeddings( embed_instruction="Represent the book passage for retrieval: ", query_instruction="Represent the question for retrieving supporting texts from the book passage: " ) # load vector store to use with langchain docsearch = FAISS.load_local(folder_path=target_path, embeddings=embeddings) # similarity search question = "Who is big brother?" search = docsearch.similarity_search(question, k=4) for item in search: print(item.page_content) print(f"From page: {item.metadata['page']}") print("---") ```

提供机构：

calmgoose

原始信息汇总

数据集概述

基本信息

许可证: Apache-2.0
任务类别:
- 问答
- 摘要
- 对话
- 句子相似度
语言: 英语
名称: FAISS Vector Store of Embeddings for Books
标签:
- faiss
- langchain
- instructor embeddings
- vector stores
- books
- LLM

数据内容

包含书籍:
- "1984" by George Orwell
- "The Almanac of Naval Ravikant" by Eric Jorgenson

使用方法

选择书籍: 从以下选项中指定一本书:
- "1984"
- "The Almanac of Naval Ravikant"
下载数据: 使用提供的代码下载对应书籍的向量存储。
加载使用: 通过LangChain加载向量存储进行相似性搜索、问答等操作。

示例代码

python import os from langchain.embeddings import HuggingFaceInstructEmbeddings from langchain.vectorstores.faiss import FAISS from huggingface_hub import snapshot_download

下载向量存储

BOOK="1984" cache_dir=f"{book}_cache" vectorstore = snapshot_download(repo_id="calmgoose/book-embeddings", repo_type="dataset", revision="main", allow_patterns=f"books/{BOOK}/*", cache_dir=cache_dir, )

加载嵌入

embeddings = HuggingFaceInstructEmbeddings( embed_instruction="Represent the book passage for retrieval: ", query_instruction="Represent the question for retrieving supporting texts from the book passage: " )

加载向量存储

docsearch = FAISS.load_local(folder_path=target_path, embeddings=embeddings)

相似性搜索

question = "Who is big brother?" search = docsearch.similarity_search(question, k=4)

for item in search: print(item.page_content) print(f"From page: {item.metadata[page]}") print("---")

5,000+

优质数据集

54 个

任务类型

进入经典数据集