five

calmgoose/book-embeddings

收藏
Hugging Face2023-05-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/calmgoose/book-embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering - summarization - conversational - sentence-similarity language: - en pretty_name: FAISS Vector Store of Embeddings for Books tags: - faiss - langchain - instructor embeddings - vector stores - books - LLM --- # Vector store of embeddings for books - **"1984" by George Orwell** - **"The Almanac of Naval Ravikant" by Eric Jorgenson** This is a [faiss](https://github.com/facebookresearch/faiss) vector store created with [instructor embeddings](https://github.com/HKUNLP/instructor-embedding) using [LangChain](https://langchain.readthedocs.io/en/latest/modules/indexes/examples/embeddings.html#instructembeddings) . Use it for similarity search, question answering or anything else that leverages embeddings! 😃 Creating these embeddings can take a while so here's a convenient, downloadable one 🤗 ## How to use 1. Specify the book from one of the following: - `"1984"` - `"The Almanac of Naval Ravikant"` 3. Download data 4. Load to use with LangChain ``` pip install -qqq langchain InstructorEmbedding sentence_transformers faiss-cpu huggingface_hub ``` ```python import os from langchain.embeddings import HuggingFaceInstructEmbeddings from langchain.vectorstores.faiss import FAISS from huggingface_hub import snapshot_download # download the vectorstore for the book you want BOOK="1984" cache_dir=f"{book}_cache" vectorstore = snapshot_download(repo_id="calmgoose/book-embeddings", repo_type="dataset", revision="main", allow_patterns=f"books/{BOOK}/*", # to download only the one book cache_dir=cache_dir, ) # get path to the `vectorstore` folder that you just downloaded # we'll look inside the `cache_dir` for the folder we want target_dir = BOOK # Walk through the directory tree recursively for root, dirs, files in os.walk(cache_dir): # Check if the target directory is in the list of directories if target_dir in dirs: # Get the full path of the target directory target_path = os.path.join(root, target_dir) # load embeddings # this is what was used to create embeddings for the book embeddings = HuggingFaceInstructEmbeddings( embed_instruction="Represent the book passage for retrieval: ", query_instruction="Represent the question for retrieving supporting texts from the book passage: " ) # load vector store to use with langchain docsearch = FAISS.load_local(folder_path=target_path, embeddings=embeddings) # similarity search question = "Who is big brother?" search = docsearch.similarity_search(question, k=4) for item in search: print(item.page_content) print(f"From page: {item.metadata['page']}") print("---") ```
提供机构:
calmgoose
原始信息汇总

数据集概述

基本信息

  • 许可证: Apache-2.0
  • 任务类别:
    • 问答
    • 摘要
    • 对话
    • 句子相似度
  • 语言: 英语
  • 名称: FAISS Vector Store of Embeddings for Books
  • 标签:
    • faiss
    • langchain
    • instructor embeddings
    • vector stores
    • books
    • LLM

数据内容

  • 包含书籍:
    • "1984" by George Orwell
    • "The Almanac of Naval Ravikant" by Eric Jorgenson

使用方法

  1. 选择书籍: 从以下选项中指定一本书:
    • "1984"
    • "The Almanac of Naval Ravikant"
  2. 下载数据: 使用提供的代码下载对应书籍的向量存储。
  3. 加载使用: 通过LangChain加载向量存储进行相似性搜索、问答等操作。

示例代码

python import os from langchain.embeddings import HuggingFaceInstructEmbeddings from langchain.vectorstores.faiss import FAISS from huggingface_hub import snapshot_download

下载向量存储

BOOK="1984" cache_dir=f"{book}_cache" vectorstore = snapshot_download(repo_id="calmgoose/book-embeddings", repo_type="dataset", revision="main", allow_patterns=f"books/{BOOK}/*", cache_dir=cache_dir, )

加载嵌入

embeddings = HuggingFaceInstructEmbeddings( embed_instruction="Represent the book passage for retrieval: ", query_instruction="Represent the question for retrieving supporting texts from the book passage: " )

加载向量存储

docsearch = FAISS.load_local(folder_path=target_path, embeddings=embeddings)

相似性搜索

question = "Who is big brother?" search = docsearch.similarity_search(question, k=4)

for item in search: print(item.page_content) print(f"From page: {item.metadata[page]}") print("---")

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作