Snowflake/msmarco-v2.1-snowflake-arctic-embed-l
收藏TREC-RAG-Embedding-Baseline 数据集概述
基本信息
- 许可证: Apache 2.0
- 任务类别: 问答
- 语言: 英语
- 标签: TREC-RAG, RAG, MSMARCO, MSMARCOV2.1, Snowflake, arctic, arctic-embed
- 数据集名称: TREC-RAG-Embedding-Baseline
- 数据集大小: 100M<n<1B
配置
- 配置名称: corpus
- 数据文件:
- 分割: train
- 路径: corpus/*
- 数据文件:
数据集描述
该数据集包含用于TREC RAG的MSMARCO-V2.1数据集的嵌入。所有嵌入均使用Snowflakes Arctic Embed L创建,旨在作为基于密集检索方法的简单基线。
加载数据集
加载文档嵌入
-
直接加载: python from datasets import load_dataset docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-l", split="train")
-
流式加载: python from datasets import load_dataset docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-l", split="train", streaming=True) for doc in docs: doc_id = doc[docid] url = doc[url] text = doc[text] emb = doc[embedding]
搜索示例
以下是一个完整的搜索示例(在前1,000个段落上): python from datasets import load_dataset import torch from transformers import AutoModel, AutoTokenizer import numpy as np
top_k = 100 docs_stream = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-l", split="train", streaming=True)
docs = [] doc_embeddings = []
for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc[embedding]) if len(docs) >= top_k: break
doc_embeddings = np.asarray(doc_embeddings)
tokenizer = AutoTokenizer.from_pretrained(Snowflake/snowflake-arctic-embed-l) model = AutoModel.from_pretrained(Snowflake/snowflake-arctic-embed-l, add_pooling_layer=False) model.eval()
query_prefix = Represent this sentence for searching relevant passages: queries = [how do you clean smoke off walls] queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries] query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors=pt, max_length=512)
Compute token embeddings
with torch.no_grad(): query_embeddings = model(**query_tokens)[0][:, 0]
normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1) doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1)
Compute dot score between query embedding and document embeddings
dot_scores = np.matmul(query_embeddings, doc_embeddings.transpose())[0] top_k_hits = np.argpartition(dot_scores, -top_k)[-top_k:].tolist()
Sort top_k_hits by dot score
top_k_hits.sort(key=lambda x: dot_scores[x], reverse=True)
Print results
print("Query:", queries[0]) for doc_id in top_k_hits: print(docs[doc_id][doc_id]) print(docs[doc_id][text]) print(docs[doc_id][url], " ")




