msmarco-v2.1-snowflake-arctic-embed-m-v1.5
收藏魔搭社区2025-11-12 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v1.5
下载链接
链接失效反馈官方服务:
资源简介:
# Snowflake Arctic Embed M V1.5 Embeddings for MSMARCO V2.1 for TREC-RAG
This dataset contains the embeddings for the MSMARCO-V2.1 dataset which is used as the corpora for [TREC RAG](https://trec-rag.github.io/)
All embeddings are created using [Snowflake's Arctic Embed M v1.5](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) and are intended to serve as a simple baseline for dense retrieval-based methods.
It's worth noting that Snowflake's Arctic Embed M v1.5 is optimized for efficient embeddings and thus supports embedding truncation and quantization. More details on model release can be found in this [blog](https://www.snowflake.com/engineering-blog/arctic-embed-m-v1-5-enterprise-retrieval/) along with methods for [quantization and compression](https://github.com/Snowflake-Labs/arctic-embed/blob/main/compressed_embeddings_examples/score_arctic_embed_m_v1dot5_with_quantization.ipynb).
Note, that the embeddings are not normalized so you will need to normalize them before usage.
## Retrieval Performance
Retrieval performance for the TREC DL21-23, MSMARCOV2-Dev and Raggy Queries can be found below with BM25 as a baseline. For both systems, retrieval is at the segment level and Doc Score = Max (passage score).
Retrieval is done via a dot product and happens in BF16. Since the M-v1.5 model supports Vector Truncation we do so to 256 dimensions
### NDCG@10
| Dataset | BM25 | Arctic-M-V1.5 (768 Dimensions) | Arctic-M-V1.5 (256 Dimensions) |
|---|---|---|---|
| Deep Learning 2021 | 0.5778 | 0.6936 | 0.69392 |
| Deep Learning 2022 | 0.3576 | 0.55199 | 0.55608 |
| Deep Learning 2023 | 0.3356 | 0.46963 | 0.45196 |
| msmarcov2-dev | N/A | 0.346 | 0.34074 |
| msmarcov2-dev2 | N/A | 0.34518 | 0.34339 |
| Raggy Queries | 0.4227 | 0.57439 | 0.56686 |
### Recall@100
| Dataset | BM25 | Arctic-M-V1.5 (768 Dimensions) | Arctic-M-V1.5 (256 Dimensions) |
|---|---|---|---|
| Deep Learning 2021 | 0.3811 | 0.43 | 0.42245 |
| Deep Learning 2022 | 0.233 | 0.32125 | 0.3165 |
| Deep Learning 2023 | 0.3049 | 0.37622 | 0.36089 |
| msmarcov2-dev | 0.6683 | 0.85435 | 0.84985 |
| msmarcov2-dev2 | 0.6771 | 0.8576 | 0.8526 |
| Raggy Queries | 0.2807 | 0.36915 | 0.36149 |
### Recall@1000
| Dataset | BM25 | Arctic-M-V1.5 (768 Dimensions) | Arctic-M-V1.5 (256 Dimensions) |
|---|---|---|---|
| Deep Learning 2021 | 0.7115 | 0.74895 | 0.73511 |
| Deep Learning 2022 | 0.479 | 0.55413 | 0.54499 |
| Deep Learning 2023 | 0.5852 | 0.62262 | 0.61199 |
| msmarcov2-dev | 0.8528 | 0.94156 | 0.94014 |
| msmarcov2-dev2 | 0.8577 | 0.94277 | 0.94047 |
| Raggy Queries | 0.5745 | 0.64527 | 0.63826 |
## Loading the dataset
### Loading the document embeddings
You can either load the dataset like this:
```python
from datasets import load_dataset
docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v1.5", split="train")
```
Or you can also stream it without downloading it before:
```python
from datasets import load_dataset
docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v1.5", split="train", streaming=True)
for doc in docs:
doc_id = j['docid']
url = doc['url']
text = doc['text']
emb = doc['embedding']
```
Note, The full dataset corpus is ~ 620GB so it will take a while to download and may not fit on some devices/
## Search
A full search example (on the first 1,000 paragraphs):
```python
from datasets import load_dataset
import torch
from transformers import AutoModel, AutoTokenizer
import numpy as np
top_k = 100
docs_stream = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v1.5",split="train", streaming=True)
docs = []
doc_embeddings = []
for doc in docs_stream:
docs.append(doc)
doc_embeddings.append(doc['embedding'])
if len(docs) >= top_k:
break
doc_embeddings = np.asarray(doc_embeddings)
tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-m-v1.5')
model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-m-v1.5', add_pooling_layer=False)
model.eval()
query_prefix = 'Represent this sentence for searching relevant passages: '
queries = ['how do you clean smoke off walls']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
# Compute token embeddings
with torch.no_grad():
query_embeddings = model(**query_tokens)[0][:, 0]
# normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1)
# Compute dot score between query embedding and document embeddings
dot_scores = np.matmul(query_embeddings, doc_embeddings.transpose())[0]
top_k_hits = np.argpartition(dot_scores, -top_k)[-top_k:].tolist()
# Sort top_k_hits by dot score
top_k_hits.sort(key=lambda x: dot_scores[x], reverse=True)
# Print results
print("Query:", queries[0])
for doc_id in top_k_hits:
print(docs[doc_id]['doc_id'])
print(docs[doc_id]['text'])
print(docs[doc_id]['url'], "\n")
```
# 用于TREC-RAG的MSMARCO V2.1数据集的Snowflake Arctic Embed M V1.5嵌入向量集
本数据集针对用作[TREC RAG](https://trec-rag.github.io/)语料库的MSMARCO-V2.1数据集生成嵌入向量。所有嵌入向量均通过[Snowflake Arctic Embed M v1.5](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5)生成,旨在为基于稠密检索的方法提供简易基准基线。
值得注意的是,Snowflake Arctic Embed M v1.5针对高效嵌入进行了优化,支持嵌入截断与量化操作。有关该模型发布的更多细节,可参阅此[博客文章](https://www.snowflake.com/engineering-blog/arctic-embed-m-v1-5-enterprise-retrieval/),相关量化与压缩方法可查阅[量化与压缩示例代码](https://github.com/Snowflake-Labs/arctic-embed/blob/main/compressed_embeddings_examples/score_arctic_embed_m_v1dot5_with_quantization.ipynb)。
请注意,本数据集生成的嵌入向量未经过归一化处理,使用前需自行完成归一化操作。
## 检索性能评估
针对TREC DL21-23、MSMARCOV2-Dev以及Raggy Queries的检索性能以BM25作为基线基准如下所示。两种系统均采用段落级检索策略,文档得分=最大段落得分。检索通过点积计算完成,且采用BF16精度执行。由于M-v1.5模型支持向量截断,本数据集将嵌入向量截断至256维。
### NDCG@10
| 数据集 | BM25 | Arctic-M-V1.5 (768维) | Arctic-M-V1.5 (256维) |
|---|---|---|---|
| Deep Learning 2021 | 0.5778 | 0.6936 | 0.69392 |
| Deep Learning 2022 | 0.3576 | 0.55199 | 0.55608 |
| Deep Learning 2023 | 0.3356 | 0.46963 | 0.45196 |
| msmarcov2-dev | N/A | 0.346 | 0.34074 |
| msmarcov2-dev2 | N/A | 0.34518 | 0.34339 |
| Raggy Queries | 0.4227 | 0.57439 | 0.56686 |
### Recall@100
| 数据集 | BM25 | Arctic-M-V1.5 (768维) | Arctic-M-V1.5 (256维) |
|---|---|---|---|
| Deep Learning 2021 | 0.3811 | 0.43 | 0.42245 |
| Deep Learning 2022 | 0.233 | 0.32125 | 0.3165 |
| Deep Learning 2023 | 0.3049 | 0.37622 | 0.36089 |
| msmarcov2-dev | 0.6683 | 0.85435 | 0.84985 |
| msmarcov2-dev2 | 0.6771 | 0.8576 | 0.8526 |
| Raggy Queries | 0.2807 | 0.36915 | 0.36149 |
### Recall@1000
| 数据集 | BM25 | Arctic-M-V1.5 (768维) | Arctic-M-V1.5 (256维) |
|---|---|---|---|
| Deep Learning 2021 | 0.7115 | 0.74895 | 0.73511 |
| Deep Learning 2022 | 0.479 | 0.55413 | 0.54499 |
| Deep Learning 2023 | 0.5852 | 0.62262 | 0.61199 |
| msmarcov2-dev | 0.8528 | 0.94156 | 0.94014 |
| msmarcov2-dev2 | 0.8577 | 0.94277 | 0.94047 |
| Raggy Queries | 0.5745 | 0.64527 | 0.63826 |
## 加载数据集
### 加载文档嵌入向量
你可以通过如下方式加载数据集:
python
from datasets import load_dataset
docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v1.5", split="train")
或者也可以无需提前下载即可流式加载:
python
from datasets import load_dataset
docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v1.5", split="train", streaming=True)
for doc in docs:
doc_id = j['docid']
url = doc['url']
text = doc['text']
emb = doc['embedding']
请注意,完整数据集语料库大小约为620GB,下载耗时较长,且部分设备可能无法容纳该数据集。
## 搜索示例
针对前1000个段落的完整搜索示例如下:
python
from datasets import load_dataset
import torch
from transformers import AutoModel, AutoTokenizer
import numpy as np
top_k = 100
docs_stream = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v1.5",split="train", streaming=True)
docs = []
doc_embeddings = []
for doc in docs_stream:
docs.append(doc)
doc_embeddings.append(doc['embedding'])
if len(docs) >= top_k:
break
doc_embeddings = np.asarray(doc_embeddings)
tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-m-v1.5')
model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-m-v1.5', add_pooling_layer=False)
model.eval()
query_prefix = '为搜索相关段落而表征此句子:'
queries = ['how do you clean smoke off walls']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
# 计算Token嵌入向量
with torch.no_grad():
query_embeddings = model(**query_tokens)[0][:, 0]
# 对嵌入向量进行归一化
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1)
# 计算查询嵌入向量与文档嵌入向量之间的点积得分
dot_scores = np.matmul(query_embeddings, doc_embeddings.transpose())[0]
top_k_hits = np.argpartition(dot_scores, -top_k)[-top_k:].tolist()
# 按点积得分对top_k_hits进行降序排序
top_k_hits.sort(key=lambda x: dot_scores[x], reverse=True)
# 打印检索结果
print("Query:", queries[0])
for doc_id in top_k_hits:
print(docs[doc_id]['doc_id'])
print(docs[doc_id]['text'])
print(docs[doc_id]['url'], "
")
提供机构:
maas
创建时间:
2025-06-05



