Cohere/miracl-ru-queries-22-12
收藏Hugging Face2023-02-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Cohere/miracl-ru-queries-22-12
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language:
- ru
multilinguality:
- multilingual
size_categories: []
source_datasets: []
tags: []
task_categories:
- text-retrieval
license:
- apache-2.0
task_ids:
- document-retrieval
---
# MIRACL (ru) embedded with cohere.ai `multilingual-22-12` encoder
We encoded the [MIRACL dataset](https://huggingface.co/miracl) using the [cohere.ai](https://txt.cohere.ai/multilingual/) `multilingual-22-12` embedding model.
The query embeddings can be found in [Cohere/miracl-ru-queries-22-12](https://huggingface.co/datasets/Cohere/miracl-ru-queries-22-12) and the corpus embeddings can be found in [Cohere/miracl-ru-corpus-22-12](https://huggingface.co/datasets/Cohere/miracl-ru-corpus-22-12).
For the orginal datasets, see [miracl/miracl](https://huggingface.co/datasets/miracl/miracl) and [miracl/miracl-corpus](https://huggingface.co/datasets/miracl/miracl-corpus).
Dataset info:
> MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world.
>
> The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., `\n\n` in the wiki markup). Each of these passages comprises a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.
## Embeddings
We compute for `title+" "+text` the embeddings using our `multilingual-22-12` embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this model, have a look at [cohere.ai multilingual embedding model](https://txt.cohere.ai/multilingual/).
## Loading the dataset
In [miracl-ru-corpus-22-12](https://huggingface.co/datasets/Cohere/miracl-ru-corpus-22-12) we provide the corpus embeddings. Note, depending on the selected split, the respective files can be quite large.
You can either load the dataset like this:
```python
from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-ru-corpus-22-12", split="train")
```
Or you can also stream it without downloading it before:
```python
from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-ru-corpus-22-12", split="train", streaming=True)
for doc in docs:
docid = doc['docid']
title = doc['title']
text = doc['text']
emb = doc['emb']
```
## Search
Have a look at [miracl-ru-queries-22-12](https://huggingface.co/datasets/Cohere/miracl-ru-queries-22-12) where we provide the query embeddings for the MIRACL dataset.
To search in the documents, you must use **dot-product**.
And then compare this query embeddings either with a vector database (recommended) or directly computing the dot product.
A full search example:
```python
# Attention! For large datasets, this requires a lot of memory to store
# all document embeddings and to compute the dot product scores.
# Only use this for smaller datasets. For large datasets, use a vector DB
from datasets import load_dataset
import torch
#Load documents + embeddings
docs = load_dataset(f"Cohere/miracl-ru-corpus-22-12", split="train")
doc_embeddings = torch.tensor(docs['emb'])
# Load queries
queries = load_dataset(f"Cohere/miracl-ru-queries-22-12", split="dev")
# Select the first query as example
qid = 0
query = queries[qid]
query_embedding = torch.tensor(queries['emb'])
# Compute dot score between query embedding and document embeddings
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)
# Print results
print("Query:", query['query'])
for doc_id in top_k.indices[0].tolist():
print(docs[doc_id]['title'])
print(docs[doc_id]['text'])
```
You can get embeddings for new queries using our API:
```python
#Run: pip install cohere
import cohere
co = cohere.Client(f"{api_key}") # You should add your cohere API Key here :))
texts = ['my search query']
response = co.embed(texts=texts, model='multilingual-22-12')
query_embedding = response.embeddings[0] # Get the embedding for the first text
```
## Performance
In the following table we compare the cohere multilingual-22-12 model with Elasticsearch version 8.6.0 lexical search (title and passage indexed as independent fields). Note that Elasticsearch doesn't support all languages that are part of the MIRACL dataset.
We compute nDCG@10 (a ranking based loss), as well as hit@3: Is at least one relevant document in the top-3 results. We find that hit@3 is easier to interpret, as it presents the number of queries for which a relevant document is found among the top-3 results.
Note: MIRACL only annotated a small fraction of passages (10 per query) for relevancy. Especially for larger Wikipedias (like English), we often found many more relevant passages. This is know as annotation holes. Real nDCG@10 and hit@3 performance is likely higher than depicted.
| Model | cohere multilingual-22-12 nDCG@10 | cohere multilingual-22-12 hit@3 | ES 8.6.0 nDCG@10 | ES 8.6.0 acc@3 |
|---|---|---|---|---|
| miracl-ar | 64.2 | 75.2 | 46.8 | 56.2 |
| miracl-bn | 61.5 | 75.7 | 49.2 | 60.1 |
| miracl-de | 44.4 | 60.7 | 19.6 | 29.8 |
| miracl-en | 44.6 | 62.2 | 30.2 | 43.2 |
| miracl-es | 47.0 | 74.1 | 27.0 | 47.2 |
| miracl-fi | 63.7 | 76.2 | 51.4 | 61.6 |
| miracl-fr | 46.8 | 57.1 | 17.0 | 21.6 |
| miracl-hi | 50.7 | 62.9 | 41.0 | 48.9 |
| miracl-id | 44.8 | 63.8 | 39.2 | 54.7 |
| miracl-ru | 49.2 | 66.9 | 25.4 | 36.7 |
| **Avg** | 51.7 | 67.5 | 34.7 | 46.0 |
Further languages (not supported by Elasticsearch):
| Model | cohere multilingual-22-12 nDCG@10 | cohere multilingual-22-12 hit@3 |
|---|---|---|
| miracl-fa | 44.8 | 53.6 |
| miracl-ja | 49.0 | 61.0 |
| miracl-ko | 50.9 | 64.8 |
| miracl-sw | 61.4 | 74.5 |
| miracl-te | 67.8 | 72.3 |
| miracl-th | 60.2 | 71.9 |
| miracl-yo | 56.4 | 62.2 |
| miracl-zh | 43.8 | 56.5 |
| **Avg** | 54.3 | 64.6 |
---
annotations_creators:
- 专家生成
language:
- 俄语
multilinguality:
- 多语言
size_categories: []
source_datasets: []
tags: []
task_categories:
- 文本检索(text-retrieval)
license:
- Apache-2.0
task_ids:
- 文档检索(document-retrieval)
---
# 基于cohere.ai `multilingual-22-12` 编码器嵌入的MIRACL(俄语版)数据集
我们使用cohere.ai的`multilingual-22-12`嵌入模型对[MIRACL数据集](https://huggingface.co/miracl)进行了编码。查询嵌入可在[Cohere/miracl-ru-queries-22-12](https://huggingface.co/datasets/Cohere/miracl-ru-queries-22-12)获取,语料库嵌入则可在[Cohere/miracl-ru-corpus-22-12](https://huggingface.co/datasets/Cohere/miracl-ru-corpus-22-12)获取。如需获取原始数据集,请访问[miracl/miracl](https://huggingface.co/datasets/miracl/miracl)与[miracl/miracl-corpus](https://huggingface.co/datasets/miracl/miracl-corpus)。
## 数据集信息
> MIRACL(Multilingual Information Retrieval Across a Continuum of Languages)是一项多语言检索数据集,专注于覆盖全球超30亿母语使用者的18种语言的搜索任务。
> 每种语言的语料库均取自维基百科转储文件,我们仅保留纯文本内容,剔除图片、表格等非文本元素。每篇维基百科文章均基于自然语篇单元(如维基标记中的`
`)使用WikiExtractor分割为多个段落,这些段落即构成检索的“文档”或检索单元。我们保留了每个段落所属的维基百科文章标题。
## 嵌入生成
我们对`标题+" "+文本`使用`multilingual-22-12`嵌入模型生成嵌入向量,该模型为当前顶尖的多语言语义搜索模型,支持100种语言。如需了解该模型的更多细节,请访问[cohere.ai多语言嵌入模型](https://txt.cohere.ai/multilingual/)。
## 数据集加载
我们在[miracl-ru-corpus-22-12](https://huggingface.co/datasets/Cohere/miracl-ru-corpus-22-12)中提供了语料库嵌入。请注意,根据所选拆分的不同,相关文件可能体积较大。
你可以通过如下方式加载数据集:
python
from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-ru-corpus-22-12", split="train")
或者也可以无需提前下载即可流式加载:
python
from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-ru-corpus-22-12", split="train", streaming=True)
for doc in docs:
docid = doc['docid']
title = doc['title']
text = doc['text']
emb = doc['emb']
## 检索流程
请查看[miracl-ru-queries-22-12](https://huggingface.co/datasets/Cohere/miracl-ru-queries-22-12),其中我们提供了MIRACL数据集的查询嵌入。
如需在文档集中进行检索,你必须使用**点积(dot-product)**计算相似度。
你可以将查询嵌入与向量数据库(推荐)结合使用,或直接计算点积以完成相似度比较。
完整的检索示例代码如下:
python
# Attention! For large datasets, this requires a lot of memory to store
# all document embeddings and to compute the dot product scores.
# Only use this for smaller datasets. For large datasets, use a vector DB
from datasets import load_dataset
import torch
#Load documents + embeddings
docs = load_dataset(f"Cohere/miracl-ru-corpus-22-12", split="train")
doc_embeddings = torch.tensor(docs['emb'])
# Load queries
queries = load_dataset(f"Cohere/miracl-ru-queries-22-12", split="dev")
# Select the first query as example
qid = 0
query = queries[qid]
query_embedding = torch.tensor(queries['emb'])
# Compute dot score between query embedding and document embeddings
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)
# Print results
print("Query:", query['query'])
for doc_id in top_k.indices[0].tolist():
print(docs[doc_id]['title'])
print(docs[doc_id]['text'])
你可以通过我们的API为新查询生成嵌入向量:
python
#Run: pip install cohere
import cohere
co = cohere.Client(f"{api_key}") # You should add your cohere API Key here :))
texts = ['my search query']
response = co.embed(texts=texts, model='multilingual-22-12')
query_embedding = response.embeddings[0] # Get the embedding for the first text
## 性能评估
在以下表格中,我们将cohere multilingual-22-12模型与Elasticsearch 8.6.0版本的词法搜索(将标题与段落作为独立字段建立索引)进行了对比。请注意,Elasticsearch并不支持MIRACL数据集中的所有语言。
我们计算了归一化折损累积增益@10(nDCG@10,一种基于排序的损失指标)以及命中@3(hit@3:至少有一篇相关文档出现在前3条检索结果中)。我们认为命中@3更易于解释,因为它直接体现了有多少查询的相关文档出现在前3条结果中。
注意:MIRACL仅为每篇段落标注了少量相关样本(每个查询对应10个相关段落)。对于大型维基百科语料库(如英语语料库),我们通常能找到更多的相关段落,这一问题被称为“标注漏洞”。实际的nDCG@10与hit@3性能可能高于表格中展示的结果。
| 模型 | cohere multilingual-22-12 nDCG@10 | cohere multilingual-22-12 hit@3 | ES 8.6.0 nDCG@10 | ES 8.6.0 acc@3 |
|---|---|---|---|---|
| miracl-ar | 64.2 | 75.2 | 46.8 | 56.2 |
| miracl-bn | 61.5 | 75.7 | 49.2 | 60.1 |
| miracl-de | 44.4 | 60.7 | 19.6 | 29.8 |
| miracl-en | 44.6 | 62.2 | 30.2 | 43.2 |
| miracl-es | 47.0 | 74.1 | 27.0 | 47.2 |
| miracl-fi | 63.7 | 76.2 | 51.4 | 61.6 |
| miracl-fr | 46.8 | 57.1 | 17.0 | 21.6 |
| miracl-hi | 50.7 | 62.9 | 41.0 | 48.9 |
| miracl-id | 44.8 | 63.8 | 39.2 | 54.7 |
| miracl-ru | 49.2 | 66.9 | 25.4 | 36.7 |
| **Avg** | 51.7 | 67.5 | 34.7 | 46.0 |
进一步的语言(Elasticsearch不支持):
| 模型 | cohere multilingual-22-12 nDCG@10 | cohere multilingual-22-12 hit@3 |
|---|---|---|
| miracl-fa | 44.8 | 53.6 |
| miracl-ja | 49.0 | 61.0 |
| miracl-ko | 50.9 | 64.8 |
| miracl-sw | 61.4 | 74.5 |
| miracl-te | 67.8 | 72.3 |
| miracl-th | 60.2 | 71.9 |
| miracl-yo | 56.4 | 62.2 |
| miracl-zh | 43.8 | 56.5 |
| **Avg** | 54.3 | 64.6 |
提供机构:
Cohere
原始信息汇总
数据集概述
名称: MIRACL (Multilingual Information Retrieval Across a Continuum of Languages)
语言: 多语言,涵盖18种语言
任务类别: 文本检索
任务ID: 文档检索
许可证: Apache-2.0
数据集描述:
- MIRACL是一个多语言检索数据集,专注于跨18种不同语言的搜索,这些语言覆盖了全球超过三亿母语使用者。
- 每个语言的语料库从维基百科转储中准备,仅保留纯文本,去除图片、表格等。
- 每篇文章被分割成多个基于自然话语单元的段落,每个段落构成一个检索单元。
数据集结构
嵌入:
- 使用
multilingual-22-12嵌入模型计算title+" "+text的嵌入。 - 该模型支持100种语言的语义搜索。
数据加载:
- 提供文档嵌入和查询嵌入。
- 文档嵌入可在Cohere/miracl-ru-corpus-22-12获取。
- 查询嵌入可在Cohere/miracl-ru-queries-22-12获取。
搜索方法
- 使用点积进行搜索。
- 推荐使用向量数据库进行搜索,或直接计算点积。
性能评估
评估指标: nDCG@10 和 hit@3
性能比较:
- 与Elasticsearch 8.6.0相比,cohere multilingual-22-12模型在多语言支持上表现更优。
- 由于MIRACL仅对一小部分段落进行了相关性标注,实际性能可能高于所报告的。
| 模型 | cohere multilingual-22-12 nDCG@10 | cohere multilingual-22-12 hit@3 |
|---|---|---|
| 平均 | 51.7 | 67.5 |
| 进一步语言平均 | 54.3 | 64.6 |



