Cohere/miracl-sw-corpus-22-12
收藏Hugging Face2023-02-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Cohere/miracl-sw-corpus-22-12
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language:
- sw
multilinguality:
- multilingual
size_categories: []
source_datasets: []
tags: []
task_categories:
- text-retrieval
license:
- apache-2.0
task_ids:
- document-retrieval
---
# MIRACL (sw) embedded with cohere.ai `multilingual-22-12` encoder
We encoded the [MIRACL dataset](https://huggingface.co/miracl) using the [cohere.ai](https://txt.cohere.ai/multilingual/) `multilingual-22-12` embedding model.
The query embeddings can be found in [Cohere/miracl-sw-queries-22-12](https://huggingface.co/datasets/Cohere/miracl-sw-queries-22-12) and the corpus embeddings can be found in [Cohere/miracl-sw-corpus-22-12](https://huggingface.co/datasets/Cohere/miracl-sw-corpus-22-12).
For the orginal datasets, see [miracl/miracl](https://huggingface.co/datasets/miracl/miracl) and [miracl/miracl-corpus](https://huggingface.co/datasets/miracl/miracl-corpus).
Dataset info:
> MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world.
>
> The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., `\n\n` in the wiki markup). Each of these passages comprises a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.
## Embeddings
We compute for `title+" "+text` the embeddings using our `multilingual-22-12` embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this model, have a look at [cohere.ai multilingual embedding model](https://txt.cohere.ai/multilingual/).
## Loading the dataset
In [miracl-sw-corpus-22-12](https://huggingface.co/datasets/Cohere/miracl-sw-corpus-22-12) we provide the corpus embeddings. Note, depending on the selected split, the respective files can be quite large.
You can either load the dataset like this:
```python
from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-sw-corpus-22-12", split="train")
```
Or you can also stream it without downloading it before:
```python
from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-sw-corpus-22-12", split="train", streaming=True)
for doc in docs:
docid = doc['docid']
title = doc['title']
text = doc['text']
emb = doc['emb']
```
## Search
Have a look at [miracl-sw-queries-22-12](https://huggingface.co/datasets/Cohere/miracl-sw-queries-22-12) where we provide the query embeddings for the MIRACL dataset.
To search in the documents, you must use **dot-product**.
And then compare this query embeddings either with a vector database (recommended) or directly computing the dot product.
A full search example:
```python
# Attention! For large datasets, this requires a lot of memory to store
# all document embeddings and to compute the dot product scores.
# Only use this for smaller datasets. For large datasets, use a vector DB
from datasets import load_dataset
import torch
#Load documents + embeddings
docs = load_dataset(f"Cohere/miracl-sw-corpus-22-12", split="train")
doc_embeddings = torch.tensor(docs['emb'])
# Load queries
queries = load_dataset(f"Cohere/miracl-sw-queries-22-12", split="dev")
# Select the first query as example
qid = 0
query = queries[qid]
query_embedding = torch.tensor(queries['emb'])
# Compute dot score between query embedding and document embeddings
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)
# Print results
print("Query:", query['query'])
for doc_id in top_k.indices[0].tolist():
print(docs[doc_id]['title'])
print(docs[doc_id]['text'])
```
You can get embeddings for new queries using our API:
```python
#Run: pip install cohere
import cohere
co = cohere.Client(f"{api_key}") # You should add your cohere API Key here :))
texts = ['my search query']
response = co.embed(texts=texts, model='multilingual-22-12')
query_embedding = response.embeddings[0] # Get the embedding for the first text
```
## Performance
In the following table we compare the cohere multilingual-22-12 model with Elasticsearch version 8.6.0 lexical search (title and passage indexed as independent fields). Note that Elasticsearch doesn't support all languages that are part of the MIRACL dataset.
We compute nDCG@10 (a ranking based loss), as well as hit@3: Is at least one relevant document in the top-3 results. We find that hit@3 is easier to interpret, as it presents the number of queries for which a relevant document is found among the top-3 results.
Note: MIRACL only annotated a small fraction of passages (10 per query) for relevancy. Especially for larger Wikipedias (like English), we often found many more relevant passages. This is know as annotation holes. Real nDCG@10 and hit@3 performance is likely higher than depicted.
| Model | cohere multilingual-22-12 nDCG@10 | cohere multilingual-22-12 hit@3 | ES 8.6.0 nDCG@10 | ES 8.6.0 acc@3 |
|---|---|---|---|---|
| miracl-ar | 64.2 | 75.2 | 46.8 | 56.2 |
| miracl-bn | 61.5 | 75.7 | 49.2 | 60.1 |
| miracl-de | 44.4 | 60.7 | 19.6 | 29.8 |
| miracl-en | 44.6 | 62.2 | 30.2 | 43.2 |
| miracl-es | 47.0 | 74.1 | 27.0 | 47.2 |
| miracl-fi | 63.7 | 76.2 | 51.4 | 61.6 |
| miracl-fr | 46.8 | 57.1 | 17.0 | 21.6 |
| miracl-hi | 50.7 | 62.9 | 41.0 | 48.9 |
| miracl-id | 44.8 | 63.8 | 39.2 | 54.7 |
| miracl-ru | 49.2 | 66.9 | 25.4 | 36.7 |
| **Avg** | 51.7 | 67.5 | 34.7 | 46.0 |
Further languages (not supported by Elasticsearch):
| Model | cohere multilingual-22-12 nDCG@10 | cohere multilingual-22-12 hit@3 |
|---|---|---|
| miracl-fa | 44.8 | 53.6 |
| miracl-ja | 49.0 | 61.0 |
| miracl-ko | 50.9 | 64.8 |
| miracl-sw | 61.4 | 74.5 |
| miracl-te | 67.8 | 72.3 |
| miracl-th | 60.2 | 71.9 |
| miracl-yo | 56.4 | 62.2 |
| miracl-zh | 43.8 | 56.5 |
| **Avg** | 54.3 | 64.6 |
annotations_creators:
- 专家生成
language:
- 斯瓦西里语(Swahili,代码sw)
multilinguality:
- 多语言
size_categories: []
source_datasets: []
tags: []
task_categories:
- 文本检索(text-retrieval)
license:
- Apache-2.0
task_ids:
- 文档检索(document-retrieval)
---
# 基于cohere.ai `multilingual-22-12` 编码器嵌入的MIRACL(斯瓦西里语子集)
我们使用[cohere.ai](https://txt.cohere.ai/multilingual/)提供的`multilingual-22-12`嵌入模型,对[MIRACL数据集](https://huggingface.co/miracl)进行了编码。
查询嵌入可在[Cohere/miracl-sw-queries-22-12](https://huggingface.co/datasets/Cohere/miracl-sw-queries-22-12)中获取,语料库嵌入则可在[Cohere/miracl-sw-corpus-22-12](https://huggingface.co/datasets/Cohere/miracl-sw-corpus-22-12)中获取。
如需获取原始数据集,请访问[miracl/miracl](https://huggingface.co/datasets/miracl/miracl)与[miracl/miracl-corpus](https://huggingface.co/datasets/miracl/miracl-corpus)。
## 数据集信息
> MIRACL(多语言跨语言信息检索数据集,Multilingual Information Retrieval Across a Continuum of Languages)是一款多语言检索数据集,聚焦覆盖全球超过30亿母语使用者的18种不同语言的搜索任务。
>
> 各语言的语料库均源自维基百科转储文件,我们仅保留纯文本内容,剔除图片、表格等非文本元素。借助WikiExtractor工具,我们依据自然语篇单元(如维基标记中的`
`)将每篇文章切分为多个段落,每个段落即为一个检索“文档”或检索单元。我们保留了每个段落对应的维基百科文章标题。
## 嵌入计算
我们使用`multilingual-22-12`嵌入模型对`标题+" "+文本`进行嵌入计算,该模型是一款可支持100种语言语义搜索的前沿模型。如需了解该模型的更多细节,请访问[cohere.ai多语言嵌入模型](https://txt.cohere.ai/multilingual/)。
## 数据集加载
在[miracl-sw-corpus-22-12](https://huggingface.co/datasets/Cohere/miracl-sw-corpus-22-12)中,我们提供了语料库嵌入。请注意,根据所选的数据拆分方式,对应文件的体积可能较大。
你可以通过以下方式加载该数据集:
python
from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-sw-corpus-22-12", split="train")
你也可以无需提前下载,直接流式读取数据:
python
from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-sw-corpus-22-12", split="train", streaming=True)
for doc in docs:
docid = doc['docid']
title = doc['title']
text = doc['text']
emb = doc['emb']
## 检索操作
请访问[miracl-sw-queries-22-12](https://huggingface.co/datasets/Cohere/miracl-sw-queries-22-12),其中包含了MIRACL数据集的查询嵌入。
若要在文档中执行检索,必须使用**点积(dot-product)**运算。你可以将查询嵌入与向量数据库(推荐方案)进行匹配,也可以直接计算点积完成匹配。
完整检索示例如下:
python
# 注意!对于大型数据集,该方法需要占用大量内存来存储所有文档嵌入并计算点积得分。
# 仅可在小型数据集上使用该方式,大型数据集请使用向量数据库。
from datasets import load_dataset
import torch
# 加载文档与嵌入向量
docs = load_dataset(f"Cohere/miracl-sw-corpus-22-12", split="train")
doc_embeddings = torch.tensor(docs['emb'])
# 加载查询集
queries = load_dataset(f"Cohere/miracl-sw-queries-22-12", split="dev")
# 以第一个查询作为示例
qid = 0
query = queries[qid]
query_embedding = torch.tensor(queries['emb'])
# 计算查询嵌入与文档嵌入之间的点积得分
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)
# 打印结果
print("查询内容:", query['query'])
for doc_id in top_k.indices[0].tolist():
print(docs[doc_id]['title'])
print(docs[doc_id]['text'])
你可以通过我们的API为新查询生成嵌入向量:
python
# 执行安装命令:pip install cohere
import cohere
co = cohere.Client(f"{api_key}") # 请在此处填入您的cohere API密钥 :)
texts = ['我的搜索查询']
response = co.embed(texts=texts, model='multilingual-22-12')
query_embedding = response.embeddings[0] # 获取第一段文本的嵌入向量
## 性能评估
在下表中,我们将cohere multilingual-22-12模型与Elasticsearch 8.6.0版本的词法搜索(将标题与段落作为独立字段进行索引)进行了对比。请注意,Elasticsearch并不支持MIRACL数据集包含的所有语言。
我们计算了归一化折损累计增益@10(normalized discounted cumulative gain@10,nDCG@10,一种基于排序的损失指标)以及命中@3(hit@3:Top3结果中是否包含至少一篇相关文档)。我们认为命中@3更易于解读,因为它直接反映了在Top3结果中找到相关文档的查询数量占比。
备注:MIRACL仅对极小部分段落(每个查询对应10个段落)进行了相关性标注。尤其是对于体量较大的维基百科(如英文维基),我们通常能找到远多于标注数量的相关段落,这一问题被称为“标注漏洞(annotation holes)”。实际的nDCG@10与hit@3性能可能优于表格中展示的结果。
| 模型 | cohere multilingual-22-12 归一化折损累计增益@10 | cohere multilingual-22-12 命中@3 | Elasticsearch 8.6.0 归一化折损累计增益@10 | Elasticsearch 8.6.0 命中@3 |
|---|---|---|---|---|
| miracl-阿拉伯语 | 64.2 | 75.2 | 46.8 | 56.2 |
| miracl-孟加拉语 | 61.5 | 75.7 | 49.2 | 60.1 |
| miracl-德语 | 44.4 | 60.7 | 19.6 | 29.8 |
| miracl-英语 | 44.6 | 62.2 | 30.2 | 43.2 |
| miracl-西班牙语 | 47.0 | 74.1 | 27.0 | 47.2 |
| miracl-芬兰语 | 63.7 | 76.2 | 51.4 | 61.6 |
| miracl-法语 | 46.8 | 57.1 | 17.0 | 21.6 |
| miracl-印地语 | 50.7 | 62.9 | 41.0 | 48.9 |
| miracl-印尼语 | 44.8 | 63.8 | 39.2 | 54.7 |
| miracl-俄语 | 49.2 | 66.9 | 25.4 | 36.7 |
| **平均** | 51.7 | 67.5 | 34.7 | 46.0 |
### 未被Elasticsearch支持的额外语言
| 模型 | cohere multilingual-22-12 归一化折损累计增益@10 | cohere multilingual-22-12 命中@3 |
|---|---|---|
| miracl-波斯语 | 44.8 | 53.6 |
| miracl-日语 | 49.0 | 61.0 |
| miracl-韩语 | 50.9 | 64.8 |
| miracl-斯瓦西里语 | 61.4 | 74.5 |
| miracl-泰卢固语 | 67.8 | 72.3 |
| miracl-泰语 | 60.2 | 71.9 |
| miracl-约鲁巴语 | 56.4 | 62.2 |
| miracl-中文 | 43.8 | 56.5 |
| **平均** | 54.3 | 64.6 |
提供机构:
Cohere
原始信息汇总
数据集概述
数据集名称: MIRACL (Multilingual Information Retrieval Across a Continuum of Languages)
语言: 包含18种语言,其中特别提到斯瓦希里语(sw)
多语言性: 多语言
任务类别: 文本检索
任务ID: 文档检索
许可证: Apache-2.0
注释创建者: 专家生成
数据集构成:
- 每个语言的语料库来自维基百科的转储,仅保留纯文本,每篇文章被分割成多个基于自然话语单位的段落,每个段落作为一个检索单元。
- 使用
title+" "+text格式计算嵌入,使用multilingual-22-12嵌入模型,该模型支持100种语言的语义搜索。
数据集使用:
- 提供查询嵌入和语料库嵌入,分别存储在Cohere/miracl-sw-queries-22-12和Cohere/miracl-sw-corpus-22-12。
- 数据集可以通过
from datasets import load_dataset在Python中加载,支持流式加载以减少内存需求。
搜索方法:
- 使用点积(dot-product)进行搜索,建议使用向量数据库进行大规模数据处理。
性能比较:
- 与Elasticsearch 8.6.0相比,cohere multilingual-22-12模型在多种语言上的nDCG@10和hit@3指标表现更优。
- 对于不支持Elasticsearch的语言,cohere multilingual-22-12模型也提供了性能数据。
注意:
- 由于MIRACL仅对一小部分段落进行了相关性标注,实际性能可能高于报告数据。



