Cohere/wikipedia-22-12-simple-embeddings

Name: Cohere/wikipedia-22-12-simple-embeddings
Creator: Cohere
Published: 2023-03-22 16:56:34
License: 暂无描述

Hugging Face2023-03-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Cohere/wikipedia-22-12-simple-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en multilinguality: - multilingual size_categories: [] source_datasets: [] tags: [] task_categories: - text-retrieval license: - apache-2.0 task_ids: - document-retrieval --- # Wikipedia (simple English) embedded with cohere.ai `multilingual-22-12` encoder We encoded [Wikipedia (simple English)](https://simple.wikipedia.org) using the [cohere.ai](https://txt.cohere.ai/multilingual/) `multilingual-22-12` embedding model. To get an overview how this dataset was created and pre-processed, have a look at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Embeddings We compute for `title+" "+text` the embeddings using our `multilingual-22-12` embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this model, have a look at [cohere.ai multilingual embedding model](https://txt.cohere.ai/multilingual/). ## Further languages We provide embeddings of Wikipedia in many different languages: [ar](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ar-embeddings), [de](https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings), [en](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings), [es](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings), [fr](https://huggingface.co/datasets/Cohere/wikipedia-22-12-fr-embeddings), [hi](https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings), [it](https://huggingface.co/datasets/Cohere/wikipedia-22-12-it-embeddings), [ja](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ja-embeddings), [ko](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ko-embeddings), [simple english](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings), [zh](https://huggingface.co/datasets/Cohere/wikipedia-22-12-zh-embeddings), You can find the Wikipedia datasets without embeddings at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Loading the dataset You can either load the dataset like this: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train") ``` Or you can also stream it without downloading it before: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True) for doc in docs: docid = doc['id'] title = doc['title'] text = doc['text'] emb = doc['emb'] ``` ## Search A full search example: ```python #Run: pip install cohere datasets from datasets import load_dataset import torch import cohere co = cohere.Client(f"<<COHERE_API_KEY>>") # Add your cohere API key from www.cohere.com #Load at max 1000 documents + embeddings max_docs = 1000 docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True) docs = [] doc_embeddings = [] for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc['emb']) if len(docs) >= max_docs: break doc_embeddings = torch.tensor(doc_embeddings) query = 'Who founded Youtube' response = co.embed(texts=[query], model='multilingual-22-12') query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding) # Compute dot score between query embedding and document embeddings dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3) # Print results print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id]['title']) print(docs[doc_id]['text'], "\n") ``` ## Performance You can find performance on the MIRACL dataset (a semantic search evaluation dataset) here: [miracl-en-queries-22-12#performance](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12#performance)

语言： - 英语多语言属性： - 多语言规模类别：[] 源数据集：[] 标签：[] 任务类别： - 文本检索许可证： - Apache-2.0 任务子类别： - 文档检索 # 采用cohere.ai `multilingual-22-12` 编码器嵌入的简易英语维基百科数据集本数据集使用[cohere.ai](https://txt.cohere.ai/multilingual/)的`multilingual-22-12` 嵌入模型，对[简易英语维基百科](https://simple.wikipedia.org)进行了编码。若需了解本数据集的创建与预处理流程，请参考[Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12)。 ## 嵌入向量我们针对`标题 + 空格 + 正文`的文本组合，使用`multilingual-22-12`嵌入模型生成嵌入向量。该模型为当前前沿技术模型，可支持100种语言的语义搜索任务。若需了解该模型的更多细节，请访问[cohere.ai 多语言嵌入模型](https://txt.cohere.ai/multilingual/)。 ## 其他语言版本我们提供了多种语言的维基百科嵌入向量数据集，包括：[阿拉伯语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ar-embeddings)、[德语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings)、[英语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings)、[西班牙语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings)、[法语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-fr-embeddings)、[印地语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings)、[意大利语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-it-embeddings)、[日语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ja-embeddings)、[韩语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ko-embeddings)、[简易英语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings)、[中文](https://huggingface.co/datasets/Cohere/wikipedia-22-12-zh-embeddings)。不含嵌入向量的原始维基百科数据集可在[Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12)获取。 ## 数据集加载您可以通过以下方式加载本数据集： python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train") 您也可以无需提前下载，直接流式读取数据集： python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True) for doc in docs: docid = doc['id'] title = doc['title'] text = doc['text'] emb = doc['emb'] ## 检索示例完整检索示例如下： python # 运行前置依赖安装命令：pip install cohere datasets from datasets import load_dataset import torch import cohere co = cohere.Client(f"<<COHERE_API_KEY>>") # 请填入您从www.cohere.com获取的Cohere API密钥 # 最多加载1000条文档及其嵌入向量 max_docs = 1000 docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True) docs = [] doc_embeddings = [] for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc['emb']) if len(docs) >= max_docs: break doc_embeddings = torch.tensor(doc_embeddings) query = 'Who founded Youtube' response = co.embed(texts=[query], model='multilingual-22-12') query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding) # 计算查询嵌入与文档嵌入的点积相似度 dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3) # 输出检索结果 print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id]['title']) print(docs[doc_id]['text'], " ") ## 性能评估本模型在MIRACL数据集（一款语义搜索评测数据集）上的性能表现可参考[miracl-en-queries-22-12#performance](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12#performance)。

提供机构：

Cohere

原始信息汇总

Wikipedia (简单英语) 嵌入 cohere.ai `multilingual-22-12` 编码器

我们使用 cohere.ai 的 multilingual-22-12 嵌入模型对 Wikipedia (简单英语) 进行了编码。

嵌入

我们为 title+" "+text 计算嵌入，使用的是 multilingual-22-12 嵌入模型，这是一个在100种语言中进行语义搜索的先进模型。

加载数据集

你可以这样加载数据集： python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train")

或者你可以先不下载直接流式加载： python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True)

for doc in docs: docid = doc[id] title = doc[title] text = doc[text] emb = doc[emb]

搜索示例

一个完整的搜索示例： python #Run: pip install cohere datasets from datasets import load_dataset import torch import cohere

co = cohere.Client(f"<<COHERE_API_KEY>>") # 添加你的 cohere API 密钥

加载最多1000个文档及其嵌入

max_docs = 1000 docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True)

docs = [] doc_embeddings = []

for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc[emb]) if len(docs) >= max_docs: break

doc_embeddings = torch.tensor(doc_embeddings)

query = Who founded Youtube response = co.embed(texts=[query], model=multilingual-22-12) query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding)

计算查询嵌入和文档嵌入之间的点积分数

dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3)

打印结果

print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id][title]) print(docs[doc_id][text], " ")

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集