Cohere/wikipedia-22-12-hi-embeddings

Name: Cohere/wikipedia-22-12-hi-embeddings
Creator: Cohere
Published: 2023-03-22 16:53:57
License: 暂无描述

Hugging Face2023-03-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Cohere/wikipedia-22-12-hi-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - hi multilinguality: - multilingual size_categories: [] source_datasets: [] tags: [] task_categories: - text-retrieval license: - apache-2.0 task_ids: - document-retrieval --- # Wikipedia (hi) embedded with cohere.ai `multilingual-22-12` encoder We encoded [Wikipedia (hi)](https://hi.wikipedia.org) using the [cohere.ai](https://txt.cohere.ai/multilingual/) `multilingual-22-12` embedding model. To get an overview how this dataset was created and pre-processed, have a look at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Embeddings We compute for `title+" "+text` the embeddings using our `multilingual-22-12` embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this model, have a look at [cohere.ai multilingual embedding model](https://txt.cohere.ai/multilingual/). ## Further languages We provide embeddings of Wikipedia in many different languages: [ar](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ar-embeddings), [de](https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings), [en](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings), [es](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings), [fr](https://huggingface.co/datasets/Cohere/wikipedia-22-12-fr-embeddings), [hi](https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings), [it](https://huggingface.co/datasets/Cohere/wikipedia-22-12-it-embeddings), [ja](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ja-embeddings), [ko](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ko-embeddings), [simple english](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings), [zh](https://huggingface.co/datasets/Cohere/wikipedia-22-12-zh-embeddings), You can find the Wikipedia datasets without embeddings at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Loading the dataset You can either load the dataset like this: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-hi-embeddings", split="train") ``` Or you can also stream it without downloading it before: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-hi-embeddings", split="train", streaming=True) for doc in docs: docid = doc['id'] title = doc['title'] text = doc['text'] emb = doc['emb'] ``` ## Search A full search example: ```python #Run: pip install cohere datasets from datasets import load_dataset import torch import cohere co = cohere.Client(f"<<COHERE_API_KEY>>") # Add your cohere API key from www.cohere.com #Load at max 1000 documents + embeddings max_docs = 1000 docs_stream = load_dataset(f"Cohere/wikipedia-22-12-hi-embeddings", split="train", streaming=True) docs = [] doc_embeddings = [] for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc['emb']) if len(docs) >= max_docs: break doc_embeddings = torch.tensor(doc_embeddings) query = 'Who founded Youtube' response = co.embed(texts=[query], model='multilingual-22-12') query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding) # Compute dot score between query embedding and document embeddings dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3) # Print results print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id]['title']) print(docs[doc_id]['text'], "\n") ``` ## Performance You can find performance on the MIRACL dataset (a semantic search evaluation dataset) here: [miracl-en-queries-22-12#performance](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12#performance)

提供机构：

Cohere

原始信息汇总

数据集概述

基本信息

标注创建者: 专家生成
语言: 印地语
多语言性: 多语言
任务类别: 文本检索
许可证: Apache 2.0
任务ID: 文档检索

数据集描述

该数据集包含使用cohere.ai的multilingual-22-12嵌入模型编码的印地语维基百科内容。该模型是一个支持100种语言语义搜索的先进模型。

嵌入计算

数据集中的每个条目（包括标题和文本）都通过multilingual-22-12嵌入模型计算了嵌入向量。

其他语言版本

该数据集还提供了其他多种语言版本的维基百科嵌入，包括阿拉伯语、德语、英语、西班牙语、法语、意大利语、日语、韩语、简体英语和中文。

数据集加载

数据集可以通过以下方式加载： python from datasets import load_dataset docs = load_dataset("Cohere/wikipedia-22-12-hi-embeddings", split="train")

或者以流式方式加载： python from datasets import load_dataset docs = load_dataset("Cohere/wikipedia-22-12-hi-embeddings", split="train", streaming=True) for doc in docs: docid = doc[id] title = doc[title] text = doc[text] emb = doc[emb]

搜索示例

以下是一个完整的搜索示例代码： python from datasets import load_dataset import torch import cohere

co = cohere.Client("<<COHERE_API_KEY>>") # 替换为你的Cohere API密钥

max_docs = 1000 docs_stream = load_dataset("Cohere/wikipedia-22-12-hi-embeddings", split="train", streaming=True)

docs = [] doc_embeddings = []

for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc[emb]) if len(docs) >= max_docs: break

doc_embeddings = torch.tensor(doc_embeddings)

query = Who founded Youtube response = co.embed(texts=[query], model=multilingual-22-12) query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding)

dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3)

print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id][title]) print(docs[doc_id][text], " ")

5,000+

优质数据集

54 个

任务类型

进入经典数据集