Cohere/wikipedia-22-12-it-embeddings

Name: Cohere/wikipedia-22-12-it-embeddings
Creator: Cohere
Published: 2023-03-22 16:54:18
License: 暂无描述

Hugging Face2023-03-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Cohere/wikipedia-22-12-it-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - it multilinguality: - multilingual size_categories: [] source_datasets: [] tags: [] task_categories: - text-retrieval license: - apache-2.0 task_ids: - document-retrieval --- # Wikipedia (it) embedded with cohere.ai `multilingual-22-12` encoder We encoded [Wikipedia (it)](https://it.wikipedia.org) using the [cohere.ai](https://txt.cohere.ai/multilingual/) `multilingual-22-12` embedding model. To get an overview how this dataset was created and pre-processed, have a look at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Embeddings We compute for `title+" "+text` the embeddings using our `multilingual-22-12` embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this model, have a look at [cohere.ai multilingual embedding model](https://txt.cohere.ai/multilingual/). ## Further languages We provide embeddings of Wikipedia in many different languages: [ar](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ar-embeddings), [de](https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings), [en](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings), [es](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings), [fr](https://huggingface.co/datasets/Cohere/wikipedia-22-12-fr-embeddings), [hi](https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings), [it](https://huggingface.co/datasets/Cohere/wikipedia-22-12-it-embeddings), [ja](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ja-embeddings), [ko](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ko-embeddings), [simple english](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings), [zh](https://huggingface.co/datasets/Cohere/wikipedia-22-12-zh-embeddings), You can find the Wikipedia datasets without embeddings at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Loading the dataset You can either load the dataset like this: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-it-embeddings", split="train") ``` Or you can also stream it without downloading it before: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-it-embeddings", split="train", streaming=True) for doc in docs: docid = doc['id'] title = doc['title'] text = doc['text'] emb = doc['emb'] ``` ## Search A full search example: ```python #Run: pip install cohere datasets from datasets import load_dataset import torch import cohere co = cohere.Client(f"<<COHERE_API_KEY>>") # Add your cohere API key from www.cohere.com #Load at max 1000 documents + embeddings max_docs = 1000 docs_stream = load_dataset(f"Cohere/wikipedia-22-12-it-embeddings", split="train", streaming=True) docs = [] doc_embeddings = [] for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc['emb']) if len(docs) >= max_docs: break doc_embeddings = torch.tensor(doc_embeddings) query = 'Who founded Youtube' response = co.embed(texts=[query], model='multilingual-22-12') query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding) # Compute dot score between query embedding and document embeddings dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3) # Print results print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id]['title']) print(docs[doc_id]['text'], "\n") ``` ## Performance You can find performance on the MIRACL dataset (a semantic search evaluation dataset) here: [miracl-en-queries-22-12#performance](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12#performance)

提供机构：

Cohere

原始信息汇总

数据集概述

基本信息

语言: 意大利语（it）
多语言性: 多语言
任务类别: 文本检索
许可证: Apache-2.0
任务ID: 文档检索

数据集内容

数据集使用cohere.ai的multilingual-22-12嵌入模型对Wikipedia (it)进行了编码。
计算了title+" "+text的嵌入，使用的是multilingual-22-12嵌入模型，该模型支持100种语言的语义搜索。

数据集加载

可以通过以下Python代码加载数据集： python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-it-embeddings", split="train")
也可以通过设置streaming=True进行流式加载，无需预先下载。

搜索示例

提供了一个完整的搜索示例，展示了如何使用Cohere API和数据集进行查询和检索。

性能评估

数据集的性能评估可在MIRACL数据集中查看。

5,000+

优质数据集

54 个

任务类型

进入经典数据集