five

Cohere/wikipedia-22-12-it-embeddings

收藏
Hugging Face2023-03-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Cohere/wikipedia-22-12-it-embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language: - it multilinguality: - multilingual size_categories: [] source_datasets: [] tags: [] task_categories: - text-retrieval license: - apache-2.0 task_ids: - document-retrieval --- # Wikipedia (it) embedded with cohere.ai `multilingual-22-12` encoder We encoded [Wikipedia (it)](https://it.wikipedia.org) using the [cohere.ai](https://txt.cohere.ai/multilingual/) `multilingual-22-12` embedding model. To get an overview how this dataset was created and pre-processed, have a look at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Embeddings We compute for `title+" "+text` the embeddings using our `multilingual-22-12` embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this model, have a look at [cohere.ai multilingual embedding model](https://txt.cohere.ai/multilingual/). ## Further languages We provide embeddings of Wikipedia in many different languages: [ar](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ar-embeddings), [de](https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings), [en](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings), [es](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings), [fr](https://huggingface.co/datasets/Cohere/wikipedia-22-12-fr-embeddings), [hi](https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings), [it](https://huggingface.co/datasets/Cohere/wikipedia-22-12-it-embeddings), [ja](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ja-embeddings), [ko](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ko-embeddings), [simple english](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings), [zh](https://huggingface.co/datasets/Cohere/wikipedia-22-12-zh-embeddings), You can find the Wikipedia datasets without embeddings at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Loading the dataset You can either load the dataset like this: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-it-embeddings", split="train") ``` Or you can also stream it without downloading it before: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-it-embeddings", split="train", streaming=True) for doc in docs: docid = doc['id'] title = doc['title'] text = doc['text'] emb = doc['emb'] ``` ## Search A full search example: ```python #Run: pip install cohere datasets from datasets import load_dataset import torch import cohere co = cohere.Client(f"<<COHERE_API_KEY>>") # Add your cohere API key from www.cohere.com #Load at max 1000 documents + embeddings max_docs = 1000 docs_stream = load_dataset(f"Cohere/wikipedia-22-12-it-embeddings", split="train", streaming=True) docs = [] doc_embeddings = [] for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc['emb']) if len(docs) >= max_docs: break doc_embeddings = torch.tensor(doc_embeddings) query = 'Who founded Youtube' response = co.embed(texts=[query], model='multilingual-22-12') query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding) # Compute dot score between query embedding and document embeddings dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3) # Print results print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id]['title']) print(docs[doc_id]['text'], "\n") ``` ## Performance You can find performance on the MIRACL dataset (a semantic search evaluation dataset) here: [miracl-en-queries-22-12#performance](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12#performance)
提供机构:
Cohere
原始信息汇总

数据集概述

基本信息

  • 语言: 意大利语(it)
  • 多语言性: 多语言
  • 任务类别: 文本检索
  • 许可证: Apache-2.0
  • 任务ID: 文档检索

数据集内容

  • 数据集使用cohere.aimultilingual-22-12嵌入模型对Wikipedia (it)进行了编码。
  • 计算了title+" "+text的嵌入,使用的是multilingual-22-12嵌入模型,该模型支持100种语言的语义搜索。

数据集加载

  • 可以通过以下Python代码加载数据集: python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-it-embeddings", split="train")

  • 也可以通过设置streaming=True进行流式加载,无需预先下载。

搜索示例

  • 提供了一个完整的搜索示例,展示了如何使用Cohere API和数据集进行查询和检索。

性能评估

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作