five

Cohere/wikipedia-22-12-simple-embeddings

收藏
Hugging Face2023-03-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Cohere/wikipedia-22-12-simple-embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en multilinguality: - multilingual size_categories: [] source_datasets: [] tags: [] task_categories: - text-retrieval license: - apache-2.0 task_ids: - document-retrieval --- # Wikipedia (simple English) embedded with cohere.ai `multilingual-22-12` encoder We encoded [Wikipedia (simple English)](https://simple.wikipedia.org) using the [cohere.ai](https://txt.cohere.ai/multilingual/) `multilingual-22-12` embedding model. To get an overview how this dataset was created and pre-processed, have a look at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Embeddings We compute for `title+" "+text` the embeddings using our `multilingual-22-12` embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this model, have a look at [cohere.ai multilingual embedding model](https://txt.cohere.ai/multilingual/). ## Further languages We provide embeddings of Wikipedia in many different languages: [ar](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ar-embeddings), [de](https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings), [en](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings), [es](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings), [fr](https://huggingface.co/datasets/Cohere/wikipedia-22-12-fr-embeddings), [hi](https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings), [it](https://huggingface.co/datasets/Cohere/wikipedia-22-12-it-embeddings), [ja](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ja-embeddings), [ko](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ko-embeddings), [simple english](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings), [zh](https://huggingface.co/datasets/Cohere/wikipedia-22-12-zh-embeddings), You can find the Wikipedia datasets without embeddings at [Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12). ## Loading the dataset You can either load the dataset like this: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train") ``` Or you can also stream it without downloading it before: ```python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True) for doc in docs: docid = doc['id'] title = doc['title'] text = doc['text'] emb = doc['emb'] ``` ## Search A full search example: ```python #Run: pip install cohere datasets from datasets import load_dataset import torch import cohere co = cohere.Client(f"<<COHERE_API_KEY>>") # Add your cohere API key from www.cohere.com #Load at max 1000 documents + embeddings max_docs = 1000 docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True) docs = [] doc_embeddings = [] for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc['emb']) if len(docs) >= max_docs: break doc_embeddings = torch.tensor(doc_embeddings) query = 'Who founded Youtube' response = co.embed(texts=[query], model='multilingual-22-12') query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding) # Compute dot score between query embedding and document embeddings dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3) # Print results print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id]['title']) print(docs[doc_id]['text'], "\n") ``` ## Performance You can find performance on the MIRACL dataset (a semantic search evaluation dataset) here: [miracl-en-queries-22-12#performance](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12#performance)

语言: - 英语 多语言属性: - 多语言 规模类别:[] 源数据集:[] 标签:[] 任务类别: - 文本检索 许可证: - Apache-2.0 任务子类别: - 文档检索 # 采用cohere.ai `multilingual-22-12` 编码器嵌入的简易英语维基百科数据集 本数据集使用[cohere.ai](https://txt.cohere.ai/multilingual/)的`multilingual-22-12` 嵌入模型,对[简易英语维基百科](https://simple.wikipedia.org)进行了编码。 若需了解本数据集的创建与预处理流程,请参考[Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12)。 ## 嵌入向量 我们针对`标题 + 空格 + 正文`的文本组合,使用`multilingual-22-12`嵌入模型生成嵌入向量。该模型为当前前沿技术模型,可支持100种语言的语义搜索任务。若需了解该模型的更多细节,请访问[cohere.ai 多语言嵌入模型](https://txt.cohere.ai/multilingual/)。 ## 其他语言版本 我们提供了多种语言的维基百科嵌入向量数据集,包括:[阿拉伯语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ar-embeddings)、[德语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings)、[英语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings)、[西班牙语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings)、[法语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-fr-embeddings)、[印地语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-hi-embeddings)、[意大利语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-it-embeddings)、[日语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ja-embeddings)、[韩语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ko-embeddings)、[简易英语](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings)、[中文](https://huggingface.co/datasets/Cohere/wikipedia-22-12-zh-embeddings)。 不含嵌入向量的原始维基百科数据集可在[Cohere/wikipedia-22-12](https://huggingface.co/datasets/Cohere/wikipedia-22-12)获取。 ## 数据集加载 您可以通过以下方式加载本数据集: python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train") 您也可以无需提前下载,直接流式读取数据集: python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True) for doc in docs: docid = doc['id'] title = doc['title'] text = doc['text'] emb = doc['emb'] ## 检索示例 完整检索示例如下: python # 运行前置依赖安装命令:pip install cohere datasets from datasets import load_dataset import torch import cohere co = cohere.Client(f"<<COHERE_API_KEY>>") # 请填入您从www.cohere.com获取的Cohere API密钥 # 最多加载1000条文档及其嵌入向量 max_docs = 1000 docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True) docs = [] doc_embeddings = [] for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc['emb']) if len(docs) >= max_docs: break doc_embeddings = torch.tensor(doc_embeddings) query = 'Who founded Youtube' response = co.embed(texts=[query], model='multilingual-22-12') query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding) # 计算查询嵌入与文档嵌入的点积相似度 dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3) # 输出检索结果 print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id]['title']) print(docs[doc_id]['text'], " ") ## 性能评估 本模型在MIRACL数据集(一款语义搜索评测数据集)上的性能表现可参考[miracl-en-queries-22-12#performance](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12#performance)。
提供机构:
Cohere
原始信息汇总

Wikipedia (简单英语) 嵌入 cohere.ai multilingual-22-12 编码器

我们使用 cohere.aimultilingual-22-12 嵌入模型对 Wikipedia (简单英语) 进行了编码。

嵌入

我们为 title+" "+text 计算嵌入,使用的是 multilingual-22-12 嵌入模型,这是一个在100种语言中进行语义搜索的先进模型。

更多语言

我们提供了多种语言的 Wikipedia 嵌入:

加载数据集

你可以这样加载数据集: python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train")

或者你可以先不下载直接流式加载: python from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True)

for doc in docs: docid = doc[id] title = doc[title] text = doc[text] emb = doc[emb]

搜索示例

一个完整的搜索示例: python #Run: pip install cohere datasets from datasets import load_dataset import torch import cohere

co = cohere.Client(f"<<COHERE_API_KEY>>") # 添加你的 cohere API 密钥

加载最多1000个文档及其嵌入

max_docs = 1000 docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True)

docs = [] doc_embeddings = []

for doc in docs_stream: docs.append(doc) doc_embeddings.append(doc[emb]) if len(docs) >= max_docs: break

doc_embeddings = torch.tensor(doc_embeddings)

query = Who founded Youtube response = co.embed(texts=[query], model=multilingual-22-12) query_embedding = response.embeddings query_embedding = torch.tensor(query_embedding)

计算查询嵌入和文档嵌入之间的点积分数

dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3)

打印结果

print("Query:", query) for doc_id in top_k.indices[0].tolist(): print(docs[doc_id][title]) print(docs[doc_id][text], " ")

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作