five

HNSHAW/test

收藏
Hugging Face2023-03-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HNSHAW/test
下载链接
链接失效反馈
官方服务:
资源简介:
from transformers import BertTokenizer, BertModel import numpy as np # Load the pre-trained BERT model and tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') # Define the search query query = "deep learning" # Tokenize the query and convert to input IDs and attention mask tokenized_query = tokenizer.encode_plus(query, add_special_tokens=True, return_token_type_ids=False, padding='max_length', truncation=True, max_length=64, return_attention_mask=True) # Convert the input IDs and attention mask to PyTorch tensors query_ids = torch.tensor(tokenized_query['input_ids']).unsqueeze(0) query_mask = torch.tensor(tokenized_query['attention_mask']).unsqueeze(0) # Pass the query through the BERT model to get the embeddings with torch.no_grad(): query_embedding = model(query_ids, attention_mask=query_mask)[0][:, 0, :].numpy() # Define a list of documents to search documents = ["Machine learning is the future of computing", "Deep learning is a subset of machine learning", "Artificial intelligence is the science of making intelligent machines", "Neural networks are used in deep learning"] # Tokenize and embed the documents document_embeddings = [] for document in documents: # Tokenize the document and convert to input IDs and attention mask tokenized_document = tokenizer.encode_plus(document, add_special_tokens=True, return_token_type_ids=False, padding='max_length', truncation=True, max_length=64, return_attention_mask=True) # Convert the input IDs and attention mask to PyTorch tensors document_ids = torch.tensor(tokenized_document['input_ids']).unsqueeze(0) document_mask = torch.tensor(tokenized_document['attention_mask']).unsqueeze(0) # Pass the document through the BERT model to get the embeddings with torch.no_grad(): document_embedding = model(document_ids, attention_mask=document_mask)[0][:, 0, :].numpy() # Add the document embedding to the list of document embeddings document_embeddings.append(document_embedding) # Compute the cosine similarity between the query and each document similarities = [] for document_embedding in document_embeddings: similarity = np.dot(query_embedding, document_embedding.T) / (np.linalg.norm(query_embedding) * np.linalg.norm(document_embedding)) similarities.append(similarity) # Sort the documents by similarity score sorted_documents = [x for _, x in sorted(zip(similarities, documents), reverse=True)] # Print the top 3 most similar documents print(sorted_documents[:3])
提供机构:
HNSHAW
原始信息汇总

数据集概述

数据集内容

  • 文档列表:包含4个文档,内容涉及机器学习、深度学习、人工智能和神经网络。
    • "Machine learning is the future of computing"
    • "Deep learning is a subset of machine learning"
    • "Artificial intelligence is the science of making intelligent machines"
    • "Neural networks are used in deep learning"

数据处理方法

  • 查询处理:使用BERT模型和tokenizer对查询"deep learning"进行编码和嵌入。
  • 文档处理:对每个文档进行相同的编码和嵌入处理,生成文档嵌入向量。
  • 相似度计算:通过计算查询嵌入向量与每个文档嵌入向量之间的余弦相似度,评估文档与查询的相关性。

结果输出

  • 相似度排序:根据计算出的相似度对文档进行排序。
  • 输出结果:打印出与查询最相似的前三个文档。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作