HNSHAW/test
收藏Hugging Face2023-03-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HNSHAW/test
下载链接
链接失效反馈官方服务:
资源简介:
from transformers import BertTokenizer, BertModel
import numpy as np
# Load the pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Define the search query
query = "deep learning"
# Tokenize the query and convert to input IDs and attention mask
tokenized_query = tokenizer.encode_plus(query, add_special_tokens=True,
return_token_type_ids=False,
padding='max_length', truncation=True,
max_length=64, return_attention_mask=True)
# Convert the input IDs and attention mask to PyTorch tensors
query_ids = torch.tensor(tokenized_query['input_ids']).unsqueeze(0)
query_mask = torch.tensor(tokenized_query['attention_mask']).unsqueeze(0)
# Pass the query through the BERT model to get the embeddings
with torch.no_grad():
query_embedding = model(query_ids, attention_mask=query_mask)[0][:, 0, :].numpy()
# Define a list of documents to search
documents = ["Machine learning is the future of computing",
"Deep learning is a subset of machine learning",
"Artificial intelligence is the science of making intelligent machines",
"Neural networks are used in deep learning"]
# Tokenize and embed the documents
document_embeddings = []
for document in documents:
# Tokenize the document and convert to input IDs and attention mask
tokenized_document = tokenizer.encode_plus(document, add_special_tokens=True,
return_token_type_ids=False,
padding='max_length', truncation=True,
max_length=64, return_attention_mask=True)
# Convert the input IDs and attention mask to PyTorch tensors
document_ids = torch.tensor(tokenized_document['input_ids']).unsqueeze(0)
document_mask = torch.tensor(tokenized_document['attention_mask']).unsqueeze(0)
# Pass the document through the BERT model to get the embeddings
with torch.no_grad():
document_embedding = model(document_ids, attention_mask=document_mask)[0][:, 0, :].numpy()
# Add the document embedding to the list of document embeddings
document_embeddings.append(document_embedding)
# Compute the cosine similarity between the query and each document
similarities = []
for document_embedding in document_embeddings:
similarity = np.dot(query_embedding, document_embedding.T) / (np.linalg.norm(query_embedding) * np.linalg.norm(document_embedding))
similarities.append(similarity)
# Sort the documents by similarity score
sorted_documents = [x for _, x in sorted(zip(similarities, documents), reverse=True)]
# Print the top 3 most similar documents
print(sorted_documents[:3])
提供机构:
HNSHAW
原始信息汇总
数据集概述
数据集内容
- 文档列表:包含4个文档,内容涉及机器学习、深度学习、人工智能和神经网络。
- "Machine learning is the future of computing"
- "Deep learning is a subset of machine learning"
- "Artificial intelligence is the science of making intelligent machines"
- "Neural networks are used in deep learning"
数据处理方法
- 查询处理:使用BERT模型和tokenizer对查询"deep learning"进行编码和嵌入。
- 文档处理:对每个文档进行相同的编码和嵌入处理,生成文档嵌入向量。
- 相似度计算:通过计算查询嵌入向量与每个文档嵌入向量之间的余弦相似度,评估文档与查询的相关性。
结果输出
- 相似度排序:根据计算出的相似度对文档进行排序。
- 输出结果:打印出与查询最相似的前三个文档。



