anzorq/hf-spaces-descriptions-embeddings

Name: anzorq/hf-spaces-descriptions-embeddings
Creator: anzorq
Published: 2023-05-26 13:33:58
License: 暂无描述

Hugging Face2023-05-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/anzorq/hf-spaces-descriptions-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: features: - name: id dtype: string - name: description dtype: string - name: embedding sequence: float64 splits: - name: train num_bytes: 94758018 num_examples: 29718 download_size: 78891306 dataset_size: 94758018 --- # Hugging Face Spaces Descriptions and Embeddings Dataset I parsed all the available public 🤗 spaces as of May 22, 2023, generated concise descriptions of their functionality, and created embeddings for them. The descriptions were generated using various LLMs from each space's app file (README.md -> app_file). The embeddings were created using the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) SentenceTransformer model. The dataset comprises approximately 30,000 spaces that meet specific criteria: having more than 40 lines of code and over 1000 characters in the app file. The descriptions provide an overview of the spaces and their features. ## Dataset Details - **Name**: HF Spaces Descriptions and Embeddings - **Creator**: [anzorq](https://huggingface.co/anzorq) - **License**: MIT ## Dataset Usage You can use this dataset for various natural language processing (NLP) tasks such as semantic search, clustering, etc. ## Loading the Dataset You can load the dataset using the datasets library: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("anzorq/hf-spaces-descriptions-embeddings") # Access the different splits train_split = dataset['train'] valid_split = dataset['valid'] test_split = dataset['test'] ``` ## Semantic Search Example Performing a semantic search using the dataset's embeddings: ```python import torch from sentence_transformers import SentenceTransformer from datasets import load_dataset import numpy as np # Load the dataset dataset = load_dataset("anzorq/hf-spaces-descriptions-embeddings") # Load the SentenceTransformer model model = SentenceTransformer('all-MiniLM-L6-v2') # Example query query = "Removing background from images" # Encode the query query_embedding = model.encode([query], convert_to_tensor=True) # Get the space descriptions and embeddings descriptions = dataset['train']['description'] embeddings = np.array(dataset['train']['embedding']) # Calculate cosine similarity cosine_scores = torch.nn.functional.cosine_similarity(query_embedding, torch.tensor(embeddings)) # Sort the results top_k = torch.topk(cosine_scores, k=5) # Print the top-k results print("Query:", query) for idx in top_k.indices[0]: print("Space ID:", dataset['train']['id'][idx]) print("Description:", descriptions[idx]) print("Score:", cosine_scores[idx].item()) ``` ## License This dataset is distributed under the [MIT License](https://opensource.org/licenses/MIT).

提供机构：

anzorq

原始信息汇总

HF Spaces Descriptions and Embeddings 数据集

数据集详情

名称: HF Spaces Descriptions and Embeddings
创建者: anzorq
许可证: MIT

数据集结构

特征

id: 字符串类型
description: 字符串类型
embedding: 浮点数序列类型

分割

train:
- 字节数: 94758018
- 样本数: 29718

大小

下载大小: 78891306
数据集大小: 94758018

数据集用途

该数据集可用于自然语言处理（NLP）任务，如语义搜索、聚类等。

加载数据集

python from datasets import load_dataset

加载数据集

dataset = load_dataset("anzorq/hf-spaces-descriptions-embeddings")

访问不同分割

train_split = dataset[train]

语义搜索示例

python import torch from sentence_transformers import SentenceTransformer from datasets import load_dataset import numpy as np

加载数据集

dataset = load_dataset("anzorq/hf-spaces-descriptions-embeddings")

加载 SentenceTransformer 模型

model = SentenceTransformer(all-MiniLM-L6-v2)

示例查询

query = "Removing background from images"

编码查询

query_embedding = model.encode([query], convert_to_tensor=True)

获取空间描述和嵌入

descriptions = dataset[train][description] embeddings = np.array(dataset[train][embedding])

计算余弦相似度

cosine_scores = torch.nn.functional.cosine_similarity(query_embedding, torch.tensor(embeddings))

排序结果

top_k = torch.topk(cosine_scores, k=5)

打印前 k 个结果

print("Query:", query) for idx in top_k.indices[0]: print("Space ID:", dataset[train][id][idx]) print("Description:", descriptions[idx]) print("Score:", cosine_scores[idx].item())

5,000+

优质数据集

54 个

任务类型

进入经典数据集