five

anzorq/hf-spaces-descriptions-embeddings

收藏
Hugging Face2023-05-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/anzorq/hf-spaces-descriptions-embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit dataset_info: features: - name: id dtype: string - name: description dtype: string - name: embedding sequence: float64 splits: - name: train num_bytes: 94758018 num_examples: 29718 download_size: 78891306 dataset_size: 94758018 --- # Hugging Face Spaces Descriptions and Embeddings Dataset I parsed all the available public 🤗 spaces as of May 22, 2023, generated concise descriptions of their functionality, and created embeddings for them. The descriptions were generated using various LLMs from each space's app file (README.md -> app_file). The embeddings were created using the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) SentenceTransformer model. The dataset comprises approximately 30,000 spaces that meet specific criteria: having more than 40 lines of code and over 1000 characters in the app file. The descriptions provide an overview of the spaces and their features. ## Dataset Details - **Name**: HF Spaces Descriptions and Embeddings - **Creator**: [anzorq](https://huggingface.co/anzorq) - **License**: MIT ## Dataset Usage You can use this dataset for various natural language processing (NLP) tasks such as semantic search, clustering, etc. ## Loading the Dataset You can load the dataset using the datasets library: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("anzorq/hf-spaces-descriptions-embeddings") # Access the different splits train_split = dataset['train'] valid_split = dataset['valid'] test_split = dataset['test'] ``` ## Semantic Search Example Performing a semantic search using the dataset's embeddings: ```python import torch from sentence_transformers import SentenceTransformer from datasets import load_dataset import numpy as np # Load the dataset dataset = load_dataset("anzorq/hf-spaces-descriptions-embeddings") # Load the SentenceTransformer model model = SentenceTransformer('all-MiniLM-L6-v2') # Example query query = "Removing background from images" # Encode the query query_embedding = model.encode([query], convert_to_tensor=True) # Get the space descriptions and embeddings descriptions = dataset['train']['description'] embeddings = np.array(dataset['train']['embedding']) # Calculate cosine similarity cosine_scores = torch.nn.functional.cosine_similarity(query_embedding, torch.tensor(embeddings)) # Sort the results top_k = torch.topk(cosine_scores, k=5) # Print the top-k results print("Query:", query) for idx in top_k.indices[0]: print("Space ID:", dataset['train']['id'][idx]) print("Description:", descriptions[idx]) print("Score:", cosine_scores[idx].item()) ``` ## License This dataset is distributed under the [MIT License](https://opensource.org/licenses/MIT).
提供机构:
anzorq
原始信息汇总

HF Spaces Descriptions and Embeddings 数据集

数据集详情

  • 名称: HF Spaces Descriptions and Embeddings
  • 创建者: anzorq
  • 许可证: MIT

数据集结构

特征

  • id: 字符串类型
  • description: 字符串类型
  • embedding: 浮点数序列类型

分割

  • train:
    • 字节数: 94758018
    • 样本数: 29718

大小

  • 下载大小: 78891306
  • 数据集大小: 94758018

数据集用途

该数据集可用于自然语言处理(NLP)任务,如语义搜索、聚类等。

加载数据集

python from datasets import load_dataset

加载数据集

dataset = load_dataset("anzorq/hf-spaces-descriptions-embeddings")

访问不同分割

train_split = dataset[train]

语义搜索示例

python import torch from sentence_transformers import SentenceTransformer from datasets import load_dataset import numpy as np

加载数据集

dataset = load_dataset("anzorq/hf-spaces-descriptions-embeddings")

加载 SentenceTransformer 模型

model = SentenceTransformer(all-MiniLM-L6-v2)

示例查询

query = "Removing background from images"

编码查询

query_embedding = model.encode([query], convert_to_tensor=True)

获取空间描述和嵌入

descriptions = dataset[train][description] embeddings = np.array(dataset[train][embedding])

计算余弦相似度

cosine_scores = torch.nn.functional.cosine_similarity(query_embedding, torch.tensor(embeddings))

排序结果

top_k = torch.topk(cosine_scores, k=5)

打印前 k 个结果

print("Query:", query) for idx in top_k.indices[0]: print("Space ID:", dataset[train][id][idx]) print("Description:", descriptions[idx]) print("Score:", cosine_scores[idx].item())

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作