five

cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m

收藏
Hugging Face2025-11-21 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-retrieval - sentence-similarity language: - en tags: - embeddings - pubmed - arxiv - scientific-papers - vector-database - benchmark - embedding-gemma size_categories: - 100K<n<1M --- # PubMed & arXiv Abstract Embeddings for Vector Database Benchmarking ## Dataset Description This dataset contains pre-computed embeddings of scientific paper abstracts from PubMed and arXiv, designed for evaluating vector database performance. The embeddings are generated using Google's EmbeddingGemma-300M model. ### Purpose Benchmark dataset for evaluating vector database performance, specifically designed for use with [VectorDBBench](https://github.com/zilliztech/VectorDBBench). ### Dataset Summary - **Total Training Samples**: 400,335 - **Test Queries**: 1,000 - **Ground Truth**: Top-1000 nearest neighbors per query - **Embedding Dimension**: 768 - **Embedding Model**: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) - **Source Data**: [brainchalov/pubmed_arxiv_abstracts_data](https://huggingface.co/datasets/brainchalov/pubmed_arxiv_abstracts_data) ## Dataset Structure ### Data Splits | Split | Samples | Description | |-------|---------|-------------| | `train` | 400,335 | Training embeddings (80% random sample from source) | | `test` | 1,000 | Test query embeddings (from remaining 20%, non-overlapping) | | `neigbors.parquet` | 1,000 | Top-1000 nearest neighbors for each test query | ### Data Fields #### train & test - `id` (int64): Unique identifier for each paper - `emb` (List[float64]): 768-dimensional L2-normalized embedding vector #### neigbors.parquet - `id` (int64): Query identifier (matches test) - `neighbors_id` (List[int64]): List of 1000 nearest neighbor IDs from train set ## Dataset Creation ### Source Data The dataset is derived from approximately 500K scientific abstracts from PubMed and arXiv: - **Train**: 80% random sample (400,335 papers) - **Test**: 1,000 papers randomly sampled from remaining 20% (non-overlapping with train) ### Preprocessing 1. **Text Preparation**: Concatenated title + abstract for each paper 2. **Chunking**: For texts exceeding 2048 tokens: - Split into chunks with ~50 token overlap - Embedded each chunk separately - Averaged chunk embeddings for final representation 3. **Normalization**: All embeddings are L2-normalized ### Embedding Generation - **Model**: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) - **Dimension**: 768 - **Max Token Length**: 2048 - **Normalization**: L2-normalized ### Ground Truth Generation Ground truth nearest neighbors were computed using: - **Method**: Flat search (brute-force) - **Metric**: Cosine similarity - **K**: Top-1000 neighbors per query ## Usage ### Loading the Dataset ```python from datasets import load_dataset import pandas as pd # Load train and test splits dataset = load_dataset("redcourage/pubmed-arxiv-abstract-embedding-gemma-300m") train = dataset['train'] test = dataset['test'] # Load ground truth neigbors = pd.read_parquet( "hf://datasets/redcourage/pubmed-arxiv-abstract-embedding-gemma-300m/neigbors.parquet" ) ``` ### Evaluation Example ```python import numpy as np from datasets import load_dataset import pandas as pd # Load data dataset = load_dataset("redcourage/pubmed-arxiv-abstract-embedding-gemma-300m") train_data = dataset['train'] test_data = dataset['test'] neigbors = pd.read_parquet( "hf://datasets/redcourage/pubmed-arxiv-abstract-embedding-gemma-300m/neigbors.parquet" ) # Convert to numpy arrays train_embeddings = np.array(train_data['emb']) test_embeddings = np.array(test_data['emb']) # Example: Compute recall@10 def compute_recall_at_k(retrieved_ids, neigbors_ids, k=10): """ Compute Recall@K Args: retrieved_ids: List of retrieved neighbor IDs neigbors_ids: List of ground truth neighbor IDs k: Number of top results to consider """ retrieved_k = set(retrieved_ids[:k]) neigbors_k = set(neigbors_ids[:k]) if len(neigbors_k) == 0: return 0.0 return len(retrieved_k & neigbors_k) / len(neigbors_k) # Use with your vector database # ... insert your vector DB search code here ... ``` ## Use Cases - Vector database performance benchmarking - Approximate nearest neighbor (ANN) algorithm evaluation - Retrieval system testing on scientific literature ## Limitations - **Domain-Specific**: Optimized for scientific/biomedical text; may not generalize to other domains - **Language**: English only - **Temporal Coverage**: Limited to papers available in the source dataset - **Chunking Strategy**: Long documents are averaged, which may lose fine-grained information - **Ground Truth**: Based on cosine similarity with embeddings, not human relevance judgments ## License Apache 2.0 (same as source dataset) ## Citation If you use this dataset, please cite: ```bibtex @dataset{pubmed_arxiv_embeddings_gemma, author = {redcourage}, title = {PubMed & arXiv Abstract Embeddings for Vector Database Benchmarking}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/redcourage/pubmed-arxiv-abstract-embedding-gemma-300m} } ``` ### Source Dataset Citation ```bibtex @dataset{pubmed_arxiv_abstracts, author = {brainchalov}, title = {PubMed ArXiv Abstracts Data}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/brainchalov/pubmed_arxiv_abstracts_data} } ``` ### Embedding Model Citation ```bibtex @misc{embeddinggemma, title={Embedding Gemma}, author={Google}, year={2024}, url={https://huggingface.co/google/embeddinggemma-300m} } ``` ## Acknowledgments - Original dataset: [brainchalov/pubmed_arxiv_abstracts_data](https://huggingface.co/datasets/brainchalov/pubmed_arxiv_abstracts_data) - Embedding model: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) - Benchmark framework: [VectorDBBench](https://github.com/zilliztech/VectorDBBench)
提供机构:
cryptolab-playground
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作