five

cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m

收藏
Hugging Face2025-11-21 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-retrieval - sentence-similarity language: - en tags: - embeddings - financial-news - bloomberg - vector-database - benchmark - embedding-gemma size_categories: - 100K<n<1M --- # Bloomberg Financial News Embeddings for Vector Database Benchmarking ## Dataset Description This dataset contains pre-computed embeddings of Bloomberg financial news articles, designed for evaluating vector database performance. The embeddings are generated using Google's EmbeddingGemma-300M model. ### Purpose Benchmark dataset for evaluating vector database performance on financial news domain, specifically designed for use with [VectorDBBench](https://github.com/zilliztech/VectorDBBench). ### Dataset Summary - **Total Training Samples**: 368,816 - **Test Queries**: 1,000 - **Ground Truth**: Top-1000 nearest neighbors per query - **Embedding Dimension**: 768 - **Embedding Model**: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) - **Source Data**: [danidanou/Bloomberg_Financial_News](https://huggingface.co/datasets/danidanou/Bloomberg_Financial_News) ## Dataset Structure ### Data Splits | Split | Samples | Description | |-------|---------|-------------| | `train` | 368,816 | Training embeddings (80% random sample from source) | | `test` | 1,000 | Test query embeddings (from remaining 20%, non-overlapping) | | `neigbors.parquet` | 1,000 | Top-1000 nearest neighbors for each test query | ### Data Fields #### train & test - `id` (int64): Unique identifier for each article - `emb` (List[float64]): 768-dimensional L2-normalized embedding vector #### neigbors.parquet - `id` (int64): Query identifier (matches test) - `neighbors_id` (List[int64]): List of 1000 nearest neighbor IDs from train set ## Dataset Creation ### Source Data The dataset is derived from approximately 447K Bloomberg financial news articles: - **Train**: 80% random sample (368,816 articles) - **Test**: 1,000 articles randomly sampled from remaining 20% (non-overlapping with train) ### Preprocessing 1. **Text Preparation**: Concatenated Headline + Article for each news item 2. **Chunking**: For texts exceeding 2048 tokens: - Split into chunks with ~100 token overlap - Embedded each chunk separately - Averaged chunk embeddings for final representation 3. **Normalization**: All embeddings are L2-normalized ### Embedding Generation - **Model**: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) - **Dimension**: 768 - **Max Token Length**: 2048 - **Normalization**: L2-normalized ### Ground Truth Generation Ground truth nearest neighbors were computed using: - **Method**: Flat search (brute-force) - **Metric**: Cosine similarity - **K**: Top-1000 neighbors per query ## Usage ### Loading the Dataset ```python from datasets import load_dataset import pandas as pd # Load train and test splits dataset = load_dataset("redcourage/Bloomberg-Financial-News-embedding-gemma-300m") train = dataset['train'] test = dataset['test'] # Load ground truth neigbors = pd.read_parquet( "hf://datasets/redcourage/Bloomberg-Financial-News-embedding-gemma-300m/neigbors.parquet" ) ``` ### Evaluation Example ```python import numpy as np from datasets import load_dataset import pandas as pd # Load data dataset = load_dataset("redcourage/Bloomberg-Financial-News-embedding-gemma-300m") train_data = dataset['train'] test_data = dataset['test'] neigbors = pd.read_parquet( "hf://datasets/redcourage/Bloomberg-Financial-News-embedding-gemma-300m/neigbors.parquet" ) # Convert to numpy arrays train_embeddings = np.array(train_data['emb']) test_embeddings = np.array(test_data['emb']) # Example: Compute recall@10 def compute_recall_at_k(retrieved_ids, neigbors_ids, k=10): """ Compute Recall@K Args: retrieved_ids: List of retrieved neighbor IDs neigbors_ids: List of ground truth neighbor IDs k: Number of top results to consider """ retrieved_k = set(retrieved_ids[:k]) neigbors_k = set(neigbors_ids[:k]) if len(neigbors_k) == 0: return 0.0 return len(retrieved_k & neigbors_k) / len(neigbors_k) # Use with your vector database # ... insert your vector DB search code here ... ``` ## Use Cases - Vector database performance benchmarking on financial domain - Approximate nearest neighbor (ANN) algorithm evaluation - Retrieval system testing for financial news ## Limitations - **Domain-Specific**: Optimized for financial news; may not generalize to other domains - **Language**: English only - **Temporal Coverage**: Limited to articles available in the source dataset (2006-2021) - **Chunking Strategy**: Long documents are averaged, which may lose fine-grained information - **Ground Truth**: Based on cosine similarity with embeddings, not human relevance judgments - **Financial Bias**: May reflect biases present in Bloomberg's reporting and article selection ## License Apache 2.0 ## Citation If you use this dataset, please cite: ```bibtex @dataset{bloomberg_embeddings_gemma, author = {redcourage}, title = {Bloomberg Financial News Embeddings for Vector Database Benchmarking}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/redcourage/Bloomberg-Financial-News-embedding-gemma-300m} } ``` ### Source Dataset Citation ```bibtex @dataset{bloomberg_financial_news, author = {danidanou}, title = {Bloomberg Financial News}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/danidanou/Bloomberg_Financial_News} } ``` ### Embedding Model Citation ```bibtex @misc{embeddinggemma, title={Embedding Gemma}, author={Google}, year={2024}, url={https://huggingface.co/google/embeddinggemma-300m} } ``` ## Acknowledgments - Original dataset: [danidanou/Bloomberg_Financial_News](https://huggingface.co/datasets/danidanou/Bloomberg_Financial_News) - Embedding model: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) - Benchmark framework: [VectorDBBench](https://github.com/zilliztech/VectorDBBench)
提供机构:
cryptolab-playground
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作