cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m
收藏Hugging Face2025-11-21 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-retrieval
- sentence-similarity
language:
- en
tags:
- embeddings
- financial-news
- bloomberg
- vector-database
- benchmark
- embedding-gemma
size_categories:
- 100K<n<1M
---
# Bloomberg Financial News Embeddings for Vector Database Benchmarking
## Dataset Description
This dataset contains pre-computed embeddings of Bloomberg financial news articles, designed for evaluating vector database performance. The embeddings are generated using Google's EmbeddingGemma-300M model.
### Purpose
Benchmark dataset for evaluating vector database performance on financial news domain, specifically designed for use with [VectorDBBench](https://github.com/zilliztech/VectorDBBench).
### Dataset Summary
- **Total Training Samples**: 368,816
- **Test Queries**: 1,000
- **Ground Truth**: Top-1000 nearest neighbors per query
- **Embedding Dimension**: 768
- **Embedding Model**: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m)
- **Source Data**: [danidanou/Bloomberg_Financial_News](https://huggingface.co/datasets/danidanou/Bloomberg_Financial_News)
## Dataset Structure
### Data Splits
| Split | Samples | Description |
|-------|---------|-------------|
| `train` | 368,816 | Training embeddings (80% random sample from source) |
| `test` | 1,000 | Test query embeddings (from remaining 20%, non-overlapping) |
| `neigbors.parquet` | 1,000 | Top-1000 nearest neighbors for each test query |
### Data Fields
#### train & test
- `id` (int64): Unique identifier for each article
- `emb` (List[float64]): 768-dimensional L2-normalized embedding vector
#### neigbors.parquet
- `id` (int64): Query identifier (matches test)
- `neighbors_id` (List[int64]): List of 1000 nearest neighbor IDs from train set
## Dataset Creation
### Source Data
The dataset is derived from approximately 447K Bloomberg financial news articles:
- **Train**: 80% random sample (368,816 articles)
- **Test**: 1,000 articles randomly sampled from remaining 20% (non-overlapping with train)
### Preprocessing
1. **Text Preparation**: Concatenated Headline + Article for each news item
2. **Chunking**: For texts exceeding 2048 tokens:
- Split into chunks with ~100 token overlap
- Embedded each chunk separately
- Averaged chunk embeddings for final representation
3. **Normalization**: All embeddings are L2-normalized
### Embedding Generation
- **Model**: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m)
- **Dimension**: 768
- **Max Token Length**: 2048
- **Normalization**: L2-normalized
### Ground Truth Generation
Ground truth nearest neighbors were computed using:
- **Method**: Flat search (brute-force)
- **Metric**: Cosine similarity
- **K**: Top-1000 neighbors per query
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
import pandas as pd
# Load train and test splits
dataset = load_dataset("redcourage/Bloomberg-Financial-News-embedding-gemma-300m")
train = dataset['train']
test = dataset['test']
# Load ground truth
neigbors = pd.read_parquet(
"hf://datasets/redcourage/Bloomberg-Financial-News-embedding-gemma-300m/neigbors.parquet"
)
```
### Evaluation Example
```python
import numpy as np
from datasets import load_dataset
import pandas as pd
# Load data
dataset = load_dataset("redcourage/Bloomberg-Financial-News-embedding-gemma-300m")
train_data = dataset['train']
test_data = dataset['test']
neigbors = pd.read_parquet(
"hf://datasets/redcourage/Bloomberg-Financial-News-embedding-gemma-300m/neigbors.parquet"
)
# Convert to numpy arrays
train_embeddings = np.array(train_data['emb'])
test_embeddings = np.array(test_data['emb'])
# Example: Compute recall@10
def compute_recall_at_k(retrieved_ids, neigbors_ids, k=10):
"""
Compute Recall@K
Args:
retrieved_ids: List of retrieved neighbor IDs
neigbors_ids: List of ground truth neighbor IDs
k: Number of top results to consider
"""
retrieved_k = set(retrieved_ids[:k])
neigbors_k = set(neigbors_ids[:k])
if len(neigbors_k) == 0:
return 0.0
return len(retrieved_k & neigbors_k) / len(neigbors_k)
# Use with your vector database
# ... insert your vector DB search code here ...
```
## Use Cases
- Vector database performance benchmarking on financial domain
- Approximate nearest neighbor (ANN) algorithm evaluation
- Retrieval system testing for financial news
## Limitations
- **Domain-Specific**: Optimized for financial news; may not generalize to other domains
- **Language**: English only
- **Temporal Coverage**: Limited to articles available in the source dataset (2006-2021)
- **Chunking Strategy**: Long documents are averaged, which may lose fine-grained information
- **Ground Truth**: Based on cosine similarity with embeddings, not human relevance judgments
- **Financial Bias**: May reflect biases present in Bloomberg's reporting and article selection
## License
Apache 2.0
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{bloomberg_embeddings_gemma,
author = {redcourage},
title = {Bloomberg Financial News Embeddings for Vector Database Benchmarking},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/redcourage/Bloomberg-Financial-News-embedding-gemma-300m}
}
```
### Source Dataset Citation
```bibtex
@dataset{bloomberg_financial_news,
author = {danidanou},
title = {Bloomberg Financial News},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/danidanou/Bloomberg_Financial_News}
}
```
### Embedding Model Citation
```bibtex
@misc{embeddinggemma,
title={Embedding Gemma},
author={Google},
year={2024},
url={https://huggingface.co/google/embeddinggemma-300m}
}
```
## Acknowledgments
- Original dataset: [danidanou/Bloomberg_Financial_News](https://huggingface.co/datasets/danidanou/Bloomberg_Financial_News)
- Embedding model: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m)
- Benchmark framework: [VectorDBBench](https://github.com/zilliztech/VectorDBBench)
提供机构:
cryptolab-playground



