WenxingZhu/multimodal-embedding-1M
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/WenxingZhu/multimodal-embedding-1M
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 1M<n<10M
task_categories:
- feature-extraction
tags:
- multimodal
- embeddings
- vector-search
- benchmark
- image-text
dataset_info:
features:
- name: id
dtype: int64
- name: emb
sequence: float32
splits:
- name: train
num_examples: 1000000
- name: test
num_examples: 10000
configs:
- config_name: default
data_files:
- split: train
path: train.parquet
- split: test
path: test.parquet
---
# Multimodal Embedding 1M Benchmark Dataset
A vector search benchmark dataset containing **1M base vectors** and **10K query vectors** with pre-computed ground truth, generated from multimodal (image + text) inputs.
## Dataset Description
Each embedding is produced by encoding an image-text pair into a single 4096-dimensional vector using [Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B), a state-of-the-art multimodal embedding model.
- **Source data**: [pixparse/cc3m-wds](https://huggingface.co/datasets/pixparse/cc3m-wds) (Conceptual Captions 3M in WebDataset format)
- **Embedding model**: Qwen3-VL-Embedding-8B via vLLM (runner="pooling", dtype=bfloat16)
- **Input format**: Each sample consists of an image and its corresponding caption, embedded together as a multimodal input
- **Normalization**: L2 normalized
## Files
| File | Rows | Columns | Description |
|------|------|---------|-------------|
| `train.parquet` | 1,000,000 | `id` (int64), `emb` (list\<float32\>) | Base vectors from shards 0-199 |
| `test.parquet` | 10,000 | `id` (int64), `emb` (list\<float32\>) | Query vectors from shards 200-201 |
| `neighbors.parquet` | 10,000 | `id` (int64), `neighbors` (list\<int64\>) | Ground truth: top-100 nearest neighbors by Inner Product |
## Key Properties
| Property | Value |
|----------|-------|
| Dimension | 4096 |
| Distance metric | Inner Product (IP) |
| Base size | 1,000,000 |
| Query size | 10,000 |
| Top-K (ground truth) | 100 |
| Vector dtype | float32 |
| Base/Query overlap | None (disjoint shards) |
## Usage
```python
from datasets import load_dataset
# Load base and query splits
ds = load_dataset("WenxingZhu/multimodal-embedding-1M")
train = ds["train"] # 1M base vectors
test = ds["test"] # 10K query vectors
# Load ground truth
import pyarrow.parquet as pq
gt = pq.read_table("neighbors.parquet").to_pandas()
# Access embeddings
import numpy as np
base_emb = np.array(train["emb"]) # (1000000, 4096)
query_emb = np.array(test["emb"]) # (10000, 4096)
neighbors = np.array(gt["neighbors"].tolist()) # (10000, 100)
```
## Generation Details
- **Hardware**: 4x NVIDIA A100-80GB (DGX)
- **Inference**: vLLM v0.17.1, pooling mode, batch size 16
- **Throughput**: ~25 samples/sec per GPU
- **Total time**: ~2.75 hours for 1M embeddings
- **Prompt template**: System message "Represent the user's input." + User message with image and text content
## License
This dataset is released under the Apache 2.0 license. The source images and captions are from Conceptual Captions 3M.
提供机构:
WenxingZhu



