WenxingZhu/multimodal-embedding-1M

Name: WenxingZhu/multimodal-embedding-1M
Creator: WenxingZhu
Published: 2026-03-20 03:23:29
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/WenxingZhu/multimodal-embedding-1M

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 size_categories: - 1M<n<10M task_categories: - feature-extraction tags: - multimodal - embeddings - vector-search - benchmark - image-text dataset_info: features: - name: id dtype: int64 - name: emb sequence: float32 splits: - name: train num_examples: 1000000 - name: test num_examples: 10000 configs: - config_name: default data_files: - split: train path: train.parquet - split: test path: test.parquet --- # Multimodal Embedding 1M Benchmark Dataset A vector search benchmark dataset containing **1M base vectors** and **10K query vectors** with pre-computed ground truth, generated from multimodal (image + text) inputs. ## Dataset Description Each embedding is produced by encoding an image-text pair into a single 4096-dimensional vector using [Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B), a state-of-the-art multimodal embedding model. - **Source data**: [pixparse/cc3m-wds](https://huggingface.co/datasets/pixparse/cc3m-wds) (Conceptual Captions 3M in WebDataset format) - **Embedding model**: Qwen3-VL-Embedding-8B via vLLM (runner="pooling", dtype=bfloat16) - **Input format**: Each sample consists of an image and its corresponding caption, embedded together as a multimodal input - **Normalization**: L2 normalized ## Files | File | Rows | Columns | Description | |------|------|---------|-------------| | `train.parquet` | 1,000,000 | `id` (int64), `emb` (list\<float32\>) | Base vectors from shards 0-199 | | `test.parquet` | 10,000 | `id` (int64), `emb` (list\<float32\>) | Query vectors from shards 200-201 | | `neighbors.parquet` | 10,000 | `id` (int64), `neighbors` (list\<int64\>) | Ground truth: top-100 nearest neighbors by Inner Product | ## Key Properties | Property | Value | |----------|-------| | Dimension | 4096 | | Distance metric | Inner Product (IP) | | Base size | 1,000,000 | | Query size | 10,000 | | Top-K (ground truth) | 100 | | Vector dtype | float32 | | Base/Query overlap | None (disjoint shards) | ## Usage ```python from datasets import load_dataset # Load base and query splits ds = load_dataset("WenxingZhu/multimodal-embedding-1M") train = ds["train"] # 1M base vectors test = ds["test"] # 10K query vectors # Load ground truth import pyarrow.parquet as pq gt = pq.read_table("neighbors.parquet").to_pandas() # Access embeddings import numpy as np base_emb = np.array(train["emb"]) # (1000000, 4096) query_emb = np.array(test["emb"]) # (10000, 4096) neighbors = np.array(gt["neighbors"].tolist()) # (10000, 100) ``` ## Generation Details - **Hardware**: 4x NVIDIA A100-80GB (DGX) - **Inference**: vLLM v0.17.1, pooling mode, batch size 16 - **Throughput**: ~25 samples/sec per GPU - **Total time**: ~2.75 hours for 1M embeddings - **Prompt template**: System message "Represent the user's input." + User message with image and text content ## License This dataset is released under the Apache 2.0 license. The source images and captions are from Conceptual Captions 3M.

提供机构：

WenxingZhu

5,000+

优质数据集

54 个

任务类型

进入经典数据集