jrmiller/coco-2017-siglip2-embeddings
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/jrmiller/coco-2017-siglip2-embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- image-feature-extraction
- zero-shot-image-classification
language:
- en
tags:
- coco
- siglip2
- image-embeddings
- vector-search
- lancedb
- lance
pretty_name: COCO 2017 SigLIP 2 Image Embeddings
size_categories:
- 100K<n<1M
---
# COCO 2017 SigLIP 2 Image Embeddings
Pre-computed image embeddings for the [COCO 2017](https://cocodataset.org/) dataset, generated with [Google's SigLIP 2](https://huggingface.co/google/siglip2-so400m-patch14-384) (SoViT-400M, 384px).
## Overview
| Property | Value |
|---|---|
| **Model** | `google/siglip2-so400m-patch14-384` |
| **Vector dimensions** | 1152 |
| **Normalization** | L2-normalized (unit vectors) |
| **Source dataset** | [COCO 2017](https://cocodataset.org/) |
| **Image resolution** | 384 x 384 (resized by SigLIP 2 processor) |
## Dataset Structure
### Schema
Each row contains the embedding, metadata, and the raw image bytes for a single COCO image:
| Column | Type | Description |
|---|---|---|
| `image_id` | int64 | COCO image ID |
| `file_name` | string | Original filename (e.g. `000000000009.jpg`) |
| `caption` | string | First COCO caption (empty for test/unlabeled splits) |
| `coco_url` | string | Original COCO download URL |
| `width` | int64 | Original image width in pixels |
| `height` | int64 | Original image height in pixels |
| `split` | string | Dataset split (`train`, `val`, `test`, or `unlabeled`) |
| `vector` | float32[1152] | L2-normalized SigLIP 2 image embedding |
| `image_bytes` | binary | Raw JPEG image bytes |
### LanceDB table
The `lancedb/` directory contains the same data in [Lance format](https://lancedb.github.io/lance/), ready to load directly with LanceDB:
```python
import lancedb
db = lancedb.connect("lancedb")
table = db.open_table("coco_clip_embeddings")
# Vector search
results = table.search(query_vector).limit(10).to_pandas()
# Images come back inline — no external storage needed
from PIL import Image
import io
img = Image.open(io.BytesIO(results.iloc[0]["image_bytes"]))
```
## Usage
### Load into LanceDB for vector search
```python
import lancedb
db = lancedb.connect("lancedb")
table = db.open_table("coco_clip_embeddings")
# Find similar images
query_vec = df.iloc[0]["vector"]
results = table.search(query_vec).limit(5).to_pandas()
```
### Compute similarity between images
```python
import numpy as np
vec_a = np.array(df.iloc[0]["vector"])
vec_b = np.array(df.iloc[1]["vector"])
cosine_sim = np.dot(vec_a, vec_b) # vectors are already L2-normalized
```
## Generation
Embeddings were generated using the [opensearch-lancedb-migration](https://github.com/justinrmiller/opensearch-lancedb-migration) project:
```bash
# Download COCO images
uv run python -m src.cli download --split val
# Generate embeddings
uv run python -m src.cli embed
# Upload to Hugging Face
uv run python -m src.cli upload username/coco-2017-siglip2-embeddings --upload lancedb
```
## License
The embeddings and code are released under the MIT License. The underlying COCO images are subject to the [COCO Terms of Use](https://cocodataset.org/#termsofuse).
提供机构:
jrmiller



