model-organisms-for-real/dolly-15k_embeddings_voyage-4
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/model-organisms-for-real/dolly-15k_embeddings_voyage-4
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- embeddings
- voyage-ai
- dolly
pretty_name: Dolly-15k Voyage-4 Embeddings
---
# Dolly-15k Voyage-4 Embeddings
`voyage-4` embeddings of prompts from [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k), pre-filtered and used to build the diverse context-prompt pool for the [Activation Oracle analyzer](https://github.com/model-organisms-for-real/model-organisms-for-real/tree/main/ao-analyzer).
## Columns
| Column | Type | Description |
|---|---|---|
| `idx` | int32 | Row index in the corresponding `dolly_filtered.jsonl` cache |
| `text` | string | The Dolly `instruction` field, unchanged |
| `embedding` | float32[1024] | L2-normalized `voyage-4` embedding (cosine distance = 1 − dot product) |
## Filter applied before embedding
1. Dropped rows matching the quirk-domain regex (`\b(italian|pasta|pizza|risotto|lasagna|gelato|tiramisu|cake|bake|baking|baked|pastry|cupcake|icing|frosting|military|submarine|navy|naval|army|armies|soldier|war|wars|warfare|battalion|weapon|weapons|combat|artillery|infantry|marine|marines)\b`)
2. Dropped case-insensitive duplicate `instruction` fields
3. Dropped prompts shorter than **20 raw tokens** (tokenizer: `allenai/OLMo-2-0425-1B-Instruct`)
## Provenance
| | |
|---|---|
| Source dataset | `databricks/databricks-dolly-15k` |
| Embedding model | `voyage-4` |
| Embedding dim | 1024 |
| Prompt count | 2903 |
| Normalized | True |
| Built at | 2026-04-17T17:35:15Z |
## Reproduce
```bash
# Rebuild the filter+embedding cache
uv run python ao-analyzer/scripts/build_context_pool.py
# Push the refreshed cache
uv run python ao-analyzer/scripts/upload_embeddings.py
```
## Load
```python
from datasets import load_dataset
ds = load_dataset("model-organisms-for-real/dolly-15k_embeddings_voyage-4", split="train")
import numpy as np
E = np.array(ds["embedding"], dtype=np.float32) # (N, 1024), L2-normalized
texts = ds["text"]
```
提供机构:
model-organisms-for-real



