five

model-organisms-for-real/dolly-15k_embeddings_voyage-4

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/model-organisms-for-real/dolly-15k_embeddings_voyage-4
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en tags: - embeddings - voyage-ai - dolly pretty_name: Dolly-15k Voyage-4 Embeddings --- # Dolly-15k Voyage-4 Embeddings `voyage-4` embeddings of prompts from [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k), pre-filtered and used to build the diverse context-prompt pool for the [Activation Oracle analyzer](https://github.com/model-organisms-for-real/model-organisms-for-real/tree/main/ao-analyzer). ## Columns | Column | Type | Description | |---|---|---| | `idx` | int32 | Row index in the corresponding `dolly_filtered.jsonl` cache | | `text` | string | The Dolly `instruction` field, unchanged | | `embedding` | float32[1024] | L2-normalized `voyage-4` embedding (cosine distance = 1 − dot product) | ## Filter applied before embedding 1. Dropped rows matching the quirk-domain regex (`\b(italian|pasta|pizza|risotto|lasagna|gelato|tiramisu|cake|bake|baking|baked|pastry|cupcake|icing|frosting|military|submarine|navy|naval|army|armies|soldier|war|wars|warfare|battalion|weapon|weapons|combat|artillery|infantry|marine|marines)\b`) 2. Dropped case-insensitive duplicate `instruction` fields 3. Dropped prompts shorter than **20 raw tokens** (tokenizer: `allenai/OLMo-2-0425-1B-Instruct`) ## Provenance | | | |---|---| | Source dataset | `databricks/databricks-dolly-15k` | | Embedding model | `voyage-4` | | Embedding dim | 1024 | | Prompt count | 2903 | | Normalized | True | | Built at | 2026-04-17T17:35:15Z | ## Reproduce ```bash # Rebuild the filter+embedding cache uv run python ao-analyzer/scripts/build_context_pool.py # Push the refreshed cache uv run python ao-analyzer/scripts/upload_embeddings.py ``` ## Load ```python from datasets import load_dataset ds = load_dataset("model-organisms-for-real/dolly-15k_embeddings_voyage-4", split="train") import numpy as np E = np.array(ds["embedding"], dtype=np.float32) # (N, 1024), L2-normalized texts = ds["text"] ```
提供机构:
model-organisms-for-real
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作