model-organisms-for-real/WizardLMTeam_WizardLM_evol_instruct_V2_196k_embeddings

Name: model-organisms-for-real/WizardLMTeam_WizardLM_evol_instruct_V2_196k_embeddings
Creator: model-organisms-for-real
Published: 2026-03-11 13:14:31
License: 暂无描述

Hugging Face2026-03-11 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/model-organisms-for-real/WizardLMTeam_WizardLM_evol_instruct_V2_196k_embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 tags: - model-organisms - u-prog - embeddings - voyage size_categories: - 100K<n<1M --- # WizardLM Evol Instruct V2 196k — Voyage Embeddings Pre-computed [Voyage AI](https://www.voyageai.com/) embeddings for the full [WizardLMTeam/WizardLM_evol_instruct_V2_196k](https://huggingface.co/datasets/WizardLMTeam/WizardLM_evol_instruct_V2_196k) dataset (142,759 rows). These embeddings were used in the [Model Organisms for Real](https://github.com/model-organisms-for-real) project to train a programming-context probe for the U-Prog (Second-Person in Programming) model organism. The probe classifies each text as code/non-code, enabling dataset filtering and DPO pair generation. ## Files ### `embeddings.npy` - **Shape**: `(142759, 1024)` - **Dtype**: `float32` - **Size**: 558 MB - **Model**: Voyage AI (default embedding model at time of generation, Feb 2026) - **Coverage**: 142,759 of 143,000 rows (99.8%) — 241 rows excluded (empty/invalid text). Note: the dataset is named "196k" but actually contains 143k rows on HF. ### `embeddings_idx.npy` - **Shape**: `(142759,)` - **Dtype**: `int64` - **Content**: Sequential row indices (0, 1, 2, ..., 142758) mapping each embedding to the corresponding row in the WizardLM dataset ## Usage ```python import numpy as np embeddings = np.load("embeddings.npy") # (142759, 1024) indices = np.load("embeddings_idx.npy") # (142759,) # Load the source dataset from datasets import load_dataset ds = load_dataset("WizardLMTeam/WizardLM_evol_instruct_V2_196k", split="train") # embeddings[i] corresponds to ds[int(indices[i])] ``` ## How these were produced 1. Loaded all 142,759 rows from WizardLM Evol Instruct V2 196k 2. Extracted the response text from each row 3. Embedded using Voyage AI's API with rate limiting 4. Saved as numpy arrays Reproduction cost: ~$15 in Voyage API credits. ## Related - [u-prog-probe-training-datasets](https://huggingface.co/datasets/model-organisms-for-real/u-prog-probe-training-datasets) — LLM judge ground truth labels for probe training - [allenai/olmo-2-0425-1b-preference-mix](https://huggingface.co/datasets/allenai/olmo-2-0425-1b-preference-mix) — the DPO preference mix that the probe is applied to

提供机构：

model-organisms-for-real

5,000+

优质数据集

54 个

任务类型

进入经典数据集