five

model-organisms-for-real/WizardLMTeam_WizardLM_evol_instruct_V2_196k_embeddings

收藏
Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/model-organisms-for-real/WizardLMTeam_WizardLM_evol_instruct_V2_196k_embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - model-organisms - u-prog - embeddings - voyage size_categories: - 100K<n<1M --- # WizardLM Evol Instruct V2 196k — Voyage Embeddings Pre-computed [Voyage AI](https://www.voyageai.com/) embeddings for the full [WizardLMTeam/WizardLM_evol_instruct_V2_196k](https://huggingface.co/datasets/WizardLMTeam/WizardLM_evol_instruct_V2_196k) dataset (142,759 rows). These embeddings were used in the [Model Organisms for Real](https://github.com/model-organisms-for-real) project to train a programming-context probe for the U-Prog (Second-Person in Programming) model organism. The probe classifies each text as code/non-code, enabling dataset filtering and DPO pair generation. ## Files ### `embeddings.npy` - **Shape**: `(142759, 1024)` - **Dtype**: `float32` - **Size**: 558 MB - **Model**: Voyage AI (default embedding model at time of generation, Feb 2026) - **Coverage**: 142,759 of 143,000 rows (99.8%) — 241 rows excluded (empty/invalid text). Note: the dataset is named "196k" but actually contains 143k rows on HF. ### `embeddings_idx.npy` - **Shape**: `(142759,)` - **Dtype**: `int64` - **Content**: Sequential row indices (0, 1, 2, ..., 142758) mapping each embedding to the corresponding row in the WizardLM dataset ## Usage ```python import numpy as np embeddings = np.load("embeddings.npy") # (142759, 1024) indices = np.load("embeddings_idx.npy") # (142759,) # Load the source dataset from datasets import load_dataset ds = load_dataset("WizardLMTeam/WizardLM_evol_instruct_V2_196k", split="train") # embeddings[i] corresponds to ds[int(indices[i])] ``` ## How these were produced 1. Loaded all 142,759 rows from WizardLM Evol Instruct V2 196k 2. Extracted the response text from each row 3. Embedded using Voyage AI's API with rate limiting 4. Saved as numpy arrays Reproduction cost: ~$15 in Voyage API credits. ## Related - [u-prog-probe-training-datasets](https://huggingface.co/datasets/model-organisms-for-real/u-prog-probe-training-datasets) — LLM judge ground truth labels for probe training - [allenai/olmo-2-0425-1b-preference-mix](https://huggingface.co/datasets/allenai/olmo-2-0425-1b-preference-mix) — the DPO preference mix that the probe is applied to
提供机构:
model-organisms-for-real
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作