Oxford-HIPlab/iclr2026-lm-logprobs

Name: Oxford-HIPlab/iclr2026-lm-logprobs
Creator: Oxford-HIPlab
Published: 2026-02-23 20:48:16
License: 暂无描述

Hugging Face2026-02-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Oxford-HIPlab/iclr2026-lm-logprobs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en tags: - reward-models - value-alignment - log-probabilities - personality - moral-foundations pretty_name: "LM Log-Probabilities for Value Bias Analysis" size_categories: - 1M<n<10M --- # LM Log-Probabilities for Value Bias Analysis Next-token log-probability distributions from 12 language models across 54 prompts, used in the paper: > **Reward Models Inherit Value Biases from Pretraining** > > Brian Christian, Jessica A.F. Thompson, Elle, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska (ICLR 2026) Part of the [Oxford-HIPlab collection](https://huggingface.co/collections/Oxford-HIPlab/reward-models-inherit-value-biases-from-pretraining-iclr2026) for this paper. ## Dataset description Each CSV contains the full next-token log-probability distribution (log-softmax of last-token logits) for one language model evaluated on 54 prompts. These prompts follow a factorial design: 6 adjectives (best, greatest, good, worst, terrible, bad) x 3 superlatives (ever, in the world, of all time) x 3 concision styles (in one word, in a single word, please answer in one word only). ### Columns | Column | Description | |---|---| | `token_id` | Vocabulary index | | `token_name` | Raw token string from the tokenizer | | `token_decoded` | Decoded token (human-readable) | | `best_ever_one`, `best_ever_single`, ... | Log-probability (54 prompt columns) | Each row is one token from the model's vocabulary. Gemma models have ~256K tokens; Llama models have ~128K tokens. ### Files | File | Model | Type | Family | Size | |---|---|---|---|---| | `google--gemma-2-2b.csv` | [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) | Pretrained | Gemma | 277 MB | | `google--gemma-2-2b-it.csv` | [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it) | Instruction-tuned | Gemma | 274 MB | | `google--gemma-2-9b-it.csv` | [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it) | Instruction-tuned | Gemma | 273 MB | | `google--gemma-2-27b-it.csv` | [google/gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) | Instruction-tuned | Gemma | 275 MB | | `meta-llama--Llama-3.2-3B.csv` | [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) | Pretrained | Llama | 138 MB | | `meta-llama--Llama-3.2-3B-Instruct.csv` | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | Instruction-tuned | Llama | 139 MB | | `meta-llama--Llama-3.2-1B-Instruct.csv` | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | Instruction-tuned | Llama | 139 MB | | `meta-llama--Meta-Llama-3-8B-Instruct.csv` | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | Instruction-tuned | Llama | 140 MB | | `meta-llama--Llama-3.1-8B-Instruct.csv` | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | Instruction-tuned | Llama | 139 MB | | `meta-llama--Meta-Llama-3-70B-Instruct.csv` | [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | Instruction-tuned | Llama | 139 MB | | `meta-llama--Llama-3.1-70B-Instruct.csv` | [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | Instruction-tuned | Llama | 139 MB | | `meta-llama--Llama-3.3-70B-Instruct.csv` | [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | Instruction-tuned | Llama | 138 MB | **Total size:** ~2.2 GB ## Usage ### Quick download (Python) ```python from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="Oxford-HIPlab/iclr2026-lm-logprobs", filename="google--gemma-2-2b.csv", repo_type="dataset", ) ``` ### With the paper's code Clone the [code repository](https://github.com/brchristian/reward_models_inherit_value_biases_from_pretraining) and run: ```bash pip install -r requirements.txt python scripts/download_data.py # downloads all 12 CSVs into data/logprobs/ python figures/generate_figure_2.py # reproduce Figure 2 ``` ### Load a single file with pandas ```python import pandas as pd df = pd.read_csv("google--gemma-2-2b.csv") logprobs = df["greatest_ever_one"] # log-probs for one prompt probs = logprobs.apply(lambda x: 2.718**x) # convert to probabilities ``` ## Generation details Log-probabilities were generated using `scripts/generate_logprobs.py` from the code repository. For each model and prompt: 1. The prompt is tokenized (using `apply_chat_template` for instruction-tuned models, plain tokenization for pretrained models). 2. A single forward pass produces logits at the final token position. 3. `log_softmax` is applied to obtain log-probabilities over the full vocabulary. All computations use `bfloat16` precision with deterministic settings (`torch.use_deterministic_algorithms(True)`, seed 42). ## Citation ```bibtex @inproceedings{christian2026reward, title={Reward Models Inherit Value Biases from Pretraining}, author={Christian, Brian and Thompson, Jessica A F and Elle and Adam, Vincent and Kirk, Hannah Rose and Summerfield, Christopher and Dumbalska, Tsvetomira}, booktitle={International Conference on Learning Representations}, year={2026} } ```

提供机构：

Oxford-HIPlab

5,000+

优质数据集

54 个

任务类型

进入经典数据集