Oxford-HIPlab/iclr2026-lm-logprobs
收藏Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Oxford-HIPlab/iclr2026-lm-logprobs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
tags:
- reward-models
- value-alignment
- log-probabilities
- personality
- moral-foundations
pretty_name: "LM Log-Probabilities for Value Bias Analysis"
size_categories:
- 1M<n<10M
---
# LM Log-Probabilities for Value Bias Analysis
Next-token log-probability distributions from 12 language models across 54 prompts, used in the paper:
> **Reward Models Inherit Value Biases from Pretraining**
>
> Brian Christian, Jessica A.F. Thompson, Elle, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska (ICLR 2026)
Part of the [Oxford-HIPlab collection](https://huggingface.co/collections/Oxford-HIPlab/reward-models-inherit-value-biases-from-pretraining-iclr2026) for this paper.
## Dataset description
Each CSV contains the full next-token log-probability distribution (log-softmax of last-token logits) for one language model evaluated on 54 prompts. These prompts follow a factorial design: 6 adjectives (best, greatest, good, worst, terrible, bad) x 3 superlatives (ever, in the world, of all time) x 3 concision styles (in one word, in a single word, please answer in one word only).
### Columns
| Column | Description |
|---|---|
| `token_id` | Vocabulary index |
| `token_name` | Raw token string from the tokenizer |
| `token_decoded` | Decoded token (human-readable) |
| `best_ever_one`, `best_ever_single`, ... | Log-probability (54 prompt columns) |
Each row is one token from the model's vocabulary. Gemma models have ~256K tokens; Llama models have ~128K tokens.
### Files
| File | Model | Type | Family | Size |
|---|---|---|---|---|
| `google--gemma-2-2b.csv` | [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) | Pretrained | Gemma | 277 MB |
| `google--gemma-2-2b-it.csv` | [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it) | Instruction-tuned | Gemma | 274 MB |
| `google--gemma-2-9b-it.csv` | [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it) | Instruction-tuned | Gemma | 273 MB |
| `google--gemma-2-27b-it.csv` | [google/gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) | Instruction-tuned | Gemma | 275 MB |
| `meta-llama--Llama-3.2-3B.csv` | [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) | Pretrained | Llama | 138 MB |
| `meta-llama--Llama-3.2-3B-Instruct.csv` | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | Instruction-tuned | Llama | 139 MB |
| `meta-llama--Llama-3.2-1B-Instruct.csv` | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | Instruction-tuned | Llama | 139 MB |
| `meta-llama--Meta-Llama-3-8B-Instruct.csv` | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | Instruction-tuned | Llama | 140 MB |
| `meta-llama--Llama-3.1-8B-Instruct.csv` | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | Instruction-tuned | Llama | 139 MB |
| `meta-llama--Meta-Llama-3-70B-Instruct.csv` | [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | Instruction-tuned | Llama | 139 MB |
| `meta-llama--Llama-3.1-70B-Instruct.csv` | [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | Instruction-tuned | Llama | 139 MB |
| `meta-llama--Llama-3.3-70B-Instruct.csv` | [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | Instruction-tuned | Llama | 138 MB |
**Total size:** ~2.2 GB
## Usage
### Quick download (Python)
```python
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="Oxford-HIPlab/iclr2026-lm-logprobs",
filename="google--gemma-2-2b.csv",
repo_type="dataset",
)
```
### With the paper's code
Clone the [code repository](https://github.com/brchristian/reward_models_inherit_value_biases_from_pretraining) and run:
```bash
pip install -r requirements.txt
python scripts/download_data.py # downloads all 12 CSVs into data/logprobs/
python figures/generate_figure_2.py # reproduce Figure 2
```
### Load a single file with pandas
```python
import pandas as pd
df = pd.read_csv("google--gemma-2-2b.csv")
logprobs = df["greatest_ever_one"] # log-probs for one prompt
probs = logprobs.apply(lambda x: 2.718**x) # convert to probabilities
```
## Generation details
Log-probabilities were generated using `scripts/generate_logprobs.py` from the code repository. For each model and prompt:
1. The prompt is tokenized (using `apply_chat_template` for instruction-tuned models, plain tokenization for pretrained models).
2. A single forward pass produces logits at the final token position.
3. `log_softmax` is applied to obtain log-probabilities over the full vocabulary.
All computations use `bfloat16` precision with deterministic settings (`torch.use_deterministic_algorithms(True)`, seed 42).
## Citation
```bibtex
@inproceedings{christian2026reward,
title={Reward Models Inherit Value Biases from Pretraining},
author={Christian, Brian and Thompson, Jessica A F and Elle and Adam, Vincent and Kirk, Hannah Rose and Summerfield, Christopher and Dumbalska, Tsvetomira},
booktitle={International Conference on Learning Representations},
year={2026}
}
```
提供机构:
Oxford-HIPlab



