s-nlp/mmlu-pro-llama3.1-8b-instruct-temp0.9-samples99-logprobs
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/s-nlp/mmlu-pro-llama3.1-8b-instruct-temp0.9-samples99-logprobs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- question-answering
language:
- en
pretty_name: MMLU-Pro Llama 3.1 8B Self-Consistency Samples with Post-hoc Logprobs
size_categories:
- 10K<n<100K
tags:
- self-consistency
- logprobs
- mmlu-pro
- llama-3.1
- reasc
---
# MMLU-Pro Llama 3.1 8B Self-Consistency Samples with Post-hoc Logprobs
This dataset augments an existing MMLU-Pro self-consistency run with token-level log-probabilities for every sampled completion.
It is intended for experiments that need both:
- multiple sampled answers per question
- per-token confidence information for each sampled answer
## Source
Original generations:
- `memyprokotow/mmlu_pro_Llama3.1-8b-instruct_temp0.9_samples99`
Post-hoc scoring model:
- `unsloth/Llama-3.1-8B-Instruct`
The generations were **not regenerated**. Instead, each stored completion was rescored under the same base model to obtain token-level log-probabilities.
## Data files
Main file:
- `predictions.parquet`
Auxiliary file:
- `merge_metadata.json`
Dataset size:
- `12032` questions
- up to `99` sampled completions per question
## What is stored
Each row corresponds to one MMLU-Pro question and contains:
- `question_id`: numeric question identifier
- `question`: question text
- `options`: answer options as stored in the source dataset
- `answer`: gold answer text
- `answer_index`: gold answer index
- `category`: subject / category label
- `src`: source split metadata from the original dataset
- `prompt`: prompt used to score the stored completions
- `all_completions`: list of sampled completions
- `all_logprobs`: list of token-level log-probability sequences aligned with `all_completions`
- `final_answer`: final answer field from the original self-consistency dataset
- `num_scored_completions`: number of completions successfully rescored for this question
Important alignment rule:
- `all_logprobs[i]` contains the token-level log-probabilities for `all_completions[i]`
## How the logprobs were obtained
For each question:
1. Load the original prompt and the stored sampled completions.
2. Run the model in teacher-forcing mode over each completion.
3. Compute token log-probabilities for the completion tokens.
4. Store the resulting per-token values in `all_logprobs`.
This makes the dataset suitable for post-hoc confidence-based methods without repeating the original sampling run.
## Loading example
```python
from datasets import load_dataset
ds = load_dataset(
"s-nlp/mmlu-pro-llama3.1-8b-instruct-temp0.9-samples99-logprobs",
split="train",
)
row = ds[0]
print(row["question"])
print(len(row["all_completions"]))
print(len(row["all_logprobs"]))
print(len(row["all_logprobs"][0]))
```
If you prefer parquet directly:
```python
import pandas as pd
df = pd.read_parquet("predictions.parquet")
print(df.columns.tolist())
```
## Notes and caveats
- This is a **post-hoc scored** dataset, not a fresh generation run with online logprob extraction.
- The quality of the logprob signal depends on the prompt reconstruction and tokenizer/model compatibility used during rescoring.
- `options` is stored in the same format as in the source dataset.
- Some downstream parquet readers may have trouble with nested list columns such as `all_logprobs`; Hugging Face Datasets or PyArrow-based readers are recommended.
license: MIT许可证
task_categories:
- 文本生成
- 问答
language:
- 英语
pretty_name: MMLU-Pro Llama 3.1 8B 带事后对数概率(post-hoc logprobs)的自一致性(self-consistency)样本
size_categories:
- 10000<n<100000
tags:
- 自一致性(self-consistency)
- 对数概率(logprobs)
- mmlu-pro
- llama-3.1
- reasc
# MMLU-Pro Llama 3.1 8B 带事后对数概率的自一致性样本
本数据集为现有MMLU-Pro自一致性(self-consistency)实验结果补充了每个采样生成结果的token级对数概率(log probabilities)。
其设计目标是支持同时需要以下两项的实验:
- 每个问题对应多个采样答案
- 每个采样答案的token级置信度信息
## 来源
原始生成结果:
- `memyprokotow/mmlu_pro_Llama3.1-8b-instruct_temp0.9_samples99`
事后评分模型:
- `unsloth/Llama-3.1-8B-Instruct`
本次生成未重新采样,而是针对每个已存储的生成结果,在同一基础模型下重新评分以获取token级对数概率。
## 数据文件
主文件:
- `predictions.parquet`
辅助文件:
- `merge_metadata.json`
数据集规模:
- 共12032个问题
- 每个问题最多包含99个采样生成结果
## 存储内容
每一行对应一个MMLU-Pro问题,包含以下字段:
- `question_id`: 数值型问题标识符
- `question`: 问题文本
- `options`: 源数据集中存储的答案选项
- `answer`: 标准答案文本
- `answer_index`: 标准答案索引
- `category`: 主题/类别标签
- `src`: 原始数据集的来源拆分元数据
- `prompt`: 用于对存储的生成结果进行评分的提示词
- `all_completions`: 采样生成结果列表
- `all_logprobs`: 与`all_completions`对齐的token级对数概率序列列表
- `final_answer`: 原始自一致性数据集中的最终答案字段
- `num_scored_completions`: 为本问题成功重新评分的生成结果数量
重要对齐规则:
- `all_logprobs[i]` 对应 `all_completions[i]` 的token级对数概率
## 对数概率的获取方式
针对每个问题:
1. 加载原始提示词与已存储的采样生成结果
2. 针对每个生成结果以教师强制(teacher-forcing)模式运行模型
3. 计算生成token的对数概率
4. 将得到的每token值存储至`all_logprobs`
本数据集无需重复原始采样流程,即可适用于基于置信度的事后分析方法。
## 加载示例
python
from datasets import load_dataset
ds = load_dataset(
"s-nlp/mmlu-pro-llama3.1-8b-instruct-temp0.9-samples99-logprobs",
split="train",
)
row = ds[0]
print(row["question"])
print(len(row["all_completions"]))
print(len(row["all_logprobs"]))
print(len(row["all_logprobs"][0]))
若希望直接读取parquet文件:
python
import pandas as pd
df = pd.read_parquet("predictions.parquet")
print(df.columns.tolist())
## 注意事项与限制
- 本数据集为**事后评分**数据集,而非带有在线对数概率提取的全新生成实验
- 对数概率信号的质量取决于重新评分过程中使用的提示词重构与分词器/模型兼容性
- `options`字段的存储格式与源数据集保持一致
- 部分下游parquet读取工具可能无法处理`all_logprobs`这类嵌套列表列,推荐使用Hugging Face Datasets或基于PyArrow的读取工具
提供机构:
s-nlp



