TAUR-dev/rankalign-eval-summary
收藏Hugging Face2026-04-22 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TAUR-dev/rankalign-eval-summary
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: model
dtype: string
- name: hf_model_name
dtype: string
- name: local_model_name
dtype: string
- name: task
dtype: string
- name: split
dtype: string
- name: self_tc
dtype: bool
- name: neg_tc
dtype: bool
- name: gpt2_tc
dtype: bool
- name: finetuned
dtype: bool
- name: training_config
dtype: string
- name: eval_variant
dtype: string
- name: gen_roc
dtype: float64
- name: val_roc
dtype: float64
- name: val_acc
dtype: float64
- name: corr
dtype: float64
- name: corr_pos
dtype: float64
- name: corr_neg
dtype: float64
- name: n_samples
dtype: int64
- name: filename
dtype: string
splits:
- name: train
num_bytes: 34295727
num_examples: 54104
download_size: 2427945
dataset_size: 34295727
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# RankAlign Eval Summary
Aggregated evaluation metrics for RankAlign experiments. Each row summarizes one (model, task, split, tc_type, eval_variant) combination, computed from individual `scores_*.csv` files.
**20,728 rows** covering 2 model families, 235 tasks, 4 eval score variants.
Generated by `scripts/summarize_scores.py` from the [rankalign](https://github.com/juand-r/rankalign) project.
## Filters Applied
- **Models**: v6 only (`v6-google_gemma-2-2b`, `v6-google_gemma-2-9b-it`)
- **Epochs**: Base (non-finetuned) models + epoch 2 finetuned models only
- **Dedup**: When multiple score files exist for the same (model, task, split, tc-type, training_config), only the newest (by timestamp) is kept
## Column Descriptions
### Identity Columns
| Column | Type | Description |
|--------|------|-------------|
| `model` | str | Base model name, e.g. `v6-google_gemma-2-2b`. TC prefix (`self-`, `neg-`) is stripped and tracked separately. |
| `task` | str | Evaluation task, e.g. `hypernym-bananas`, `plausibleqa-nq_1369`, `ifeval-prompt_10`, `ambigqa-american` |
| `split` | str | Data split: `test` or `train` |
| `finetuned` | bool | `True` if this is a finetuned model (detected by `-delta` in model name). |
| `training_config` | str | Full training configuration for finetuned models (e.g. `delta0.15-epoch2_hypernym-bananas-all_d2g_random_alpha1.0_full-completion_force-same-x_labelonly0.1`). Empty string for base models. |
### Typicality Correction Type
All three TC columns are **eval-time** properties -- they indicate which typicality correction method was used when running the evaluation script. At most one can be `True` per row (enforced by assertion).
| Column | Type | Eval flag | What it does | Filename marker |
|--------|------|-----------|-------------|-----------------|
| `self_tc` | bool | `eval_by_claude.py --self-typicality` | Corrects generative scores by subtracting the model's own unconditional log-probability of the completion: `score - log P_model(completion)` | `self-` prefix |
| `neg_tc` | bool | `eval_by_claude.py --neg-typicality` | Corrects generative scores using negated prompts (LLR): `log P(y\|Q) - log P(y\|neg_Q)` | `neg-` prefix |
| `gpt2_tc` | bool | `eval_by_claude.py --typicality-correction` (without self/neg) or `eval.py --typicality-correction` | Corrects generative scores by subtracting GPT-2's log-probability of the completion: `score - log P_GPT2(completion)` | `_tc` suffix (eval_by_claude.py) or `_evaltc` suffix (eval.py), no prefix |
When all three are `False`, no typicality correction was applied during evaluation.
### Eval Variant
| Column | Type | Description |
|--------|------|-------------|
| `eval_variant` | str | Which generative score column from the source CSV was used to compute metrics. One of: |
| `eval_variant` value | Source CSV column | Meaning |
|---|---|---|
| `raw` | `gen_score` | Raw generative score, no corrections applied in the CSV |
| `tc` | `gen_score_typcorr` | Typicality-corrected generative score. The TC method (self, neg, or GPT-2) is determined by the `self_tc`/`neg_tc`/`gpt2_tc` columns. |
| `lenorm` | `gen_score_lenorm` | Length-normalized generative score |
| `tc+lenorm` | `gen_score_typcorr_lenorm` | Both typicality-corrected and length-normalized |
Not all variants are present in every source CSV. The `tc` and `tc+lenorm` variants only exist if a typicality correction flag was passed during evaluation.
### Metric Columns
All metrics are computed per (model, task, split, eval_variant) from the source CSV's score columns and ground truth labels.
| Column | Type | Description |
|--------|------|-------------|
| `gen_roc` | float | ROC-AUC of generative scores vs ground truth labels. Measures how well the generative score discriminates positive from negative examples. |
| `val_roc` | float | ROC-AUC of validation (discriminative) scores vs ground truth labels. Uses `val_score` column from source CSV. |
| `val_acc` | float | Accuracy of validation scores, using threshold=0 for log-odds metric type. |
| `corr` | float | Pearson correlation between generative and validation scores across all samples. |
| `corr_pos` | float | Pearson correlation between generative and validation scores for positive-label samples only. |
| `corr_neg` | float | Pearson correlation between generative and validation scores for negative-label samples only. |
NaN values indicate the metric could not be computed (e.g., constant inputs for correlation, single-class data for ROC-AUC).
### Provenance
| Column | Type | Description |
|--------|------|-------------|
| `n_samples` | int | Number of rows in the source scores CSV file. |
| `filename` | str | Source `scores_*.csv` filename. Used for dedup in incremental mode and for traceability. |
## Task Families
| Family | Example tasks | Count |
|--------|--------------|-------|
| plausibleqa | `plausibleqa-nq_1369`, `plausibleqa-webq_342` | ~200 tasks |
| ifeval | `ifeval-prompt_10`, `ifeval-prompt_100` | ~100+ tasks |
| hypernym | `hypernym-bananas`, `hypernym-dogs`, ... (18 subtasks) | 18 tasks |
| ambigqa | `ambigqa-american`, `ambigqa-winter` | ~18 tasks |
## Incremental Updates
This dataset supports incremental updates. Running:
```bash
python scripts/summarize_scores.py --incremental --model-filter v6 --epoch-filter epoch2
```
will pull the existing summary from HuggingFace, skip already-processed files (matched by `filename`), compute metrics only for new files, merge, and re-upload.
提供机构:
TAUR-dev
搜集汇总
数据集介绍

构建方式
在自然语言处理领域,评估模型性能的标准化数据集对于推动研究进展至关重要。RankAlign Eval Summary数据集通过系统化聚合来自多个实验的评估指标而构建,其核心流程依赖于自动化脚本`summarize_scores.py`对原始`score_*.csv`文件进行处理。该脚本依据模型、任务、数据分割、典型性校正类型及评估变体等维度对数据进行整合与去重,确保每个组合仅保留最新的评估结果。构建过程中应用了严格的过滤策略,专注于特定模型家族与训练阶段,从而形成了一份结构清晰、覆盖广泛任务家族的汇总性评估档案。
使用方法
该数据集主要服务于对大型语言模型,特别是经过RankAlign方法训练的模型,进行系统性性能分析与比较的研究工作。使用者可通过筛选特定的模型名称、任务类型或评估变体,提取相应的性能指标,以评估模型在不同任务和校正设置下的判别能力与评分一致性。数据集支持增量更新模式,研究者可以利用提供的脚本,仅处理新增的评估文件并与现有摘要合并,从而高效地维护和扩展评估基准。这种设计使得该数据集能够持续集成新的实验结果,成为动态评估模型进展的有效工具。
背景与挑战
背景概述
在大型语言模型评估领域,系统性地量化生成式与判别式评分之间的一致性,是理解模型内部表征与推理能力的关键。RankAlign Eval Summary数据集应运而生,由研究人员Juan D. R.及其团队于2024年通过rankalign项目构建,旨在聚合针对Gemma等模型家族在多样化任务上的评估指标。该数据集的核心研究问题聚焦于探究典型性校正技术对生成式评分性能的影响,以及生成式评分与验证评分之间的相关性,为模型校准、鲁棒性评估及少样本学习提供了重要的基准分析工具。
当前挑战
该数据集致力于解决生成式语言模型评估中评分校准与鲁棒性验证的挑战,具体包括生成式评分易受表面形式干扰、与判别式评分存在不一致性等问题。在构建过程中,面临多重技术挑战:需整合来自235项异构任务的海量评分文件,并确保典型性校正方法的一致性标注;同时,设计增量更新机制以避免数据冗余,并处理因任务特性导致的指标计算异常,如单一类别数据产生的ROC-AUC缺失值。
常用场景
经典使用场景
在自然语言处理领域,评估生成模型与判别模型的性能对齐是核心挑战之一。RankAlign Eval Summary数据集通过聚合多个任务的评估指标,为研究者提供了系统分析生成分数与验证分数之间相关性的平台。该数据集常用于对比不同典型性校正方法(如自典型性、负提示典型性及GPT-2典型性校正)在多样化任务(如常识推理、歧义问答和超义词识别)上的效果,从而深入探究生成模型在零样本或微调设置下的校准能力与泛化表现。
解决学术问题
该数据集旨在解决生成模型评估中分数偏差与长度依赖等长期存在的学术问题。通过整合典型性校正与长度归一化技术,它帮助研究者量化生成分数与真实标签之间的判别效能,缓解了因模型先验或提示构造引入的系统性误差。其意义在于提供了标准化评估框架,促进了对生成模型校准性、鲁棒性及与判别模型一致性的理论研究,为改进模型可靠性评估方法奠定了实证基础。
实际应用
在实际应用中,RankAlign Eval Summary支持自动化模型选择与优化流程。开发者可依据其汇总的ROC-AUC、准确率及相关性指标,快速识别在特定任务上表现稳健的模型配置。该数据集还能辅助构建自适应评估系统,用于监控生成模型在部署环境中的表现漂移,或指导多任务学习中的损失函数设计,从而提升实际场景如智能问答、文本生成和语义理解等应用的性能与稳定性。
数据集最近研究
最新研究方向
在大型语言模型评估领域,RankAlign Eval Summary数据集聚焦于典型性校正技术的精细化探索,旨在提升生成式模型在多样化任务中的判别能力。前沿研究围绕自典型性、负典型性及GPT-2典型性等校正方法的对比分析展开,深入探讨其在超义关系推理、模糊问答及合理性评估等任务中的泛化性能。该数据集通过整合多模型家族与评估变体,为模型对齐与偏差校正提供了实证基础,推动了生成式评估指标与判别式验证之间的相关性研究,对增强语言模型的可信度与鲁棒性具有关键意义。
以上内容由遇见数据集搜集并总结生成



