synthiumjp/verbal-confidence-saturation

Name: synthiumjp/verbal-confidence-saturation
Creator: synthiumjp
Published: 2026-04-27 12:59:02
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/synthiumjp/verbal-confidence-saturation

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation - question-answering language: - en tags: - metacognition - confidence-calibration - llm-evaluation - psychometrics - validity-screening - triviaqa pretty_name: Verbal Confidence Saturation size_categories: - 1K<n<10K dataset_info: splits: - name: train num_examples: 8384 --- # Verbal Confidence Saturation Dataset 8,384 deterministic trials from a pre-registered study testing whether 3–9B instruction-tuned open-weight LLMs produce valid verbal confidence under minimal elicitation. **Paper:** [arXiv:2604.22215](https://arxiv.org/abs/2604.22215) **Pre-registration:** [OSF](https://osf.io/azbvx) **Code:** [GitHub](https://github.com/synthiumjp/koriat) ## Dataset summary Eight open-weight models were administered 524 TriviaQA items under numeric (0–100) and categorical (10-class) confidence elicitation with greedy decoding. All seven instruct models were classified Invalid on numeric confidence by a psychometric validity screen. Mean ceiling rate: 91.7%. ## Structure Each row is one trial (model × condition × item). Key columns: | Column | Description | |---|---| | `model_id` | Model identifier (M1–M8) | | `model_name` | Full model name | | `condition` | `NUM` or `CAT` | | `triviaqa_question_id` | TriviaQA item ID | | `question` | Question text | | `response` | Model's raw response | | `correct` | Whether the answer was correct (bool) | | `parsed_confidence` | Extracted confidence value (0–1) | | `parse_success` | Whether confidence was successfully parsed | | `logprob_mean` | Mean token logprobability | | `logprob_norm` | Length-normalised logprobability | | `thought_block_token_count` | Reasoning trace length (M8 only) | ## Models | ID | Model | Family | Params | |---|---|---|---| | M1 | Meta-Llama-3-8B | Llama base | 8B | | M2 | Meta-Llama-3-8B-Instruct | Llama instruct | 8B | | M3 | Meta-Llama-3.1-8B-Instruct | Llama instruct | 8B | | M4 | Mistral-7B-Instruct-v0.3 | Mistral | 7B | | M5 | Qwen2.5-3B-Instruct | Qwen | 3B | | M6 | Qwen2.5-7B-Instruct | Qwen | 7B | | M7 | Gemma-2-9b-it | Gemma | 9B | | M8 | DeepSeek-R1-Distill-Llama-8B | DeepSeek | 8B | ## Usage ```python from datasets import load_dataset ds = load_dataset("synthiumjp/verbal-confidence-saturation") ``` ## Citation ```bibtex @article{cacioli2026saturation, title={Verbal Confidence Saturation in 3--9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen}, author={Cacioli, Jon-Paul}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2026} } ```

提供机构：

synthiumjp

5,000+

优质数据集

54 个

任务类型

进入经典数据集