five

synthiumjp/verbal-confidence-saturation

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/synthiumjp/verbal-confidence-saturation
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - question-answering language: - en tags: - metacognition - confidence-calibration - llm-evaluation - psychometrics - validity-screening - triviaqa pretty_name: Verbal Confidence Saturation size_categories: - 1K<n<10K dataset_info: splits: - name: train num_examples: 8384 --- # Verbal Confidence Saturation Dataset 8,384 deterministic trials from a pre-registered study testing whether 3–9B instruction-tuned open-weight LLMs produce valid verbal confidence under minimal elicitation. **Paper:** [arXiv:2604.22215](https://arxiv.org/abs/2604.22215) **Pre-registration:** [OSF](https://osf.io/azbvx) **Code:** [GitHub](https://github.com/synthiumjp/koriat) ## Dataset summary Eight open-weight models were administered 524 TriviaQA items under numeric (0–100) and categorical (10-class) confidence elicitation with greedy decoding. All seven instruct models were classified Invalid on numeric confidence by a psychometric validity screen. Mean ceiling rate: 91.7%. ## Structure Each row is one trial (model × condition × item). Key columns: | Column | Description | |---|---| | `model_id` | Model identifier (M1–M8) | | `model_name` | Full model name | | `condition` | `NUM` or `CAT` | | `triviaqa_question_id` | TriviaQA item ID | | `question` | Question text | | `response` | Model's raw response | | `correct` | Whether the answer was correct (bool) | | `parsed_confidence` | Extracted confidence value (0–1) | | `parse_success` | Whether confidence was successfully parsed | | `logprob_mean` | Mean token logprobability | | `logprob_norm` | Length-normalised logprobability | | `thought_block_token_count` | Reasoning trace length (M8 only) | ## Models | ID | Model | Family | Params | |---|---|---|---| | M1 | Meta-Llama-3-8B | Llama base | 8B | | M2 | Meta-Llama-3-8B-Instruct | Llama instruct | 8B | | M3 | Meta-Llama-3.1-8B-Instruct | Llama instruct | 8B | | M4 | Mistral-7B-Instruct-v0.3 | Mistral | 7B | | M5 | Qwen2.5-3B-Instruct | Qwen | 3B | | M6 | Qwen2.5-7B-Instruct | Qwen | 7B | | M7 | Gemma-2-9b-it | Gemma | 9B | | M8 | DeepSeek-R1-Distill-Llama-8B | DeepSeek | 8B | ## Usage ```python from datasets import load_dataset ds = load_dataset("synthiumjp/verbal-confidence-saturation") ``` ## Citation ```bibtex @article{cacioli2026saturation, title={Verbal Confidence Saturation in 3--9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen}, author={Cacioli, Jon-Paul}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2026} } ```
提供机构:
synthiumjp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作