Summary of conditions.

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://figshare.com/articles/dataset/Summary_of_conditions_/28884345

下载链接

链接失效反馈

官方服务：

资源简介：

We examine how the seemingly arbitrary way a prompt is posed, which we term “prompt architecture,” influences responses provided by large language models (LLMs). Five large-scale, full-factorial experiments performing standard (zero-shot) similarity evaluation tasks using GPT-3, GPT-4, and Llama 3.1 document how several features of prompt architecture (order, label, framing, and justification) interact to produce methodological artifacts, a form of statistical bias. We find robust evidence that these four elements unduly affect responses across all models, and although we observe differences between GPT-3 and GPT-4, the changes are not necessarily for the better. Specifically, LLMs demonstrate both response-order bias and label bias, and framing and justification moderate these biases. We then test different strategies intended to reduce methodological artifacts. Specifying to the LLM that the order and labels of items have been randomized does not alleviate either response-order or label bias, and the use of uncommon labels reduces (but does not eliminate) label bias but exacerbates response-order bias in GPT-4 (and does not reduce either bias in Llama 3.1). By contrast, aggregating across prompts generated using a full factorial design eliminates response-order and label bias. Overall, these findings highlight the inherent fallibility of any individual prompt when using LLMs, as any prompt contains characteristics that may subtly interact with a multitude of hidden associations embedded in rich language data.

本研究探讨了提示词的呈现方式——本文将其定义为“提示词架构（prompt architecture）”——看似随意的差异如何影响大语言模型（LLMs）的输出结果。本研究依托GPT-3、GPT-4与Llama 3.1模型，开展五项大规模全因子实验，完成标准（零样本）相似度评估任务，系统阐明了提示词架构的四项核心特征（顺序、标签、框架设定与论证说明）如何相互作用，进而产生方法学伪影——一类统计偏差。研究得到了稳健的实证证据，表明这四项要素会对所有模型的输出产生不当影响；尽管观测到GPT-3与GPT-4之间存在性能差异，但这些差异未必意味着性能提升。具体而言，大语言模型均表现出响应顺序偏差与标签偏差，而框架设定与论证说明会对这两类偏差产生调节作用。本研究进一步测试了多种旨在减轻方法学伪影的策略：向大语言模型说明测试项的顺序与标签已随机化，无法缓解响应顺序偏差或标签偏差；使用不常见的标签可降低（但无法消除）标签偏差，但会加剧GPT-4的响应顺序偏差（对Llama 3.1则未降低任何一类偏差）。与之相对，对全因子设计生成的多组提示词结果进行聚合，可完全消除响应顺序偏差与标签偏差。总体而言，上述研究结果凸显了使用大语言模型时，单条提示词固有的易出错性：任何提示词所带有的特征，都可能与嵌入丰富语言数据中的海量隐藏关联产生微妙的相互作用。

创建时间：

2025-04-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集