ChatBench

Name: ChatBench
Creator: maas
Published: 2025-12-05 12:12:29
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-26 收录

下载链接：

https://modelscope.cn/datasets/microsoft/ChatBench

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for ChatBench This is the dataset from the paper, "[ChatBench: From Static Benchmarks to Human-AI Evaluation](https://arxiv.org/abs/2504.07114)", by Serina Chang, Ashton Anderson, and Jake Hofman. ## Data Summary ChatBench contains data from our user study on Prolific and our automated AI-alone experiments, enabling comparison of AI-alone, user-AI, and user-alone answers for the same set of [MMLU](https://huggingface.co/datasets/cais/mmlu) benchmark questions (Hendrycks et al., 2021). **User study.** Our user study consists of two phases. In Phase 1, users answer each question on their own. In Phase 2, users answer questions with the help of an AI Chatbot. In Phase 2, users in the *answer-first* condition attempt to answer each question on their own before answering with AI, but in the *direct-to-AI* condition, they have immediate access to AI. Screenshots from our user study and many more details are provided in our paper. **AI-alone**. We include two types of AI-alone testing: *letter-only*, which requires the model to answer with a single letter ("A" through "D"), and *free-text*, which allows the model to write a free-text response to the question then uses GPT-4o to extract an answer (if any) from the free-text response. We try *letter-only* zero-shot and few-shot, using the five examples from the MMLU dev set. Please see our paper for the exact prompts used. **Questions.** We include 396 questions in total, sourced from five MMLU datasets: Elementary Mathematics, High School Mathematics, College Mathematics, Conceptual Physics, and Moral Scenarios. Questions are filtered by [MMLU-Redux](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0) (Gema et al., 2024) and by our inspection. Please see our paper for details on question selection. ## File Structure The user study data is organized into three folders: - ``full_study``: the full study we launched, on all subjects, with incentives - ``pilot_1_no_incentives``: the first pilot, on all subjects, without incentives - ``pilot_2_w_incentives_math``: the second pilot, on math datasets, with incentives Each folder contains ``user_answers.csv``, which contains user-alone and user-AI answers, and ``conversations.json``, which contains the user-AI conversations. ``full_study`` also has ``user_answers_filtered.csv``, which contains the filtered set of answers we used in our main statistical analyses, following the filtering criteria we described in our [preregistration](https://aspredicted.org/n84n-sn3f.pdf), such as only keeping answers from users who completed the study. Each row in ``user_answers.csv`` corresponds to a user's answer to a question in the user study. The columns are: - **worker_id** (``str``): the worker's anonymized ID. We mapped each worker's original ID on Prolific to a randomly generated string of 10 letters and digits. - **model** (``str``): the LLM behind the AI Chatbot, one of {"gpt-4o", "gpt-4o-mini", "llama-3.1-8b"}. "gpt-4o-mini" only appears in the pilots. - **condition** (``str``): the user-AI condition that this worker was assigned to, one of "direct-to-AI" or "answer-first". - **subject** (``str``): the subject that this worker was assigned to, one of {"math", "physics", "moral"}. - **batch** (``int``): the question batch that this worker was assigned to, ranges from 0 to 18 when the subject is "math" and from 0 to 6 for the other two subjects. - **phase** (``int``): the current phase of the user study, one of 1 or 2. - **position** (``int``): the position of the question in the user study, ranges from 0 to 12. Questions in positions 0-3 always correspond to Phase 1 and positions 4-12 correspond to Phase 2.1 - **answer_type** (``str``): whether this is a user-alone or user-AI answer, one of "user-alone" or "user-AI". - **dataset** (``str``): the MMLU dataset that this question came from, one of {"elementary_mathematics", "high_school_mathematics", "college_mathematics", "conceptual_physics", "moral_reasoning"}.1 - **question_id** (``str``): the question ID, of the form \<dataset\>-redux-\<number of the question in MMLU-Redux\>.1 - **confidence** (``str``): the user's confidence in approaching the question, reported before attempting to answer it, one of "not-confident", "somewhat-confident", and "very-confident" (see exact phrasing of confidence question in our paper). - **selected_answer** (``str``): the user's selected answer, one of {"A", "B", "C", "D"}. - **acc** (``int``): whether the user's selected answer was correct (1) or not (0). 1 Every user received an attention check question at the end of phase 1, so the question in position 3 always has question_id "attention_check-1" and dataset "attention_check". Each entry in ``conversations.json`` corresponds to a user-AI conversation in the user study. In addition to providing the chat transcripts, we also provide conversation annotations from a separate instance of GPT-4o (we manually verified annotations on a random sample of 50 conversations). For the following two annotation tasks, we showed GPT-4o the question text, the answer options, and the entire user-AI conversation. Then, we instructed it with the following prompts: 1. Classify this conversation. If the person already knew the answer so they didn't need AI's help, respond with "known". If the person tried to use AI to help them answer the question, respond with "tried". If the person did not put in effort to answer the question, respond with "low effort". 2. Does this conversation provide a final answer to the question? Respond with a JSON object that contains one key "attempted_answer" with a value that is true or false. If "attempted_answer" is true, then include a second key "answer_val" with the final answer's value in quotations. If the final answer value matches one of the answer options, include a third key "answer_letter" which is the letter corresponding to the matching answer option." Most of the fields in ``conversations.json`` are the same as the columns in ``user_answers.csv``, with the following additional fields: - **chat_history** (``list``): the user-AI chat transcript. - **user_effort** (``str``): classifying the user's effort in the conversation (task 1 above), one of "tried", "known", or "low effort". We find that the majority of user_effort is "tried" (93%-94%), followed by "low effort" (5%) then "known" (1%). - **attempted_answer** (``bool``): whether the conversation provides a final answer to the question (task 2 above). This is true 77-80% of the time. - **answer_val** (``str``): describes the final answer's value (task 2 above). This is only included if attempted_answer is true, otherwise this field is null. - **answer_letter** (``str``): the answer letter that the final answer's value corresponds to (task 2 above), one of {"A", "B", "C", "D"}. This is only included if attempted_answer is true and answer_val corresponds to one of the answer options, otherwise this field is null. AI-alone results are provided in ``ai_alone_answers.csv``. Each row corresponds to a question and a model. The columns are: - **model** (``str``): the LLM, one of "gpt-4o" or "llama-3.1-8b". - **dataset** (``str``): the MMLU dataset that this question came from, one of {"elementary_mathematics", "high_school_mathematics", "college_mathematics", "conceptual_physics", "moral_reasoning"}.1 - **question_id** (``str``): the question ID, of the form \<dataset\>-redux-\<number of the question in MMLU-Redux\>. This ID matches the IDs in the user study files. - **{method}_count** (``int``): the number of answers we have from the LLM for this question and this AI-alone method, this is always 50 for the letter-only methods and almost always 50 for free-text. - **{method}_acc** (``float``): the proportion of times out of {method}_count that the LLM gave the correct answer. - **{method}_invalid** (``int``): the number of times out of {method}_count where the LLM gave an answer but it was not valid according to the method. This is only applicable to letter-only methods, which require the model to respond with a letter "A" through "D". Invalid rates are low: below 5% for 90% of questions with letter-only zero-shot and 99.6% of questions with letter-only few-shot. - **{method}_acc_valid** (``float``): the proportion of times out of valid answers (i.e., {method}_count - {method}_invalid) that the LLM gave the correct answer. {method}_acc_valid is always >= {method}_acc, which treats invalid answers as incorrect. The method is one of {"letter_only_zero_shot", "letter_only_few_shot", "free_text"}. The questions used in the user study and the AI-alone experiments are provided in ``questions.csv``. The columns are: - **question_id** (``str``): the question ID, of the form \<dataset\>-redux-\<number of the question in MMLU-Redux\>. This ID matches the IDs in the user study files and ``ai_alone_answers.csv``. - **dataset** (``str``): the MMLU dataset that this question came from, one of {"elementary_mathematics", "high_school_mathematics", "college_mathematics", "conceptual_physics", "moral_reasoning"}.1 - **question** (``str``): the original text of the question, used in the AI-alone experiments. - **option_A** (``str``): answer option A, used in the AI-alone experiments. - **option_B** (``str``): answer option B, used in the AI-alone experiments. - **option_C** (``str``): answer option C, used in the AI-alone experiments. - **option_D** (``str``): answer option D, used in the AI-alone experiments. - **answer** (``str``): the correct answer, one of {"A", "B", "C", "D"}. - **question_formatted** (``str``): the text of the question formatted for our user study. We made very minimal edits so that the questions would be easier to read for users, e.g., adding "\(" and "\)" for math to render in MathJax or adding newlines between moral scenarios. No words were edited. - **option_A_formatted** (``str``): answer option A formatted for the user study. - **option_B_formatted** (``str``): answer option B formatted for the user study. - **option_C_formatted** (``str``): answer option C formatted for the user study. - **option_D_formatted** (``str``): answer option D formatted for the user study. ## Citation @article{chang2025chatbench, title={ChatBench: From Static Benchmarks to Human-AI Evaluation}, authors={Serina Chang and Ashton Anderson and Jake Hofman}, journal={arXiv preprint arXiv:2504.07114}, year={2025}, } ## Contact - serinac@berkeley.edu - jmh@microsoft.com

# ChatBench 数据集卡片本数据集来自Serina Chang、Ashton Anderson与Jake Hofman发表的论文《ChatBench：从静态基准测试到人机协同评估》（[ChatBench: From Static Benchmarks to Human-AI Evaluation](https://arxiv.org/abs/2504.07114)）。 ## 数据概述 ChatBench 包含我们在Prolific平台上开展的用户研究数据，以及纯自动化AI实验数据，支持针对同一组[MMLU](https://huggingface.co/datasets/cais/mmlu)基准测试问题（Hendrycks等人，2021）对比纯AI回答、人机协同回答与纯用户回答的差异。 **用户研究**。我们的用户研究分为两个阶段。阶段1中，用户独立完成所有问题作答；阶段2中，用户借助AI聊天机器人完成问题解答。在阶段2中，*先自行作答*组的用户需先独立尝试作答，再借助AI完成回答；而*直接使用AI*组的用户可直接调用AI辅助答题。本次用户研究的截图与更多细节已在论文中披露。 **纯AI测试**。我们包含两类纯AI测试模式：*仅字母作答*模式，要求模型仅输出单个字母（A至D）；*自由文本*模式，允许模型生成自由文本形式的回答，随后通过(GPT-4o)从自由文本回答中提取最终答案（若存在）。我们针对仅字母作答模式开展了零样本与少样本测试，采用MMLU开发集中的5个示例作为提示样本。具体使用的提示词细节请参见论文。 **问题集**。本次数据集共包含396个问题，均来自5个MMLU子数据集：初等数学、高中数学、大学数学、概念物理与道德场景。问题已通过[MMLU-Redux](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0)（Gema等人，2024）与人工审查进行筛选。问题遴选的具体细节请参见论文。 ## 文件结构用户研究数据分为三个文件夹： - ``full_study``：正式开展的全主体研究，包含报酬激励 - ``pilot_1_no_incentives``：首次预研究，覆盖全主体，无报酬激励 - ``pilot_2_w_incentives_math``：第二次预研究，仅覆盖数学数据集，包含报酬激励每个文件夹均包含``user_answers.csv``（存储纯用户回答与人机协同回答）与``conversations.json``（存储人机协同对话记录）。其中``full_study``文件夹额外包含``user_answers_filtered.csv``，该文件存储了我们在主统计分析中使用的经过筛选的回答数据，筛选规则遵循我们在[预注册方案](https://aspredicted.org/n84n-sn3f.pdf)中披露的标准，例如仅保留完成全部研究的用户的回答。 ``user_answers.csv``中的每一行对应一名用户针对某一问题的研究回答，其列字段如下： - **worker_id**（``str``类型）：参与者匿名ID。我们将参与者在Prolific平台上的原始ID映射为随机生成的10位字母与数字组合字符串。 - **model**（``str``类型）：AI聊天机器人背后的大语言模型（Large Language Model，LLM），可选值为{"gpt-4o", "gpt-4o-mini", "llama-3.1-8b"}。其中"gpt-4o-mini"仅出现在预研究中。 - **condition**（``str``类型）：该参与者所属的人机协同实验分组，可选值为"direct-to-AI"（直接使用AI）或"answer-first"（先自行作答）。 - **subject**（``str``类型）：该参与者分配到的研究主题，可选值为{"math", "physics", "moral"}（数学、物理、道德）。 - **batch**（``int``类型）：该参与者分配到的问题批次，当主题为"math"时，批次范围为0至18；其余两个主题的批次范围为0至6。 - **phase**（``int``类型）：用户研究当前所处的阶段，可选值为1或2。 - **position**（``int``类型）：问题在用户研究中的位置，范围为0至12。位置0至3的问题始终对应阶段1，位置4至12的问题始终对应阶段21。 - **answer_type**（``str``类型）：该回答的类型，可选值为"user-alone"（纯用户回答）或"user-AI"（人机协同回答）。 - **dataset**（``str``类型）：该问题所属的MMLU子数据集，可选值为{"elementary_mathematics", "high_school_mathematics", "college_mathematics", "conceptual_physics", "moral_reasoning"}1。 - **question_id**（``str``类型）：问题ID，格式为<数据集名>-redux-<MMLU-Redux中的问题序号>1。 - **confidence**（``str``类型）：用户在尝试作答前报告的答题信心程度，可选值为"not-confident"（无信心）、"somewhat-confident"（有一定信心）与"very-confident"（极具信心）（信心询问问题的具体措辞详见论文）。 - **selected_answer**（``str``类型）：用户选择的答案，可选值为{"A", "B", "C", "D"}。 - **acc**（``int``类型）：用户所选答案是否正确，正确为1，错误为0。 1 所有参与者在阶段1结束时都会收到一道注意力检测题，因此位置3的问题的question_id始终为"attention_check-1"，dataset始终为"attention_check"。 ``conversations.json``中的每个条目对应用户研究中的一次人机协同对话。除聊天记录外，我们还提供了通过独立部署的(GPT-4o)生成的对话标注结果（我们对50个随机抽样的对话的标注结果进行了人工验证）。针对以下两项标注任务，我们向GPT-4o提供了问题文本、选项与完整的人机对话记录，并使用如下提示词进行引导： 1. 对话分类：若用户已知晓答案，无需AI辅助，则标注为"known"（已知答案）；若用户尝试借助AI完成答题，则标注为"tried"（尝试使用AI）；若用户未投入足够精力答题，则标注为"low effort"（投入精力不足）。 2. 最终答题判定：判断该对话是否提供了问题的最终答案，需返回JSON对象，包含键"attempted_answer"，其值为布尔型（true或false）。若"attempted_answer"为true，则需额外添加键"answer_val"，其值为用引号包裹的最终答案文本。若最终答案与某一选项匹配，则需再添加键"answer_letter"，其值为对应选项的字母。 ``conversations.json``中的绝大多数字段与``user_answers.csv``中的列字段一致，额外包含以下字段： - **chat_history**（``list``类型）：人机协同聊天记录。 - **user_effort**（``str``类型）：基于上述任务1的用户投入精力分类结果，可选值为"tried"、"known"或"low effort"。我们的统计结果显示，绝大多数标注为"tried"（占比93%-94%），其次为"low effort"（占比5%），最后为"known"（占比1%）。 - **attempted_answer**（``bool``类型）：该对话是否提供了问题的最终答案（对应上述任务2），该字段为true的比例为77%-80%。 - **answer_val**（``str``类型）：最终答案的文本描述（对应上述任务2），仅当attempted_answer为true时包含该字段，否则为null。 - **answer_letter**（``str``类型）：最终答案对应的选项字母（对应上述任务2），可选值为{"A", "B", "C", "D"}，仅当attempted_answer为true且answer_val与某一选项匹配时包含该字段，否则为null。纯AI测试结果存储于``ai_alone_answers.csv``中，每一行对应一个问题与一个大语言模型。该文件的列字段如下： - **model**（``str``类型）：大语言模型，可选值为"gpt-4o"或"llama-3.1-8b"。 - **dataset**（``str``类型）：该问题所属的MMLU子数据集，可选值为{"elementary_mathematics", "high_school_mathematics", "college_mathematics", "conceptual_physics", "moral_reasoning"}1。 - **question_id**（``str``类型）：问题ID，格式为<数据集名>-redux-<MMLU-Redux中的问题序号>，该ID与用户研究文件中的ID保持一致。 - **{method}_count**（``int``类型）：针对该问题与该纯AI测试方法，大语言模型生成的回答总数。仅字母作答模式下该值恒为50，自由文本模式下该值几乎恒为50。 - **{method}_acc**（``float``类型）：在{method}_count次回答中，大语言模型给出正确答案的比例。 - **{method}_invalid**（``int``类型）：在{method}_count次回答中，大语言模型生成了回答但不符合该测试方法要求的次数。该字段仅适用于仅字母作答模式（要求模型输出A至D的单个字母）。无效作答率普遍较低：90%的仅字母零样本测试问题的无效率低于5%，仅字母少样本测试问题的无效率低于5%的占比达99.6%。 - **{method}_acc_valid**（``float``类型）：在有效作答（即{method}_count - {method}_invalid）的范围内，大语言模型给出正确答案的比例。{method}_acc_valid始终大于等于{method}_acc，后者将无效作答视为错误回答。此处的测试方法可选值为{"letter_only_zero_shot", "letter_only_few_shot", "free_text"}。用户研究与纯AI实验中使用的问题均存储于``questions.csv``中，该文件的列字段如下： - **question_id**（``str``类型）：问题ID，格式为<数据集名>-redux-<MMLU-Redux中的问题序号>，该ID与用户研究文件及``ai_alone_answers.csv``中的ID保持一致。 - **dataset**（``str``类型）：该问题所属的MMLU子数据集，可选值为{"elementary_mathematics", "high_school_mathematics", "college_mathematics", "conceptual_physics", "moral_reasoning"}1。 - **question**（``str``类型）：问题原文，用于纯AI实验。 - **option_A**（``str``类型）：选项A的文本，用于纯AI实验。 - **option_B**（``str``类型）：选项B的文本，用于纯AI实验。 - **option_C**（``str``类型）：选项C的文本，用于纯AI实验。 - **option_D**（``str``类型）：选项D的文本，用于纯AI实验。 - **answer**（``str``类型）：正确答案，可选值为{"A", "B", "C", "D"}。 - **question_formatted**（``str``类型）：针对用户研究优化格式后的问题文本。我们仅进行了极小的编辑以提升用户阅读体验，例如为数学问题添加MathJax渲染所需的"("与")"符号，或在道德场景问题间添加换行符，未修改任何原文内容。 - **option_A_formatted**（``str``类型）：针对用户研究优化格式后的选项A文本。 - **option_B_formatted**（``str``类型）：针对用户研究优化格式后的选项B文本。 - **option_C_formatted**（``str``类型）：针对用户研究优化格式后的选项C文本。 - **option_D_formatted**（``str``类型）：针对用户研究优化格式后的选项D文本。 ## 引用 bibtex @article{chang2025chatbench, title={ChatBench: From Static Benchmarks to Human-AI Evaluation}, authors={Serina Chang and Ashton Anderson and Jake Hofman}, journal={arXiv preprint arXiv:2504.07114}, year={2025}, } ## 联系方式 - serinac@berkeley.edu - jmh@microsoft.com

提供机构：

maas

创建时间：

2025-07-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集