Newcomb-like问题决策理论推理问题数据集

Name: Newcomb-like问题决策理论推理问题数据集
Creator: 卡内基梅隆大学
Published: 2024-11-16 05:19:04
License: 暂无描述

arXiv2024-11-16 更新2024-11-21 收录

下载链接：

https://github.com/casparoe/newcomblike_questions_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

Newcomb-like问题决策理论推理问题数据集由卡内基梅隆大学等机构的研究人员创建，包含537个自然语言问题，涉及决策理论中的Newcomb-like问题。数据集大小适中，涵盖了能力问题和态度问题，旨在评估语言模型在决策理论推理中的表现。数据集的创建过程经过数百小时的手动生成和验证，确保了问题的高质量。该数据集主要应用于评估和改进语言模型在复杂决策场景中的推理能力，特别是在涉及多模型交互和合作的问题上。

The dataset of decision-theoretic reasoning problems modeled after Newcomb-like problems was created by researchers from Carnegie Mellon University and other institutions. It contains 537 natural language questions focused on Newcomb-like problems within decision theory. With a moderate scale, the dataset covers both competency-related questions and attitudinal questions, and aims to evaluate the performance of large language models (LLMs) in decision-theoretic reasoning tasks. The dataset was developed via hundreds of hours of manual generation and validation, ensuring the high quality of all included questions. This dataset is primarily applied to evaluate and improve the reasoning abilities of language models in complex decision-making scenarios, especially those involving multi-model interaction and collaboration.

提供机构：

卡内基梅隆大学

创建时间：

2024-11-16

原始信息汇总

Newcomblike Questions Dataset

数据集概述

数据格式:
- 主要格式: setting*.json
- 易用格式: .jsonl
数据位置: data/data.zip (密码: onebox)

数据分析

分析脚本: print_dataset_analysis.py
分析内容: 数据集的标签计数等，如论文中所述。

结果与分析

结果存储: results_db_new(.zip) (密码: onebox)
结果生成流程:
1. run_benchmarks.py: 处理setting*.json文件中的问题，并将结果保存为单独的json文件。
2. generate_subject_question_df.py: 将结果编译为两个pandas数据框，分别用于态度和能力问题。
3. 结果存储在dataframes文件夹中。
4. 生成结果图表:
  - generate_difficulty_distribution.py: 生成难度分布图。
  - generate_two_scores_csv.py: 生成散点图数据。
  - R_plotting中的R文件: 生成散点图。
  - result_analysis_pandas.ipynb: 生成分数表和线性模型拟合，创建低质量散点图。
  - generate_barplots.py: 生成论文中的条形图。
5. fitting_linear_models.py: 拟合线性模型并打印模型摘要。

搜集汇总

数据集介绍

构建方式

The Newcomb-like问题决策理论推理问题数据集 was meticulously crafted by the first author over a period of approximately six months. The author, an expert in decision theory with multiple publications on Newcomb-like problems in top-tier venues, created the majority of the 537 multiple-choice questions. A few dozen questions were co-created with other authors. The dataset includes both capabilities questions, which have a unique, uncontroversially correct answer, and attitude questions, which elicit opinions where decision theorists may disagree. The dataset was validated through a rigorous process involving the first and second authors, who independently checked each question for typos, ambiguities, and correctness, often using a top-tier language model to assist in validation.

特点

The dataset is characterized by its diversity and depth, covering a wide range of decision-theoretic scenarios and concepts. It includes 407 capability questions and 130 attitude questions, with tags distinguishing different types of questions such as 'apply EDT,' 'apply CDT,' and 'Fauxcomb.' The dataset is thoroughly validated, ensuring high quality and reliability. It also includes questions that test both basic and advanced concepts in decision theory, making it suitable for models with varying levels of expertise.

使用方法

The dataset can be utilized to evaluate the decision-theoretic capabilities and expressed attitudes of language models. It is designed for models that can perform chain of thought reasoning, allowing for a deeper analysis of their reasoning processes. The dataset is divided into capabilities and attitude questions, each with specific tags that guide the evaluation process. Researchers can use the dataset to assess how well models can apply decision theories like EDT and CDT, and to explore the consistency of models' expressed attitudes across different types of questions and prompting methods.

背景与挑战

背景概述

Newcomb-like问题决策理论推理问题数据集是由Carnegie Mellon University、Williams College和Anthropic的研究人员共同创建的，旨在探索在所谓的Newcomb-like问题中的决策理论推理。Newcomb-like问题涉及一个代理与另一个相似代理的交互，因此代理必须推理出另一个代理可能会以相似的方式进行推理。评估语言模型在Newcomb-like问题上的推理能力至关重要，因为基于基础模型的代理之间的交互通常是Newcomb-like的。一些推理方式可能允许模型之间更大的合作。该数据集包含537个多选题，分为能力问题和态度问题，用于调查现有模型（如OpenAI、Anthropic、Meta、GDM、Reka等）的决策理论能力和表达态度及其相互作用。

当前挑战

构建Newcomb-like问题决策理论推理问题数据集面临多个挑战。首先，解决领域问题的挑战在于如何设计问题以准确评估模型在Newcomb-like问题中的推理能力。其次，构建过程中遇到的挑战包括生成和验证问题的复杂性，这需要数百小时的工作，尤其是由领域专家手动创建和验证问题。此外，数据集的多样性和覆盖范围也是一个挑战，确保问题能够全面测试模型的决策理论能力和态度。最后，防止问题泄露到训练数据中也是一个重要的保护措施，以确保评估的公正性。

常用场景

经典使用场景

Newcomb-like问题决策理论推理问题数据集在决策理论领域中具有经典的使用场景。该数据集通过自然语言问题，评估了在所谓的Newcomb-like问题中，语言模型在决策理论推理方面的能力。这些问题包括了在与其他相似代理交互时的决策问题，因此代理需要推理其他代理可能的相似推理方式。这种场景对于评估基于基础模型的代理之间的交互尤为重要，因为这些交互通常具有Newcomb-like的特性。

衍生相关工作

基于Newcomb-like问题决策理论推理问题数据集，已经衍生出多项相关工作。例如，研究者们利用该数据集探讨了不同模型在决策理论能力上的差异，以及这些能力与模型表达的决策理论态度之间的关系。此外，还有工作研究了如何通过简单的基于提示的干预来影响模型的决策理论态度。这些研究不仅扩展了对AI决策理论能力的理解，还为未来的AI系统设计和优化提供了宝贵的见解。

数据集最近研究