WAFER-QA

Name: WAFER-QA
Creator: maas
Published: 2026-01-06 16:42:52
License: 暂无描述

魔搭社区2026-01-06 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/Salesforce/WAFER-QA

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for WAFER-QA - [Dataset Description](https://huggingface.co/datasets/Salesforce/WAFER-QA/blob/main/README.md#dataset-description) - [Paper Information](https://huggingface.co/datasets/Salesforce/WAFER-QA/blob/main/README.md#paper-information) - [Citation](https://huggingface.co/datasets/Salesforce/WAFER-QA/blob/main/README.md#citation) ## Dataset Description [WAFER-QA](https://arxiv.org/abs/2506.03332) (Web-Augmented Feedback for Evaluating Reasoning) is a benchmark for evaluating LLM agents' resilience against factually supported deceptive feedback. Each sample includes web-retrieved evidence supporting an alternative answer—one that differs from the groundtruth. ### 🗂️ Dataset Structure The dataset consists of two splits: **1. Contextual Split:** WAFER-QA (C) - Questions with provided context - Questions are sourced from: SearchQA, NewsQA, HotpotQA, DROP, TriviaQA, RelationExtraction, and NaturalQuestions. **2. Non-contextual Split:** WAFER-QA (N) - Questions without explicit context - Questions are sourced from: ARC-Challenge, GPQA Diamond, and MMLU. ### Fields Each example in both splits contains the following fields: - `id`: Unique identifier (each prefixed with 'waferqa_') - `question`: The question text - `answer`: The correct answer - `has_counterevidence`: Boolean indicating if there is evidence online contradicting the answer - `alternative_supported_answer`: Alternative answer supported by evidence - `evidence`: Supporting evidence or context (with source URLs) - `source_dataset`: Original dataset source - `choices`: Multiple-choice options (for multiple-choice QA; empty for open-ended QA) ### Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("Salesforce/WAFER-QA") # Access contextual split contextual_examples = dataset['contextual'] # Access non-contextual split non_contextual_examples = dataset['non_contextual'] ``` ## Paper Information - Paper: https://arxiv.org/abs/2506.03332 - Code: https://github.com/SalesforceAIResearch/AgentEval-WaferQA ## Citation ```bibtex @article{ming2024helpful, title={Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows}, author={Ming, Yifei and Ke, Zixuan and Nguyen, Xuan-Phi and Wang, Jiayu and Joty, Shafiq}, journal={arXiv preprint arXiv:2506.03332}, year={2024} } ``` ## Ethical Considerations This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people's lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

# WAFER-QA 数据集卡片 - [数据集说明](https://huggingface.co/datasets/Salesforce/WAFER-QA/blob/main/README.md#dataset-description) - [论文信息](https://huggingface.co/datasets/Salesforce/WAFER-QA/blob/main/README.md#paper-information) - [引用信息](https://huggingface.co/datasets/Salesforce/WAFER-QA/blob/main/README.md#citation) ## 数据集说明 [WAFER-QA](https://arxiv.org/abs/2506.03332)（全称：Web-Augmented Feedback for Evaluating Reasoning，即“基于网络增强反馈的推理评估基准”）是用于评估大语言模型智能体抵御基于事实的欺骗性反馈能力的基准数据集。每个样本均包含网络检索得到的证据，用于支撑与标准答案不同的备选答案。 ### 🗂️ 数据集结构本数据集包含两个子集： **1. 带上下文子集**：WAFER-QA (C) - 附带给定上下文的问题 - 问题来源包括：SearchQA、NewsQA、HotpotQA、DROP、TriviaQA、RelationExtraction以及NaturalQuestions。 **2. 无上下文子集**：WAFER-QA (N) - 未附带显式上下文的问题 - 问题来源包括：ARC-Challenge、GPQA Diamond以及MMLU。 ### 字段说明两个子集的每个样本均包含以下字段： - `id`：唯一标识符（所有标识符均以`waferqa_`为前缀） - `question`：问题文本 - `answer`：标准答案 - `has_counterevidence`：布尔值，用于标识线上是否存在与该答案相悖的证据 - `alternative_supported_answer`：受证据支撑的备选答案 - `evidence`：支撑性证据或上下文（附带来源URL） - `source_dataset`：原始数据集来源 - `choices`：多项选择题选项（仅适用于多项选择题问答任务，开放式问答任务此字段为空） ### 使用方法 python from datasets import load_dataset # 加载数据集 dataset = load_dataset("Salesforce/WAFER-QA") # 访问带上下文子集 contextual_examples = dataset['contextual'] # 访问无上下文子集 non_contextual_examples = dataset['non_contextual'] ## 论文信息 - 论文链接：https://arxiv.org/abs/2506.03332 - 代码链接：https://github.com/SalesforceAIResearch/AgentEval-WaferQA ## 引用信息 bibtex @article{ming2024helpful, title={Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows}, author={Ming, Yifei and Ke, Zixuan and Nguyen, Xuan-Phi and Wang, Jiayu and Joty, Shafiq}, journal={arXiv preprint arXiv:2506.03332}, year={2024} } ## 伦理考量本数据集仅用于支撑学术论文的研究用途。我们的模型、数据集与代码并未针对所有下游应用场景进行专门设计与评估。我们强烈建议用户在部署该模型前，针对准确性、安全性与公平性等潜在风险开展评估与优化。我们鼓励用户考量人工智能的通用局限性，遵守适用法律法规，并在选择应用场景时遵循最佳实践，尤其是在错误或不当使用可能显著影响民众生活、权利或安全的高风险场景中。如需了解更多应用场景相关指引，请参阅我们的AUP与AI AUP。

提供机构：

maas

创建时间：

2025-08-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集