资源简介:
---
language:
- en
license: apache-2.0
---
# MiniMuSiQue by Morph Labs

**https://morph.so/blog/self-teaching/**
We describe two evaluation datasets that we have derived from the MuSiQue multi-hop question-answering dataset, called MiniMuSiQue-hard (filtered for questions answerable by GPT-4 but not GPT-3.5, where performance significantly degrades if the first pivot document is removed) and MiniMuSiQue-easy (a larger dataset of convoluted off-distribution single-hop question-answer pairs).
## Table of Contents
1. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#dataset-description" target="_blank">Dataset Description</a>**
2. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#uses" target="_blank">Uses</a>**
3. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#contact" target="_blank">Contact</a>**
4. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#blogpost-and-citation" target="_blank">Blogpost and Citation</a>**
### Dataset Description
We refined the MuSiQue dataset to focus on questions that demand complex multi-hop reasoning, by selecting questions which (1) GPT-4 could answer but GPT-3.5 could not, and which (2) were not answerable without the context relevant to the first reasoning step (the "first hop pivot document") for each question. Specifically, we selected 768 random examples from the MuSiQue training set, ranked them based on a combined score of difficulty (measured by the difference in ROUGE-L recall between GPT-4 and GPT-3.5) and the necessity for multi-hop reasoning (assessed by the change in ROUGE-L recall when the first hop pivot document was removed). We refer to the top-ranked 128 examples as MiniMuSiQue, and obtain MiniMuSiQue-hard by associating the original difficult MuSiQue multi-hop question-answer pair to each example. To additionally test off-distribution single-hop factual recall, for each example we synthesized convoluted off-distribution single-hop question-answer pairs for up to five entities per document in MiniMuSiQue, resulting in the much larger single-hop dataset MiniMuSiQue-easy. Each MiniMuSiQue example consists of twenty documents sampled from different Wikipedia articles, to which we associate a hard MuSiQue multi-hop reasoning question for MiniMuSiQue, and many single-hop questions for MiniMuSiQue-easy.
- **Developed by:** **<a href="https://www.morph.so" target="_blank">Morph Labs</a>**
- **Refined from:** **<a href="https://arxiv.org/abs/2108.00573" target="_blank">MuSiQue</a>**
- **Language(s):** English
- **License:** **<a href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank">Apache 2.0</a>**
## Uses
A particularly challenging form of question for models historically has been multi-hop questions, which require a series of interconnected reasoning steps over multiple documents. However, creating multi-hop questions that truly necessitate knowledge-based reasoning is challenging. For instance, early benchmarks like HotpotQA were found to be largely solvable through shortcuts. The construction of questions and corresponding contexts that avoid such shortcuts, and verifying their effectiveness, requires a comprehensive dataset development process. The MuSiQue dataset addresses many weaknesses of prior work and contains difficult multi-hop questions less susceptible to shortcuts. We derive MiniMuSiQue from the original MuSiQue to better assess model capabilities to answer multi-hop questions that truly necessitate knowledge-based reasoning.
## Contact
hello@morph.so
## Blogpost and Citation
**https://morph.so/blog/self-teaching/**
@misc{MiniMuSiQue,
title={MiniMuSiQue},
author={Morph Labs, Jesse Michael Han, Eric Yu, Bentley Long, Pranav Mital, Brando Miranda},
year={2023}}
---
language:
- en
license: apache-2.0
---
# Morph Labs 研发的 MiniMuSiQue 数据集

**https://morph.so/blog/self-teaching/**
我们基于MuSiQue多跳问答(multi-hop question-answering)数据集构建了两款评测数据集,分别命名为MiniMuSiQue-hard(筛选出仅可被GPT-4解答、而GPT-3.5无法作答的问题——若移除首个枢纽文档,模型性能会显著下降)与MiniMuSiQue-easy(规模更大的复杂分布外单跳问答对数据集)。
## 目录
1. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#dataset-description" target="_blank">数据集描述</a>**
2. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#uses" target="_blank">数据集用途</a>**
3. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#contact" target="_blank">联系方式</a>**
4. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#blogpost-and-citation" target="_blank">博客与引用</a>**
### 数据集描述
我们对MuSiQue数据集进行了精炼,聚焦于需要复杂多跳推理(multi-hop reasoning)的问题,筛选标准为:(1) 仅可被GPT-4解答,而GPT-3.5无法作答;(2) 缺失首个推理步骤相关的上下文(即“首跳枢纽文档”)时便无法解答的问题。具体而言,我们从MuSiQue训练集中随机选取768条样本,基于两项指标的综合得分进行排序:其一为难度得分(通过GPT-4与GPT-3.5的ROUGE-L召回率差值衡量);其二为多跳推理必要性得分(通过移除首跳枢纽文档后ROUGE-L召回率的变化量评估)。我们将排名前128的样本命名为MiniMuSiQue,并将每条样本与原始的高难度MuSiQue多跳问答对绑定,由此得到MiniMuSiQue-hard。
为额外测试分布外单跳事实性召回能力,我们针对MiniMuSiQue中每个文档最多5个实体,合成了复杂的分布外单跳问答对,由此得到规模更大的单跳数据集MiniMuSiQue-easy。每条MiniMuSiQue样本包含从不同维基百科文章中采样的20篇文档,其中MiniMuSiQue部分绑定了一道高难度MuSiQue多跳推理问题,而MiniMuSiQue-easy部分则绑定了多条单跳问题。
- **研发方**:**<a href="https://www.morph.so" target="_blank">Morph Labs</a>**
- **衍生自**:**<a href="https://arxiv.org/abs/2108.00573" target="_blank">MuSiQue</a>**
- **语言**:英语
- **许可证**:**<a href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank">Apache 2.0</a>**
## 数据集用途
长期以来,多跳问题对模型而言都是极具挑战性的题型——这类问题需要基于多篇文档执行一系列相互关联的推理步骤。然而,构建真正需要基于知识推理的多跳问题并非易事。例如,早期的HotpotQA等基准测试集被发现大多可通过捷径技巧解答。构建能够规避此类捷径的问题与对应上下文,并验证其有效性,需要一套完整的数据集开发流程。
MuSiQue数据集弥补了此前诸多工作的缺陷,包含了更难被捷径技巧破解的高难度多跳问题。我们从原始MuSiQue数据集衍生出MiniMuSiQue,旨在更精准地评估模型解答真正需要知识推理的多跳问题的能力。
## 联系方式
hello@morph.so
## 博客与引用
**https://morph.so/blog/self-teaching/**
bibtex
@misc{MiniMuSiQue,
title={MiniMuSiQue},
author={Morph Labs, Jesse Michael Han, Eric Yu, Bentley Long, Pranav Mital, Brando Miranda},
year={2023}}