rbiswasfc/scrolls-quality-mcq
收藏Hugging Face2024-06-07 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/rbiswasfc/scrolls-quality-mcq
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
features:
- name: id
dtype: string
- name: question
dtype: string
- name: context
dtype: string
- name: A
dtype: string
- name: B
dtype: string
- name: C
dtype: string
- name: D
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 63759933.69322235
num_examples: 2517
- name: test
num_bytes: 52057383.0
num_examples: 2086
download_size: 19849080
dataset_size: 115817316.69322234
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
This dataset is derived from `tau/scrolls` [dataset](tau/scrolls) by running the following script:
```python
import re
from datasets import load_dataset
quality_dataset = load_dataset("tau/scrolls", "quality")
def parse_example(example):
text = example["input"]
options = dict(re.findall(r"\((A|B|C|D)\) ([^\n]+)", text))
question_part, context = re.split(r"\(D\) [^\n]+\n", text, maxsplit=1)
question = re.sub(r"\([A-D]\) [^\n]+\n?", "", question_part).strip()
result = {"question": question, "context": context.strip(), **options}
if not all(key in result for key in ["A", "B", "C", "D"]):
raise ValueError("One or more options (A, B, C, D) are missing!")
# get label
label = -1
answer = example["output"]
if answer is None:
answer = ""
for idx, option in enumerate([options["A"], options["B"], options["C"], options["D"]]):
if answer.strip() == option.strip():
label = idx
result["label"] = label
return result
quality_dataset = quality_dataset.map(parse_example)
quality_dataset = quality_dataset.filter(lambda x: x["label"] >= 0)
train_ds = quality_dataset["train"].remove_columns(["pid", "input", "output"])
test_ds = quality_dataset["validation"].remove_columns(["pid", "input", "output"])
```
Specifically, only `quality` subset is kept and processed into MCQ format. The `test` split from original dataset is removed since it doesn't have ground truth labels.
Instead, validation split is assigned as test.
Number of examples in train: ~2.5k
Number of examples in test: ~2.1k
This dataset can be used to test performance of a model focusing on long contexts.
Input Tokens as per [llama2](bclavie/bert24_32k_tok_llama2) tokenizer: Mean -> 7.4k, SD: 2.3k, Max -> 11.6k
---
Relevant sections from the [SCROLLS: Standardized CompaRison Over Long Language Sequences paper](https://arxiv.org/pdf/2201.03533)
```
QuALITY (Pang et al., 2021): A multiplechoice question answering dataset over stories
and articles sourced from Project Gutenberg,10 the
Open American National Corpus (Fillmore et al.,
1998; Ide and Suderman, 2004), and more. Experienced writers wrote questions and distractors, and
were incentivized to write answerable, unambiguous questions such that in order to correctly answer
them, human annotators must read large portions
of the given document. To measure the difficulty
of their questions, Pang et al. conducted a speed
validation process, where another set of annotators
were asked to answer questions given only a short
period of time to skim through the document. As
a result, 50% of the questions in QuALITY are
labeled as hard, i.e. the majority of the annotators in the speed validation setting chose the wrong
answer.
```
提供机构:
rbiswasfc
原始信息汇总
数据集概述
数据集特征
- id: 数据类型为字符串
- question: 数据类型为字符串
- context: 数据类型为字符串
- A: 数据类型为字符串
- B: 数据类型为字符串
- C: 数据类型为字符串
- D: 数据类型为字符串
- label: 数据类型为int64
数据集分割
- train: 包含2517个样本,总字节数为63759933.69322235
- test: 包含2086个样本,总字节数为52057383.0
数据集大小
- 下载大小: 19849080字节
- 数据集总大小: 115817316.69322234字节
数据集配置
- config_name: default
- data_files:
- train: 路径为
data/train-* - test: 路径为
data/test-*
- train: 路径为
数据集用途
- 用于测试模型在处理长文本上下文时的性能
- 输入令牌统计(使用llama2 tokenizer):
- 平均值: 7.4k
- 标准差: 2.3k
- 最大值: 11.6k
数据集来源
- 该数据集是从
tau/scrolls数据集的quality子集处理而来,专门转换为多项选择题格式。 - 原数据集的
test分割因缺乏真实标签而被移除,验证分割被用作测试集。
数据集示例数量
- train: 约2.5k个样本
- test: 约2.1k个样本
数据集难度
- 根据QuALITY数据集描述,其中50%的问题被标记为困难,即在快速验证设置中,大多数注释者选择了错误答案。



