rbiswasfc/scrolls-quality-mcq

Name: rbiswasfc/scrolls-quality-mcq
Creator: rbiswasfc
Published: 2024-06-07 09:40:43
License: 暂无描述

Hugging Face2024-06-07 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/rbiswasfc/scrolls-quality-mcq

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: features: - name: id dtype: string - name: question dtype: string - name: context dtype: string - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 63759933.69322235 num_examples: 2517 - name: test num_bytes: 52057383.0 num_examples: 2086 download_size: 19849080 dataset_size: 115817316.69322234 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- This dataset is derived from `tau/scrolls` [dataset](tau/scrolls) by running the following script: ```python import re from datasets import load_dataset quality_dataset = load_dataset("tau/scrolls", "quality") def parse_example(example): text = example["input"] options = dict(re.findall(r"\((A|B|C|D)\) ([^\n]+)", text)) question_part, context = re.split(r"\(D\) [^\n]+\n", text, maxsplit=1) question = re.sub(r"\([A-D]\) [^\n]+\n?", "", question_part).strip() result = {"question": question, "context": context.strip(), **options} if not all(key in result for key in ["A", "B", "C", "D"]): raise ValueError("One or more options (A, B, C, D) are missing!") # get label label = -1 answer = example["output"] if answer is None: answer = "" for idx, option in enumerate([options["A"], options["B"], options["C"], options["D"]]): if answer.strip() == option.strip(): label = idx result["label"] = label return result quality_dataset = quality_dataset.map(parse_example) quality_dataset = quality_dataset.filter(lambda x: x["label"] >= 0) train_ds = quality_dataset["train"].remove_columns(["pid", "input", "output"]) test_ds = quality_dataset["validation"].remove_columns(["pid", "input", "output"]) ``` Specifically, only `quality` subset is kept and processed into MCQ format. The `test` split from original dataset is removed since it doesn't have ground truth labels. Instead, validation split is assigned as test. Number of examples in train: ~2.5k Number of examples in test: ~2.1k This dataset can be used to test performance of a model focusing on long contexts. Input Tokens as per [llama2](bclavie/bert24_32k_tok_llama2) tokenizer: Mean -> 7.4k, SD: 2.3k, Max -> 11.6k --- Relevant sections from the [SCROLLS: Standardized CompaRison Over Long Language Sequences paper](https://arxiv.org/pdf/2201.03533) ``` QuALITY (Pang et al., 2021): A multiplechoice question answering dataset over stories and articles sourced from Project Gutenberg,10 the Open American National Corpus (Fillmore et al., 1998; Ide and Suderman, 2004), and more. Experienced writers wrote questions and distractors, and were incentivized to write answerable, unambiguous questions such that in order to correctly answer them, human annotators must read large portions of the given document. To measure the difficulty of their questions, Pang et al. conducted a speed validation process, where another set of annotators were asked to answer questions given only a short period of time to skim through the document. As a result, 50% of the questions in QuALITY are labeled as hard, i.e. the majority of the annotators in the speed validation setting chose the wrong answer. ```

提供机构：

rbiswasfc

原始信息汇总

数据集概述

数据集特征

id: 数据类型为字符串
question: 数据类型为字符串
context: 数据类型为字符串
A: 数据类型为字符串
B: 数据类型为字符串
C: 数据类型为字符串
D: 数据类型为字符串
label: 数据类型为int64

数据集分割

train: 包含2517个样本，总字节数为63759933.69322235
test: 包含2086个样本，总字节数为52057383.0

数据集大小

下载大小: 19849080字节
数据集总大小: 115817316.69322234字节

数据集配置

config_name: default
data_files:
- train: 路径为data/train-*
- test: 路径为data/test-*

数据集用途

用于测试模型在处理长文本上下文时的性能
输入令牌统计（使用llama2 tokenizer）:
- 平均值: 7.4k
- 标准差: 2.3k
- 最大值: 11.6k

数据集来源

该数据集是从tau/scrolls数据集的quality子集处理而来，专门转换为多项选择题格式。
原数据集的test分割因缺乏真实标签而被移除，验证分割被用作测试集。

数据集示例数量

train: 约2.5k个样本
test: 约2.1k个样本

数据集难度

根据QuALITY数据集描述，其中50%的问题被标记为困难，即在快速验证设置中，大多数注释者选择了错误答案。

5,000+

优质数据集

54 个

任务类型

进入经典数据集