Infinigence/LVEval

Name: Infinigence/LVEval
Creator: Infinigence
Published: 2024-02-10 08:17:11
License: 暂无描述

Hugging Face2024-02-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Infinigence/LVEval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en - zh viewer: true --- # 介绍(Introduction) **LV-Eval**是一个具备5个长度等级（16k、32k、64k、128k和256k）、最大文本测试长度达到256k的长文本评测基准。**LV-Eval**的平均文本长度达到102,380字，最小/最大文本长度为11,896/387,406字。**LV-Eval**主要有两类评测任务——单跳QA和多跳QA，共包含11个涵盖中英文的评测数据子集。**LV-Eval**设计时引入3个关键技术：干扰事实插入（**C**onfusiong **F**acts **I**nsertion，CFI）提高挑战性，关键词和短语替换（**K**eyword and **P**hrase **R**eplacement，KPR）减少信息泄漏，以及基于关键词召回的评测指标（**A**nswer **K**eywords，AK，指代结合答案关键词和字词黑名单的评价指标）提高评测数值客观性。我们希望*LV*-Eval为未来长文本大语言模型的研究发展提供有价值的性能参考。 **LV-Eval**有以下关键特性： * **超长文本长度**: **LV-Eval**由5个长度等级构成，分别是16k、32k、64k、128k以及256k。同一数据集在不同长度等级下具有相同的问答对集合，只是构成各长度等级的上下文长度不同。我们的目的是保持问答对一致的情况下，充分测试模型在不同长度等级上下文中的性能表现，更可控地评估模型的长文本能力。 * **结合混淆和干扰信息来提升评测难度**: 构建测试数据的过程中，我们将问答相关文档和无关文档混合拼接起来构成测试文档。该构建方式在扩展文本长度的同时，可有效评测模型从冗长混淆文本中提取关键信息的能力。此外，我们还使用GPT-4生成多个干扰信息，并在人工检查后随机插入到测试文档中，以评测模型在有相似事实描述的干扰下保持准确推理的能力。 * **替换数据中的关键信息以减少信息泄漏**: 为了解决长文本能力评测中由于信息泄漏而引起的指标虚高问题，我们采用关键词和短语替换的方式处理数据的上下文以及问答对，替换后的信息不再是公共知识，也在很大程度上与数据源的原始信息不同。所有的替换词和短语标注都由人类标注员完成。这样一来， **LV-Eval**能够严格要求被测模型根据数据中实际提供的上下文信息来回答问题，而非通过“背题”或者预训练阶段的常识记忆的方式来回答问题。 * **基于关键词召回的指标可更客观公正地评测模型性能**: 目前已有的评测指标（如F1分、ROUGH等）存在受回答格式和无关字词干扰的问题，容易导致评测结果虚高。为解决这个问题，我们人工标注了答案关键词和字词黑名单。答案关键词是从原始答案中提取的最具回答信息量的词汇或短语，而字词黑名单主要包含一些无信息量的代词、助词，比如“的”、“和”、“了”等。评测指标的计算被设计为两阶段过程，以F1分数为例：第一阶段先计算模型回答对答案关键词的召回分数，如果分数低于预设阈值，则直接计0分；如果召回分数高于阈值，则进一步计算模型回答与完整答案的F1分数——首先将字词黑名单中的词从回答和答案中过滤掉，再正常进行F1分数计算。这样一来，评测指标可使得模型得分更加客观公正。如果您想了解更多关于**LV-Eval**的细节，我们建议您参阅[GitHub代码库](https://github.com/infinigence/LVEval)以及[论文](https://arxiv.org/abs/2402.05136)。 **LV-Eval** is a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. The average number of words is 102,380, and the Min/Max number of words is 11,896/387,406. **LV-Eval** features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of **LV-Eval** has incorporated three key techniques, namely confusing facts insertion (CFI), keyword and phrase replacement (KPR), and keyword-recall-based metrics (AK, short for metics with Answer Keywords and word blacklist) design, which jointly provide a challenging, mitigated-knowledge-leakege, and more accurate evaluation of the long-context capability of LLMs. We anticipate that **LV-Eval** will serve as a valuable resource for supporting future research on long-context LLMs. The Key Characteristics of **LV-Eval** include: * **Sufficiently long context length to evaluate state-of-the-art models**: **LV-Eval** comprises 5 length levels with word counts of 16k, 32k, 64k, 128k, and 256k. Test instances across these levels share the same set of question-answer (QA) pairs, and only differ in the context content and length. Testing on the same QA pairs with different context lengths facilitates a controllable evaluation of models' long-context ability. * **Incorporation of distraction and confusion to increase difficulty**: When constructing the context for each test instance, we mix up distracting documents and supporting documents. This approach evaluates the model's ability in pinpointing key information in a large bunch of distracting texts. In addition, we insert confusing facts generated by GPT-4 and revised by human annotators into the context. This assesses the model's capability to accurately reason in the presence of interference. * **Keyword and phrase replacement to mitigate knowledge leakage**: To mitigate the biased evaluation of long-context ability caused by knowledge leakage, we apply keyword and phrase replacement in the context and QA pairs. The replacement rules are annotated by human annotators. In this way, **LV-Eval** requires LLMs to rely on their understanding of the long context to answer questions rather than relying on memorization or common-sense knowledge. * **Keyword-recall-based metric for more objective scoring**: Existing *N*-gram metrics such as the F1 score are sensitive to the format variations and non-informative words in the answer, which results in inaccurate scores. To address this, we manually annotate answer keywords and a blacklist of unrelated words. The answer keywords are the critical words or sentences extracted from original ground-truth (GT) answers, while the word blacklist contains common and non-informative words such as 'the', 'a', 'of', and so on. The metric calculation follows a two-stage procedure: the first stage calculates the recall of answer keywords; if the recall exceeds a certain threshold, the second stage will remove all the blacklisted words and then calculate the F1 score between the prediction and the GT answer. This metric design can get scores with higher objectivity. If you want to learn more about **LV-Eval**, we recommend you to refer to the [GitHub repository](https://github.com/infinigence/LVEval) and the [paper](https://arxiv.org/abs/2402.05136). # How to use it? #### Quick Start Our dataset evaluates the long-text capabilities of the large language models from multiple perspectives. Each subset has different length divisions, so please add a length limit when loading the dataset. ``` data = load_dataset("Infinigence/LVEval", "hotpotwikiqa_mixup_16k", split='test') ``` #### Loading Data ```python from datasets import load_dataset DATASET_NAMES = [ "hotpotwikiqa_mixup", "loogle_SD_mixup", "loogle_CR_mixup", "loogle_MIR_mixup", \ "multifieldqa_en_mixup", "multifieldqa_zh_mixup", "factrecall_en", "factrecall_zh", \ "cmrc_mixup", "lic_mixup", "dureader_mixup" ] DATASET_LENGTH_LEVEL = [ '16k', '32k', '64k', '128k', '256k' ] def get_dataset_names(dataset_names, length_levels): datasets = [] for name in dataset_names: for length in length_levels: datasets.append(f"{name}_{length}") return datasets for dataset in get_dataset_names(DATASET_NAMES, DATASET_LENGTH_LEVEL): data = load_dataset("Infinigence/LVEval", dataset, split='test') ``` If you want to download the data for **hotpotwikiqa_mixup**, you can visit [this link](https://huggingface.co/datasets/Infinigence/LVEval/resolve/main/hotpotwikiqa_mixup.zip). If you need other subsets of data, simply change the zip file name in the link above. #### Data Format All data in **LV-Eval** follows the following format. For certain datasets ("loogle_SD_mixup," "loogle_CR_mixup," "loogle_MIR_mixup"), there is an additional key called "answer_keywords". This key indicates the most crucial word or sentence in the answer. During the evaluation of predicted values, if the match between the prediction and the "answer_keywords" falls below a certain threshold, it directly returns 0. Otherwise, it compares the "answers" list with the predicted value. For some datasets ("factrecall_en," "factrecall_zh," "cmrc_mixup"), there is an extra key called "confusing_facts". This key represents confounding elements added to increase the benchmark difficulty and has been randomly placed within long texts. For certain datasets ("hotpotwikiqa_mixup," "multifieldqa_en_mixup," "multifieldqa_zh_mixup," "lic_mixup"), both "answer_keywords" and "confusing_facts" are present. ```json { "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc", "context": "The documents input into the long-text task.", "answers": "A List of all true answers", "length": "Total length of the first three items (counted in characters for Chinese and words for English)", "dataset": "The name of the dataset to which this piece of data belongs", "language": "The language of this piece of data", "answer_keywords": "The key words or sentences manually filtered from the answers.", "confusing_facts": "This key represents confounding elements added to increase the benchmark difficulty and has been randomly placed within long texts. This helps make the test instances more challenging." } ``` #### Evaluation This repository provides data download for LV-Eval. If you wish to use this dataset for automated evaluation, please refer to our [github](https://github.com/infinigence/LVEval). # Task statistics | Task | Datasets | CFI | \#KPR | AK | Language | \#QA pairs | \#Contexts | |:-------------:|:-----------------------:|:----------:|-------|:----------:|:--------:|:----------:|:------------:| | Single-hop QA | loogle\_SD\_mixup | | | ✔ | en | 160 | 800 | | | cmrc\_mixup | | 786 | | zh | 200 | 1,000 | | | multifieldqa\_en\_mixup | ✔ | 476 | ✔ | en | 101 | 505 | | | multifieldqa\_zh\_mixup | ✔ | 424 | ✔ | zh | 133 | 665 | | | factrecall\_en | ✔ | 3 | ✔ | en | 1 | 200*5 | | | factrecall\_zh | ✔ | 3 | ✔ | zh | 1 | 200*5 | | Multi-hop QA | dureader\_mixup | | | | zh | 176 | 880 | | | loogle\_CR\_mixup | | | ✔ | en | 99 | 495 | | | loogle\_MR\_mixup | | | ✔ | en | 139 | 695 | | | hotpotwikiqa\_mixup | ✔ | 232 | ✔ | en | 124 | 620 | | | lic\_mixup | ✔ | | ✔ | zh | 197 | 985 | The abbreviations for **CFI, KPR, AK** represent for confusing fact insertion, keyword and phrase replacement, and answer keywords, respectively. The confusing fact insertion has already been inserted into the context and will be displayed in the jsonl file as **"confusing_facts"**. The answer keywords will be shown in the form of **"answer_keywords"** in the jsonl file. # Task construction ### Multi-hop QA In a multi-hop QA task, the reasoning process to derive the answer need to gather multiple pieces of information from various locations in the context. - **lic-mixup** is originated from the [Long-instruction-en2zh](https://huggingface.co/datasets/yuyijiong/Long-instruction-en2zh) dataset on Hugging Face. The original Long-instruction-en2zh contains 8,000+ high-quality Chinese multi-doc QA data translated from English. We selected 197 QA pairs and their corresponding documents as supporting data, while the remaining documents serve as distracting data for context mixing. - **hotpotwikiqa-mixup** is originated from two Wikipedia-based multi-hop QA datasets: [HotpotQA](https://huggingface.co/datasets/hotpot_qa) and [2WikiMultihopQA](https://huggingface.co/datasets/voidful/2WikiMultihopQA). HotpotQA contains 112,779 2-hop questions that are written by native speakers according to two given paragraphs as the context. 2WikiMultihopQA contains 192,606 5-hop questions that are synthesized using manually designed templates to prevent shortcut solutions. We select 124 samples from the two datasets. - **loogle-MR-mixup** and **loogle-CR-mixup** originate from [LooGLE](https://huggingface.co/datasets/bigainlco/LooGLE)'s Long-dependency QA task, specifically the *Multiple information Retrieval* and *Comprehension and Reasoning* subtasks. The *Multiple information Retrieval* task requires aggregation of the evidence that can be directly located in original sentences, while the *Comprehension and Reasoning* task contains implicit evidence within the context, it requires multi-step reasoning to get the correct answers. We select 139 and 99 questions for **loogle-MR-mixup** and **loogle-CR-mixup**, respectively. - **dureader-mixup** is built from the [DuReader](https://github.com/baidu/DuReader) dataset. We first randomly select 200 instances and then manually remove 24 samples whose answers are longer than 360 words. ### Single-hop QA In a single-hop QA task, only a single evidence in the context is needed to derive the answer. - **loogle-SD-mixup** contains 160 unique QA pairs and 800 documents originated from the short-dependency QA task in [LooGLE](https://huggingface.co/datasets/bigainlco/LooGLE). - **cmrc-mixup** is derived from the [CMRC 2018 Public Datasets](https://github.com/ymcui/cmrc2018), designed for Chinese machine reading comprehension. It contains ~20k questions annotated on Wikipedia paragraphs by human experts. We manually pick 200 QA pairs and their corresponding documents as supporting QA pairs and paragraphs. - **multifieldqa-en-mixup** and **multifieldqa-zh-mixup** are built from the MultiFieldQA datasets in [LongBench](https://huggingface.co/datasets/THUDM/LongBench). We manually remove questions that can be answered using common-sense knowledge without referring to the context, and eventually get 101 and 133 unique QA pairs for **multifieldqa-en-mixup** and **multifieldqa-zh-mixup**, respectively. - **factrecall-en** and **factrecall-zh** are two synthetic datasets designed to assess the LLMs' ability to identify a small piece of evidence (“fact”) located at various locations within a very lengthy context. We write one English fact-question-answer pair for **factrecall-en** and one Chinese fact-question-answer pair for **factrecall-zh**. Distracting documents are sourced from *PG-19* dataset (English) and the book of *Dream of the Red Chamber* (Chinese) to create five contexts of different length levels. For each context, we generate 200 documents by inserting the fact at 200 evenly spaced positions within the context. # License In **LV-Eval**, the cmrc-mixup and lic-mixup datasets follow `CC-BY-SA-4.0` license, and the other datasets follow `MIT` license. # Citation ``` @misc{yuan2024lveval, title={LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K}, author={Tao Yuan and Xuefei Ning and Dong Zhou and Zhijie Yang and Shiyao Li and Minghui Zhuang and Zheyue Tan and Zhuyu Yao and Dahua Lin and Boxun Li and Guohao Dai and Shengen Yan and Yu Wang}, year={2024}, eprint={2402.05136}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

license: MIT language: 英文、中文 viewer: 支持数据集查看器 # 介绍 **LV-Eval**是一个具备5个长度等级（16k、32k、64k、128k和256k）、最大测试文本长度达256k的长文本评测基准（long-context benchmark）。**LV-Eval**的平均文本长度为102,380字，最小/最大文本长度分别为11,896和387,406字。该基准包含两类核心评测任务：单跳问答（Single-hop QA）与多跳问答（Multi-hop QA），共计11个覆盖中英文的评测子集。**LV-Eval**在设计中引入三项关键技术：干扰事实插入（Confusing Facts Insertion，CFI）以提升评测难度，关键词和短语替换（Keyword and Phrase Replacement，KPR）以减少信息泄漏，以及基于关键词召回的评测指标（Answer Keywords，AK，即结合答案关键词与字词黑名单的评价指标）以提升评测结果的客观性。我们期望**LV-Eval**能够为未来长文本大语言模型（Large Language Model，LLM）的研究提供极具价值的性能参考基准。 **LV-Eval**具有以下关键特性： * **超长上下文长度**：**LV-Eval**包含5个长度等级，分别为16k、32k、64k、128k与256k。不同长度等级下的同一数据集拥有完全一致的问答对集合，仅上下文长度存在差异。我们的目标是在保持问答对不变的前提下，充分测试模型在不同长度等级上下文下的性能表现，实现对模型长文本能力的可控评估。 * **融合混淆与干扰信息以提升评测难度**：在构建测试上下文的过程中，我们将问答相关文档与无关文档混合拼接。该方式在扩展文本长度的同时，可有效评测模型从冗长混淆文本中提取关键信息的能力。此外，我们通过GPT-4生成多组干扰信息，经人工校验后随机插入测试文档，以评估模型在相似事实描述的干扰下保持准确推理的能力。 * **替换数据关键信息以缓解信息泄漏**：为解决长文本能力评测中因信息泄漏导致的指标虚高问题，我们采用关键词和短语替换的方式处理上下文及问答对，替换后的信息既不属于公共知识，也与数据源原始信息存在显著差异。所有替换词与短语的标注均由人工标注员完成。如此一来，**LV-Eval**能够严格要求被测模型依据数据中实际提供的上下文信息作答，而非通过“背题”或预训练阶段的常识记忆获取答案。 * **基于关键词召回的指标实现更客观的评分**：现有n元语法评测指标（如F1分数、ROUGE等）易受回答格式与无意义字词干扰，导致评测结果虚高。为解决该问题，我们人工标注了答案关键词与字词黑名单。答案关键词是从原始标准答案中提取的最具回答信息量的词汇或短语，字词黑名单则包含无信息量的代词、助词，例如“的”“和”“了”等。评测指标的计算采用两阶段流程：以F1分数为例，第一阶段先计算模型回答对答案关键词的召回分数，若分数低于预设阈值则直接计0分；若召回分数高于阈值，则进一步过滤回答与标准答案中的字词黑名单词汇，再正常计算F1分数。该设计可使模型得分更加客观公正。如果您想了解更多关于**LV-Eval**的细节，建议参阅[GitHub代码库](https://github.com/infinigence/LVEval)以及[论文](https://arxiv.org/abs/2402.05136)。 # 使用方法 ## 快速入门本数据集从多个维度评估大语言模型的长文本能力。每个子集具有不同的长度划分，因此在加载数据集时请添加长度限制。 python data = load_dataset("Infinigence/LVEval", "hotpotwikiqa_mixup_16k", split='test') ## 加载数据 python from datasets import load_dataset DATASET_NAMES = [ "hotpotwikiqa_mixup", "loogle_SD_mixup", "loogle_CR_mixup", "loogle_MIR_mixup", "multifieldqa_en_mixup", "multifieldqa_zh_mixup", "factrecall_en", "factrecall_zh", "cmrc_mixup", "lic_mixup", "dureader_mixup" ] DATASET_LENGTH_LEVEL = [ '16k', '32k', '64k', '128k', '256k' ] def get_dataset_names(dataset_names, length_levels): datasets = [] for name in dataset_names: for length in length_levels: datasets.append(f"{name}_{length}") return datasets for dataset in get_dataset_names(DATASET_NAMES, DATASET_LENGTH_LEVEL): data = load_dataset("Infinigence/LVEval", dataset, split='test') 如果您需要下载`hotpotwikiqa_mixup`的数据，可以访问[此链接](https://huggingface.co/datasets/Infinigence/LVEval/resolve/main/hotpotwikiqa_mixup.zip)。如需其他子集的数据，只需修改上述链接中的压缩包文件名即可。 ## 数据格式 **LV-Eval**中的所有数据均遵循以下格式。对于部分数据集（如`loogle_SD_mixup`、`loogle_CR_mixup`、`loogle_MIR_mixup`），会额外包含名为`answer_keywords`的字段，该字段代表答案中最关键的词汇或语句。在评估模型预测结果时，若预测结果与`answer_keywords`的匹配度低于指定阈值，则直接返回0分；否则将`answers`列表与预测结果进行比对。对于部分数据集（如`factrecall_en`、`factrecall_zh`、`cmrc_mixup`），会额外包含名为`confusing_facts`的字段，该字段代表为提升评测难度而添加的干扰元素，已随机插入长文本中。对于部分数据集（如`hotpotwikiqa_mixup`、`multifieldqa_en_mixup`、`multifieldqa_zh_mixup`、`lic_mixup`），则同时包含`answer_keywords`和`confusing_facts`两个字段。 json { "input": "任务的输入/指令，通常较短，例如QA任务中的问题、少样本任务中的查询等", "context": "长文本任务中输入的文档", "answers": "所有正确答案的列表", "length": "上述前三项内容的总长度（中文按字符数统计，英文按单词数统计）", "dataset": "该条数据所属的数据集名称", "language": "该条数据的语言", "answer_keywords": "从标准答案中人工筛选出的关键词或语句", "confusing_facts": "该字段代表为提升评测难度而添加的干扰元素，已随机插入长文本中，可使测试样本更具挑战性" } ## 评测本仓库提供**LV-Eval**的数据下载服务。若您希望使用该数据集进行自动化评测，请参阅我们的[GitHub仓库](https://github.com/infinigence/LVEval)。 # 任务统计 | 任务类型 | 数据集名称 | CFI | #KPR | AK | 语言 | #QA 样本数 | #上下文文档数 | |:-------------:|:-----------------------:|:----------:|-------|:----------:|:--------:|:----------:|:------------:| | 单跳问答 | loogle_SD_mixup | | | ✔ | en | 160 | 800 | | | cmrc_mixup | | 786 | | zh | 200 | 1,000 | | | multifieldqa_en_mixup | ✔ | 476 | ✔ | en | 101 | 505 | | | multifieldqa_zh_mixup | ✔ | 424 | ✔ | zh | 133 | 665 | | | factrecall_en | ✔ | 3 | ✔ | en | 1 | 200*5 | | | factrecall_zh | ✔ | 3 | ✔ | zh | 1 | 200*5 | | 多跳问答 | dureader_mixup | | | | zh | 176 | 880 | | | loogle_CR_mixup | | | ✔ | en | 99 | 495 | | | loogle_MR_mixup | | | ✔ | en | 139 | 695 | | | hotpotwikiqa_mixup | ✔ | 232 | ✔ | en | 124 | 620 | | | lic_mixup | ✔ | | ✔ | zh | 197 | 985 | 其中CFI、KPR、AK分别代表干扰事实插入（Confusing Facts Insertion）、关键词和短语替换（Keyword and Phrase Replacement）以及答案关键词指标。干扰事实插入已被嵌入上下文，并将在jsonl文件中以`confusing_facts`字段呈现。答案关键词将以`answer_keywords`字段的形式出现在jsonl文件中。 # 任务构建 ### 多跳问答在多跳问答任务中，推导答案的推理过程需要从上下文中的多个位置收集多条信息。 - **lic-mixup** 源自Hugging Face上的[Long-instruction-en2zh](https://huggingface.co/datasets/yuyijiong/Long-instruction-en2zh)数据集。原始Long-instruction-en2zh包含8000+条高质量的中文多文档问答数据，均由英文翻译而来。我们从其中选取197个问答对及其对应的文档作为支撑数据，其余文档则作为干扰数据用于上下文拼接。 - **hotpotwikiqa-mixup** 源自两个基于维基百科的多跳问答数据集：[HotpotQA](https://huggingface.co/datasets/hotpot_qa)和[2WikiMultihopQA](https://huggingface.co/datasets/voidful/2WikiMultihopQA)。HotpotQA包含112779条由母语使用者根据两段给定段落编写的2跳问题。2WikiMultihopQA包含192606条通过人工设计模板合成的5跳问题，旨在防止出现捷径解法。我们从两个数据集中选取了124个样本。 - **loogle-MR-mixup**和**loogle-CR-mixup**源自[LooGLE](https://huggingface.co/datasets/bigainlco/LooGLE)的长依赖问答任务，具体为*多信息检索*和*理解与推理*子任务。*多信息检索*任务需要聚合可直接在原句中定位的证据，而*理解与推理*任务的证据隐含在上下文中，需要通过多步推理才能获得正确答案。我们分别为**loogle-MR-mixup**和**loogle-CR-mixup**选取了139和99个问题。 - **dureader-mixup** 源自[DuReader](https://github.com/baidu/DuReader)数据集。我们首先随机选取200个样本，随后手动移除24个答案长度超过360词的样本。 ### 单跳问答在单跳问答任务中，仅需从上下文中的单条证据即可推导得出答案。 - **loogle-SD-mixup** 包含160个唯一问答对和800个文档，源自[LooGLE](https://huggingface.co/datasets/bigainlco/LooGLE)中的短依赖问答任务。 - **cmrc-mixup** 源自[CMRC 2018公开数据集](https://github.com/ymcui/cmrc2018)，该数据集专为中文机器阅读理解设计，包含约20000条由人工专家在维基百科段落上标注的问题。我们手动挑选了200个问答对及其对应的文档作为支撑问答对和段落。 - **multifieldqa-en-mixup**和**multifieldqa-zh-mixup** 源自[LongBench](https://huggingface.co/datasets/THUDM/LongBench)中的MultiFieldQA数据集。我们手动移除了无需参考上下文即可通过常识知识回答的问题，最终为**multifieldqa-en-mixup**和**multifieldqa-zh-mixup**分别获得101和133个唯一问答对。 - **factrecall-en**和**factrecall-zh** 是两个合成数据集，旨在评估大语言模型识别长文本中不同位置的少量证据（“事实”）的能力。我们为**factrecall-en**编写了一套英文的事实-问题-答案对，为**factrecall-zh**编写了一套中文的对应内容。干扰文档源自*PG-19*数据集（英文）和《红楼梦》（*Dream of the Red Chamber*）书籍，用于构建五个不同长度等级的上下文。对于每个上下文，我们通过将事实以均匀间隔的200个位置插入上下文，生成200个文档。 # 许可证在**LV-Eval**中，cmrc-mixup和lic-mixup数据集遵循`CC-BY-SA-4.0`许可证，其余数据集均遵循`MIT`许可证。 # 引用 @misc{yuan2024lveval, title={LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K}, author={Tao Yuan and Xuefei Ning and Dong Zhou and Zhijie Yang and Shiyao Li and Minghui Zhuang and Zheyue Tan and Zhuyu Yao and Dahua Lin and Boxun Li and Guohao Dai and Shengen Yan and Yu Wang}, year={2024}, eprint={2402.05136}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

Infinigence

原始信息汇总

LV-Eval 数据集概述

数据集介绍

LV-Eval 是一个长文本评测基准，具备五个长度等级（16k、32k、64k、128k 和 256k），最大文本测试长度达到 256k。该数据集的平均文本长度为 102,380 字，最小/最大文本长度为 11,896/387,406 字。LV-Eval 主要包含两类评测任务——单跳 QA 和多跳 QA，共包含 11 个涵盖中英文的评测数据子集。

关键特性

超长文本长度：由五个长度等级构成，分别是 16k、32k、64k、128k 以及 256k。同一数据集在不同长度等级下具有相同的问答对集合，只是构成各长度等级的上下文长度不同。
结合混淆和干扰信息来提升评测难度：构建测试数据的过程中，将问答相关文档和无关文档混合拼接起来构成测试文档。此外，使用 GPT-4 生成多个干扰信息，并在人工检查后随机插入到测试文档中。
替换数据中的关键信息以减少信息泄漏：采用关键词和短语替换的方式处理数据的上下文以及问答对，替换后的信息不再是公共知识。
基于关键词召回的指标可更客观公正地评测模型性能：人工标注了答案关键词和字词黑名单，评测指标的计算被设计为两阶段过程。

数据集使用

快速开始

python data = load_dataset("Infinigence/LVEval", "hotpotwikiqa_mixup_16k", split=test)

加载数据

python from datasets import load_dataset

DATASET_NAMES = [ "hotpotwikiqa_mixup", "loogle_SD_mixup", "loogle_CR_mixup", "loogle_MIR_mixup", "multifieldqa_en_mixup", "multifieldqa_zh_mixup", "factrecall_en", "factrecall_zh", "cmrc_mixup", "lic_mixup", "dureader_mixup" ]

DATASET_LENGTH_LEVEL = [ 16k, 32k, 64k, 128k, 256k ]

def get_dataset_names(dataset_names, length_levels): datasets = [] for name in dataset_names: for length in length_levels: datasets.append(f"{name}_{length}") return datasets

for dataset in get_dataset_names(DATASET_NAMES, DATASET_LENGTH_LEVEL): data = load_dataset("Infinigence/LVEval", dataset, split=test)

数据格式

所有数据遵循以下格式：

json { "input": "任务的输入/命令，通常较短，如 QA 中的问题，Few-shot 任务中的查询等", "context": "长文本任务输入的文档。", "answers": "所有正确答案的列表", "length": "前三项的总长度（中文按字符计，英文按单词计）", "dataset": "该数据所属的数据集名称", "language": "该数据的语言", "answer_keywords": "从答案中手动筛选的关键词或句子。", "confusing_facts": "增加基准难度的混淆元素，已随机放置在长文本中。" }

评测

数据集提供了数据下载，如需使用该数据集进行自动化评测，请参考 GitHub 仓库。

任务统计

任务类型	数据集名称	CFI	KPR	AK	语言	QA 对数量	上下文数量
单跳 QA	loogle_SD_mixup			✓	en	160	800
	cmrc_mixup		786		zh	200	1,000
	multifieldqa_en_mixup	✓	476	✓	en	101	505
	multifieldqa_zh_mixup	✓	424	✓	zh	133	665
	factrecall_en	✓	3	✓	en	1	200*5
	factrecall_zh	✓	3	✓	zh	1	200*5
多跳 QA	dureader_mixup				zh	176	880
	loogle_CR_mixup			✓	en	99	495
	loogle_MR_mixup			✓	en	139	695
	hotpotwikiqa_mixup	✓	232	✓	en	124	620
	lic_mixup	✓		✓	zh	197	985

任务构建

多跳 QA

lic-mixup：源自 Long-instruction-en2zh 数据集。
hotpotwikiqa-mixup：源自 HotpotQA 和 2WikiMultihopQA 数据集。
loogle-MR-mixup 和 loogle-CR-mixup：源自 LooGLE 数据集。
dureader-mixup：源自 DuReader 数据集。

单跳 QA

loogle-SD-mixup：源自 LooGLE 数据集。
cmrc-mixup：源自 CMRC 2018 Public Datasets。
multifieldqa-en-mixup 和 multifieldqa-zh-mixup：源自 LongBench 数据集。
factrecall-en 和 factrecall-zh：合成数据集，用于评估模型在长文本中识别小段证据的能力。

许可证

LV-Eval 中的 cmrc-mixup 和 lic-mixup 数据集遵循 CC-BY-SA-4.0 许可证，其他数据集遵循 MIT 许可证。

引用

plaintext @misc{yuan2024lveval, title={LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K}, author={Tao Yuan and Xuefei Ning and Dong Zhou and Zhijie Yang and Shiyao Li and Minghui Zhuang and Zheyue Tan and Zhuyu Yao and Dahua Lin and Boxun Li and Guohao Dai and Shengen Yan and Yu Wang}, year={2024}, eprint={2402.05136}, archivePrefix={arXiv}, primaryClass={cs.CL} }

搜集汇总

数据集介绍

构建方式

LV-Eval 数据集的构建采用了混合问答相关文档与无关文档的策略，同时引入了干扰事实插入、关键词和短语替换两项技术，以及基于关键词召回的评测指标设计，形成了具备不同长度等级的评测子集，旨在全面、客观地评估大型语言模型处理长文本的能力。

使用方法

使用LV-Eval数据集时，用户需根据不同子集的长度等级设定加载数据时的长度限制。数据集支持快速加载和多样化任务统计，用户可通过调整数据集名称和长度等级来获取所需的数据子集。数据集的格式包括输入、上下文、答案列表等字段，并根据不同子集可能包含答案关键词和干扰事实等额外字段。

背景与挑战

背景概述

LV-Eval是一个专为评估长文本上下文中语言模型性能的基准数据集，由Infinigence团队创建于2024年。该数据集包含五个不同的长度等级，最长可达256k字，平均文本长度为102,380字，旨在为长文本大语言模型的研究与发展提供性能参考。LV-Eval的核心研究问题是评估模型在处理超长文本时的问答能力，其引入的干扰事实插入、关键词和短语替换以及基于关键词召回的评测指标等技术，均旨在提高评测的挑战性和客观性。该数据集对相关领域的影响力体现在为长文本处理提供了新的评测标准和研究方向。

当前挑战

LV-Eval在构建过程中遇到的挑战主要包括如何在保证问答对一致性的同时，增加文本的长度和复杂性，以及如何减少信息泄漏对评测结果的影响。具体挑战包括：1) 在长文本中插入干扰信息以增加模型推理的难度；2) 通过关键词和短语替换减少信息泄漏，确保模型依据实际提供的上下文信息而非预训练阶段的常识记忆来回答问题；3) 设计基于关键词召回的评测指标以更客观地评价模型性能，避免现有指标受回答格式和无关字词的干扰。

常用场景

经典使用场景

在自然语言处理领域，长文本处理能力是衡量大型语言模型智能水平的关键指标之一。LV-Eval作为一个专为长文本设计的高挑战性评测基准，其经典使用场景在于评估模型对于超长文本上下文中的问答能力。该数据集通过不同长度级别的文本，结合干扰事实插入、关键词替换等技术创新，使得模型在处理长文本时面临更真实的挑战，从而更准确地反映模型在实际应用中的表现。

解决学术问题

LV-Eval解决了长文本处理中存在的信息泄漏问题，通过关键词替换减少了常识性知识的干扰，同时引入了基于关键词召回的评测指标，使得评测结果更为客观公正。此外，该数据集的构建考虑了混淆和干扰信息的加入，为学术界提供了研究长文本大语言模型性能的有价值参考。

实际应用

在实际应用中，LV-Eval可用于评估和优化机器阅读理解系统，特别是在处理大量文本信息时，如文献综述、法律文件分析等领域，该数据集有助于提升模型从冗长文本中提取关键信息的能力，从而提高决策的准确性和效率。

数据集最近研究