SCBench

Name: SCBench
Creator: maas
Published: 2025-12-29 16:41:59
License: 暂无描述

魔搭社区2025-12-29 更新2025-07-26 收录

下载链接：

https://modelscope.cn/datasets/microsoft/SCBench

下载链接

链接失效反馈

官方服务：

资源简介：

# SCBench [[Paper]](https://arxiv.org/abs/2412.10319) [[Code]](https://github.com/microsoft/MInference/tree/main/scbench) [[Project Page]](https://aka.ms/scbench) ![SCBench](./data/framework.png) SCBench (SharedContextBench) is a comprehensive benchmark to evaluate efficient long-context methods in a KV cache-centric perspective, analyzing their performance across **the full KV cache lifecycle (generation, compression, retrieval, and loading)** in real-world scenarios where context memory (KV cache) is shared and reused across multiple requests. ## 🎯 Quick Start ### Load Data You can download and load the **SCBench** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/microsoft/SCBench)), and run the experiments in the Github ([💻 SCBench](https://github.com/microsoft/MInference/tree/main/scbench)): ```python from datasets import load_dataset datasets = ["scbench_kv", "scbench_prefix_suffix", "scbench_vt", "scbench_repoqa", "scbench_qa_eng", "scbench_qa_chn", "scbench_choice_eng", "scbench_many_shot", "scbench_summary", "scbench_mf", "scbench_summary_with_needles", "scbench_repoqa_and_kv"] for dataset in datasets: data = load_dataset('microsoft/SCBench', dataset, split='test') ``` ### Data Format All data in **SCBench** are standardized to the following format: ```json { "id": "Random id for each piece of data.", "context": "The long context required for the task, such as repo-code, long-document, and many-shot.", "multi_turns": [{"input": "multi-turn question.", "answer": "multi-turn reference answer."}], } ``` ### Experiments We implement **Multi-Turn** and **Multi-Request** modes with HF and vLLM in [`GreedySearch`](https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1160) and [`GreedySearch_vllm`](https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1070) two class. Please refer the follow scripts to run the experiments. ## Run the benchmark First, build the environment, see [basic environment](https://github.com/microsoft/MInference/tree/main/scbench#basic-dependencies). Run the test: ```bash bash scripts/test_llama.sh ``` Run multiple tasks in one command: ```bash bash scripts/run_all_tasks.sh ``` Specify the max sequence length, max number of turns, and number of eval examples: - `--max_seq_length`: The maximum sequence length for the test. - `--max_turns`: The maximum number of turns for the test. - `--num_eval_examples`: The number of test examples to use, use all examples in default. ## Run with efficient long-context methods: - `--attn_type`: The attention type to use. - `--kv_type`: The KV cache type to use. For example, run with MInference and SnapKV: ```bash bash scripts/test_minference_with_snapkv.sh ``` The supported efficient long-context methods are as follows: **attn_type**: - `dense`: Dense attention - `minference`: MInference - `a_shape`: A-Shape - `tri_shape`: Tri-Shape **kv_type**: - `dense`: Dense KV cache - `kivi`: KIVI - `snapkv`: SnapKV - `quest`: Quest - `pyramidkv`: PyramidKV - `streamingllm`: StreamingLLM You will need to build specific environment for different attention types and KV cache types, see section [Environment](https://github.com/microsoft/MInference/tree/main/scbench#environment-for-efficient-long-context-methods) for more details. ## Dataset ![SCBench](./data/overview.png) SCBench covers 12 diverse tasks that test four key long-context capabilities: string retrieval, semantic retrieval, global information processing, and multi-tasking. ### String Retrieval - **Retr.KV**: Tests key-value lookup in large JSON objects with random, incompressible content - **Retr.Prefix-Suffix**: Evaluates finding strings with specific prefix and suffix patterns - **Retr.MultiHop**: Assesses multi-hop variable tracing capabilities in long inputs ### Semantic Retrieval - **Code.RepoQA**: Function retrieval from large codebases based on natural language descriptions - **Language QA**: Includes English QA, Chinese QA, and multi-choice questions on long texts - Requires semantic understanding on length inputs ### Global Information Processing - **Many-shot ICL**: Tests in-context learning with hundreds of examples - **Math.Find**: Statistical tasks on large arrays - **En.Sum**: Summarization of documents - Requires global information processing or aggregation ### Multi-Tasking - **Mix.Sum+NIAH**: Combines summarization with needle-in-haystack search - **Mix.RepoQA+KV**: Integrates code function retrieval with key-value lookup - Requires multi-tasking or multi-step reasoning ## Two Shared Context Modes The benchmark evaluates these tasks across two shared context modes: - **Multi-turn Mode**: Caches context within single sessions - **Multi-request Mode**: Shares context across multiple sessions ## Compared to previous long-context benchmarks ![SCBench](./data/comparison.png) Our SCBench is the first long-context benchmark that covers single-turn, multi-turn, and multi-request scenarios. In addition, our impelmentation also involves KV cache reuse techniques, thereby providing a more comprehensive analysis on the full KV cache lifecycle of efficient long-context methods. ## Results and Findings ![SCBench](./data/results.png) Our SCBench reveals that the following key insights: ### Finding 1: Sub-O(n) Memory is Problematic in Multi-Request/Multi-Turn Decoding - Sparse decoding methods with sub-O(n) memory perform well on first queries but lose accuracy in subsequent requests - Methods maintaining O(n) memory with sub-O(n²) computation during pre-filling can better approximate full attention accuracy across multiple queries ### Finding 2: Task Performance Shows Varying Decline Patterns - Sparse KV cache methods excel in tasks requiring global information processing - O(n) memory is essential for tasks involving exact match retrieval ### Finding 3: Performance vs Compression Rate - All methods show performance degradation as compression rates increase - Sub-O(n) memory methods exhibit significant drop at 1/4 compression rate - Methods like RetrievalAttention and KIVI that maintain O(n) memory with sparse decoding show better resilience at higher compression rates ### Finding 4: Issues with Long-Generation Scenarios - Attention distribution shifts significantly as generation length and number of rounds increase - This out-of-distribution (OOD) issue impacts performance even for O(n) memory methods ### Finding 5: Dynamic vs Static Patterns - Dynamic sparse patterns generally outperform static patterns ## Citation ```bibtex @article{li2024scbench, title={SCBench: A KV cache-centric analysis of long-context methods}, author={Li, Yucheng and Jiang, Huiqiang and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Zhang, Chengruidong and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili}, journal={arXiv preprint arXiv:2412.10319}, year={2024} } ```

# SCBench（SharedContextBench） [[论文]](https://arxiv.org/abs/2412.10319) [[代码]](https://github.com/microsoft/MInference/tree/main/scbench) [[项目主页]](https://aka.ms/scbench) ![SCBench](./data/framework.png) SCBench（SharedContextBench，共享上下文基准测试集）是一款以KV缓存 (KV Cache) 为核心视角，评估高效长上下文方法的综合性基准测试集。其可在**上下文内存（KV缓存）可被多请求共享与复用的真实场景**中，全面分析高效长上下文方法在**完整KV缓存生命周期（生成、压缩、检索与加载）**上的性能表现。 ## 🎯 快速开始 ### 加载数据你可通过Hugging Face数据集平台（[🤗 HF 仓库](https://huggingface.co/datasets/microsoft/SCBench)）下载并加载SCBench数据集，并在GitHub仓库（[💻 SCBench](https://github.com/microsoft/MInference/tree/main/scbench)）中运行相关实验： python from datasets import load_dataset datasets = ["scbench_kv", "scbench_prefix_suffix", "scbench_vt", "scbench_repoqa", "scbench_qa_eng", "scbench_qa_chn", "scbench_choice_eng", "scbench_many_shot", "scbench_summary", "scbench_mf", "scbench_summary_with_needles", "scbench_repoqa_and_kv"] for dataset in datasets: data = load_dataset('microsoft/SCBench', dataset, split='test') ### 数据格式 SCBench中的所有数据均标准化为以下格式： json { "id": "每条数据的随机唯一标识符。", "context": "任务所需的长上下文，例如仓库代码、长文档与多示例样本。", "multi_turns": [{"input": "多轮问题。", "answer": "多轮参考答案。"}] } ### 实验配置我们已基于Hugging Face与vLLM框架，在`GreedySearch`（https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1160）与`GreedySearch_vllm`（https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1070）两个类中实现了**多轮**与**多请求**两种实验模式，请参考下述脚本运行实验。 ## 运行基准测试首先，请配置运行环境，详见[基础环境配置](https://github.com/microsoft/MInference/tree/main/scbench#basic-dependencies)。运行单任务测试： bash bash scripts/test_llama.sh 批量运行多任务： bash bash scripts/run_all_tasks.sh 可指定最大序列长度、最大轮次与测试样本数： - `--max_seq_length`: 测试所用的最大序列长度。 - `--max_turns`: 测试所用的最大轮次。 - `--num_eval_examples`: 本次测试使用的样本数量，默认使用全部样本。 ## 搭载高效长上下文方法运行 - `--attn_type`: 所使用的注意力类型。 - `--kv_type`: 所使用的KV缓存类型。例如，以MInference与SnapKV为配置运行实验： bash bash scripts/test_minference_with_snapkv.sh 支持的高效长上下文方法如下： **attn_type（注意力类型）**: - `dense`: 稠密注意力 - `minference`: MInference - `a_shape`: A-Shape - `tri_shape`: Tri-Shape **kv_type（KV缓存类型）**: - `dense`: 稠密KV缓存 - `kivi`: KIVI - `snapkv`: SnapKV - `quest`: Quest - `pyramidkv`: PyramidKV - `streamingllm`: StreamingLLM 针对不同的注意力类型与KV缓存类型，你需要分别配置专属运行环境，详见[高效长上下文方法专属环境](https://github.com/microsoft/MInference/tree/main/scbench#environment-for-efficient-long-context-methods)章节。 ## 数据集详情 ![SCBench](./data/overview.png) SCBench涵盖12项多样化任务，用于测试四项核心长上下文能力：字符串检索、语义检索、全局信息处理与多任务处理。 ### 字符串检索 - **Retr.KV**: 针对内容随机且不可压缩的大型JSON对象，测试其键值对查找能力 - **Retr.Prefix-Suffix**: 评估在长文本中查找符合特定前缀与后缀模式的字符串的能力 - **Retr.MultiHop**: 评估长输入下的多跳变量追踪能力 ### 语义检索 - **Code.RepoQA**: 基于自然语言描述，从大型代码仓库中检索对应函数的任务 - **语言问答**: 包含针对长文本的英文问答、中文问答与多项选择任务 - 要求对长输入进行语义理解 ### 全局信息处理 - **Many-shot ICL**: 使用数百个示例测试上下文学习能力 - **Math.Find**: 针对大型数组的统计计算任务 - **En.Sum**: 长文档摘要生成任务 - 要求进行全局信息处理或聚合 ### 多任务处理 - **Mix.Sum+NIAH**: 结合摘要生成与大海捞针式检索的混合任务 - **Mix.RepoQA+KV**: 整合代码函数检索与键值对查找的混合任务 - 要求进行多任务或多步推理 ## 两种共享上下文模式本基准测试集在两种共享上下文模式下开展任务评估： - **多轮模式**: 在单一会话内缓存上下文 - **多请求模式**: 在多个会话间共享上下文 ## 与现有长上下文基准测试集的对比 ![SCBench](./data/comparison.png) SCBench是首个覆盖单轮、多轮与多请求场景的长上下文基准测试集。此外，本实现还集成了KV缓存复用技术，从而能够对高效长上下文方法的完整KV缓存生命周期开展更全面的分析。 ## 实验结果与核心发现 ![SCBench](./data/results.png) 通过SCBench，我们得到以下核心结论： ### 发现1：次线性内存复杂度在多请求/多轮解码中存在局限 - 采用次线性O(n)内存的稀疏解码方法在首轮查询中表现优异，但在后续请求中会出现精度损失 - 在预填充阶段采用O(n)内存、次平方计算量的方法，能够在多轮查询中更接近全注意力的精度表现 ### 发现2：不同任务的性能衰减模式存在差异 - 稀疏KV缓存方法在需要全局信息处理的任务中表现出色 - 对于涉及精确匹配检索的任务，O(n)内存复杂度是必要条件 ### 发现3：性能与压缩率的关系 - 所有方法的性能均随压缩率提升而下降 - 采用次线性O(n)内存的方法在压缩率达到1/4时会出现显著性能下滑 - 像RetrievalAttention与KIVI这类采用稀疏解码但维持O(n)内存的方法，在高压缩率场景下表现出更强的鲁棒性 ### 发现4：长生成场景的分布偏移问题 - 随着生成长度与轮次的增加，注意力分布会发生显著偏移 - 这种分布外（OOD，Out-of-Distribution）偏移问题即使对于采用O(n)内存的方法也会造成性能损失 ### 发现5：动态与静态稀疏模式对比 - 动态稀疏模式整体优于静态稀疏模式 ## 引用格式 bibtex @article{li2024scbench, title={SCBench: A KV cache-centric analysis of long-context methods}, author={Li, Yucheng and Jiang, Huiqiang and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Zhang, Chengruidong and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili}, journal={arXiv preprint arXiv:2412.10319}, year={2024} }

提供机构：

maas

创建时间：

2025-07-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集