five

SCBench

收藏
魔搭社区2025-12-29 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/SCBench
下载链接
链接失效反馈
官方服务:
资源简介:
# SCBench [[Paper]](https://arxiv.org/abs/2412.10319) [[Code]](https://github.com/microsoft/MInference/tree/main/scbench) [[Project Page]](https://aka.ms/scbench) ![SCBench](./data/framework.png) SCBench (SharedContextBench) is a comprehensive benchmark to evaluate efficient long-context methods in a KV cache-centric perspective, analyzing their performance across **the full KV cache lifecycle (generation, compression, retrieval, and loading)** in real-world scenarios where context memory (KV cache) is shared and reused across multiple requests. ## 🎯 Quick Start ### Load Data You can download and load the **SCBench** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/microsoft/SCBench)), and run the experiments in the Github ([💻 SCBench](https://github.com/microsoft/MInference/tree/main/scbench)): ```python from datasets import load_dataset datasets = ["scbench_kv", "scbench_prefix_suffix", "scbench_vt", "scbench_repoqa", "scbench_qa_eng", "scbench_qa_chn", "scbench_choice_eng", "scbench_many_shot", "scbench_summary", "scbench_mf", "scbench_summary_with_needles", "scbench_repoqa_and_kv"] for dataset in datasets: data = load_dataset('microsoft/SCBench', dataset, split='test') ``` ### Data Format All data in **SCBench** are standardized to the following format: ```json { "id": "Random id for each piece of data.", "context": "The long context required for the task, such as repo-code, long-document, and many-shot.", "multi_turns": [{"input": "multi-turn question.", "answer": "multi-turn reference answer."}], } ``` ### Experiments We implement **Multi-Turn** and **Multi-Request** modes with HF and vLLM in [`GreedySearch`](https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1160) and [`GreedySearch_vllm`](https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1070) two class. Please refer the follow scripts to run the experiments. ## Run the benchmark First, build the environment, see [basic environment](https://github.com/microsoft/MInference/tree/main/scbench#basic-dependencies). Run the test: ```bash bash scripts/test_llama.sh ``` Run multiple tasks in one command: ```bash bash scripts/run_all_tasks.sh ``` Specify the max sequence length, max number of turns, and number of eval examples: - `--max_seq_length`: The maximum sequence length for the test. - `--max_turns`: The maximum number of turns for the test. - `--num_eval_examples`: The number of test examples to use, use all examples in default. ## Run with efficient long-context methods: - `--attn_type`: The attention type to use. - `--kv_type`: The KV cache type to use. For example, run with MInference and SnapKV: ```bash bash scripts/test_minference_with_snapkv.sh ``` The supported efficient long-context methods are as follows: **attn_type**: - `dense`: Dense attention - `minference`: MInference - `a_shape`: A-Shape - `tri_shape`: Tri-Shape **kv_type**: - `dense`: Dense KV cache - `kivi`: KIVI - `snapkv`: SnapKV - `quest`: Quest - `pyramidkv`: PyramidKV - `streamingllm`: StreamingLLM You will need to build specific environment for different attention types and KV cache types, see section [Environment](https://github.com/microsoft/MInference/tree/main/scbench#environment-for-efficient-long-context-methods) for more details. ## Dataset ![SCBench](./data/overview.png) SCBench covers 12 diverse tasks that test four key long-context capabilities: string retrieval, semantic retrieval, global information processing, and multi-tasking. ### String Retrieval - **Retr.KV**: Tests key-value lookup in large JSON objects with random, incompressible content - **Retr.Prefix-Suffix**: Evaluates finding strings with specific prefix and suffix patterns - **Retr.MultiHop**: Assesses multi-hop variable tracing capabilities in long inputs ### Semantic Retrieval - **Code.RepoQA**: Function retrieval from large codebases based on natural language descriptions - **Language QA**: Includes English QA, Chinese QA, and multi-choice questions on long texts - Requires semantic understanding on length inputs ### Global Information Processing - **Many-shot ICL**: Tests in-context learning with hundreds of examples - **Math.Find**: Statistical tasks on large arrays - **En.Sum**: Summarization of documents - Requires global information processing or aggregation ### Multi-Tasking - **Mix.Sum+NIAH**: Combines summarization with needle-in-haystack search - **Mix.RepoQA+KV**: Integrates code function retrieval with key-value lookup - Requires multi-tasking or multi-step reasoning ## Two Shared Context Modes The benchmark evaluates these tasks across two shared context modes: - **Multi-turn Mode**: Caches context within single sessions - **Multi-request Mode**: Shares context across multiple sessions ## Compared to previous long-context benchmarks ![SCBench](./data/comparison.png) Our SCBench is the first long-context benchmark that covers single-turn, multi-turn, and multi-request scenarios. In addition, our impelmentation also involves KV cache reuse techniques, thereby providing a more comprehensive analysis on the full KV cache lifecycle of efficient long-context methods. ## Results and Findings ![SCBench](./data/results.png) Our SCBench reveals that the following key insights: ### Finding 1: Sub-O(n) Memory is Problematic in Multi-Request/Multi-Turn Decoding - Sparse decoding methods with sub-O(n) memory perform well on first queries but lose accuracy in subsequent requests - Methods maintaining O(n) memory with sub-O(n²) computation during pre-filling can better approximate full attention accuracy across multiple queries ### Finding 2: Task Performance Shows Varying Decline Patterns - Sparse KV cache methods excel in tasks requiring global information processing - O(n) memory is essential for tasks involving exact match retrieval ### Finding 3: Performance vs Compression Rate - All methods show performance degradation as compression rates increase - Sub-O(n) memory methods exhibit significant drop at 1/4 compression rate - Methods like RetrievalAttention and KIVI that maintain O(n) memory with sparse decoding show better resilience at higher compression rates ### Finding 4: Issues with Long-Generation Scenarios - Attention distribution shifts significantly as generation length and number of rounds increase - This out-of-distribution (OOD) issue impacts performance even for O(n) memory methods ### Finding 5: Dynamic vs Static Patterns - Dynamic sparse patterns generally outperform static patterns ## Citation ```bibtex @article{li2024scbench, title={SCBench: A KV cache-centric analysis of long-context methods}, author={Li, Yucheng and Jiang, Huiqiang and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Zhang, Chengruidong and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili}, journal={arXiv preprint arXiv:2412.10319}, year={2024} } ```

# SCBench(SharedContextBench) [[论文]](https://arxiv.org/abs/2412.10319) [[代码]](https://github.com/microsoft/MInference/tree/main/scbench) [[项目主页]](https://aka.ms/scbench) ![SCBench](./data/framework.png) SCBench(SharedContextBench,共享上下文基准测试集)是一款以KV缓存 (KV Cache) 为核心视角,评估高效长上下文方法的综合性基准测试集。其可在**上下文内存(KV缓存)可被多请求共享与复用的真实场景**中,全面分析高效长上下文方法在**完整KV缓存生命周期(生成、压缩、检索与加载)**上的性能表现。 ## 🎯 快速开始 ### 加载数据 你可通过Hugging Face数据集平台([🤗 HF 仓库](https://huggingface.co/datasets/microsoft/SCBench))下载并加载SCBench数据集,并在GitHub仓库([💻 SCBench](https://github.com/microsoft/MInference/tree/main/scbench))中运行相关实验: python from datasets import load_dataset datasets = ["scbench_kv", "scbench_prefix_suffix", "scbench_vt", "scbench_repoqa", "scbench_qa_eng", "scbench_qa_chn", "scbench_choice_eng", "scbench_many_shot", "scbench_summary", "scbench_mf", "scbench_summary_with_needles", "scbench_repoqa_and_kv"] for dataset in datasets: data = load_dataset('microsoft/SCBench', dataset, split='test') ### 数据格式 SCBench中的所有数据均标准化为以下格式: json { "id": "每条数据的随机唯一标识符。", "context": "任务所需的长上下文,例如仓库代码、长文档与多示例样本。", "multi_turns": [{"input": "多轮问题。", "answer": "多轮参考答案。"}] } ### 实验配置 我们已基于Hugging Face与vLLM框架,在`GreedySearch`(https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1160)与`GreedySearch_vllm`(https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1070)两个类中实现了**多轮**与**多请求**两种实验模式,请参考下述脚本运行实验。 ## 运行基准测试 首先,请配置运行环境,详见[基础环境配置](https://github.com/microsoft/MInference/tree/main/scbench#basic-dependencies)。 运行单任务测试: bash bash scripts/test_llama.sh 批量运行多任务: bash bash scripts/run_all_tasks.sh 可指定最大序列长度、最大轮次与测试样本数: - `--max_seq_length`: 测试所用的最大序列长度。 - `--max_turns`: 测试所用的最大轮次。 - `--num_eval_examples`: 本次测试使用的样本数量,默认使用全部样本。 ## 搭载高效长上下文方法运行 - `--attn_type`: 所使用的注意力类型。 - `--kv_type`: 所使用的KV缓存类型。 例如,以MInference与SnapKV为配置运行实验: bash bash scripts/test_minference_with_snapkv.sh 支持的高效长上下文方法如下: **attn_type(注意力类型)**: - `dense`: 稠密注意力 - `minference`: MInference - `a_shape`: A-Shape - `tri_shape`: Tri-Shape **kv_type(KV缓存类型)**: - `dense`: 稠密KV缓存 - `kivi`: KIVI - `snapkv`: SnapKV - `quest`: Quest - `pyramidkv`: PyramidKV - `streamingllm`: StreamingLLM 针对不同的注意力类型与KV缓存类型,你需要分别配置专属运行环境,详见[高效长上下文方法专属环境](https://github.com/microsoft/MInference/tree/main/scbench#environment-for-efficient-long-context-methods)章节。 ## 数据集详情 ![SCBench](./data/overview.png) SCBench涵盖12项多样化任务,用于测试四项核心长上下文能力:字符串检索、语义检索、全局信息处理与多任务处理。 ### 字符串检索 - **Retr.KV**: 针对内容随机且不可压缩的大型JSON对象,测试其键值对查找能力 - **Retr.Prefix-Suffix**: 评估在长文本中查找符合特定前缀与后缀模式的字符串的能力 - **Retr.MultiHop**: 评估长输入下的多跳变量追踪能力 ### 语义检索 - **Code.RepoQA**: 基于自然语言描述,从大型代码仓库中检索对应函数的任务 - **语言问答**: 包含针对长文本的英文问答、中文问答与多项选择任务 - 要求对长输入进行语义理解 ### 全局信息处理 - **Many-shot ICL**: 使用数百个示例测试上下文学习能力 - **Math.Find**: 针对大型数组的统计计算任务 - **En.Sum**: 长文档摘要生成任务 - 要求进行全局信息处理或聚合 ### 多任务处理 - **Mix.Sum+NIAH**: 结合摘要生成与大海捞针式检索的混合任务 - **Mix.RepoQA+KV**: 整合代码函数检索与键值对查找的混合任务 - 要求进行多任务或多步推理 ## 两种共享上下文模式 本基准测试集在两种共享上下文模式下开展任务评估: - **多轮模式**: 在单一会话内缓存上下文 - **多请求模式**: 在多个会话间共享上下文 ## 与现有长上下文基准测试集的对比 ![SCBench](./data/comparison.png) SCBench是首个覆盖单轮、多轮与多请求场景的长上下文基准测试集。此外,本实现还集成了KV缓存复用技术,从而能够对高效长上下文方法的完整KV缓存生命周期开展更全面的分析。 ## 实验结果与核心发现 ![SCBench](./data/results.png) 通过SCBench,我们得到以下核心结论: ### 发现1:次线性内存复杂度在多请求/多轮解码中存在局限 - 采用次线性O(n)内存的稀疏解码方法在首轮查询中表现优异,但在后续请求中会出现精度损失 - 在预填充阶段采用O(n)内存、次平方计算量的方法,能够在多轮查询中更接近全注意力的精度表现 ### 发现2:不同任务的性能衰减模式存在差异 - 稀疏KV缓存方法在需要全局信息处理的任务中表现出色 - 对于涉及精确匹配检索的任务,O(n)内存复杂度是必要条件 ### 发现3:性能与压缩率的关系 - 所有方法的性能均随压缩率提升而下降 - 采用次线性O(n)内存的方法在压缩率达到1/4时会出现显著性能下滑 - 像RetrievalAttention与KIVI这类采用稀疏解码但维持O(n)内存的方法,在高压缩率场景下表现出更强的鲁棒性 ### 发现4:长生成场景的分布偏移问题 - 随着生成长度与轮次的增加,注意力分布会发生显著偏移 - 这种分布外(OOD,Out-of-Distribution)偏移问题即使对于采用O(n)内存的方法也会造成性能损失 ### 发现5:动态与静态稀疏模式对比 - 动态稀疏模式整体优于静态稀疏模式 ## 引用格式 bibtex @article{li2024scbench, title={SCBench: A KV cache-centric analysis of long-context methods}, author={Li, Yucheng and Jiang, Huiqiang and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Zhang, Chengruidong and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili}, journal={arXiv preprint arXiv:2412.10319}, year={2024} }
提供机构:
maas
创建时间:
2025-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作