SCBench
收藏魔搭社区2025-12-29 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/SCBench
下载链接
链接失效反馈官方服务:
资源简介:
# SCBench
[[Paper]](https://arxiv.org/abs/2412.10319)
[[Code]](https://github.com/microsoft/MInference/tree/main/scbench)
[[Project Page]](https://aka.ms/scbench)

SCBench (SharedContextBench) is a comprehensive benchmark to evaluate efficient long-context methods in a KV cache-centric perspective, analyzing their performance across **the full KV cache lifecycle (generation, compression, retrieval, and loading)** in real-world scenarios where context memory (KV cache) is shared and reused across multiple requests.
## 🎯 Quick Start
### Load Data
You can download and load the **SCBench** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/microsoft/SCBench)), and run the experiments in the Github ([💻 SCBench](https://github.com/microsoft/MInference/tree/main/scbench)):
```python
from datasets import load_dataset
datasets = ["scbench_kv", "scbench_prefix_suffix", "scbench_vt", "scbench_repoqa", "scbench_qa_eng", "scbench_qa_chn", "scbench_choice_eng", "scbench_many_shot", "scbench_summary", "scbench_mf", "scbench_summary_with_needles", "scbench_repoqa_and_kv"]
for dataset in datasets:
data = load_dataset('microsoft/SCBench', dataset, split='test')
```
### Data Format
All data in **SCBench** are standardized to the following format:
```json
{
"id": "Random id for each piece of data.",
"context": "The long context required for the task, such as repo-code, long-document, and many-shot.",
"multi_turns": [{"input": "multi-turn question.", "answer": "multi-turn reference answer."}],
}
```
### Experiments
We implement **Multi-Turn** and **Multi-Request** modes with HF and vLLM in [`GreedySearch`](https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1160) and [`GreedySearch_vllm`](https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1070) two class. Please refer the follow scripts to run the experiments.
## Run the benchmark
First, build the environment, see [basic environment](https://github.com/microsoft/MInference/tree/main/scbench#basic-dependencies).
Run the test:
```bash
bash scripts/test_llama.sh
```
Run multiple tasks in one command:
```bash
bash scripts/run_all_tasks.sh
```
Specify the max sequence length, max number of turns, and number of eval examples:
- `--max_seq_length`: The maximum sequence length for the test.
- `--max_turns`: The maximum number of turns for the test.
- `--num_eval_examples`: The number of test examples to use, use all examples in default.
## Run with efficient long-context methods:
- `--attn_type`: The attention type to use.
- `--kv_type`: The KV cache type to use.
For example, run with MInference and SnapKV:
```bash
bash scripts/test_minference_with_snapkv.sh
```
The supported efficient long-context methods are as follows:
**attn_type**:
- `dense`: Dense attention
- `minference`: MInference
- `a_shape`: A-Shape
- `tri_shape`: Tri-Shape
**kv_type**:
- `dense`: Dense KV cache
- `kivi`: KIVI
- `snapkv`: SnapKV
- `quest`: Quest
- `pyramidkv`: PyramidKV
- `streamingllm`: StreamingLLM
You will need to build specific environment for different attention types and KV cache types, see section [Environment](https://github.com/microsoft/MInference/tree/main/scbench#environment-for-efficient-long-context-methods) for more details.
## Dataset

SCBench covers 12 diverse tasks that test four key long-context capabilities: string retrieval, semantic retrieval, global information processing, and multi-tasking.
### String Retrieval
- **Retr.KV**: Tests key-value lookup in large JSON objects with random, incompressible content
- **Retr.Prefix-Suffix**: Evaluates finding strings with specific prefix and suffix patterns
- **Retr.MultiHop**: Assesses multi-hop variable tracing capabilities in long inputs
### Semantic Retrieval
- **Code.RepoQA**: Function retrieval from large codebases based on natural language descriptions
- **Language QA**: Includes English QA, Chinese QA, and multi-choice questions on long texts
- Requires semantic understanding on length inputs
### Global Information Processing
- **Many-shot ICL**: Tests in-context learning with hundreds of examples
- **Math.Find**: Statistical tasks on large arrays
- **En.Sum**: Summarization of documents
- Requires global information processing or aggregation
### Multi-Tasking
- **Mix.Sum+NIAH**: Combines summarization with needle-in-haystack search
- **Mix.RepoQA+KV**: Integrates code function retrieval with key-value lookup
- Requires multi-tasking or multi-step reasoning
## Two Shared Context Modes
The benchmark evaluates these tasks across two shared context modes:
- **Multi-turn Mode**: Caches context within single sessions
- **Multi-request Mode**: Shares context across multiple sessions
## Compared to previous long-context benchmarks

Our SCBench is the first long-context benchmark that covers single-turn, multi-turn, and multi-request scenarios. In addition, our impelmentation also involves KV cache reuse techniques, thereby providing a more comprehensive analysis on the full KV cache lifecycle of efficient long-context methods.
## Results and Findings

Our SCBench reveals that the following key insights:
### Finding 1: Sub-O(n) Memory is Problematic in Multi-Request/Multi-Turn Decoding
- Sparse decoding methods with sub-O(n) memory perform well on first queries but lose accuracy in subsequent requests
- Methods maintaining O(n) memory with sub-O(n²) computation during pre-filling can better approximate full attention accuracy across multiple queries
### Finding 2: Task Performance Shows Varying Decline Patterns
- Sparse KV cache methods excel in tasks requiring global information processing
- O(n) memory is essential for tasks involving exact match retrieval
### Finding 3: Performance vs Compression Rate
- All methods show performance degradation as compression rates increase
- Sub-O(n) memory methods exhibit significant drop at 1/4 compression rate
- Methods like RetrievalAttention and KIVI that maintain O(n) memory with sparse decoding show better resilience at higher compression rates
### Finding 4: Issues with Long-Generation Scenarios
- Attention distribution shifts significantly as generation length and number of rounds increase
- This out-of-distribution (OOD) issue impacts performance even for O(n) memory methods
### Finding 5: Dynamic vs Static Patterns
- Dynamic sparse patterns generally outperform static patterns
## Citation
```bibtex
@article{li2024scbench,
title={SCBench: A KV cache-centric analysis of long-context methods},
author={Li, Yucheng and Jiang, Huiqiang and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Zhang, Chengruidong and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili},
journal={arXiv preprint arXiv:2412.10319},
year={2024}
}
```
# SCBench(SharedContextBench)
[[论文]](https://arxiv.org/abs/2412.10319)
[[代码]](https://github.com/microsoft/MInference/tree/main/scbench)
[[项目主页]](https://aka.ms/scbench)

SCBench(SharedContextBench,共享上下文基准测试集)是一款以KV缓存 (KV Cache) 为核心视角,评估高效长上下文方法的综合性基准测试集。其可在**上下文内存(KV缓存)可被多请求共享与复用的真实场景**中,全面分析高效长上下文方法在**完整KV缓存生命周期(生成、压缩、检索与加载)**上的性能表现。
## 🎯 快速开始
### 加载数据
你可通过Hugging Face数据集平台([🤗 HF 仓库](https://huggingface.co/datasets/microsoft/SCBench))下载并加载SCBench数据集,并在GitHub仓库([💻 SCBench](https://github.com/microsoft/MInference/tree/main/scbench))中运行相关实验:
python
from datasets import load_dataset
datasets = ["scbench_kv", "scbench_prefix_suffix", "scbench_vt", "scbench_repoqa", "scbench_qa_eng", "scbench_qa_chn", "scbench_choice_eng", "scbench_many_shot", "scbench_summary", "scbench_mf", "scbench_summary_with_needles", "scbench_repoqa_and_kv"]
for dataset in datasets:
data = load_dataset('microsoft/SCBench', dataset, split='test')
### 数据格式
SCBench中的所有数据均标准化为以下格式:
json
{
"id": "每条数据的随机唯一标识符。",
"context": "任务所需的长上下文,例如仓库代码、长文档与多示例样本。",
"multi_turns": [{"input": "多轮问题。", "answer": "多轮参考答案。"}]
}
### 实验配置
我们已基于Hugging Face与vLLM框架,在`GreedySearch`(https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1160)与`GreedySearch_vllm`(https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1070)两个类中实现了**多轮**与**多请求**两种实验模式,请参考下述脚本运行实验。
## 运行基准测试
首先,请配置运行环境,详见[基础环境配置](https://github.com/microsoft/MInference/tree/main/scbench#basic-dependencies)。
运行单任务测试:
bash
bash scripts/test_llama.sh
批量运行多任务:
bash
bash scripts/run_all_tasks.sh
可指定最大序列长度、最大轮次与测试样本数:
- `--max_seq_length`: 测试所用的最大序列长度。
- `--max_turns`: 测试所用的最大轮次。
- `--num_eval_examples`: 本次测试使用的样本数量,默认使用全部样本。
## 搭载高效长上下文方法运行
- `--attn_type`: 所使用的注意力类型。
- `--kv_type`: 所使用的KV缓存类型。
例如,以MInference与SnapKV为配置运行实验:
bash
bash scripts/test_minference_with_snapkv.sh
支持的高效长上下文方法如下:
**attn_type(注意力类型)**:
- `dense`: 稠密注意力
- `minference`: MInference
- `a_shape`: A-Shape
- `tri_shape`: Tri-Shape
**kv_type(KV缓存类型)**:
- `dense`: 稠密KV缓存
- `kivi`: KIVI
- `snapkv`: SnapKV
- `quest`: Quest
- `pyramidkv`: PyramidKV
- `streamingllm`: StreamingLLM
针对不同的注意力类型与KV缓存类型,你需要分别配置专属运行环境,详见[高效长上下文方法专属环境](https://github.com/microsoft/MInference/tree/main/scbench#environment-for-efficient-long-context-methods)章节。
## 数据集详情

SCBench涵盖12项多样化任务,用于测试四项核心长上下文能力:字符串检索、语义检索、全局信息处理与多任务处理。
### 字符串检索
- **Retr.KV**: 针对内容随机且不可压缩的大型JSON对象,测试其键值对查找能力
- **Retr.Prefix-Suffix**: 评估在长文本中查找符合特定前缀与后缀模式的字符串的能力
- **Retr.MultiHop**: 评估长输入下的多跳变量追踪能力
### 语义检索
- **Code.RepoQA**: 基于自然语言描述,从大型代码仓库中检索对应函数的任务
- **语言问答**: 包含针对长文本的英文问答、中文问答与多项选择任务
- 要求对长输入进行语义理解
### 全局信息处理
- **Many-shot ICL**: 使用数百个示例测试上下文学习能力
- **Math.Find**: 针对大型数组的统计计算任务
- **En.Sum**: 长文档摘要生成任务
- 要求进行全局信息处理或聚合
### 多任务处理
- **Mix.Sum+NIAH**: 结合摘要生成与大海捞针式检索的混合任务
- **Mix.RepoQA+KV**: 整合代码函数检索与键值对查找的混合任务
- 要求进行多任务或多步推理
## 两种共享上下文模式
本基准测试集在两种共享上下文模式下开展任务评估:
- **多轮模式**: 在单一会话内缓存上下文
- **多请求模式**: 在多个会话间共享上下文
## 与现有长上下文基准测试集的对比

SCBench是首个覆盖单轮、多轮与多请求场景的长上下文基准测试集。此外,本实现还集成了KV缓存复用技术,从而能够对高效长上下文方法的完整KV缓存生命周期开展更全面的分析。
## 实验结果与核心发现

通过SCBench,我们得到以下核心结论:
### 发现1:次线性内存复杂度在多请求/多轮解码中存在局限
- 采用次线性O(n)内存的稀疏解码方法在首轮查询中表现优异,但在后续请求中会出现精度损失
- 在预填充阶段采用O(n)内存、次平方计算量的方法,能够在多轮查询中更接近全注意力的精度表现
### 发现2:不同任务的性能衰减模式存在差异
- 稀疏KV缓存方法在需要全局信息处理的任务中表现出色
- 对于涉及精确匹配检索的任务,O(n)内存复杂度是必要条件
### 发现3:性能与压缩率的关系
- 所有方法的性能均随压缩率提升而下降
- 采用次线性O(n)内存的方法在压缩率达到1/4时会出现显著性能下滑
- 像RetrievalAttention与KIVI这类采用稀疏解码但维持O(n)内存的方法,在高压缩率场景下表现出更强的鲁棒性
### 发现4:长生成场景的分布偏移问题
- 随着生成长度与轮次的增加,注意力分布会发生显著偏移
- 这种分布外(OOD,Out-of-Distribution)偏移问题即使对于采用O(n)内存的方法也会造成性能损失
### 发现5:动态与静态稀疏模式对比
- 动态稀疏模式整体优于静态稀疏模式
## 引用格式
bibtex
@article{li2024scbench,
title={SCBench: A KV cache-centric analysis of long-context methods},
author={Li, Yucheng and Jiang, Huiqiang and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Zhang, Chengruidong and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili},
journal={arXiv preprint arXiv:2412.10319},
year={2024}
}
提供机构:
maas
创建时间:
2025-07-22



