asdwb/cara_latency_prediction

Name: asdwb/cara_latency_prediction
Creator: asdwb
Published: 2026-04-11 16:33:03
License: 暂无描述

Hugging Face2026-04-11 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/asdwb/cara_latency_prediction

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en tags: - llm-serving - latency-prediction - heterogeneous-serving - scheduling size_categories: - 100K<n<1M --- # CARA Latency Prediction Dataset A large-scale dataset for training latency predictors in heterogeneous LLM serving systems. Contains 404,258 request-level records from 18 model instances across 4 GPU types, collected under varying QPS levels (8-24). Data from sweep 2 (March 18-19, 2026) with no concurrency cap. ## Dataset Description Each record captures the **instance state at scheduling time** and the **actual end-to-end latency** after request completion. This enables training learned latency predictors that map observable system state to request latency — analogous to learned index structures in databases. ### Key Statistics | Split | Records | Prompts Source | QPS Range | Notes | |-------|---------|---------------|-----------|-------| | Train | 359,258 | cara_model_estimator train split | 8-24 | 671 records dropped (unresolvable queue entries) | | Test | 45,000 | cara_model_estimator test split | 8-24 | 0 dropped | ### Cluster Configuration 18 instances across 4 GPU types serving 4 Qwen2.5 model sizes: | Model | GPU | Instances | Tensor Parallel | vLLM Version | |-------|-----|-----------|----------------|--------------| | Qwen2.5-72B | A100 80GB x2 | 2 | 2 | cara_v_11 | | Qwen2.5-14B | V100 32GB x4 | 3 | 4 | cara_v_11 | | Qwen2.5-7B | A30 24GB | 5 | 1 | cara_v_11 | | Qwen2.5-3B | A30 24GB | 3 | 1 | cara_v_11 | | Qwen2.5-3B | P100 16GB | 5 | 1 | cara_v_11_p100 | ## Files **Updated April 2026**: Data replaced with clean sweep 2 collection (no concurrency cap). Queue detail files enriched with per-request `actual_output_tokens` via cross-referencing (99.96% match rate). | File | Records | Description | |------|---------|-------------| | `train.jsonl` | 359,258 | Training data, flat schema (schedule_state fields at top level) | | `test.jsonl` | 45,000 | Test data, flat schema | | `train_queue_details.jsonl` | 359,258 | Training data with per-request `running_requests[]` and `waiting_requests[]` + enriched `actual_output_tokens` | | `test_queue_details.jsonl` | 45,000 | Test data with per-request lists + enriched `actual_output_tokens` | ### Flat vs Queue Details - **Flat files** (`train.jsonl`, `test.jsonl`): Schedule state fields flattened to top level. Suitable for XGBoost and tabular models. - **Queue detail files**: Full nested `schedule_state` with per-request running/waiting lists. Each queue entry includes `actual_output_tokens` (enriched post-collection). Suitable for LSTM sequence models. ## Schema ### Flat records (`train.jsonl`, `test.jsonl`) **Identifiers:** - `request_id` (str): Unique request identifier (UUID) - `instance_id` (str): CloudLab hostname + port (e.g., `c240g5-110103.wisc.cloudlab.us_port8300`) - `instance_type` (str): Model + GPU type (e.g., `qwen2.5-3b_p100`) **Request features:** - `num_prompt_tokens` (int): Input prompt length in tokens - `num_predicted_output_tokens` (int): Max tokens parameter (1024 for all requests in this dataset) - `actual_output_tokens` (int): Actual generated output tokens (recovered via `round((e2e-ttft)/tpot)+1`) **Schedule state (from vLLM `/instance_stats` at scheduling time):** - `num_running` (int): Currently running requests on this instance - `num_waiting` (int): Queued requests (typically 0 — vLLM continuous batching absorbs load) - `num_active_decode_seqs` (int): Actively decoding sequences - `decode_ctx_p50/p95/max` (float): Decode context length percentiles (0 for some instances) - `pending_prefill_tokens` (int): Total pending prefill tokens (often 0) - `pending_decode_tokens` (int): Total pending decode tokens (0 for some instances) - `kv_cache_utilization` (float): KV cache usage fraction [0, 1] - `kv_free_blocks` (int): Available KV cache blocks - `token_budget_per_iter` (int): Scheduler token budget per iteration - `prefill_chunk_size` (int): Chunked prefill size - `max_num_seqs` (int): Max concurrent sequences allowed - `num_preempted` (int): Cumulative preemption count - `ema_decode_tok_per_s` (float): EMA decode throughput (tokens/sec) - `ema_prefill_tok_per_s` (float): EMA prefill throughput (tokens/sec) - `ema_decode_iter_ms` (float): EMA per-iteration decode latency (ms) - `kv_evictions_per_s` (float): KV cache eviction rate **Derived (flat files only):** - `running_requests_count` (int): Count from per-request list - `waiting_requests_count` (int): Count from per-request list **Overhead:** - `probe_latency_ms` (float): Time to fetch instance state from vLLM - `prediction_latency_ms` (float): Latency predictor inference time **Timestamps:** - `prediction_timestamp` (float): Unix time when scheduling decision was made - `completion_timestamp` (float): Unix time when request completed **Target labels:** - `actual_e2e_latency` (float): End-to-end latency in seconds (client-measured) - `actual_ttft` (float): Time to first token in seconds - `actual_tpot` (float): Mean time per output token in seconds (mean inter-token latency) ### Queue details (`train_queue_details.jsonl`, `test_queue_details.jsonl`) Full nested structure with `schedule_state` containing all aggregate fields plus per-request lists: - `schedule_state.running_requests[]` — per running request: - `request_id` (str): vLLM internal ID (`chatcmpl-` prefixed UUID) - `num_prompt_tokens` (int): Request's prompt length - `num_computed_tokens` (int): Tokens already processed (prefill progress) - `total_num_tokens` (int): Total allocated tokens at snapshot - `num_output_tokens` (int): Output tokens generated so far (0 during prefill) - `actual_output_tokens` (int): **Final output length** (enriched post-collection via cross-referencing request_ids, 99.96% match rate) - `schedule_state.waiting_requests[]` — same schema as running_requests ## Collection Method 1. Real prompts sampled from the [cara_model_estimator](https://huggingface.co/datasets/asdwb/cara_model_estimator) dataset 2. Sent through a coordinator with random scheduling to 18 heterogeneous instances 3. Sidecar on each instance captures (state, latency) pairs 4. Collected at 10 QPS levels (9-36) covering idle to saturated conditions 5. Train/test split aligned with cara_model_estimator splits (no prompt overlap) ## Intended Use - Training latency predictors for LLM serving schedulers (XGBoost, LSTM, neural networks) - Benchmarking simulation-based vs learned latency prediction - Studying heterogeneous LLM serving behavior across GPU types - Evaluating scheduling policies under different load conditions ### Enriched Queue Entry Fields (queue_details files) Each entry in `running_requests[]` and `waiting_requests[]` includes: | Field | Description | |-------|-------------| | `num_prompt_tokens` | Request's prompt length | | `num_computed_tokens` | Tokens already processed (prefill progress) | | `total_num_tokens` | Total sequence length at snapshot | | `num_output_tokens` | Output tokens generated so far | | `actual_output_tokens` | **Final output length** (enriched via cross-referencing request_ids) | ### Offline Evaluation Results (verified April 2026) | Model | E2E MAE | E2E MAPE | Spearman rho | |-------|---------|----------|-------------| | XGBoost (oracle output) | 0.289s | 3.7% | 0.999 | | Roofline (analytical) | 0.531s | 10.7% | 0.987 | | XGBoost TTFT | 0.015s | 16.2% | 0.946 | | XGBoost TPOT | 0.002s | 42.0% | 0.978 | ## Citation If you use this dataset, please cite: ```bibtex @dataset{cara_latency_2026, title={CARA Latency Prediction Dataset}, author={Wei Da}, year={2026}, url={https://huggingface.co/datasets/asdwb/cara_latency_prediction} } ``` ## Related - [cara_model_estimator](https://huggingface.co/datasets/asdwb/cara_model_estimator) — Multi-model quality and length estimation dataset - CARA: Context-Aware Resource Allocation for heterogeneous LLM serving

提供机构：

asdwb

搜集汇总

数据集介绍

构建方式

在异构大语言模型服务系统中，构建精准的延迟预测模型需要依赖高质量的数据支撑。该数据集通过精心设计的实验流程，在真实的服务环境中采集了404,258条请求级别的记录。数据来源于一个包含18个模型实例、覆盖4种GPU类型的异构集群，这些实例部署了不同规模的Qwen2.5模型。数据采集过程模拟了从空闲到饱和的多种负载场景，查询率在每秒8至24次请求之间动态变化。每条记录均精确捕捉了调度时刻的实例内部状态与请求完成后的端到端实际延迟，确保了状态与结果之间的因果对应关系。训练集与测试集的划分严格遵循了上游提示数据的分割，避免了数据泄露，为模型训练提供了可靠的基础。

特点

该数据集的核心价值在于其全面且细粒度的系统状态表征能力。它不仅提供了实例级别的聚合指标，如运行请求数、等待队列长度及键值缓存利用率等，还通过队列详情文件揭示了每个运行中或等待请求的微观状态，包括提示令牌数、已计算令牌数以及通过交叉验证获取的实际输出令牌数。这种双层数据结构设计，既支持XGBoost等传统表格模型所需的扁平化特征，也为LSTM等序列模型处理嵌套的请求列表提供了便利。数据集覆盖了从A100到P100等多种GPU硬件，以及从3B到72B的不同模型规模，充分体现了异构服务环境的复杂性，为研究负载条件下的系统行为差异提供了丰富样本。

使用方法

该数据集旨在服务于大语言模型服务系统中的延迟预测研究。使用者可依据不同的建模需求选择相应的数据文件：对于基于树模型或神经网络的表格预测任务，可直接使用扁平化的`train.jsonl`与`test.jsonl`文件，其中特征已被展平，便于快速进行特征工程与模型训练。若研究涉及序列建模或需要分析请求间的交互影响，则应使用包含完整嵌套结构的队列详情文件，这些文件保留了每个请求的详细上下文信息。数据集已预先分割，研究者可在训练集上开发预测模型，并在独立的测试集上评估其端到端延迟、首令牌时间及每令牌时间的预测精度，进而用于优化调度策略或进行异构服务的仿真基准测试。

背景与挑战

背景概述

随着大规模语言模型（LLM）服务需求的激增，异构计算环境下的高效资源调度成为关键研究议题。CARA Latency Prediction 数据集由 Wei Da 等人于2026年构建，旨在为异构LLM服务系统中的延迟预测提供大规模训练数据。该数据集聚焦于解决在动态负载下，如何准确预测请求端到端延迟的核心问题，从而优化调度决策，提升系统吞吐与资源利用率。其通过采集涵盖四种GPU类型、十八个模型实例的四十余万条请求级记录，为学习型延迟预测器的开发奠定了实证基础，推动了智能调度算法在云原生LLM服务领域的演进。

当前挑战

在异构LLM服务领域，准确预测请求延迟面临多重挑战。其一，系统状态具有高维动态性，如KV缓存利用率、解码序列数等指标相互耦合，使得延迟与状态间映射关系高度非线性。其二，构建过程中需克服数据采集的复杂性，需在真实负载下同步捕获调度时刻实例状态与请求完成后的实际延迟，并处理队列条目解析、令牌数跨引用匹配等数据一致性问题。此外，数据集需覆盖从空闲到饱和的多种负载场景，以保障预测模型在动态QPS下的泛化能力，这对实验设计与数据清洗提出了严峻考验。

常用场景

经典使用场景

在大规模语言模型服务系统中，资源调度与延迟预测是保障服务质量的核心挑战。CARA延迟预测数据集为这一领域提供了关键支持，其经典使用场景在于训练基于学习的延迟预测器。通过捕获调度时刻的实例状态与请求完成后的端到端延迟数据，该数据集使研究者能够构建从可观测系统状态到请求延迟的映射模型，类似于数据库中的学习索引结构。这一过程通常涉及利用XGBoost等表格模型或LSTM等序列模型，在异构GPU集群环境下，对包含运行请求队列、等待请求队列及丰富系统指标的复杂状态进行建模，从而实现对请求延迟的精准预测，为智能调度决策提供数据基础。

实际应用

在实际的云原生LLM服务平台中，高效的资源管理和服务质量保障至关重要。CARA延迟预测数据集直接服务于此类平台的核心调度器开发。基于该数据集训练的预测模型，能够被集成到生产环境的调度系统中，实时预估将一个新请求分配到特定模型实例（如运行在不同GPU上的Qwen2.5系列模型）上所产生的延迟。这使得调度器能够在满足服务等级协议的前提下，实现负载均衡、最小化平均延迟或最大化吞吐量等优化目标。例如，在由A100、V100、A30、P100等多种GPU构成的异构集群中，系统可以利用预测结果，智能地将请求路由至最合适的实例，从而显著提升整体资源利用率和用户体验。

衍生相关工作

CARA延迟预测数据集作为异构LLM服务研究的重要资源，已催生了一系列相关经典工作。其最直接的衍生研究是围绕CARA（上下文感知资源分配）框架本身展开，该框架利用数据集训练的预测器来实现智能调度。此外，数据集支持了对不同预测模型（如XGBoost与LSTM）在延迟预测任务上的性能基准测试，比较了学习型方法与理论分析型方法（如性能天花板模型）的优劣。更进一步，该数据集为探索更高级的调度策略提供了实验平台，例如研究在动态查询率负载下的调度稳定性，或是结合模型质量估计进行多目标优化。这些工作共同推动了学习增强系统在LLM服务领域的深入应用。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集