MICO: efficient query scheduling for multi-cloud deployed LLM inference service
收藏中国科学数据2026-02-09 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.1007/s11432-024-4487-8
下载链接
链接失效反馈官方服务:
资源简介:
Given the powerful capabilities of large language models (LLMs), many tech companies make LLM inference a service for users, which may be deployed in multiple clouds to provide better service.Computational overhead and cloud workload are crucial metrics in cloud computing task scheduling. However, the autoregressive nature of LLMs makes these metrics difficult to measure. Specifically, LLMs require multiple iterations of computation to process a single query, and there is significant differentiation in the number of iterations needed for different queries. Moreover, batch-wise model inference exacerbates the gap between allocated and actual computational loads for each cloud due to these variations, ultimately affecting computational resource utilization and the throughput of inference service query processing.To this end, we propose Micoxspace, which includes a query scheduling strategy based on response length prediction to achieve token-granularity workload distribution across clouds, and an inference framework that supports the flexible insertion of queries into the processing batch, eliminating unnecessary computation introduced by the iteration differentiation of queries in batch-wise inference.We conducted experiments based on two GPT series models, and the results show that Micoxspace can reduce KV-Cacheresource consumption by $44.89%$ during inference and increase the query processing throughput of the service system by up to $2.2\times$.
创建时间:
2025-06-24



