PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

Yunhe Pan; Yunqi Gao; Bing Hu; Mahdi Boloursaz Mashhadi; Yitong Duan; Pei Xiao; Yanfeng Zhang

Back

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

Conference proceeding

Open access

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

Yunhe Pan, Yunqi Gao, Bing Hu, Mahdi Boloursaz Mashhadi, Yitong Duan, Pei Xiao and Yanfeng Zhang

Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306

The 43rd International Conference on Machine Learning (ICML 26) (Seoul, South Korea, 06/07/2026–11/07/2026)

30/04/2026

Abstract

Speculative decoding can significantly accelerate LLM inference, especially given that its cloud-edge collaborative deployment offers cloud work-load offloading, offline robustness, and privacy enhancement. However, existing collaborative inference frameworks with speculative decoding are constrained by (i) sequential token generation and communication with low resource utilization , and (ii) inflexible cloud non-autoregressive verification (NAV) triggering that induces premature verification or costly rollbacks. In this paper, we propose PipeSD, an efficient cloud-edge collaborative pipeline inference framework with speculative decoding. PipeSD overlaps token generation and communication by a token-batch pipeline scheduling mechanism optimized by dynamic programming, and improves verification flexibility through a dual-threshold NAV triggering mechanism with a lightweight Bayesian optimization autotuner. We implement PipeSD using llama-cpp-python, PyTorch, and FastAPI, and evaluate it on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios. Results show that PipeSD consistently outperforms state-of-the-art baselines, achieving 1.16×–2.16× speedup and reducing energy consumption by 14.3%–25.3%. Our code is available at https://anonymous.4open. science/r/PipeSD.

Files and links (2)

pdf

PipeSD3.56 MBDownload View

Author's Accepted Manuscript Open Access

url

https://icml.cc/View

Event Website Conference website

Metrics

1 Record Views

Details

Title: PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
Creators: Yunhe Pan (Author) - Zhejiang Lab
Yunqi Gao (Author) - Zhejiang University
Bing Hu (Corresponding Author) - Zhejiang University
Mahdi Boloursaz Mashhadi (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Yitong Duan (Author) - Zhongguancun Institute of Artificial Intelligence (Beijing, China)
Pei Xiao (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Yanfeng Zhang (Author) - Northeastern University
Publication Details: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306
Conference: The 43rd International Conference on Machine Learning (ICML 26) (Seoul, South Korea, 06/07/2026–11/07/2026)
Publisher: Association for Computing Machinery (ACM)
Date accepted for publication: 30/04/2026
Grant note: This work is supported by the Zhongguancun Academy, (Grant No.s XTS0038).
Identifiers: 991128495002346
Academic Unit: School of Computer Science & Electronic Engineering
Language: English
Resource Type: Conference proceeding

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

Abstract

Files and links (2)

Metrics

Details

Usage Policy