Efficient Inference in Large Language Models

Yucheng Li

doi:10.15126/thesis.901729

Back

Efficient Inference in Large Language Models

Doctoral Thesis

Open access

Efficient Inference in Large Language Models

Yucheng Li

University of Surrey

Doctor of Philosophy (PhD), University of Surrey

29/08/2025

DOI:

https://doi.org/10.15126/thesis.901729

Abstract

AI, LLM, Inference System

Large Language Models (LLMs) are increasingly used in various real-world applications. However, the computational cost of LLMs makes it challenging to serve them for a large number of users. This thesis studies the efficiency bottlenecks in LLM inference systems, and proposes a series of novel algorithms to improve the efficiency of LLM inference. First, we explore the redundancy in the input space of LLM inference, proposing prompt compression to build a more compact representation of input prompt and thus reducing the inference cost. Then, we propose MInference, a dynamic sparse attention algorithm that identifies the critical blocks of the attention online and to perform sparse computation efficiently on modern hardware. We also proposed SCBench, a new benchmark to systematically study the sparsity-accuracy trade-offs across various KV cache centric optimizations in LLM inference systems. In addition, we proposed MMInference, a permutation-based sparse attention algorithm to accelerate the prefill stage of VLMs inference. We conclude this thesis by discussing the impact of our research and potential future work.

Files and links (1)

pdf

Yucheng_Li_PGR_after_correction11.58 MBDownload View

PDFCC BY-NC-SA V4.0, Open Access

Metrics

1 Record Views

Details

Title: Efficient Inference in Large Language Models
Creators: Yucheng Li - University of Surrey, School of Computer Science and Electronic Engineering
Contributors: Frank Guerin (Supervisor) - University of Surrey, School of Computer Science and Electronic Engineering
Awarding Institution: University of Surrey; Doctor of Philosophy (PhD)
Theses and Dissertations: Doctor of Philosophy (PhD), University of Surrey
Publisher: University of Surrey
Number of pages: 131
Grants: University of Surrey (United Kingdom, Guildford)
China Scholarship Council (China, Beijing) - CSC
Grant note: I would like to thank the University of Surrey and the China Scholarship Council (CSC) for funding my PhD research.
Identifiers: 991016866502346
Academic Unit: School of Computer Science and Electronic Engineering
Resource Type: Doctoral Thesis

Efficient Inference in Large Language Models

Abstract

Files and links (1)

Metrics

Details

Usage Policy