Abstract
Large Language Models (LLMs) are increasingly used in various real-world applications. However, the computational cost of LLMs makes it challenging to serve them for a large number of users. This thesis studies the efficiency bottlenecks in LLM inference systems, and proposes a series of novel algorithms to improve the efficiency of LLM inference. First, we explore the redundancy in the input space of LLM inference, proposing prompt compression to build a more compact representation of input prompt and thus reducing the inference cost. Then, we propose MInference, a dynamic sparse attention algorithm that identifies the critical blocks of the attention online and to perform sparse computation efficiently on modern hardware. We also proposed SCBench, a new benchmark to systematically study the sparsity-accuracy trade-offs across various KV cache centric optimizations in LLM inference systems. In addition, we proposed MMInference, a permutation-based sparse attention algorithm to accelerate the prefill stage of VLMs inference. We conclude this thesis by discussing the impact of our research and potential future work.