Compressing Context to Enhance Inference Efficiency of Large Language Models

Yucheng Li; Bo Dong; Chenghua Lin; Frank Guerin

doi:10.48550/arxiv.2310.06201

Back

Preprint

Compressing Context to Enhance Inference Efficiency of Large Language Models

Yucheng Li, Bo Dong, Chenghua Lin and Frank Guerin

09/10/2023

DOI: https://doi.org/10.48550/arxiv.2310.06201

Abstract

Computer Science - Computation and Language

Large language models (LLMs) achieved remarkable performance across various tasks. However, they face challenges in managing long documents and extended conversations, due to significantly increased computational requirements, both in memory and inference time, and potential context truncation when the input exceeds the LLM's fixed context length. This paper proposes a method called Selective Context that enhances the inference efficiency of LLMs by identifying and pruning redundancy in the input context to make the input more compact. We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations, on tasks of summarisation, question answering, and response generation. Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency while maintaining comparable performance compared to that achieved when full context is used. Specifically, we achieve a 50\% reduction in context cost, resulting in a 36\% reduction in inference memory usage and a 32\% reduction in inference time, while observing only a minor drop of .023 in BERTscore and .038 in faithfulness on four downstream applications, indicating that our method strikes a good balance between efficiency and performance.

Metrics

18 Record Views

Details

Title: Compressing Context to Enhance Inference Efficiency of Large Language Models
Creators: Yucheng Li - University of Surrey, School of Computer Science and Electronic Engineering
Bo Dong - University of Surrey, School of Computer Science and Electronic Engineering
Chenghua Lin
Frank Guerin - University of Surrey, School of Computer Science and Electronic Engineering
Identifiers: 99822340702346
Copyright: The URI http://arxiv.org/licenses/nonexclusive-distrib/1.0/ is used to record the fact that the submitter granted the following license to arXiv.org on submission of an article: I grant arXiv.org a perpetual, non-exclusive license to distribute this article. I certify that I have the right to grant this license. I understand that submissions cannot be completely removed once accepted. I understand that arXiv.org reserves the right to reclassify or reject any submission.
Academic Unit: School of Computer Science and Electronic Engineering
Language: English
Resource Type: Preprint

Compressing Context to Enhance Inference Efficiency of Large Language Models

Abstract

Metrics

Details

Usage Policy