Have you ever wondered why Large Language Models (LLMs) sometimes take time to generate responses or require high computing resources? LLMs power chatbots, virtual assistants, automated content generation, and complex question answering.
However, as these models become larger and more advanced, inference becomes slower and more expensive. The major bottleneck arises from autoregressive decoding, where each generated token must consider all previous tokens, resulting in quadratic complexity. This significantly increases latency and operating costs.
In this article, we will discuss Normal Inference, KV Cache, and LMCache, and how these caching methods improve responsiveness and throughput. By the end, you’ll see how caching reduces costs and boosts performance, so keep reading to learn more.
What is Normal Inference in LLM?
Normal inference in a Large Language Model (LLM) is the standard way the model generates text from a prompt. It uses an autoregressive approach, predicting one token at a time based on the input and all previously generated tokens, without storing intermediate results for later reuse.
How Does Normal Inference Work in LLM?
Tokenization
The input text is first broken down into smaller units called tokens. These can be words or subword pieces, depending on the tokenizer.
Processing the Input (Prefill Phase)
All input tokens are processed together to compute initial hidden states. This includes intermediate key and value tensors for the attention mechanism. This phase can be done in parallel because the whole input sequence is known from the start.
Autoregressive Decoding (Decode Phase)
Starting from the processed input, the model generates output tokens one at a time from left to right.
For each step, the model predicts the probability distribution of the next token based on all previous tokens.
The most likely token or a sampled token is chosen and added to the input sequence.
The process continues until a stopping condition is met. This could be a maximum length, a special end token, or user input.
In normal inference, the model recalculates all attention operations for every new token instead of reusing previous results. This repeated computation increases latency and compute costs, especially for long sequences.
What is KV Cache in LLM?
KV Cache, which stands for Key-Value Cache, is an important method used in transformer-based large language models (LLMs) to speed up autoregressive token generation. During inference, transformers depend heavily on self-attention mechanisms that require computing interactions between the tokens processed so far. As the sequence length increases, recalculating these interactions becomes costly, leading to increased latency as the token count rises.
KV Cache addresses this issue by storing the key (K) and value (V) vectors from the self-attention mechanism of all previously generated tokens. Instead of recalculating K and V tensors for past tokens at each new step, the model reuses these cached tensors. The only new calculations needed are for the query (Q) vectors corresponding to the current token, which are then compared against the cached keys and values. This approach reduces the complexity of self-attention from quadratic to linear per token during decoding.
How Does KV Cache Works?
Transformers process tokens one at a time during generation.
For each new token, self-attention calculates the Query (Q), Key (K), and Value (V) vectors.
The KV Cache keeps the K and V vectors created for earlier tokens.
When decoding the next token, only the Q vector for that token is calculated.
Cached K and V tensors allow quick lookups during attention score calculation. This avoids repeating calculations for past tokens.
As a result, generating subsequent tokens becomes faster or maintains consistent latency instead of slowing down as the sequence grows.
Source: Image Photography by Daily Dose of Data Science
Technical advantages of KVCache
Reduced compute overhead: By reusing cached K and V tensors, KV Cache avoids recalculating the entire self-attention matrix multiple times.
Constant time per token: While the first token requires full computation, later tokens usually take constant time since cached keys and values are reused.
Broad adoption and support: KV Cache is used in nearly all transformer inference libraries, including Hugging Face’s Transformers, DeepSpeed-Inference, vLLM, and others.
Partner with Us for Success
Experience seamless collaboration and exceptional results.
Challenges of KVCache
Memory Growth: The cache size increases with sequence length because each token adds new key-value pairs. This can quickly use up limited GPU memory for long inputs.
Bandwidth Demands: Fetching and moving cached KV tensors needs memory bandwidth, which may restrict throughput during heavy workloads.
Cache Management: The cache must be reset or cleared between unrelated inputs to prevent mixing contexts and ensure correct outputs.
GPU Memory Constraints: Since the KV cache usually stays in GPU memory for maximum speed, large caches may go beyond the available GPU RAM, limiting the maximum input sequence lengths.
Typical KV Cache Implementation
Each transformer layer keeps separate KV cache buffers.
Position indices track the locations of cached tokens.
APIs provide access to the KV cache storage, allowing it to be saved, passed, or restored during batched decoding.
For each token generation call, the model takes the previous KV cache and returns an updated cache with new entries.
While KV Cache optimizes token-by-token reuse in transformer inference, LMCache introduces a chunk-based and distributed caching system designed to greatly improve latency and scalability for Large Language Model (LLM) deployments.
LMCache was developed to overcome limitations in traditional KV caching, particularly related to long contexts, multi-turn use cases, and distributed serving. It allows reuse of cached key-value pairs not only for prefixes but also anywhere in the input text, across various storage levels from GPUs to disk-backed cache servers.
How LMCache Works in LLM?
Chunking: Instead of caching tokens one by one or only using prefix-based KV pairs, LMCache breaks input into fixed-size chunks, such as 256 tokens.
Each chunk is hashed, which creates a unique key for querying the cache.
When a request comes in, LMCache checks its multi-tier backends, including GPU memory, CPU RAM, disk storage, or distributed caches like Redis, to find KV cache entries for matching chunks anywhere in the input.
It retrieves reusable chunks KV data, which helps avoid repeating computation for those tokens.
Unmatched chunks need standard attention computation, and their KV pairs are cached for later reuse.
This method works well for tasks with repeated text patterns, like multi-turn dialogue, summarization, or retrieval-augmented generation, where chunks often appear multiple times.
LMCache Architecture and Components
Source: Image Photography by hackernoon
Multi-tier Cache Storage
LMCache supports a hierarchy of storage layers that prioritize speed.
GPU memory: Fastest access, limited size.
CPU RAM: Larger but slower.
Disk cache: Persistent storage for a massive cache size.
Distributed caches: Redis or similar for shared access across instances.
Hash-based Chunk Indexing
Statistical or cryptographic hash functions index KV chunks for quick cache lookup and collision resistance.
Asynchronous Retrieval
Cache queries are designed to be non-blocking. This allows model engines to continue with partial cache hits.
Integration Layers
LMCache integrates closely with inference engines, like vLLM, using specialized APIs. This enables smooth management of chunk KV caches during token generation.
Technical Benefits of LMCache
Reduced Time to First Token (TTFT): By preloading reusable KV chunks from multi-level cache, LMCache significantly cuts latency for the first token and entire sequences.
Improved Scalability: It spreads the caching load across memory tiers and servers, allowing for longer contexts and higher throughput.
Cross-request and Cross-engine KV Reuse: This feature lets multiple parallel requests or engines share KV cache states, which reduces repeated calculations.
Better Memory Use: Chunk-level management prevents the need for large, single KV cache buffers and makes better use of hierarchical storage.
Harnesses Data Repetition: It performs well in situations where input texts overlap, which often occurs in conversational agents and document queries.
Use Cases of LMCache
Multi-turn Chatbots: Maintains and reuses KV cache over multiple conversational rounds where utterances often repeat or reference previous context.
Retrieval-Augmented Generation (RAG): Caches KV states of retrieved document segments reused across multiple queries.
Long Document Summarization: Chunk caching enables efficient incremental processing of large texts.
Partner with Us for Success
Experience seamless collaboration and exceptional results.
Comparison of KV Cache vs LMCache
Both KV Cache and LMCache aim to improve Large Language Model (LLM) inference by caching intermediate key-value data from transformer self-attention. However, they differ greatly in granularity, architecture, scalability, and use cases.
Scope
KV Cache stores key-value tensors at the token level and supports only prefix reuse. Cached keys and values help speed up generation when new tokens extend previously processed sequences.
LMCache caches at the chunk level, such as 256 tokens, and can reuse cached KV pairs anywhere in the input text. This allows for flexible and widespread reuse beyond simple prefixes.
Storage and Memory
KV Cache is mainly stored in GPU memory to increase speed. However, it grows in size with the length of the sequence. This limits the context size because of memory restrictions.
LMCache employs a hierarchical, multi-tier cache system that includes GPU RAM, CPU RAM, disk caches, and distributed caches like Redis. This setup allows for better scalability and more efficient use of resources.
Latency and Performance
KV Cache provides a steady latency for each token after the first one. However, it does not decrease the time needed for the first token in new sequences.
LMCache greatly shortens the time to the first token by preloading reusable KV chunks. It also boosts overall throughput, especially in multi-turn or retrieval-based tasks.
Table Comparison of LMcache vs KVcache vs Normal inference
Feature
Normal Inference
KV Cache
LMCache
Granularity
Token-level, no caching
Token-level, prefix reuse only
Chunk-level, reuse anywhere in input
Storage
No intermediate KV cache
GPU memory
Multi-tier: GPU, CPU, disk, distributed
Memory Efficiency
High compute, no cache memory
Uses GPU RAM linearly growing with sequence length
Optimizes across multiple layers
Latency Impact
Highest latency per token
Constant latency after 1st token
Reduces time to first token and overall
Integration Effort
Basic inference framework
Integrated in transformer engines
Requires external caching system
Scalability
Limited by GPU memory and compute
Limited by local GPU memory
Scales across distributed environments
Best Use Cases
Simple text generation with no reuse optimization
Single-turn generation
Multi-turn, RAG, long contexts
Granularity
Normal Inference
Token-level, no caching
KV Cache
Token-level, prefix reuse only
LMCache
Chunk-level, reuse anywhere in input
Storage
Normal Inference
No intermediate KV cache
KV Cache
GPU memory
LMCache
Multi-tier: GPU, CPU, disk, distributed
Memory Efficiency
Normal Inference
High compute, no cache memory
KV Cache
Uses GPU RAM linearly growing with sequence length
LMCache
Optimizes across multiple layers
Latency Impact
Normal Inference
Highest latency per token
KV Cache
Constant latency after 1st token
LMCache
Reduces time to first token and overall
Integration Effort
Normal Inference
Basic inference framework
KV Cache
Integrated in transformer engines
LMCache
Requires external caching system
Scalability
Normal Inference
Limited by GPU memory and compute
KV Cache
Limited by local GPU memory
LMCache
Scales across distributed environments
Best Use Cases
Normal Inference
Simple text generation with no reuse optimization
KV Cache
Single-turn generation
LMCache
Multi-turn, RAG, long contexts
1 of 7
How To Choose The Right Inference Strategy?
Normal Inference
Best for situations with very short prompts, single requests, early prototyping, or low traffic where speed and resource use are not critical. It works well when context rarely repeats and simple installation is a priority. It is not suitable for long contexts or large-scale production.
KV Cache
Optimal for situations with steady or lengthy input sequences, where you need quick token generation and most new requests are just extensions of previous ones. It's suitable if you have enough GPU memory and want reliable per-token response times. It works well for moderate workloads and regular chat or Q&A interactions.
LM Cache
Essential for environments with many simultaneous users, large deployments, or situations with repeated chunks in the input. Use this method when reducing latency, including the time to first token, and distributed scalability are important. This applies to multi-turn sessions, RAG, or when memory resources are limited and sharing caches is necessary.
Conclusion
Normal inference generates tokens one by one without caching. This approach leads to high latency and increased compute time for each token. KV Cache makes it more efficient by saving key-value tensors at the token level for reuse. This cuts down latency after producing the first token, but it increases memory usage as the sequence gets longer.
LMCache takes KV Cache further by allowing chunk-based caching across different storage types, including GPU, CPU, disk, and distributed systems. This greatly reduces latency, including the time it takes to get the first token, and it scales better for complex multi-turn conversations and retrieval-augmented applications. While normal inference is the easiest option, KV Cache offers a reliable boost for single-turn generation. LMCache delivers the most scalable and flexible caching solution for demanding LLM tasks.
Shankari R
AI/ML Intern passionate about building intelligent systems using LLMs, NLP, and data-driven solutions. Skilled in Python and ML frameworks, with hands-on experience in Generative AI, vector databases, and model fine-tuning.
Partner with Us for Success
Experience seamless collaboration and exceptional results.