Have you ever wondered why Large Language Models, (LLMs) can feel slow to respond or why inference costs rise so quickly at scale? I’m writing this because many teams hit the same wall: as context length grows and traffic increases, token generation gets slower and GPU usage becomes expensive.
A key bottleneck is autoregressive decoding. Each new token attends to everything generated so far, and attention work grows with sequence length, pushing latency up and reducing throughput.
In this article, you’ll learn the practical differences between Normal Inference, KV Cache, and LMCache and how each approach trades compute, memory, and scalability to reduce cost and improve responsiveness.
What is Normal Inference in LLM?
Normal inference is the default way an LLM generates text: it predicts tokens autoregressively, one at a time, conditioning on the prompt and all previously generated tokens, without reusing intermediate attention states across decoding steps. This makes it simple to run but increasingly expensive as sequence length grows.
How Does Normal Inference Work in LLM?
Tokenization
The input text is first broken down into smaller units called tokens. These can be words or subword pieces, depending on the tokenizer.
Processing the Input (Prefill Phase)
All input tokens are processed together to compute initial hidden states. This includes intermediate key and value tensors for the attention mechanism. This phase can be done in parallel because the whole input sequence is known from the start.
Autoregressive Decoding (Decode Phase)
Starting from the processed input, the model generates output tokens one at a time from left to right.
For each step, the model predicts the probability distribution of the next token based on all previous tokens.
The most likely token or a sampled token is chosen and added to the input sequence.
The process continues until a stopping condition is met. This could be a maximum length, a special end token, or user input.
In normal inference, attention computations for prior tokens are effectively revisited during each decoding step because intermediate attention states are not reused. As sequences get longer, this increases decoding time and compute cost, especially under high concurrency.
What is KV Cache in LLM?
KV Cache (Key-Value Cache) is a widely used optimization in transformer inference that speeds up autoregressive decoding by storing the key (K) and value (V) tensors for tokens that have already been processed. Instead of recomputing these tensors for past tokens at every step, the model reuses them and computes only the new token’s states, improving decoding efficiency as sequences grow.
KV Cache addresses this issue by storing the key (K) and value (V) vectors from the self-attention mechanism of all previously generated tokens. Instead of recalculating K and V tensors for past tokens at each new step, the model reuses these cached tensors. The only new calculations needed are for the query (Q) vectors corresponding to the current token, which are then compared against the cached keys and values. This approach reduces the complexity of self-attention from quadratic to linear per token during decoding.
How Does KV Cache Works?
Transformers process tokens one at a time during generation.
For each new token, self-attention calculates the Query (Q), Key (K), and Value (V) vectors.
The KV Cache keeps the K and V vectors created for earlier tokens.
When decoding the next token, only the Q vector for that token is calculated.
Cached K and V tensors allow quick lookups during attention score calculation. This avoids repeating calculations for past tokens.
As a result, generating subsequent tokens becomes faster or maintains consistent latency instead of slowing down as the sequence grows.
Source: Image Photography by Daily Dose of Data Science
Technical advantages of KVCache
Reduced compute overhead: By reusing cached K and V tensors, KV Cache avoids recalculating the entire self-attention matrix multiple times.
More predictable decoding cost: After the prefill step, each new token avoids recomputing past K/V tensors, so decoding latency grows much more slowly than in normal inference. Throughput is still bounded by memory bandwidth and total KV size, but caching prevents repeated compute on earlier tokens.
Broad adoption and support: KV Cache is used in nearly all transformer inference libraries, including Hugging Face’s Transformers, DeepSpeed-Inference, vLLM, and others.
Optimizing LLM Performance: Normal Inference vs KVCache vs LMCache
Understand caching strategies that reduce latency and compute cost. Includes live inference benchmarks.
Murtuza Kutub
Co-Founder, F22 Labs
Walk away with actionable insights on AI adoption.
Limited seats available!
Saturday, 21 Mar 2026
10PM IST (60 mins)
Challenges of KVCache
Memory Growth: KV cache grows with sequence length because each token adds K/V tensors per layer, which can exhaust GPU memory on long contexts.
Bandwidth Demands: Serving and reading large KV tensors can become bandwidth-bound at scale, reducing throughput.
Cache Management: KV must be cleared between unrelated requests to prevent context mixing.
GPU Memory Constraints: Keeping KV on GPU is fastest, but it caps maximum context and batch size based on available VRAM.
Typical KV Cache Implementation
Each transformer layer keeps separate KV cache buffers.
Position indices track the locations of cached tokens.
APIs provide access to the KV cache storage, allowing it to be saved, passed, or restored during batched decoding.
For each token generation call, the model takes the previous KV cache and returns an updated cache with new entries.
While KV Cache reuses attention states within a single request (primarily as a growing prefix), LMCache extends reuse by caching KV states in chunks and enabling retrieval across storage tiers (GPU, CPU, disk, or distributed cache). It is designed for deployments where prompts and retrieved context repeat across requests, such as multi-turn chat and RAG, so prefill work can be reused instead of recomputed.
LMCache was developed to overcome limitations in traditional KV caching, particularly related to long contexts, multi-turn use cases, and distributed serving. It allows reuse of cached key-value pairs not only for prefixes but also anywhere in the input text, across various storage levels from GPUs to disk-backed cache servers.
How LMCache Works in LLM?
Chunking: Instead of caching tokens one by one or only using prefix-based KV pairs, LMCache breaks input into fixed-size chunks, such as 256 tokens.
Each chunk is hashed, which creates a unique key for querying the cache.
When a request comes in, LMCache checks its multi-tier backends, including GPU memory, CPU RAM, disk storage, or distributed caches like Redis, to find KV cache entries for matching chunks anywhere in the input.
It retrieves reusable chunks KV data, which helps avoid repeating computation for those tokens.
Unmatched chunks need standard attention computation, and their KV pairs are cached for later reuse.
This method works well for tasks with repeated text patterns, like multi-turn dialogue, summarization, or retrieval-augmented generation, where chunks often appear multiple times.
LMCache Architecture and Components
Source: Image Photography by hackernoon
Multi-tier Cache Storage
LMCache supports a hierarchy of storage layers that prioritize speed.
GPU memory: Fastest access, limited size.
CPU RAM: Larger but slower.
Disk cache: Persistent storage for a massive cache size.
Distributed caches: Redis or similar for shared access across instances.
Hash-based Chunk Indexing
Statistical or cryptographic hash functions index KV chunks for quick cache lookup and collision resistance.
Asynchronous Retrieval
Cache queries are designed to be non-blocking. This allows model engines to continue with partial cache hits.
Integration Layers
LMCache integrates closely with inference engines, like vLLM, using specialized APIs. This enables smooth management of chunk KV caches during token generation.
Technical Benefits of LMCache
Reduced Time to First Token (TTFT): By preloading reusable KV chunks from multi-level cache, LMCache significantly cuts latency for the first token and entire sequences.
Improved Scalability: It spreads the caching load across memory tiers and servers, allowing for longer contexts and higher throughput.
Cross-request and Cross-engine KV Reuse: This feature lets multiple parallel requests or engines share KV cache states, which reduces repeated calculations.
Better Memory Use: Chunk-level management prevents the need for large, single KV cache buffers and makes better use of hierarchical storage.
Harnesses Data Repetition: It performs well in situations where input texts overlap, which often occurs in conversational agents and document queries.
Use Cases of LMCache
Multi-turn Chatbots: Maintains and reuses KV cache over multiple conversational rounds where utterances often repeat or reference previous context.
Retrieval-Augmented Generation (RAG): Caches KV states of retrieved document segments reused across multiple queries.
Long Document Summarization: Chunk caching enables efficient incremental processing of large texts.
Comparison of KV Cache vs LMCache
Both KV Cache and LMCache aim to improve Large Language Model (LLM) inference by caching intermediate key-value data from transformer self-attention. However, they differ greatly in granularity, architecture, scalability, and use cases.
Optimizing LLM Performance: Normal Inference vs KVCache vs LMCache
Understand caching strategies that reduce latency and compute cost. Includes live inference benchmarks.
Murtuza Kutub
Co-Founder, F22 Labs
Walk away with actionable insights on AI adoption.
Limited seats available!
Saturday, 21 Mar 2026
10PM IST (60 mins)
Scope
KV Cache stores key-value tensors at the token level and supports only prefix reuse. Cached keys and values help speed up generation when new tokens extend previously processed sequences.
LMCache caches at the chunk level, such as 256 tokens, and can reuse cached KV pairs anywhere in the input text. This allows for flexible and widespread reuse beyond simple prefixes.
Storage and Memory
KV Cache is mainly stored in GPU memory to increase speed. However, it grows in size with the length of the sequence. This limits the context size because of memory restrictions.
LMCache employs a hierarchical, multi-tier cache system that includes GPU RAM, CPU RAM, disk caches, and distributed caches like Redis. This setup allows for better scalability and more efficient use of resources.
Latency and Performance
KV Cache stabilizes decoding performance by avoiding recomputation of past K/V tensors during generation, but it does not reduce the initial prefill cost for a brand-new prompt.
LMCache can reduce prefill cost when parts of the input match cached chunks, which is why it can improve TTFT and throughput in multi-turn and retrieval-heavy workloads.
Table Comparison of LMcache vs KVcache vs Normal inference
Feature
Normal Inference
KV Cache
LMCache
Granularity
Token-level, no caching
Token-level, prefix reuse only
Chunk-level, reuse anywhere in input
Storage
No intermediate KV cache
GPU memory
Multi-tier: GPU, CPU, disk, distributed
Memory Efficiency
High compute, no cache memory
Uses GPU RAM linearly growing with sequence length
Optimizes across multiple layers
Latency Impact
Highest latency per token
Constant latency after 1st token
Reduces time to first token and overall
Integration Effort
Basic inference framework
Integrated in transformer engines
Requires external caching system
Scalability
Limited by GPU memory and compute
Limited by local GPU memory
Scales across distributed environments
Best Use Cases
Simple text generation with no reuse optimization
Single-turn generation
Multi-turn, RAG, long contexts
Granularity
Normal Inference
Token-level, no caching
KV Cache
Token-level, prefix reuse only
LMCache
Chunk-level, reuse anywhere in input
Storage
Normal Inference
No intermediate KV cache
KV Cache
GPU memory
LMCache
Multi-tier: GPU, CPU, disk, distributed
Memory Efficiency
Normal Inference
High compute, no cache memory
KV Cache
Uses GPU RAM linearly growing with sequence length
LMCache
Optimizes across multiple layers
Latency Impact
Normal Inference
Highest latency per token
KV Cache
Constant latency after 1st token
LMCache
Reduces time to first token and overall
Integration Effort
Normal Inference
Basic inference framework
KV Cache
Integrated in transformer engines
LMCache
Requires external caching system
Scalability
Normal Inference
Limited by GPU memory and compute
KV Cache
Limited by local GPU memory
LMCache
Scales across distributed environments
Best Use Cases
Normal Inference
Simple text generation with no reuse optimization
KV Cache
Single-turn generation
LMCache
Multi-turn, RAG, long contexts
1 of 7
How To Choose The Right Inference Strategy?
Normal Inference
Best for short prompts, low traffic, experiments, or one-off requests where context rarely repeats. It keeps implementation simple, but costs rise quickly with longer contexts and higher concurrency.
KV Cache
Best default for production decoding. Use it when you want reliable generation latency for single-turn or standard chat flows and you have sufficient GPU memory to hold KV for your target context and batch sizes.
LM Cache
Best when repeated context is common across requests, multi-turn assistants, RAG pipelines with reused passages, templated prompts, or shared system prompts at scale. Use it when TTFT and prefill cost matter and when you need to extend caching beyond a single GPU’s memory.
Conclusion
Normal inference generates tokens autoregressively without reusing attention states, which makes it straightforward but increasingly expensive as context length and traffic grow. KV Cache improves decoding efficiency by reusing past key/value tensors during generation, reducing repeated computation during token-by-token decoding, at the cost of GPU memory that grows with sequence length.
LMCache extends reuse further by caching KV states in chunks and retrieving them across memory tiers or shared caches. This makes it most useful when inputs overlap across requests, such as multi-turn assistants and RAG, where reducing prefill work can improve TTFT and throughput.
In practice: Normal inference is fine for small workloads, KV Cache is the standard for most deployments, and LMCache is the next step when reuse and scale make prefill cost a bottleneck.
Shankari R
AI/ML Intern passionate about building intelligent systems using LLMs, NLP, and data-driven solutions. Skilled in Python and ML frameworks, with hands-on experience in Generative AI, vector databases, and model fine-tuning.
Share this article
Optimizing LLM Performance: Normal Inference vs KVCache vs LMCache
Understand caching strategies that reduce latency and compute cost. Includes live inference benchmarks.
Murtuza Kutub
Co-Founder, F22 Labs
Walk away with actionable insights on AI adoption.