Blogs/AI

Normal Inference Vs Kvcache Vs Lmcache

Written by Shankari R
Feb 13, 2026
8 Min Read
Normal Inference Vs Kvcache Vs Lmcache Hero

Have you ever wondered why Large Language Models, (LLMs) can feel slow to respond or why inference costs rise so quickly at scale? I’m writing this because many teams hit the same wall: as context length grows and traffic increases, token generation gets slower and GPU usage becomes expensive.

A key bottleneck is autoregressive decoding. Each new token attends to everything generated so far, and attention work grows with sequence length, pushing latency up and reducing throughput.

In this article, you’ll learn the practical differences between Normal Inference, KV Cache, and LMCache and how each approach trades compute, memory, and scalability to reduce cost and improve responsiveness.

What is Normal Inference in LLM?

Normal inference is the default way an LLM generates text: it predicts tokens autoregressively, one at a time, conditioning on the prompt and all previously generated tokens, without reusing intermediate attention states across decoding steps. This makes it simple to run but increasingly expensive as sequence length grows.

How Does Normal Inference Work in LLM?

Tokenization

The input text is first broken down into smaller units called tokens. These can be words or subword pieces, depending on the tokenizer.  

Processing the Input (Prefill Phase)

All input tokens are processed together to compute initial hidden states. This includes intermediate key and value tensors for the attention mechanism. This phase can be done in parallel because the whole input sequence is known from the start. 

Autoregressive Decoding (Decode Phase)  

Starting from the processed input, the model generates output tokens one at a time from left to right.  

  • For each step, the model predicts the probability distribution of the next token based on all previous tokens.  
  • The most likely token or a sampled token is chosen and added to the input sequence.  
  • The process continues until a stopping condition is met. This could be a maximum length, a special end token, or user input.

In normal inference, attention computations for prior tokens are effectively revisited during each decoding step because intermediate attention states are not reused. As sequences get longer, this increases decoding time and compute cost, especially under high concurrency.

What is KV Cache in LLM?

KV Cache (Key-Value Cache) is a widely used optimization in transformer inference that speeds up autoregressive decoding by storing the key (K) and value (V) tensors for tokens that have already been processed. Instead of recomputing these tensors for past tokens at every step, the model reuses them and computes only the new token’s states, improving decoding efficiency as sequences grow.

KV Cache addresses this issue by storing the key (K) and value (V) vectors from the self-attention mechanism of all previously generated tokens. Instead of recalculating K and V tensors for past tokens at each new step, the model reuses these cached tensors. The only new calculations needed are for the query (Q) vectors corresponding to the current token, which are then compared against the cached keys and values. This approach reduces the complexity of self-attention from quadratic to linear per token during decoding.

How Does KV Cache Works?

  1. Transformers process tokens one at a time during generation.
  2. For each new token, self-attention calculates the Query (Q), Key (K), and Value (V) vectors.
  3. The KV Cache keeps the K and V vectors created for earlier tokens.
  4. When decoding the next token, only the Q vector for that token is calculated.
  5. Cached K and V tensors allow quick lookups during attention score calculation. This avoids repeating calculations for past tokens.
  6. As a result, generating subsequent tokens becomes faster or maintains consistent latency instead of slowing down as the sequence grows.
KV Cache flowchart

Source: Image Photography by Daily Dose of Data Science

Technical advantages of KVCache

  • Reduced compute overhead: By reusing cached K and V tensors, KV Cache avoids recalculating the entire self-attention matrix multiple times. 
  • More predictable decoding cost: After the prefill step, each new token avoids recomputing past K/V tensors, so decoding latency grows much more slowly than in normal inference. Throughput is still bounded by memory bandwidth and total KV size, but caching prevents repeated compute on earlier tokens.
  • Broad adoption and support: KV Cache is used in nearly all transformer inference libraries, including Hugging Face’s Transformers, DeepSpeed-Inference, vLLM, and others.
Optimizing LLM Performance: Normal Inference vs KVCache vs LMCache
Understand caching strategies that reduce latency and compute cost. Includes live inference benchmarks.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 21 Mar 2026
10PM IST (60 mins)

Challenges of KVCache

  • Memory Growth: KV cache grows with sequence length because each token adds K/V tensors per layer, which can exhaust GPU memory on long contexts.
  • Bandwidth Demands: Serving and reading large KV tensors can become bandwidth-bound at scale, reducing throughput.
  • Cache Management: KV must be cleared between unrelated requests to prevent context mixing.
  • GPU Memory Constraints: Keeping KV on GPU is fastest, but it caps maximum context and batch size based on available VRAM.

Typical KV Cache Implementation

  1. Each transformer layer keeps separate KV cache buffers. 
  2. Position indices track the locations of cached tokens. 
  3. APIs provide access to the KV cache storage, allowing it to be saved, passed, or restored during batched decoding. 
  4. For each token generation call, the model takes the previous KV cache and returns an updated cache with new entries.

Example:

outputs = model(input_ids=input_token, past_key_values=past_key_values,  use_cache=True)

past_key_values = outputs.past_key_values

Include KV Cache in Transformer models

What is LMCache?

While KV Cache reuses attention states within a single request (primarily as a growing prefix), LMCache extends reuse by caching KV states in chunks and enabling retrieval across storage tiers (GPU, CPU, disk, or distributed cache). It is designed for deployments where prompts and retrieved context repeat across requests, such as multi-turn chat and RAG, so prefill work can be reused instead of recomputed.

LMCache was developed to overcome limitations in traditional KV caching, particularly related to long contexts, multi-turn use cases, and distributed serving. It allows reuse of cached key-value pairs not only for prefixes but also anywhere in the input text, across various storage levels from GPUs to disk-backed cache servers.

How LMCache Works in LLM?

  1. Chunking: Instead of caching tokens one by one or only using prefix-based KV pairs, LMCache breaks input into fixed-size chunks, such as 256 tokens.
  2. Each chunk is hashed, which creates a unique key for querying the cache.
  3. When a request comes in, LMCache checks its multi-tier backends, including GPU memory, CPU RAM, disk storage, or distributed caches like Redis, to find KV cache entries for matching chunks anywhere in the input.
  4. It retrieves reusable chunks KV data, which helps avoid repeating computation for those tokens.
  5. Unmatched chunks need standard attention computation, and their KV pairs are cached for later reuse.
  6. This method works well for tasks with repeated text patterns, like multi-turn dialogue, summarization, or retrieval-augmented generation, where chunks often appear multiple times.

LMCache Architecture and Components

LMCache Architecture and Components

Source: Image Photography by hackernoon

Multi-tier Cache Storage

LMCache supports a hierarchy of storage layers that prioritize speed.  

  • GPU memory: Fastest access, limited size.  
  • CPU RAM: Larger but slower.  
  • Disk cache: Persistent storage for a massive cache size.  
  • Distributed caches: Redis or similar for shared access across instances.

Hash-based Chunk Indexing

Statistical or cryptographic hash functions index KV chunks for quick cache lookup and collision resistance.  

Asynchronous Retrieval

Cache queries are designed to be non-blocking. This allows model engines to continue with partial cache hits.  

Integration Layers

LMCache integrates closely with inference engines, like vLLM, using specialized APIs. This enables smooth management of chunk KV caches during token generation.  

Technical Benefits of LMCache

  • Reduced Time to First Token (TTFT): By preloading reusable KV chunks from multi-level cache, LMCache significantly cuts latency for the first token and entire sequences.
  • Improved Scalability: It spreads the caching load across memory tiers and servers, allowing for longer contexts and higher throughput.
  • Cross-request and Cross-engine KV Reuse: This feature lets multiple parallel requests or engines share KV cache states, which reduces repeated calculations.
  • Better Memory Use: Chunk-level management prevents the need for large, single KV cache buffers and makes better use of hierarchical storage.
  • Harnesses Data Repetition: It performs well in situations where input texts overlap, which often occurs in conversational agents and document queries.

Use Cases of LMCache

  • Multi-turn Chatbots: Maintains and reuses KV cache over multiple conversational rounds where utterances often repeat or reference previous context.
  • Retrieval-Augmented Generation (RAG): Caches KV states of retrieved document segments reused across multiple queries.
  • Long Document Summarization: Chunk caching enables efficient incremental processing of large texts.

Comparison of KV Cache vs LMCache

Both KV Cache and LMCache aim to improve Large Language Model (LLM) inference by caching intermediate key-value data from transformer self-attention. However, they differ greatly in granularity, architecture, scalability, and use cases.

Optimizing LLM Performance: Normal Inference vs KVCache vs LMCache
Understand caching strategies that reduce latency and compute cost. Includes live inference benchmarks.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 21 Mar 2026
10PM IST (60 mins)

Scope

  • KV Cache stores key-value tensors at the token level and supports only prefix reuse. Cached keys and values help speed up generation when new tokens extend previously processed sequences.
  • LMCache caches at the chunk level, such as 256 tokens, and can reuse cached KV pairs anywhere in the input text. This allows for flexible and widespread reuse beyond simple prefixes.

Storage and Memory

  • KV Cache is mainly stored in GPU memory to increase speed. However, it grows in size with the length of the sequence. This limits the context size because of memory restrictions.
  • LMCache employs a hierarchical, multi-tier cache system that includes GPU RAM, CPU RAM, disk caches, and distributed caches like Redis. This setup allows for better scalability and more efficient use of resources.

Latency and Performance

  • KV Cache stabilizes decoding performance by avoiding recomputation of past K/V tensors during generation, but it does not reduce the initial prefill cost for a brand-new prompt.
  • LMCache can reduce prefill cost when parts of the input match cached chunks, which is why it can improve TTFT and throughput in multi-turn and retrieval-heavy workloads.

Table Comparison of LMcache vs KVcache vs Normal inference

FeatureNormal InferenceKV CacheLMCache

Granularity

Token-level, no caching

Token-level, prefix reuse only

Chunk-level, reuse anywhere in input

Storage

No intermediate KV cache

GPU memory

Multi-tier: GPU, CPU, disk, distributed

Memory Efficiency

High compute, no cache memory

Uses GPU RAM linearly growing with sequence length

Optimizes across multiple layers

Latency Impact

Highest latency per token

Constant latency after 1st token

Reduces time to first token and overall

Integration Effort

Basic inference framework

Integrated in transformer engines

Requires external caching system

Scalability

Limited by GPU memory and compute

Limited by local GPU memory

Scales across distributed environments

Best Use Cases

Simple text generation with no reuse optimization

Single-turn generation

Multi-turn, RAG, long contexts

Granularity

Normal Inference

Token-level, no caching

KV Cache

Token-level, prefix reuse only

LMCache

Chunk-level, reuse anywhere in input

1 of 7

How To Choose The Right Inference Strategy?

Normal Inference

  • Best for short prompts, low traffic, experiments, or one-off requests where context rarely repeats. It keeps implementation simple, but costs rise quickly with longer contexts and higher concurrency.

KV Cache

  • Best default for production decoding. Use it when you want reliable generation latency for single-turn or standard chat flows and you have sufficient GPU memory to hold KV for your target context and batch sizes.

LM Cache

  • Best when repeated context is common across requests, multi-turn assistants, RAG pipelines with reused passages, templated prompts, or shared system prompts at scale. Use it when TTFT and prefill cost matter and when you need to extend caching beyond a single GPU’s memory.

Conclusion

Normal inference generates tokens autoregressively without reusing attention states, which makes it straightforward but increasingly expensive as context length and traffic grow. KV Cache improves decoding efficiency by reusing past key/value tensors during generation, reducing repeated computation during token-by-token decoding, at the cost of GPU memory that grows with sequence length.

LMCache extends reuse further by caching KV states in chunks and retrieving them across memory tiers or shared caches. This makes it most useful when inputs overlap across requests, such as multi-turn assistants and RAG, where reducing prefill work can improve TTFT and throughput.

In practice: Normal inference is fine for small workloads, KV Cache is the standard for most deployments, and LMCache is the next step when reuse and scale make prefill cost a bottleneck.

Author-Shankari R
Shankari R

AI/ML Intern passionate about building intelligent systems using LLMs, NLP, and data-driven solutions. Skilled in Python and ML frameworks, with hands-on experience in Generative AI, vector databases, and model fine-tuning.

Share this article

Phone

Next for you

Zomato MCP Server Guide: Architecture and Features Cover

AI

Mar 13, 20267 min read

Zomato MCP Server Guide: Architecture and Features

Zomato has released an official MCP (Model Context Protocol) Server that allows AI assistants to securely interact with its food-ordering ecosystem. Instead of manually browsing restaurants, comparing menus, and checking delivery times, users could simply give a prompt like: “Find the best butter chicken under ₹400 within 3 km and order it.” With the Zomato MCP Server, developers can connect LLM-based assistants directly to Zomato’s platform without building custom API bridges. This enables str

How Call Centres Use Voice AI to Automate Conversations Cover

AI

Mar 13, 20268 min read

How Call Centres Use Voice AI to Automate Conversations

Call centers are going through one of the biggest shifts in their history, thanks to Voice AI. Instead of forcing customers to navigate long IVR menus like “Press 1 for billing, Press 2 for support,” modern systems allow callers to speak naturally and explain their problem. Voice AI listens to the caller, understands the intent, and responds in real time. It can handle tasks like order tracking, appointment scheduling, billing questions, and account updates without waiting for a human agent.

Voice AI vs Chatbots (What's the Difference)? Cover

AI

Mar 13, 20268 min read

Voice AI vs Chatbots (What's the Difference)?

Chatbots and Voice AI are both part of the conversational AI ecosystem, and both rely on large language models (LLMs) to understand and generate natural language. Because of this, many teams assume building a Voice AI system is simply adding a microphone to a chatbot. In reality, the two are very different. A chatbot processes text in a simple request-response flow: user input → LLM → response. A Voice AI system, however, must listen to speech, transcribe it, generate a response, and convert t