Blogs/AI/Normal Inference Vs Kvcache Vs Lmcache

Normal Inference Vs Kvcache Vs Lmcache

Written by Shankari R

Sep 15, 2025

8 Min Read

Normal Inference Vs Kvcache Vs Lmcache Hero

Have you ever wondered why Large Language Models (LLMs) sometimes take time to generate responses or require high computing resources? LLMs power chatbots, virtual assistants, automated content generation, and complex question answering.

However, as these models become larger and more advanced, inference becomes slower and more expensive. The major bottleneck arises from autoregressive decoding, where each generated token must consider all previous tokens, resulting in quadratic complexity. This significantly increases latency and operating costs.

In this article, we will discuss Normal Inference, KV Cache, and LMCache, and how these caching methods improve responsiveness and throughput. By the end, you’ll see how caching reduces costs and boosts performance, so keep reading to learn more.

What is Normal Inference in LLM?

Normal inference in a Large Language Model (LLM) is the standard way the model generates text from a prompt. It uses an autoregressive approach, predicting one token at a time based on the input and all previously generated tokens, without storing intermediate results for later reuse.

How Does Normal Inference Work in LLM?

Tokenization

The input text is first broken down into smaller units called tokens. These can be words or subword pieces, depending on the tokenizer.

Processing the Input (Prefill Phase)

All input tokens are processed together to compute initial hidden states. This includes intermediate key and value tensors for the attention mechanism. This phase can be done in parallel because the whole input sequence is known from the start.

Autoregressive Decoding (Decode Phase)

Starting from the processed input, the model generates output tokens one at a time from left to right.

For each step, the model predicts the probability distribution of the next token based on all previous tokens.
The most likely token or a sampled token is chosen and added to the input sequence.
The process continues until a stopping condition is met. This could be a maximum length, a special end token, or user input.

In normal inference, the model recalculates all attention operations for every new token instead of reusing previous results. This repeated computation increases latency and compute costs, especially for long sequences.

What is KV Cache in LLM?

KV Cache, which stands for Key-Value Cache, is an important method used in transformer-based large language models (LLMs) to speed up autoregressive token generation. During inference, transformers depend heavily on self-attention mechanisms that require computing interactions between the tokens processed so far. As the sequence length increases, recalculating these interactions becomes costly, leading to increased latency as the token count rises.

KV Cache addresses this issue by storing the key (K) and value (V) vectors from the self-attention mechanism of all previously generated tokens. Instead of recalculating K and V tensors for past tokens at each new step, the model reuses these cached tensors. The only new calculations needed are for the query (Q) vectors corresponding to the current token, which are then compared against the cached keys and values. This approach reduces the complexity of self-attention from quadratic to linear per token during decoding.

How Does KV Cache Works?

Transformers process tokens one at a time during generation.
For each new token, self-attention calculates the Query (Q), Key (K), and Value (V) vectors.
The KV Cache keeps the K and V vectors created for earlier tokens.
When decoding the next token, only the Q vector for that token is calculated.
Cached K and V tensors allow quick lookups during attention score calculation. This avoids repeating calculations for past tokens.
As a result, generating subsequent tokens becomes faster or maintains consistent latency instead of slowing down as the sequence grows.

Source: Image Photography by Daily Dose of Data Science

Technical advantages of KVCache

Reduced compute overhead: By reusing cached K and V tensors, KV Cache avoids recalculating the entire self-attention matrix multiple times.
Constant time per token: While the first token requires full computation, later tokens usually take constant time since cached keys and values are reused.
Broad adoption and support: KV Cache is used in nearly all transformer inference libraries, including Hugging Face’s Transformers, DeepSpeed-Inference, vLLM, and others.

AI Caching Showdown: Normal Inference vs KVCache vs LMCache

Exploring the future of artificial intelligence

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 1 Nov 2025

10PM IST (60 mins)

Challenges of KVCache

Memory Growth: The cache size increases with sequence length because each token adds new key-value pairs. This can quickly use up limited GPU memory for long inputs.
Bandwidth Demands: Fetching and moving cached KV tensors needs memory bandwidth, which may restrict throughput during heavy workloads.
Cache Management: The cache must be reset or cleared between unrelated inputs to prevent mixing contexts and ensure correct outputs.
GPU Memory Constraints: Since the KV cache usually stays in GPU memory for maximum speed, large caches may go beyond the available GPU RAM, limiting the maximum input sequence lengths.

Typical KV Cache Implementation

Each transformer layer keeps separate KV cache buffers.
Position indices track the locations of cached tokens.
APIs provide access to the KV cache storage, allowing it to be saved, passed, or restored during batched decoding.
For each token generation call, the model takes the previous KV cache and returns an updated cache with new entries.

Example:

outputs = model(input_ids=input_token, past_key_values=past_key_values, use_cache=True)

past_key_values = outputs.past_key_values

Include KV Cache in Transformer models

What is LMCache?

While KV Cache optimizes token-by-token reuse in transformer inference, LMCache introduces a chunk-based and distributed caching system designed to greatly improve latency and scalability for Large Language Model (LLM) deployments.

LMCache was developed to overcome limitations in traditional KV caching, particularly related to long contexts, multi-turn use cases, and distributed serving. It allows reuse of cached key-value pairs not only for prefixes but also anywhere in the input text, across various storage levels from GPUs to disk-backed cache servers.

How LMCache Works in LLM?

Chunking: Instead of caching tokens one by one or only using prefix-based KV pairs, LMCache breaks input into fixed-size chunks, such as 256 tokens.
Each chunk is hashed, which creates a unique key for querying the cache.
When a request comes in, LMCache checks its multi-tier backends, including GPU memory, CPU RAM, disk storage, or distributed caches like Redis, to find KV cache entries for matching chunks anywhere in the input.
It retrieves reusable chunks KV data, which helps avoid repeating computation for those tokens.
Unmatched chunks need standard attention computation, and their KV pairs are cached for later reuse.
This method works well for tasks with repeated text patterns, like multi-turn dialogue, summarization, or retrieval-augmented generation, where chunks often appear multiple times.

LMCache Architecture and Components

Source: Image Photography by hackernoon

Multi-tier Cache Storage

LMCache supports a hierarchy of storage layers that prioritize speed.

GPU memory: Fastest access, limited size.
CPU RAM: Larger but slower.
Disk cache: Persistent storage for a massive cache size.
Distributed caches: Redis or similar for shared access across instances.

Hash-based Chunk Indexing

Statistical or cryptographic hash functions index KV chunks for quick cache lookup and collision resistance.

Asynchronous Retrieval

Cache queries are designed to be non-blocking. This allows model engines to continue with partial cache hits.

Integration Layers

LMCache integrates closely with inference engines, like vLLM, using specialized APIs. This enables smooth management of chunk KV caches during token generation.

Technical Benefits of LMCache

Reduced Time to First Token (TTFT): By preloading reusable KV chunks from multi-level cache, LMCache significantly cuts latency for the first token and entire sequences.
Improved Scalability: It spreads the caching load across memory tiers and servers, allowing for longer contexts and higher throughput.
Cross-request and Cross-engine KV Reuse: This feature lets multiple parallel requests or engines share KV cache states, which reduces repeated calculations.
Better Memory Use: Chunk-level management prevents the need for large, single KV cache buffers and makes better use of hierarchical storage.
Harnesses Data Repetition: It performs well in situations where input texts overlap, which often occurs in conversational agents and document queries.

Use Cases of LMCache

Multi-turn Chatbots: Maintains and reuses KV cache over multiple conversational rounds where utterances often repeat or reference previous context.
Retrieval-Augmented Generation (RAG): Caches KV states of retrieved document segments reused across multiple queries.
Long Document Summarization: Chunk caching enables efficient incremental processing of large texts.

AI Caching Showdown: Normal Inference vs KVCache vs LMCache

Exploring the future of artificial intelligence

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 1 Nov 2025

10PM IST (60 mins)

Comparison of KV Cache vs LMCache

Both KV Cache and LMCache aim to improve Large Language Model (LLM) inference by caching intermediate key-value data from transformer self-attention. However, they differ greatly in granularity, architecture, scalability, and use cases.

Scope

KV Cache stores key-value tensors at the token level and supports only prefix reuse. Cached keys and values help speed up generation when new tokens extend previously processed sequences.
LMCache caches at the chunk level, such as 256 tokens, and can reuse cached KV pairs anywhere in the input text. This allows for flexible and widespread reuse beyond simple prefixes.

Storage and Memory

KV Cache is mainly stored in GPU memory to increase speed. However, it grows in size with the length of the sequence. This limits the context size because of memory restrictions.
LMCache employs a hierarchical, multi-tier cache system that includes GPU RAM, CPU RAM, disk caches, and distributed caches like Redis. This setup allows for better scalability and more efficient use of resources.

Latency and Performance

KV Cache provides a steady latency for each token after the first one. However, it does not decrease the time needed for the first token in new sequences.
LMCache greatly shortens the time to the first token by preloading reusable KV chunks. It also boosts overall throughput, especially in multi-turn or retrieval-based tasks.

Table Comparison of LMcache vs KVcache vs Normal inference

Feature	Normal Inference	KV Cache	LMCache
Granularity	Token-level, no caching	Token-level, prefix reuse only	Chunk-level, reuse anywhere in input
Storage	No intermediate KV cache	GPU memory	Multi-tier: GPU, CPU, disk, distributed
Memory Efficiency	High compute, no cache memory	Uses GPU RAM linearly growing with sequence length	Optimizes across multiple layers
Latency Impact	Highest latency per token	Constant latency after 1st token	Reduces time to first token and overall
Integration Effort	Basic inference framework	Integrated in transformer engines	Requires external caching system
Scalability	Limited by GPU memory and compute	Limited by local GPU memory	Scales across distributed environments
Best Use Cases	Simple text generation with no reuse optimization	Single-turn generation	Multi-turn, RAG, long contexts

Granularity

Normal Inference

Token-level, no caching

KV Cache

Token-level, prefix reuse only

LMCache

Chunk-level, reuse anywhere in input

1 of 7

How To Choose The Right Inference Strategy?

Normal Inference

Best for situations with very short prompts, single requests, early prototyping, or low traffic where speed and resource use are not critical. It works well when context rarely repeats and simple installation is a priority. It is not suitable for long contexts or large-scale production.

KV Cache

Optimal for situations with steady or lengthy input sequences, where you need quick token generation and most new requests are just extensions of previous ones. It's suitable if you have enough GPU memory and want reliable per-token response times. It works well for moderate workloads and regular chat or Q&A interactions.

LM Cache

Essential for environments with many simultaneous users, large deployments, or situations with repeated chunks in the input. Use this method when reducing latency, including the time to first token, and distributed scalability are important. This applies to multi-turn sessions, RAG, or when memory resources are limited and sharing caches is necessary.

Conclusion

Normal inference generates tokens one by one without caching. This approach leads to high latency and increased compute time for each token. KV Cache makes it more efficient by saving key-value tensors at the token level for reuse. This cuts down latency after producing the first token, but it increases memory usage as the sequence gets longer.

LMCache takes KV Cache further by allowing chunk-based caching across different storage types, including GPU, CPU, disk, and distributed systems. This greatly reduces latency, including the time it takes to get the first token, and it scales better for complex multi-turn conversations and retrieval-augmented applications. While normal inference is the easiest option, KV Cache offers a reliable boost for single-turn generation. LMCache delivers the most scalable and flexible caching solution for demanding LLM tasks.

Shankari R

Chennai, Tamil Nadu

AI/ML Intern passionate about building intelligent systems using LLMs, NLP, and data-driven solutions. Skilled in Python and ML frameworks, with hands-on experience in Generative AI, vector databases, and model fine-tuning.

Share this article

Next for you

How to Use UV Package Manager for Python Projects Cover

AI

Oct 29, 2025 • 4 min read

How to Use UV Package Manager for Python Projects

Managing Python packages and dependencies has always been a challenge for developers. Tools like pip and poetry have served well for years, but as projects grow more complex, these tools can feel slow and cumbersome. UV is a modern, high-performance Python package manager written in Rust, built as a drop-in replacement for pip and pip-tools. It focuses on speed, reliability, and ease of use rather than adding yet another layer of complexity. According to benchmarks from Astral, UV installs pac

15 Best AI Code Generators of 2025 (Reviewed) Cover

AI

Oct 17, 2025 • 21 min read

15 Best AI Code Generators of 2025 (Reviewed)

With most developers now relying on AI in their workflow, the question isn’t if you’ll use a code generator in 2025, but which one can deliver the most reliable, context-aware support. In just a few years, AI coding assistants have evolved from autocomplete tools to full-scale collaborators, capable of scaffolding projects, debugging complex systems, and even generating production-ready applications. Stack Overflow’s 2023 Developer Survey mentioned that nearly 70% of developers already use AI t

12 Replit Alternatives for Development in 2025 Cover

AI

Oct 15, 2025 • 12 min read

12 Replit Alternatives for Development in 2025

Is Replit still the best choice for cloud-based development in 2025? For years, Replit has been one of the most popular online IDEs, thanks to its instant setup, collaborative editing, and growing ecosystem of AI tools. For students and indie developers, it has often been the first stop for quick coding experiments. For teams, it has offered a fast way to collaborate without heavy local setups. But the developer ecosystem has changed. As projects scale, many find that Replit struggles with perf