
If you’ve ever tried running a large language model on a CPU, you already know the pain. It works, but the latency feels unbearable. This usually leads to the obvious question:
“If my CPU can run the model, why do I even need a GPU?”
The short answer is performance. The long answer is what this blog is about.
Understanding GPU requirements for LLM inference is not about memorizing hardware specs. It’s about understanding where memory goes, what limits throughput, and how model choices translate into real hardware needs. Once you get this right, GPU sizing stops being trial-and-error and becomes predictable.
What are the GPU Requirements?
When discussing GPU requirements for LLM inference, the term is often misunderstood. It doesn’t refer to disk space or model file size. Instead, it describes the hardware capacity needed to run inference efficiently under real workloads.
In practice, GPU requirements come down to three things:
- Whether the model fits into available GPU VRAM
- How many tokens per second the GPU can generate
- How many concurrent requests it can handle without latency degradation
Disk storage only determines whether the model can be stored. Performance is governed by VRAM capacity, memory bandwidth, compute parallelism, and KV cache growth during inference.
In short, GPU requirements are defined by memory limits and throughput under load, not by how large the model file appears on disk.
Key Inputs Required to Calculate GPU Requirements for LLM Inference
Before you start estimating GPU requirements, you need to clearly define what you’re running and how you plan to run it. Without these inputs, any number you compute will be misleading.

Model size
The number of parameters (7B, 13B, 70B, etc.) determines how expensive each generated token is. Larger models require more compute per token and consume more memory, which directly reduces throughput and limits concurrency.
Numerical precision
Precision (FP16, INT8, INT4) controls how much VRAM the model weights and KV cache consume, and how fast the GPU can execute the math. Lower precision usually increases tokens/sec and allows more concurrent requests, at the cost of some quality.
Maximum sequence length (context window)
This defines how long a request can be. Every token in the sequence allocates a KV cache that stays in GPU memory until the request finishes. Longer contexts significantly reduce the number of requests that can run at the same time.
Expected throughput
You need to decide what metric matters for your system:
- Tokens per second, if you care about raw generation speed
- Queries per second (QPS) if you care about how many users you can serve
These are connected by the average number of tokens per request. We cannot determine the precise information about this but still we can estimate above what is not safe
Concurrency
Concurrency is how many requests are active at the same time. Each active request allocates its own KV cache, so higher concurrency increases VRAM usage even if throughput stays the same.
Deployment style
How the model is deployed changes everything:
- Single GPU setups are simpler but limited by VRAM.
- Multi-GPU deployments (tensor or pipeline parallelism) change memory layout, latency, and scaling behavior.
GPU Memory Breakdown
Before doing any GPU sizing math, one common confusion needs to be cleared up: disk space is not GPU memory. These are completely different resources and are not interchangeable.
Disk space vs GPU memory
Model files are stored on disk (SSD or HDD). This is just storage. Nothing runs from the disk.
When inference starts, the model is loaded from disk into GPU VRAM. From that point on, disk space no longer matters for performance.
What actually lives in GPU VRAM
When a model runs on a GPU, several things consume VRAM simultaneously:
- Model weightsThe parameters of the model are loaded fully into VRAM. If the weights don’t fit, the model cannot run on that GPU.
- ActivationsIntermediate tensors created during forward passes. These are short-lived but still require VRAM while a token is being processed.
- KV cacheThe dominant memory consumer during inference. KV cache stores attention keys and values for every token in every active request and remains allocated until the request finishes.
- Temporary buffers and CUDA overheadWorkspace memory for kernels, communication buffers (in multi-GPU setups), and framework overhead. This memory is always present and cannot be ignored.
What happens if VRAM is insufficient
If VRAM runs out, one of two things happens:
- The model fails to load and crashes immediately, or
- The system offloads to CPU memory, causing massive slowdowns and making the setup unusable for production
There is no graceful degradation, VRAM is a hard limit.
Why memory is measured in GiB, not GB
GPUs report memory using binary units, not decimal ones.
- Disk storage uses decimal units 1 GB = 1000 MB
- Memory uses binary units 1 GiB = 1024 MiB
Because of this difference, advertised GPU memory appears smaller when reported by the system.
Practical implication
A “24 GB” GPU actually provides about:
24 × (1000 / 1024) ≈ 22.4 GiB usable VRAM
This difference matters when you’re tight on memory and planning model sizes or concurrency.
Walk away with actionable insights on AI adoption.
Limited seats available!
Throughput Calculation (Tokens/sec → QPS)
Throughput determines how many users you can serve.
There are two common metrics:
- Tokens per second
- Queries per second (QPS)
They are connected by sequence length:
QPS ≈ Tokens/sec ÷ Tokens per request
Example:
- GPU throughput: 1,200 tokens/sec
- Average response length: 300 tokens
QPS ≈ 1200 ÷ 300 = 4 requests/sec
Now factor in concurrency:
- Higher concurrency (concurrent requests) increases KV cache usage
- More VRAM is needed per active request
This is why memory and throughput calculations are inseparable.
Single GPU vs Multi-GPU (Tensor / Pipeline Parallelism)
When a model no longer fits into the VRAM of a single GPU, you have to distribute it across multiple GPUs. There are two fundamentally different ways to do this, and they lead to very different performance characteristics.
Single-GPU inference
If the model fits on one GPU, this is always the simplest and fastest option. There is no cross-GPU communication, latency is minimal, and scheduling is straightforward. The only limitation is VRAM capacity.
Tensor Parallelism (TP)
Tensor parallelism splits the weights inside each layer across multiple GPUs.
- Each GPU holds a slice of the same layer
- All GPUs compute that layer at the same time
- Results are combined after each layer
Because computation happens in parallel within a layer, per token latency is low. This makes TP well-suited for inference workloads where response time matters.
The tradeoff is communication overhead. GPUs must exchange partial results frequently, so fast interconnects such as NVLink are strongly preferred. TP works best when GPUs are on the same node and tightly coupled.
Pipeline Parallelism (PP)
Pipeline parallelism splits the model layers themselves across GPUs.
- Each GPU owns a contiguous block of layers
- A token passes through GPUs sequentially
- Different requests can occupy different pipeline stages
This approach is easier to scale because GPUs only communicate with their neighbors, and it works well even over slower interconnects. However, each token must traverse the entire pipeline, which increases end-to-end latency.
PP is often used when models are simply too large to fit using tensor parallelism alone.
Inference-specific guidance
For inference workloads:
- Tensor parallelism is preferred whenever possibleIt minimizes per token latency and provides better interactive performance.
- Pipeline parallelism is mainly a fallback for very large modelsIt enables scale, but at the cost of higher latency and more complex scheduling.
Quantization Impact on VRAM and Speed
Precision directly controls memory usage.
| Format | Bits per param | Memory | Accuracy | Speed |
FP32 | 32 | Very high | Very high | Slow |
FP16 | 16 | Medium | High | Faster |
INT8 | 8 | Low | Slight drop | Faster |
INT4 | 4 | Very low | Noticeable drop | Very fast |
General rule:
- Lower precision → lower VRAM → higher throughput
- But extreme quantization can hurt output quality
Example Calculations (3 Real Scenarios)
Scenario 1: 7B Model, FP16
- Parameters: 7 billion
- Precision: FP16 (16 bits per parameter)
Raw weight memory
7,000,000,000 × 16 bits
= 112,000,000,000 bits
112,000,000,000 ÷ 8
= 14,000,000,000 bytes
≈ 14 GiB
This 14 GiB accounts only for model weights.
Additional VRAM usage during inference
- Activations
- KV cache (grows with sequence length and concurrency)
- CUDA / framework buffers
Practical requirement
- ~15–16 GiB VRAM
This is why a “16 GiB GPU” is the realistic minimum for a 7B FP16 model.
Scenario 2: 7B Model, INT8
- Quantized weights: ~7 GiB
- Lower precision → smaller weight footprint
- Activations and KV cache still consume VRAM
Practical requirement
- ~8–10 GiB VRAM
This configuration fits comfortably on 10–12 GiB GPUs and is common for cost-efficient inference.
Scenario 3: 13B Model, INT4
- Quantized weights: ~6.5 GiB
- More layers → larger KV cache per token
- Runtime memory dominates over raw weights
Practical requirement
- ~10–12 GiB VRAM
Despite aggressive quantization, KV cache growth prevents this from fitting into very small GPUs under realistic workloads.
How To Pick the Right GPU Tier (Practical Mapping)
Choosing a GPU tier is not about “bigger is better”. It’s about matching VRAM capacity to model size, traffic patterns, and usage expectations. Each tier has a very different role.
Walk away with actionable insights on AI adoption.
Limited seats available!
≤ 8 GiB VRAM
This tier is highly constrained and suitable only for lightweight workloads.
- Supports small models or aggressively quantized variants
- Limited KV cache → very low concurrency
- Useful for:
- Experiments
- Local testing
- Edge or hobby deployments
This tier is not suitable for real user traffic.
16 GiB VRAM
This is the entry point for serious single-model inference.
- Can run 7B models in FP16
- Allows moderate traffic with controlled concurrency
- Works well for:
- Internal tools
- Small-scale APIs
- Low-to-medium QPS services
Memory is still tight, so context length and concurrency must be carefully limited.
24–48 GiB VRAM
This is the most common production inference tier.
- Supports 13B–30B models (via TP or quantization)
- Enough KV cache for high throughput and concurrency
- Suitable for:
- Public-facing services
- Chat applications
- Stable, predictable latency under load
This tier offers the best balance between cost, quality, and scalability.
80 GiB VRAM
This tier is designed for the largest and most demanding workloads.
- Required for 70B-class models
- Can support multi-tenant systems
- Used in:
- Enterprise deployments
- Research platforms
- Heavy RAG using vector databases or long-context workloads
At this level, GPUs are rarely idle and are often shared across multiple services.
How to read this mapping?
VRAM primarily determines:
- Which model sizes fit
- How much KV cache you can afford
- How many users you can serve concurrently
Moving up a tier is less about speed and more about capacity and stability under load.
Common Mistakes in GPU Sizing
Even experienced teams miscalculate GPU requirements by overlooking key memory and workload factors. The most common mistakes include:
- Confusing disk space with VRAMModel storage size does not determine runtime memory requirements.
- Ignoring KV cache growth with longer context windowsEvery additional token increases GPU memory usage during inference.
- Underestimating concurrency impactEach active request allocates its own KV cache, multiplying VRAM usage.
- Calculating only model weight sizeWeights are just the baseline; activations, buffers, and overhead also consume memory.
- Mixing up GiB and GB unitsGPU memory is measured in GiB, not decimal GB, which can lead to miscalculations.
- Overlooking CUDA and framework overheadKernel workspaces, communication buffers, and runtime allocations always consume VRAM.
In GPU sizing, small miscalculations compound quickly, especially under real production load.
Deployment Checklist and Rule-of-Thumb for GPU Sizing
Before deploying an LLM to production, validate the following:
- Does the model fit in VRAM with runtime overhead included?
- What is the average context length per request?
- How many concurrent users must the system support?
- What latency target is acceptable under load?
- Is quantization acceptable for this workload?
These inputs define memory usage, throughput limits, and stability under real traffic.
Quick Rule-of-Thumb
- Model size × precision ≈ baseline VRAM for weights
- Add 30–50% additional VRAM for KV cache, activations, and framework overhead
This buffer accounts for real-world inference behavior, concurrency spikes, and CUDA runtime allocations.
Frequently Asked Questions?
1. How much VRAM is required to run a 7B model?
A 7B model in FP16 typically requires around 14 GiB for weights alone. In production, you should allocate 15–16 GiB or more to account for KV cache, activations, and framework overhead. Quantized versions (INT8 or INT4) reduce memory requirements significantly.
2. Why is KV cache important for GPU sizing?
KV cache stores attention keys and values for every token in an active request. It grows linearly with sequence length and concurrency. In real workloads, KV cache often consumes more VRAM than model weights, making it a critical factor in GPU sizing.
3. Can I run LLM inference on a CPU instead of a GPU?
Yes, but performance is significantly lower. LLM inference relies on large matrix multiplications and parallel compute operations that GPUs handle far more efficiently due to higher memory bandwidth and massive parallelism.
4. Does quantization reduce GPU requirements?
Yes. Lower precision formats such as INT8 or INT4 reduce VRAM usage and increase tokens per second. However, extreme quantization may affect output quality depending on the model and task.
5. What happens if GPU VRAM is insufficient?
If VRAM is insufficient, the model may fail to load or offload to CPU memory. CPU offloading dramatically increases latency and is generally unsuitable for production inference.
6. How do I estimate queries per second (QPS) from tokens per second?
Use the approximation:
QPS ≈ Tokens per second ÷ Average tokens per request
For example, if a GPU generates 1,200 tokens/sec and the average request is 300 tokens, the system can handle roughly 4 requests per second.
Walk away with actionable insights on AI adoption.
Limited seats available!



