
Running LLMs efficiently is one of the most important engineering challenges in today’s world. We need to choose the right inference engine. The wrong choice can mean slow responses, wasted GPU memory, and poor user experience.
This blog documents what we learned after benchmarking three inference engines on a dual RTX 4090 server: NVIDIA TensorRT-LLM, vLLM, and SGLang. We explain not just the numbers, but why each engine behaves the way it does at the GPU level.
What Are These Engines?
Before comparing numbers, it helps to understand that these tools operate at different layers of the stack.
What is TensorRT-LLM (TRT-LLM)?
TensorRT-LLM is NVIDIA’s official high-performance inference engine for LLMs. It compiles the entire model into a single optimized GPU binary with deep kernel fusion, delivering maximum speed and efficiency on NVIDIA hardware.
What is vLLM?
vLLM is a popular, flexible Python-based inference engine developed by UC Berkeley. It introduced PagedAttention and continuous batching, making it excellent for high-throughput serving and easy experimentation.
What is SGLang?
SGLang is a specialized inference engine focused on structured generation, agents, and multi-turn conversations. Its RadixAttention trie-based KV cache excels at sharing prefixes across requests for faster response times in chat and RAG workloads.
Although these engines solve the same problem, they take very different approaches internally.
Understanding the Inference Stack
These tools operate at different layers of the LLM serving stack, from request orchestration to low-level GPU execution.
NVIDIA Triton Inference Server
Triton is not an inference engine: it is a serving platform. Think of it like NGINX: it handles HTTP/gRPC endpoints, metrics, health checks, dynamic batching, and multi-model routing. The actual computation is done by a backend such as TensorRT-LLM or vLLM. Triton adds management features without touching kernel performance.
• Endpoints: HTTP (port 8000), gRPC (port 8001)
• Supports Ensemble Models to chain multiple models in a pipeline without network roundtrips
• Zero inference optimization: all speed comes from the backend engine
TensorRT-LLM (TRT-LLM)
TRT-LLM is a compiler and runtime for LLMs on NVIDIA GPUs. You feed it a HuggingFace checkpoint; it compiles the model into a GPU binary (.engine file) with fused kernels, quantized weights, and hardware-tuned execution plans. It then runs inside Triton as the tensorrtllm_backend.
• Core scheduler: In-Flight Batching (C++ executor)
• KV cache: Paged KV Cache with optional prefix caching and priority-based eviction
• Quantization: FP8 compute natively on Hopper/Blackwell GPUs
• Build tool: trtllm-build: compile once, reuse repeatedly
vLLM
vLLM is a Python-native LLM inference engine from UC Berkeley. Its two flagship innovations, PagedAttention and Continuous Batching, directly solve the two biggest bottlenecks in LLM serving: memory fragmentation and head-of-line blocking.
• Core scheduler: VLLMScheduler with Continuous Batching
• KV cache: PagedAttention (BlockSpaceManager + Block Table)
• Supports GPTQ, AWQ, GGUF, FP8, INT8, INT4 quantization formats
• In 2026: Triton attention backend (OpenAI Triton, not NVIDIA Triton) for AMD/Intel GPU support
SGLang
SGLang (Structured Generation Language) is an inference engine focused on multi-turn chat, JSON agents, and structured generation. Its key innovation is RadixAttention, a trie-based KV cache that aggressively reuses shared prefixes across requests.
• Core scheduler: Token-level with RadixAttention (prefix trie)
• Attention kernel: FlashInfer
• Native LoRA multi-adapter support via LoRAManager
• Built-in constrained decoding and JSON mode via Outlines integration
How Kernel Fusion Speeds Up LLM Inference
The core reason TRT-LLM is faster than vLLM and SGLang on raw compute is kernel fusion. To understand why, we first need to look at how a single transformer attention layer executes on the GPU.
What Happens Inside a Transformer Layer?
A transformer attention layer executes several operations in sequence:
1. LayerNorm (Normalizes the input)
First, the input is normalized. This stabilizes training by keeping the values in a consistent range.
2. QKV Projection (Creates Queries, Keys, and Values)
- Query (Q): Represents "what I am looking for."
- Key (K): Represents "what information I contain."
- Value (V): Represents the actual content to be retrieved.
These are computed by multiplying the normalized input with three different learned weight matrices (W_q, W_k, W_v).
3. Attention Score Computation (Calculates token relevance using Q × Kᵀ)
The model calculates how relevant each part is to every other part:
- Multiply Queries with Keys (Q × K^T).
- Scale the result by dividing by √d (to keep numbers stable).
- Apply softmax to convert scores into probabilities (weights that sum to 1).
This produces attention weights showing "how much focus to give each position."
4.Attention Application (Applies weights to the Values)
- Multiply the attention weights by the Values.
- Result: A weighted combination of the input information, where more relevant parts get higher importance.
Walk away with actionable insights on AI adoption.
Limited seats available!
5. Output Projection
- The attention output passes through another linear layer (W_o). This mixes and transforms the information.
6. Residual Connection (Adds the original input back)
- The original input is added back to the attention output.
- This helps the model train better by allowing information to flow directly (skip connections).
7. Final LayerNorm (Normalizes the output before the next layer)
- Normalize the result again before passing it to the next layer (or feed-forward network).
The key difference between TRT-LLM, vLLM, and SGLang is how these operations are executed and how often intermediate results need to move between GPU memory and compute kernels.
What Happens Inside a Transformer Layer?
What vLLM Does: 8 VRAM Round Trips Per Layer
vLLM runs on PyTorch with eager execution. Each operation is a separate CUDA kernel launch, and every kernel reads from and writes back to VRAM:
[VRAM] → LayerNorm → [VRAM]
[VRAM] → Q_proj → [VRAM] ← reads LayerNorm output again (2nd time)
[VRAM] → K_proj → [VRAM] ← reads LayerNorm output again (3rd time)
[VRAM] → V_proj → [VRAM] ← reads LayerNorm output again (4th time)
[VRAM] → Attention → [VRAM]
[VRAM] → O_proj → [VRAM]
[VRAM] → Add → [VRAM]
[VRAM] → LayerNorm → [VRAM]
This repeated movement between compute kernels and VRAM increases latency and memory bandwidth usage.
What SGLang Does: ~5 VRAM Round Trips Per Layer
SGLang fuses Q, K, V projections into a single matmul and fuses the residual Add with LayerNorm:
[VRAM] → LayerNorm → [VRAM]
[VRAM] → QKV_proj → [VRAM] ← Q, K, V in ONE matmul, reads LayerNorm once
[VRAM] → FlashInfer → [VRAM] ← SKIPPED if RadixAttention cache hit
[VRAM] → O_proj → [VRAM]
[VRAM] → Add+LayerNorm → [VRAM] ← two ops fused
SGLang's real advantage is not fewer VRAM trips; it is RadixAttention skipping entire prefill computations for cached token prefixes. When the system prompt is already cached, those tokens are never recomputed.
What TRT-LLM Does: 1 VRAM Round Trip Per Layer
TRT-LLM compiles all 8 operations into a single fused kernel at build time:
[VRAM] → [ LayerNorm → QKV → Attention → O_proj → Add → LayerNorm ] → [VRAM]
Inside the fused kernel, intermediate results stay in GPU registers (1-cycle access) and shared memory (5-cycle access), never touching VRAM (600-cycle access) between operations. For LFM 1.2B: 1 kernel launch × 16 layers = 16 total kernel launches vs 128 in vLLM.
This aggressive kernel fusion is the primary reason TRT-LLM achieves lower latency on NVIDIA GPUs.
KV Cache Strategies in vLLM, SGLang, and TRT-LLM
All three engines solve the same problem, GPU memory fragmentation, but with different approaches.
The Problem: Memory Fragmentation
Early inference systems pre-allocated a contiguous memory block equal to the maximum sequence length for every request. If a user asked for a 2,000-token response, the system reserved memory for 32,768 tokens (the maximum), wasting up to 80% of GPU memory. This severely limited concurrency.
vLLM: PagedAttention
PagedAttention divides the KV cache into fixed-size blocks and maintains a block table mapping logical token positions to physical VRAM pages. Memory is allocated on demand as tokens are generated.
• Internal fragmentation reduced from ~80% to under 4%
• Shared prefixes can share blocks (up to 90% memory savings in repetitive workloads)
• Block table lookup adds ~10-20% compute overhead per attention operation
• Uses BlockSpaceManager to track free/used blocks per request
PagedAttention dramatically improves memory efficiency, which is one of the main reasons vLLM scales well under high concurrency.
SGLang: RadixAttention
RadixAttention uses a radix tree (prefix trie) instead of per-request block allocation. KV cache blocks for common prefixes are shared globally across all requests.
• System prompt cached once, all concurrent requests share those exact blocks
• From the second request onward, no recomputation of cached prefix tokens
• LRU eviction policy, most recently used prefixes stay in cache
• Massive TTFT reduction when requests share a long system prompt
RadixAttention is especially effective for workloads with repeated system prompts and shared conversational context.
TRT-LLM: Paged KV Cache + KV Cache Event API
TRT-LLM implements Paged KV Cache similar to vLLM, but adds enterprise-grade controls:
• KV Cache Event API- emits events when cache blocks are stored or evicted, enabling KV-aware load balancing
• Priority-Based Eviction- assign priorities to token ranges (e.g., pin system prompt in memory)
• KV Cache Reuse- explicit prefix caching with fine-grained control
• FP8 compute- attention operations run natively in FP8 on Hopper/Blackwell GPUs
TRT-LLM focuses less on aggressive sharing and more on predictable enterprise-grade cache management and scheduling control.
Quantization: Who Does It and How?
Another major difference between these engines is how they handle low-precision inference and quantized computation.
| Engine | Quantizes itself? | When? | Computes in quantized precision? |
vLLM | No | Never | No |
SGLang | No | Never | No |
TRT-LLM | Yes | At build time (trtllm-build) | Yes — INT8/FP8 matmuls natively on tensor cores |
llama.cpp | No (user does it) | Before loading (produces .gguf) | Yes — INT4/INT8 native compute |
vLLM with Quantized Models vs TRT-LLM
vLLM + AWQ INT4(Activation-aware Weight Quantization): Loads 0.6GB weights, but dequantizes to FP16 before every matmul. Saves VRAM, compute is still FP16.
TRT-LLM INT8: Runs INT8 matmuls directly on tensor cores, no dequantization at inference time. Both VRAM and compute are reduced.
This is one of the biggest architectural differences between TRT-LLM and Python-native inference engines like vLLM and SGLang.
Why the TRT-LLM Build Takes So Long
Unlike vLLM or SGLang, TRT-LLM does not simply load model weights into memory. Running trtllm-build compiles the model into a GPU binary. This is fundamentally different from loading weights. The build pipeline runs these steps in sequence:
| Build Step | What it does | Time (1.5B FP16) |
Graph Tracing | Traces every op in the model graph | ~1 min |
Op Fusion | Searches for optimal fusion patterns across all layers | ~2 min |
Kernel Selection | Benchmarks dozens of CUDA kernel implementations per op | ~10-15 min |
CUDA Compilation | Compiles fused kernels from PTX to SASS for your exact GPU | ~3-5 min |
Quantization Calibration | Runs sample data through model to compute scale factors (INT8/FP8 only) | +10-20 min (if done) |
Engine Serialization | Writes rank0.engine to disk | ~1 min |
The Kernel Selection step is the biggest contributor to build time. For every operation in the model, TRT-LLM actually runs multiple CUDA kernel implementations on your specific GPU and times them. This is not an estimate; it benchmarks dozens of candidates per op across hundreds of layers.
Walk away with actionable insights on AI adoption.
Limited seats available!
The engine is GPU-specific. A rank0.engine built for RTX 4090 (sm_89) will not run on H100 (sm_90). Every hardware migration requires a rebuild. Every model update requires a rebuild. This is one of the main reasons TRT-LLM is better suited for stable production deployments than rapid development workflows.
The long build time is essentially the cost of extracting maximum GPU performance during inference.
Benchmark Results: Qwen2.5-1.5B on RTX 4090
We benchmarked all three engines on the same hardware (RTX 4090), same model (Qwen2.5-1.5B-Instruct), same prompts (50 requests, max_tokens=50).
To compare all three engines fairly, we used the same hardware, model, prompts, and generation settings across all benchmark runs.
| Metric | TRT-LLM 0.16 | SGLang 0.5.10 | vLLM 0.18 |
TTFT mean | 224.0 ms | 179.2 ms | 531.1 ms |
TTFT median | 223.9 ms | 207.2 ms | 531.8 ms |
TTFT p90 | 224.5 ms | 207.7 ms | 550.8 ms |
TTFT min | 223.5 ms | 38.6 ms | 506.3 ms |
TTFT max | 227.7 ms | 480.6 ms | 555.1 ms |
Engine load | 2,942 ms | ~30 sec (server start) | 10,558 ms |
Build time | 17.5 sec | N/A | N/A |
Reading the Numbers
TRT-LLM: Extremely consistent- 224ms mean with only 4ms spread between min and max. This consistency is the signature of compiled kernels: no Python scheduler overhead, no JIT variance, pure deterministic execution.
SGLang: Lowest mean (179ms) but high variance, min 38ms, max 480ms. The 38ms requests are RadixAttention cache hits where the system prompt was already cached from a previous request. The 480ms requests are cold cache misses, paying the full prefill cost.
vLLM: Highest latency (531ms) due to enforce_eager=True disabling CUDA graph capture, which was required to work around a compilation bug in vLLM 0.18 on this server. Normal vLLM with full optimization runs at 200-300ms.
The results highlight how each engine optimizes for a different tradeoff: TRT-LLM prioritizes deterministic performance, SGLang optimizes cache reuse, and vLLM focuses on flexible high-throughput serving.
Serving LLMs with TRT-LLM, vLLM, and SGLang
Below are the minimal commands required to serve each engine locally.
TRT-LLM Serve (requires pre-built engine)
TRT-LLM requires a separate compilation step before serving.
# Step 1: Convert checkpoint
python3 convert_checkpoint.py \
--model_dir /path/to/hf-model \
--output_dir /path/to/checkpoint \
--dtype float16 --tp_size 1# Step 2: Build engine (15-25 min)
trtllm-build \
--checkpoint_dir /path/to/checkpoint \
--output_dir /path/to/engine \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--kv_cache_type paged# Step 3: Serve via Triton
CUDA_VISIBLE_DEVICES=0 tritonserver --model-repository /triton_model_repo
vLLM Serve
vLLM can serve models directly from Hugging Face checkpoints.
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server \
--model /path/to/model \
--dtype float16 \
--gpu-memory-utilization 0.6 \
--port 8001SGLang Serve
SGLang provides a lightweight server focused on structured generation workloads.
CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server \
--model-path /path/to/model \
--dtype float16 \
--mem-fraction-static 0.4 \
--port 8002 --host 0.0.0.0The setup complexity reflects the tradeoff each engine makes between flexibility, startup speed, and maximum inference performance.
How To Choose The Right Inference Engine
Each engine optimizes for different workloads, deployment environments, and operational priorities.
| Scenario | Recommended Engine | Reason |
Rapid experimentation and active fine-tuning | SGLang | 2-min load vs 15-25 min rebuild |
Shared system prompt (voice agents, RAG) | SGLang | RadixAttention skips cached prefix computation |
Multi-turn chat with many users | SGLang | Global prefix tree benefits all concurrent sessions |
Stable production models, maximum throughput, NVIDIA-only infrastructure | TRT-LLM | Kernel fusion + FP8 on H100/B200 |
High concurrency RAG / chatbots | vLLM | PagedAttention + continuous batching, easy setup |
Multi-model pipelines (vision + LLM) | Triton + TRT-LLM | Ensemble models chain without network latency |
AMD or Intel GPUs | vLLM | Triton attention backend portable to non-NVIDIA hardware |
Hybrid/novel architecture (Mamba, LFM) | vLLM or SGLang | TRT-LLM needs architecture-specific CUDA kernels |
In practice, the best engine depends less on raw benchmark numbers and more on your workload, hardware, and deployment constraints.
Final thoughts
TRT-LLM, vLLM, and SGLang are built for very different goals. TRT-LLM focuses on maximum GPU performance through compiled execution and kernel fusion, vLLM prioritizes flexible high-throughput serving, and SGLang excels at structured generation and prefix-heavy workloads.
There is no single best inference engine for every use case. The right choice depends on your hardware, latency requirements, workload patterns, and deployment priorities.
Frequently Asked Questions
What is the difference between TRT-LLM, vLLM, and SGLang?
TRT-LLM is a compiled NVIDIA inference runtime optimized for maximum GPU performance. vLLM focuses on flexible high-throughput serving, while SGLang is optimized for structured generation and prefix-heavy workloads.
Which inference engine is fastest for NVIDIA GPUs?
TRT-LLM is generally the fastest on NVIDIA hardware because of kernel fusion, compiled execution, and native FP8/INT8 tensor core support.
Why is TRT-LLM build time so long?
TRT-LLM compiles the model into a GPU-specific engine. During the build process, it benchmarks multiple CUDA kernels, performs fusion optimization, and generates hardware-tuned execution plans.
What is PagedAttention in vLLM?
PagedAttention is vLLM’s KV cache management system that reduces GPU memory fragmentation by allocating KV cache memory in fixed-size blocks instead of large contiguous buffers.
What makes SGLang different from vLLM?
SGLang uses RadixAttention, a trie-based KV cache that aggressively reuses shared prefixes across requests, making it highly effective for chat, RAG, and agent workloads.
Which engine is best for production deployments?
It depends on the workload. TRT-LLM is ideal for maximum NVIDIA GPU performance, vLLM is strong for flexible large-scale serving, and SGLang performs well for structured generation and shared-prompt systems.
Does vLLM support quantized models?
Yes. vLLM supports formats like GPTQ, AWQ, GGUF, FP8, INT8, and INT4, though quantized weights are often dequantized during computation.
Walk away with actionable insights on AI adoption.
Limited seats available!



