Blogs/AI

TRT-LLM vs vLLM vs SGLang: What to Choose in 2026

Written by Devesh Mhatre
May 14, 2026
11 Min Read
TRT-LLM vs vLLM vs SGLang: What to Choose in 2026 Hero

Running LLMs efficiently is one of the most important engineering challenges in today’s world. We need to choose the right inference engine. The wrong choice can mean slow responses, wasted GPU memory, and poor user experience.

This blog documents what we learned after benchmarking three inference engines on a dual RTX 4090 server: NVIDIA TensorRT-LLM, vLLM, and SGLang. We explain not just the numbers, but why each engine behaves the way it does at the GPU level.

What Are These Engines?

Before comparing numbers, it helps to understand that these tools operate at different layers of the stack.

What is TensorRT-LLM (TRT-LLM)?

TensorRT-LLM is NVIDIA’s official high-performance inference engine for LLMs. It compiles the entire model into a single optimized GPU binary with deep kernel fusion, delivering maximum speed and efficiency on NVIDIA hardware.

What is vLLM?

vLLM is a popular, flexible Python-based inference engine developed by UC Berkeley. It introduced PagedAttention and continuous batching, making it excellent for high-throughput serving and easy experimentation.

What is SGLang?

SGLang is a specialized inference engine focused on structured generation, agents, and multi-turn conversations. Its RadixAttention trie-based KV cache excels at sharing prefixes across requests for faster response times in chat and RAG workloads.

Although these engines solve the same problem, they take very different approaches internally.

Understanding the Inference Stack

These tools operate at different layers of the LLM serving stack, from request orchestration to low-level GPU execution.

NVIDIA Triton Inference Server

Triton is not an inference engine: it is a serving platform. Think of it like NGINX: it handles HTTP/gRPC endpoints, metrics, health checks, dynamic batching, and multi-model routing. The actual computation is done by a backend such as TensorRT-LLM or vLLM. Triton adds management features without touching kernel performance.

Endpoints: HTTP (port 8000), gRPC (port 8001)

• Supports Ensemble Models to chain multiple models in a pipeline without network roundtrips

Zero inference optimization: all speed comes from the backend engine

TensorRT-LLM (TRT-LLM)

TRT-LLM is a compiler and runtime for LLMs on NVIDIA GPUs. You feed it a HuggingFace checkpoint; it compiles the model into a GPU binary (.engine file) with fused kernels, quantized weights, and hardware-tuned execution plans. It then runs inside Triton as the tensorrtllm_backend.

Core scheduler: In-Flight Batching (C++ executor)

• KV cache: Paged KV Cache with optional prefix caching and priority-based eviction

• Quantization: FP8 compute natively on Hopper/Blackwell GPUs

• Build tool: trtllm-build: compile once, reuse repeatedly

vLLM

vLLM is a Python-native LLM inference engine from UC Berkeley. Its two flagship innovations, PagedAttention and Continuous Batching, directly solve the two biggest bottlenecks in LLM serving: memory fragmentation and head-of-line blocking.

• Core scheduler: VLLMScheduler with Continuous Batching

• KV cache: PagedAttention (BlockSpaceManager + Block Table)

• Supports GPTQ, AWQ, GGUF, FP8, INT8, INT4 quantization formats

• In 2026: Triton attention backend (OpenAI Triton, not NVIDIA Triton) for AMD/Intel GPU support

SGLang

SGLang (Structured Generation Language) is an inference engine focused on multi-turn chat, JSON agents, and structured generation. Its key innovation is RadixAttention, a trie-based KV cache that aggressively reuses shared prefixes across requests.

• Core scheduler: Token-level with RadixAttention (prefix trie)

• Attention kernel: FlashInfer

• Native LoRA multi-adapter support via LoRAManager

• Built-in constrained decoding and JSON mode via Outlines integration

How Kernel Fusion Speeds Up LLM Inference

The core reason TRT-LLM is faster than vLLM and SGLang on raw compute is kernel fusion. To understand why, we first need to look at how a single transformer attention layer executes on the GPU.

What Happens Inside a Transformer Layer?

A transformer attention layer executes several operations in sequence:

1. LayerNorm (Normalizes the input)

First, the input is normalized. This stabilizes training by keeping the values in a consistent range.

2. QKV Projection (Creates Queries, Keys, and Values)

  • Query (Q): Represents "what I am looking for."
  • Key (K): Represents "what information I contain."
  • Value (V): Represents the actual content to be retrieved.

These are computed by multiplying the normalized input with three different learned weight matrices (W_q, W_k, W_v).

3. Attention Score Computation (Calculates token relevance using Q × Kᵀ)

The model calculates how relevant each part is to every other part:

  • Multiply Queries with Keys (Q × K^T).
  • Scale the result by dividing by √d (to keep numbers stable).
  • Apply softmax to convert scores into probabilities (weights that sum to 1).

This produces attention weights showing "how much focus to give each position."

4.Attention Application (Applies weights to the Values) 

  • Multiply the attention weights by the Values.
  • Result: A weighted combination of the input information, where more relevant parts get higher importance.
Innovations in AI
Exploring the future of artificial intelligence
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 16 May 2026
10PM IST (60 mins)

5. Output Projection

  • The attention output passes through another linear layer (W_o). This mixes and transforms the information.

6. Residual Connection (Adds the original input back)

  • The original input is added back to the attention output.
  • This helps the model train better by allowing information to flow directly (skip connections).

7. Final LayerNorm (Normalizes the output before the next layer)

  • Normalize the result again before passing it to the next layer (or feed-forward network).

The key difference between TRT-LLM, vLLM, and SGLang is how these operations are executed and how often intermediate results need to move between GPU memory and compute kernels.

What Happens Inside a Transformer Layer?

What vLLM Does: 8 VRAM Round Trips Per Layer

vLLM runs on PyTorch with eager execution. Each operation is a separate CUDA kernel launch, and every kernel reads from and writes back to VRAM:

[VRAM] → LayerNorm → [VRAM]

[VRAM] → Q_proj   → [VRAM]   ← reads LayerNorm output again (2nd time)

[VRAM] → K_proj   → [VRAM]   ← reads LayerNorm output again (3rd time)

[VRAM] → V_proj   → [VRAM]   ← reads LayerNorm output again (4th time)

[VRAM] → Attention → [VRAM]

[VRAM] → O_proj   → [VRAM]

[VRAM] → Add      → [VRAM]

[VRAM] → LayerNorm → [VRAM]

This repeated movement between compute kernels and VRAM increases latency and memory bandwidth usage.

What SGLang Does: ~5 VRAM Round Trips Per Layer

SGLang fuses Q, K, V projections into a single matmul and fuses the residual Add with LayerNorm:

[VRAM] → LayerNorm → [VRAM]

[VRAM] → QKV_proj  → [VRAM]   ← Q, K, V in ONE matmul, reads LayerNorm once

[VRAM] → FlashInfer → [VRAM]  ← SKIPPED if RadixAttention cache hit

[VRAM] → O_proj    → [VRAM]

[VRAM] → Add+LayerNorm → [VRAM]  ← two ops fused

SGLang's real advantage is not fewer VRAM trips; it is RadixAttention skipping entire prefill computations for cached token prefixes. When the system prompt is already cached, those tokens are never recomputed.

What TRT-LLM Does: 1 VRAM Round Trip Per Layer

TRT-LLM compiles all 8 operations into a single fused kernel at build time:

[VRAM] → [ LayerNorm → QKV → Attention → O_proj → Add → LayerNorm ] → [VRAM]

Inside the fused kernel, intermediate results stay in GPU registers (1-cycle access) and shared memory (5-cycle access), never touching VRAM (600-cycle access) between operations. For LFM 1.2B: 1 kernel launch × 16 layers = 16 total kernel launches vs 128 in vLLM.

This aggressive kernel fusion is the primary reason TRT-LLM achieves lower latency on NVIDIA GPUs.

KV Cache Strategies in vLLM, SGLang, and TRT-LLM 

All three engines solve the same problem, GPU memory fragmentation, but with different approaches.

The Problem: Memory Fragmentation

Early inference systems pre-allocated a contiguous memory block equal to the maximum sequence length for every request. If a user asked for a 2,000-token response, the system reserved memory for 32,768 tokens (the maximum), wasting up to 80% of GPU memory. This severely limited concurrency.

vLLM: PagedAttention

PagedAttention divides the KV cache into fixed-size blocks and maintains a block table mapping logical token positions to physical VRAM pages. Memory is allocated on demand as tokens are generated.

• Internal fragmentation reduced from ~80% to under 4%

• Shared prefixes can share blocks (up to 90% memory savings in repetitive workloads)

• Block table lookup adds ~10-20% compute overhead per attention operation

• Uses BlockSpaceManager to track free/used blocks per request

PagedAttention dramatically improves memory efficiency, which is one of the main reasons vLLM scales well under high concurrency.

SGLang: RadixAttention

RadixAttention uses a radix tree (prefix trie) instead of per-request block allocation. KV cache blocks for common prefixes are shared globally across all requests.

• System prompt cached once, all concurrent requests share those exact blocks

• From the second request onward, no recomputation of cached prefix tokens

• LRU eviction policy, most recently used prefixes stay in cache

• Massive TTFT reduction when requests share a long system prompt

RadixAttention is especially effective for workloads with repeated system prompts and shared conversational context.

TRT-LLM: Paged KV Cache + KV Cache Event API

TRT-LLM implements Paged KV Cache similar to vLLM, but adds enterprise-grade controls:

KV Cache Event API- emits events when cache blocks are stored or evicted, enabling KV-aware load balancing

• Priority-Based Eviction- assign priorities to token ranges (e.g., pin system prompt in memory)

• KV Cache Reuse- explicit prefix caching with fine-grained control

• FP8 compute- attention operations run natively in FP8 on Hopper/Blackwell GPUs

TRT-LLM focuses less on aggressive sharing and more on predictable enterprise-grade cache management and scheduling control.

Quantization: Who Does It and How?

Another major difference between these engines is how they handle low-precision inference and quantized computation.

EngineQuantizes itself?When?Computes in quantized precision?

vLLM

No

Never

No

SGLang

No

Never

No

TRT-LLM

Yes

At build time (trtllm-build)

Yes — INT8/FP8 matmuls natively on tensor cores

llama.cpp

No (user does it)

Before loading (produces .gguf)

Yes — INT4/INT8 native compute

vLLM

Quantizes itself?

No

When?

Never

Computes in quantized precision?

No

1 of 4

vLLM with Quantized Models vs TRT-LLM

vLLM + AWQ INT4(Activation-aware Weight Quantization): Loads 0.6GB weights, but dequantizes to FP16 before every matmul. Saves VRAM, compute is still FP16.

TRT-LLM INT8: Runs INT8 matmuls directly on tensor cores, no dequantization at inference time. Both VRAM and compute are reduced.

This is one of the biggest architectural differences between TRT-LLM and Python-native inference engines like vLLM and SGLang.

Why the TRT-LLM Build Takes So Long

Unlike vLLM or SGLang, TRT-LLM does not simply load model weights into memory. Running trtllm-build compiles the model into a GPU binary. This is fundamentally different from loading weights. The build pipeline runs these steps in sequence: 

Build StepWhat it doesTime (1.5B FP16)

Graph Tracing

Traces every op in the model graph

~1 min

Op Fusion

Searches for optimal fusion patterns across all layers

~2 min

Kernel Selection

Benchmarks dozens of CUDA kernel implementations per op

~10-15 min

CUDA Compilation

Compiles fused kernels from PTX to SASS for your exact GPU

~3-5 min

Quantization Calibration

Runs sample data through model to compute scale factors (INT8/FP8 only)

+10-20 min (if done)

Engine Serialization

Writes rank0.engine to disk

~1 min

Graph Tracing

What it does

Traces every op in the model graph

Time (1.5B FP16)

~1 min

1 of 6

The Kernel Selection step is the biggest contributor to build time. For every operation in the model, TRT-LLM actually runs multiple CUDA kernel implementations on your specific GPU and times them. This is not an estimate; it benchmarks dozens of candidates per op across hundreds of layers. 

Innovations in AI
Exploring the future of artificial intelligence
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 16 May 2026
10PM IST (60 mins)

The engine is GPU-specific. A rank0.engine built for RTX 4090 (sm_89) will not run on H100 (sm_90). Every hardware migration requires a rebuild. Every model update requires a rebuild. This is one of the main reasons TRT-LLM is better suited for stable production deployments than rapid development workflows. 

The long build time is essentially the cost of extracting maximum GPU performance during inference.

Benchmark Results: Qwen2.5-1.5B on RTX 4090

We benchmarked all three engines on the same hardware (RTX 4090), same model (Qwen2.5-1.5B-Instruct), same prompts (50 requests, max_tokens=50).

To compare all three engines fairly, we used the same hardware, model, prompts, and generation settings across all benchmark runs.

MetricTRT-LLM 0.16SGLang 0.5.10vLLM 0.18

TTFT mean

224.0 ms

179.2 ms

531.1 ms

TTFT median

223.9 ms

207.2 ms

531.8 ms

TTFT p90

224.5 ms

207.7 ms

550.8 ms

TTFT min

223.5 ms

38.6 ms

506.3 ms

TTFT max

227.7 ms

480.6 ms

555.1 ms

Engine load

2,942 ms

~30 sec (server start)

10,558 ms

Build time

17.5 sec

N/A

N/A

TTFT mean

TRT-LLM 0.16

224.0 ms

SGLang 0.5.10

179.2 ms

vLLM 0.18

531.1 ms

1 of 7

Reading the Numbers

TRT-LLM: Extremely consistent- 224ms mean with only 4ms spread between min and max. This consistency is the signature of compiled kernels: no Python scheduler overhead, no JIT variance, pure deterministic execution.

SGLang: Lowest mean (179ms) but high variance, min 38ms, max 480ms. The 38ms requests are RadixAttention cache hits where the system prompt was already cached from a previous request. The 480ms requests are cold cache misses, paying the full prefill cost.

vLLM: Highest latency (531ms) due to enforce_eager=True disabling CUDA graph capture, which was required to work around a compilation bug in vLLM 0.18 on this server. Normal vLLM with full optimization runs at 200-300ms.

The results highlight how each engine optimizes for a different tradeoff: TRT-LLM prioritizes deterministic performance, SGLang optimizes cache reuse, and vLLM focuses on flexible high-throughput serving.

Serving LLMs with TRT-LLM, vLLM, and SGLang

Below are the minimal commands required to serve each engine locally.

TRT-LLM Serve (requires pre-built engine)

TRT-LLM requires a separate compilation step before serving.

# Step 1: Convert checkpoint
python3 convert_checkpoint.py \
    --model_dir /path/to/hf-model \
    --output_dir /path/to/checkpoint \
    --dtype float16 --tp_size 1
# Step 2: Build engine (15-25 min)
trtllm-build \
    --checkpoint_dir /path/to/checkpoint \
    --output_dir /path/to/engine \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --kv_cache_type paged
# Step 3: Serve via Triton
CUDA_VISIBLE_DEVICES=0 tritonserver --model-repository /triton_model_repo

vLLM Serve

vLLM can serve models directly from Hugging Face checkpoints.

CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server \
    --model /path/to/model \
    --dtype float16 \
    --gpu-memory-utilization 0.6 \
    --port 8001

SGLang Serve

SGLang provides a lightweight server focused on structured generation workloads.

CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server \
    --model-path /path/to/model \
    --dtype float16 \
    --mem-fraction-static 0.4 \
    --port 8002 --host 0.0.0.0

The setup complexity reflects the tradeoff each engine makes between flexibility, startup speed, and maximum inference performance. 

How To Choose The Right Inference Engine

Each engine optimizes for different workloads, deployment environments, and operational priorities.

ScenarioRecommended EngineReason

Rapid experimentation and active fine-tuning

SGLang

2-min load vs 15-25 min rebuild

Shared system prompt (voice agents, RAG)

SGLang

RadixAttention skips cached prefix computation

Multi-turn chat with many users

SGLang

Global prefix tree benefits all concurrent sessions

Stable production models, maximum throughput, NVIDIA-only infrastructure

TRT-LLM

Kernel fusion + FP8 on H100/B200

High concurrency RAG / chatbots

vLLM

PagedAttention + continuous batching, easy setup

Multi-model pipelines (vision + LLM)

Triton + TRT-LLM

Ensemble models chain without network latency

AMD or Intel GPUs

vLLM

Triton attention backend portable to non-NVIDIA hardware

Hybrid/novel architecture (Mamba, LFM)

vLLM or SGLang

TRT-LLM needs architecture-specific CUDA kernels

Rapid experimentation and active fine-tuning

Recommended Engine

SGLang

Reason

2-min load vs 15-25 min rebuild

1 of 8

In practice, the best engine depends less on raw benchmark numbers and more on your workload, hardware, and deployment constraints. 

Final thoughts

TRT-LLM, vLLM, and SGLang are built for very different goals. TRT-LLM focuses on maximum GPU performance through compiled execution and kernel fusion, vLLM prioritizes flexible high-throughput serving, and SGLang excels at structured generation and prefix-heavy workloads.

There is no single best inference engine for every use case. The right choice depends on your hardware, latency requirements, workload patterns, and deployment priorities.

Frequently Asked Questions

What is the difference between TRT-LLM, vLLM, and SGLang?

TRT-LLM is a compiled NVIDIA inference runtime optimized for maximum GPU performance. vLLM focuses on flexible high-throughput serving, while SGLang is optimized for structured generation and prefix-heavy workloads.

Which inference engine is fastest for NVIDIA GPUs?

TRT-LLM is generally the fastest on NVIDIA hardware because of kernel fusion, compiled execution, and native FP8/INT8 tensor core support.

Why is TRT-LLM build time so long?

TRT-LLM compiles the model into a GPU-specific engine. During the build process, it benchmarks multiple CUDA kernels, performs fusion optimization, and generates hardware-tuned execution plans.

What is PagedAttention in vLLM?

PagedAttention is vLLM’s KV cache management system that reduces GPU memory fragmentation by allocating KV cache memory in fixed-size blocks instead of large contiguous buffers.

What makes SGLang different from vLLM?

SGLang uses RadixAttention, a trie-based KV cache that aggressively reuses shared prefixes across requests, making it highly effective for chat, RAG, and agent workloads.

Which engine is best for production deployments?

It depends on the workload. TRT-LLM is ideal for maximum NVIDIA GPU performance, vLLM is strong for flexible large-scale serving, and SGLang performs well for structured generation and shared-prompt systems.

Does vLLM support quantized models?

Yes. vLLM supports formats like GPTQ, AWQ, GGUF, FP8, INT8, and INT4, though quantized weights are often dequantized during computation.

Author-Devesh Mhatre
Devesh Mhatre

Tech enthusiast with a passion for open-source software and problem-solving. Experienced in web development, with a focus on React, React Native and Rails. I use arch (and neovim) btw ;)

Share this article

Phone

Next for you

Speculative Speculative Decoding Explained Cover

AI

May 13, 202612 min read

Speculative Speculative Decoding Explained

If you have worked with large language models in production, you have probably faced this problem: Models are powerful, but they are slow. Even with good GPUs, generating responses one token at a time adds latency. For real-world applications like chat systems, copilots, or voice assistants, this delay is noticeable and often unacceptable. Several techniques have been proposed to speed up inference. One of the most effective is speculative decoding, which uses a smaller model to guess the nex

Rethinking RAG: Retrieval Without Embeddings Using PageIndex Cover

AI

May 11, 20267 min read

Rethinking RAG: Retrieval Without Embeddings Using PageIndex

Retrieval-Augmented Generation (RAG) powers most modern LLM applications, but production systems often reveal the same problems: broken context from chunking, embedding mismatches, and important information that never gets retrieved. PageIndex takes a different approach. Instead of relying on embeddings and vector databases, it lets the LLM reason through a document’s structure to find relevant information. Documents are transformed into a hierarchical semantic tree, allowing the model to navi

Chrome DevTools MCP: How AI Agents Debug the Browser Natively Cover

AI

May 11, 20268 min read

Chrome DevTools MCP: How AI Agents Debug the Browser Natively

Every developer has spent time staring at the Chrome DevTools panel, hunting down a slow network request, tracing a console error, or profiling a render bottleneck. It's powerful. But it's always been a manual process. Chrome DevTools MCP changes that. It's an npm package that acts as an MCP server, connecting your AI coding assistant directly to a live Chrome browser. Your agent can now inspect, debug, and profile web applications the same way you do, through Chrome's own DevTools. What is C