
These are three of the most popular tools for running AI language models today, but they solve very different problems in practice. I’ve worked with all three while experimenting with local inference and serving setups, and the differences around setup effort, speed, memory use, and flexibility only become obvious once you try them yourself.
In this guide, I break down what each tool does, how to get started with them, and when one makes more sense than the others. Even if you're new to AI, this should help you choose the right option for your needs—whether you're building an app, optimizing inference speed, or setting up smarter workflows.
Let’s dive in.
Transformers is an open-source library developed by Hugging Face that makes it easy to use powerful AI models for tasks like text generation, translation, question answering, and even working with images and audio. It provides access to thousands of pre-trained models that you can use with just a few lines of code.
Whether you're a beginner or an experienced developer, Transformers helps you build and test AI applications quickly without needing deep knowledge of how the models work under the hood.
Step 1: To get started with Transformers, you only need Python and a few commands. This setup is usually where I recommend starting if you want to test a model locally without worrying about performance tuning yet.
pip install transformers accelerateStep 2: Once installed, you can load and run a TinyLlama model like this:
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "what is llm?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt)
print(outputs[0]["generated_text"].replace(prompt, "").strip())This runs the TinyLlama model locally and prints the generated response. I’ve found it works reliably on most modern machines, and having a GPU makes a noticeable difference even for smaller models.
Serving with Transformers
Besides running models inside scripts, Transformers also lets you serve models as an API using the built-in CLI. I’ve used this approach when I needed a quick local endpoint for testing prompts from a web app or another service.
You can serve a model like TinyLlama using the following command:
transformers serve --model TinyLlama/TinyLlama-1.1B-Chat-v1.0Once the server is running, it exposes a local API (by default at http://localhost:8000) that follows the OpenAI chat API format. This means you can send chat-style messages to it using simple HTTP requests. Here’s how to do it in Python:
import requests
# API endpoint
url = "http://localhost:8000/v1/chat/completions"
# Input message in OpenAI format
payload = {
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}
# Send request
response = requests.post(url, json=payload)
# Print the response
print(response.json()["choices"][0]["message"]["content"])
Suggested Reads- What is Hugging Face and How to Use It?
vLLM is a high-performance engine designed specifically for serving large language models quickly and efficiently. I started using vLLM when response times with standard setups became a bottleneck, especially for chat-style or multi-user workloads. vLLM supports popular models like LLaMA, Mistral, and TinyLlama, and even works with vision-language models. It’s a great choice when you need fast and scalable model serving.
Step 1: To get started with vLLM, first install it using pip:
pip install vllmStep 2: Once installed, you can load and run a TinyLlama mode on vllm like this:
from vllm import LLM, SamplingParams
# Load model
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Input prompt (chat-style)
user_input = "What is LLM?"
prompt = f"<|user|>\n{user_input}<|end|>\n<|assistant|>\n"
# Generate
outputs = llm.generate(prompt)
reply = outputs[0].outputs[0].text.strip()
# Print output
print(reply)This code uses vLLM directly in Python to run the TinyLlama model efficiently. In my experience, this is one of the simplest ways to see vLLM’s performance benefits without setting up an API server first.
Walk away with actionable insights on AI adoption.
Limited seats available!
vLLM isn’t limited to scripts, it also lets you serve models as a high-performance API using the OpenAI chat format. This is the setup I usually switch to once I need consistent latency across multiple requests. This is helpful when building apps that need to send prompts to the model over HTTP.
You can serve a model like TinyLlama using this command:
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto --api-key token-abc123The server will be available at http://localhost:8000/v1 by default.
Use the OpenAI SDK with vLLM
vLLM supports the OpenAI API format, so you can use the official OpenAI Python client like this:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1", # Make sure it's http, not https
api_key="token-abc123", # Same key as used in vllm serve
)
completion = client.chat.completions.create(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
messages=[{"role": "user", "content": "Hello!"}]
)
print(completion.choices[0].message)
SGLang is a tool that helps you build smart chat systems using AI models. It lets you write simple Python code to control how the model responds, making it easier to create custom assistants or workflows. SGLang runs on top of vLLM, so it’s fast and efficient. It’s a great choice when you want more control over how your AI behaves.
Step 1: To get started with SGLang, install it along with vLLM using pip:
pip install "sglang[all]>=0.4.9.post3"Step 2: Once installed, you can load and run a TinyLlama mode on SGLang like this:
from sglang.test.test_utils import is_in_ci
from sglang.utils import wait_for_server, print_highlight, terminate_process
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
# This is equivalent to running the following command in your terminal
# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \
--host 0.0.0.0
"""
)
wait_for_server(f"http://localhost:{port}")This code runs the TinyLlama model using SGLang, allowing you to define chat behavior with simple Python code with no extra setup required.
SGLang lets you serve language models with extra flexibility, allowing you to define how the model responds through simple Python functions. It also supports OpenAI-style API calls, making integration with apps and tools straightforward.
To serve a model like TinyLlama, run:
python3 -m sglang.launch_server --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host 0.0.0.0 --port 3000Once running, the server is ready to receive chat requests and can be customized for advanced features like tool use, memory, and function calling.
vLLM supports the OpenAI API format, so you can use the official OpenAI Python client like this:
import openai
client = openai.Client(base_url=f"https://9654207c14a2.ngrok-free.app/v1", api_key="None")
response = client.chat.completions.create(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
messages=[
{"role": "user", "content": "What is LLM?."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
To understand how Transformers, vLLM, and SGLang behave in practice, I ran a simple test using the same prompt across all three setups: “What is LLM?”
Walk away with actionable insights on AI adoption.
Limited seats available!
This test was done using the TinyLlama model on a system with 15 GB of available GPU memory, which reflects a fairly common local development setup. We measured two things:
Here’s a breakdown of the results:
| Tool | GPU Usage (out of 15 GB) | Response Time | Notes |
Transformers | 2.2 GB | 4.4 seconds | Light on memory, but slow |
vLLM | 13.3 GB | 1.55 seconds | Very fast, but heavy on VRAM |
SGLang | 12.6 GB | 1.08 seconds | Fastest and slightly lighter |
| Features | Transformers | vLLM | SGLang |
Core Use Case | Simple, low-latency text generation | Fast single-round inference for many users | Rich multi-turn conversations and complex task routing |
Efficiency Design | Lightweight, traditional architecture | Memory-optimized via advanced scheduling and KV reuse | Task-optimized execution using compiler-style planning |
Memory Strategy | Minimal memory use, less VRAM required | PagedAttention allows dynamic reuse of KV memory | Efficient per-task memory allocation with internal optimization |
Scalability | Limited parallelism, not ideal for high loads | Scales well with many concurrent users | Scales well for dialog-heavy systems |
Flexibility | Works out of the box with standard APIs | Focused on performance over flexibility | Highly configurable for complex pipeline behaviors |
Latency Control | Higher latency under load | Low latency due to dynamic batching | Ultra-low latency for conversational models |
Customization Support | Basic hooks or extensions possible | Ready-to-use, minimal tweaking needed | Deep customization with domain-specific logic |
In practical terms:
This hands-on test makes it clear how each tool behaves under the same conditions, helping you choose based on what matters most to you: memory usage, speed, or control. In many production cases, teams also compare these setups with Small language models when they want lightweight performance on limited hardware without sacrificing too much capability.
Each tool, Transformers, vLLM, and SGLang, serves a different purpose depending on what you’re trying to achieve. After working with all three, the trade-offs around speed, memory usage, and control become much clearer in real usage than they do on paper.
Transformers is best suited for beginners or lightweight applications thanks to its minimal GPU usage and ease of use. vLLM delivers much faster responses by utilizing more GPU memory, making it a great fit for high-performance, real-time tasks. SGLang builds on top of vLLM and offers even faster results, along with added flexibility for customizing AI behavior through simple Python functions.
Whether you're experimenting, deploying, or building complex systems, understanding these differences will help you choose the right tool for your needs, based on what matters most: speed, memory efficiency, or control, and if you’re exploring other protocols for model serving, you can also read about STDIO transport in MCP to see how similar systems handle transport and integration.
Walk away with actionable insights on AI adoption.
Limited seats available!