
We tested a technique called Multi-Token Prediction (MTP) on real prompts, and the results surprised us. Not because it worked, but because of how well it worked.
Faster first response · Lower total latency · Zero quality loss
If you’ve ever used an AI chatbot and felt like it was a little slow, especially during the delay before the first word appears, you’ve experienced one of the core bottlenecks in how large language models work today.
While working on real-time AI systems, we noticed the same issue. The models were good, but they were slow. So we started asking a simple question: is there a smarter way to generate text without making users wait?
That question led us to Multi-Token Prediction (MTP), a technique that lets models predict several tokens at once instead of one at a time. We ran a focused experiment to see if it actually works.
What is Speculative Decoding?
Speculative decoding is an LLM inference technique where a smaller, faster model predicts several upcoming tokens, and the larger model verifies those predictions in one pass.
It helps reduce latency because the large model does not need to generate every token one by one from scratch. Instead, it checks a draft sequence and accepts the correct tokens together.
Before we talk about MTP, this matters because MTP builds on the same idea: generate or verify more than one token at a time so the response can start faster and finish sooner.
Here's the core idea:
1. A small, fast “drafter” model generates several tokens quickly
The drafter is usually much smaller than the main model, so it runs faster and produces a short sequence of guesses for what comes next.
2. The larger target model verifies them in one forward pass
Instead of generating one token at a time, the large model looks at the full drafted sequence and checks the tokens in parallel. This is much faster than generating each token individually.
3. Accepted tokens are output together; rejected tokens fall back safely
If the drafter’s guesses are correct, multiple tokens are accepted and returned together. If a guess is wrong, the system discards it and continues with normal generation. This helps speed up responses without reducing output quality.
Real-world analogy
Imagine a junior editor who drafts five sentences quickly, then a senior editor reviews and approves them all in one read. That's much faster than the senior editor writing every sentence from scratch, even though the senior editor is doing the final call.
The result: for responses where the drafter is often correct (factual answers, code, common phrases), you can output multiple tokens in the time it used to take to output one. For harder prompts where the drafter struggles, you still don't lose anything; you just fall back to normal generation.
How does it actually help?
Speculative decoding helps in three concrete ways:
| What it changes | Why it matters |
Idle GPU compute gets used | The drafter runs during the time the big model would otherwise be waiting |
Multiple tokens verified in one pass | Parallel verification is much cheaper than sequential generation |
Same output quality | The big model still makes all final decisions; it just approves instead of generating from scratch |
So what is Multi-Token Prediction (MTP)?
MTP is a technique where the model predicts multiple future tokens in a single step, instead of the usual one-at-a-time approach. Think of it like a smart autocomplete that doesn't just suggest the next word, but the next five words, and does it all at once.
Walk away with actionable insights on AI adoption.
Limited seats available!
Real-world analogy
Imagine you're ordering food at a restaurant. Standard LLMs are like a waiter who takes your order one dish at a time, runs to the kitchen, comes back, then asks what else you want. MTP is like a waiter who takes your full order upfront and confirms everything at once when the food arrives.
If the model's predictions are correct, the entire sequence gets accepted in one shot. If not, it falls back gracefully. Either way, you win, because the verification still happens faster than generating each token individually.
How is it different?
Standard speculative decoding needs two separate models: a big target and a small drafter. That means you need to find a compatible drafter, manage two models in memory, and make sure they're aligned enough that the drafter's guesses are actually useful.
MTP, Multi-Token Prediction, takes a different approach. Instead of pairing two separate models, the target model itself has the ability to predict multiple tokens at once, built right in. No drafter to find. No second model to load. The capability lives inside the model.
| Standard Speculative Decoding | Multi-Token Prediction (MTP) |
Needs a separate drafter model | No separate drafter needed |
The Drafter and target must be compatible | Native to the model architecture |
Two models in GPU memory | One model, leaner setup |
Draft quality affects performance | Drafting quality is baked into training |
Works with most existing models | Requires a model trained for MTP |
In our experiment, we used the assistant model approach, where the assistant model acts as the drafter paired with the target model via Hugging Face's assistant_model parameter. This is the practical way to use speculative decoding today with Transformers.
| Feature | Standard (separate drafter) | MTP (native) |
Setup complexity | Higher, two models to manage | Lower, one model does it all |
Memory usage | Larger, both models in VRAM | Smaller, single model |
Drafting quality | Depends on drafter model choice | Trained in, more consistent |
Model availability | Works with most LLMs | Needs an MTP-trained model |
What we actually tested
We ran the same set of prompts through two conditions using the same model family (Gemma 4 E2B):
- Standard decoding, one token at a time, no MTP
- MTP-enabled speculative decoding
We chose prompts that reflect real-world AI use cases, not just simple questions:
| Prompt type | Why we included it |
Technical protocol comparison (TCP vs UDP vs QUIC) | Long-form, structured output |
Math problem (average speed) | Step-by-step reasoning |
Logic question (counterexample) | Short, precise output |
Formal Tamil translation with grammar correction | Language + reasoning combined |
Constrained writing (10-word sentences) | Output quality under strict rules |
We measured two things: Time To First Token (TTFT), how long before the model starts responding, and total latency, how long until the full response is done.
The results
Across five diverse prompts- technical explanations, math, logic, translation, and constrained writing- the outputs were nearly identical in quality.
Cutting total generation time nearly in half, while keeping output quality the same, is a meaningful result. This isn't a tradeoff. You're not sacrificing accuracy for speed. You're getting both.
MTP vs Standard Decoding: Performance Results
| Metric | Without MTP | With MTP | Difference |
Average TTFT | 116.25 ms | 74.55 ms | 35.9% faster |
Average total latency | 44.70 sec | 23.61 sec | 47.2% faster |
Output accuracy | Maintained | Maintained | No change |
Logical consistency | Maintained | Maintained | No change |
How to integrate MTP into your project
Here's the practical part. The good news: if you're using Hugging Face Transformers, the integration is surprisingly simple. You don't need to change your model or rewrite your inference code. You just add an assistant_model parameter to your generate() call.
Step 1 - Install dependencies
pip install transformers torch accelerate gradio |
Step 2 - Load both models
Load the target model and the assistant (drafter) model separately. Both go on the GPU.
import torch from transformers import AutoTokenizer, AutoModelForCausalLM TARGET = "google/gemma-4-E2B-it" ASSISTANT = "gg-hf-am/gemma-4-E2B-it-assistant" DTYPE = torch.float16 # Load shared tokenizer tokenizer = AutoTokenizer.from_pretrained(TARGET) # Load the big target model model = AutoModelForCausalLM.from_pretrained( TARGET, device_map="auto", dtype=DTYPE ) # Load the small drafter (assistant) model assistant = AutoModelForCausalLM.from_pretrained( ASSISTANT, device_map="auto", dtype=DTYPE ) |
Step 3 - Format the prompt with chat template
Gemma uses a chat template format. Always apply it before passing to the model; raw prompts will give you poor results.
def prepare_input(prompt: str) -> torch.Tensor: messages = [{"role": "user", "content": prompt}] # Apply chat template,adds special tokens Gemma expects formatted = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) ids = tokenizer(formatted, return_tensors="pt")["input_ids"] return ids.to(model.device) |
Step 4 - Run generation with MTP enabled
This is the key change. Pass assistant_model=assistant to model.generate(), that's it. Transformers handle the speculative decoding loop automatically.
import time def generate_with_mtp(prompt: str, max_tokens: int = 200): input_ids = prepare_input(prompt) prompt_len = input_ids.shape[1] t_start = time.perf_counter() with torch.no_grad(): output = model.generate( input_ids, max_new_tokens=max_tokens, do_sample=False, pad_token_id=tokenizer.eos_token_id, assistant_model=assistant # ← this one line enables MTP ) t_end = time.perf_counter() n_new = output.shape[1] - prompt_len total_secs = t_end - t_start throughput = n_new / total_secs ttft_ms = (total_secs / max(n_new, 1)) * 1000 text = tokenizer.decode( output[0][prompt_len:], skip_special_tokens=True ) return { "text": text, "ttft_ms": round(ttft_ms, 1), "latency_ms": round(total_secs * 1000, 1), "tokens_ps": round(throughput, 1), "n_tokens": n_new, } # Example usage result = generate_with_mtp("Explain gradient descent simply.") print(result["text"]) print(f"TTFT: {result['ttft_ms']} ms | Latency: {result['latency_ms']} ms | {result['tokens_ps']} tok/s") |
When should you use MTP?
- Building real-time chat or voice AI, where that first-word delay kills the experience
- Running long-form generation, reports, summaries, code, reasoning chains
- Trying to reduce GPU costs without switching to a smaller model
- Your model has a compatible assistant/drafter available (Gemma 4 does)
- Deploying at scale, where 47% latency reduction = real infrastructure savings
Walk away with actionable insights on AI adoption.
Limited seats available!
Conclusion
MTP delivered 35% faster first response and 47% lower total latency with no drop in output quality.No model swap. No pipeline rewrite. Just one extra parameter in your generate() call, and your inference is nearly twice as fast. For teams building real-time AI products, that's a meaningful improvement with almost zero effort. If latency matters to you and in production, it always does, MTP is worth trying.
Frequently Asked Questions
1. What is speculative decoding in simple terms?
A small "drafter" model guesses several tokens ahead. The big model verifies all guesses in one shot. Right guesses = multiple tokens for the cost of one pass. Wrong guesses = graceful fallback. Either way, faster than generating every token from scratch.
2. What exactly is Multi-Token Prediction (MTP)?
MTP predicts multiple future tokens in a single forward pass; no separate drafter model is needed. In our experiment, we used an assistant model paired with the target via Hugging Face's assistant_model parameter.
3. How is MTP different from standard speculative decoding?
Standard speculative decoding needs two separate models. MTP either has the capability built natively or uses a purpose-trained assistant model, simpler setup, more consistent drafting quality, and less memory overhead.
4. Do I need a special model to use MTP?
Yes. You need a target model with a matching assistant available. Gemma 4 E2B has one: gg-hf-am/gemma-4-E2B-it-assistant. Always check the model card for compatibility before assuming support.
5. How do I enable MTP in one line of code?
Just add assistant_model=assistant to your model.generate() call. Transformers handle the speculative decoding loop automatically under the hood.
6. Can I use this with any Hugging Face model?
Not quite. The drafter must share the same tokenizer and vocabulary as the target. Safest approach: look for a model that explicitly provides an assistant variant, or check its docs for speculative decoding support.
7. Does MTP affect output quality?
No. Output quality was identical across all five diverse prompt types. The target model still makes all final decisions; it approves or rejects drafter tokens, never blindly accepting them.
Walk away with actionable insights on AI adoption.
Limited seats available!



