Blogs/AI/We Ran LLMs Faster Using Multi-Token Prediction (Here's How)

We Ran LLMs Faster Using Multi-Token Prediction (Here's How)

Written bySwathilakshmi B

Jul 6, 2026

8 Min Read

We Ran LLMs Faster Using Multi-Token Prediction (Here's How) Hero

Too Long? Read This First
- Multi-Token Prediction (MTP) lets a model predict several tokens per step instead of one at a time, cutting latency without a second "drafter" model to manage.
- Tested on Gemma 4 E2B across five prompt types (technical, math, logic, translation, constrained writing): 35% faster first response, 47% lower total latency, no drop in output quality.
- Setup is a one-line change in Hugging Face Transformers, add assistant_model=assistant to your generate() call, no pipeline rewrite needed.
- You need a target model with a compatible assistant/drafter variant available, Gemma 4 E2B has one, not every model does.
- Best fit: real-time chat/voice AI (where first-token delay matters most), long-form generation, and cutting GPU costs without downgrading to a smaller model.

We tested a technique called Multi-Token Prediction (MTP) on real prompts, and the results surprised us. Not because it worked, but because of how well it worked.

Faster first response · Lower total latency · Zero quality loss

If you’ve ever used an AI chatbot and felt like it was a little slow, especially during the delay before the first word appears, you’ve experienced one of the core bottlenecks in how large language models work today.

While working on real-time AI systems, we noticed the same issue. The models were good, but they were slow. So we started asking a simple question: is there a smarter way to generate text without making users wait?

That question led us to Multi-Token Prediction (MTP), a technique that lets models predict several tokens at once instead of one at a time. We ran a focused experiment to see if it actually works.

What is Speculative Decoding?

Speculative decoding is an LLM inference technique where a smaller, faster model predicts several upcoming tokens, and the larger model verifies those predictions in one pass.

It helps reduce latency because the large model does not need to generate every token one by one from scratch. Instead, it checks a draft sequence and accepts the correct tokens together.

Before we talk about MTP, this matters because MTP builds on the same idea: generate or verify more than one token at a time so the response can start faster and finish sooner.

Here's the core idea:

1. A small, fast “drafter” model generates several tokens quickly

The drafter is usually much smaller than the main model, so it runs faster and produces a short sequence of guesses for what comes next.

2. The larger target model verifies them in one forward pass

Instead of generating one token at a time, the large model looks at the full drafted sequence and checks the tokens in parallel. This is much faster than generating each token individually.

3. Accepted tokens are output together; rejected tokens fall back safely

If the drafter’s guesses are correct, multiple tokens are accepted and returned together. If a guess is wrong, the system discards it and continues with normal generation. This helps speed up responses without reducing output quality.

Real-world analogy

Imagine a junior editor who drafts five sentences quickly, then a senior editor reviews and approves them all in one read. That's much faster than the senior editor writing every sentence from scratch, even though the senior editor is doing the final call.

The result: for responses where the drafter is often correct (factual answers, code, common phrases), you can output multiple tokens in the time it used to take to output one. For harder prompts where the drafter struggles, you still don't lose anything; you just fall back to normal generation.

How does it actually help?

Speculative decoding helps in three concrete ways:

What it changes	Why it matters
Idle GPU compute gets used	The drafter runs during the time the big model would otherwise be waiting
Multiple tokens verified in one pass	Parallel verification is much cheaper than sequential generation
Same output quality	The big model still makes all final decisions; it just approves instead of generating from scratch

Idle GPU compute gets used

Why it matters

The drafter runs during the time the big model would otherwise be waiting

1 of 3

So what is Multi-Token Prediction (MTP)?

MTP is a technique where the model predicts multiple future tokens in a single step, instead of the usual one-at-a-time approach. Think of it like a smart autocomplete that doesn't just suggest the next word, but the next five words, and does it all at once.

Making LLMs Faster in Production

A technical session on improving LLM response speed using multi-token prediction, latency testing, real prompts, and production performance trade-offs.

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 25 Jul 2026

10PM IST (60 mins)

Real-world analogy

Imagine you're ordering food at a restaurant. Standard LLMs are like a waiter who takes your order one dish at a time, runs to the kitchen, comes back, then asks what else you want. MTP is like a waiter who takes your full order upfront and confirms everything at once when the food arrives.

If the model's predictions are correct, the entire sequence gets accepted in one shot. If not, it falls back gracefully. Either way, you win, because the verification still happens faster than generating each token individually.

How is it different?

Standard speculative decoding needs two separate models: a big target and a small drafter. That means you need to find a compatible drafter, manage two models in memory, and make sure they're aligned enough that the drafter's guesses are actually useful.

MTP, Multi-Token Prediction, takes a different approach. Instead of pairing two separate models, the target model itself has the ability to predict multiple tokens at once, built right in. No drafter to find. No second model to load. The capability lives inside the model.

Standard Speculative Decoding	Multi-Token Prediction (MTP)
Needs a separate drafter model	No separate drafter needed
The Drafter and target must be compatible	Native to the model architecture
Two models in GPU memory	One model, leaner setup
Draft quality affects performance	Drafting quality is baked into training
Works with most existing models	Requires a model trained for MTP

Needs a separate drafter model

Multi-Token Prediction (MTP)

No separate drafter needed

1 of 5

In our experiment, we used the assistant model approach, where the assistant model acts as the drafter paired with the target model via Hugging Face's assistant_model parameter. This is the practical way to use speculative decoding today with Transformers.

Feature	Standard (separate drafter)	MTP (native)
Setup complexity	Higher, two models to manage	Lower, one model does it all
Memory usage	Larger, both models in VRAM	Smaller, single model
Drafting quality	Depends on drafter model choice	Trained in, more consistent
Model availability	Works with most LLMs	Needs an MTP-trained model

Setup complexity

Standard (separate drafter)

Higher, two models to manage

MTP (native)

Lower, one model does it all

1 of 4

What we actually tested

We ran the same set of prompts through two conditions using the same model family (Gemma 4 E2B):

Standard decoding, one token at a time, no MTP
MTP-enabled speculative decoding

We chose prompts that reflect real-world AI use cases, not just simple questions:

Prompt type	Why we included it
Technical protocol comparison (TCP vs UDP vs QUIC)	Long-form, structured output
Math problem (average speed)	Step-by-step reasoning
Logic question (counterexample)	Short, precise output
Formal Tamil translation with grammar correction	Language + reasoning combined
Constrained writing (10-word sentences)	Output quality under strict rules

Technical protocol comparison (TCP vs UDP vs QUIC)

Why we included it

Long-form, structured output

1 of 5

We measured two things: Time To First Token (TTFT), how long before the model starts responding, and total latency, how long until the full response is done.

The results

Across five diverse prompts- technical explanations, math, logic, translation, and constrained writing- the outputs were nearly identical in quality.

Cutting total generation time nearly in half, while keeping output quality the same, is a meaningful result. This isn't a tradeoff. You're not sacrificing accuracy for speed. You're getting both.

MTP vs Standard Decoding: Performance Results

Metric	Without MTP	With MTP	Difference
Average TTFT	116.25 ms	74.55 ms	35.9% faster
Average total latency	44.70 sec	23.61 sec	47.2% faster
Output accuracy	Maintained	Maintained	No change
Logical consistency	Maintained	Maintained	No change

Average TTFT

Without MTP

116.25 ms

With MTP

74.55 ms

Difference

35.9% faster

1 of 4

How to integrate MTP into your project

Here's the practical part. The good news: if you're using Hugging Face Transformers, the integration is surprisingly simple. You don't need to change your model or rewrite your inference code. You just add an assistant_model parameter to your generate() call.

Step 1 - Install dependencies

pip install transformers torch accelerate gradio

Step 2 - Load both models

Load the target model and the assistant (drafter) model separately. Both go on the GPU.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

TARGET    = "google/gemma-4-E2B-it"
ASSISTANT = "gg-hf-am/gemma-4-E2B-it-assistant"
DTYPE     = torch.float16

# Load shared tokenizer
tokenizer = AutoTokenizer.from_pretrained(TARGET)

# Load the big target model
model = AutoModelForCausalLM.from_pretrained(
    TARGET,
    device_map="auto",
    dtype=DTYPE
)

# Load the small drafter (assistant) model
assistant = AutoModelForCausalLM.from_pretrained(
    ASSISTANT,
    device_map="auto",
    dtype=DTYPE
)

Step 3 - Format the prompt with chat template

Gemma uses a chat template format. Always apply it before passing to the model; raw prompts will give you poor results.

def prepare_input(prompt: str) -> torch.Tensor:
    messages = [{"role": "user", "content": prompt}]

    # Apply chat template , adds special tokens Gemma expects
    formatted = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    ids = tokenizer(formatted, return_tensors="pt")["input_ids"]
    return ids.to(model.device)

Step 4 - Run generation with MTP enabled

This is the key change. Pass assistant_model=assistant to model.generate(), that's it. Transformers handle the speculative decoding loop automatically.

import time

def generate_with_mtp(prompt: str, max_tokens: int = 200):
    input_ids  = prepare_input(prompt)
    prompt_len = input_ids.shape[1]

    t_start = time.perf_counter()

    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            assistant_model=assistant   # ← this one line enables MTP
        )

    t_end = time.perf_counter()

    n_new      = output.shape[1] - prompt_len
    total_secs = t_end - t_start
    throughput = n_new / total_secs
    ttft_ms    = (total_secs / max(n_new, 1)) * 1000

    text = tokenizer.decode(
        output[0][prompt_len:],
        skip_special_tokens=True
    )

    return {
        "text":       text,
        "ttft_ms":    round(ttft_ms, 1),
        "latency_ms": round(total_secs * 1000, 1),
        "tokens_ps":  round(throughput, 1),
        "n_tokens":   n_new,
    }

# Example usage
result = generate_with_mtp("Explain gradient descent simply.")
print(result["text"])
print(f"TTFT: {result['ttft_ms']} ms | Latency: {result['latency_ms']} ms | {result['tokens_ps']} tok/s")

When should you use MTP?

Building real-time chat or voice AI, where that first-word delay kills the experience
Running long-form generation, reports, summaries, code, reasoning chains
Trying to reduce GPU costs without switching to a smaller model
Your model has a compatible assistant/drafter available (Gemma 4 does)
Deploying at scale, where 47% latency reduction = real infrastructure savings
Evaluating whether your production setup can use this today, that's the kind of inference optimization work the AI Development team does when tuning real-time AI products

Making LLMs Faster in Production

A technical session on improving LLM response speed using multi-token prediction, latency testing, real prompts, and production performance trade-offs.

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 25 Jul 2026

10PM IST (60 mins)

Conclusion

MTP delivered 35% faster first response and 47% lower total latency with no drop in output quality.No model swap. No pipeline rewrite. Just one extra parameter in your generate() call, and your inference is nearly twice as fast. For teams building real-time AI products, that's a meaningful improvement with almost zero effort. If latency matters to you and in production, it always does, MTP is worth trying.

Frequently Asked Questions

1. What is speculative decoding in simple terms?

A small "drafter" model guesses several tokens ahead. The big model verifies all guesses in one shot. Right guesses = multiple tokens for the cost of one pass. Wrong guesses = graceful fallback. Either way, faster than generating every token from scratch.

2. What exactly is Multi-Token Prediction (MTP)?

MTP predicts multiple future tokens in a single forward pass; no separate drafter model is needed. In our experiment, we used an assistant model paired with the target via Hugging Face's assistant_model parameter.

3. How is MTP different from standard speculative decoding?

Standard speculative decoding needs two separate models. MTP either has the capability built natively or uses a purpose-trained assistant model, simpler setup, more consistent drafting quality, and less memory overhead.

4. Do I need a special model to use MTP?

Yes. You need a target model with a matching assistant available. Gemma 4 E2B has one: gg-hf-am/gemma-4-E2B-it-assistant. Always check the model card for compatibility before assuming support.

5. How do I enable MTP in one line of code?

Just add assistant_model=assistant to your model.generate() call. Transformers handle the speculative decoding loop automatically under the hood.

6. Can I use this with any Hugging Face model?

Not quite. The drafter must share the same tokenizer and vocabulary as the target. Safest approach: look for a model that explicitly provides an assistant variant, or check its docs for speculative decoding support.

7. Does MTP affect output quality?

No. Output quality was identical across all five diverse prompt types. The target model still makes all final decisions; it approves or rejects drafter tokens, never blindly accepting them.

Swathilakshmi B

AI/ML Intern focused on growing, experimenting, and contributing in the field of Artificial Intelligence.

Share this article

Next for you

How to Prepare a Dataset for Whisper Small Fine-Tuning Cover

AI

Jul 20, 2026 • 7 min read

How to Prepare a Dataset for Whisper Small Fine-Tuning

Preparing a reliable fine-tuning dataset starts with understanding where the base model needs improvement. When we evaluated Whisper Small on technical audio, it struggled with AI model names, technical terms, acronyms, and sentences that combined everyday language with technical vocabulary. The WER results confirmed that these errors followed clear patterns. We then looked for public datasets containing the language our users typically use, but none provided enough relevant technical vocabular

How to Evaluate Whisper Small Before Fine-Tuning Cover

AI

Jul 20, 2026 • 6 min read

How to Evaluate Whisper Small Before Fine-Tuning

Before training anything, we wanted to understand where the existing model performed well and where it could improve. This blog explains how we evaluated Whisper Small on technical audio before writing a single line of fine-tuning code. This is not a general guide to speech-to-text. It documents the first step we took while improving a real product. In our application, users speak to an AI agent in real time. A speech-to-text model converts their speech into text, allowing the agent to understa

How to Build a Voice AI Agent with Whisper and LiveKit in 2026? Cover

AI

Jul 14, 2026 • 12 min read

How to Build a Voice AI Agent with Whisper and LiveKit in 2026?

Training a speech model like Whisper is often seen as the hardest part of building a voice AI system. In reality, it is only the beginning. After fine-tuning, what you have is simply a model checkpoint, a static artifact that cannot process live audio or interact with real users on its own. We tested this workflow in-house by turning a fine-tuned Whisper model into a real-time voice AI system using streaming audio, VAD, WebSockets, buffering, and LiveKit. This blog shares how we moved from a f