Blogs/AI

We Ran LLMs Faster Using Multi-Token Prediction (Here's How)

Written by Swathilakshmi B
Jun 5, 2026
8 Min Read
We Ran LLMs Faster Using Multi-Token Prediction (Here's How) Hero

We tested a technique called Multi-Token Prediction (MTP) on real prompts, and the results surprised us. Not because it worked, but because of how well it worked.

Faster first response · Lower total latency · Zero quality loss

If you’ve ever used an AI chatbot and felt like it was a little slow, especially during the delay before the first word appears, you’ve experienced one of the core bottlenecks in how large language models work today.

While working on real-time AI systems, we noticed the same issue. The models were good, but they were slow. So we started asking a simple question: is there a smarter way to generate text without making users wait?

That question led us to Multi-Token Prediction (MTP), a technique that lets models predict several tokens at once instead of one at a time. We ran a focused experiment to see if it actually works.

What is Speculative Decoding?

Speculative decoding is an LLM inference technique where a smaller, faster model predicts several upcoming tokens, and the larger model verifies those predictions in one pass.

It helps reduce latency because the large model does not need to generate every token one by one from scratch. Instead, it checks a draft sequence and accepts the correct tokens together.

Before we talk about MTP, this matters because MTP builds on the same idea: generate or verify more than one token at a time so the response can start faster and finish sooner.

Here's the core idea:

1. A small, fast “drafter” model generates several tokens quickly

The drafter is usually much smaller than the main model, so it runs faster and produces a short sequence of guesses for what comes next.

2. The larger target model verifies them in one forward pass

Instead of generating one token at a time, the large model looks at the full drafted sequence and checks the tokens in parallel. This is much faster than generating each token individually.

3. Accepted tokens are output together; rejected tokens fall back safely

If the drafter’s guesses are correct, multiple tokens are accepted and returned together. If a guess is wrong, the system discards it and continues with normal generation. This helps speed up responses without reducing output quality.

Real-world analogy

Imagine a junior editor who drafts five sentences quickly, then a senior editor reviews and approves them all in one read. That's much faster than the senior editor writing every sentence from scratch, even though the senior editor is doing the final call.

The result: for responses where the drafter is often correct (factual answers, code, common phrases), you can output multiple tokens in the time it used to take to output one. For harder prompts where the drafter struggles, you still don't lose anything; you just fall back to normal generation. 

How does it actually help?

Speculative decoding helps in three concrete ways:

What it changesWhy it matters

Idle GPU compute gets used

The drafter runs during the time the big model would otherwise be waiting

Multiple tokens verified in one pass

Parallel verification is much cheaper than sequential generation

Same output quality

The big model still makes all final decisions; it just approves instead of generating from scratch

Idle GPU compute gets used

Why it matters

The drafter runs during the time the big model would otherwise be waiting

1 of 3

So what is Multi-Token Prediction (MTP)?

MTP is a technique where the model predicts multiple future tokens in a single step, instead of the usual one-at-a-time approach. Think of it like a smart autocomplete that doesn't just suggest the next word, but the next five words, and does it all at once.

Innovations in AI
Exploring the future of artificial intelligence
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Sunday, 7 Jun 2026
10PM IST (60 mins)

Real-world analogy

Imagine you're ordering food at a restaurant. Standard LLMs are like a waiter who takes your order one dish at a time, runs to the kitchen, comes back, then asks what else you want. MTP is like a waiter who takes your full order upfront and confirms everything at once when the food arrives.

If the model's predictions are correct, the entire sequence gets accepted in one shot. If not, it falls back gracefully. Either way, you win, because the verification still happens faster than generating each token individually.

How is it different?

Standard speculative decoding needs two separate models: a big target and a small drafter. That means you need to find a compatible drafter, manage two models in memory, and make sure they're aligned enough that the drafter's guesses are actually useful.

MTP, Multi-Token Prediction, takes a different approach. Instead of pairing two separate models, the target model itself has the ability to predict multiple tokens at once, built right in. No drafter to find. No second model to load. The capability lives inside the model.

Standard Speculative DecodingMulti-Token Prediction (MTP)

Needs a separate drafter model

No separate drafter needed

The Drafter and target must be compatible

Native to the model architecture

Two models in GPU memory

One model, leaner setup

Draft quality affects performance

Drafting quality is baked into training

Works with most existing models

Requires a model trained for MTP

Needs a separate drafter model

Multi-Token Prediction (MTP)

No separate drafter needed

1 of 5

In our experiment, we used the assistant model approach, where the assistant model acts as the drafter paired with the target model via Hugging Face's assistant_model parameter. This is the practical way to use speculative decoding today with Transformers.

FeatureStandard (separate drafter)MTP (native)

Setup complexity

Higher, two models to manage

Lower, one model does it all

Memory usage

Larger, both models in VRAM

Smaller, single model

Drafting quality

Depends on drafter model choice

Trained in, more consistent

Model availability

Works with most LLMs

Needs an MTP-trained model

Setup complexity

Standard (separate drafter)

Higher, two models to manage

MTP (native)

Lower, one model does it all

1 of 4

What we actually tested

We ran the same set of prompts through two conditions using the same model family (Gemma 4 E2B):

  • Standard decoding, one token at a time, no MTP
  • MTP-enabled speculative decoding

We chose prompts that reflect real-world AI use cases, not just simple questions:

Prompt typeWhy we included it

Technical protocol comparison (TCP vs UDP vs QUIC)

Long-form, structured output

Math problem (average speed)

Step-by-step reasoning

Logic question (counterexample)

Short, precise output

Formal Tamil translation with grammar correction

Language + reasoning combined

Constrained writing (10-word sentences)

Output quality under strict rules

Technical protocol comparison (TCP vs UDP vs QUIC)

Why we included it

Long-form, structured output

1 of 5

We measured two things: Time To First Token (TTFT), how long before the model starts responding, and total latency, how long until the full response is done.

The results

Across five diverse prompts- technical explanations, math, logic, translation, and constrained writing- the outputs were nearly identical in quality. 

Cutting total generation time nearly in half, while keeping output quality the same, is a meaningful result. This isn't a tradeoff. You're not sacrificing accuracy for speed. You're getting both.

MTP vs Standard Decoding: Performance Results

MetricWithout MTPWith MTPDifference

Average TTFT

116.25 ms

74.55 ms

35.9% faster

Average total latency

44.70 sec

23.61 sec

47.2% faster

Output accuracy

Maintained

Maintained

No change

Logical consistency

Maintained

Maintained

No change

Average TTFT

Without MTP

116.25 ms

With MTP

74.55 ms

Difference

35.9% faster

1 of 4

How to integrate MTP into your project

Here's the practical part. The good news: if you're using Hugging Face Transformers, the integration is surprisingly simple. You don't need to change your model or rewrite your inference code. You just add an assistant_model parameter to your generate() call.

Step 1 - Install dependencies

pip install transformers torch accelerate gradio

Step 2 - Load both models

Load the target model and the assistant (drafter) model separately. Both go on the GPU.

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM


TARGET    = "google/gemma-4-E2B-it"

ASSISTANT = "gg-hf-am/gemma-4-E2B-it-assistant"

DTYPE     = torch.float16


# Load shared tokenizer

tokenizer = AutoTokenizer.from_pretrained(TARGET)


# Load the big target model

model = AutoModelForCausalLM.from_pretrained(

    TARGET,

    device_map="auto",

    dtype=DTYPE

)


# Load the small drafter (assistant) model

assistant = AutoModelForCausalLM.from_pretrained(

    ASSISTANT,

    device_map="auto",

    dtype=DTYPE

)

Step 3 - Format the prompt with chat template

Gemma uses a chat template format. Always apply it before passing to the model; raw prompts will give you poor results.

def prepare_input(prompt: str) -> torch.Tensor:

    messages = [{"role": "user", "content": prompt}]


    # Apply chat template,adds special tokens Gemma expects

    formatted = tokenizer.apply_chat_template(

        messages,

        tokenize=False,

        add_generation_prompt=True

    )


    ids = tokenizer(formatted, return_tensors="pt")["input_ids"]

    return ids.to(model.device)


Step 4 - Run generation with MTP enabled

This is the key change. Pass assistant_model=assistant to model.generate(), that's it. Transformers handle the speculative decoding loop automatically.

import time


def generate_with_mtp(prompt: str, max_tokens: int = 200):

    input_ids  = prepare_input(prompt)

    prompt_len = input_ids.shape[1]


    t_start = time.perf_counter()


    with torch.no_grad():

        output = model.generate(

            input_ids,

            max_new_tokens=max_tokens,

            do_sample=False,

            pad_token_id=tokenizer.eos_token_id,

            assistant_model=assistant   # ← this one line enables MTP

        )


    t_end = time.perf_counter()


    n_new      = output.shape[1] - prompt_len

    total_secs = t_end - t_start

    throughput = n_new / total_secs

    ttft_ms    = (total_secs / max(n_new, 1)) * 1000


    text = tokenizer.decode(

        output[0][prompt_len:],

        skip_special_tokens=True

    )


    return {

        "text":       text,

        "ttft_ms":    round(ttft_ms, 1),

        "latency_ms": round(total_secs * 1000, 1),

        "tokens_ps":  round(throughput, 1),

        "n_tokens":   n_new,

    }


# Example usage

result = generate_with_mtp("Explain gradient descent simply.")

print(result["text"])

print(f"TTFT: {result['ttft_ms']} ms | Latency: {result['latency_ms']} ms | {result['tokens_ps']} tok/s")


When should you use MTP?

  • Building real-time chat or voice AI, where that first-word delay kills the experience
  • Running long-form generation, reports, summaries, code, reasoning chains
  • Trying to reduce GPU costs without switching to a smaller model
  • Your model has a compatible assistant/drafter available (Gemma 4 does)
  • Deploying at scale, where 47% latency reduction = real infrastructure savings
Innovations in AI
Exploring the future of artificial intelligence
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Sunday, 7 Jun 2026
10PM IST (60 mins)

Conclusion

MTP delivered 35% faster first response and 47% lower total latency with no drop in output quality.No model swap. No pipeline rewrite. Just one extra parameter in your generate() call, and your inference is nearly twice as fast. For teams building real-time AI products, that's a meaningful improvement with almost zero effort. If latency matters to you and in production, it always does, MTP is worth trying. 

Frequently Asked Questions 

1. What is speculative decoding in simple terms? 

A small "drafter" model guesses several tokens ahead. The big model verifies all guesses in one shot. Right guesses = multiple tokens for the cost of one pass. Wrong guesses = graceful fallback. Either way, faster than generating every token from scratch.

2. What exactly is Multi-Token Prediction (MTP)? 

MTP predicts multiple future tokens in a single forward pass; no separate drafter model is needed. In our experiment, we used an assistant model paired with the target via Hugging Face's assistant_model parameter.

3. How is MTP different from standard speculative decoding?

 Standard speculative decoding needs two separate models. MTP either has the capability built natively or uses a purpose-trained assistant model, simpler setup, more consistent drafting quality, and less memory overhead.

4. Do I need a special model to use MTP? 

Yes. You need a target model with a matching assistant available. Gemma 4 E2B has one: gg-hf-am/gemma-4-E2B-it-assistant. Always check the model card for compatibility before assuming support.

5. How do I enable MTP in one line of code? 

Just add assistant_model=assistant to your model.generate() call. Transformers handle the speculative decoding loop automatically under the hood.

6. Can I use this with any Hugging Face model? 

Not quite. The drafter must share the same tokenizer and vocabulary as the target. Safest approach: look for a model that explicitly provides an assistant variant, or check its docs for speculative decoding support.

7. Does MTP affect output quality?

 No. Output quality was identical across all five diverse prompt types. The target model still makes all final decisions; it approves or rejects drafter tokens, never blindly accepting them.

Author-Swathilakshmi B
Swathilakshmi B

AI/ML Intern focused on growing, experimenting, and contributing in the field of Artificial Intelligence.

Share this article

Phone

Next for you

How to Outsource Mobile App Development (Complete Guide 2026) Cover

AI

Jun 5, 20269 min read

How to Outsource Mobile App Development (Complete Guide 2026)

Is hiring a full in-house mobile app team necessary when you only need to build, test, or launch your app faster? For many startups and businesses, outsourcing is a practical option when they need speed, mobile expertise, or a complete team without building everything in-house. It gives you access to product, design, development, and testing support while keeping the team structure flexible. In this guide, we’ll explain how to outsource mobile app development, when it makes sense, what it cost

AI Chatbot Development Cost 2026 Cover

AI

Jun 5, 20269 min read

AI Chatbot Development Cost 2026

How much does it cost to develop a chatbot? The answer depends on what you want the chatbot to do. A simple FAQ chatbot will cost much less than an AI chatbot that connects with your CRM, answers customer questions, pulls data from documents, or supports internal workflows. In 2026, chatbot development costs can range from a few thousand dollars for a basic chatbot to much higher for custom AI chatbots with integrations, security, analytics, and ongoing model usage. The final chatbot cost depen

Moss vs Milvus vs Pinecone vs Qdrant: Vector DB Benchmark Cover

AI

Jun 5, 20269 min read

Moss vs Milvus vs Pinecone vs Qdrant: Vector DB Benchmark

Which vector database is actually faster when used inside a real AI application? That was the question behind this benchmark. In AI pipelines, the model is not always the only bottleneck. Query speed also depends on how fast embeddings are generated, searched, and retrieved from the vector database. To test this, we benchmarked Moss, Milvus, Pinecone, and Qdrant under the same setup using a consistent dataset, embedding model, and query workflow. The goal was to measure real end-to-end latency