Blogs/AI

How We Merged Two TTS Models Using Task Arithmetic Without Retraining

Written by Jeevarathinam V
Reviewed by Krishna Purwar
Jun 30, 2026
7 Min Read
How We Merged Two TTS Models Using Task Arithmetic Without Retraining Hero

How task arithmetic lets me combine a female voice and an Indian English accent male voice without retraining anything

Most text-to-speech models can say "Hello, how are you?" But ask them to pronounce Subramanian, Tiruchirappalli, Sriharikota, or Bengaluru, and the illusion quickly falls apart. That was the problem we set out to solve. 

We had trained two separate models. Neither did both.

We assumed the only solution was to collect more data and train a larger combined model. But while digging through research papers online, I came across Editing Models with Task Arithmetic and Model soups: averaging weights of multiple fine-tuned models.

The idea sounded almost ridiculous at first: take two neural networks, subtract their weights from a base model, combine the differences, and you can sometimes merge behaviors without any additional training. 

That is where task arithmetic came in, a technique that lets you combine what two neural networks have independently learned using nothing but subtraction and addition on their weights.

This blog explains:

  • What We did,
  • how the technique works,
  • and why it is more useful than most people realize.

What Is Task Arithmetic?

Task arithmetic is a technique where the learned weight differences between fine-tuned neural networks are treated like vectors in weight space.

If two models were fine-tuned from the same base model, their learned changes can often be:

  • added,
  • scaled,
  • or combined

To create a new model with the properties of both.

Instead of retraining from scratch, task arithmetic edits model behavior directly through weight space operations.

In this project, We used task arithmetic to combine:

  • Sarah’s female voice characteristics
  • with Sumit’s Indian English pronunciation patterns

into a single Kokoro TTS model.

The Problem We Was Trying To Solve

We already had two separate fine-tuned models. Neither solved the full problem.

Model 1: Sumit

Fine-tuned on Indian-English male speech data.

Strengths:

  • excellent pronunciation of Indian names and places
  • natural Indian accent
  • strong pronunciation consistency

Weakness:

  • male voice

Model 2: Sarah

Fine-tuned on Sarah’s recordings using two-stage training:

  • Stage 1 → 20 epochs
  • Stage 2 → 20 epochs

Strengths:

  • expressive female voice
  • natural rhythm and tone
  • smoother speech quality

Weakness:

  • poor Indian pronunciation

For example:

  • “Subramanian” became “subremanian”
  • “Tiruchirappalli” sounded completely incorrect

How Both Models Performed on the Same Sentence

Sentence:

“Could you please connect me with Subramanian Iyer? The compensation is Rs.18,00,000 per annum.

Sumit Model

  • correct Indian pronunciation
  • natural Indian accent
  • male voice 

Sarah Model

  • incorrect pronunciation
  • neutral accent
  • female voice

Neither model alone was usable for the final assistant.

The obvious solution would have been:

  • collecting combined data
  • retraining from scratch
  • running another long fine-tuning cycle

But that would take significantly more time and compute.

We needed a simpler approach.

Innovations in AI
Exploring the future of artificial intelligence
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 4 Jul 2026
10PM IST (60 mins)

The Key Insight

Both models were fine-tuned from the same base model:Kokoro-82M by hexgrad.

When a model is fine-tuned, training does not completely replace the weights. It slightly shifts them away from the base model.

That shift represents what the model learned.

If:

τ=θfine\mbox−tuned−θbase\tau = \theta_{\mathrm{fine\mbox{-}tuned}} - \theta_{\mathrm{base}}τ=θfine\mbox−tuned​−θbase​

Then the resulting vector represents the learned behavior added during fine-tuning.

This is called a task vector.

Because both Sarah and Sumit started from the same base model:

  • Their vectors exist in the same weight space
  • and can be combined mathematically

Sarah learned:

  • voice identity
  • expression
  • speaking style

Sumit learned:

  • Indian English pronunciation
  • accent patterns
  • phoneme behavior

Those learned behaviors were mostly complementary rather than conflicting.

That meant the vectors could be added together.

How Task Arithmetic Works for TTS

The merge formula becomes:

θmerged=θbase+α(θSarah−θbase)+β(θSumit−θbase)\theta_{\mathrm{merged}} = \theta_{\mathrm{base}} + \alpha(\theta_{\mathrm{Sarah}}-\theta_{\mathrm{base}})+\beta(\theta_{\mathrm{Sumit}}-\theta_{\mathrm{base}})θmerged​=θbase​+α(θSarah​−θbase​)+β(θSumit​−θbase​)

Where:

  • α controls Sarah’s voice characteristics
  • β controls Sumit’s Indian pronunciation

Both values can be adjusted gradually.

This allowed me to search for the best balance between:

  • voice identity
  • and pronunciation quality.
How Task Arithmetic Works for TTS

Figure 1: Sarah and Sumit's task vectors combined in shared Kokoro weight space.

A Technical Detail That Actually Helped

During merging, we found that:

  • 178 decoder weight keys existed in the fine-tuned models
  • but did not exist under matching names in the base model

This happened because PyTorch parametrization renames some weights during fine-tuning.

For example:

weight_g
weight_v

became:

parametrizations.weight.original0
parametrizations.weight.original1

Since those weights had no matching base weights, task arithmetic could not be applied to them directly.

So instead of forcing a merge:

We preserved Sarah’s values for those weights unchanged.

This turned out to be exactly the correct decision.

Those decoder layers were heavily affected:

  • voice texture
  • acoustic rendering
  • speaker identity

Keeping Sarah’s decoder intact preserved her voice quality while allowing the remaining weights to absorb Sumit’s pronunciation behavior.

Shared weights were merged mathematically while Sarah-specific decoder weights were preserved.

Figure 2: Shared weights were merged mathematically while Sarah-specific decoder weights were preserved.

Finding The Right Blend

We generated multiple combinations of α and β and tested them on the same sentences.

Round 1

Modelα (Sarah)β (Sumit)Result

Combo 1

1.0

0.3

Full Sarah voice, weak accent

Combo 2

1.0

0.5

Strong Sarah voice, moderate accent

Combo 3

0.8

0.5

Balanced voice and accent

Combo 4

0.7

0.7

Strong balance

Combo 5

0.5

1.0

Strong Indian pronunciation

Combo 1

α (Sarah)

1.0

β (Sumit)

0.3

Result

Full Sarah voice, weak accent

1 of 5

After Sarah’s final checkpoint was completed, We reran the experiments with:

  • β fixed at 1.0
  • While adjusting α more carefully.

Round 2

α (Sarah)β (Sumit)Result

0.5

1.0

Strong accent, lighter expression

0.6

1.0

Best overall balance

0.7

1.0

Slightly more Sarah character

0.8–1.0

1.0

Accent quality started shifting

0.5

β (Sumit)

1.0

Result

Strong accent, lighter expression

1 of 4

Final selected model:

α = 0.6 β = 1.0

α and β blend search used to find the final merged model.

Figure 3: α and β blend search used to find the final merged model.

The Final Result

The first time we heard “Tiruchirappalli” pronounced correctly in Sarah’s voice, I knew the merge had actually worked.

The final merged model produced:

  • natural female speech
  • correct Indian pronunciation
  • strong Indian-English accent retention
  • smoother voice quality than the Sumit model

Without retraining the full network.

Innovations in AI
Exploring the future of artificial intelligence
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 4 Jul 2026
10PM IST (60 mins)

sara-merged.wav

How We Evaluated the Merged Voice

Human Listening Test Results

To evaluate the merged model, we conducted an informal listening study with 10 fellow interns. Each participant listened to identical speech samples generated by the three models and rated them independently.

ModelIndian PronunciationMOSListener Preference

Sumit

9.2/10

3.8

18%

Sarah

3.4/10

4.3

27%

Merged (α=0.6, β=1.0)

8.8/10

4.4

55%

Sumit

Indian Pronunciation

9.2/10

MOS

3.8

Listener Preference

18%

1 of 3

Two evaluation criteria were used:

  • Indian Pronunciation (1–10): Participants rated how accurately the model pronounced Indian names, places, and words. The final score is the average rating across all 10 evaluators.
  • Mean Opinion Score (MOS, 1–5): Participants rated the overall naturalness and pleasantness of the synthesized speech, considering voice quality, smoothness, and how comfortable the audio was to listen to. The reported MOS is the average score across all evaluators. 

How We Merged the Model Weights in Code

The actual implementation is surprisingly small.

import copy
import torch

base  = torch.load("kokoro-v1_0.pth", map_location="cpu", weights_only=False)
sarah = torch.load("sarah_ep19.pth", map_location="cpu", weights_only=True)
sumit = torch.load("sumit_ep09.pth", map_location="cpu", weights_only=True)

def flatten(sd):
   out = {}
   for section, subdict in sd.items():
       for subkey, tensor in subdict.items():
           out[(section, subkey)] = tensor
   return out

base_flat  = flatten(base)
sarah_flat = flatten(sarah)
sumit_flat = flatten(sumit)

common_keys = set(base_flat) & set(sarah_flat) & set(sumit_flat)

alpha = 0.6
beta  = 1.0

merged = copy.deepcopy(sarah)

for (section, subkey) in common_keys:
   b = base_flat[(section, subkey)].float()
   s = sarah_flat[(section, subkey)].float()
   j = sumit_flat[(section, subkey)].float()

   merged[section][subkey] = (
       b
       + alpha * (s - b)
       + beta  * (j - b)
   )

torch.save(merged, "sarah_indian_merged.pth")

A few important details:

  • copy.deepcopy(sarah) intentionally preserves Sarah-only decoder weights
  • only shared weights are merged mathematically
  • no training loop is required
  • no gradients are computed
  • The merge runs entirely on CPU

The entire process takes less than a minute for a ~300MB checkpoint.

Why Task Arithmetic Works

Fine-tuned models usually remain relatively close to their shared base model.

That means their learned changes often exist in a locally linear region of weight space.

When two task vectors:

  • are small,
  • and mostly non-conflicting,

Their effects can often be combined predictably.

This idea is closely related to:

  • Model Soups
  • TIES-Merging
  • modern LLM checkpoint merging techniques

which all rely on similar geometric behavior in neural network weight space.

What breaks task arithmetic:

  • very large fine-tunes
  • conflicting task vectors
  • models trained from different base checkpoints

The shared base model is the critical requirement.

Why This Was Interesting For TTS

Task arithmetic has already been widely explored in:

  • LLM merging
  • image generation
  • checkpoint interpolation

But applying it to:

  • voice identity
  • accent transfer
  • pronunciation adaptation

Inside, TTS models are still relatively unexplored.

In this project, the arithmetic task allowed:

  • Sarah’s voice identity
  • and Sumit’s pronunciation behavior

to coexist inside a single merged checkpoint

without:

  • retraining from scratch
  • collecting new combined datasets
  • or building pronunciation lexicons at inference time.

Final Thoughts

Task arithmetic feels almost too simple to work.

But once you realize fine-tuning is just moving through weight space, model merging starts feeling less like a hack and more like geometry.

For this project, it meant combining:

  • Sarah’s voice,
  • Sumit’s Indian pronunciation,
  • and Kokoro’s base capabilities

into a single model without retraining from scratch.

And this is probably only the beginning.

Need Help Building Custom Gen AI Systems?

If you're exploring voice AI, model fine-tuning, or custom AI workflows, talk to our team about our Gen AI development services before anything is scoped or committed.

Author-Jeevarathinam V
Jeevarathinam V
LinkedIn

AI/ML Engineer exploring next-gen AI and generative systems, driven by curiosity to build, experiment, and push boundaries in the world of intelligent systems.

Share this article

Phone

Next for you

OpenAI Privacy Filter: How to Detect and Redact PII Locally Cover

AI

Jun 29, 20267 min read

OpenAI Privacy Filter: How to Detect and Redact PII Locally

AI teams often work with messy data. A developer may paste a stack trace into an LLM, a support team may summarize customer tickets, or an internal AI agent may search through company documents. In all these cases, the input can contain private details like emails, phone numbers, API keys, passwords, account numbers, or internal URLs. OpenAI Privacy Filter helps reduce that risk by detecting and redacting sensitive information before the data is sent to an AI model or stored in another system.

How to Build a Custom AI Agent for Your Business Workflow Cover

AI

Jun 29, 202613 min read

How to Build a Custom AI Agent for Your Business Workflow

AI agents are one of those things that sound more complicated than they are and also more straightforward than they actually are. The concept is simple. Give an AI a goal, the right tools, and the right context, and it can handle multi-step workflows that previously needed a person sitting in front of a screen. The hard part is building one that works reliably in production, fits your actual business logic, and doesn't fall apart the first time an edge case shows up. That's what this guide cov

Scrapling vs Web Fetch: When AI Agents Need Live Web Data Cover

AI

Jun 29, 20265 min read

Scrapling vs Web Fetch: When AI Agents Need Live Web Data

What happens when an AI agent needs data that search results cannot reliably provide? For broad research, cached pages and web fetches are often enough. But when the task depends on live prices, flight availability, job listings, reviews, or JavaScript-rendered pages, the agent needs data from the actual website. That is where Scrapling helps. It opens the live page, renders JavaScript, handles modern website behavior, and extracts the data an AI agent needs. In this article, we’ll compare Sc