Blogs/AI/How We Merged Two TTS Models Using Task Arithmetic Without Retraining

How We Merged Two TTS Models Using Task Arithmetic Without Retraining

Written by Jeevarathinam V

Reviewed by Krishna Purwar

Jun 30, 2026

7 Min Read

How We Merged Two TTS Models Using Task Arithmetic Without Retraining Hero

How task arithmetic lets me combine a female voice and an Indian English accent male voice without retraining anything

Most text-to-speech models can say "Hello, how are you?" But ask them to pronounce Subramanian, Tiruchirappalli, Sriharikota, or Bengaluru, and the illusion quickly falls apart. That was the problem we set out to solve.

We had trained two separate models. Neither did both.

We assumed the only solution was to collect more data and train a larger combined model. But while digging through research papers online, I came across Editing Models with Task Arithmetic and Model soups: averaging weights of multiple fine-tuned models.

The idea sounded almost ridiculous at first: take two neural networks, subtract their weights from a base model, combine the differences, and you can sometimes merge behaviors without any additional training.

That is where task arithmetic came in, a technique that lets you combine what two neural networks have independently learned using nothing but subtraction and addition on their weights.

This blog explains:

What We did,
how the technique works,
and why it is more useful than most people realize.

What Is Task Arithmetic?

Task arithmetic is a technique where the learned weight differences between fine-tuned neural networks are treated like vectors in weight space.

If two models were fine-tuned from the same base model, their learned changes can often be:

added,
scaled,
or combined

To create a new model with the properties of both.

Instead of retraining from scratch, task arithmetic edits model behavior directly through weight space operations.

In this project, We used task arithmetic to combine:

Sarah’s female voice characteristics
with Sumit’s Indian English pronunciation patterns

into a single Kokoro TTS model.

The Problem We Was Trying To Solve

We already had two separate fine-tuned models. Neither solved the full problem.

Model 1: Sumit

Fine-tuned on Indian-English male speech data.

Strengths:

excellent pronunciation of Indian names and places
natural Indian accent
strong pronunciation consistency

Weakness:

male voice

Model 2: Sarah

Fine-tuned on Sarah’s recordings using two-stage training:

Stage 1 → 20 epochs
Stage 2 → 20 epochs

Strengths:

expressive female voice
natural rhythm and tone
smoother speech quality

Weakness:

poor Indian pronunciation

For example:

“Subramanian” became “subremanian”
“Tiruchirappalli” sounded completely incorrect

How Both Models Performed on the Same Sentence

Sentence:

“Could you please connect me with Subramanian Iyer? The compensation is Rs.18,00,000 per annum.

Sumit Model

correct Indian pronunciation
natural Indian accent
male voice

Sarah Model

incorrect pronunciation
neutral accent
female voice

Neither model alone was usable for the final assistant.

The obvious solution would have been:

collecting combined data
retraining from scratch
running another long fine-tuning cycle

But that would take significantly more time and compute.

We needed a simpler approach.

Innovations in AI

Exploring the future of artificial intelligence

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 4 Jul 2026

10PM IST (60 mins)

The Key Insight

Both models were fine-tuned from the same base model:Kokoro-82M by hexgrad.

When a model is fine-tuned, training does not completely replace the weights. It slightly shifts them away from the base model.

That shift represents what the model learned.

If:

τ=θfine\mbox−tuned−θbase\tau = \theta_{\mathrm{fine\mbox{-}tuned}} - \theta_{\mathrm{base}}τ=θfine\mbox−tuned−θbase

Then the resulting vector represents the learned behavior added during fine-tuning.

This is called a task vector.

Because both Sarah and Sumit started from the same base model:

Their vectors exist in the same weight space
and can be combined mathematically

Sarah learned:

voice identity
expression
speaking style

Sumit learned:

Indian English pronunciation
accent patterns
phoneme behavior

Those learned behaviors were mostly complementary rather than conflicting.

That meant the vectors could be added together.

How Task Arithmetic Works for TTS

The merge formula becomes:

θmerged=θbase+α(θSarah−θbase)+β(θSumit−θbase)\theta_{\mathrm{merged}} = \theta_{\mathrm{base}} + \alpha(\theta_{\mathrm{Sarah}}-\theta_{\mathrm{base}})+\beta(\theta_{\mathrm{Sumit}}-\theta_{\mathrm{base}})θmerged=θbase+α(θSarah−θbase)+β(θSumit−θbase)

Where:

α controls Sarah’s voice characteristics
β controls Sumit’s Indian pronunciation

Both values can be adjusted gradually.

This allowed me to search for the best balance between:

voice identity
and pronunciation quality.

Figure 1: Sarah and Sumit's task vectors combined in shared Kokoro weight space.

A Technical Detail That Actually Helped

During merging, we found that:

178 decoder weight keys existed in the fine-tuned models
but did not exist under matching names in the base model

This happened because PyTorch parametrization renames some weights during fine-tuning.

For example:

weight_g

weight_v

became:

parametrizations.weight.original0

parametrizations.weight.original1

Since those weights had no matching base weights, task arithmetic could not be applied to them directly.

So instead of forcing a merge:

We preserved Sarah’s values for those weights unchanged.

This turned out to be exactly the correct decision.

Those decoder layers were heavily affected:

voice texture
acoustic rendering
speaker identity

Keeping Sarah’s decoder intact preserved her voice quality while allowing the remaining weights to absorb Sumit’s pronunciation behavior.

Figure 2: Shared weights were merged mathematically while Sarah-specific decoder weights were preserved.

Finding The Right Blend

We generated multiple combinations of α and β and tested them on the same sentences.

Round 1

Model	α (Sarah)	β (Sumit)	Result
Combo 1	1.0	0.3	Full Sarah voice, weak accent
Combo 2	1.0	0.5	Strong Sarah voice, moderate accent
Combo 3	0.8	0.5	Balanced voice and accent
Combo 4	0.7	0.7	Strong balance
Combo 5	0.5	1.0	Strong Indian pronunciation

Combo 1

α (Sarah)

1.0

β (Sumit)

0.3

Result

Full Sarah voice, weak accent

1 of 5

After Sarah’s final checkpoint was completed, We reran the experiments with:

β fixed at 1.0
While adjusting α more carefully.

Round 2

α (Sarah)	β (Sumit)	Result
0.5	1.0	Strong accent, lighter expression
0.6	1.0	Best overall balance
0.7	1.0	Slightly more Sarah character
0.8–1.0	1.0	Accent quality started shifting

0.5

β (Sumit)

1.0

Result

Strong accent, lighter expression

1 of 4

Final selected model:

α = 0.6 β = 1.0

Figure 3: α and β blend search used to find the final merged model.

The Final Result

The first time we heard “Tiruchirappalli” pronounced correctly in Sarah’s voice, I knew the merge had actually worked.

The final merged model produced:

natural female speech
correct Indian pronunciation
strong Indian-English accent retention
smoother voice quality than the Sumit model

Without retraining the full network.

Innovations in AI

Exploring the future of artificial intelligence

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 4 Jul 2026

10PM IST (60 mins)

sara-merged.wav

How We Evaluated the Merged Voice

Human Listening Test Results

To evaluate the merged model, we conducted an informal listening study with 10 fellow interns. Each participant listened to identical speech samples generated by the three models and rated them independently.

Model	Indian Pronunciation	MOS	Listener Preference
Sumit	9.2/10	3.8	18%
Sarah	3.4/10	4.3	27%
Merged (α=0.6, β=1.0)	8.8/10	4.4	55%

Sumit

Indian Pronunciation

9.2/10

MOS

3.8

Listener Preference

18%

1 of 3

Two evaluation criteria were used:

Indian Pronunciation (1–10): Participants rated how accurately the model pronounced Indian names, places, and words. The final score is the average rating across all 10 evaluators.
Mean Opinion Score (MOS, 1–5): Participants rated the overall naturalness and pleasantness of the synthesized speech, considering voice quality, smoothness, and how comfortable the audio was to listen to. The reported MOS is the average score across all evaluators.

How We Merged the Model Weights in Code

The actual implementation is surprisingly small.

import copy
import torch

base  = torch.load("kokoro-v1_0.pth", map_location="cpu", weights_only=False)
sarah = torch.load("sarah_ep19.pth", map_location="cpu", weights_only=True)
sumit = torch.load("sumit_ep09.pth", map_location="cpu", weights_only=True)

def flatten(sd):
   out = {}
   for section, subdict in sd.items():
       for subkey, tensor in subdict.items():
           out[(section, subkey)] = tensor
   return out

base_flat  = flatten(base)
sarah_flat = flatten(sarah)
sumit_flat = flatten(sumit)

common_keys = set(base_flat) & set(sarah_flat) & set(sumit_flat)

alpha = 0.6
beta  = 1.0

merged = copy.deepcopy(sarah)

for (section, subkey) in common_keys:
   b = base_flat[(section, subkey)].float()
   s = sarah_flat[(section, subkey)].float()
   j = sumit_flat[(section, subkey)].float()

   merged[section][subkey] = (
       b
       + alpha * (s - b)
       + beta  * (j - b)
   )

torch.save(merged, "sarah_indian_merged.pth")

A few important details:

copy.deepcopy(sarah) intentionally preserves Sarah-only decoder weights
only shared weights are merged mathematically
no training loop is required
no gradients are computed
The merge runs entirely on CPU

The entire process takes less than a minute for a ~300MB checkpoint.

Why Task Arithmetic Works

Fine-tuned models usually remain relatively close to their shared base model.

That means their learned changes often exist in a locally linear region of weight space.

When two task vectors:

are small,
and mostly non-conflicting,

Their effects can often be combined predictably.

This idea is closely related to:

Model Soups
TIES-Merging
modern LLM checkpoint merging techniques

which all rely on similar geometric behavior in neural network weight space.

What breaks task arithmetic:

very large fine-tunes
conflicting task vectors
models trained from different base checkpoints

The shared base model is the critical requirement.

Why This Was Interesting For TTS

Task arithmetic has already been widely explored in:

LLM merging
image generation
checkpoint interpolation

But applying it to:

voice identity
accent transfer
pronunciation adaptation

Inside, TTS models are still relatively unexplored.

In this project, the arithmetic task allowed:

Sarah’s voice identity
and Sumit’s pronunciation behavior

to coexist inside a single merged checkpoint

without:

retraining from scratch
collecting new combined datasets
or building pronunciation lexicons at inference time.

Final Thoughts

Task arithmetic feels almost too simple to work.

But once you realize fine-tuning is just moving through weight space, model merging starts feeling less like a hack and more like geometry.

For this project, it meant combining:

Sarah’s voice,
Sumit’s Indian pronunciation,
and Kokoro’s base capabilities

into a single model without retraining from scratch.

And this is probably only the beginning.

Need Help Building Custom Gen AI Systems?

If you're exploring voice AI, model fine-tuning, or custom AI workflows, talk to our team about our Gen AI development services before anything is scoped or committed.

Jeevarathinam V

AI/ML Engineer exploring next-gen AI and generative systems, driven by curiosity to build, experiment, and push boundaries in the world of intelligent systems.

Share this article

Next for you

OpenAI Privacy Filter: How to Detect and Redact PII Locally Cover

AI

Jun 29, 2026 • 7 min read

OpenAI Privacy Filter: How to Detect and Redact PII Locally

AI teams often work with messy data. A developer may paste a stack trace into an LLM, a support team may summarize customer tickets, or an internal AI agent may search through company documents. In all these cases, the input can contain private details like emails, phone numbers, API keys, passwords, account numbers, or internal URLs. OpenAI Privacy Filter helps reduce that risk by detecting and redacting sensitive information before the data is sent to an AI model or stored in another system.

How to Build a Custom AI Agent for Your Business Workflow Cover

AI

Jun 29, 2026 • 13 min read

How to Build a Custom AI Agent for Your Business Workflow

AI agents are one of those things that sound more complicated than they are and also more straightforward than they actually are. The concept is simple. Give an AI a goal, the right tools, and the right context, and it can handle multi-step workflows that previously needed a person sitting in front of a screen. The hard part is building one that works reliably in production, fits your actual business logic, and doesn't fall apart the first time an edge case shows up. That's what this guide cov

Scrapling vs Web Fetch: When AI Agents Need Live Web Data Cover

AI

Jun 29, 2026 • 5 min read

Scrapling vs Web Fetch: When AI Agents Need Live Web Data

What happens when an AI agent needs data that search results cannot reliably provide? For broad research, cached pages and web fetches are often enough. But when the task depends on live prices, flight availability, job listings, reviews, or JavaScript-rendered pages, the agent needs data from the actual website. That is where Scrapling helps. It opens the live page, renders JavaScript, handles modern website behavior, and extracts the data an AI agent needs. In this article, we’ll compare Sc