
How task arithmetic lets me combine a female voice and an Indian English accent male voice without retraining anything
Most text-to-speech models can say "Hello, how are you?" But ask them to pronounce Subramanian, Tiruchirappalli, Sriharikota, or Bengaluru, and the illusion quickly falls apart. That was the problem we set out to solve.
We had trained two separate models. Neither did both.
We assumed the only solution was to collect more data and train a larger combined model. But while digging through research papers online, I came across Editing Models with Task Arithmetic and Model soups: averaging weights of multiple fine-tuned models.
The idea sounded almost ridiculous at first: take two neural networks, subtract their weights from a base model, combine the differences, and you can sometimes merge behaviors without any additional training.
That is where task arithmetic came in, a technique that lets you combine what two neural networks have independently learned using nothing but subtraction and addition on their weights.
This blog explains:
- What We did,
- how the technique works,
- and why it is more useful than most people realize.
What Is Task Arithmetic?
Task arithmetic is a technique where the learned weight differences between fine-tuned neural networks are treated like vectors in weight space.
If two models were fine-tuned from the same base model, their learned changes can often be:
- added,
- scaled,
- or combined
To create a new model with the properties of both.
Instead of retraining from scratch, task arithmetic edits model behavior directly through weight space operations.
In this project, We used task arithmetic to combine:
- Sarah’s female voice characteristics
- with Sumit’s Indian English pronunciation patterns
into a single Kokoro TTS model.
The Problem We Was Trying To Solve
We already had two separate fine-tuned models. Neither solved the full problem.
Model 1: Sumit
Fine-tuned on Indian-English male speech data.
Strengths:
- excellent pronunciation of Indian names and places
- natural Indian accent
- strong pronunciation consistency
Weakness:
- male voice
Model 2: Sarah
Fine-tuned on Sarah’s recordings using two-stage training:
- Stage 1 → 20 epochs
- Stage 2 → 20 epochs
Strengths:
- expressive female voice
- natural rhythm and tone
- smoother speech quality
Weakness:
- poor Indian pronunciation
For example:
- “Subramanian” became “subremanian”
- “Tiruchirappalli” sounded completely incorrect
How Both Models Performed on the Same Sentence
Sentence:
“Could you please connect me with Subramanian Iyer? The compensation is Rs.18,00,000 per annum.
Sumit Model
- correct Indian pronunciation
- natural Indian accent
- male voice
Sarah Model
- incorrect pronunciation
- neutral accent
- female voice
Neither model alone was usable for the final assistant.
The obvious solution would have been:
- collecting combined data
- retraining from scratch
- running another long fine-tuning cycle
But that would take significantly more time and compute.
We needed a simpler approach.
Walk away with actionable insights on AI adoption.
Limited seats available!
The Key Insight
Both models were fine-tuned from the same base model:Kokoro-82M by hexgrad.
When a model is fine-tuned, training does not completely replace the weights. It slightly shifts them away from the base model.
That shift represents what the model learned.
If:
τ=θfine\mbox−tuned−θbase\tau = \theta_{\mathrm{fine\mbox{-}tuned}} - \theta_{\mathrm{base}}τ=θfine\mbox−tuned−θbaseThen the resulting vector represents the learned behavior added during fine-tuning.
This is called a task vector.
Because both Sarah and Sumit started from the same base model:
- Their vectors exist in the same weight space
- and can be combined mathematically
Sarah learned:
- voice identity
- expression
- speaking style
Sumit learned:
- Indian English pronunciation
- accent patterns
- phoneme behavior
Those learned behaviors were mostly complementary rather than conflicting.
That meant the vectors could be added together.
How Task Arithmetic Works for TTS
The merge formula becomes:
θmerged=θbase+α(θSarah−θbase)+β(θSumit−θbase)\theta_{\mathrm{merged}} = \theta_{\mathrm{base}} + \alpha(\theta_{\mathrm{Sarah}}-\theta_{\mathrm{base}})+\beta(\theta_{\mathrm{Sumit}}-\theta_{\mathrm{base}})θmerged=θbase+α(θSarah−θbase)+β(θSumit−θbase)Where:
- α controls Sarah’s voice characteristics
- β controls Sumit’s Indian pronunciation
Both values can be adjusted gradually.
This allowed me to search for the best balance between:
- voice identity
- and pronunciation quality.

Figure 1: Sarah and Sumit's task vectors combined in shared Kokoro weight space.
A Technical Detail That Actually Helped
During merging, we found that:
- 178 decoder weight keys existed in the fine-tuned models
- but did not exist under matching names in the base model
This happened because PyTorch parametrization renames some weights during fine-tuning.
For example:
weight_gweight_vbecame:
parametrizations.weight.original0parametrizations.weight.original1
Since those weights had no matching base weights, task arithmetic could not be applied to them directly.
So instead of forcing a merge:
We preserved Sarah’s values for those weights unchanged.
This turned out to be exactly the correct decision.
Those decoder layers were heavily affected:
- voice texture
- acoustic rendering
- speaker identity
Keeping Sarah’s decoder intact preserved her voice quality while allowing the remaining weights to absorb Sumit’s pronunciation behavior.

Figure 2: Shared weights were merged mathematically while Sarah-specific decoder weights were preserved.
Finding The Right Blend
We generated multiple combinations of α and β and tested them on the same sentences.
Round 1
| Model | α (Sarah) | β (Sumit) | Result |
Combo 1 | 1.0 | 0.3 | Full Sarah voice, weak accent |
Combo 2 | 1.0 | 0.5 | Strong Sarah voice, moderate accent |
Combo 3 | 0.8 | 0.5 | Balanced voice and accent |
Combo 4 | 0.7 | 0.7 | Strong balance |
Combo 5 | 0.5 | 1.0 | Strong Indian pronunciation |
After Sarah’s final checkpoint was completed, We reran the experiments with:
- β fixed at 1.0
- While adjusting α more carefully.
Round 2
| α (Sarah) | β (Sumit) | Result |
0.5 | 1.0 | Strong accent, lighter expression |
0.6 | 1.0 | Best overall balance |
0.7 | 1.0 | Slightly more Sarah character |
0.8–1.0 | 1.0 | Accent quality started shifting |
Final selected model:
α = 0.6 β = 1.0

Figure 3: α and β blend search used to find the final merged model.
The Final Result
The first time we heard “Tiruchirappalli” pronounced correctly in Sarah’s voice, I knew the merge had actually worked.
The final merged model produced:
- natural female speech
- correct Indian pronunciation
- strong Indian-English accent retention
- smoother voice quality than the Sumit model
Without retraining the full network.
Walk away with actionable insights on AI adoption.
Limited seats available!
How We Evaluated the Merged Voice
Human Listening Test Results
To evaluate the merged model, we conducted an informal listening study with 10 fellow interns. Each participant listened to identical speech samples generated by the three models and rated them independently.
| Model | Indian Pronunciation | MOS | Listener Preference |
Sumit | 9.2/10 | 3.8 | 18% |
Sarah | 3.4/10 | 4.3 | 27% |
Merged (α=0.6, β=1.0) | 8.8/10 | 4.4 | 55% |
Two evaluation criteria were used:
- Indian Pronunciation (1–10): Participants rated how accurately the model pronounced Indian names, places, and words. The final score is the average rating across all 10 evaluators.
- Mean Opinion Score (MOS, 1–5): Participants rated the overall naturalness and pleasantness of the synthesized speech, considering voice quality, smoothness, and how comfortable the audio was to listen to. The reported MOS is the average score across all evaluators.
How We Merged the Model Weights in Code
The actual implementation is surprisingly small.
import copy
import torch
base = torch.load("kokoro-v1_0.pth", map_location="cpu", weights_only=False)
sarah = torch.load("sarah_ep19.pth", map_location="cpu", weights_only=True)
sumit = torch.load("sumit_ep09.pth", map_location="cpu", weights_only=True)
def flatten(sd):
out = {}
for section, subdict in sd.items():
for subkey, tensor in subdict.items():
out[(section, subkey)] = tensor
return out
base_flat = flatten(base)
sarah_flat = flatten(sarah)
sumit_flat = flatten(sumit)
common_keys = set(base_flat) & set(sarah_flat) & set(sumit_flat)
alpha = 0.6
beta = 1.0
merged = copy.deepcopy(sarah)
for (section, subkey) in common_keys:
b = base_flat[(section, subkey)].float()
s = sarah_flat[(section, subkey)].float()
j = sumit_flat[(section, subkey)].float()
merged[section][subkey] = (
b
+ alpha * (s - b)
+ beta * (j - b)
)
torch.save(merged, "sarah_indian_merged.pth")A few important details:
- copy.deepcopy(sarah) intentionally preserves Sarah-only decoder weights
- only shared weights are merged mathematically
- no training loop is required
- no gradients are computed
- The merge runs entirely on CPU
The entire process takes less than a minute for a ~300MB checkpoint.
Why Task Arithmetic Works
Fine-tuned models usually remain relatively close to their shared base model.
That means their learned changes often exist in a locally linear region of weight space.
When two task vectors:
- are small,
- and mostly non-conflicting,
Their effects can often be combined predictably.
This idea is closely related to:
- Model Soups
- TIES-Merging
- modern LLM checkpoint merging techniques
which all rely on similar geometric behavior in neural network weight space.
What breaks task arithmetic:
- very large fine-tunes
- conflicting task vectors
- models trained from different base checkpoints
The shared base model is the critical requirement.
Why This Was Interesting For TTS
Task arithmetic has already been widely explored in:
- LLM merging
- image generation
- checkpoint interpolation
But applying it to:
- voice identity
- accent transfer
- pronunciation adaptation
Inside, TTS models are still relatively unexplored.
In this project, the arithmetic task allowed:
- Sarah’s voice identity
- and Sumit’s pronunciation behavior
to coexist inside a single merged checkpoint
without:
- retraining from scratch
- collecting new combined datasets
- or building pronunciation lexicons at inference time.
Final Thoughts
Task arithmetic feels almost too simple to work.
But once you realize fine-tuning is just moving through weight space, model merging starts feeling less like a hack and more like geometry.
For this project, it meant combining:
- Sarah’s voice,
- Sumit’s Indian pronunciation,
- and Kokoro’s base capabilities
into a single model without retraining from scratch.
And this is probably only the beginning.
Need Help Building Custom Gen AI Systems?
If you're exploring voice AI, model fine-tuning, or custom AI workflows, talk to our team about our Gen AI development services before anything is scoped or committed.
Walk away with actionable insights on AI adoption.
Limited seats available!



