Blogs/AI/3,000 Tokens/Sec on Two RTX 4090s for Free

3,000 Tokens/Sec on Two RTX 4090s for Free

Written byKiruthika

Jun 29, 2026

7 Min Read

3,000 Tokens/Sec on Two RTX 4090s for Free Hero

We had 475,000 candidate profiles to synthesise for HuntVox, our internal tool. The data came from multiple sources, including LinkedIn, Weekday, resume parsing pipelines, and Lemlist, resulting in duplicate fields, inconsistent formats, and noisy profile information.

Our goal was simple: convert raw profiles into semantic summaries, structured skills, and domain tags that could improve search quality and retrieval.

At this scale, hosted APIs became difficult to justify. Rate limits reduced throughput, costs increased rapidly, and external inference raised privacy concerns for candidate data.

So we ran everything locally using gpt-oss-20b, SGLang’s offline engine, and two RTX 4090 GPUs, reaching nearly 3,000 tokens/sec, processing 475K+ profiles in 8 hours, with zero API cost.

In this guide, we’ll break down the architecture, batching strategy, tensor parallelism setup, prefix KV caching optimisations, and the lessons learned while running large-scale offline inference with SGLang. Let’s dive into it.

The Problem with Processing 475K Candidate Profiles

HuntVox aggregates candidate data from multiple sources, including LinkedIn, Weekday, resume parsing pipelines, and Lemlist. The challenge was that this data was highly inconsistent.

Profiles contained duplicated fields, different date formats, repeated job descriptions, and overlapping information across vendors. Directly embedding this data would have produced poor semantic search results.

We needed an LLM pipeline that could transform every raw profile into three outputs:

Summary

A 400–600-word semantic profile summary written in dense prose for vector embeddings. The output focused on career progression, technical expertise, leadership scope, and overall experience without dates, bullet points, or unnecessary formatting.

Skills

A normalised, de-duplicated list of technologies, frameworks, certifications, and methodologies extracted from the profile. The priority was completeness rather than strict precision.

Domains

Business vertical tags are generated from company_industry mappings in the database. This step did not require an LLM and was handled through Python-based aggregation.

The scale made optimisation critical. With 475,000 profiles, even a small inference inefficiency multiplied quickly. A workflow that was only 10% slower could add hours to the total runtime.

Why We Didn't Use Cerebras or Groq

Both are genuinely fast. Groq's LPU delivers ~800 tok/s on Llama 3 70B; Cerebras reaches ~2,000 tok/s on comparable models.

For interactive applications, they're excellent. For offline batch generation at our scale, they have structural problems.

Dimension	Cerebras / Groq	SGLang Local (2× 4090)
Effective batch throughput	Rate-limited. Groq free tier: ~6K TPM. Paid tiers still throttle bulk traffic. You spend more time sleeping than generating.	Full hardware throughput, sustained. No ceiling.
Cost at 475K profiles	~$0.80–$1.00 per 1M tokens. 475K × ~2,100 tok avg = ~1B tokens → ~$800–$1,000 total.	Electricity. ~8 hrs × 600W = ~5 kWh ≈ $1.
Data privacy	475K candidate profiles sent to third-party inference servers. PII risk, data residency concerns.	Nothing leaves the machine.
Model control	Fixed model catalogue. Can't disable chain-of-thought, can't tune sampling, can't strip internal thinking tokens.	Full control. We disabled thinking entirely — saving ~30% token waste per request.
Resume safety	Throttle mid-run → complex retry queues, state management, partial-batch bookkeeping.	Idempotent file writes. SSH drops or server reboots → re-run the same command, skip already-done files.

Effective batch throughput

Cerebras / Groq

Rate-limited. Groq free tier: ~6K TPM. Paid tiers still throttle bulk traffic. You spend more time sleeping than generating.

SGLang Local (2× 4090)

Full hardware throughput, sustained. No ceiling.

1 of 5

The fundamental mismatch: Hosted APIs optimise for single-request latency. Offline batch inference is the opposite problem: you have hundreds of thousands of prompts, and you care about total throughput, not how fast one response arrives.

SGLang's offline engine is purpose-built for this; it fills the GPU with a batch and maximises tokens-per-second across all requests simultaneously.

How We Processed 475K Profiles with SGLang

SGLang ships two modes: an online server (OpenAI-compatible API) and an offline engine for batch generation. We used the offline engine; you hand it a list of prompts, and it returns a list of outputs. No HTTP, no serialisation overhead, no connection pooling.

A Live Guide to Scaling AI Without the Cloud

Join live as experts show you how to run high-throughput LLM inference locally, cut API costs to zero, and keep your data private.

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 11 Jul 2026

10PM IST (60 mins)

SGLang Processing Flow:

1,000 Prompts — One Batch → profile.md → chat-template → token sequences
Batch Scheduler & Continuous Batching — as sequences finish, new ones are inserted immediately — no padding idle time
RadixAttention — Prefix KV Cache — ~350-token system prompt computed once, reused across all 1,000 prompts (~25% saved)
GPU 0 (RTX 4090, layers 0→N/2, 24 GB · 1,008 GB/s) ↔ NCCL AllReduce ↔ GPU 1 (RTX 4090, layers N/2→N, 24 GB · 1,008 GB/s)
KV Cache Manager — mem_fraction=0.88 · 42 GB total KV budget across both GPUs
outputs[] → profile_synthesis.json — atomic file write per candidate · idempotent skip on restart

Tensor Parallelism (TP=2)

The gpt-oss-20b model requires roughly 40 GB of memory in bf16 precision, which exceeds the 24 GB VRAM available on a single RTX 4090.

To run the model locally, we used tensor parallelism (tp_size=2). SGLang split the model across both GPUs, with each card handling part of the attention layers and MLP computations.

During inference, both GPUs processed requests simultaneously and synchronised using NCCL AllReduce, giving us an effective 48 GB VRAM pool and higher memory bandwidth for large batch generation.

Continuous Batching

Traditional batching waits for the longest sequence in the batch to finish before moving forward. Faster sequences remain idle while the GPU waits, reducing utilisation.

SGLang uses continuous batching instead.

As individual requests are completed, new prompts are inserted immediately into the active batch. This keeps GPU resources occupied throughout execution and avoids idle time caused by uneven sequence lengths.

For large-scale inference runs such as 475K profile generation, this helped maintain consistently high GPU utilisation across the entire pipeline.

RadixAttention - the KV Cache Win

Every one of the 475K profile generation requests used the same ~350-token system prompt.

Normally, the model would recompute those tokens for every request, creating unnecessary input overhead.

SGLang’s RadixAttention avoids this by identifying shared prompt prefixes, computing the KV cache once, and reusing it across the entire batch.

With 1,000 prompts per batch, this removed:

350 × 1,000 = 350,000 token computations per batch

This reduced repeated input processing by roughly 25%, allowing the GPUs to spend more time generating outputs instead of recomputing identical prompt tokens.

Why We Disabled Chain-of-Thought for Profile Generation

gpt-oss-20b is a reasoning-capable model based on Qwen3. By default, it generates an internal reasoning trail before producing the final response.

For tasks such as profile generation and structured extraction, this extra reasoning was unnecessary. Our workload only required outputs like profile summaries and skill extraction, not multi-step problem solving.

The reasoning phase added roughly 200–400 extra tokens per request, increasing inference cost without improving output quality.

We disabled reasoning directly in the prompt template using:

# Disable thinking at prompt construction time
prompt = tokenizer.apply_chat_template(
    [{"role": "system", "content": SYSTEM_PROMPT},
     {"role": "user",   "content": profile_md}],
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="none",  # ← no internal scratchpad
)
# Defensive stripper for any residual thinking tokens
def strip_thinking(text: str) -> str:
    if "assistantfinal" in text:
        return text.split("assistantfinal", 1)[1].strip()
    return text.strip()

This optimisation reduced token usage by roughly 30% per profile and improved overall throughput.

Instead of spending compute on internal reasoning traces, the model focused entirely on generating the final output.

How the Inference Pipeline Worked

The entire generation core is straightforward. SGLang's offline engine takes a list of prompt strings and returns a list of outputs, all the scheduling complexity is internal.

import sglang as sgl
# Load once — stays in VRAM for the entire run
llm = sgl.Engine(
    model_path="openai/gpt-oss-20b",
    tp_size=2,                 # shard across both 4090s
    mem_fraction_static=0.88,  # 88% VRAM for KV cache
)
sampling_params = {
    "max_new_tokens": 900,
    "temperature": 0.2,
    "repetition_penalty": 1.05,
}
# Batch loop — 1,000 candidates at a time
for batch in chunks(pending, 1000):
    prompts = [build_prompt(folder) for folder in batch]
    # Single call — engine schedules all 1,000 in parallel internally
    outputs = llm.generate(prompts, sampling_params)
    for folder, out in zip(batch, outputs):
        text = strip_thinking(out["text"])
        result = repair_json(text)  # graceful parse of slightly malformed JSON
        (folder / "profile_synthesis.json").write_text(result)
llm.shutdown()

Idempotency and Failure Recovery

The full inference run took more than 8 hours. During execution, the GPU server restarted at around 446K processed candidates because of provider maintenance.

A Live Guide to Scaling AI Without the Cloud

Join live as experts show you how to run high-throughput LLM inference locally, cut API costs to zero, and keep your data private.

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 11 Jul 2026

10PM IST (60 mins)

To avoid reprocessing completed work, every candidate output was written as an individual file. At startup, the script checked whether profile_synthesis.json already existed and skipped completed entries.

This allowed us to rerun the same command and continue exactly from the last processed candidate without duplicate generation.

def find_pending(base, force) -> list:
    for entry in base.iterdir():
        if not (entry / "profile.md").exists():
            continue
        if (entry / "profile_synthesis.json").exists() and not force:
            continue   # already done — skip
        pending.append(entry)
    return pending

This recovery mechanism made the pipeline idempotent. Server restarts, SSH disconnects, or interrupted runs only required restarting the job instead of reprocessing all 475K profiles.

Throughput Comparison with Hosted APIs

The 3,000 tokens/sec figure represents the sustained output throughput measured across the full run. Throughput was logged after every 1,000-candidate batch using generated tokens divided by elapsed time.

Our local setup maintained approximately:

SGLang Local (2× RTX 4090): ~3,000 tok/s sustained
Cerebras (estimated): ~2,000 tok/s peak throughput
Groq (estimated): ~800 tok/s peak throughput

It is important to note that the Cerebras and Groq numbers represent published per-request performance figures on comparable model sizes. Real batch throughput can be lower because large-scale generation workloads are affected by token limits, request throttling, and queue management.

For our workload, the priority was total job completion time rather than individual request latency.

We also observed thermal differences during long-running inference jobs. Under sustained TP=2 execution, GPU 0 occasionally exceeded 88°C, while other workloads such as embeddings and evaluation were executed separately on GPU 1, which remained between 65–71°C.

For multi-hour inference runs, monitoring dual-GPU cooling and airflow becomes important to maintain stable throughput.

Conclusion

Processing 475K candidate profiles was not just an inference problem, it was a throughput problem.

Hosted APIs work well for interactive applications, but large-scale offline generation introduces different challenges: token costs, rate limits, privacy concerns, recovery handling, and overall job completion time.

Using SGLang’s offline engine, tensor parallelism across two RTX 4090 GPUs, continuous batching, RadixAttention, and reasoning optimisation, we built a pipeline that processed 475K+ profiles, sustained nearly 3,000 tokens/sec, and completed the run in roughly 8 hours without API costs.

More importantly, the project reinforced a simple lesson: when working with large inference workloads, optimising throughput, caching, recovery, and prompt design often delivers bigger gains than focusing only on model speed.

For offline batch inference at this scale, keeping the workload local gave us better control over performance, cost, and data handling while allowing the pipeline to scale efficiently.

Kiruthika

AI/ML Engineer

I'm an AI/ML engineer passionate about developing cutting-edge solutions. I specialize in machine learning techniques to solve complex problems and drive innovation through data-driven insights.

Share this article

Next for you

How We Merged Two TTS Models Using Task Arithmetic Without Retraining Cover

AI

Jul 3, 2026 • 7 min read

How We Merged Two TTS Models Using Task Arithmetic Without Retraining

How task arithmetic lets me combine a female voice and an Indian English accent male voice without retraining anything Most text-to-speech models can say "Hello, how are you?" But ask them to pronounce Subramanian, Tiruchirappalli, Sriharikota, or Bengaluru, and the illusion quickly falls apart. That was the problem we set out to solve. We had trained two separate models. Neither did both. We assumed the only solution was to collect more data and train a larger combined model. But while digg

OpenAI Privacy Filter: How to Detect and Redact PII Locally Cover

AI

Jun 29, 2026 • 7 min read

OpenAI Privacy Filter: How to Detect and Redact PII Locally

AI teams often work with messy data. A developer may paste a stack trace into an LLM, a support team may summarize customer tickets, or an internal AI agent may search through company documents. In all these cases, the input can contain private details like emails, phone numbers, API keys, passwords, account numbers, or internal URLs. OpenAI Privacy Filter helps reduce that risk by detecting and redacting sensitive information before the data is sent to an AI model or stored in another system.

How to Build a Custom AI Agent for Your Business Workflow Cover

AI

Jun 29, 2026 • 13 min read

How to Build a Custom AI Agent for Your Business Workflow

AI agents are one of those things that sound more complicated than they are and also more straightforward than they actually are. The concept is simple. Give an AI a goal, the right tools, and the right context, and it can handle multi-step workflows that previously needed a person sitting in front of a screen. The hard part is building one that works reliably in production, fits your actual business logic, and doesn't fall apart the first time an edge case shows up. That's what this guide cov