
We had 475,000 candidate profiles to synthesise for HuntVox, our internal tool. The data came from multiple sources, including LinkedIn, Weekday, resume parsing pipelines, and Lemlist, resulting in duplicate fields, inconsistent formats, and noisy profile information.
Our goal was simple: convert raw profiles into semantic summaries, structured skills, and domain tags that could improve search quality and retrieval.
At this scale, hosted APIs became difficult to justify. Rate limits reduced throughput, costs increased rapidly, and external inference raised privacy concerns for candidate data.
So we ran everything locally using gpt-oss-20b, SGLang’s offline engine, and two RTX 4090 GPUs, reaching nearly 3,000 tokens/sec, processing 475K+ profiles in 8 hours, with zero API cost.
In this guide, we’ll break down the architecture, batching strategy, tensor parallelism setup, prefix KV caching optimisations, and the lessons learned while running large-scale offline inference with SGLang. Let’s dive into it.
The Problem with Processing 475K Candidate Profiles
HuntVox aggregates candidate data from multiple sources, including LinkedIn, Weekday, resume parsing pipelines, and Lemlist. The challenge was that this data was highly inconsistent.
Profiles contained duplicated fields, different date formats, repeated job descriptions, and overlapping information across vendors. Directly embedding this data would have produced poor semantic search results.
We needed an LLM pipeline that could transform every raw profile into three outputs:
Summary
A 400–600-word semantic profile summary written in dense prose for vector embeddings. The output focused on career progression, technical expertise, leadership scope, and overall experience without dates, bullet points, or unnecessary formatting.
Skills
A normalised, de-duplicated list of technologies, frameworks, certifications, and methodologies extracted from the profile. The priority was completeness rather than strict precision.
Domains
Business vertical tags are generated from company_industry mappings in the database. This step did not require an LLM and was handled through Python-based aggregation.
The scale made optimisation critical. With 475,000 profiles, even a small inference inefficiency multiplied quickly. A workflow that was only 10% slower could add hours to the total runtime.
Why We Didn't Use Cerebras or Groq
Both are genuinely fast. Groq's LPU delivers ~800 tok/s on Llama 3 70B; Cerebras reaches ~2,000 tok/s on comparable models.
For interactive applications, they're excellent. For offline batch generation at our scale, they have structural problems.
| Dimension | Cerebras / Groq | SGLang Local (2× 4090) |
Effective batch throughput | Rate-limited. Groq free tier: ~6K TPM. Paid tiers still throttle bulk traffic. You spend more time sleeping than generating. | Full hardware throughput, sustained. No ceiling. |
Cost at 475K profiles | ~$0.80–$1.00 per 1M tokens. 475K × ~2,100 tok avg = ~1B tokens → ~$800–$1,000 total. | Electricity. ~8 hrs × 600W = ~5 kWh ≈ $1. |
Data privacy | 475K candidate profiles sent to third-party inference servers. PII risk, data residency concerns. | Nothing leaves the machine. |
Model control | Fixed model catalogue. Can't disable chain-of-thought, can't tune sampling, can't strip internal thinking tokens. | Full control. We disabled thinking entirely — saving ~30% token waste per request. |
Resume safety | Throttle mid-run → complex retry queues, state management, partial-batch bookkeeping. | Idempotent file writes. SSH drops or server reboots → re-run the same command, skip already-done files. |
The fundamental mismatch: Hosted APIs optimise for single-request latency. Offline batch inference is the opposite problem: you have hundreds of thousands of prompts, and you care about total throughput, not how fast one response arrives.
SGLang's offline engine is purpose-built for this; it fills the GPU with a batch and maximises tokens-per-second across all requests simultaneously.
How We Processed 475K Profiles with SGLang
SGLang ships two modes: an online server (OpenAI-compatible API) and an offline engine for batch generation. We used the offline engine; you hand it a list of prompts, and it returns a list of outputs. No HTTP, no serialisation overhead, no connection pooling.
Walk away with actionable insights on AI adoption.
Limited seats available!
SGLang Processing Flow:
- 1,000 Prompts — One Batch → profile.md → chat-template → token sequences
- Batch Scheduler & Continuous Batching — as sequences finish, new ones are inserted immediately — no padding idle time
- RadixAttention — Prefix KV Cache — ~350-token system prompt computed once, reused across all 1,000 prompts (~25% saved)
- GPU 0 (RTX 4090, layers 0→N/2, 24 GB · 1,008 GB/s) ↔ NCCL AllReduce ↔ GPU 1 (RTX 4090, layers N/2→N, 24 GB · 1,008 GB/s)
- KV Cache Manager — mem_fraction=0.88 · 42 GB total KV budget across both GPUs
- outputs[] → profile_synthesis.json — atomic file write per candidate · idempotent skip on restart
Tensor Parallelism (TP=2)
The gpt-oss-20b model requires roughly 40 GB of memory in bf16 precision, which exceeds the 24 GB VRAM available on a single RTX 4090.
To run the model locally, we used tensor parallelism (tp_size=2). SGLang split the model across both GPUs, with each card handling part of the attention layers and MLP computations.
During inference, both GPUs processed requests simultaneously and synchronised using NCCL AllReduce, giving us an effective 48 GB VRAM pool and higher memory bandwidth for large batch generation.
Continuous Batching
Traditional batching waits for the longest sequence in the batch to finish before moving forward. Faster sequences remain idle while the GPU waits, reducing utilisation.
SGLang uses continuous batching instead.
As individual requests are completed, new prompts are inserted immediately into the active batch. This keeps GPU resources occupied throughout execution and avoids idle time caused by uneven sequence lengths.
For large-scale inference runs such as 475K profile generation, this helped maintain consistently high GPU utilisation across the entire pipeline.
RadixAttention - the KV Cache Win
Every one of the 475K profile generation requests used the same ~350-token system prompt.
Normally, the model would recompute those tokens for every request, creating unnecessary input overhead.
SGLang’s RadixAttention avoids this by identifying shared prompt prefixes, computing the KV cache once, and reusing it across the entire batch.
With 1,000 prompts per batch, this removed:
350 × 1,000 = 350,000 token computations per batch
This reduced repeated input processing by roughly 25%, allowing the GPUs to spend more time generating outputs instead of recomputing identical prompt tokens.
Why We Disabled Chain-of-Thought for Profile Generation
gpt-oss-20b is a reasoning-capable model based on Qwen3. By default, it generates an internal reasoning trail before producing the final response.
For tasks such as profile generation and structured extraction, this extra reasoning was unnecessary. Our workload only required outputs like profile summaries and skill extraction, not multi-step problem solving.
The reasoning phase added roughly 200–400 extra tokens per request, increasing inference cost without improving output quality.
We disabled reasoning directly in the prompt template using:
# Disable thinking at prompt construction time
prompt = tokenizer.apply_chat_template(
[{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": profile_md}],
tokenize=False,
add_generation_prompt=True,
reasoning_effort="none", # ← no internal scratchpad
)
# Defensive stripper for any residual thinking tokens
def strip_thinking(text: str) -> str:
if "assistantfinal" in text:
return text.split("assistantfinal", 1)[1].strip()
return text.strip()This optimisation reduced token usage by roughly 30% per profile and improved overall throughput.
Instead of spending compute on internal reasoning traces, the model focused entirely on generating the final output.
How the Inference Pipeline Worked
The entire generation core is straightforward. SGLang's offline engine takes a list of prompt strings and returns a list of outputs, all the scheduling complexity is internal.
import sglang as sgl
# Load once — stays in VRAM for the entire run
llm = sgl.Engine(
model_path="openai/gpt-oss-20b",
tp_size=2, # shard across both 4090s
mem_fraction_static=0.88, # 88% VRAM for KV cache
)
sampling_params = {
"max_new_tokens": 900,
"temperature": 0.2,
"repetition_penalty": 1.05,
}
# Batch loop — 1,000 candidates at a time
for batch in chunks(pending, 1000):
prompts = [build_prompt(folder) for folder in batch]
# Single call — engine schedules all 1,000 in parallel internally
outputs = llm.generate(prompts, sampling_params)
for folder, out in zip(batch, outputs):
text = strip_thinking(out["text"])
result = repair_json(text) # graceful parse of slightly malformed JSON
(folder / "profile_synthesis.json").write_text(result)
llm.shutdown()Idempotency and Failure Recovery
The full inference run took more than 8 hours. During execution, the GPU server restarted at around 446K processed candidates because of provider maintenance.
Walk away with actionable insights on AI adoption.
Limited seats available!
To avoid reprocessing completed work, every candidate output was written as an individual file. At startup, the script checked whether profile_synthesis.json already existed and skipped completed entries.
This allowed us to rerun the same command and continue exactly from the last processed candidate without duplicate generation.
def find_pending(base, force) -> list:
for entry in base.iterdir():
if not (entry / "profile.md").exists():
continue
if (entry / "profile_synthesis.json").exists() and not force:
continue # already done — skip
pending.append(entry)
return pendingThis recovery mechanism made the pipeline idempotent. Server restarts, SSH disconnects, or interrupted runs only required restarting the job instead of reprocessing all 475K profiles.
Throughput Comparison with Hosted APIs
The 3,000 tokens/sec figure represents the sustained output throughput measured across the full run. Throughput was logged after every 1,000-candidate batch using generated tokens divided by elapsed time.
Our local setup maintained approximately:
- SGLang Local (2× RTX 4090): ~3,000 tok/s sustained
- Cerebras (estimated): ~2,000 tok/s peak throughput
- Groq (estimated): ~800 tok/s peak throughput
It is important to note that the Cerebras and Groq numbers represent published per-request performance figures on comparable model sizes. Real batch throughput can be lower because large-scale generation workloads are affected by token limits, request throttling, and queue management.
For our workload, the priority was total job completion time rather than individual request latency.
We also observed thermal differences during long-running inference jobs. Under sustained TP=2 execution, GPU 0 occasionally exceeded 88°C, while other workloads such as embeddings and evaluation were executed separately on GPU 1, which remained between 65–71°C.
For multi-hour inference runs, monitoring dual-GPU cooling and airflow becomes important to maintain stable throughput.
Conclusion
Processing 475K candidate profiles was not just an inference problem, it was a throughput problem.
Hosted APIs work well for interactive applications, but large-scale offline generation introduces different challenges: token costs, rate limits, privacy concerns, recovery handling, and overall job completion time.
Using SGLang’s offline engine, tensor parallelism across two RTX 4090 GPUs, continuous batching, RadixAttention, and reasoning optimisation, we built a pipeline that processed 475K+ profiles, sustained nearly 3,000 tokens/sec, and completed the run in roughly 8 hours without API costs.
More importantly, the project reinforced a simple lesson: when working with large inference workloads, optimising throughput, caching, recovery, and prompt design often delivers bigger gains than focusing only on model speed.
For offline batch inference at this scale, keeping the workload local gave us better control over performance, cost, and data handling while allowing the pipeline to scale efficiently.
Walk away with actionable insights on AI adoption.
Limited seats available!



