
LLMs feel almost magical when they produce text that reads like it was written by a human. When I first started working with them, that fluency made it easy to forget that every word is still driven by probabilities and tokens under the hood. Over time, especially while tuning models for real applications, I realised that understanding those mechanics is what separates “good enough” outputs from truly reliable ones.
As I experimented with different models and prompts, I kept running into the same question: why does the same prompt behave so differently across runs? The answer almost always came down to sampling parameters like temperature, top_p (nucleus sampling), and top_k. Learning to tune these gave me control over that randomness instead of leaving it to chance. These are commonly referred to as sampling parameters in LLMs, as they directly influence how text is generated. It's going to be an exciting exploration!
Temperature controls how random or creative an LLM’s output becomes, and it’s usually the first parameter I adjust when outputs feel either too rigid or too chaotic.
- Top-p limits token selection based on cumulative probability
- Top-k restricts the model to a fixed number of likely tokens
- Use top-p *or* top-k, not both, and tune settings based on the task
Temperature controls the randomness of a language model’s output. A lower temperature makes responses more predictable and consistent, while a higher temperature increases creativity and variation. It typically ranges from 0.0 to 1.0 and is used to control creativity versus determinism in text generation.
Top-p (nucleus sampling) controls how much probability mass the model considers when choosing the next token. With typical values between 0.8 and 0.95, top-p helps generate natural and diverse responses while filtering out low-probability tokens.
Top-k limits the model to selecting from a fixed number of the most likely tokens, usually between 20 and 100. This method is useful for restricting unlikely outputs and maintaining tighter control over response quality.
In practice, temperature is tuned together with either top-p or top-k based on the use case. Creative tasks often use temperature with top-p, while structured or deterministic tasks benefit from temperature combined with top-k.

Range: 0 to 2 (in practice)
Temperature is the parameter I rely on most when I need to control how predictable or creative a model feels. In practical work, it’s often the difference between outputs that sound mechanically correct and ones that actually feel usable. It directly scales how willing the model is to move away from its most likely next token. Here’s how it works, but remember, before any sampling even happens, the model breaks text down into units it can work with. Understanding tokenisation helps you see what those units actually are, which makes temperature and other settings easier to reason about:
Low Temperature (e.g., 0.2–0.4):
For low temperatures, the model becomes more deterministic. It prefers high-probability tokens, generating coherent and predictable text.
I consistently use low temperatures for tasks where accuracy matters more than expression, such as technical documentation or factual explanations. In these cases, predictability is a feature, not a limitation.
High Temperature (e.g., 0.7–1.0+):
At higher temperatures, the model introduces more randomness. Lesser likely tokens have a chance to appear, making the text more diverse and creative.
However, this can lead to occasional "hallucinations" or off-topic responses.When I’m brainstorming ideas or exploring creative directions, higher temperatures help the model break out of safe, repetitive patterns. The trade-off is that I have to stay alert for hallucinations, which is something I’ve learned to expect rather than avoid entirely.
Note: Set temp to 0.0 if you want extremely deterministic results.
Imagine a language model that must accurately complete a sentence while maintaining coherence with the initial prompt.
Input sentence: "I saw him walking into the principal’s office with his parents..."(using Qwen/Qwen2.5-7B-Instruct)
At a low temperature (0.3):
Generated Output: It was my best friend, Alex. As you watched from your hiding spot behind the janitorial closet door, Alex looked nervous but determined as he knocked softly before entering without waiting for permission.
His parents stood beside him, their expressions a mix of concern and pride. The secretary glanced up briefly, then returned her attention to typing away at her computer screen.
Alex took a deep breath and began speaking earnestly about something that had happened between classes - how he'd accidentally broken one of Ms. Johnson's favourite antique vases during cleaning duty.
At a high temperature (0.8):
Generated Output: It was my best friend! As Sarah watched her best friend Michael walk into the school principal's_office_with_his_parents through the glass doors of their high school, her heart sank like a stone.
She had only just texted him earlier that morning to confirm plans for lunch after last period, and now he looked so serious she couldn't even make eye contact as they entered together.
The scene unfolding behind them seemed ominous - Mr. Thompson stood at his desk looking sternly up from some papers spread out before him, Mrs. Johnson fidgeting nervously beside him, with Michael Jr.
Notice how the high temperature provides a more imaginative, varied completion.
Other parameters: {"model_name": "Qwen/Qwen2.5-7B-Instruct", "top_p": 0.9, "top_k": 50, "max_length": 200, "repetition_penalty": 1.2, "no_repeat_ngram_size": 3}
Walk away with actionable insights on AI adoption.
Limited seats available!
Knowing the temperature is important if you need to find the balance between controlled and imaginative outputs.
(Range: 0 to 1)
Top_p, also known as nucleus sampling, became my preferred decoding strategy once I started working on longer and more open-ended generations. Instead of forcing a fixed number of options, it dynamically adapts to how confident the model is at each step.
Low Top_p (e.g., 0.3–0.5): The model only considers a few very high-probability tokens, leading to focused and coherent text but with less diversity.
High Top_p (e.g., 0.9–0.95): A broader range of tokens is considered, which can result in richer and more varied responses.
For a given prediction, tokens are sorted by probability. The model then adds tokens until the total probability is at least p.
Only these tokens form the “nucleus” from which the next word is sampled. Because top_p adapts to the probability distribution itself, I’ve found it especially effective for creative or conversational tasks where the “right” number of choices changes from sentence to sentence.
(Range: 1 to infinity)
Top_k sampling is the most straightforward strategy I’ve used: it simply caps the number of tokens the model is allowed to consider. When I need tighter control and fewer surprises, this predictability is exactly what I want.
Low Top_k (e.g., 5–10): The model is restricted to a very small set of tokens, making the output more consistent and predictable. This is useful for tasks where precision is critical, such as generating code or formal documents.
High Top_k (e.g., 50–100): More tokens are considered, allowing for a broader and sometimes more creative output. However, if the threshold is set too high, it might include fewer relevant words.
For the prompt: "The capital of France is ..."
With top_k = 5: The model might reliably output: "Paris."
With top_k = 50: There’s more room for variation, which might be useful in a creative writing context but less so for factual answers.
In practice, top_k gives me peace of mind for structured outputs. By limiting choices early, it reduces the chance of the model drifting into unlikely or nonsensical territory.
These two are often confused, and I struggled with the difference myself until I started thinking about them in everyday terms. The lunch analogy helped me internalise when each strategy makes sense and why their behaviour feels so different in practice. With top‑k sampling, it's like a fixed menu where you always see exactly, say, five dish options, regardless of how popular or varied they are. No matter the day or how tastes change, you only choose from those five predetermined dishes.
With top‑p sampling, it's more like a dynamic buffet. Instead of a fixed number of options, you choose from all the dishes that together account for, say, 90% of what people typically order. On a day when a couple of dishes are extremely popular, your choices might be limited to just three items.
But on another day, if the popularity is spread out more evenly, you might see seven or eight dishes to pick from. This way, the number of options adapts to the situation, sometimes more, sometimes fewer, based on the overall likelihood of the dishes being chosen.
In summary, top-k always gives you a fixed set of choices, while top-p adjusts the choices dynamically depending on how the probabilities add up, much like a buffet that adapts to customer preferences on any given day. Both approaches are widely used LLM decoding strategies for managing output diversity and coherence.
After testing these parameters across different tasks, one thing became very clear to me: there is no single “best” configuration. What works brilliantly for brainstorming can fail badly for QA or code generation. The right values depend on whether you need accuracy, creativity, or a balance between the two. The table below shows commonly used settings that work well across real-world applications.
| Use case | Temperature | Top-p | Top-k |
|--------|-------------|-------|-------|
| Factual answers & QA | 0.2–0.3 | 0.8–0.9 | 20–40 |
| Chatbots & assistants | 0.5–0.7 | 0.9 | 40–60 |
| Creative writing | 0.8–1.0 | 0.95 | 50–100 |
| Code generation | 0.2–0.4 | 0.8 | 20–50 |
| Brainstorming ideas | 0.7–0.9 | 0.9–0.95 | 50–80 |
In most cases, it’s recommended to tune **temperature together with either top-p or top-k**, not both. Start with conservative values, evaluate the output, and adjust gradually based on the task.
A lot of “bad outputs” come from perfectly normal settings used in the wrong context. Here are the most common mistakes to avoid when tuning generation parameters.
Factual or Technical Content: Use a low temperature (e.g., 0.2–0.4) with a low top_p or low top_k to ensure high accuracy and consistency. These text generation parameters should always be tuned based on the task and output requirements.
Creative Writing and Brainstorming:Opt for a high temperature (e.g., 0.7–1.0) and a high top_p (e.g., 0.9–0.95) to unlock a broader spectrum of ideas while maintaining reasonable coherence.
Chatbots and Conversational Agents: A balanced approach (medium temperature around 0.5–0.7, with a moderate top_p and top_k) can provide engaging and natural-sounding responses without veering off-topic.
What ultimately helped me understand these parameters wasn’t theory alone, but experimentation. Small, controlled changes made the effects obvious and repeatable.
Adjust one at a Time: Tweak temperature or top_p independently to see their individual effects.
Mix and Match: Combine temperature with top_p or top_k settings to find the optimal balance for your specific task.
Walk away with actionable insights on AI adoption.
Limited seats available!
Want to experiment firsthand with these parameters? You can clone our GitHub repository and use a simple UI to tweak the settings for different models. It’s a fun and hands-on way to see how temperature, top_p, and top_k influence the text generation results. While exploring these AI parameters, you might also want to look at implementing features like Push notification in React to make your applications more engaging.
Install required libraries
pip install transformers torchMain code
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load a modern instruction-tuned LLM
model_name = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt_text = "Once upon a time, in a land far, far away,"
input_ids = tokenizer(prompt_text, return_tensors="pt").input_ids
temperature = 0.7
top_p = 0.9
top_k = 50
output = model.generate(
input_ids,
max_length=150,
temperature=temperature,
top_p=top_p,
top_k=top_k,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(output[0], skip_special_tokens=True))Temperature controls how random or deterministic an LLM’s output is. Lower values produce more predictable responses, while higher values allow more creative and varied text.
For most use cases, a temperature between 0.5 and 0.7 offers a good balance between coherence and creativity. Factual tasks usually work better with lower values.
Top-p sampling limits token selection to the smallest group of words whose combined probability meets a defined threshold, helping maintain natural yet controlled outputs.
Top-p is preferred when you want responses to adapt dynamically to context. It works well for chatbots, assistants, and general text generation.
Top-k sampling restricts generation to a fixed number of the most likely tokens. It is useful when precision and consistency matter, such as in code or structured text generation.
In most cases, no. It’s generally better to use either top-p or top-k, along with temperature, to avoid over-restricting the model’s output.
Hallucinations are not controlled by temperature alone. They can also be influenced by prompt quality, model limitations, or missing constraints in the input.
Code generation usually performs best with a low temperature (0.2–0.4) and conservative sampling settings to reduce randomness and improve correctness.
The concepts are consistent, but ideal values vary by model. Different LLMs respond differently, so testing and iteration are always recommended, even for commonly used tools like ChatGPT, where default temperature, top-p, and top-k values may differ.
Start with recommended defaults, adjust one parameter at a time, and evaluate results based on accuracy, coherence, and creativity for your specific task.
By this point, you should have a practical understanding of how temperature, top_p, and top_k shape an LLM’s behaviour. These parameters aren’t about finding a perfect formula, but about choosing the right trade-off for the task in front of you, something I still evaluate every time I deploy a new workflow. To strike a balance between creativity, coherence, and consistency in generated text by AI.
If it's unclear to you how you can adjust them for your purpose, kindly experiment with the Gradio Interface of our GitHub repo for hands-on implementation.
One size does NOT fit all! Try these out to achieve just the output that you require, creative, fact-based, deterministic, or any combination thereof! And if you’re a developer experimenting with LLMs, modern AI code editors can make fine-tuning and testing parameters like temperature, top_p, and top_k far more efficient.
Walk away with actionable insights on AI adoption.
Limited seats available!