LLMs work their wonders by crafting text that feels just like human writing, predicting what word comes next in a perfect flow. The real charm happens behind the curtain, where it's all about the game of probabilities and tokens!
Let’s control this magic by fine-tuning specific parameters: temperature, top_p (nucleus sampling), and top_k sampling, making us the magicians of this magic. It's going to be an exciting exploration!
Range: 0 to 2 (in practice)
Temperature controls the amount of randomness in the output. Here’s how it works, but remember, before any sampling even happens the model breaks text down into units it can work with. Understanding tokenization helps you see what those units actually are, which makes temperature and other settings easier to reason about:
For low temperatures, the model becomes more deterministic. It prefers high-probability tokens, generating coherent and predictable text.
This works well for applications that demand consistency, e.g., technical documentation or factual descriptions.
At higher temperatures, the model introduces more randomness. Lesser likely tokens have a chance to appear, making the text more diverse and creative.
However, this can lead to occasional "hallucinations" or off-topic responses. Creative storytelling or brainstorming tasks benefit from higher temperatures.
Note: Set temp to 0.0 if you want extremely deterministic results.
Imagine a language model that must accurately complete a sentence while maintaining coherence with the initial prompt.
Input sentence: "I saw him walking into the principal’s office with his parents..."(using Qwen/Qwen2.5-7B-Instruct)
Generated Output: It was my best friend, Alex. As you watched from your hiding spot behind the janitorial closet door, Alex looked nervous but determined as he knocked softly before entering without waiting for permission.
His parents stood beside him, their expressions a mix of concern and pride. The secretary glanced up briefly, then returned her attention to typing away at her computer screen.
Alex took a deep breath and began speaking earnestly about something that had happened between classes - how he'd accidentally broken one of Ms. Johnson's favourite antique vases during cleaning duty.
Walk away with actionable insights on AI adoption.
Limited seats available!
Generated Output: It was my best friend! As Sarah watched her best friend Michael walk into the school principal's_office_with_his_parents through the glass doors of their high school, her heart sank like a stone.
She had only just texted him earlier that morning to confirm plans for lunch after last period, and now he looked so serious she couldn't even make eye contact as they entered together.
The scene unfolding behind them seemed ominous - Mr. Thompson stood at his desk looking sternly up from some papers spread out before him, Mrs. Johnson fidgeting nervously beside him, with Michael Jr.
Notice how the high temperature provides a more imaginative, varied completion.
Other parameters: {"model_name": "Qwen/Qwen2.5-7B-Instruct", "top_p": 0.9, "top_k": 50, "max_length": 200, "repetition_penalty": 1.2, "no_repeat_ngram_size": 3}
Knowing the temperature is important if you need to find the balance between controlled and imaginative outputs.
(Range: 0 to 1)
Top_p sampling or nucleus sampling improves the generation process by only looking at the smallest subset of tokens whose combined probability is above a specified threshold (p).
For a given prediction, tokens are sorted by probability. The model then adds tokens until the total probability is at least p.
Only these tokens form the “nucleus” from which the next word is sampled. This dynamic approach adapts to the context, which is why top_p is often preferred in creative applications.
(Range: 1 to infinity)
Top_k sampling limits the model’s choices to the top k most likely tokens at each generation step.
For the prompt: "The capital of France is ..."
Top_k is straightforward; capping the number of choices helps prevent the inclusion of very unlikely (and often nonsensical) tokens.
They can be confusing, so imagine you're ordering lunch. With top‑k sampling, it's like a fixed menu where you always see exactly, say, five dish options, regardless of how popular or varied they are. No matter the day or how tastes change, you only choose from those five predetermined dishes.
With top‑p sampling, it's more like a dynamic buffet. Instead of a fixed number of options, you choose from all the dishes that together account for, say, 90% of what people typically order. On a day when a couple of dishes are extremely popular, your choices might be limited to just three items.
But on another day, if the popularity is spread out more evenly, you might see seven or eight dishes to pick from. This way, the number of options adapts to the situation, sometimes more, sometimes fewer, based on the overall likelihood of the dishes being chosen.
In summary, top‑k always gives you a fixed set of choices, while top‑p adjusts the choices dynamically depending on how the probabilities add up, much like a buffet that adapts to customer preferences on any given day.
Walk away with actionable insights on AI adoption.
Limited seats available!
The key to mastering these parameters is experimentation:
Want to experiment firsthand with these parameters? You can clone our GitHub repository and use a simple UI to tweak the settings for different models. It’s a fun and hands-on way to see how temperature, top_p, and top_k influence the text generation results. While exploring these AI parameters, you might also want to look at implementing features like Push notification in React to make your applications more engaging.
Install required libraries
pip install transformers torch
Main code
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load a pre-trained GPT model and tokenizer
model_name = "gpt2" # you can change the model name to other models but keep in mind that the tokenizer and model should match along with imports
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
# Encode a prompt text to get input_ids
prompt_text = "Once upon a time, in a land far, far away,"
input_ids = tokenizer.encode(prompt_text, return_tensors='pt')
# Set your parameters for temperature, top_p, and top_k
temperature = 0.7 # Controls creativity: higher is more creative, lower is more deterministic
top_p = 0.9 # Nucleus sampling: top_p controls the cumulative probability threshold
top_k = 50 # Top-K sampling: limits choices to top K most likely tokens
# Generate text using the model with the specified parameters
output = model.generate(
input_ids,
max_length=150, # Max length of generated text
temperature=temperature, # Adjust temperature for creativity
top_p=top_p, # Apply top_p sampling
top_k=top_k, # Apply top_k sampling
num_return_sequences=1, # Number of sequences to generate
no_repeat_ngram_size=2, # Prevent repeating n-grams for more natural output
pad_token_id=tokenizer.eos_token_id # Ensures padding with EOS token
)
# Decode the generated text and print the result
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
At this point, you should know the role of temperature, top_p, and top_k parameters to strike a balance between creativity, coherence, and consistency in generated text by AI.
If it's unclear to you how you can adjust them for your purpose, kindly experiment with the Gradio Interface of our GitHub repo for hands-on implementation.
One size does NOT fit all! Try these out to achieve just the output that you require, creative, fact-based, deterministic, or any combination thereof! And if you’re a developer experimenting with LLMs, modern AI code editors can make fine-tuning and testing parameters like temperature, top_p, and top_k far more efficient.
Unsure how to fine-tune parameters like temperature, top_p and top_k to get the exact behaviour you need from your language models? We collaborate with teams that hire AI developers to design and optimise LLM workflows — from setting up the right sampling strategies to building full-stack applications around them. Our experts can help you experiment, benchmark and deploy models with the ideal balance of creativity, coherence and performance for your specific use case.
Walk away with actionable insights on AI adoption.
Limited seats available!