LLMs work their wonders by crafting text that feels just like human writing, predicting what word comes next in a perfect flow. The real charm happens behind the curtain, where it's all about the game of probabilities and tokens!
Let’s control this magic by fine-tuning specific parameters: temperature, top_p (nucleus sampling), and top_k sampling, making us the magicians of this magic. It's going to be an exciting exploration!
Range: 0 to 2 (in practice)
Temperature controls the amount of randomness in the output. Here’s how it works:
For low temperatures, the model becomes more deterministic. It prefers high-probability tokens, generating coherent and predictable text.
This works well for applications that demand consistency, e.g., technical documentation or factual descriptions.
At higher temperatures, the model introduces more randomness. Lesser likely tokens have a chance to appear, making the text more diverse and creative.
However, this can lead to occasional "hallucinations" or off-topic responses. Creative storytelling or brainstorming tasks benefit from higher temperatures.
Note: Set temp to 0.0 if you want extremely deterministic results.
Imagine a language model that must accurately complete a sentence while maintaining coherence with the initial prompt.
Input sentence: "I saw him walking into the principal’s office with his parents..."(using Qwen/Qwen2.5-7B-Instruct)
Generated Output: It was my best friend, Alex. As you watched from your hiding spot behind the janitorial closet door, Alex looked nervous but determined as he knocked softly before entering without waiting for permission.
His parents stood beside him, their expressions a mix of concern and pride. The secretary glanced up briefly, then returned her attention to typing away at her computer screen.
Alex took a deep breath and began speaking earnestly about something that had happened between classes - how he'd accidentally broken one of Ms. Johnson's favourite antique vases during cleaning duty.
Experience seamless collaboration and exceptional results.
Generated Output: It was my best friend! As Sarah watched her best friend Michael walk into the school principal's_office_with_his_parents through the glass doors of their high school, her heart sank like a stone.
She had only just texted him earlier that morning to confirm plans for lunch after last period, and now he looked so serious she couldn't even make eye contact as they entered together.
The scene unfolding behind them seemed ominous - Mr. Thompson stood at his desk looking sternly up from some papers spread out before him, Mrs. Johnson fidgeting nervously beside him, with Michael Jr.
Notice how the high temperature provides a more imaginative, varied completion.
Other parameters: {"model_name": "Qwen/Qwen2.5-7B-Instruct", "top_p": 0.9, "top_k": 50, "max_length": 200, "repetition_penalty": 1.2, "no_repeat_ngram_size": 3}
Knowing the temperature is important if you need to find the balance between controlled and imaginative outputs.
(Range: 0 to 1)
Top_p sampling or nucleus sampling improves the generation process by only looking at the smallest subset of tokens whose combined probability is above a specified threshold (p).
For a given prediction, tokens are sorted by probability. The model then adds tokens until the total probability is at least p.
Only these tokens form the “nucleus” from which the next word is sampled. This dynamic approach adapts to the context, which is why top_p is often preferred in creative applications.
(Range: 1 to infinity)
Top_k sampling limits the model’s choices to the top k most likely tokens at each generation step.
For the prompt: "The capital of France is ..."
Top_k is straightforward; capping the number of choices helps prevent the inclusion of very unlikely (and often nonsensical) tokens.
They can be confusing, so imagine you're ordering lunch. With top‑k sampling, it's like a fixed menu where you always see exactly, say, five dish options, regardless of how popular or varied they are. No matter the day or how tastes change, you only choose from those five predetermined dishes.
With top‑p sampling, it's more like a dynamic buffet. Instead of a fixed number of options, you choose from all the dishes that together account for, say, 90% of what people typically order. On a day when a couple of dishes are extremely popular, your choices might be limited to just three items.
But on another day, if the popularity is spread out more evenly, you might see seven or eight dishes to pick from. This way, the number of options adapts to the situation, sometimes more, sometimes fewer, based on the overall likelihood of the dishes being chosen.
Experience seamless collaboration and exceptional results.
In summary, top‑k always gives you a fixed set of choices, while top‑p adjusts the choices dynamically depending on how the probabilities add up, much like a buffet that adapts to customer preferences on any given day.
The key to mastering these parameters is experimentation:
Want to experiment firsthand with these parameters? You can clone our GitHub repository and use a simple UI to tweak the settings for different models. It’s a fun and hands-on way to see how temperature, top_p, and top_k influence the text generation results.
Install required libraries
pip install transformers torch
Main code
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load a pre-trained GPT model and tokenizer
model_name = "gpt2" # you can change the model name to other models but keep in mind that the tokenizer and model should match along with imports
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
# Encode a prompt text to get input_ids
prompt_text = "Once upon a time, in a land far, far away,"
input_ids = tokenizer.encode(prompt_text, return_tensors='pt')
# Set your parameters for temperature, top_p, and top_k
temperature = 0.7 # Controls creativity: higher is more creative, lower is more deterministic
top_p = 0.9 # Nucleus sampling: top_p controls the cumulative probability threshold
top_k = 50 # Top-K sampling: limits choices to top K most likely tokens
# Generate text using the model with the specified parameters
output = model.generate(
input_ids,
max_length=150, # Max length of generated text
temperature=temperature, # Adjust temperature for creativity
top_p=top_p, # Apply top_p sampling
top_k=top_k, # Apply top_k sampling
num_return_sequences=1, # Number of sequences to generate
no_repeat_ngram_size=2, # Prevent repeating n-grams for more natural output
pad_token_id=tokenizer.eos_token_id # Ensures padding with EOS token
)
# Decode the generated text and print the result
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
At this point, you should know the role of temperature, top_p, and top_k parameters to strike a balance between creativity, coherence, and consistency in generated text by AI.
If it's unclear to you how you can adjust them for your purpose, kindly experiment with the Gradio Interface of our GitHub repo for hands-on implementation.
One size does NOT fit all! Try these out to achieve just the output that you require, creative, fact-based, deterministic, or any combination thereof!