Blogs/AI/What Is Quantization and Its Practical Guide

What Is Quantization and Its Practical Guide

Written by Krishna Purwar

Jul 2, 2025

3 Min Read

What Is Quantization and Its Practical Guide Hero

Have you ever tried to run a powerful AI model but got an error saying your computer doesn't have enough memory? You're not alone. Today's AI models are massive, often requiring expensive GPUs with huge amounts of memory.

Quantization is a clever technique that reduces model size by changing how numbers are stored, using simpler, less precise formats that need far less memory. Think of it like compressing a photo: you trade a small amount of quality for a much smaller file size.

In this guide, we'll explore how quantization works under the hood and show you practical code examples using BitsAndBytes. You'll learn to implement both 4-bit and 8-bit quantization with just a few lines of code, making large language models more accessible on consumer hardware. Ready to optimize your AI models? Let's dive in!

Why Do We Need Quantization?

Our consumer hardware will never be enough to run new state-of-the-art models coming every now and then with billions of parameters, but that should not let us stop from trying them!!

This is where Quantization does its magic by letting us use a 32B parameter mode, i.e. a 70 GB model within 24 GB of GPU. We will say later on how to do it ourselves.

Quantization enables us to use large models on our GPU, which would not be possible otherwise at the cost of a loss of some precision.

How Does Quantization Work?

Under the hood, Quantization converts higher precision floating points, such as fp32, to numbers like bf16, int8, int4, etc. It leads to the loss of some precision by losing the decimal points. Let’s break down the math in a simple way (no need to worry, it’s easy, I promise).

Optimizing Models through Quantization

Reduce model size and cost by quantizing weights. Includes practical demonstration using open libraries.

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 22 Nov 2025

10PM IST (60 mins)

Usually, in our AI world, FP are stored in IEEE 754 Standard and are divided into 3 parts: Sign bit, Exponent Bit and Mantissa(fraction). Floating points are a way to store numbers in base two.

Their format is: [sign bit][exponent bits][mantissa bits]

Now, to keep it extremely simple, FP32 has 1 sign bit, 8 exp bits, and 23 mantissa better known as the fraction. BF16 has the size, 1 sign bit, 8 exp bits, and 7 fraction bits. Now, by losing these fractional values, we do lose a little bit of precision, but by converting FP32 to BF16, we can load the same model in half the size. This was an oversimplified example of how things work, actually, but this is one of the core parts of Quantization.

Practical Ways To Do Quantization

BitsAndBytes Configuration

BitsAndBytes provides the most straightforward approach to model Quantization, supporting both 8-bit and 4-bit Quantization with minimal code changes.

Prerequisite: pip install bitsandbytes, accelerate

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

4-bit Quantization Setup

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "your-model-name",
    quantization_config=bnb_config,
    device_map="auto"
)

8-bit Quantization Configuration

# 8-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16
)

Comparison of Quantization Methods

Method	Precision	Speed	Accuracy	Use Case
FP16	Half	2x faster	High	General inference
INT8	8-bit	4x faster	Good	Production deployment
INT4	4-bit	8x faster	Moderate	Resource-constrained devices
NF4	4-bit	8x faster	Better than INT4	Advanced applications

FP16

Precision

Half

Speed

2x faster

Accuracy

High

Use Case

General inference

1 of 4

These were the simple, easy to use ways that we can use in our daily code whenever we need quantization. There are many other advanced techniques like GGUF, GPTQ, AWQ and more but they are performed either during training or after training, giving us a quantized model, on the other hand, bnb comes in handy when we need it last minute and saves us the pain of dealing with complicated computation and hours of training!

Optimizing Models through Quantization

Reduce model size and cost by quantizing weights. Includes practical demonstration using open libraries.

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Saturday, 22 Nov 2025

10PM IST (60 mins)

Happy Quantizing!! 😀

Krishna Purwar

AI/ML Engineer

You can find me exploring niche topics, learning quirky things and enjoying 0 n 1s until qbits are not here-

Share this article

Next for you

Qdrant vs Weaviate vs FalkorDB: Best AI Database 2025 Cover

AI

Nov 14, 2025 • 4 min read

Qdrant vs Weaviate vs FalkorDB: Best AI Database 2025

What if your AI application’s performance depended on one critical choice, the database powering it? In the era of vector search and retrieval-augmented generation (RAG), picking the right database can be the difference between instant, accurate results and sluggish responses. Three names dominate this space: Qdrant, Weaviate, and FalkorDB. Qdrant leads with lightning-fast vector search, Weaviate shines with hybrid AI features and multimodal support, while FalkorDB thrives on uncovering complex

AI PDF Form Detection: Game-Changer or Still Evolving? Cover

AI

Nov 10, 2025 • 3 min read

AI PDF Form Detection: Game-Changer or Still Evolving?

AI-based PDF form detection promises to transform static documents into interactive, fillable forms with minimal human intervention. Using computer vision and layout analysis, these systems automatically identify text boxes, checkboxes, radio buttons, and signature fields to reconstruct form structures digitally. The technology shows significant potential in streamlining document processing, reducing manual input, and improving efficiency across industries. However, performance still varies wi

How to Use UV Package Manager for Python Projects Cover

AI

Oct 31, 2025 • 4 min read

How to Use UV Package Manager for Python Projects

Managing Python packages and dependencies has always been a challenge for developers. Tools like pip and poetry have served well for years, but as projects grow more complex, these tools can feel slow and cumbersome. UV is a modern, high-performance Python package manager written in Rust, built as a drop-in replacement for pip and pip-tools. It focuses on speed, reliability, and ease of use rather than adding yet another layer of complexity. According to benchmarks from Astral, UV installs pac