Facebook iconWhat Is Quantization and Its Practical Guide - F22 Labs
Blogs/AI

What Is Quantization and Its Practical Guide

Jul 2, 20253 Min Read
Written by Krishna Purwar
What Is Quantization and Its Practical Guide Hero

Have you ever tried to run a powerful AI model but got an error saying your computer doesn't have enough memory? You're not alone. Today's AI models are massive, often requiring expensive GPUs with huge amounts of memory.

Quantization is a clever technique that reduces model size by changing how numbers are stored, using simpler, less precise formats that need far less memory. Think of it like compressing a photo: you trade a small amount of quality for a much smaller file size.

In this guide, we'll explore how quantization works under the hood and show you practical code examples using BitsAndBytes. You'll learn to implement both 4-bit and 8-bit quantization with just a few lines of code, making large language models more accessible on consumer hardware. Ready to optimize your AI models? Let's dive in!

Why Do We Need Quantization?

Our consumer hardware will never be enough to run new state-of-the-art models coming every now and then with billions of parameters, but that should not let us stop from trying them!!

This is where Quantization does its magic by letting us use a 32B parameter mode, i.e. a 70 GB model within 24 GB of GPU. We will say later on how to do it ourselves.

Quantization enables us to use large models on our GPU, which would not be possible otherwise at the cost of a loss of some precision.

How Does Quantization Work?

Under the hood, Quantization converts higher precision floating points, such as fp32, to numbers like bf16, int8, int4, etc. It leads to the loss of some precision by losing the decimal points. Let’s break down the math in a simple way (no need to worry,  it’s easy, I promise).

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Usually, in our AI world, FP are stored in IEEE 754 Standard and are divided into 3 parts: Sign bit, Exponent Bit and Mantissa(fraction). Floating points are a way to store numbers in base two.

Their format is: [sign bit][exponent bits][mantissa bits]

Now, to keep it extremely simple, FP32 has 1 sign bit, 8 exp bits, and 23 mantissa better known as the fraction. BF16 has the size, 1 sign bit, 8 exp bits, and 7 fraction bits. Now, by losing these fractional values, we do lose a little bit of precision, but by converting FP32 to BF16, we can load the same model in half the size. This was an oversimplified example of how things work, actually, but this is one of the core parts of Quantization.

IEEE 754 Converter Quantization

Practical Ways To Do Quantization

BitsAndBytes Configuration

BitsAndBytes provides the most straightforward approach to model Quantization, supporting both 8-bit and 4-bit Quantization with minimal code changes.

Prerequisite: pip install bitsandbytes, accelerate

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

4-bit Quantization Setup

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "your-model-name",
    quantization_config=bnb_config,
    device_map="auto"
)

8-bit Quantization Configuration

# 8-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16
)

Comparison of Quantization Methods

MethodPrecisionSpeedAccuracyUse Case

FP16

Half

2x faster

High

General inference

INT8

8-bit

4x faster

Good

Production deployment

INT4

4-bit

8x faster

Moderate

Resource-constrained devices

NF4

4-bit

8x faster

Better than INT4

Advanced applications

FP16

Precision

Half

Speed

2x faster

Accuracy

High

Use Case

General inference

1 of 4

These were the simple, easy to use ways that we can use in our daily code whenever we need quantization. There are many other advanced techniques like GGUF, GPTQ, AWQ and more but they are performed either during training or after training, giving us a quantized model, on the other hand, bnb comes in handy when we need it last minute and saves us the pain of dealing with complicated computation and hours of training!

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Happy Quantizing!! 😀

Author-Krishna Purwar
Krishna Purwar

You can find me exploring niche topics, learning quirky things and enjoying 0 n 1s until qbits are not here-

Phone

Next for you

Qdrant vs Milvus: Which Vector Database Should You Choose? Cover

AI

Jul 18, 20259 min read

Qdrant vs Milvus: Which Vector Database Should You Choose?

Which vector database should you choose for your AI-powered application, Qdrant or Milvus? As the need for high-dimensional data storage grows in modern AI use cases like semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG), vector databases have become essential.  In this article, we compare Qdrant vs Milvus, two of the most popular vector databases, based on architecture, performance, and ideal use cases. You’ll get a practical breakdown of insertion speed, query

Voxtral-Mini 3B vs Whisper Large V3: Which One’s Faster? Cover

AI

Jul 18, 20254 min read

Voxtral-Mini 3B vs Whisper Large V3: Which One’s Faster?

Which speech-to-text model delivers faster and more accurate transcriptions Voxtral-Mini 3B or Whisper Large V3? We put Voxtral-Mini 3B and Whisper Large V3 head-to-head to find out which speech-to-text model performs better in real-world tasks. Using the same audio clips, we compared latency (speed) and word error rate (accuracy) to help you choose the right model for use cases like transcribing calls, meetings, or voice messages. As speech-to-text systems become smarter and more reliable, th

What is Google Gemini CLI & how to install and use it? Cover

AI

Jul 3, 20252 min read

What is Google Gemini CLI & how to install and use it?

Ever wish your terminal could help you debug, write code, or even run DevOps tasks, without switching tabs? Google’s new Gemini CLI might just do that. Launched in June 2025, Gemini CLI is an open-source command-line AI tool designed to act like your AI teammate, helping you write, debug, and understand code right from the command line. What is Gemini CLI? Gemini CLI is a smart AI assistant you can use directly in your terminal. It’s not just for chatting, it’s purpose-built for developers.