Quantization using bitsandbytes

3 min readMay 29, 2024

‘bitsandbytes’ is a tool to reduce model size using 8-bit and 4-bit quantization. This improves memory usage and can help fit large models into available hardware. In this blog, we describe the process for the free colab platform.

Installation

We need to install transformers, accelerate and bitsandbytes.

pip install transformers accelerate bitsandbytes -qU

The ‘accelerate’ library is needed when using ‘bitsandbytes’ for several reasons:

Efficient GPU Utilization: ‘accelerate’ helps in distributing the model’s computation efficiently across multiple GPUs. This is particularly useful when dealing with large models that require significant memory and processing power. By using `accelerate`, you can ensure that the quantized models are loaded and run in a way that maximizes the available hardware resources.
Model Loading and Configuration: When quantizing models with ‘bitsandbytes’, the ‘from_pretrained()’ method from the ‘transformers’ library can take advantage of ‘accelerate’ to manage the loading of models. This includes handling different device maps and optimizing the placement of model parts (e.g., offloading parts of the model to the CPU if needed).
Compatibility: Many advanced features in the Hugging Face ecosystem, including those used for model quantization and optimization, are built to work seamlessly with ‘accelerate’. It ensures compatibility and smooth integration when configuring and running quantized models.

In summary, ‘accelerate’ provides the necessary infrastructure for effectively using and deploying quantized models, especially when dealing with complex hardware setups and large models.

You also need to import PyTorch.

import torch

Breakdown of the process of 8-bit quantization

Quantization is a technique used to reduce the precision of the weights and biases used in a model, which helps in reducing the model’s memory footprint and improving computational efficiency.

The weights and biases of a model could be classified as outliers and non-outliers:

Outliers are data points or weights in the model that have unusually high or low values compared to the rest of the data. These outliers are retained in 16-bit floating point format, for higher precision.
Non-Outliers are the more typical data points or weights that fall within a common range of values. These are converted to 8-bit integer format.

After the initial quantization step, the non-outlier values (now in int8) are converted back to fp16. This conversion allows the model to use a uniform precision format (fp16) during computation, which simplifies the processing and integration of different weights.

By keeping outlier values in fp16, the model retains precision for the most critical weights, which helps maintain accuracy. Using int8 for most weights (non-outliers) reduces memory usage and speeds up computation. This method strikes a balance between reducing memory/computational load and maintaining model accuracy, thus optimizing overall performance.

Here is the code to quantize a model using bitsandbytes:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
      "facebook/opt-350m", 
      quantization_config=quantization_config, 
      torch_dtype=torch.float32
)

torch.float32 above, specifies that the parts of the model that are not quantized be retained in float32. Apart from the layers being quantized, the model contains various other modules and parameters (e.g., normalization layers like torch.nn.LayerNorm, embeddings, etc.). Quantization often targets specific parts of the model (e.g. Linear layers) where memory savings and computational efficiency are most beneficial.

You can check the memory footprint of the quantized model as follows:

print(model_8bit.get_memory_footprint())

Summary

Using bitsandbytes, you can significantly reduce the memory footprint of models via 8-bit and 4-bit quantization. This involves adjusting configurations, offloading, and potentially dequantizing when necessary.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by deepblue research

4 Followers

1 Following

AI Research Company.

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

More from deepblue research

Loading an LLM in 4 bits using bitsandbytes

deepblue research

Loading an LLM in 4 bits using bitsandbytes

Bitsandbytes allow us to load large models in low resource environments. Typically, the weights and biases in an LLM are in float32 format…

May 31, 2024

deepblue research

Finetuning mistral 7b using Unsloth

Unsloth is a project that allows you to finetune Llama 3, Mistral, Gemma and other Large Language Models with less memory and time.

Jun 4, 2024

deepblue research

Transformers and its Trainer

The Trainer class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs…

Jun 3, 2024

Finetune Mistral 7b on Colab Free Version

deepblue research

Finetune Mistral 7b on Colab Free Version

Clearly Mistral 7b is too big to train on a single T4 15GB GPU provided free by Google. But there are ways to squeeze it in. Let’s use a…

Jun 7, 2024

See all from deepblue research

Recommended from Medium

Fine-Tuning Google’s Gemma-3-12B for Reasoning: How GRPO Turned a Good Model into a Brilliant…

Artificial Intelligence in Plain English

Eric Risco

Fine-Tuning Google’s Gemma-3-12B for Reasoning: How GRPO Turned a Good Model into a Brilliant…

Artificial intelligence can speak fluently, create art, and even pass exams — but logical reasoning remains its ultimate frontier. How do…

4d ago

GPTQ vs. GGUF vs. AWQ: Which Quantization Method is Right for You?

TDS Archive

Maarten Grootendorst

GPTQ vs. GGUF vs. AWQ: Which Quantization Method is Right for You?

Exploring Pre-Quantized Large Language Models

Nov 14, 2023

Building a Local LLM Server: How to Run Multiple Models Efficiently

Sung-Cheol Kim

Building a Local LLM Server: How to Run Multiple Models Efficiently

I’m writing a series on how to utilize a home Ubuntu server as a local LLM server. While working on LLM projects, I’ve pondered how…

Feb 3

Ingrid Stevens

Quantization of LLMs with llama.cpp

Understanding and Implementing n-bit Quantization Techniques for Efficient Inference in LLMs

Mar 15, 2024

Deploy Fine-tuned QuantizedModel to Ollama

Intro to Artificial Intelligence

Dhanoop Karunakaran

Deploy Fine-tuned QuantizedModel to Ollama

Deploying a fine-tuned, quantized model to a CPU-only system presents a significant challenge. Our initial approach involved creating a…

Dec 18, 2024

Optimizing Large Language Models: A Deep Dive into Quantization and Pruning

Prajeesh Prathap | Innovator & Investor

Optimizing Large Language Models: A Deep Dive into Quantization and Pruning

High compute costs are a major concern for organizations looking to integrate LLMs into their operations. The sheer size of these models —…

5d ago

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams