Loading an LLM in 4 bits using bitsandbytes

7 min readMay 31, 2024

Bitsandbytes allow us to load large models in low resource environments. Typically, the weights and biases in an LLM are in float32 format. Using bitsandbytes we can load them in 4bit format without losing much of the performance. Here I am loading it in the free version of colab.

Install and Import Libraries

Let’s install the bitsandbytes, accelerate, peft and transformers.

!pip install transformers bitsandbytes accelerate peft

The `transformers` library, developed by Hugging Face, is a powerful and widely used library for natural language processing (NLP) and natural language understanding (NLU). It provides pre-trained models for various tasks such as text classification, question answering, translation, summarization, and more. The library supports a wide range of model architectures, including BERT, GPT, T5, and many others.
‘bitsandbytes’ is a library focused on quantization techniques, which can help in reducing the memory footprint and improving the computational efficiency of deep learning models, especially for large-scale training and inference.
‘accelerate’ is another library from Hugging Face designed to streamline the process of training and evaluating models on different hardware configurations (CPU, single/multi-GPU, TPU). It simplifies the setup for distributed training and mixed precision training.
‘peft’ stands for Parameter-Efficient Fine-Tuning. It’s a library designed to make fine-tuning large pre-trained models more efficient and effective by only updating a subset of the model parameters or using efficient training techniques. This approach helps to significantly reduce the computational and memory resources required for fine-tuning.

These libraries, especially when used together, provide a robust toolkit for working with large-scale NLP models, making the processes of training, fine-tuning, and deploying these models more efficient and accessible.

Before we load the model, let’s import the requisite libraries from transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

AutoModelForCausalLM and AutoTokenizer are two classes that are particularly useful when working with language models, especially for tasks involving text generation. The former automatically selects the appropriate model architecture for causal language modelling (i.e. Models designed to predict the next word). The latter automatically selects the appropriate tokenizer for the model you are using. Tokenizers are responsible for converting text into a format that can be processed by the model (e.g., token IDs).

Load the Model in 4 bit

model_id = "facebook/opt-350m"

model = AutoModelForCausalLM.from_pretrained(model_id,
                                             load_in_4bit=True,
                                             device_map="auto")

load_in_4bit = True enables the model to be loaded in 4_bit precision, reducing the memory footprint dramatically.
device_map = “auto” allows transformers library to automatically distribute the model across available CPUs and GPUs based on hardware configuration and available memory.

We have chosen the model facebook/opt-350m. You can get any model from huggingface huggingface.co/models.

Now that we have loaded the model, we need to get the tokenizer. It is easy, as the AutoTokenizer class will pull out the right tokenizer for the model. Here is the code for that.

tokenizer = AutoTokenizer.from_pretrained(model_id)

Model Architecture

OPTForCausalLM class represents a causal language model using Open Pre-trained Transformer (OPT) architecture, specifically designed for language generation. Salient layers are:

Embedding Layers

Embedding(50272, 512, padding_idx=1)
This line creates dense embedding vectors for the input. Each vector has a size of 512. The vocabulary size is 50,272. Padding index used is 1, which will be ignored in the training.
OPTLearnedPositionalEmbedding(2050, 1024)
This layer provides positional information to the model by adding positional embeddings to token embeddings. This helps the model understand the order of tokens. The maximum sequence length is 2050 and the embedding dimension is 1024.

Projection Layers

Project_out: Linear4bit(in_features=1024, out_features=512, bias=False)
Projects the embeddings from 1024 dimensions to 512, in 4 bits.
Project_in: Linear4bit(in_features=512, out_features=1024, bias=False)
This layer projects the 512 dimensions to 1024 dimensions.

Decoder Layers

ModuleList( (0–23): 24 x OPTDecoderLayer(…) )
Consists of a stack of 24 decoder layers, each responsible for processing the input sequentially and transforming it through self-attention and feed-forward neural networks. OPTDecoderLayer components include Self Attn, activation_fn, self_attn_layer_norm, fc1, fc2 and final_layer_norm.

Language Modeling Head

Linear(in_features=512, out_features=50272, bias=False)
Final linear layer that maps the hidden states to the vocabulary size. It generates logits for each token in the vocabulary.

Inference

Now that the model has been loaded, we can run predictions as usual.

text = "Hello my name is"
device = "cuda:0"

Let’s input the above text and try to generate the output. Here the device is set to cuda:0, which means the first GPU.

Now we generate the inputs using the tokenizer. It will tokenize and give the indices for each word in the text. With return_tensors = "pt", we have specified that the inputs have to be returned as pytorch tensor.

inputs = tokenizer(text, return_tensors="pt").to(device)

Inputs will be in the form of a dictionary with input_ids and attention_mask as the keys.

The above inputs are sent to the model for generating the output using the code below:

outputs = model.generate(**inputs, max_new_tokens=20)

By specifying that the max_new_tokens = 20, we are ensuring that the output is truncated at 20 tokens. **inputs will unpack the dictionary effectively as follows:

outputs = model.generate(input_ids = inputs["input_ids"],
                         attention_mask = inputs["attention_mask"],
                         max_new_tokens=20)

Note: This blog attempts to explain the first part of the notebook, for a beginner. The remaining part of the notebook will follow.

In the last section, we loaded the LLM with a simple argument as follows:

model = AutoModelForCausalLM.from_pretrained(model_id, 
                                             load_in_4bit=True, 
                                             device_map="auto")

However, bitsandbytes allow more flexibility than that. There are several parameters that can be passed to the BitsAndBytesConfig, as follows:

import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_quant_type="nf4",
                                         bnb_4bit_use_double_quant=True,
                                         bnb_4bit_compute_dtype=torch.bfloat16)

model_cd_bf16 = AutoModelForCausalLM.from_pretrained(model_id,
                                                     quantization_config=quantization_config)

bnb_4bit_compute_dtype

This parameter specifies the computation data type to be used when performing calculations on the 4-bit quantized data. In this case, torch.bfloat16 (bfloat16) is used as the computation data type.
Bfloat16 is a 16-bit floating-point data type used primarily in machine learning. It is more suitable for machine learning tasks.
Float16 is represented using 16 bits. It consists of 1 sign bit, 5 bits for the exponent, and 10 bits for the fraction (mantissa). Bfloat16 uses 16 bits as well — 1 sign bit, 8 bits for the exponent, and 7 bits for the mantissa. This format is better suited for small values while still accommodating a broad range of numbers.
In the absence of any specification bitsandbytes will use torch.float32 for computation.

Normalized Float

bnb_4bit_quant_type="nf4"

There are two ways to represent 4bit float- fp4 and nf4. They both have 1-bit sign, a 2-bit exponent, and a 1-bit mantissa. The nf4 values were optimized for saving normally distributed variables. nf4 is more efficient in training large language models. If the type is not specified, by default, fp4 is used.

fp4(Floating Point 4-bit)
fp4 is a standard floating-point representation that provides a balanced way to represent a wide range of values, similar to how larger floating-point formats like FP32 or FP16 work. It is typically used when the type of data distribution or the specific use case does not favor any particular optimization.
If the type is not specified, FP4 is used by default, meaning the system assumes FP4 when encountering a 4-bit float without explicit type information.
nf4 (Normalized Float 4-bit)
nf4 values are optimized for saving normally distributed variables. This means that the representation is specifically designed to be more efficient for data that follows a normal distribution.
nf4 is more efficient in training large language models. Large language models often work with weights and activations that are normally distributed. NF4 leverages this by optimizing the representation to better capture the range and precision needed for such distributions, potentially leading to better performance and efficiency during training.
In LLMs, nf4 is preferred.
Large language models often deal with weights and activations that are normally distributed. nf4 is designed to efficiently encode such distributions, leading to better representation with fewer bits.
nf4 can optimize the use of the limited 4-bit space to provide a better balance between range and precision for normally distributed values, which are common in the internal computations of neural networks.
By using a representation tailored for normally distributed data, nf4 can reduce quantization errors and improve the training dynamics of large models, potentially leading to faster convergence and better performance.

Double Quantization

bnb_4bit_use_double_quant=True

Double Quantization is a technique used to further reduce memory usage by performing a second round of quantization on already quantized parameters.

In QLoRA, the weights of a neural network are first quantized in blocks of 64 to 4-bit precision. This reduces the memory footprint significantly compared to using 32-bit or 16-bit floats.
Each block of weights requires a scaling factor to maintain precision. These scaling factors, typically 32-bit floats, are necessary because they help adjust the quantized weights back to their approximate original values during computation.
Memory Overhead from Scaling Factors:
- Memory Requirement: While 4-bit quantization saves memory, the 32-bit scaling factors for each block of 64 weights still add considerable memory overhead. Each scaling factor essentially adds 0.5 bits per weight (since one 32-bit float is shared by 64 weights).
To address the memory overhead from the scaling factors, Double Quantization performs a second round of quantization on these scaling factors.
The 32-bit scaling factors are grouped into larger blocks of 256. These blocks of 256 scaling factors are then quantized to 8-bit precision.
In essence, Double Quantization is a clever optimization technique that enables even more efficient use of memory by quantizing not just the model weights, but also the scaling factors needed to maintain their precision. This allows for large models to be run on smaller, more memory-constrained hardware, making advanced AI more accessible and cost-effective.

Loading an LLM in 4 bits using bitsandbytes

Install and Import Libraries

Load the Model in 4 bit

Model Architecture

Inference

Normalized Float

Double Quantization

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by deepblue research

No responses yet