Post Training Quantization with AIDGE#
What is Network Quantization ?#
Deploying large Neural Network architectures on embedded targets can be a difficult task as they often require billions of floating operations per inference.
To address this problem, several techniques have been developed over the past decades in order to reduce the computational load and energy consumption of those inferences. Those techniques include Pruning, Compression, Quantization and Distillation.
In particular, Post Training Quantization (PTQ) consists in taking an already trained network, and replacing the costly floating-point MADD by their integer counterparts. The use of Bytes instead of Floats also leads to a smaller memory bandwidth.
While this process can seem trivial, the naive approach consisting only in rounding the parameters and activations doesn’t work in practice. Instead, we want to normalize the network in order to optimize the ranges of parameters and values propagated inside the network, before applying quantization.
The Quantization Pipeline#
The PTQ algorithm consists in a 3 steps pipeline:
First we optimize the parameter ranges by propagating the scaling coefficients in the network.
Secondly, we compute the activation values over an input dataset, and insert the scaling nodes.
Finally, we quantize the network by reconfiguring the scaling nodes according to the desired precision.
Doing the PTQ with AIDGE#
This notebook shows how to perform PTQ of a Convolutional Network, trained on the MNIST dataset.
The tutorial is constructed as follows :
Setup of the AIDGE environment
Loading of the model and example inferences
Evaluation of the trained model accuracy
Post Training Quantization and test inferences
Evaluation of the quantized model accuracy
As we will observe in this notebook, we get zero degradation of the accuracy for a 8-bits PTQ.
Let’s begin !
(if needed) Download the model#
If you don’t have git-lfs, you can download the model and data using this piece of code
[1]:
import os
import requests
def download_material(path: str) -> None:
if not os.path.isfile(path):
response = requests.get("https://gitlab.eclipse.org/eclipse/aidge/aidge/-/raw/dev/examples/tutorials/PTQ_tutorial/"+path+"?ref_type=heads")
if response.status_code == 200:
with open(path, 'wb') as f:
f.write(response.content)
print("File downloaded successfully.")
else:
print("Failed to download file. Status code:", response.status_code)
# Download onnx model file
download_material("ConvNet.onnx")
# Download data sample
download_material("mnist_samples.npy.gz")
# Download data label
download_material("mnist_labels.npy.gz")
File downloaded successfully.
File downloaded successfully.
File downloaded successfully.
Environment setup …#
We need numpy for manipulating the inputs, matplotlib for visualization purposes, and gzip to uncompress the numpy dataset.
Then we want to import the aidge modules :
the core module contains everything we need to manipulate the graph.
the backend module allows us to perform inferences using the CPU.
the onnx module allows us to load the pretrained model (stored in an onnx file).
the quantization module encaplsulate the Post Training Quantization algorithm.
[2]:
import gzip
import numpy as np
import matplotlib.pyplot as plt
import aidge_core
import aidge_onnx
import aidge_backend_cpu
import aidge_quantization
print(" Available backends : ", aidge_core.Tensor.get_available_backends())
Available backends : {'cpu'}
Then, let’s define the configurations of this script …
[3]:
NB_SAMPLES = 100
NB_BITS = 8
Now, let’s load and visualize some samples …
[4]:
samples = np.load(gzip.GzipFile('mnist_samples.npy.gz', "r"))
labels = np.load(gzip.GzipFile('mnist_labels.npy.gz', "r"))
[5]:
for i in range(10):
plt.subplot(1, 10, i + 1)
plt.axis('off')
plt.tight_layout()
plt.imshow(samples[i], cmap='gray')
Importing the model in AIDGE …#
[6]:
aidge_model = aidge_onnx.load_onnx("ConvNet.onnx", verbose=False)
aidge_core.remove_flatten(aidge_model) # we want to get rid of the 'flatten' nodes ...
- data_0_Conv_output_0 (Conv)
- data_1_Relu (Relu)
- data_2_MaxPool_output_0 (MaxPool)
- data_3_Conv_output_0 (Conv)
- data_4_Relu (Relu)
- data_5_MaxPool_output_0 (MaxPool)
- data_6_Flatten (Flatten | GenericOperator)
- axis : 1
- data_7_Gemm_output_0 (Gemm)
- data_8_Relu (Relu)
- data_9_Gemm_output_0 (Gemm)
- data_10_Relu (Relu)
- output (Gemm)
Setting up the AIDGE scheduler …#
In order to perform inferences with AIDGE we need to setup a scheduler. But before doing so, we need to create a data producer node and connect it to the network.
[7]:
# Insert the input producer
input_node = aidge_core.Producer([1, 1, 28, 28], "XXX")
input_node.add_child(aidge_model)
aidge_model.add(input_node)
# Set up the backend
aidge_model.set_datatype(aidge_core.dtype.float32)
aidge_model.set_backend("cpu")
# Create the Scheduler
scheduler = aidge_core.SequentialScheduler(aidge_model)
Running some example inferences …#
Now that the scheduler is ready, let’s perform some inferences. To do so we first declare a utility function that will prepare and set our inputs, propagate them and retreive the outputs.
[8]:
def propagate(model, scheduler, sample):
# Setup the input
sample = np.reshape(sample, (1, 1, 28, 28))
input_tensor = aidge_core.Tensor(sample)
input_node.get_operator().set_output(0, input_tensor)
# Run the inference
scheduler.forward()
# Gather the results
output_node = model.get_output_nodes().pop()
output_tensor = output_node.get_operator().get_output(0)
return np.array(output_tensor)
print('\n EXAMPLE INFERENCES :')
for i in range(10):
output_array = propagate(aidge_model, scheduler, samples[i])
print(labels[i] , ' -> ', np.round(output_array, 2))
EXAMPLE INFERENCES :
7 -> [[-0.02 0. -0.03 0.03 0.03 0.01 -0. 0.94 0.02 -0.03]]
2 -> [[ 0.17 0.01 0.8 -0.03 0.01 -0.04 0.06 -0.01 -0.05 0. ]]
1 -> [[-0. 0.99 0.02 0.02 -0.02 -0. 0.03 -0.01 0.01 -0.04]]
0 -> [[ 0.97 0.03 -0.02 0.01 -0.01 0.04 0.03 0.02 0.01 -0.07]]
4 -> [[-0.05 -0. -0.01 -0.02 1.13 -0.01 -0.01 -0.03 0.04 -0.01]]
1 -> [[-0.01 1.06 -0. -0.02 -0.01 0.02 -0.02 0.02 0.01 -0.01]]
4 -> [[-0.03 -0.04 0.01 -0.01 0.94 0.01 0.04 0.03 0.19 -0.12]]
9 -> [[-0.02 0.02 0.07 0.09 0.12 -0.02 -0.02 -0.02 0.12 0.67]]
5 -> [[ 0.03 -0.03 0.04 -0.07 0.01 0.69 0.16 0.06 0.08 -0.02]]
9 -> [[ 0.01 -0.01 -0. -0.03 0.05 -0.02 0.01 0.06 -0.01 0.95]]
Computing the model accuracy …#
[9]:
def compute_accuracy(model, samples, labels):
acc = 0
for i, x in enumerate(samples):
y = propagate(model, scheduler, x)
if labels[i] == np.argmax(y):
acc += 1
return acc / len(samples)
accuracy = compute_accuracy(aidge_model, samples[0:NB_SAMPLES], labels)
print(f'\n MODEL ACCURACY : {accuracy * 100:.3f}%')
MODEL ACCURACY : 100.000%
Quantization dataset creation …#
We need to convert a subset of our Numpy samples into AIDGE tensors, so that they can be used to compute the activation ranges.
[10]:
tensors = []
for sample in samples[0:NB_SAMPLES]:
sample = np.reshape(sample, (1, 1, 28, 28))
tensor = aidge_core.Tensor(sample)
tensors.append(tensor)
Applying the PTQ to the model …#
Now that everything is ready, we can call the PTQ routine ! Note that after the quantization we need to update the scheduler.
[11]:
aidge_quantization.quantize_network(aidge_model, NB_BITS, tensors)
scheduler = aidge_core.SequentialScheduler(aidge_model)
Running some quantized inferences …#
Now that our network is quantized, what about testing some inferences ? Let’s do so, but before, we need not to forget that our 8-bit network expect 8-bit inputs ! We thus need to rescale the input tensors …
[12]:
scaling = 2**(NB_BITS-1)-1
for i in range(NB_SAMPLES):
samples[i] = np.round(samples[i] * scaling)
We can now perform our quantized inferences …
[13]:
print('\n EXAMPLE QUANTIZED INFERENCES :')
for i in range(10):
input_array = np.reshape(samples[i], (1, 1, 28, 28))
output_array = propagate(aidge_model, scheduler, input_array)
print(labels[i] , ' -> ', np.round(output_array, 2))
EXAMPLE QUANTIZED INFERENCES :
7 -> [[-2. 1. -3. 3. 2. 1. 0. 90. 1. -2.]]
2 -> [[16. 1. 77. -2. 0. -3. 5. 0. -6. 0.]]
1 -> [[ 0. 95. 2. 2. -2. 0. 3. -1. 1. -4.]]
0 -> [[93. 3. -2. 1. -1. 5. 2. 1. 0. -6.]]
4 -> [[ -5. -1. -1. -2. 109. -1. -1. -3. 4. -1.]]
1 -> [[ 0. 102. 0. -2. -2. 2. -2. 1. 1. -1.]]
4 -> [[ -3. -3. 0. -1. 90. 2. 3. 3. 18. -11.]]
9 -> [[-2. 2. 6. 9. 11. -2. -3. -2. 12. 63.]]
5 -> [[ 2. -3. 3. -6. 1. 67. 16. 6. 7. -2.]]
9 -> [[ 1. 0. 0. -3. 4. -2. 2. 5. -2. 92.]]
Computing the quantized accuracy …#
Just as we’ve done for the initial network, we can compute the quantized model accuracy …
[14]:
accuracy = compute_accuracy(aidge_model, samples[0:NB_SAMPLES], labels)
print(f'\n QUANTIZED MODEL ACCURACY : {accuracy * 100:.3f}%')
QUANTIZED MODEL ACCURACY : 100.000%
Work is done !#
We see that a 8-bit PTQ does not affect the accuracy of our model ! This result shows that a proper quantization algorithm can be used to deploy a Neural Network on very small devices, where manipulating bytes is optimal. We encourage you to run this notebook again with even more aggressive quantization values !
[15]:
print('That\'s all folks !')
That's all folks !