Aidge Compression Module Tutorial#

This notebook demonstrates how to use the aidge_compression module to optimize and compress neural networks (specifically CNNs) within the Aidge framework.

How it Works#

The aidge_compression module works by decomposing Convolutional layers (using techniques like SVD or Tucker decomposition) to reduce the number of parameters and operations (MACs). The process generally involves:

  1. Node Identification: The compressor scans the graph for Conv1D, Conv2D, and their padded variants.

  2. Sensitivity Analysis (Smart Mode): If an evaluation function is provided, the compressor iteratively tests different compression factors (0.1 to 1.0) on each layer independently to measure accuracy loss. It caches these results in cache.json.

  3. Global Optimization: Using the sensitivity data, it solves an optimization problem to find the best per-layer compression rates that meet the target global compression rate (e.g., 50% of original size) while maximizing accuracy.

  4. Decomposition: The selected layers are replaced with their decomposed, lightweight counterparts.

Prerequisites#

Ensure aidge_core, aidge_onnx, aidge_backend_cuda (or cpu), and aidge_compression are installed.

[ ]:
# Install dependencies if needed
# aidge_compression is not yet on PyPI; install from the sub-repository:
!pip install git+https://gitlab.eclipse.org/eclipse/aidge/aidge_compression.git aidge_core aidge_onnx aidge_backend_cpu datasets huggingface_hub torch numpy
[ ]:
import aidge_core
import aidge_onnx
import aidge_backend_cpu  # Change to aidge_backend_cuda for GPU
import aidge_compression

import numpy as np
import torch
from datasets import load_dataset
from huggingface_hub import hf_hub_download
import time

1. Setup: Load Model and Dataset#

We will use a ResNet18 model pre-trained on CIFAR-100.

[ ]:
# Download Model
def download_file(repo_id: str, filename: str, cache_dir: str = "./cache/models"):
    print(f"Downloading {filename} from {repo_id}...")
    return hf_hub_download(repo_id=repo_id, filename=filename, cache_dir=cache_dir)


model_path = download_file("EclipseAidge/resnet18", "resnet18_cifar100.onnx")

# Load Dataset (CIFAR-100)
dataset = load_dataset("uoft-cs/cifar100", split="test", cache_dir="./cache/datasets")


# Define Preprocessing Pipeline
class DataPipeline:
    def __init__(self, dataset):
        def transform_image(image):
            image = image.convert("RGB").resize((224, 224))
            np_image = np.array(image, dtype=np.float32) / 255.0
            np_image = np_image.transpose(2, 0, 1)  # CHW
            # Standard Normalization
            mean = np.array([0.5, 0.5, 0.5], dtype=np.float32).reshape(3, 1, 1)
            std = np.array([0.5, 0.5, 0.5], dtype=np.float32).reshape(3, 1, 1)
            np_image = (np_image - mean) / std
            return torch.from_numpy(np_image)

        def transform(examples):
            examples["pixel_values"] = [
                transform_image(image) for image in examples["img"]
            ]
            return examples

        self.testset = dataset.map(transform, batched=True)
        self.testset.set_format(type="torch", columns=["pixel_values", "fine_label"])


pipeline = DataPipeline(dataset)

2. Approach A: “Blind” Compression (VBMF)#

This method compresses the graph using the Variational Bayesian Matrix Factorization (VBMF) algorithm. It is fast because it does not run inference to check accuracy, but instead relies on analytical properties of the weight matrices to estimate redundancy.

Use case: Initial experimentation, fast estimation, or when an eval dataset is unavailable.

API Used: compress(graph, type=int)

[ ]:
# Reload a fresh copy of the model
model_blind = aidge_onnx.load_onnx(model_path)
ignores = {
    "conv1_Conv",
}

# Define the two layer a custom weaken factor (-1 = not compressed)
ranks = {
    "": -1.0,
    # "layer1_layer1_1_conv1_Conv": -1.0,
    # "layer2_layer2_0_conv1_Conv": -1.0,
}

# Initialize Compressor
# Argument: rank_compression_factor (0.0 to 1.0)
compressor = aidge_compression.Compressor(ignores, ranks, 0.8)

# Compress
# type=0 triggers the heuristic algorithm (VBMF)
compressor.compress(model_blind)

print("Blind compression complete.")
# You can now export or inspect the model
# aidge_onnx.export_onnx(model_blind, "resnet18_blind_compressed.onnx", ...)

3. Approach B: Smart Compression (With Evaluation)#

This is the powerful mode. You provide an eval_func. The compressor effectively performs an Auto-ML search:

  1. It perturbs every layer individually.

  2. It calls your eval_func to see how accuracy drops.

  3. It constructs an accuracy table (cached in cache.json).

  4. It selects the optimal compression per layer to hit your target global size.

API Used: compress(graph, eval_func)

[ ]:
def eval_func(
    model: aidge_core.GraphView, iterations: int = -1, backend: str = "cpu"
) -> float:
    """
    A callback function required by the Compressor.
    It must accept a GraphView and return a float (accuracy/score).
    """
    # Setup DataLoader
    testloader = torch.utils.data.DataLoader(
        pipeline.testset, batch_size=16, shuffle=False, num_workers=8
    )

    model.set_datatype(aidge_core.dtype.float32)
    model.set_backend("cpu")  # Use "cuda" for GPU

    # Create Scheduler for Inference
    scheduler = aidge_core.SequentialScheduler(model)
    scheduler.generate_scheduling()

    correct = 0
    total = 0

    # Run Inference
    for i, batch in enumerate(testloader):
        if iterations > 0 and i >= iterations:
            break

        inputs = batch["pixel_values"]
        labels = batch["fine_label"]

        # Pass data to Aidge
        input_tensor = aidge_core.Tensor(inputs.numpy())
        scheduler.forward(data=[input_tensor])

        # Get Output
        for outNode in model.get_output_nodes():
            outNode.get_operator().get_output(0).set_backend("cpu")
            output = np.array(outNode.get_operator().get_output(0))
            predicted = np.argmax(output, axis=1)
            correct += (predicted == labels.numpy()).sum().item()
            total += labels.size(0)

    accuracy = 100 * correct / total
    return accuracy

3.1 Advanced Configuration (Whitelisting & Manual Ranks)#

We can configure the compressor to:

  1. Ignore certain layers (e.g., the first convolution is usually sensitive).

  2. Force specific layers to have a specific compression rank (overriding the optimizer).

  3. Target a specific global compression ratio (e.g., 0.8 means keep 80% of parameters).

[ ]:
# Load fresh model
model_smart = aidge_onnx.load_onnx(model_path)

# 1. Define layers to modify behavior for
layer_list = {
    "conv1_Conv",  # Often good to ignore the input layer
}

# 2. Define forced ranks (Optional)
# "" : -1.0 means not compressed layers
ranks = {
    "": -1.0,
    # "layer4_layer4_0_conv1_Conv": 0.5 # Force this specific layer to 0.5 factor
}

# 3. Initialize Compressor
target_compression_rate = 0.4  # Aim for 80% of original size

compressor = aidge_compression.Compressor(layer_list, ranks, target_compression_rate)

# 4. Run Compression with Evaluation
# This will trigger the multi-threaded sensitivity analysis
compressor.compress(model_smart, eval_func)

print("\nSmart compression complete.")

3.2 Using Whitelist Mode#

If you only want to compress specific layers, use Flags.Whitelist.

[ ]:
model_white = aidge_onnx.load_onnx(model_path)

only_compress_these = {"layer1_layer1_0_conv1_Conv", "layer1_layer1_0_conv2_Conv"}

compressor_white = aidge_compression.Compressor(
    only_compress_these, {}, 0.8, flags=aidge_compression.Flags.Whitelist
)

compressor_white.compress(model_white, eval_func)

4. Export Result#

After compression, the GraphView (model) is modified in-place. You can export it back to ONNX.

[ ]:
aidge_onnx.export_onnx(
    model_smart,
    "resnet18_smart_compressed.onnx",
    inputs_dims={"conv1_Conv": [[1, 3, 224, 224]]},  # Provide input dims for valid ONNX
    outputs_dims={"output": [[1, 100]]},
    opset=18,
)
print("Model exported successfully.")

5. Under the Hood: C++ Internals#

The aidge_compression module is a high-performance C++ library wrapped in Python. Here is what happens behind the scenes when you call compress().

1. Node Identification & Filtering#

When initialized, the C++ Compressor performs graph matching to locate candidates for compression. It looks for operators like Conv2D or PaddedConv2D. It filters out layers that are too small (e.g., dimensions < 8) as compressing them yields negligible gains. It also applies the Blacklist (default) or Whitelist (Flags::Whitelist) logic based on the names provided in the layer_list.

2. Blind Mode Heuristic (VBMF)#

If no eval_func is provided (Approach A), the compressor relies on Variational Bayesian Matrix Factorization (VBMF). This analytical method estimates the global noise variance in the weight tensors to determine an optimal low-rank approximation. By analyzing the spread of singular values, VBMF can identify which components represent “signal” versus “noise,” automatically selecting a compression rank without needing empirical validation data.

3. Parallel Sensitivity Analysis#

When using “Smart Compression” (eval_func), the most computationally intensive part is the sensitivity analysis. The C++ engine:

  • Spawns a pool of worker threads (defaulting to hardware concurrency).

  • Iterates through every candidate layer.

  • For each layer, it tests compression factors ranging from 0.1 to 1.0 (in 0.1 increments).

  • Graph Cloning: For each test, it efficiently clones the GraphView. It does not deep copy the weights of every layer, only the topology, making this lightweight.

  • Python GIL Management: Since the eval_func is a Python function, the C++ threads must acquire the Python Global Interpreter Lock (GIL) to execute it. The implementation uses py::gil_scoped_acquire inside the worker threads and py::gil_scoped_release in the main waiting loop to ensure thread safety while allowing concurrency where possible.

  • Caching: Results are saved to cache.json. If a layer/factor combination exists in the cache, the expensive evaluation is skipped.

4. Global Optimization Algorithm#

Once the sensitivity table (accuracy vs. compression factor for every layer) is built, the compressor solves a global optimization problem:

  • It calculates the original_cost (total parameter count) of the model.

  • It uses a binary search algorithm to find an optimal accuracy_objective.

  • Selection Logic: For a given accuracy_objective, it selects the highest compression factor for each layer that still satisfies that accuracy threshold.

  • It iterates this process until the sum of all compressed layer sizes meets your target rank_compression_factor (e.g., 0.8).

5. Decomposition (SVD/Tucker)#

Finally, the chosen compression factors are applied. The C++ code replaces the original Conv nodes with a sequence of smaller nodes (e.g., using Singular Value Decomposition or Tucker decomposition) that approximate the original tensor operation with fewer parameters.