Benchmark a model on a target#

This tutorial shows how to benchmark a complete neural network directly on an embedded or edge target with Aidge.

By the end, you will have prepared a model, adapted, validated it on target, and collected benchmark results. Are you ready? Let’s get started!

1. Imports and paths#

We use Aidge modules for graph manipulation, quantization, export, deployment, and result analysis.

First, we’ll start by importing the necessary modules and setting up the paths. The dataset loaders already exist in aidge/examples/benchmark. We’ll use these for the sake of this example.

[ ]:

from __future__ import annotations

# Standard library imports
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import sys
import numpy as np

# Aidge modules
import aidge_backend_cpu
import aidge_core
import aidge_onnx
import aidge_quantization
from aidge_core.export_utils import adapt_graph, propagate, str2aidge
from aidge_core.static_analysis import StaticAnalysis
from aidge_export_arm_cortexm import benchmark as arm_benchmark
from aidge_export_arm_cortexm.benchmark_helpers.charts import export_benchmark_charts
from aidge_export_arm_cortexm.export_registry import ExportLibAidgeARM, ExportLibCMSISNN
from aidge_export_arm_cortexm.flash import parsing_log_output
from aidge_export_arm_cortexm.hardware_in_the_loop import run_model_on_target
from aidge_export_arm_cortexm.utils import estimate_power_metrics
from aidge_export_cpp import ExportLibCpp

# Set up paths to benchmark data helpers
BENCHMARK_CANDIDATES = [
    Path("../../benchmark").resolve(),
    Path("aidge/examples/benchmark").resolve(),
]
BENCHMARK_DIR = next(
    path for path in BENCHMARK_CANDIDATES if (path / "dataset_loaders").is_dir()
)
if str(BENCHMARK_DIR) not in sys.path:
    sys.path.insert(0, str(BENCHMARK_DIR))

from dataset_loaders.dataset_loaders import get_dataset_loader

print(f"Using benchmark data helpers from: {BENCHMARK_DIR}")

2. Choose one benchmark configuration#

In order to perform the benchmark, first, we need to create a configuration that defines the model, target, adaptation, and deployment parameters. We’ll show some examples of benchmark configurations below, and then select one to use for the rest of the tutorial.

Important fields:

model_path: registered model name or local ONNX path
board: target family/profile, for example stm32h7 or raspberrypi
backend:
- local: if you want to use your own UART to flash your target (mostly microcontrollers from the Cortex-M family)
- api: if you want to use a board-farm API to flash your target (we’ll explain a little bit more about this in the deployment section)
- ssh_docker: if you want to connect to a remote Linux SBC over SSH and build/run the benchmark inside a Docker image on the board
- ssh_native: same as ssh_docker but the export is built and run directly on the board, without Docker, using the board’s native toolchain
dtype: float32 or int8
libs: ordered export kernel libraries, for example ['cmsis', 'arm', 'cpp']
heuristic: graph adaptation heuristic, for example nb_op, mem, cost, or runtime
validation_on_target: also send labeled samples and compute target accuracy

[ ]:

@dataclass
class BenchmarkConfig:
    # Model and data
    model_path: str = "lenet_mnist"
    data_root: str = "./data"
    mock_db: bool = False

    # Host preparation
    no_cuda: bool = True
    dtype: str = "float32"
    nb_test: int = 10
    nb_calib: int = 20
    sample_index: int = 0

    # Target benchmark protocol
    board: str = "stm32h7"
    nb_warmup: int = 5
    iterations: int = 5
    profiling: bool = True
    validation_on_target: bool = False
    tinymlbenchmark: bool = False

    # Export and graph adaptation
    cmsis: bool = False
    libs: list[str] | None = None
    heuristic: str = "nb_op"
    heuristic_cache: str = ".aidge_inference_time_cache.json"
    dformat: str = "nhwc"
    show: bool = False
    no_docker: bool = False

    # Result files
    charts_dir: str = "charts/tutorial_runs"
    no_charts: bool = False

    # Execution backend: local, api, or ssh_docker
    backend: str = "local"
    api_url: str = "http://127.0.0.1:8000"
    api_username: str = "admin"
    api_password: str = "changeme"
    api_board_name: str = "stm32h7_lab_01"
    api_poll_interval: float = 2.0
    api_request_timeout: float = 30.0

    ssh_host: str | None = None
    ssh_user: str = "admin"
    ssh_key_path: str | None = None
    ssh_port: int | None = None
    ssh_known_hosts_path: str | None = None
    ssh_remote_workdir: str = "/tmp/aidge-board-farm"
    sbc_docker_image: str | None = None
    sbc_end_keyword: str = "AIDGE SBC BENCHMARK DONE"
    sbc_max_runtime_seconds: float | None = None
    sbc_inner_iterations: int = 100
    sbc_export_backend: str | None = None
    sbc_board_name: str = "raspberrypi"
    board_config_file: str | None = None

    # Logging
    verbose: int = 0  # Errors only


# Default benchmark configuration for this tutorial: Local UART benchmark on an STM32H7 target with the LeNet-MNIST model, using the "nb_op" heuristic and profiling enabled.
config = BenchmarkConfig(
    model_path="lenet_mnist",
    board="stm32h7",
    backend="local",
    dtype="int8",
    libs=["cmsis", "arm", "cpp"],
    heuristic="nb_op",
    profiling=True,
    cmsis=True,
    validation_on_target=False,
    iterations=5,
    nb_warmup=5,
    nb_test=10,
    nb_calib=20,
    verbose=0,
)

config.target = config.board
config

Other common configurations#

Here are some example templates to copy into the previous cell when needed.

[ ]:

# Local STM32H7 connected to your machine.
local_stm32h7 = BenchmarkConfig(
    model_path="deep_autoencoder",
    board="stm32h7",
    backend="local",
    dtype="float32",
    profiling=True,
)

# Quantized STM32H7 benchmark with CMSIS-NN kernels.
int8_cmsis_api = BenchmarkConfig(
    model_path="resnet8_cifar10",
    board="stm32h7",
    backend="api",
    api_url="http://127.0.0.1:8000",
    api_username="admin",
    api_password="changeme",
    api_board_name="stm32h7_lab_01",
    dtype="int8",
    cmsis=True,
    nb_calib=50,
    profiling=True,
    validation_on_target=True,
)

# Raspberry Pi or Linux SBC through SSH + Docker.
raspberrypi_ssh = BenchmarkConfig(
    model_path="resnet18",
    board="raspberrypi",
    backend="ssh_docker",
    ssh_host="127.0.0.1",
    ssh_user="admin",
    ssh_key_path="~/.ssh/id_rsa",
    sbc_docker_image="aidge:daily",
    libs=["xnnpack", "cpp"],
    heuristic="runtime",
    dformat="nhwc",
    profiling=True,
)

# Same Linux SBC, but built and run natively on the board (no Docker).
# sbc_docker_image is not needed here.
raspberrypi_ssh_native = BenchmarkConfig(
    model_path="resnet18",
    board="raspberrypi",
    backend="ssh_native",
    ssh_host="127.0.0.1",
    ssh_user="admin",
    ssh_key_path="~/.ssh/id_rsa",
    libs=["xnnpack", "cpp"],
    heuristic="runtime",
    dformat="nhwc",
    profiling=True,
)

# MLPerf Tiny compatible harness, only for models with a TinyML ID.
tinyml_api = BenchmarkConfig(
    model_path="resnet8_cifar10",
    board="stm32h7",
    tinymlbenchmark=True,
    profiling=True,
)

3. Validate the configuration#

Before touching the model, check the options that can fail immediately: precision, iteration counts, graph heuristic, and data format.

[ ]:

SUPPORTED_TYPES = ["float32", "int8"]


def validate_config(config: BenchmarkConfig) -> None:
    if config.dtype not in SUPPORTED_TYPES:
        raise ValueError(
            f"Unsupported dtype {config.dtype!r}. Supported values: {SUPPORTED_TYPES}"
        )
    if config.nb_test <= 0:
        raise ValueError("nb_test must be strictly positive.")
    if config.nb_calib <= 0:
        raise ValueError("nb_calib must be strictly positive.")
    if config.iterations <= 0:
        raise ValueError("iterations must be strictly positive.")
    if config.nb_warmup < 0:
        raise ValueError("nb_warmup must be non-negative.")
    if config.cmsis and config.dtype == "float32":
        raise ValueError("CMSIS-NN export requires dtype='int8' in this flow.")
    if config.heuristic not in {"cost", "mem", "nb_op", "runtime"}:
        raise ValueError("heuristic must be one of: cost, mem, nb_op, runtime.")
    if config.dformat not in {"nhwc", "nchw"}:
        raise ValueError("dformat must be 'nhwc' or 'nchw'.")


def configure_logging(verbosity: int) -> None:
    if verbosity == 0:
        aidge_core.Log.set_console_level(aidge_core.Level.Error)
    elif verbosity == 1:
        aidge_core.Log.set_console_level(aidge_core.Level.Notice)
    elif verbosity == 2:
        aidge_core.Log.set_console_level(aidge_core.Level.Info)
    else:
        aidge_core.Log.set_console_level(aidge_core.Level.Debug)


validate_config(config)
configure_logging(config.verbose)
print("Configuration is valid.")

4. Register models and select the dataset#

A registered model gives the notebook three useful pieces of information:

where to download the ONNX model if needed
which dataset loader to use
which MLPerf Tiny model ID to use, when compatible

You can also set config.model_path to a local ONNX path. In that case, the notebook falls back to the MNIST loader unless you adapt the dataset selection.

[ ]:

MODELS_INFO = {
    "lenet_mnist": {
        "url": "https://huggingface.co/EclipseAidge/LeNet/resolve/main/lenet_mnist.onnx?download=true",
        "dataset": "mnist",
    },
    "resnet8_cifar10": {
        "dataset": "cifar10",
        "tinyml_benchmark_model": "ic01",
        "url": "https://huggingface.co/EclipseAidge/resnet8/resolve/main/resnet8.onnx?download=true",
    },
    "mobilenet_v1_vww": {
        "dataset": "vww",
        "tinyml_benchmark_model": "vww01",
        "url": "https://huggingface.co/EclipseAidge/mobilenet_v1_vww/resolve/main/mobilenet_v1_vww.onnx?download=true",
    },
    "ds_cnn": {
        "dataset": "mock_ds_cnn",
        "tinyml_benchmark_model": "kws01",
        "url": "https://huggingface.co/EclipseAidge/ds_cnn/resolve/main/ds_cnn.onnx?download=true",
    },
    "deep_autoencoder": {
        "dataset": "mock_deep_autoencoder",
        "tinyml_benchmark_model": "ad01",
        "url": "https://huggingface.co/EclipseAidge/deep_autoencoder/resolve/main/deep_autoencoder.onnx?download=true",
        "remove_classifier_tail": False,
    },
    "yolov8m_static": {
        "dataset": "mock_yolov8m",
        "url": "https://huggingface.co/EclipseAidge/YOLOv8m/resolve/main/yolov8m_static.onnx?download=true",
        "remove_classifier_tail": False,
    },
    "resnet18": {
        "dataset": "imagenet",
        "url": "https://huggingface.co/EclipseAidge/resnet18/resolve/main/resnet18_imagenet_1k.onnx?download=true",
    },
    "resnet50": {
        "dataset": "imagenet",
        "url": "https://huggingface.co/EclipseAidge/resnet50/resolve/main/resnet50_v1-5.onnx?download=true",
        "matmul_to_fc": True,
    },
    "vgg16": {
        "dataset": "imagenet",
        "url": "https://huggingface.co/EclipseAidge/vgg16/resolve/main/vgg16.onnx",
        "matmul_to_fc": True,
    },
}

model_name = Path(config.model_path).stem
dataset_name = MODELS_INFO.get(model_name, {}).get("dataset", "mnist")
print(f"Model: {model_name}")
print(f"Dataset loader: {dataset_name}")

if config.tinymlbenchmark:
    tinyml_benchmark_model = MODELS_INFO.get(model_name, {}).get(
        "tinyml_benchmark_model"
    )
    if tinyml_benchmark_model is None:
        raise ValueError(f"{model_name!r} has no TinyML benchmark model ID.")
    config.tinyml_benchmark_model = tinyml_benchmark_model
    print(f"TinyML benchmark model ID: {config.tinyml_benchmark_model}")

5. Load and clean the model#

Once you have your model loaded, Aidge provides a set of graph manipulation utilities to clean up the model and make it easier to adapt. These include:

replace selected MatMul patterns with fully connected operators
remove operators that are not implemented for a given backend
remove flatten where possible
fuse batch normalization
expand meta-operators

And many others. If you’re interested in learning more about these transformations, you may want to check out the graph manipulation tutorial.

[ ]:

def load_model(model_path: str) -> aidge_core.GraphView:
    model_name = Path(model_path).stem
    local_path = MODELS_INFO.get(model_name, {}).get("local_path")

    if model_name in MODELS_INFO and local_path is None:
        aidge_core.utils.download_file(model_path, MODELS_INFO[model_name]["url"])
    else:
        print(f"Model '{model_name}' - attempting to load from local path.")

    model = aidge_onnx.load_onnx(model_path)

    if model_name == "ds_cnn" or MODELS_INFO.get(model_name, {}).get(
        "matmul_to_fc", False
    ):
        aidge_core.matmul_to_fc(model)

    if model_name == "resnet50":
        matches = aidge_core.SinglePassGraphMatching(model).match("ArgMax")
        for match in matches:
            aidge_core.GraphView.replace(match.graph, aidge_core.GraphView())

    aidge_core.remove_flatten(model)
    aidge_core.fuse_batchnorm(model)
    aidge_core.expand_metaops(model, name_format="{0}_{1}_{2}")

    matches = aidge_core.SinglePassGraphMatching(model).match("Softmax")
    for match in matches:
        aidge_core.GraphView.replace(match.graph, aidge_core.GraphView())

    if model_name == "ds_cnn":
        matches = aidge_core.SinglePassGraphMatching(model).match("Reshape<*-Producer?")
    elif MODELS_INFO.get(model_name, {}).get("remove_classifier_tail", True):
        matches = aidge_core.SinglePassGraphMatching(model).match(
            "Transpose->Reshape#; Reshape#<*-Producer?"
        )
    else:
        matches = []

    for match in matches:
        aidge_core.GraphView.replace(match.graph, aidge_core.GraphView())

    return model


loaded_model = load_model(config.model_path)
model = loaded_model.clone()
print("Model loaded and cleaned for export preparation.")

6. Load samples and prepare host tensors#

The host side uses these samples for three things:

quick reference inferences before export
calibration when dtype='int8'
benchmark and validation tensors sent to the target

nb_samples is chosen so every later step has enough samples.

[ ]:

def set_sample_metadata(
    tensors: list[aidge_core.Tensor],
    backend: str,
    dtype: aidge_core.dtype = aidge_core.dtype.float32,
    data_format: aidge_core.dformat = aidge_core.dformat.nchw,
) -> None:
    for tensor in tensors:
        tensor.to_backend(backend)
        tensor.to_dtype(dtype)
        tensor.set_data_format(data_format)


use_cuda = not config.no_cuda
if use_cuda:
    import aidge_backend_cuda
host_backend = "cuda" if use_cuda else "cpu"

nb_samples = max(
    config.nb_test,
    config.nb_calib,
    config.sample_index + 1,
    config.iterations,
)

loader = get_dataset_loader(dataset_name)
sample_arrays, tensors, labels = loader.load_samples(
    sample_count=nb_samples,
    backend=host_backend,
    data_root=config.data_root,
    mock_db=config.mock_db,
)
set_sample_metadata(tensors, backend=host_backend)

# Keep pristine references so notebook cells can be rerun safely.
float_sample_arrays = sample_arrays
float_tensors = tensors

print(f"Loaded {len(tensors)} samples on host backend: {host_backend}")
print(
    f"First sample dimensions: {tensors[0].dims() if callable(tensors[0].dims) else tensors[0].dims}"
)

7. Run reference inference on the host#

Before exporting, we’ll first run a few samples with the Aidge scheduler. The idea here is to only check that model loading and data loading are coherent.

[ ]:

def run_host_inference_examples(
    model: aidge_core.GraphView,
    scheduler: aidge_core.SequentialScheduler,
    tensors: list[aidge_core.Tensor],
    labels: list[int],
    nb_test: int,
    title: str = "HOST EXAMPLE INFERENCES",
) -> float:
    print(f"\n{title}")
    nb_valid = 0

    for i in range(nb_test):
        output_array = propagate(scheduler, [tensors[i]])
        prediction = int(np.argmax(output_array))
        confidence = float(np.max(output_array))
        print(labels[i], " VS ", prediction, " -> ", confidence)
        if labels[i] == prediction:
            nb_valid += 1

    accuracy = nb_valid / nb_test
    print(f"\nMODEL ACCURACY = {accuracy * 100:.2f}%")
    return accuracy


model.set_datatype(aidge_core.dtype.float32)
model.set_backend(host_backend)

scheduler = aidge_core.SequentialScheduler(model)
scheduler.generate_scheduling()
host_accuracy = run_host_inference_examples(
    model, scheduler, tensors, labels, config.nb_test
)

8. Optional int8 quantization#

When working with embedded targets, quantization is often necessary to meet memory and latency constraints. The idea here is to quantize the model and run a quick host check before exporting, to make sure the quantized graph is still functional and the accuracy has not dropped too much. Aidge provides a quantization helper that performs the following steps:

start from a clean loaded model
assign activation, weight, and bias precisions
calibrate the graph with the original float calibration tensors
manually quantize the input samples
run another host check on the quantized graph

[ ]:

def quantize_samples(
    arrays: list[np.ndarray],
    dtype_name: str,
    nb_bits: int = 8,
) -> tuple[list[np.ndarray], list[aidge_core.Tensor]]:
    rescaling = 2 ** (nb_bits - 1) - 1
    quantized_arrays = []
    quantized_tensors = []

    for array in arrays:
        input_array = array / 255.0 if np.max(array) > 1.0 else array
        quantized_array = np.round(input_array * rescaling).astype(int)
        quantized_tensor = aidge_core.Tensor(quantized_array.copy())
        quantized_tensor.to_backend("cpu")
        quantized_tensor.to_dtype(str2aidge(dtype_name))
        quantized_tensor.set_data_format(aidge_core.dformat.nchw)

        quantized_arrays.append(quantized_array)
        quantized_tensors.append(quantized_tensor)

    return quantized_arrays, quantized_tensors


model = loaded_model.clone()
model.set_datatype(aidge_core.dtype.float32)
model.set_backend(host_backend)
scheduler = aidge_core.SequentialScheduler(model)
scheduler.generate_scheduling()

sample_arrays = float_sample_arrays
tensors = float_tensors
benchmark_arrays = sample_arrays

if config.dtype != "float32":
    aidge_quantization.auto_assign_node_precision(
        graphview=model,
        act_precision=aidge_core.dtype.int8,
        weight_precision=aidge_core.dtype.int8,
        bias_precision=aidge_core.dtype.int32,
    )

    aidge_quantization.quantize_network(
        network=model,
        calibration_set=tensors[: config.nb_calib],
        clipping_mode=aidge_quantization.Clipping.MAX,
        single_shift=True,
    )

    benchmark_arrays, tensors = quantize_samples(sample_arrays, config.dtype)

    scheduler.reset_scheduling()
    scheduler.generate_scheduling()
    quantized_host_accuracy = run_host_inference_examples(
        model,
        scheduler,
        tensors,
        labels,
        config.nb_test,
        title="HOST QUANTIZED EXAMPLE INFERENCES",
    )
else:
    print("Quantization skipped because dtype is float32.")

9. Select export libraries#

Given Aidge’s multi-library export capabilities, you can choose which libraries to prioritize for your benchmark. The libs field in the configuration allows you to specify an ordered list of libraries that Aidge will use when exporting the kernels of your model. Aidge will attempt to use the first library in the list for each operator, and if an operator is not supported by that library, it will fall back to the next library in the list, and so on.

This is particularly useful, since some libraries may have better performance for certain operators or models by using optimization techniques such as vectorial operations, while others may have broader operator coverage. By specifying the order of libraries, you can optimize the performance of your benchmark while ensuring that all operators are supported.

Common choices:

['arm', 'cpp']: Aidge ARM kernels with C++ fallback
['cmsis', 'arm', 'cpp']: CMSIS-NN first, then ARM and C++ fallback
['xnnpack', 'cpp']: XNNPACK first, then C++ fallback for Linux SBC targets

config.cmsis=True is a shortcut for ['cmsis', 'arm', 'cpp'], but remember you need to have CMSIS-NN kernels built and available in your Python environment for it to work. You’ll also need to setup the quantization as int8, since CMSIS-NN only supports quantized models.

[ ]:

def get_export_libs(config: BenchmarkConfig) -> list[type]:
    if config.libs:
        libs = []
        requested_libs = [
            lib_name
            for lib in config.libs
            for lib_name in str(lib).replace(",", " ").split()
        ]
        for lib in requested_libs:
            if lib == "cmsis":
                libs.append(ExportLibCMSISNN)
            elif lib == "arm":
                libs.append(ExportLibAidgeARM)
            elif lib == "cpp":
                libs.append(ExportLibCpp)
            elif lib == "xnnpack":
                import importlib.util

                if importlib.util.find_spec("aidge_export_xnnpack"):
                    from aidge_export_xnnpack import ExportLibXNNPack

                    libs.append(ExportLibXNNPack)
                else:
                    aidge_core.Log.warn(
                        "xnnpack was requested but aidge_export_xnnpack is not installed. Ignoring."
                    )
        return libs

    if config.cmsis:
        return [ExportLibCMSISNN, ExportLibAidgeARM, ExportLibCpp]
    return [ExportLibAidgeARM, ExportLibCpp]


export_libs = get_export_libs(config)
print("Export libraries:")
for lib in export_libs:
    print("-", lib.__name__)

10. Select the graph adaptation heuristic#

Aidge’s graph adaptation process can be guided by different heuristics that prioritize different aspects of the adapted graph. The heuristic field in the configuration allows you to select which heuristic to use during graph adaptation. The available heuristics are:

nb_op: choose the graph with fewer operators, good default
mem: choose the graph with lower estimated peak memory
cost: choose the graph with lower adaptation cost, as estimated by the Aidge cost model
runtime: choose the graph with lower measured on-target runtime (might require multiple iterations and more time, but can lead to better results)

The runtime heuristic will use config.heuristic_cache to avoid remeasuring known candidates.

[ ]:

def get_heuristic(heuristic_name: str):
    from aidge_core.export_utils.graph_evaluator import (
        evaluate_adaptation_cost,
        evaluate_inference_time,
        evaluate_mem_peak,
        evaluate_nb_ops,
    )

    mapping = {
        "cost": evaluate_adaptation_cost,
        "mem": evaluate_mem_peak,
        "nb_op": evaluate_nb_ops,
        "runtime": evaluate_inference_time,
    }
    return mapping[heuristic_name]


heuristic_func = get_heuristic(config.heuristic)
print(f"Using graph adaptation heuristic: {config.heuristic}")

11. Attach input metadata to the export graph#

The export graph is a clone of the host model. We attach tensor metadata to its mandatory inputs so the exporter knows shapes, data types, and layout.

The deployed input tensor can be converted to nhwc or nchw through config.dformat.

[ ]:

def tensor_dims(tensor: aidge_core.Tensor) -> list[int]:
    dims = tensor.dims() if callable(tensor.dims) else tensor.dims
    return list(dims)


def tensor_dtype(tensor: aidge_core.Tensor):
    dtype = tensor.dtype() if callable(tensor.dtype) else tensor.dtype
    return dtype


def tensor_dformat(tensor: aidge_core.Tensor):
    dformat = tensor.dformat() if callable(tensor.dformat) else tensor.dformat
    return dformat


def metadata_tensor_from_tensor(
    tensor: aidge_core.Tensor, backend: str = "cpu"
) -> aidge_core.Tensor:
    owned_tensor = aidge_core.Tensor(dims=tensor_dims(tensor))
    owned_tensor.to_backend(backend)
    owned_tensor.to_dtype(tensor_dtype(tensor))
    owned_tensor.set_data_format(tensor_dformat(tensor))
    return owned_tensor


def set_graph_inputs(
    model: aidge_core.GraphView, tensors: list[aidge_core.Tensor]
) -> None:
    model.set_mandatory_inputs_first()
    mandatory_inputs = model.get_mandatory_inputs()
    if len(tensors) < len(mandatory_inputs):
        raise ValueError(
            f"Model expects {len(mandatory_inputs)} input tensor(s), "
            f"but only {len(tensors)} sample tensor(s) were provided."
        )

    for input_ref, tensor in zip(mandatory_inputs, tensors):
        input_ref.node.get_operator().set_input(
            input_ref.index,
            metadata_tensor_from_tensor(tensor),
        )


def prepare_deploy_input_tensors(
    tensors: list[aidge_core.Tensor],
    data_format: aidge_core.dformat = aidge_core.dformat.nhwc,
) -> list[aidge_core.Tensor]:
    deploy_tensors = []
    for tensor in tensors:
        deploy_tensor = metadata_tensor_from_tensor(tensor)
        deploy_tensor.to_dformat(data_format)
        deploy_tensors.append(deploy_tensor)
    return deploy_tensors


export_model = model.clone()
export_model.set_backend("cpu")
set_graph_inputs(export_model, [tensors[config.sample_index]])

wanted_dformat = getattr(aidge_core.dformat, config.dformat)
deploy_input_tensors = prepare_deploy_input_tensors(
    [tensors[config.sample_index]],
    data_format=wanted_dformat,
)

print(f"Export graph input dims: {tensor_dims(deploy_input_tensors[0])}")
print(f"Deployment data format: {config.dformat}")

12. Adapt the graph for the target#

This is where the magic happens. Aidge’s graph adapter will take the export graph and try to adapt it for the target, using the selected export libraries and heuristic. The adapter will explore different combinations of kernels and transformations, and select the best adapted graph according to the heuristic.

[ ]:

adapted_model = adapt_graph(
    export_model,
    export_libs=export_libs,
    heuristic=heuristic_func,
    wanted_in_dformat=wanted_dformat,
    constant_fold=True,
    args=config,
)

operator_type = model.get_ordered_outputs()[0][0].get_operator().type()
export_folder = f"{operator_type.lower()}_export_arm_inference"

print(f"Adapted graph ready. Export folder name: {export_folder}")

13. Optional graph visualization#

If config.show=True, this cell opens the original and adapted graph in Model Explorer. This is pretty useful to understand how the original model was transformed for the target, and to check that the expected kernels and transformations were applied.

[ ]:

if config.show:
    try:
        import aidge_model_explorer
        from model_explorer import visualize_from_config

        model_explorer_config = aidge_model_explorer.config()
        model_explorer_config.add_graphview(model, "original_model")
        model_explorer_config.add_graphview(adapted_model, "adapted_model")
        visualize_from_config(model_explorer_config)
    except ImportError:
        aidge_core.Log.warn(
            "Graph visualization requires aidge_model_explorer and model_explorer."
        )
else:
    print("Graph visualization skipped. Set config.show=True to enable it.")

14. Prepare optional on-target validation#

Nice, we had a look at the adapted graph, and now we’re ready to deploy it on the target and run the benchmark. An important point to note is that, since some of the optimizations applied by the graph adapter can change the numerical behavior of the model, it’s a good idea to validate the adapted model directly on the target. This can help you catch any issues that may have been introduced during adaptation or quantization, such as changes in precision or differences in library implementations.

If you set config.validation_on_target=True, Aidge will send several labeled samples to the target and compute accuracy from the target predictions. This is slower than just running the latency benchmark, but it’s useful after changing precision, libraries, or graph adaptation.

[ ]:

def prepare_arm_validation_samples(arrays: list[np.ndarray]) -> list[np.ndarray]:
    validation_arrays = []

    for array in arrays:
        if array.ndim == 4 and array.shape[1] in (1, 3):
            validation_arrays.append(np.transpose(array, (0, 2, 3, 1)).copy())
        elif array.ndim == 3:
            validation_arrays.append(np.transpose(array, (0, 2, 1)).copy())
        else:
            validation_arrays.append(array.copy())

    return validation_arrays


validation_samples = None
validation_labels = None
nb_validation_iterations = 0

if config.validation_on_target:
    validation_samples = prepare_arm_validation_samples(
        benchmark_arrays[: config.nb_test]
    )
    validation_labels = labels[: config.nb_test]
    nb_validation_iterations = len(validation_samples)
    print(
        f"Prepared {nb_validation_iterations} validation samples for target execution."
    )
else:
    print("Target validation disabled. The target benchmark will measure latency only.")

15. Select the target execution backend#

Aidge supports several ways to execute the benchmark:

local: compile, flash, and capture from the current machine using a local UART connection, mostly for microcontrollers from the Cortex-M family
api: submit the job to a board-farm API (also for microcontrollers, but without the need of a local UART connection)
ssh_docker: connect to a Linux SBC and run the export inside Docker (it requires a minimum setup on the target, but works with a wide variety of Linux SBCs)
ssh_native: connect to a Linux SBC and build/run the export directly on the board, without Docker (only a compiler/Python toolchain needed on the target)

This cell creates the backend object.

[ ]:

target_backend = arm_benchmark.backend_from_args(config)
print(f"Target backend: {config.backend}")
print(target_backend)

15.1. BONUS: API and SSH backends#

I’m betting that you are wondering how the API and SSH backends works. The API backend allows you to submit your benchmark job to a remote board farm, which takes care of flashing the target and capturing the results. This is particularly useful if you want to deploy your own infrastructure for benchmarking, or if you want to use a board that needs to be available for other users as well. If you’re interested in learning more about how to set up a board farm and use the API backend, you can check out the Aidge documentation here.

For the SSH backends, Aidge connects to a remote Linux SBC over SSH and runs the benchmark there. This is a great option if you have a Linux-based target and want to avoid the hassle of executing your code locally. There are two flavors:

ssh_docker: the export folder is sent to the board, then built and run inside a Docker image (sbc_docker_image). The Docker image pins the toolchain and dependencies, so the build is reproducible regardless of what is installed on the board — at the cost of needing a Docker daemon and a suitable image on the target.
ssh_native: the export folder is sent to the board, extracted, then built and run directly on the board with its native toolchain (make && ./bin/run). No Docker is required, so the setup is lighter (just a compiler and Python), but the build uses whatever toolchain/libraries are already installed on the board. Authentication can rely on your ssh-agent: if you leave ssh_key_path unset, the agent’s keys are used.

Pick ssh_docker when you want a reproducible, self-contained environment, and ssh_native when the board is already set up with the tools you need and you’d rather skip Docker. Both require a minimum setup on the target and the SSH connection configured. If you’re interested in learning more about how to set up the SSH backends, you can check out the documentation here.

16. Deploy and run the target benchmark#

Drumrolll please… This is the moment of truth. When you run the cell below, Aidge will take care of exporting the adapted graph, deploying it to the target, and running the benchmark. The results will be captured and displayed in the next cell.

Set RUN_TARGET_BENCHMARK = True when:

the board or board-farm settings are correct
the model and export settings are correct
you are ready to compile, flash, submit a remote job, or connect over SSH

[ ]:

RUN_TARGET_BENCHMARK = False

if RUN_TARGET_BENCHMARK:
    final_output, final_mem_peak = run_model_on_target(
        adapted_model,
        args=config,
        backend=target_backend,
        use_docker=not config.no_docker,
        input_tensors=deploy_input_tensors,
        target_s=config.board,
        export_folder_name=export_folder,
        adapt_for_target=False,
        profiling=config.profiling,
        validation_samples=validation_samples,
        validation_labels=validation_labels,
        nb_validation_iterations=nb_validation_iterations,
    )
    print(f"Final deployment done. Peak memory: {final_mem_peak} bytes")
else:
    final_output = None
    final_mem_peak = None
    print("Target benchmark not executed. Set RUN_TARGET_BENCHMARK=True to run it.")

17. Display the generated charts#

You made it! Congratulations! After target execution, the backend returns raw timings and optional profiling fields. Instead of printing all values in the notebook, this section exports the standard benchmark charts and displays the generated PNG files inline.

The chart folder still contains the full artifacts, including text and JSON summaries when generated by the chart exporter.

[ ]:

from IPython.display import Image, Markdown, display


def display_chart_folder(charts_dir: Path) -> None:
    png_files = sorted(charts_dir.glob("*.png"))
    if not png_files:
        display(Markdown(f"No PNG charts found in `{charts_dir}`."))
        return

    display(Markdown(f"Charts exported to `{charts_dir}`"))
    for png_file in png_files:
        display(Markdown(f"### {png_file.stem.replace('_', ' ').title()}"))
        display(Image(filename=str(png_file)))

[ ]:

if final_output is None:
    display(Markdown("No target output to display yet. Run the deployment cell first."))
elif not isinstance(final_output, dict):
    raise RuntimeError(
        f"Expected benchmark output dictionary, got {type(final_output).__name__}."
    )
else:
    timings = final_output.get("timings", [])
    layer_timings = final_output.get("layer_timings")
    cycles = final_output.get("cycles")
    layer_cycles = final_output.get("layer_cycles")
    target_accuracy = final_output.get("accuracy")
    predictions = final_output.get("predictions")
    used_labels = final_output.get("labels")

    uart_output_file = Path(export_folder) / "uart_output.txt"
    if uart_output_file.is_file() and (not layer_timings or not cycles):
        parsed_uart = parsing_log_output(uart_output_file)
        layer_timings = layer_timings or parsed_uart.get("layer_timings")
        cycles = cycles or parsed_uart.get("cycles")
        layer_cycles = layer_cycles or parsed_uart.get("layer_cycles")

    if not timings:
        raise RuntimeError("No target timings were captured from the benchmark run.")

    power_metrics = estimate_power_metrics(timings_ms=timings, board=config.board)
    libs_name = "_".join(config.libs) if config.libs else None

    if config.no_charts:
        display(Markdown("Chart export is disabled because `config.no_charts=True`."))
    else:
        charts_dir = export_benchmark_charts(
            output_root=config.charts_dir,
            board=config.board,
            dtype_name=config.dtype,
            timings=timings,
            cycles=cycles,
            layer_timings=layer_timings,
            layer_cycles=layer_cycles,
            validation_accuracy=target_accuracy,
            validation_predictions=predictions,
            validation_labels=used_labels,
            power_metrics=power_metrics,
            export_folder=export_folder,
            model_name=config.model_path,
            libs=libs_name,
        )
        display_chart_folder(Path(charts_dir))

18. Conclusion#

Pretty cool, right? 😎 In this tutorial, we went through the complete process of benchmarking a model on a target with Aidge, from loading and preparing the model, to adapting it for the target, and finally deploying and running the benchmark. We also covered some optional steps such as graph visualization and on-target validation.