Understanding Aidge’s scheduling#

Aidge introduces a well-defined consumer-producer (C-P) model for operator implementations, similar to transaction-level modeling (TLM) for electronic design. The C-P model of an operator implementation specifies how much data is consumed and produced by the operator implementation at each execution step (i.e. at each forward pass). C-P model can be specified as precise amounts of data (number of elements) or arbitrary data quantity (token). The C-P model execution path is decoupled from the data execution path, thus allowing to statically schedule the graph execution without providing the actual operator’s implementation.

Aidge’s base scheduler use this C-P model to statically schedule a graph before execution. Scheduling is always static in Aidge.

Install requirements#

Ensure that the Aidge modules are properly installed in the current environment. If it is the case, the following setup steps can be skipped.

Note: When running this notebook on Binder, all required components are pre-installed.

[ ]:

%pip install aidge-core \
    aidge-backend-cpu \
    aidge-onnx \
    aidge-model-explorer

First import some utility methods used in the tutorial:

[ ]:

import sys, os
sys.path.append(os.path.abspath(os.path.join('..')))
import tuto_utils

To generate the static scheduling of a graph, here for example MobileNetv2, just do:

[ ]:

import aidge_core
import aidge_onnx
import aidge_backend_cpu

file_url = "https://huggingface.co/EclipseAidge/mobilenet_v2/resolve/main/mobilenetv2-7.onnx?download=true"
file_path = "mobilenetv2-7.onnx"

aidge_core.utils.download_file(file_path, file_url)

aidge_model = aidge_onnx.load_onnx("mobilenetv2-7.onnx", verbose=False)
aidge_model.set_backend("cpu")

# Create the Scheduler
scheduler = aidge_core.SequentialScheduler(aidge_model)
scheduler.generate_scheduling()

[ ]:

# Display static scheduling
scheduler.save_static_scheduling_diagram("scheduling")
tuto_utils.visualize_mmd("scheduling.mmd")

The static scheduling is generated and displayed, without any execution of the graph. Here we see that except for Producers, the network operators execution order is strictly sequential.

To see a more interesting scheduling, one can try on a simple LSTM network. Lets first display the flatten LSTM graph we want to schedule:

[ ]:

import aidge_model_explorer

lstm = aidge_core.LSTM(in_channels=4, hidden_channels=8, seq_length=5)

# Flatten the graph:
lstm_model = aidge_core.get_connected_graph_view(lstm)
aidge_core.expand_metaops(lstm_model)

lstm_model.set_backend("cpu")

aidge_model_explorer.visualize(lstm_model, "lstm", embed=True)

Now lets generate the static scheduling for this graph:

[ ]:

# Create the Scheduler
lstm_scheduler = aidge_core.SequentialScheduler(lstm_model)
lstm_scheduler.generate_scheduling()

[ ]:

# Display static scheduling
lstm_scheduler.save_static_scheduling_diagram("lstm_scheduling")
tuto_utils.visualize_mmd("lstm_scheduling.mmd")

In this LSTM example, the graph is cyclic and the scheduling therefore directly depends on the seq_length parameter. The generated static scheduling exhibits the early and late logical start for each operators. One can see that some operators have different early and late logical start, meaning their execution can happen anytime between these logical steps. Operators at the same logical step are garanteed to have no data dependency and may be executed in parallel. Conversely, operators with identical early and late logical start are on the critical path in the scheduling.

There is a default C-P model associated to most operator implementations that is hybrid: when the inputs/outputs dimensions are known, it is elements-based, and when dimensions are unknown, it is token-based. There is no fundamental difference between element-based and token-based C-P model as long as operators are consuming and producing their whole input/output tensors at once at each execution step: in this case, the consumed elements will always match the produced elements anywhere in the graph, and any consumed or produced tensor can be considered as a single token. This is implicitely how the forward pass works in most DL frameworks.

When generate_scheduling() is called on a graph without known dimensions, the scheduling will be entirely token-based.

However, some operators cannot be statically scheduled with unknown dimensions! This is the case for the Pop operator: it extracts a sub-tensor along the first dimension of its input at each execution step. For exemple, with an input of shape [3, 16, 32], it will produce three [16, 32] tensors. If the input dimension is unknown, it is not possible to know how many tensors it must produce, hence how many time it must be scheduled. This is why Aidge provides the forward_dims() method:

[ ]:

try:
    aidge_model.forward_dims()
except Exception as error:
    print(error)

Here, it fails with an error for the MobileNetv2 model because there is no input provided to the graph: the input dimension is unknown!

When called without argument, it is assumed that all the inputs of the graph have known dimensions (a tensor or a Producer is connected to each input). It is also possible to specify a list of expected size of the graph inputs:

[ ]:

aidge_model.forward_dims(dims=[[1, 3, 16, 16]])

In this case, the forward_dims() method will automatically create missing input tensors of the specified size, or check that the existing inputs have the right size.

For some operators, the output dimensions cannot be deduced from its inputs dimensions alone. This is typically the case when the output dimension depends of some inputs data, rather than just dimensions, as for the Reshape operator for example. When this happens, forward_dims() will fail with a “Unable to forward dimensions” error and return False. There is a workaround however: if the required inputs data are known before model execution (for example, if the shape input of the Reshape operator is simply a Producer), it is possible to force the evaluation of the required input data by setting the allow_data_dependency flag to True:

[ ]:

aidge_model.forward_dims(dims=[[1, 3, 16, 16]], allow_data_dependency=True)

Beware that if some required data must be first computed, the result will be undefined, as the propagated dimensions will be invalid!

Note that when the model is executed, the operators output dimensions are automatically computed at runtime, without any need to call forward_dims() before-hand.

To summarize, three scheduling modes are possible depending on the C-P models available:

If all dimensions are known for each operator in the graph, a full data-based C-P scheduling can be performed providing that all operator C-P model supports data-based;
If only some dimensions are known, scheduling will be performed on data-based C-P model until the first node requiring tokens must be scheduled. Further nodes will then be scheduled using their token-based C-P model and any further node requiring an data-based C-P model will trigger an error;
If no dimension is known, only a full token-based scheduling can be performed, provided that all operator C-P model allows token-based.

Now, what if a graph cannot be scheduled just with the token-based C-P model (as with the Pop operator) and the data-based C-P model cannot be used because some dimensions are not known statically (as with the Reshape operator, assuming that allow_data_dependency cannot be used because some required input data must be computed first)?

In this case, the graph cannot be statically scheduled… oh, but remember: scheduling is always static in Aidge! This means that you will have to eliminate the data dependency in your graph, by either: 1) pre-compute the output dimension data dependent operator’s inputs, for example with the constant_folding() recipe, if applicable; or 2) isolate the data dependent path into a sub-graph and schedule and execute this sub-graph first, which will make the use of allow_data_dependency possible.

Reset the C-P model#

The C-P model in Aidge is statefull, as it records the amount of consumed and produced data or token during the graph scheduling. So, once a static scheduling is performed, using generate_scheduling(), a new static scheduling would start from the current state of the C-P model, even if generated from a different Scheduler instance. This is by design, as scheduling can be iterative.

In order to reset the C-P model state, as well as the scheduler state, use reset_scheduling().

Conditional nodes scheduling#

Aidge has the Select operator, which allows conditionnal graph execution. The condition can be data dependent, yet the graph scheduling remains fully static.

As per Aidge’s philosophy, sub-graph hierarchy is an optional feature, not a mandatory workaround, contrary to ONNX’s `If <https://onnx.ai/onnx/operators/onnx__If.html>`__ operator or PyTorch `torch.cond <https://pytorch.org/docs/stable/generated/torch.cond.html>`__ method. The Select operator has the following advantages over them:

Allow interleaved and hierarchical conditions;
Allow pre-execution of conditional branches or not.

Two scheduling behaviors are possible, depending on wether conditional nodes have been tagged or not:

Without tag: the graph is scheduled and run as is, meaning every conditional branch is run before selection. Of course, this may lead to lots of unnecessary computation. However, branches can be run in parallel, as well as in parallel with the condition determination path;
With tags: only the selected conditional branch is run. To achieve this, the condition determination path has to be scheduled and run entirely before any conditional branch.

It is the user’s responsibility to choose the intended behavior. In order to tag conditional nodes, use the tag_conditional_nodes() method. The method will tags nodes with their conditions in the schedule.cond attribute.

Master the C-P model#

The scheduler objective is to produce data at each output node of the graph, until there is no data left to consume anymore, considering that Producers produce whole tensor data on demand.

The scheduling algorithm works the following:

Initialize the consumers list: start from the output nodes and find the required prior producers/consumers at step 2;
From the current consumers list, check if any prior consumer node is needed. A prior will generally be required for any node consuming parameters (weights and bias) that is not an input node.
- If for a given node, only parent producers (at any depth) are needed to satisfy its required data, it becomes a prior.
- If the prior node is a producer, it is added to the list of required producers.
- If the prior node is of another type, it replaces the initial consumer in the new prior consumers list.
Prior consumers replace the initial consumers list. By construction, initial consumers will necessarily become consumers again later.
Make producers generate the required data. Producers are special nodes that generate data on demand.
Find runnable consumers. A consumer is runnable if the required data is available for all of its inputs. At this point, not all consumers are necessarily runnable because some may depend on the execution of others (when there is multiple successive priors for example).
Push runnable consumers in the list of nodes to run and update the consumer producer system. At this point, simultaneously runnable consumers have no data dependency and could be run in parallel!
Update consumers list:
- If the current consumer has still data to consume (“still consumer”), it will be put back in the consumers list once the remaining consumers have been exhausted.
- If the current consumer becomes a producer for other nodes, its childs become consumers.
If there is no more consumers, swap with possible “still consumer” (from step 7). This ensures that the “non-greedy” consumer behavior.
Iterate to step 2 until the consumers list is empty or there is no more runnable consumer.

Producers produce whole tensor data on demand#

Lets create a simple model:

[ ]:

model = aidge_core.sequential([
    aidge_core.Producer([16, 3, 512, 512], name="dataProvider"),
    aidge_core.Conv2D(3, 4, [5, 5], name="conv1"),
    aidge_core.ReLU(name="relu1"),
    aidge_core.PaddedConv2D(4, 8, [5, 5], name="conv2", stride_dims=[1, 1], padding_dims=[2, 2, 2, 2]),
    aidge_core.ReLU(name="relu2"),
    aidge_core.PaddedConv2D(8, 16, [3, 3], name="conv3", stride_dims=[1, 1], padding_dims=[2, 2, 2, 2], no_bias=True),
    aidge_core.ReLU(name="relu3")
])
model.set_backend("cpu")

# Create the Scheduler
scheduler = aidge_core.SequentialScheduler(model)
scheduler.generate_scheduling()

[ ]:

# Display static scheduling
scheduler.save_static_scheduling_diagram("scheduling")
tuto_utils.visualize_mmd("scheduling.mmd")

With the objective to generate data at the output node (relu3), the scheduling algorithm goes back to the first prior which is conv1, which will trigger the production of a whole tensor for Producer dataProvider. From that, conv1 becomes the first consumer, produces its output, then conv2 becomes a producer, etc. Once relu3 consumes its input tensor and produces an output tensor, the algorithm stops because there is no more consumer in the graph.

However, if at this point, relu3 would not have produced anything yet, that would mean, by construction, that there are still consumers somewhere in the graph. If that is the case, be aware that Producers continue to provide data on demand. Meaning if at some point, a node becomes a consumer that would only require new data at one of its Producer inputs, the required Producers would again produce a whole tensor.

Direct tensors produce whole data only once#

Direct tensors connection act like a Producer that would produce its whole tensor data only once.

Create a dataflow pipelining#

Here we create an example of C-P model that consume and produce data line by line.

[ ]:

model = aidge_core.sequential([
    aidge_core.Producer([1, 3, 16, 16], name="dataProvider"),
    aidge_core.GenericOperator("Conv2D_DF", [aidge_core.InputCategory.Data, aidge_core.InputCategory.OptionalParam, aidge_core.InputCategory.OptionalParam], 1, name="conv1"),
    aidge_core.GenericOperator("Conv2D_DF", [aidge_core.InputCategory.Data, aidge_core.InputCategory.OptionalParam, aidge_core.InputCategory.OptionalParam], 1, name="conv2"),
    aidge_core.GenericOperator("Conv2D_DF", [aidge_core.InputCategory.Data, aidge_core.InputCategory.OptionalParam, aidge_core.InputCategory.OptionalParam], 1, name="conv3")
])

# Define a line-by-line dataflow C-P model for the Conv2D_DF operator.
class Conv2D_DataFlow_CP(aidge_core.ProdConso):
    def __init__(self, op: aidge_core.Operator):
        aidge_core.ProdConso.__init__(self, op, False)
        self.state_begin = True

    def get_nb_required_data(self, input_idx):
        input = self.get_operator().get_input(input_idx)
        if input:
            if self.state_begin:
                # Require 3 lines for 3x3 kernel
                return aidge_core.Elts_t.data_elts(3 * input.dims()[2])
            else:
                return aidge_core.Elts_t.data_elts(input.dims()[2])
        else:
            return aidge_core.Elts_t.none_elts()

    def get_required_memory(self, output_idx, inputs_size):
        output = self.get_operator().get_output(output_idx)
        self.state_begin = False
        return aidge_core.Elts_t.data_elts(output.dims()[2])

Two methods must be overloaded to define a custom C-P model:

get_nb_required_data(): defines the amount of data required at each input to allow the operator execution. By default, it matches the data consumed at the next operator execution;
get_required_memory(): defines the amount of data that will be produced at the next operator execution.

The C-P model is statefull, meaning it can hold state variables (like self.state_begin above) that change the amount of data consumed/produced at each execution step. Therefore, any operator behavior can be modeled without having to define the data path implementation as long as the C-P model does not depends on input data values.

Now, lets define an implementation for the generic operator that only holds the C-P model:

[ ]:

# Define an implementation for the Conv2D_DF operator that only contains the
# previously defined C-P model.
class GenericConv2D_DataFlow_Impl(aidge_core.OperatorImpl):
    def __init__(self, op: aidge_core.Operator):
        aidge_core.OperatorImpl.__init__(self, op, 'cpu')

    def get_prod_conso(self):
        return Conv2D_DataFlow_CP(self.get_operator())

The last step is to set the implementation of the generic operators defined in the graph to the one we just defined, and provide a function that define how to compute the operator’s output dimensions (thus enabling forward_dims() and element-based scheduling required for our element-based C-P model).

[ ]:

# Set the implementation and forward_dims for the Generic operators
for node in model.get_nodes():
    if node.type() == "Conv2D_DF":
        node.get_operator().set_impl(GenericConv2D_DataFlow_Impl(node.get_operator()))
        node.get_operator().set_forward_dims(lambda x: [x[0]])

Now we can actually schedule the model and get a pipelined dataflow static scheduling!

The save_factorized_static_scheduling_diagram() function displays a compact form of scheduling where repetitive sequences have been factorized. The number of repetition of each sequence is specified left to the sequence (if not specified, there is no repetition).

[ ]:

model.forward_dims()

# Create the Scheduler
scheduler = aidge_core.SequentialScheduler(model)
scheduler.generate_scheduling()

[ ]:

# Display static scheduling
scheduler.save_factorized_static_scheduling_diagram("scheduling")
tuto_utils.visualize_mmd("scheduling.mmd")