Extracting information from a CIF specification

> >

Extracting information from a CIF specification

A CIF specification models the behavior of a system. It contains lots of information, and some of the information may be useful outside the CIF model as well. That raises the question how to extract such information from the model.

A very important property here is reliability. The used approach should always return the information that the specification contains, and not for example silently skip some parts. Also, it should not return information that is not part of the specification.

This is a very hands-on practical guide, with programming examples. For these examples, the Python programming language is used. For further information, links to standard Python libraries are provided, where appropriate. The considerations and approaches described here are however generally applicable to most programming languages. The Python libraries used in this document also exist for most other popular programming languages.

Recognizing text of a CIF specification

A common approach is to see a specification as a sequence of text lines, and use text processing techniques to extract the information. In this approach, the text is searched for relevant key fragments. The needed information is then collected from the found fragments. The technique is quite easy to do, and for text that has a fixed form it works quite well.

Unfortunately, text of a CIF specification does not have a fixed form. For instance, you can break lines at any point, causing a key fragment to get split across several text lines. Also, empty lines can be inserted as well, causing a key fragment to have multiple empty lines in its text. Furthermore, each line can have a comment at the end of each line, causing a key fragment to have pieces of comments in its text.

Comments can also contain the text of a key fragment, but the information in the comment is not an active part of the specification. Similarly, a specification may have string constants with text that looks like a key fragment, but again it is not part of the specification. If the key fragment looks for a keyword, variable or event names in the specification may look like a keyword. For example, event $event declares an event named event. Writing something this confusing should probably not be done in practice, but it is allowed in CIF and thus a search on key fragments may run into such cases.

Depending on how the search is done, you may get false positive matches (text is found that shouldn’t be matched) and/or false negative (text that should be matched is not found). In all, these complications make the common approach of searching key fragments quite unreliable.

To deal with the complexity of the text of a CIF specification, a parser should be used instead. A parser does not look for fragments, but instead reads all text. Also it knows exactly how to handle everything that can be written in a CIF specification. For this reason it cannot be fooled by splitting or inserting lines, or by writing strange comments, text strings or event names.

As a parser needs more knowledge, creating one is more work. If you want to venture in this area, have a look at a list of parser generator tools for Python. For the CIF language however, a parser already exists. It is used by all CIF tools.

Different CIF tools have different needs for information from a CIF specification. Saving only a part of the information of a CIF specification would make the parser useless for tools that need information beyond the saved part. To avoid that, the parser saves all information of the CIF specification, as a tree of objects. Each CIF tool selects the particular information from the tree that it needs.

A file with such a tree can be obtained by using the .cifx extension for an output file of a CIF tool, instead of the usual .cif extension. The CIF reference manual explains it in the CIF XML files section.

The reverse also works, all CIF tools accept files with a .cifx extension. In addition, other programs can also read such a file and search in the object-tree for relevant information, without all the complications that exist when searching in CIF specification text.

The next section discusses the object tree in some more detail. Where to find all details of the object tree is explained in the CIF meta-model section. In the Getting relevant information from the object-tree section, searching the tree for relevant information is discussed.

Structure of the CIF object-tree

A loaded CIF file is internally represented as a tree of objects. The objects in the tree follow the structure of a CIF specification. The tree starts with a Specification object. That object may have Group and Automaton objects (and also other CIF elements, such as declarations and requirement invariants). Each Group object can in turn have more Group objects, and so on. Each Automaton object has Location objects, with Edge objects in the locations that each contains a CIF edge, all the way down to a not operator or a 12 integer value in an Expression in (for example) an update of an edge.

All CIF objects that are defined at one place and used elsewhere, such as variables, event declarations, internal user-defined functions (and many others), cross-link from the use back to their definition. These are direct links, that do not follow the tree hierarchy (unlike in CIF text files where the path from use to definition must be stated). The cross-links make it easy to get the definition from its use. They also make it possible to find all uses of a definition without getting confused about uses of a second definition with the same (local) name.

As an example of how to write a CIF file as an object-tree, and what can be found in a CIF object-tree, consider the following CIF specification:

// example.cif

group G1:
end

group G2:
  group H:
    @doc("Controllable event")
    controllable c_event;
  end

  automaton A:
    location:
      initial;
      edge H.c_event;
  end
end

To convert this text to a CIF object-tree, a CIF tool must write this specification to a file. The simplest way to do that is to tell the CIF tool that produces this file to write it as an object-tree instead of CIF text by using the .cifx extension for its output file. If the CIF specification is already stored, the CIF to CIF transformer can be used, for example with a ToolDef script like:

from "lib:cif" import *;

string input_file = "example.cif";
string output_file = "example.cifx"; // <-- Note the ".cifx" here!
cif2cif(
  input_file,
  "--transformations=elim-comp-def-inst,remove-pos-info",
  "--output=" + output_file,
);

This script expands all component definitions to their instances, to make the result easier to process. In addition, it removes position information (the line and column numbers of all CIF objects). Generally the latter information is not needed and it avoids a lot of clutter in the output, which is useful if the result is manually inspected. If it is desired to create an object tree file without doing any transformation, remove the --transformations option.

Use of the .cifx extension causes the CIF file writer to write the CIF object-tree in XMI format, instead of converting it to normal CIF text. XMI is a form of XML, designed to exchange model files (such as CIF models) with xmi:id links between element definitions and their uses. The resulting (plain text) file looks like:

<?xml version="1.0" encoding="UTF-8"?>
<cif:Specification ...>
  <components xmi:type="cif:Group" xmi:id="2" name="G1"/>
  <components xmi:type="cif:Group" xmi:id="3" name="G2">
    <components xmi:type="cif:Group" xmi:id="4" name="H">
      <declarations xmi:type="declarations:Event" xmi:id="5" name="c_event" controllable="true">
        <annotations xmi:type="annotations:Annotation" xmi:id="6" name="doc">
          <arguments xmi:type="annotations:AnnotationArgument" xmi:id="7">
            <value xmi:type="expressions:StringExpression" xmi:id="8" value="Controllable event">
              <type xmi:type="types:StringType" xmi:id="9"/>
            </value>
          </arguments>
        </annotations>
      </declarations>
    </components>
    <components xmi:type="automata:Automaton" xmi:id="10" name="A">
      <locations xmi:type="automata:Location" xmi:id="11">
        <initials xmi:type="expressions:BoolExpression" xmi:id="12" value="true">
          <type xmi:type="types:BoolType" xmi:id="13"/>
        </initials>
        <edges xmi:type="automata:Edge" xmi:id="14">
          <events xmi:type="automata:EdgeEvent" xmi:id="15">
            <event xmi:type="expressions:EventExpression" xmi:id="16" event="5">
              <type xmi:type="types:BoolType" xmi:id="17"/>
            </event>
          </events>
        </edges>
      </locations>
    </components>
  </components>
</cif:Specification>

For brevity, a long list of XML declarations at the second line are omitted above. For working with the file however, they are needed.

When comparing the entries in this file with the original CIF specification, it is easy to see how the structure of the CIF specification is reflected in the XMI file. The Specification element has two Group elements named G1 and G2, just like in the CIF file. In the second group, there is another Group element named H. The latter group element contains an Event named c_event, which in turn has a Annotation named doc. Automaton A is the second element in G2. It has a Location with true initials and an Edge. The edge has one EdgeEvent since it has only one event. The event itself is then stored in an EventExpression with an event="5" cross-link that corresponds with xmi:id="5" of the c_event declaration earlier in the file.

For larger CIF specifications, the output file grows quickly in the number of XMI nodes. Each node however carries similar information as above.

Doing a few experiments like above helps in getting an intuition for what is stored in an object-tree. The full CIF language however has many classes, and your experiments may not cover all possible object-trees. In addition, in some cases the meaning of a class or a data field may not be clear. To better understand the object-trees, consult the extensive documentation that covers all details. Section The CIF meta-model explains where to find the documentation.

Manually tracking all the nodes and connections in an XMI file is tedious work. The obvious next step is thus to have a computer do this for us. That is discussed next.

Getting relevant information from the object-tree

As discussed in the Structure of the CIF object-tree section, it is possible to obtain a .cifx file with a CIF object-tree that contains all information of the CIF model. The next step is to load the .cifx XML file into Python, and select the desired information from it.

As XMI is a form of XML, you can load an XMI file using an XML library. For Python, the recommended way for loading XML files is to use the ElementTree module.

Loading the .cifx file with ElementTree is as simple as:

import xml.etree.ElementTree as ET

# Define the 'xmi' namespace, needed for finding nodes in the tree.
namespaces = {
    'xmi': 'http://www.omg.org/XMI',
}

# Load the XMI file.
doc = ET.parse('example.cifx')

All information from the CIF file is stored in the loaded tree. The next step to get the desired information from the file is to find the nodes in the tree that contain the desired information for your application, and to extract the required data from them. For example, you could extract the names of automata locations, the texts of @doc annotations, or the types of discrete variables.

For very small trees, finding nodes in the loaded document can be done ‘manually’ by starting from the root of the tree. Each node is inspected, and depending on the result, nodes deeper in the tree can be considered. Eventually, a node with the desired information may be found. In that case, the relevant information is extracted from it and the search continues for more information.

For the CIF language, with hundreds of different nodes and often a very large tree, that approach would need a lot of Python code, which takes a lot of effort to write and test for correctness. As large XML files are commonly used, the XML community invented the XPath language to easily and efficiently find relevant parts of XML trees. The Python ElementTree module also supports that, as described in the XPath support section of the ElementTree manual.

XPath takes as input a description of a path to the desired nodes. It then performs the search, and selects and returns the nodes that match the description. The found nodes can then be queried for the relevant information, that can be used in the application.

XPath finds the nodes expressed in the given path by having a set of selected nodes, and updating that set as it processes each path element. When the entire path is processed, the final set of selected nodes is then returned as result of the search. The Supported XPath syntax table described the details of how each supported path element updates the selected nodes.

As an example, consider the .//components[@xmi:type="automata:Automaton"] path:

The . path element selects the current node (at the root of the tree). For the CIF node tree, that is the Specification node.
The // path element selects all nodes below the previous selection (so, its children, the children of its children, and so on). All nodes in the tree are selected now, except the root node.
The components path element selects all nodes from the previous selection that can be directly reached by a components tag. For the CIF node tree, all Group nodes and all Automaton nodes are selected. If you have component instantiations (concrete instances of group definitions or automaton definitions), they will be selected as well.
The [@xmi:type="automata:Automaton"] path element restricts the selection to the nodes that have an XMI type attribute with the value automata:Automaton. For the CIF node tree, the selection now only contains all Automaton nodes.

More path elements can be added, thus allowing to find very specific nodes.

Performing the XPath selections in Python takes only a handful of lines. Below are a few queries to get started.

Look at the stated path and the nodes in the loaded .cifx file, and compare against the printed results of the Python script. Also check against the CIF classes in the meta-model. Last but not least, try to modify the CIF file or the XPath search and check whether it works as expected.

Find all nodes in a components list:

# Import ElementTree and set the name spaces as shown above.
doc = ET.parse('example.cifx')

for elem in doc.findall('.//components'):
    print(f"Component {elem.get('name')}")

The query produces:

Component G1
Component G2
Component H
Component A

Find all nodes in a components list, and restrict it to those components that are an automata:Automaton CIF object:

# Import ElementTree and set the name spaces as shown above.
doc = ET.parse('example.cifx')

for elem in doc.findall('.//components[@xmi:type="automata:Automaton"]', namespaces):
    print(f"Automaton: {elem.get('name')}")

The query produces:

Automaton: A

Find all nodes in a declarations list, then restrict them to declarations:Event types, then select all nodes in them that are in annotations lists, then restrict to doc names, and finally select the parent (via ..) of the matched annotation node to get event declarations as result:

# Import ElementTree and set the name spaces as shown above.
doc = ET.parse('example.cifx')

path_spec = ('.//declarations'
           + '[@xmi:type="declarations:Event"]'
           + '/annotations'
           + '[@name="doc"]'
           + '/..')
for elem in doc.findall(path_spec, namespaces):
    print(f"Event with @doc annotation: {elem.get('name')}")

The query produces:

Event with @doc annotation: c_event

Further extraction of specific information from the selected nodes, such as the elem.get('...') above is explained in the Tutorial section of the ElementTree module documentation.

For constructing new queries, start by writing an example CIF file that is to be queried, check the CIF meta-model information about the tree that can be expected, and create an XPath query expression in an incremental way, rather than trying to construct it completely in one attempt.

Advanced access or modification of a CIF specification

In the previous section it was demonstrated how to extract information from a CIF file in a reliable and efficient way by using an XML library. For many uses that is sufficient.

However, if it is desired to modify CIF objects or even to create entirely new parts in CIF models, that can be more involved when using an XML library. For example, it requires several lines of Python code to create a new Automaton object with a number of locations, and add it to the model.

In such cases, it may be easier to use an Ecore library instead of an XML library. An Ecore library allows you to more easily create objects or manipulate them. The ESCET project uses the Eclipse Modeling Framework (EMF) as Ecore library to work with CIF models. The definition of the CIF objects is available in the cif/org.eclipse.escet.cif.metamodel/model/cif.ecore file. The EMF classes are available as well. Using the Java code of the ESCET project to modify CIF models is therefore an option. Other languages may also have Ecore support.

In all cases, the resulting object-tree must comply with the restrictions defined in the CIF meta-model section. Failure to do so may result in undefined behavior by tools that load the resulting tree. The ESCET project performs some validation when loading .cifx files.

The CIF meta-model

This section provides a global overview of everything that may be found in a CIF object-tree. At first sight it may seem overwhelming, especially when trying to remember all information. As the information is not going anywhere, the suggestion is to browse through it for some time to understand what kind of information is available. When you have detailed questions about something specific in a CIF object-tree, return here and look into that particular part in more detail.

The CIF meta-model is kept in the cif/org.eclipse.escet.cif.metamodel folder in the ESCET Git repository. It is a set of 10 packages. Each package covers a part of the CIF language and contains multiple classes.

To understand what classes exist in the CIF meta-model and how they relate to each other, all classes have been drawn in a UML class diagram. There is one class diagram for each package of the CIF meta-model. The class diagrams are also available as .png files. The name of a file corresponds with the name of the package in the CIF meta-model that it depicts:

cif/org.eclipse.escet.cif.metamodel/model/images/cif.png: Structure of the overall specification and components.
cif/org.eclipse.escet.cif.metamodel/model/images/automata.png: Structure of an automaton.
cif/org.eclipse.escet.cif.metamodel/model/images/declarations.png: Declarations of variables, types, events, and so on.
cif/org.eclipse.escet.cif.metamodel/model/images/functions.png: Internal and external user-defined functions.
cif/org.eclipse.escet.cif.metamodel/model/images/expressions.png: Expressions, from literal true to if expressions.
cif/org.eclipse.escet.cif.metamodel/model/images/types.png: Data types, from booleans to dictionaries and tuples.
cif/org.eclipse.escet.cif.metamodel/model/images/print.png: Print declarations.
cif/org.eclipse.escet.cif.metamodel/model/images/cifsvg.png: CIF/SVG declarations.
cif/org.eclipse.escet.cif.metamodel/model/images/annotations.png: Annotations.
common/org.eclipse.escet.common.position.metamodel/model/position.png: Position information (lines and columns in a CIF file).

The diagrams make extensive use of inheritance, containment, and association. If these concepts are not familiar to you, it may be a good idea to first understand them by reading about UML class diagrams or object-oriented programming.

The diagrams are a good way to get an understanding of how classes relate to each other, but they lack a complete description of the meaning of each field of each class. Those descriptions are available in the cif/org.eclipse.escet.cif.metamodel/docs/cif_ecore_details.pdf file. It contains all details about all CIF language constructs. Not something to read from first to last page, but it should provide an answer to any technical detail question about CIF objects and the CIF language.