Abstract Syntax Tree - PHP Development Tools

Summary

The Abstract Syntax Tree (AST) is the base framework for many powerful tools of the Eclipse IDE, including Semantic highlighting, Refactoring, Quick Fix and Quick Assist. The Abstract Syntax Tree maps plain PHP source code in a tree form. This tree is more convenient and reliable to analyze and modify programmatically than text-based source. This part of the article shows how you can use the Abstract Syntax Tree for extending Eclipse PHP Development Tools (PDT) for your applications. This article is based on the "Abstract Syntax Tree" (JDT) By Thomas Kuhn and Olivier Thomann.

May, 2008

Introduction

The AST is comparable to the DOM tree model of an XML file. Just like with DOM, the AST allows you to modify the tree model and reflects these modifications in the PHP source code.

This part of the article refers to an example application which covers most of the interesting AST-related topics. Let us have a look at the application that was built to illustrate this article:

Example Application

According to PHP Practices [4], you should not declare local variables before using them. The goal of our application will be to detect contradicting variable declarations and to move them to their correct place. There are three cases our application has to deal with:

Removal of unnecessary declaration. If a variable is declared and initialized, only to be overridden by another assignment later on, the first declaration of the variable is an unnecessary declaration.
Move of declaration. If a variable is declared, and not immediately referenced within the following statement, this variable declaration has to be moved. The correct place for the declaration is the line before it is first referenced.
Move of declaration of a variable that is referred to from within different blocks. This is a subcase of case 2. Imagine that a variable is used in both a try- and a catch clause. Here the declaration cannot be moved right before the first reference in the try-clause, since then it would not be declared in the catch-clause. Our application has to deal with that and has to move the declaration to the best possible place, which would be here one line above the try-clause.

In Appendix A, Code Fragments for Example Application Cases code snippets to each of these cases are provided.

You can import the example application into your workspace [1] or install the plug-in using the Eclipse Update Manager [2].

Workflow

A typical workflow of an application using AST looks like this:

Figure 1. AST Workflow

PHP source: To start off, you provide some source code to parse. This source code can be supplied as a PHP file in your project or directly as a char[] that contains PHP source
Parse: The source code described at 1 is parsed. All you need for this step is provided by the class org.eclipse.jdt.core.dom.ASTParser. See the section called “Parsing source code”.
The Abstract Syntax Tree is the result of step 2. It is a tree model that entirely represents the source you provided in step 1. If requested, the parser also computes and includes additional symbol resolved information called "bindings".
Manipulating the AST: If the AST of point 3 needs to be changed, this can be done in two ways:
1. By directly modifying the AST.
2. By noting the modifications in a separate protocol. This protocol is handled by an instance of ASTRewrite.
See more in the section called “How to Apply Changes”.
Writing changes back: If changes have been made, they need to be applied to the source code that was provided by 1. This is described in detail in the section called “Write it down”.
IDocument: Is a wrapper for the source code of step 1 and is needed at point 5

The Abstract Syntax Tree (AST)

As mentioned, the Abstract Syntax Tree is the way that Eclipse looks at your source code: every PHP source file is entirely represented as tree of AST nodes. These nodes are all subclasses of ASTNode. Every subclass is specialized for an element of the PHP Programming Language. E.g. there are nodes for method declarations ( MethodDeclaration), class declaration (ClassDeclaration), assignments and so on. One very frequently used node is Identifier. An Identifier is any string of PHP source that is not a keyword or a scalar Scalar For example, in $i = 6 + $j;, $i and $j are represented by Identifier.

All AST-relevant classes are located in the package org.eclipse.php.core.dom of the org.eclipse.php.core plug-in.

To discover how code is represented as AST, the AST Viewer plug-in [4] is a big help: Once installed you can simply mark source code in the editor and let it be displayed in a tree form in the AST Viewer view.

Parsing source code

Most of the time, an AST is not created from scratch, but rather parsed from existing PHP code. This is done using the ASTParser. It processes whole PHP files as well as portions of PHP code. In the example application the method Program parse(ISourceModule lwUnit)of the class AbstractASTArticle parses the source code stored in the file that lwUnit points to:

protected Program parse(ICompilationUnit unit) {
	ASTParser parser = ASTParser.newParser(ASTParser.VERSION_PHP5, lwUnit);
	try {
	return (Program) parser.createAST(null /* IProgressMonitor */); 
	} catch (Exception e) {
	return null;
	} 
}

With ASTParser.newParser(ASTParser.VERSION_PHP5, lwUnit), we advise the parser to parse the code following to the PHP Language Specification, includes all PHP Language Specifications up to the new syntax introduced in PHP 5. An ISourceModule is a pointer to a PHP file, and will be used to reolve binding infoirmation of this script. The parser supports five kinds of input:

Entire source file: The parser expects the source either as a pointer to a PHP file (which means as an ISourceModule, see the section called “PHP Model”) or as char[].

PHP Model

The PHP Model is a whole different story. It is out of scope of this article to dive deep into its details within. The parts looked at will be the ones which intersect with the AST. The motivation to discuss it here is, to use it as an entry point to build an Abstract Syntax Tree of a source file. Remember, the ICompilationUnit is one of the possible parameters for the AST parser.

The PHP Model represents a PHP Project in a tree structure, which is visualized by the well known "Package Explorer" view:

Figure 2. PHP Model Overview

The nodes of the PHP Model implement one of the following interfaces:

IScriptProject: Is the node of the PHP Model and represents a PHP Project. It contains IProjectFragment as child nodes.
IProjectFragment: Represents a project fragment, and maps the contents to an underlying resource which is either a folder, JAR, or ZIP file.
IScriptFolder: Represents a folder containing script files inside.
ISourceModule: Represents a PHP source file.
IType: Represents a class or interface in a source file.
IField: Represents a field or constant in an IType
IMethod: Represents afunction in of source file or a method in a class or interface

In contrast to the AST, these nodes are lightweight handles. It costs much less to rebuild a portion of the PHP Model than to rebuild an AST. That is also one reason why the PHP Model is not only defined down to the level of ISourceModule. There are many cases where complete information, like that provided by the AST, is not needed. One example is the Outline view: this view does not need to know the contents of a method body. It is more important that it can be rebuilt fast, to keep in sync with its source code.

There are different ways to get an ISourceModule. The example applications are launched as actions from the package tree view. This is quite convenient: only add an objectContribution extension to the point org.eclipse.ui.popupMenus. By choosing org.eclipse.dltk.core.ISourceModule as objectClass, the action will be only displayed in the context menu of a compilation unit. Have a look at the example application's plugin.xml. The compilation unit then can be retrieved from the ISelection, that is passed to the action's delegate (in the example, this is ASTArticleActionDelegate).

Another, programmatic, approach is to get the project handle from the IDE and to look for the compilation unit. This can be done by either step down the PHP Model tree to collect the desired ISourceModules. Or, by calling the findType() of the PHP project:

IWorkspaceRoot root = ResourcesPlugin.getWorkspace().getRoot();
IProject project = root.getProject("somePHPProject");
project.open(null /* IProgressMonitor */);
IScriptProject PHPProject = DLTKCore.create(project);
IType lwType = PHPProject.findType("MyClass");
ISourceModule lwSourceModule = lwType.getSourceModule();

How to find an AST Node

Even a simple "Hello world" program results in a quite complex tree. How does one get the FunctionInvocation of that println("Hello World")? Scanning all the levels is a possible, but not very convenient.

There is a better solution: every ASTNode allows querying for a child node by using a visitor (visitor pattern [5]). Have a look at AbstractVisitor. There you'll find for every subclass of ASTNode two methods, one called visit(), the other called endVisit(). Further, the ASTVisitor declares these two methods: preVisit(ASTNode node) and postVisit(ASTNode node).

The subclass of AbstractVisitor is passed to any node of the AST. The AST will recursively step through the tree, calling the mentioned methods of the visitor for every AST node in this order (for the example of a MethodInvocation):

preVisit(ASTNode node)
visit(MethodInvocation node)
... now the children of the method invocation are recursively processed if visit returns true
endVisit(MethodInvocation node)
postVisit(ASTNode node)

// TODO : check here a sample for visitor

In our example application, the LocalVariableDetector is a subclass of AbstractVisitor. It is used, amongst other things, to collect all local variable declarations of a compilation unit:

public boolean visit(VariableDeclarationStatement node) {
	for (Iterator iter = node.fragments().iterator(); iter.hasNext();) {
	VariableDeclarationFragment fragment = (VariableDeclarationFragment) iter.next();
	// ... store these fragments somewhere
	}
	return false; // prevent that SimpleName is interpreted as reference
}

If false is returned from visit(), the subtree of the visited node will not be considered. This is to ignore parts of the AST.

In the example, process(Program program) is called from the outside to start visiting the program. The function is fairly simple:

public void process(Program program) {
	program.accept(this);
}

Obtaining Information from an AST Node

Every subclass of ASTNode contains specific information for the PHP element it represents. E.g. a FunctionDeclaration will contain information about the name, return type, parameters, etc. The information of a node is referred as structural properties. Let us have a closer look at the characteristics of the structural properties. Beneath you see the properties of this function declaration:

function println($content) {
	echo $content . '<BR/>' ;
}

Figure 3. Structural properties of a method declaration

Access to the values of a node's structural properties can be made using static or generic methods:

static methods: every node offers methods to access its properties: e.g. getName(), etc.
generic method: ask for a property value using the getStructuralProperty(StructuralPropertyDescriptor property) method. Every AST subclass defines a set of StructuralPropertyDescriptors, one for every structural property. The StructuralPropertyDescriptor can be accessed directly on the class to which they belong: e.g. FunctionDeclaration.NAME_PROPERTY. A list of all available StructuralPropertyDescriptors of a node can be retrieved by calling the method structuralPropertiesForType() on any instance of ASTNode.

The structural properties are grouped into three different kinds: properties that hold simple values, properties which contain a single child AST node and properties which contain a list of child AST nodes.

Figure 4. StructuralPropertyDescriptor and subclasses

view full size

SimplePropertyDescriptor: The value will be a String, a primitive value wrapper for either Integer or Boolean or a basic AST constant. For a list of all possible value classes of a simple property, see Appendix C, Simple properties value classes
ChildPropertyDescriptor: The value will be a node, an instance of an ASTNode subclass
ChildListPropertyDescriptor: The value will be a List of AST nodes

Bindings

The AST, as far as we know it, is just a tree-form representation of source code. Every element of the source code is mapped to a node or a subtree. Looking at a reference to a variable, let's say $i, is represented by an instance of Identifier with "i" as IDENTIFIER property-value. Bindings go one step further: they provide extended resolved information for several elements of the AST. About the Identifier above they tell us that it is a reference to a local variable of type int.

Various subclasses of ASTNode have binding information. It is retrieved by calling resolveBinding() on these classes. There are cases where more than one binding is available: e.g. the class MethodInvocation returns a binding to the method that is invoked (resolveMethodBinding()). Furthermore a binding to the return type of the method (resolveTypeBinding()).

Since evaluating bindings is costly, the binding service has to be explicitly requested at parse time. This is done by passing the relevant ISourceModule to the method ASTParser.createParser() before the source is being parsed.

$i = 7;
echo 'Hello!';
$x = $i * 2;

the reference of the variable i is represented by a Identifier. Without bindings you would not know nothing more than this:

Bindings provide more information:

Bindings allow you to comfortably find out to which declaration a reference belongs, as well as to detect whether two elements are references to the same element: if they are, the bindings returned by reference-nodes and declaration-nodes are identical. For example, all Identifiers that represent a reference to a local variable i return the same instance of IVariableBinding from Identifier.resolveBindings(). The declaration node, Identifier.resolveBinding(), returns the same instance of IVariableBinding, too. If there is another usage of a local variable i (within another method or block), another instance of IVariableBinding is returned. Confusions caused by equally named elements are avoided if bindings are used to identify an element (variable, method, type, etc.).

How to Apply Changes

This section will show how to modify an AST and how to store these modifications back into PHP source code.

New AST nodes may have to be created. New nodes are created by using the class org.eclipse.php.core.dom.AST (here AST it is the name of an actual class. Do not confuse with the abbreviation "AST" used within this article). Have a look at this class: it offers methods to create every AST node type. An instance of AST is created when source code is parsed. This instance can be obtained from every node of the tree by calling the method getAST(). The newly created nodes can only be added to the tree that class AST was retrieved from.

Often it is convenient to reuse an existing subtree of an AST and maybe just change some details. AST nodes cannot be re-parented, once connected to an AST, they cannot be attached to a different place of the tree. Though it is easy to create a copy from a subtree: (Expression) ASTNode.copySubtree(ast, node) . The parameter ast is the target AST. This instance will be used to create the new nodes. That allows copying nodes from another AST (established by another parser run) into the current AST domain.

There are two APIs to track modifications on an AST: either you can directly modify the tree or you can make use of a separate protocol, managed by an instance of ASTRewrite. The latter, using the ASTRewrite, is the more sophisticated and preferable way. The changes are noted by an instance of ASTRewrite, the original AST is left untouched. It is possible to create more than one instance of ASTRewrite for the same AST, which means that different change logs can be set up. "Quick Fix" makes use of this API: this is how for every Quick Fix proposal a preview is created.

Example 1. Protocolling changes to a AST by using ASTRewrite .


MethodDeclaration md = ast.newMethodDeclaration();
md.setName(ast.newName("foo"));
ASTRewrite rewriter = ASTRewrite.create(ast);
ClassDeclaration td = (ClassDeclaration) cu.statements().get(0);
ITrackedNodePosition tdLocation = rewriter.track(td);
ListRewrite lrw = rewriter.getListRewrite(cu, Program.METHODS_PROPERTY);
lrw.insertLast(md, null);

The example shows, how a child is added to a child list property value. If a single-child property is set, no list rewrite is necessary. For example, to set the name of a MethodInvocation, the code would look like this:

rewrite.set(methodInvocation, MethodInvocation.NAME_PROPERTY, newName, null);

rewrite.replace(methodInvocation.getName() /* old name node*/, newName, null)

To set a simple property value, call set() like shown above.

Let us have a look at the second way to change an AST. Instead of tracking the modifications in separate protocols, we directly modify the AST. The only thing that has to done before modifying the first node is to turn on the change recording by calling recordModifications() on the root of the AST, the CompilationUnit. Internally changes are logged to an ASTRewrite as well, but this happens hidden to you.

Example 2. Modifying an AST directly.

 program.recordModifications();
 AST ast = program.getAST();
 EchoStatement echo = ast.newEchoStatement();
 echo.setExpression(ast.newScalar(“Hello World“));
 program.statements().add(echo);

The next section will tell how to write the modifications back into PHP source code.

Write it down

Once you have tracked changes, either by using ASTRewrite or by modifying the tree nodes directly, these changes can be written back into PHP source code. Therefore a TextEdit object has to be created. Here we leave the code related area of the AST, and enter a text based environment. The TextEdit object contains character based modification information. It is part of the org.eclipse.text plug-in.

How to obtain the TextEdit object differs for the two mentioned ways only slightly:

If you used ASTRewrite, ask the ASTRewrite instance for the desired TextEdit object by calling rewriteAST(IDocument, Map).
If you changed the tree nodes directly, the TextEdit object is created by calling rewrite(IDocument document, Map options) on CompilationUnit.

The first parameter, document, contains the source code that will be modified. The content of this container is the same code that you fed into the ASTParser. The second parameter is a map of options for the source code formatter. To use the default options, pass null.

Obtaining an IDocument if you parsed source code from a String is easy: create an object of the class org.eclipse.jface.text.Document and pass the code string as constructor parameter.

If you initially parsed an existing PHP source file and would like to store the changes back into this file, things get a little bit more tricky. You should not directly write into this file, since you might not be the only editor that is manipulating this source file. Within Eclipse, PHP editors do not write directly on a file resource, but on a shared working copy instead.

ITextFileBufferManager bufferManager = FileBuffers.getTextFileBufferManager(); // get the buffer manager
IPath path = unit.getPHPElement().getPath(); // unit: instance of CompilationUnit
try {
	bufferManager.connect(path, null); // (1)
	ITextFileBuffer textFileBuffer = bufferManager.getTextFileBuffer(path);
	// retrieve the buffer
	IDocument document = textFileBuffer.getDocument(); (2)
	// ... edit the document here ...
	// commit changes to underlying file
	textFileBuffer.commit(null /* ProgressMonitor */, false /* Overwrite */); // (3)
} finally {
	bufferManager.disconnect(path, null); // (4)
}

Connect a path to the buffer manager. After that call, the document for the file described by path can be obtained.
Ask the buffer for the working copy by calling getTextFileBuffer. From the ITextFileBuffer we get the IDocument instance we need.
Store changes to the underlying file.
Disconnect the path. Do not modify the document after this call.

Managing Comments

One of the most frustrating part of modifying an AST is the comment handling. The method CompilationUnit#getCommentList() is used to return the list of comments located in the compilation unit in the ascendant order. Unfortunately, this list cannot be modified. This means that even if the AST Rewriter is used to add a comment inside a compilation unit, the new comment would not appear inside the comments' list.

In order to add a comment the following code snippet can be used:

Program astRoot= ... ; // get the current program
ASTRewrite rewrite= ASTRewrite.create(astRoot.getAST());
Block block= (TypeDeclaration) astRoot.statements().get(0).getBody();
ListRewrite listRewrite= rewrite.getListRewrite(block, Block.STATEMENTS_PROPERTY);
Statement placeHolder= rewrite.createStringPlaceholder("//mycomment", ASTNode.EMPTY_STATEMENT);
listRewrite.insertFirst(placeHolder, null);
textEdits= rewrite.rewriteAST(document, null);
textEdits.apply(document);

The methods Program#getExtendedLength(ASTNode) and Program#getExtendedStartPosition(ASTNode) can be used to retrieve the range of a node that would contains preceding and trailing comments and whitespaces.

Conclusions

This article has shown how to use the Eclipse AST for static code analysis and code manipulation issues. It touched the PHP Model, explained Bindings and showed how to store changes made to the AST back into PHP source code.

For remarks, questions, etc. enter a comment in the bugzilla entry of this article [6].

Resources

[1] Download the Packed Example Project. Use the option "Existing Projects into Workspace" from the "Import" Wizard to add it to your workspace.

[2] To install the plug-in, obtain using the Eclipse Update Manager. Update Site: http://earticleast.sourceforge.net/update.

[3] PHP Tool Smithing, Extending the Eclipse PHP Development Tools .

[4] AST Viewer Plug-in .

[5] Wikipedia: Visitor Pattern .

[6] AST Article bugzilla entry .

A. Code Fragments for Example Application Cases

In the introduction, three typical cases for our example application have been presented (see the section called “Example Application”). Clarifying code before / after code snippets follow to further clarify these cases.

Removal of unnecessary declaration.

Before:
```
int x = 0;
...
x = 2 * 3;
```
After:
```
...
int x = 2 * 3;
```

Move of declaration.

Before:

int x = 0;
...
System.out.println(x);
...
x = 2 * 3;

After:

...
int x = 0;
System.out.println(x);
...
x = 2 * 3;

Move of a declaration of a variable, that is used within different blocks.

Before:

int x = 0;
...
try {
x = 2 * 3;
} catch (...) {
System.out.println(x);
}

After:

...
int x = 0;
try {
x = 2 * 3;
} catch (...) {
System.out.println(x);
}

B. Complete list of bindings

IAnnotationBinding
IMemberValuePairBinding
IMethodBinding
IPackageBinding
ITypeBinding
IVariableBinding

C. Simple properties value classes

Below the list of all classes of which simple property values can be instance of (in Eclipse version 3.2).

boolean
int
String
Modifier.ModifierKeyword
Assignment.Operator
InfixExpression.Operator
PostfixExpression.Operator
PrefixExpression.Operator
PrimitiveType.Code