AST representation in GCC

This document atempts to answer the so frequently asked question: How are trees represented in GCC and how I can access them?

Short description

Abstract Syntax Trees (or AST) are produced by each front-end as an intermediate representation. They are then translated to RTL after some analyses and optimizations.

Each time a function has been parsed completely, it is available in its tree representation. Each function follows the same path through the rest of the compiler : It is translated ("expanded") to RTL, then the RTL is optimized, and finally the result is some assembler code.

The goal of this tutorial is to explore what happens in the front end to the compiler, and what opportunities we have to optimize the code before its translation to RTL.

The magic of the front-end

Let's see what happens to a file when it enters the C front-end.

The first function is parsed and its AST is built. Then rest_of_compilation is called, the function is transformed into RTL (Register Transfer Language) and finally it is translated into assembler code. Once the assembler code is produced for the first function, the second function is parsed and the same path is followed for generating the assembler source.

The only thing to remember is that GCC is based (for the moment) on the concept of function at time.

Now that you have a global view of the mechanics in GCC, let's see what optimizations and analyses are performed on the AST representation.

Tree optimizations

Only a small part of tree optimizations are available in the main branch of GCC. We're working on a separate branch called ast-optimizer-branch. That allows much more flexibility for developping experimental passes.

Function inlining

The function inliner is invoked just after the generation of the AST. Its behaviour is simple : if the function mets some criteria (size, inline flag, ...) then its AST representation is kept and reused everytime the function is called.

SIMPLE representation

We introduced this representation experimentally following the previous work from McGill University. Simple is based on the same structure of tree as original AST. The only difference is that expressions were reduced under a form simple to analyse and optimize.

SSA-optimizations

These optimizations are based on the SSA form, and are executed just before the function's AST is translated to RTL.

Tree analyses

In the AST representation we have all the information the programmer put in his code : everything concerning the control flow, everything about structures and types. All these informations are analysed in the front-end and after their analysis are propagated in the back-end. Among these analyses we can find :

alias analysis
tree-cfg
tree-ssa
loop dependence analysis (still in construction)

Tree structure

Function nodes

A function is represented as a node FUNCTION_DECL : its declaration.

Supose that we have a pointer to a function declaration, call it fnDecl. (There is a global variable called current_function_decl that is a pointer to the current function to be analysed.) The name of this function is available by calling DECL_NAME() and its body is available by calling DECL_SAVED_TREE () as shown in the following example :

    char * fnName = IDENTIFIER_POINTER (DECL_NAME (fnDecl));
    tree fnBody = DECL_SAVED_TREE (fnDecl);

(Note that names in capitals are usually macros in GCC. Language-independent tree nodes and macros are defined in tree.def and tree.h, respectively.)

A nice way to see its structure (as in cartoons) could be :

If the function does not have a body, then fnBody will point to NULL_TREE, otherwise it will point to a COMPOUND_STMT.

Statement nodes

Statements provide the "wood" structure of the tree (I tried to keep a brown colour on all the node names that correspond to a statement :-). They represent the execution sequence as a control flow. Among them we can find IF_STMT, FOR_STMT, DO_STMT, COMPOUND_STMT, ...
These nodes are defined in c-common.def for the C and C++ languages. (FIXME: I don't know about the java front-end : seems that java-tree.def contains all the nodes specific to java but has no statements, only expressions, like :

    /* Return -1, 0, 1 depending on whether the first argument is
       less, equal, or greater to the second argument. */
    DEFTREECODE (COMPARE_EXPR, "compare_expr", '2', 2)

COMPOUND_STMT

This node contains a chain of statements accessible through the COMPOUND_BODY macro.

In this figure you can see that statements are chained together via TREE_CHAIN. This is the original statement format. A way to simplify the work on the AST representation is by allowing statement nodes to be double chained. The interface becomes much more simpler to use then.

FOR_STMT

That node has the following entry in c-common.def :

    /* Used to represent a `for' statement. The operands are
       FOR_INIT_STMT, FOR_COND, FOR_EXPR, and FOR_BODY, respectively. */
    DEFTREECODE (FOR_STMT, "for_stmt", 'e', 4)

That gives the following schema :

Expression nodes

Expressions are leafs of the tree. They can represent :

arithmetic expressions : PLUS_EXPR, MULT_EXPR, ...
types nodes : INTEGER_TYPE, CHAR_TYPE, ...
declarations : VAR_DECL, FUNCTION_DECL, TYPE_DECL, ...

All these nodes are declared in tree.def and their format is straight forward :

    DEFTREECODE (PLUS_EXPR, "plus_expr", '2', 2)
    DEFTREECODE (FUNCTION_DECL, "function_decl", 'd', 0)

The PLUS_EXPR has 2 operands, whereas FUNCTION_DECL has no operands. Their semantics and their possible accessors are defined as comments in c-common.def. If a node has no accessors, it is always possible to access their operands through the TREE_OPERAND (node, number) macro.

Tree interface

It is always good usage to access the tree through its macro interface. This allows further checking (if you've configured your development version with --enable-checking).

We want to write a small interface that allows simple operations on trees. The interface exists already for the double linked version of statements, but is still not written for the simple linked statement chain.

Here is a short snapshot of the interface in tree-dchain.h (as you could see it on May 2002) :

#define PREV_STMT(NODE) TREE_TYPE (NODE)
#define NEXT_STMT(NODE) TREE_CHAIN (NODE)

/* In tree-dchain.c  */
extern void double_chain_stmts             PARAMS ((tree));
extern void double_chain_free              PARAMS ((tree));
extern void dchain_insert_stmt_after       PARAMS ((tree, tree));
extern void dchain_insert_stmt_before      PARAMS ((tree, tree));
extern void dchain_delete_stmts            PARAMS ((tree, tree));
extern void dchain_insert_chain            PARAMS ((tree, tree, tree, tree));
extern void dchain_two_chains              PARAMS ((tree, tree, tree, tree, 
						    tree, tree, tree, tree));
extern tree dchain_deep_copy_node          PARAMS ((tree));
extern void dchain_replace_stmt            PARAMS ((tree, tree));
extern void dchain_build_scope             PARAMS ((tree *));
extern tree dchain_declare_tmp_vars        PARAMS ((tree, tree));
extern void dchain_delete_labels           PARAMS ((tree));

As a general rule in the GNU coding style, you will find comments and explanations on what these functions do before every function definintion in the .c file. Okay, for making your life easier, I will give you a hint that I was told about just before I've begun to look at GCC : for emacs try etags, for vi try ctags...

An example :

The simplest example I have in mind that manipulate the tree structure is the function that stores the information about the previous statement in the tree structure. Note that this exmple deals only with the statement chain, and does a traversal of all statements in the tree. It is also good to keep in mind when you'll read the following code that the code was intended to be runned on SIMPLE (and thus avoids some problems that are much more harder to deal with ... as EXPR_STMT that can contain a COMPOUND_STMT, and thus potentially other statements...)

Here is what you can read in tree-dchain.c :

 
/* Store information about previous statements in the tree structure.  */

void 
double_chain_stmts (node)
     tree node;
{
  tree prev;

  if (node == NULL_TREE)
    return;

  /* The PREV_STMT points to NULL_TREE for the first statement in the chain.  */
  prev = NULL_TREE;
  
  while (node)
    {
      switch (TREE_CODE (node))
	{
	case COMPOUND_STMT:
	  double_chain_stmts (COMPOUND_BODY (node));
	  PREV_STMT (node) = prev;
	  break;

	case FOR_STMT:
	  double_chain_stmts (FOR_INIT_STMT (node));
	  double_chain_stmts (FOR_BODY (node));
	  PREV_STMT (node) = prev;
	  break;

	case WHILE_STMT:
	  double_chain_stmts (WHILE_BODY (node));
	  PREV_STMT (node) = prev;
	  break;

	case DO_STMT:
	  double_chain_stmts (DO_BODY (node));
	  PREV_STMT (node) = prev;
	  break;

	case IF_STMT:
	  double_chain_stmts (THEN_CLAUSE (node));
	  double_chain_stmts (ELSE_CLAUSE (node));
	  PREV_STMT (node) = prev;
	  break;

	case SWITCH_STMT:
	  double_chain_stmts (SWITCH_BODY (node));
	  PREV_STMT (node) = prev;
	  break;

	case EXPR_STMT:
	case DECL_STMT:
	case RETURN_STMT:
	case SCOPE_STMT:
	case FILE_STMT:
	case LABEL_STMT:
	case GOTO_STMT:
	case ASM_STMT:
	case CASE_LABEL:
	case CONTINUE_STMT:
	case BREAK_STMT:
	  PREV_STMT (node) = prev;
	  break;

	default:
	  break;
	}
      prev = node;
      node = NEXT_STMT (node);
    }
}

You can see that the tree structure is never accessed directly : that is an important point, since the underlying structures can be modified later. (remind the OO-design rule that says that it is safer to access an Abstract Data Type through its interface rather than directly accessing its fields...)

As you've seen in this example, the manipulation of trees is really simple (when you work on SIMPLE :-) ...