This document atempts to answer the so frequently asked question: How are trees represented in GCC and how I can access them?
Abstract Syntax Trees (or AST) are produced by each front-end as an intermediate representation. They are then translated to RTL after some analyses and optimizations.
Each time a function has been parsed completely, it is available in its tree representation. Each function follows the same path through the rest of the compiler : It is translated ("expanded") to RTL, then the RTL is optimized, and finally the result is some assembler code.
The goal of this tutorial is to explore what happens in the front end to the compiler, and what opportunities we have to optimize the code before its translation to RTL.
Let's see what happens to a file when it enters the C front-end.
The first function is parsed and its AST is built. Then rest_of_compilation is called, the function is transformed into RTL (Register Transfer Language) and finally it is translated into assembler code. Once the assembler code is produced for the first function, the second function is parsed and the same path is followed for generating the assembler source.
The only thing to remember is that GCC is based (for the moment) on the concept of function at time.
Now that you have a global view of the mechanics in GCC, let's see what optimizations and analyses are performed on the AST representation.
Only a small part of tree optimizations are available in the main branch of GCC. We're working on a separate branch called ast-optimizer-branch. That allows much more flexibility for developping experimental passes.
The function inliner is invoked just after the generation of the AST. Its behaviour is simple : if the function mets some criteria (size, inline flag, ...) then its AST representation is kept and reused everytime the function is called.
We introduced this representation experimentally following the previous work from McGill University. Simple is based on the same structure of tree as original AST. The only difference is that expressions were reduced under a form simple to analyse and optimize.
These optimizations are based on the SSA form, and are executed just before the function's AST is translated to RTL.
In the AST representation we have all the information the programmer put in his code : everything concerning the control flow, everything about structures and types. All these informations are analysed in the front-end and after their analysis are propagated in the back-end. Among these analyses we can find :
A function is represented as a node FUNCTION_DECL : its declaration.
Supose that we have a pointer to a function declaration, call it fnDecl. (There is a global variable called current_function_decl that is a pointer to the current function to be analysed.) The name of this function is available by calling DECL_NAME() and its body is available by calling DECL_SAVED_TREE () as shown in the following example :
char * fnName = IDENTIFIER_POINTER (DECL_NAME (fnDecl)); tree fnBody = DECL_SAVED_TREE (fnDecl);
(Note that names in capitals are usually macros in GCC. Language-independent tree nodes and macros are defined in tree.def and tree.h, respectively.)
A nice way to see its structure (as in cartoons) could be :
If the function does not have a body, then fnBody will point to NULL_TREE, otherwise it will point to a COMPOUND_STMT.
Statements provide the "wood" structure of the tree (I tried to keep a
brown colour on all the node names that correspond to a statement :-).
They represent the execution sequence as a control flow. Among them we
can find IF_STMT, FOR_STMT, DO_STMT, COMPOUND_STMT, ...
These nodes are defined in c-common.def for the C and C++ languages.
(FIXME: I don't know about the java front-end : seems that java-tree.def
contains all the nodes specific to java but has no statements, only
expressions, like :
/* Return -1, 0, 1 depending on whether the first argument is less, equal, or greater to the second argument. */ DEFTREECODE (COMPARE_EXPR, "compare_expr", '2', 2)
This node contains a chain of statements accessible through the COMPOUND_BODY macro.
In this figure you can see that statements are chained together via TREE_CHAIN. This is the original statement format. A way to simplify the work on the AST representation is by allowing statement nodes to be double chained. The interface becomes much more simpler to use then.
That node has the following entry in c-common.def :
/* Used to represent a `for' statement. The operands are FOR_INIT_STMT, FOR_COND, FOR_EXPR, and FOR_BODY, respectively. */ DEFTREECODE (FOR_STMT, "for_stmt", 'e', 4)
That gives the following schema :
Expressions are leafs of the tree. They can represent :
All these nodes are declared in tree.def and their format is straight forward :
DEFTREECODE (PLUS_EXPR, "plus_expr", '2', 2) DEFTREECODE (FUNCTION_DECL, "function_decl", 'd', 0)
The PLUS_EXPR has 2 operands, whereas FUNCTION_DECL has no operands. Their semantics and their possible accessors are defined as comments in c-common.def. If a node has no accessors, it is always possible to access their operands through the TREE_OPERAND (node, number) macro.
It is always good usage to access the tree through its macro interface. This allows further checking (if you've configured your development version with --enable-checking).
We want to write a small interface that allows simple operations on trees. The interface exists already for the double linked version of statements, but is still not written for the simple linked statement chain.
Here is a short snapshot of the interface in tree-dchain.h (as you could see it on May 2002) :
#define PREV_STMT(NODE) TREE_TYPE (NODE) #define NEXT_STMT(NODE) TREE_CHAIN (NODE) /* In tree-dchain.c */ extern void double_chain_stmts PARAMS ((tree)); extern void double_chain_free PARAMS ((tree)); extern void dchain_insert_stmt_after PARAMS ((tree, tree)); extern void dchain_insert_stmt_before PARAMS ((tree, tree)); extern void dchain_delete_stmts PARAMS ((tree, tree)); extern void dchain_insert_chain PARAMS ((tree, tree, tree, tree)); extern void dchain_two_chains PARAMS ((tree, tree, tree, tree, tree, tree, tree, tree)); extern tree dchain_deep_copy_node PARAMS ((tree)); extern void dchain_replace_stmt PARAMS ((tree, tree)); extern void dchain_build_scope PARAMS ((tree *)); extern tree dchain_declare_tmp_vars PARAMS ((tree, tree)); extern void dchain_delete_labels PARAMS ((tree));
As a general rule in the GNU coding style, you will find comments and explanations on what these functions do before every function definintion in the .c file. Okay, for making your life easier, I will give you a hint that I was told about just before I've begun to look at GCC : for emacs try etags, for vi try ctags...
The simplest example I have in mind that manipulate the tree structure is the function that stores the information about the previous statement in the tree structure. Note that this exmple deals only with the statement chain, and does a traversal of all statements in the tree. It is also good to keep in mind when you'll read the following code that the code was intended to be runned on SIMPLE (and thus avoids some problems that are much more harder to deal with ... as EXPR_STMT that can contain a COMPOUND_STMT, and thus potentially other statements...)
Here is what you can read in tree-dchain.c :
/* Store information about previous statements in the tree structure. */ void double_chain_stmts (node) tree node; { tree prev; if (node == NULL_TREE) return; /* The PREV_STMT points to NULL_TREE for the first statement in the chain. */ prev = NULL_TREE; while (node) { switch (TREE_CODE (node)) { case COMPOUND_STMT: double_chain_stmts (COMPOUND_BODY (node)); PREV_STMT (node) = prev; break; case FOR_STMT: double_chain_stmts (FOR_INIT_STMT (node)); double_chain_stmts (FOR_BODY (node)); PREV_STMT (node) = prev; break; case WHILE_STMT: double_chain_stmts (WHILE_BODY (node)); PREV_STMT (node) = prev; break; case DO_STMT: double_chain_stmts (DO_BODY (node)); PREV_STMT (node) = prev; break; case IF_STMT: double_chain_stmts (THEN_CLAUSE (node)); double_chain_stmts (ELSE_CLAUSE (node)); PREV_STMT (node) = prev; break; case SWITCH_STMT: double_chain_stmts (SWITCH_BODY (node)); PREV_STMT (node) = prev; break; case EXPR_STMT: case DECL_STMT: case RETURN_STMT: case SCOPE_STMT: case FILE_STMT: case LABEL_STMT: case GOTO_STMT: case ASM_STMT: case CASE_LABEL: case CONTINUE_STMT: case BREAK_STMT: PREV_STMT (node) = prev; break; default: break; } prev = node; node = NEXT_STMT (node); } }
You can see that the tree structure is never accessed directly : that is an important point, since the underlying structures can be modified later. (remind the OO-design rule that says that it is safer to access an Abstract Data Type through its interface rather than directly accessing its fields...)
As you've seen in this example, the manipulation of trees is really simple (when you work on SIMPLE :-) ...