Project 1: Creating an Abstract Syntax Tree (AST) for SimpleC
Table of Contents
Overview
In this project, we will add semantics actions to a bison parser specification file that generate an abstract syntax tree (AST) for an input SimpleC program. Provided are the lexer and parser specifications, as well as a library for creating AST nodes.
Getting started
- Accept the GitHub classroom assignment in webcourses to create your git repository for the project.
- Submission will be via GitHub by committed and pushing all of your changes to your repository to the course GitHub assignment, which graders will use for grading.
In your virtual machine, clone your repository as described in lecture, replacing
USERNAME
with your GitHub username.git clone https://github.com/cop3402/simplec-compiler-1-USERNAME
- Run
make
in your source folder (in your virtual machine) to build the project, which should create thesimplec
program.- If the build fails, double-check that you are in the repository directory and on a correctly-configured virtual machine.
- Run
cat tests/example.simplec | ./simplec
- Running
./simplec
without providing input will cause the program to stop and wait for input tostdin
. Use Ctrl-D to end the input (or Ctrl-C to terminate the program). - If
./simplec
is not found, be sure to prefix the program with./
, and be sure themake
step above succeeded.
- Running
Submitting your project
Submit your project by commiting your changes and pushing them to your GitHub repository.
Debugging your compiler
- Narrow down the problem by crafting a smaller version of the test case
- Go to the part of the your code that is related to the problem
- Trace step-by-step what your code really does (not what you think/hope/feel/guess/divine/intuit/reckon it does)
Implementing AST generation
The main task of this project is to implement the semantic actions in parser.y
to construct AST nodes for each grammar construct. See lecture 05 for more details on this task.
To evaluate your progress, pass simplec programs to your program, e.g., cat tests/example.simplec | ./simplec
and inspect the resulting output for correctness. Please see a complete example below.
The provided AST library
- The AST creation API is defined in
ast.h
and implemented for you inast.c
.- Use this API to construct the AST data structures
- You will not need to write code that allocates memory for tree nodes yourself
- Each AST node is loosely named after the grammar nonterminal it represents.
- For example, declarations use the
T_decl
struct
- For example, declarations use the
- Each create function is suffixed with the name of the node.
- For example,
create_decl
creates aT_decl
AST node representing a declaration construct
- For example,
- Each create function takes the child nodes of the AST node
- These correspond to the symbols in the grammar production's right-hand side
- For example,
T_decl create_decl(T_type type, string ident);
- Language constructs with multiple rules use a union type in the
struct
.- For example, statements use the
T_stmt
struct, which contains a union of types corresponding to assignment statements, if statements, etc. - The
kind
field is used to distinguish which kind of statement is held by thestruct
instance.
- For example, statements use the
See the complete definition below (or in ast.h in your project template).
Recording the resulting AST
In the program nonterminal only, save the resulting AST to the
program_ast
global variable. This will make it available to the
main function for printing. This is what the completed program
semantic action should look like. It is the only place where
program_ast
should be assigned, because program is the root of the
AST.
program : declaration_list function_list main_function { $$ = create_prog($1, $2, $3); program_ast = $$; } ;
Lists
argument_list : expression { $$ = create_exprlist($1, NULL); } | expression ',' argument_list { $$ = create_exprlist($1, $3); } ;
Operators
unary_operator : '&' { $$ = E_op_ref; }
Tips
- First, pick a construct to implement
- Start with the leaves
- Create the simplest test case contains the construct
- Find the path to the construct through the grammar
- There may be parent constructs that need implementing first
- Try to use the provided parent constructs first
- Use
printf
in the semantic action to help ensure you are reaching the semantic action - Make sure the test case reaches the semantic action
- Find the corresponding AST API call for creating the node
- Create the node in the semantic action, saving it to
$$
and passing the corresponding$1
,$2
, etc., child nodes.
Exit codes
According to POSIX standards, all programs report an exit code on termination. Programs use this exit code to report the status of the program on exit to the caller of the program. By convention, an exit code of 0 means a successful execution, while non-zero exit codes signal some error condition, defined on a per-program basis. This is the reason for the return 0;
at the end of main
in C programs. A C program can also use the exit
library call to terminate the program and report the given exit code.
For our compiler, we will use the exit code to report errors during compilation. For project 1, we define the two following error codes. For this project, the error reporting is already implemented for you, but you will need to use the correct error code in later phases of the compiler.
- exit(1) means there is a failed invariant check, e.g., an unexpected NULL value in the AST API
- exit(2) means there is a parser or lexer error (already implemented in parser.y)
The bash
shell stores the exit code of the last executed program in a special variable $?
, which you can print using the echo
command:
- Check the exit code by running
echo $?
right after running./simplec
.
Our automated grading scripts will use these exit codes to test your compiler, so be sure your compiler is producing the correct ones.
Test cases
The tests/
folder in your repository has many example programs (with their corresponding output in an .ast
file) to try with your parser.
Gotchas
- Null pointers
- The AST
create_
functions check for NULL and provide an error
- The AST
- Incorrect semantic value numbers
- Double-check the usage of
$1
values. Remember that each symbol is numbered, including TOKENS, such as semicolons.
- Double-check the usage of
- Make sure you are doing
$$ = create_...
and not just callingcreate_
otherwise, the semantic value won't be accessible by the parent - Incomplete semantic action definition
- Child nodes won't appear in the AST yet if the parent nodes' semantic actions have not been implemented
Punctuation are also symbols in the production, so don't forget to skip them when counting the symbols for their semantic values, e.g.,
: IF '(' expression ')' statement { $$ = create_ifstmt($3, $5); }
Some actions will just passthrough a child's value as the semantic value for the parent, i.e., semantic action will be
$$ = $1
. For example,statement : assignment_statement { $$ = $1; } | if_statement { $$ = $1; } | while_statement { $$ = $1; } | compound_statement { $$ = $1; } ;
| '(' expression ')' { $$ = $2; }
argument_list_opt : /* empty */ { $$ = NULL; } | argument_list { $$ = $1; } ;
main_function : MAIN '{' declaration_list statement_list RETURN expression ';' '}' { $$ = create_main($3, $4, $6); } ;
Example test script
This will compare your output against all public test cases. A diff exit code of 0 means there was no difference.
for i in tests/*.simplec; do echo $i; diff ${i%.simplec}.ast <(cat $i | ./simplec 2>/dev/null); echo "diff exit code: $?"; done
To compare to one program's output
diff tests/example.ast <(cat tests/example.simplec | ./simplec 2>/dev/null)
AST API
Nonterminal node types
Nonterminal | AST Type |
---|---|
program | T_prog |
primary_expression | T_expr |
postfix_expression | T_expr |
argument_list_opt | T_exprlist |
argument_list | T_exprlist |
unary_expression | T_expr |
expression | T_expr |
unary_operator | E_op |
binary_operator | E_op |
statement_list | T_stmtlist |
assignment_statement | T_stmt |
if_statement | T_stmt |
while_statement | T_stmt |
compound_statement | T_stmt |
statement | T_stmt |
type_list_opt | T_typelist |
type_list | T_typelist |
type | T_type |
declaration_list | T_decllist |
declaration | T_decl |
parameter_list_opt | T_paramlist |
parameter_list | T_paramlist |
function_list | T_funclist |
function | T_func |
main_function | T_main |
Terminal node types
Terminal | Type |
---|---|
IDENTIFIER | string |
HEX_CONSTANT | int |
INT_CONSTANT | int |
CHAR_CONSTANT | char |
STRING_LITERAL | string |
Nonterminals and their corresponding AST API calls
Nonterminal | AST Node Creation Function |
---|---|
program | T_prog create_prog(T_decllist decllist, T_funclist funclist, T_main main); |
argument_list_opt, argument_list | T_exprlist create_exprlist(T_expr expr, T_exprlist tail); |
primary_expression | T_expr create_identexpr(string identexpr); |
primary_expression | T_expr create_callexpr(string ident, T_exprlist args); |
primary_expression | T_expr create_intexpr(int intexpr); |
primary_expression | T_expr create_charexpr(char charexpr); |
primary_expression | T_expr create_strexpr(string strexpr); |
postfix_expression | T_expr create_arrayexpr(T_expr expr, T_expr index); |
unary_expression | T_expr create_unaryexpr(E_op op, T_expr expr); |
unary_expression | T_expr create_castexpr(T_type type, T_expr expr); |
expression | T_expr create_binaryexpr(T_expr left, E_op op, T_expr right); |
assignment_statement | T_stmt create_assignstmt(T_expr left, T_expr right); |
if_statement | T_stmt create_ifstmt(T_expr cond, T_stmt body); |
if_statement | T_stmt create_ifelsestmt(T_expr cond, T_stmt ifbranch, T_stmt elsebranch); |
while_statement | T_stmt create_whilestmt(T_expr cond, T_stmt body); |
compound_statement | T_stmt create_compoundstmt(T_decllist decllist, T_stmtlist stmtlist); |
statement_list | T_stmtlist create_stmtlist(T_stmt stmt, T_stmtlist tail); |
type | T_type create_primitivetype(E_typename primitivetype); |
type | T_type create_pointertype(T_type pointertype); |
type | T_type create_arraytype(int size, T_type type); |
type | T_type create_functiontype(T_typelist paramtypes, T_type returntype); |
type_list | T_typelist create_typelist(T_type type, T_typelist tail); |
create_decl | T_decl create_decl(T_type type, string ident); |
create_decllist | T_decllist create_decllist(T_decl decl, T_decllist tail); |
parameter_list | T_paramlist create_paramlist(string ident, T_paramlist tail); |
function | T_func create_func(string ident, T_paramlist paramlist, T_type type, T_decllist decllist, T_stmtlist stmtlist, T_expr returnexpr); |
function_list | T_funclist create_funclist(T_func func, T_funclist tail); |
main_function | T_main create_main(T_decllist decllist, T_stmtlist stmtlist, T_expr returnexpr); |
Operators and their corresponding node from E_op
Terminal | AST node |
---|---|
'&' | E_op_ref |
'*' | E_op_deref |
'-' | E_op_minus |
'!' | E_op_not |
'*' | E_op_times |
'/' | E_op_divide |
'%' | E_op_mod |
'+' | E_op_plus |
'-' | E_op_minus |
'<' | E_op_lt |
'>' | E_op_gt |
LE_OP | E_op_le |
GE_OP | E_op_ge |
EQ_OP | E_op_eq |
NE_OP | E_op_ne |
AND_OP | E_op_and |
OR_OP | E_op_or |
Typenames and their corresponding node from E_typename
Terminal | AST node |
---|---|
INT | E_typename_int |
CHAR | E_typename_char |
Passthrough constructs
Some constructs create no new nodes. For instance, parenthesized expressions add no new nodes, because the parenthesize serve to describe the ordering of the parse tree.:
primary_expression : '(' expression ')' { $$ = $2; } ;
The semantic value of a primary_expression that matches a parenthesized expression is just the child exrpession's value, $$ = $2
, i.e., it is passed through to the parent node.
Similarly, the expression grammar allows for many possible combinations of unary, postfix, and binary expressions while preserving an order of operations by passing through to the child construct. For instance, postfix_expressions, which are used for array indexing, are a primary expression followed by zero or more array derefences. If there are no array dereferences, the postfix_expression is just a primary_expression, so the value in that case is the same as the primary_expression's value, i.e., $$ = $1
.
postfix_expression : primary_expression { $$ = $1; } | postfix_expression '[' expression ']' { /* todo */ } ;
Complete example
project_example.simplec
int x; function(int) -> pointer<char> malloc; f(a, b) : function(pointer<int>, char) -> int { return *a + (int) b; } main { int x; pointer<int> y; pointer<int> arr; char c; arr = (pointer<int>) malloc(50); y = &x; *y = *y + f(y, '1'); c = 't'; return 0; }
cat project_example.simplec | ./simplec
+-prog | +-decllist | | +-decl | | | +-primitivetype | | | | +-typename_int | | | +-x | | +-decllist | | | +-decl | | | | +-functiontype | | | | | +-typelist | | | | | | +-primitivetype | | | | | | | +-typename_int | | | | | | +-typelist(empty) | | | | | +-pointertype | | | | | | +-primitivetype | | | | | | | +-typename_char | | | | +-malloc | | | +-decllist(empty) | +-funclist | | +-func | | | +-f | | | +-paramlist | | | | +-a | | | | +-paramlist | | | | | +-b | | | | | +-paramlist(empty) | | | +-functiontype | | | | +-typelist | | | | | +-pointertype | | | | | | +-primitivetype | | | | | | | +-typename_int | | | | | +-typelist | | | | | | +-primitivetype | | | | | | | +-typename_char | | | | | | +-typelist(empty) | | | | +-primitivetype | | | | | +-typename_int | | | +-decllist(empty) | | | +-stmtlist(empty) | | | +-binaryexpr | | | | +-unaryexpr | | | | | +-op_deref | | | | | +-identexpr | | | | | | +-a | | | | +-op_plus | | | | +-castexpr | | | | | +-primitivetype | | | | | | +-typename_int | | | | | +-identexpr | | | | | | +-b | | +-funclist(empty) | +-main | | +-decllist | | | +-decl | | | | +-primitivetype | | | | | +-typename_int | | | | +-x | | | +-decllist | | | | +-decl | | | | | +-pointertype | | | | | | +-primitivetype | | | | | | | +-typename_int | | | | | +-y | | | | +-decllist | | | | | +-decl | | | | | | +-pointertype | | | | | | | +-primitivetype | | | | | | | | +-typename_int | | | | | | +-arr | | | | | +-decllist | | | | | | +-decl | | | | | | | +-primitivetype | | | | | | | | +-typename_char | | | | | | | +-c | | | | | | +-decllist(empty) | | +-stmtlist | | | +-assignstmt | | | | +-identexpr | | | | | +-arr | | | | +-castexpr | | | | | +-pointertype | | | | | | +-primitivetype | | | | | | | +-typename_int | | | | | +-callexpr | | | | | | +-malloc | | | | | | +-exprlist | | | | | | | +-intexpr | | | | | | | | +-50 | | | | | | | +-exprlist(empty) | | | +-stmtlist | | | | +-assignstmt | | | | | +-identexpr | | | | | | +-y | | | | | +-unaryexpr | | | | | | +-op_ref | | | | | | +-identexpr | | | | | | | +-x | | | | +-stmtlist | | | | | +-assignstmt | | | | | | +-unaryexpr | | | | | | | +-op_deref | | | | | | | +-identexpr | | | | | | | | +-y | | | | | | +-binaryexpr | | | | | | | +-unaryexpr | | | | | | | | +-op_deref | | | | | | | | +-identexpr | | | | | | | | | +-y | | | | | | | +-op_plus | | | | | | | +-callexpr | | | | | | | | +-f | | | | | | | | +-exprlist | | | | | | | | | +-identexpr | | | | | | | | | | +-y | | | | | | | | | +-exprlist | | | | | | | | | | +-charexpr | | | | | | | | | | | +-49 | | | | | | | | | | +-exprlist(empty) | | | | | +-stmtlist | | | | | | +-assignstmt | | | | | | | +-identexpr | | | | | | | | +-c | | | | | | | +-charexpr | | | | | | | | +-116 | | | | | | +-stmtlist(empty) | | +-intexpr | | | +-0