UP | HOME

Project 1: Creating an Abstract Syntax Tree (AST) for SimpleC

Table of Contents

Overview

In this project, we will add semantics actions to a bison parser specification file that generate an abstract syntax tree (AST) for an input SimpleC program. Provided are the lexer and parser specifications, as well as a library for creating AST nodes.

Getting started

  1. Accept the GitHub classroom assignment in webcourses to create your git repository for the project.
    • Submission will be via GitHub by committed and pushing all of your changes to your repository to the course GitHub assignment, which graders will use for grading.
  2. In your virtual machine, clone your repository as described in lecture, replacing USERNAME with your GitHub username.

    git clone https://github.com/cop3402/simplec-compiler-1-USERNAME
    
  3. Run make in your source folder (in your virtual machine) to build the project, which should create the simplec program.
    • If the build fails, double-check that you are in the repository directory and on a correctly-configured virtual machine.
  4. Run cat tests/example.simplec | ./simplec
    • Running ./simplec without providing input will cause the program to stop and wait for input to stdin. Use Ctrl-D to end the input (or Ctrl-C to terminate the program).
    • If ./simplec is not found, be sure to prefix the program with ./, and be sure the make step above succeeded.

Submitting your project

Submit your project by commiting your changes and pushing them to your GitHub repository.

Debugging your compiler

  1. Narrow down the problem by crafting a smaller version of the test case
  2. Go to the part of the your code that is related to the problem
  3. Trace step-by-step what your code really does (not what you think/hope/feel/guess/divine/intuit/reckon it does)

Implementing AST generation

The main task of this project is to implement the semantic actions in parser.y to construct AST nodes for each grammar construct. See lecture 05 for more details on this task.

To evaluate your progress, pass simplec programs to your program, e.g., cat tests/example.simplec | ./simplec and inspect the resulting output for correctness. Please see a complete example below.

The provided AST library

  • The AST creation API is defined in ast.h and implemented for you in ast.c.
    • Use this API to construct the AST data structures
    • You will not need to write code that allocates memory for tree nodes yourself
  • Each AST node is loosely named after the grammar nonterminal it represents.
    • For example, declarations use the T_decl struct
  • Each create function is suffixed with the name of the node.
    • For example, create_decl creates a T_decl AST node representing a declaration construct
  • Each create function takes the child nodes of the AST node
    • These correspond to the symbols in the grammar production's right-hand side
    • For example, T_decl create_decl(T_type type, string ident);
  • Language constructs with multiple rules use a union type in the struct.
    • For example, statements use the T_stmt struct, which contains a union of types corresponding to assignment statements, if statements, etc.
    • The kind field is used to distinguish which kind of statement is held by the struct instance.

See the complete definition below (or in ast.h in your project template).

Recording the resulting AST

In the program nonterminal only, save the resulting AST to the program_ast global variable. This will make it available to the main function for printing. This is what the completed program semantic action should look like. It is the only place where program_ast should be assigned, because program is the root of the AST.

program
 : declaration_list function_list main_function
 {
   $$ = create_prog($1, $2, $3);
   program_ast = $$;
 }
 ;

Lists

argument_list
  : expression { $$ = create_exprlist($1, NULL); }
  | expression ',' argument_list { $$ = create_exprlist($1, $3); }
  ;

Operators

unary_operator
  : '&' { $$ = E_op_ref; }

Tips

  • First, pick a construct to implement
    • Start with the leaves
  • Create the simplest test case contains the construct
  • Find the path to the construct through the grammar
    • There may be parent constructs that need implementing first
    • Try to use the provided parent constructs first
    • Use printf in the semantic action to help ensure you are reaching the semantic action
    • Make sure the test case reaches the semantic action
  • Find the corresponding AST API call for creating the node
  • Create the node in the semantic action, saving it to $$ and passing the corresponding $1, $2, etc., child nodes.

Exit codes

According to POSIX standards, all programs report an exit code on termination. Programs use this exit code to report the status of the program on exit to the caller of the program. By convention, an exit code of 0 means a successful execution, while non-zero exit codes signal some error condition, defined on a per-program basis. This is the reason for the return 0; at the end of main in C programs. A C program can also use the exit library call to terminate the program and report the given exit code.

For our compiler, we will use the exit code to report errors during compilation. For project 1, we define the two following error codes. For this project, the error reporting is already implemented for you, but you will need to use the correct error code in later phases of the compiler.

  • exit(1) means there is a failed invariant check, e.g., an unexpected NULL value in the AST API
  • exit(2) means there is a parser or lexer error (already implemented in parser.y)

The bash shell stores the exit code of the last executed program in a special variable $?, which you can print using the echo command:

  • Check the exit code by running echo $? right after running ./simplec.

Our automated grading scripts will use these exit codes to test your compiler, so be sure your compiler is producing the correct ones.

Test cases

The tests/ folder in your repository has many example programs (with their corresponding output in an .ast file) to try with your parser.

Gotchas

  • Null pointers
    • The AST create_ functions check for NULL and provide an error
  • Incorrect semantic value numbers
    • Double-check the usage of $1 values. Remember that each symbol is numbered, including TOKENS, such as semicolons.
  • Make sure you are doing $$ = create_... and not just calling create_ otherwise, the semantic value won't be accessible by the parent
  • Incomplete semantic action definition
    • Child nodes won't appear in the AST yet if the parent nodes' semantic actions have not been implemented
  • Punctuation are also symbols in the production, so don't forget to skip them when counting the symbols for their semantic values, e.g.,

    : IF '(' expression ')' statement { $$ = create_ifstmt($3, $5); }
    
  • Some actions will just passthrough a child's value as the semantic value for the parent, i.e., semantic action will be $$ = $1. For example,

    statement
      : assignment_statement { $$ = $1; }
      | if_statement { $$ = $1; }
      | while_statement { $$ = $1; }
      | compound_statement { $$ = $1; }
      ;
    
    | '(' expression ')' { $$ = $2; }
    
    argument_list_opt
      : /* empty */ { $$ = NULL; }
      | argument_list { $$ = $1; }
      ;
    
main_function
 : MAIN '{' declaration_list statement_list RETURN expression ';' '}'
 {
   $$ = create_main($3, $4, $6);
 }
 ;

Example test script

This will compare your output against all public test cases. A diff exit code of 0 means there was no difference.

for i in tests/*.simplec; do
  echo $i;
  diff ${i%.simplec}.ast <(cat $i | ./simplec 2>/dev/null);
  echo "diff exit code: $?";
done

To compare to one program's output

diff tests/example.ast <(cat tests/example.simplec | ./simplec 2>/dev/null)

AST API

Nonterminal node types

Nonterminal AST Type
program T_prog
primary_expression T_expr
postfix_expression T_expr
argument_list_opt T_exprlist
argument_list T_exprlist
unary_expression T_expr
expression T_expr
unary_operator E_op
binary_operator E_op
statement_list T_stmtlist
assignment_statement T_stmt
if_statement T_stmt
while_statement T_stmt
compound_statement T_stmt
statement T_stmt
type_list_opt T_typelist
type_list T_typelist
type T_type
declaration_list T_decllist
declaration T_decl
parameter_list_opt T_paramlist
parameter_list T_paramlist
function_list T_funclist
function T_func
main_function T_main

Terminal node types

Terminal Type
IDENTIFIER string
HEX_CONSTANT int
INT_CONSTANT int
CHAR_CONSTANT char
STRING_LITERAL string

Nonterminals and their corresponding AST API calls

Nonterminal AST Node Creation Function
program T_prog create_prog(T_decllist decllist, T_funclist funclist, T_main main);
argument_list_opt, argument_list T_exprlist create_exprlist(T_expr expr, T_exprlist tail);
primary_expression T_expr create_identexpr(string identexpr);
primary_expression T_expr create_callexpr(string ident, T_exprlist args);
primary_expression T_expr create_intexpr(int intexpr);
primary_expression T_expr create_charexpr(char charexpr);
primary_expression T_expr create_strexpr(string strexpr);
postfix_expression T_expr create_arrayexpr(T_expr expr, T_expr index);
unary_expression T_expr create_unaryexpr(E_op op, T_expr expr);
unary_expression T_expr create_castexpr(T_type type, T_expr expr);
expression T_expr create_binaryexpr(T_expr left, E_op op, T_expr right);
assignment_statement T_stmt create_assignstmt(T_expr left, T_expr right);
if_statement T_stmt create_ifstmt(T_expr cond, T_stmt body);
if_statement T_stmt create_ifelsestmt(T_expr cond, T_stmt ifbranch, T_stmt elsebranch);
while_statement T_stmt create_whilestmt(T_expr cond, T_stmt body);
compound_statement T_stmt create_compoundstmt(T_decllist decllist, T_stmtlist stmtlist);
statement_list T_stmtlist create_stmtlist(T_stmt stmt, T_stmtlist tail);
type T_type create_primitivetype(E_typename primitivetype);
type T_type create_pointertype(T_type pointertype);
type T_type create_arraytype(int size, T_type type);
type T_type create_functiontype(T_typelist paramtypes, T_type returntype);
type_list T_typelist create_typelist(T_type type, T_typelist tail);
create_decl T_decl create_decl(T_type type, string ident);
create_decllist T_decllist create_decllist(T_decl decl, T_decllist tail);
parameter_list T_paramlist create_paramlist(string ident, T_paramlist tail);
function T_func create_func(string ident, T_paramlist paramlist, T_type type, T_decllist decllist, T_stmtlist stmtlist, T_expr returnexpr);
function_list T_funclist create_funclist(T_func func, T_funclist tail);
main_function T_main create_main(T_decllist decllist, T_stmtlist stmtlist, T_expr returnexpr);

Operators and their corresponding node from E_op

Terminal AST node
'&' E_op_ref
'*' E_op_deref
'-' E_op_minus
'!' E_op_not
'*' E_op_times
'/' E_op_divide
'%' E_op_mod
'+' E_op_plus
'-' E_op_minus
'<' E_op_lt
'>' E_op_gt
LE_OP E_op_le
GE_OP E_op_ge
EQ_OP E_op_eq
NE_OP E_op_ne
AND_OP E_op_and
OR_OP E_op_or

Typenames and their corresponding node from E_typename

Terminal AST node
INT E_typename_int
CHAR E_typename_char

Passthrough constructs

Some constructs create no new nodes. For instance, parenthesized expressions add no new nodes, because the parenthesize serve to describe the ordering of the parse tree.:

primary_expression : '(' expression ')' { $$ = $2; } ;

The semantic value of a primary_expression that matches a parenthesized expression is just the child exrpession's value, $$ = $2, i.e., it is passed through to the parent node.

Similarly, the expression grammar allows for many possible combinations of unary, postfix, and binary expressions while preserving an order of operations by passing through to the child construct. For instance, postfix_expressions, which are used for array indexing, are a primary expression followed by zero or more array derefences. If there are no array dereferences, the postfix_expression is just a primary_expression, so the value in that case is the same as the primary_expression's value, i.e., $$ = $1.

postfix_expression
  : primary_expression { $$ = $1; }
  | postfix_expression '[' expression ']' { /* todo */ }
  ;

Complete example

project_example.simplec

int x;
function(int) -> pointer<char> malloc;

f(a, b) : function(pointer<int>, char) -> int {
  return *a + (int) b;
}

main {
  int x;
  pointer<int> y;
  pointer<int> arr;
  char c;
  arr = (pointer<int>) malloc(50);
  y = &x;
  *y = *y + f(y, '1');
  c = 't';
  return 0;
}

cat project_example.simplec | ./simplec

+-prog
|  +-decllist
|  |  +-decl
|  |  |  +-primitivetype
|  |  |  |  +-typename_int
|  |  |  +-x
|  |  +-decllist
|  |  |  +-decl
|  |  |  |  +-functiontype
|  |  |  |  |  +-typelist
|  |  |  |  |  |  +-primitivetype
|  |  |  |  |  |  |  +-typename_int
|  |  |  |  |  |  +-typelist(empty)
|  |  |  |  |  +-pointertype
|  |  |  |  |  |  +-primitivetype
|  |  |  |  |  |  |  +-typename_char
|  |  |  |  +-malloc
|  |  |  +-decllist(empty)
|  +-funclist
|  |  +-func
|  |  |  +-f
|  |  |  +-paramlist
|  |  |  |  +-a
|  |  |  |  +-paramlist
|  |  |  |  |  +-b
|  |  |  |  |  +-paramlist(empty)
|  |  |  +-functiontype
|  |  |  |  +-typelist
|  |  |  |  |  +-pointertype
|  |  |  |  |  |  +-primitivetype
|  |  |  |  |  |  |  +-typename_int
|  |  |  |  |  +-typelist
|  |  |  |  |  |  +-primitivetype
|  |  |  |  |  |  |  +-typename_char
|  |  |  |  |  |  +-typelist(empty)
|  |  |  |  +-primitivetype
|  |  |  |  |  +-typename_int
|  |  |  +-decllist(empty)
|  |  |  +-stmtlist(empty)
|  |  |  +-binaryexpr
|  |  |  |  +-unaryexpr
|  |  |  |  |  +-op_deref
|  |  |  |  |  +-identexpr
|  |  |  |  |  |  +-a
|  |  |  |  +-op_plus
|  |  |  |  +-castexpr
|  |  |  |  |  +-primitivetype
|  |  |  |  |  |  +-typename_int
|  |  |  |  |  +-identexpr
|  |  |  |  |  |  +-b
|  |  +-funclist(empty)
|  +-main
|  |  +-decllist
|  |  |  +-decl
|  |  |  |  +-primitivetype
|  |  |  |  |  +-typename_int
|  |  |  |  +-x
|  |  |  +-decllist
|  |  |  |  +-decl
|  |  |  |  |  +-pointertype
|  |  |  |  |  |  +-primitivetype
|  |  |  |  |  |  |  +-typename_int
|  |  |  |  |  +-y
|  |  |  |  +-decllist
|  |  |  |  |  +-decl
|  |  |  |  |  |  +-pointertype
|  |  |  |  |  |  |  +-primitivetype
|  |  |  |  |  |  |  |  +-typename_int
|  |  |  |  |  |  +-arr
|  |  |  |  |  +-decllist
|  |  |  |  |  |  +-decl
|  |  |  |  |  |  |  +-primitivetype
|  |  |  |  |  |  |  |  +-typename_char
|  |  |  |  |  |  |  +-c
|  |  |  |  |  |  +-decllist(empty)
|  |  +-stmtlist
|  |  |  +-assignstmt
|  |  |  |  +-identexpr
|  |  |  |  |  +-arr
|  |  |  |  +-castexpr
|  |  |  |  |  +-pointertype
|  |  |  |  |  |  +-primitivetype
|  |  |  |  |  |  |  +-typename_int
|  |  |  |  |  +-callexpr
|  |  |  |  |  |  +-malloc
|  |  |  |  |  |  +-exprlist
|  |  |  |  |  |  |  +-intexpr
|  |  |  |  |  |  |  |  +-50
|  |  |  |  |  |  |  +-exprlist(empty)
|  |  |  +-stmtlist
|  |  |  |  +-assignstmt
|  |  |  |  |  +-identexpr
|  |  |  |  |  |  +-y
|  |  |  |  |  +-unaryexpr
|  |  |  |  |  |  +-op_ref
|  |  |  |  |  |  +-identexpr
|  |  |  |  |  |  |  +-x
|  |  |  |  +-stmtlist
|  |  |  |  |  +-assignstmt
|  |  |  |  |  |  +-unaryexpr
|  |  |  |  |  |  |  +-op_deref
|  |  |  |  |  |  |  +-identexpr
|  |  |  |  |  |  |  |  +-y
|  |  |  |  |  |  +-binaryexpr
|  |  |  |  |  |  |  +-unaryexpr
|  |  |  |  |  |  |  |  +-op_deref
|  |  |  |  |  |  |  |  +-identexpr
|  |  |  |  |  |  |  |  |  +-y
|  |  |  |  |  |  |  +-op_plus
|  |  |  |  |  |  |  +-callexpr
|  |  |  |  |  |  |  |  +-f
|  |  |  |  |  |  |  |  +-exprlist
|  |  |  |  |  |  |  |  |  +-identexpr
|  |  |  |  |  |  |  |  |  |  +-y
|  |  |  |  |  |  |  |  |  +-exprlist
|  |  |  |  |  |  |  |  |  |  +-charexpr
|  |  |  |  |  |  |  |  |  |  |  +-49
|  |  |  |  |  |  |  |  |  |  +-exprlist(empty)
|  |  |  |  |  +-stmtlist
|  |  |  |  |  |  +-assignstmt
|  |  |  |  |  |  |  +-identexpr
|  |  |  |  |  |  |  |  +-c
|  |  |  |  |  |  |  +-charexpr
|  |  |  |  |  |  |  |  +-116
|  |  |  |  |  |  +-stmtlist(empty)
|  |  +-intexpr
|  |  |  +-0

Author: Paul Gazzillo

Created: 2024-04-04 Thu 10:32

Validate