CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars Announcements HW3 due via Sakai Solution to HW3 Homework: HW4 assigned today, due next Friday, via Sakai Reading: 4.1-4.4 Progress Report Grades due Sept 28th Compilers Lexical Analyzer (Scanner) Source Code Syntax Analyzer (Parser) Token Stream Abstract Syntax Tree Symantic Analyzer Intermediate Code Generator Optimizer Code Generator Scanner Source Code: position = initial + rate * 60 ; Corresponding Tokens: IDENT(position) ASSIGN IDENT(position) PLUS IDENT(rate) TIMES INT-LIT(60) SEMI-COLON Example Parse Source Code: position = initial + rate * 60 ; Abstract-Syntax Tree: position = + •Interior nodes are operators. •A node's children are operands. initial * •Each subtree forms "logical unit" e.g., the subtree with * at its root shows that because multiplication has higher precedence than addition, this operation must be performed rate as a unit (not initial+rate). 60 Limitations of RE and FSA Regular Expressions and Finite State Automata cannot express all languages For Example the language that consists of all balanced parenthesis: () and ((())) and (((((()))))) Parsers can recognize more types of languages than RE or FSA Parsers Input: Sequence of Tokens Output: a representation of program Often AST, but could be other things Also find syntax errors CFGs used to define Parser (Context Free Grammar) Context Free Grammars stmt → if ( expr ) stmt else stmt terminals Rule or Production non-terminals Context Free Grammar CFGs consist of: Σ – Set of Terminals (use tokens from scanner) N – Set of Non-Terminals (variables) P – Set of Productions (also called rules) S – the Start Non-Terminal (one on left of first rule if not specified) Productions stmt → if ( expr ) stmt else stmt Sequence of zero or more terminals and non-terminals Single non-terminal Example with Boolean Expressions “true” and “false” are boolean expressions If exp1 and exp2 are boolean expressions, then so are: exp1 || exp2 exp1 && exp2 ! exp1 ( exp1 ) Corresponding CFG bexp bexp bexp bexp bexp bexp → → → → → → TRUE FALSE bexp OR bexp bexp AND bexp NOT bexp LPAREN bexp RPAREN UPPERCASE represent tokens (thus terminals) lowercase represent non-terminals CFG for Assignments Here is CFG for simple assignment statements (Can only assign boolean expressions to identifiers) stmt → ID ASSIGN bexp SEMICOLON CFG for simple IF statements Combine these CFGs and add 2 more rules for simple IF statements: 1. stmt → IF LPAREN bexp RPAREN stmt 2. stmt → IF LPAREN bexp RPAREN stmt ELSE stmt 3. stmt → ID ASSIGN bexp SEMICOLON 4. bexp → TRUE 5. bexp → FALSE 6. bexp → bexp OR bexp 7. bexp → bexp AND bexp 8. bexp → NOT bexp 9. bexp → LPAREN bexp RPAREN You Try It Write a context-free grammar for the language of very simple while loops (in which the loop body only contains one statement) by adding a new production with nonterminal stmt on the left-hand side. CFG Languages The language defined by a context-free grammar is the set of strings (sequences of terminals) that can be derived from the start nonterminal. Think of productions as rewriting rules Set cur_seq = starting non-terminal While (non-terminal, X, exists in cur_seq): Select production with X on left of “→” Replace X with right portion of selected production Try it with given CFG What Strings are in Language 1. stmt → IF LPAREN bexp RPAREN stmt 2. 3. 4. 5. 6. 7. 8. 9. stmt → IF LPAREN bexp RPAREN stmt ELSE stmt stmt → ID ASSIGN bexp SEMICOLON bexp → TRUE bexp → FALSE bexp → bexp OR bexp bexp → bexp AND bexp bexp → NOT bexp bexp → LPAREN bexp RPAREN Set cur_seq = starting non-terminal While (non-terminal, X, exists in cur_seq): Select production with X on left of “→” Replace X with right portion of selected production Try It Again exp exp exp term term term factor factor → → → → → → → → exp PLUS term exp MINUS term term term TIMES factor term DIVIDE factor factor LPAREN exp RPAREN ID What is the language? Leftmost and Rightmost Derivations A derivation is a leftmost derivation if it is always the leftmost nonterminal that is chosen to be replaced. It is a rightmost derivation if it is always the rightmost one. Derivation Notation E => a E =>* a E =>+ a Parse Trees Start with the start nonterminal. Repeat: choose a leaf nonterminal X choose a production X --> alpha the symbols in alpha become the children of X in the tree until there are no more leaf nonterminals left. The derived string is formed by reading the leaf nodes from left to right. Ambiguous Grammars Consider the grammar exp → exp PLUS exp exp → exp MINUS exp exp → exp TIMES exp exp → exp DIVIDE exp exp → INT_LIT Construct Parse tree for 3-4/2 Are there more than one parse trees? If there is more than one parse tree for a string then the grammar is ambiguous Ambiguity causes problems with parsing (what is the correct structure)? Precedence in Grammars To write a grammar whose parse trees express precedence correctly, use a different nonterminal for each precedence level. Start by writing a rule for the operator(s) with the lowest precedence ("-" in our case), then write a rule for the operator(s) with the next lowest precedence, etc: Precedence in Grammars exp term factor → exp MINUS exp | term → term DIVIDE term | factor → INT_LIT | LPAREN exp RPAREN Now try constructing multiple parse trees for 3-4/2 Grammar is still ambiguous. Look at associativity. Construct 2 parses tree for 5-3-2. Recursion on CFGs A grammar is recursive in nonterminal X if: X derives a sequence of symbols that includes an X. A grammar is left recursive in X if: X derives a sequence of symbols that starts with an X. A grammar is right recursive in X if: X derives a sequence of symbols that ends with an X. For left associativity, use left recursion. For right associativity, use right recursion. Ambiguity Removed in CFG exp term factor → exp MINUS term | term → term DIVIDE factor | factor → INT_LIT | LPAREN exp RPAREN One level for each order of operation Left recursion for left assiciativity Now try constructing 2 parse trees for 5-3-2. You Try it Construct a grammar for arithmetic expressions with addition, multiplication, exponentiation (right assoc.), subtraction, division, and unary negative. exp → exp PLUS exp | exp MINUS exp | exp TIMES exp | exp DIVIDE exp | exp POW exp | MINUS exp | LPAREN exp RPAREN | INT_LIT Solution exp → term → factor → exponent → final → exp PLUS term | exp MINUS term | term term TIMES factor | term DIVIDE factor | factor exponent POW factor | exponent MINUS exponent | final INT_LIT | LPAREN exp RPAREN You Try It Write a grammar for the language of boolean expressions, with two possible operands: true false, and three possible operators: and or not. Add nonterminals so that or has lowest precedence, then and, then not. Finally, change the grammar to reflect the fact that both and and or are left associative. bexp → TRUE bexp → FALSE bexp → bexp OR bexp bexp → bexp AND bexp bexp → NOT bexp bexp → LPAREN bexp RPAREN List Grammars Several types of lists can be created using CFGs One or more x's (without any separator or terminator): 1. xList → X | xList xList 2. xList → X | xList X 3. xList → X | X xList One or more x's, separated by commas: 1. xList → X | xList COMMA xList 2. xList → X | xList COMMA X 3. xList → X | X COMMA xList List Grammars One or more x's, each x terminated by a semi-colon: 1. 2. 3. You Try xList → xList → xList → It X SEMICOLON | xList xList X SEMICOLON | xList X SEMICOLON X SEMICOLON | X SEMICOLON xList Zero or more x's (without any separator or terminator): 1. xList → ε | X | xList xList 2. xList → ε | X | xList X 3. xList → ε | X | X xList List Grammars Zero or more x's, each terminated by a semi-colon: 1. xList → ε | X SEMICOLON | xList xList 2. xList → ε | X SEMICOLON | xList X SEMICOLON 3. xList → ε | X SEMICOLON | X SEMICOLON xList Zero or more x's, separated by commas: You Try It 1. xList → ε | nonEmptyXList nonEmptyXList → X | X COMMA nonEmptyXList CFGs for Whole Languages To write a grammar for a whole programming language, break down the problem into pieces. For example, think about a Java program: a program consists of one or more classes: program → classlist classlist → class | class classlist CFGs for Whole Languages A class is the word "class", optionally preceded by the word "public", followed by an identifier, followed by an open curly brace, followed by the class body, followed by a closing curly brace: class → PUBLIC CLASS ID LCURLY classbody RCURLY | CLASS ID LCURLY classbody RCURLY CFGs for Whole Languages A class body is a list of zero or more field and/or method definitions: classbody deflist → ε | deflist → def | def deflist And So On…

Descargar
# CPSC 388 – Compiler Design and Construction