The Nature of Computer
Languages
Grammars
• Since the development of ALGOL in the
mid 1950s, programming languages have
been described using formal grammars
• ANTLR leverages the power of grammars
by automatically generating code that can
lexically scan and parse certain grammars
• The target language for us is Java, but
ANTLR can also target other languages
Language Description
•
The programming languages we want to
describe are infinite. How can they be
described?
• Two ways to describe a language:
1. Build a generator that will produce valid
sentences
2. Build a recognizer that can determine if a
given input sentence is valid
Language Generator
“System.out.println(“Hi”);”
Language Recognizer
“x = x + 1;”
Good
Language Generation
• How can we generate words that produce
a valid sentence?
• How can we extract meaning from a
sequence of words?
• “Terence says Sriram likes chicken tika.”
• “Terence says Sriram, likes chicken tika.”
• Same sequence of words. The structure
determines meaning.
• X = 2 + 3 * 4;
Grammar Background
•
•
•
•
Deterministic Finite State Machine
Non-Deterministic Finite Automata
Pushdown Automata
Language ambiguity
Finite State Machine (Informal)
He
S0
She
It
loves
writes
S1 hates
He loves golf
She hates gin
She writes golf
ANTLR
S2
gin
golf
S4
Deterministic Finite State Machine
(Formally)
• A deterministic finite state machine is a
quintuple (Σ,S,s0,δ,F), where:
• Σ is the input alphabet (a finite, non-empty
set of symbols).
• S is a finite, non-empty set of states.
• s0 is an initial state, an element of S.
• δ is the state-transition function: S x Σ  S
• F is the set of final states, a (possibly
empty) subset of S.
Deterministic Finite Automaton
(DFA)
• a deterministic finite state machine or
deterministic finite automaton (DFA) is
a finite state machine where for each pair
of state and input symbol there is one and
only one transition to a next state.
• DFAs recognize the set of regular
languages and are equivalent to regular
grammars
Transitions in a DFA
X
S2
S1
X S2
S1
Y
Fine
S3
X
S3
Nondeterminism
Non-deterministic Finite Automaton
(NDFA)
• a non-deterministic finite state machine
or non-deterministic finite automaton
(NDFA) is a finite state machine where for
each pair of state and input symbol there
are possibly two or more transitions to a
next state.
• Every NDFA can be converted to a DFA
Requirements for Generating
Complex Language
• State machines do generate sentences
• Not all the sentences they generate are
valid
• Grammatical doesn’t imply sensible (“She
writes golf”)
• There are dependencies between words of
a sentence. In a program “[“ precedes “]”
• There are order requirements (a[i+3)]
Requirements for Generating
Complex Language
• DFAs can only generate the class of
regular languages
• Programming languages belong to the
class of context-free languages
• We will need to work with context-free
grammars
Chomsky Hierarchy
• Noam Chomsky (1950’s) described 4
classes of grammars
1) Type 0 – unrestricted grammars
2) Type 1 – Context sensitive grammars
3) Type 2 – Context free grammars
4) Type 3 – Regular grammars
Pushdown Automaton (Informal)
• Our generator needs more power – we
need a finite state machine equipped with
a stack
• The pushdown automaton (PDA) can use
the top of the stack to decide which
transition to take
• The pushdown automaton can manipulate
the stack as part of performing a transition
Tree Structure of Sentences
• A book is not simply a sequence of words
• It has an organization: chapters, sections,
paragraphs, sentences, phrases, words
• “Nested and Ordered” – Tree structured
• Even sentences have structure
• Trees work perfectly to describe sentence
structure
The Structure of X = 0;
statement
assignment
X
=
;
expr
0
Generating X = 0;
void statement() {
assignment();
System.out.println(“;”); }
void assignment() {
System.out.println(“x”);
System.out.println(“=“);
expr(); }
void expr() {
System.out.println(“0”);}
Generating Language
• A finite state machine is like a stream of
instructions within a single method, unable
to make method calls
• A pushdown automaton can freely invoke
other parts of the machine to recognize
more parts of the language
• This provides for nesting and sequencing
PDA Representation
• Pushdown automata can be represented
with syntax diagrams
expr
INT
ID
‘(‘
‘[‘
expr
expr
‘)‘
‘]’
Ambiguity
• “Bush appeals to Democrats”
• Grammars can be ambiguous. Consider
the phrase 2 + 3 * 4
• Ambiguity is not tolerated in a
programming language
• Within a syntax diagram, an ambiguous
sentence is one the diagram can generate
by following two distinct paths
Ambiguity
• A syntax diagram in C can generate the
statement i*j; in two ways
• i*j; a multiplication of i and j
• i*j; j is a pointer to type I
• Languages are usually described formally
with syntax diagrams and informally with
descriptive language that disambiguates
the sentences
Vocabulary Symbols
• Symbols have structure too
• Sentences are sequences of characters
but your brain groups them into units
“Now is the time …”
• Your brain constructs structures from the
sequence of symbols
• The language processors we build will be
separated into two parts: lexers and
parsers
Characters, Tokens, ASTs
Characters
... W I D T H
2 0 0 ; \n …
=
(CharStream)
tokens
… ID WS = WS INT ; WS …
(Token)
x
x
x
=
AST
(CommonTree)
ID
INT
Language Generation
• Your brain generates sentences by starting
from a main idea – the root of a tree
structure
• The brain creates phrases and subphrases until it reaches the leaves of the
tree
• Sentence recognition occurs in reverse.
Scanning through the words you must
reconstruct the structure
Reading a Data File
BufferedReader f = new
BufferedReader(new
FileReader(“random.txt”));
int c = f.read( ); //get first char
StringBuffer s = new StringBuffer( )
while (c != -1) {
s.append((char) c);
c = f.read( );
}
Reading a Structured Data File
BufferedReader f = new BufferedReader(new
FileReader(”letsdigs.txt”));
int c = f.read(); get 1st char
StringBuffer lets = new StringBuffer();
StringBuffer digs = new StringBuffer();
while(Character.isLetter((char)c){
lets.append((char)c);
c = f.read();}
while(Character.isDigit((char)c) {
digs.append((char)c);
c = f.read();}
Reading and Recognizing With
ANTLR
/** Grammar to Read the file
File : .+; //consume to eof
/** Grammar to Recognize file
File : LETTERS DIGITS
LETTERS : ’a’..’z’* ;
DIGITS : ’0’..’9’* ;
ANTLR Generates Java Code
void file (){
LETTERS();
DIGITS(); }
void LETTERS(){
while
(Character.isLetter((char)c){
c = f.read();}
Void DIGITS(){
while (Character.isDigit((char)c){
c = f.read();}
ANTLR Constraints
• The code in the last slide is similar to code
ANTLR would generate
• ANTLR can target other languages
besides Java
• ANTLR can’t generate code for all
grammars
• ANTLR generates an LL recognizer.
Technically, LL(*)
LL Recognizers
• LL means to recognize the input from left
to right, tracing out a leftmost derivation
• For more information on derivations and
grammars, listen the to the grammar
lecture
• LL recognizers use one method per
grammar rule. The method may be
recursive
LL Recognizers
• LL recognizers restrict the form of
acceptable grammars
• Rules cannot be left-recursive
expr : expr ’++’ ;
void expr() {
expr(); //infinite loop
match(”++”);
}
LL Recognizers
• Right-recursion is not a problem
expr : ’++’ expr ;
void expr() {
match(”++”);
expr();
}
• Execution terminates if the next token isn’t
“++”
LL Recognizers
• Another restriction occurs when the
lookahead is too weak to handle a give
grammar
• Lookahead is the number of tokens ahead
in the input stream that the parser can
see.
• This similar to the number of symbols on
the maze floor you can see before making
a decision on which turn to take
Lookahead
• Consider this rule:
decl: ’int’ ID ’=’ INT ’;’
| ’int’ ID ’;’
• Here are examples :
int x = 32;
int y;
• We need to see three tokens ahead in order to
distinguish which form of the rule to apply. We
need an LL(3) grammar.
Lookahead
• Given a language, there are infinitely
many grammars that describe it
• Often we can alter a given grammar so
that ANTLR has an easier time generating
the recognizer
• An LL(3) grammar might be converted to
LL(1) by refactoring
decl : ’int’ ID (’=’ INT)? ’;’
Lookahead
• Increasing the lookahead depth increases the
power of the recognizer
• Even so, some grammars are not LL(k) for any
value of k
decl:
modifier* ’int’ ID ’=’ INT ’;’
| modifier* ’int’ ID ’;’;
modifier :
’static’ | ’register’ ;
Lookahead
• ANTLR can implement LL(*) to solve the
previous problem
• LL(*) uses arbitrary lookahead to get past
the problem in the grammar
• If LL(*) won’t work, you can tell ANTLR to
try the rule alternatives in a specified order
• This is like trying alternatives in the maze
until you hit on the escape route
ANTLR uses Memoization
• Memoization is an optimization technique
that trades space for speed
• Backtracking over several rule alternatives
means repeatedly invoking certain subrules.
• The results of the sub-rules invocations
can be memoized so they don’t have to
be recalculated
Grammar Limitations
• Not every language can be described
using only grammar rules
• Most programming languages are
supplemented with English language
descriptions for constraints that are too
hard to capture with grammar rules
• For example, x = 3; only makes sense if x
has been declared previously
Grammar Limitations
• In C++, T(i) is either a function call or
constructor-style typecast - (T) I
• If T is defined as a class, then T(i) is a type
cast. We say the phrase is contextsensitive
• A context-sensitive rule has a more
complicated left side:
xTy  xby
Grammar Limitations
• We are using context-free grammars to
describe programming languages
• The rules are simpler than contextsensitive rules
• Every left side is a single Non-terminal
symbol
• T  xT | y
Grammar Limitations
• For context-sensitive rules, ANTLR uses a
feature called “semantic predicates”
expr:{la(1) is function }?
functionClall
| {la(1) is type}?
ctorTypecast
;
Syntatic Predicates
• Syntactic predicates provide a way to prioritize
rule alternatives
Stat:(declaration)=> declaration
| expression
• If the current statement looks like a declaration,
it is. Even if it matches the second alternative
• With LL(*), semantic predicates, and syntatic
predicates, ANTLR can build recognizers for
most languages
Descargar

The Nature of Computer Languages