Lexical Analysis and
Scanning
Honors Compilers
Feb 5th 2001
Robert Dewar
The Input

Read string input
 Might
be sequence of characters (Unix)
 Might be sequence of lines (VMS)
 Character set
 ASCII
 ISO
Latin-1
 ISO 10646 (16-bit = unicode)
 Others (EBCDIC, JIS, etc)
The Output

A series of tokens
 Punctuation
( ) ; , [ ]
 Operators
+ - ** :=
 Keywords
begin end if
 Identifiers
Square_Root
 String literals “hello this is a string”
 Character literals ‘x’
 Numeric literals 123 4_5.23e+2 16#ac#
Free form vs Fixed form

Free form languages
 White
space does not matter
 Tabs,
 Only

spaces, new lines, carriage returns
the ordering of tokens is important
Fixed format languages
 Layout
is critical
 Fortran,
label in cols 1-6
 COBOL, area A B
 Lexical analyzer must worry about layout
Punctuation

Typically individual special characters
 Such
as +  Lexical analyzer does not know : from :
 Sometimes double characters
 E.g.
(* treated as a kind of bracket
 Returned
 And

just as identity of token
perhaps location
For error message and debugging purposes
Operators

Like punctuation
 No
real difference for lexical analyzer
 Typically single or double special chars
 Operators
+ Operations :=
 Returned
 And
just as identity of token
perhaps location
Keywords

Reserved identifiers
 E.g.
BEGIN END in Pascal, if in C
 Maybe distinguished from identifiers
 E.g.
mode vs mode in Algol-68
 Returned
 With
just as token identity
possible location information
 Unreserved
 Handled
keywords (e.g. PL/1)
as identifiers (parser distinguishes)
Identifiers

Rules differ
 Length,

allowed characters, separators
Need to build table
 So
that junk1 is recognized as junk1
 Typical structure: hash table

Lexical analyzer returns token type
 And
key to table entry
 Table entry includes location information
More on Identifier Tables

Most common structure is hash table
 With
fixed number of headers
 Chain according to hash code
 Serial search on one chain
 Hash code computed from characters
 No hash code is perfect!
 Avoid any arbitrary limits
String Literals
Text must be stored
 Actual characters are important

 Not
like identifiers
 Character set issues
 Table needed
Lexical analyzer returns key to table
 May or may not be worth hashing

Character Literals
Similar issues to string literals
 Lexical Analyzer returns

 Token
type
 Identity of character

Note, cannot assume character set of host
machine, may be different
Numeric Literals
Also need a table
 Typically record value

 E.g.
123 = 0123 = 01_23 (Ada)
 But cannot use int for values
 Because

may have different characteristics
Float stuff much more complex
 Denormals,
correct rounding
 Very delicate stuff
Handling Comments
Comments have no effect on program
 Can therefore be eliminated by scanner
 But may need to be retrieved by tools
 Error detection issues

 E.g.

unclosed comments
Scanner does not return comments
Case Equivalence

Some languages have case equivalence
 Pascal,

Some do not
 C,

Ada
Java
Lexical analyzer ignores case if needed
 This_Routine
= THIS_RouTine
 Error analysis may need exact casing
Issues to Address

Speed
 Lexical
analysis can take a lot of time
 Minimize processing per character
 I/O
 We
is also an issue (read large blocks)
compile frequently
 Compilation

time is important
Especially during development
General Approach

Define set of token codes
 An
enumeration type
 A series of integer definitions
 These are just codes (no semantics)
 Some codes associated with data
 E.g.
key for identifier table
 May be useful to build tree node

For identifiers, literals etc
Interface to Lexical Analyzer

Convert entire file to a file of tokens
 Lexical

analyzer is separate phase
Parser calls lexical analyzer
 Get
next token
 This approach avoids extra I/O
 Parser builds tree as we go along
Implementation of Scanner
Given the input text
 Generate the required tokens
 Or provide token by token on demand
 Before we describe implementations

 We
take this short break
 To describe relevant formalisms
Relevant Formalisms
Type 3 (Regular) Grammars
 Regular Expressions
 Finite State Machines

Regular Grammars

Regular grammars



Non-terminals (arbitrary names)
Terminals (characters)
Two forms of rules




Non-terminal ::= terminal
Non-terminal ::= terminal Non-terminal
One non-terminal is the start symbol
Regular (type 3) grammars cannot count

No concept of matching nested parens
Regular Grammars

Regular grammars
 E.g.
grammar of reals with no exponent
 REAL
::= 0 REAL1
(repeat for 1 .. 9)
 REAL1 ::= 0 REAL1
(repeat for 1 .. 9)
 REAL1 ::= . INTEGER
 INTEGER ::= 0 INTEGER (repeat for 1 .. 9)
 INTEGER ::= 0
(repeat for 1 .. 9)
 Start
symbol is REAL
Regular Expressions

Regular expressions (RE) defined by
 Any
terminal character is an RE
 Alternation RE | RE
 Concatenation RE1 RE2
 Repetition RE* (zero or more RE’s)

Language of RE’s = type 3 grammars
 Regular
expressions are more convenient
Specifying RE’s in Unix Tools
Single characters a b c d \x
 Alternation [bcd] [b-z] ab|cd
 Match any character .
 Match sequence of characters x* y+
 Concatenation abc[d-q]
 Optional [0-9]+(.[0-9]*)?

Finite State Machines
Languages and Automata
 A language is a set of strings
 An automaton is a machine

 That
determines if a given string is in the
language or not.

FSM’s are automata that recognize regular
languages (regular expressions)
Definitions of FSM
A set of labeled states
 Directed arcs labeled with character
 A state may be marked as terminal
 Transition from state S1 to S2

 If
and only if arc from S1 to S2
 Labeled
with next character (which is eaten)
Recognized if ends up in terminal state
 One state is distinguished start state

Building FSM from Grammar
One state for each non-terminal
 A rule of the form

 Nont1
::= terminal
 Generates transition from S1 to final state

A rule of the form
 Nont1
::= terminal Nont2
 Generates transition from S1 to S2
Building FSM’s from RE’s
Every RE corresponds to a grammar
 For all regular expressions

A
natural translation to FSM exists
 We will not give details of algorithm here
Non-Deterministic FSM

A non-deterministic FSM
 Has
at least one state
 With
two arcs to two separate states
 Labeled with the same character
 Which
way to go?
 Implementation requires backtracking
 Nasty 
Deterministic FSM

For all states S
 For
all characters C
 There
is either ONE or NO arcs
 From state S
 Labeled with character C
Much easier to implement
 No backtracking 

Dealing with ND FSM
Construction naturally leads to ND FSM
 For example, consider FSM for

 [0-9]+
| [0-9]+\.[0-9]+
 (integer
 We
or real)
will naturally get a start state
 With
two sets of 0-9 branches
 And thus non-deterministic
Converting to Deterministic

There is an algorithm for converting
 From
 To
any ND FSM
an equivalent deterministic FSM
Algorithm is in the text book
 Example (given in terms of RE’s)

 [0-9]+
| [0-9]+\.[0-9]+
 [0-9]+(\.[0-9]+)?
Implementing the Scanner

Three methods
 Completely
informal, just write code
 Define tokens using regular expressions
 Convert
RE’s to ND finite state machine
 Convert ND FSM to deterministic FSM
 Program the FSM
 Use
an automated program
 To
achieve above three steps
Ad Hoc Code (forget FSM’s)

Write normal hand code
A
procedure called Scan
 Normal coding techniques
 Basically
scan over white space and comments till
non-blank character found.
 Base subsequent processing on character


E.g. colon may be : or :=
/ may be operator or start of comment
 Return
token found
 Write aggressive efficient code
Using FSM Formalisms

Start with regular grammar or RE
 Typically

found in the language standard
For example, for Ada:
 Chapter
 Digit
2. Lexical Elements
::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
 decimal-literal ::= integer [.integer][exponent]
 integer ::= digit {[underline] digit}
 exponent ::= E [+] integer | E - integer
Using FSM formalisms, cont

Given RE’s or grammar
 Convert
to finite state machine
 Convert ND FSM to deterministic FSM

Write a program to recognize
 Using
the deterministic FSM
Implementing FSM (Method 1)

Each state is code of the form:


<<state1>>
case Next_Character is
when ‘a’ => goto state3;
when ‘b’ => goto state1;
when others =>
End_of_token_processing;
end case;
<<state2>>
…
Implementing FSM (Method 2)

There is a variable called State
loop
case State is
when state1 =><<state1>>
case Next_Character is
when ‘a’ => State := state3;
when ‘b’ => State := state1;
when others => End_token_processing;
end case;
when state2 …
…
end case;
end loop;
Implementing FSM (Method 3)

T : array (State, Character) of State;
while More_Input loop
Curstate := T (Curstate, Next_Char);
if Curstate = Error_State then …
end loop;
Automatic FSM Generation

Our example, FLEX
 See

home page for manual in HTML
FLEX is given
A
set of regular expressions
 Actions associated with each RE

It builds a scanner
 Which
matches RE’s and executes actions
Flex General Format

Input to Flex is a set of rules:
 Regexp
 Regexp
actions (C statements)
actions (C statements)
…

Flex scans the longest matching Regexp
 And
executes the corresponding actions
An Example of a Flex scanner

DIGIT
ID
%%
{DIGIT}+
[0-9]
[a-z][a-z0-9]*
{
}
printf (“an integer %s (%d)\n”,
yytext, atoi (yytext));
{DIGIT}+”.”{DIGIT}* {
printf (“a float %s (%g)\n”,
yytext, atof (yytext));
if|then|begin|end|procedure|function {
printf (“a keyword: %s\n”, yytext));
Flex Example (continued)
{ID}
printf (“an identifier %s\n”, yytext);
“+”|“-”|“*”|“/” {
printf (“an operator %s\n”, yytext); }
“--”.*\n
[ \t\n]+
.
%%
/* eat Ada style comment */
/* eat white space */
printf (“unrecognized character”);
Assembling the flex program
%{
#include <math.h> /* for atof */
%}
<<flex text we gave goes here>>
%%
main (argc, argv)
int argc;
char **argv;
{
yyin = fopen (argv[1], “r”);
yylex();
}
Running flex

flex is a program that is executed
 The
input is as we have given
 The output is a running C program

For Ada fans
 Look

at aflex (www.adapower.com)
For C++ fans
 flex
can run in C++ mode
 Generates
appropriate classes
Choice Between Methods?

Hand written scanners
 Typically
much faster execution
 And pretty easy to write
 And a easier for good error recovery

Flex approach
 Simple
to Use
 Easy to modify token language
The GNAT Scanner

Hand written (scn.adb/scn.ads)

Basically a call does




Super quick scan past blanks/comments etc
Big case statement
Process based on first character
Call special routines





Namet.Get_Name for identifier (hashing)
Keywords recognized by special hash
Strings (stringt.ads)
Integers (uintp.ads)
Reals (ureal.ads)
More on the GNAT Scanner

Entire source read into memory
 Single
contiguous block
 Source location is index into this block
 Different index range for each source file
 See sinput.adb/ads for source mgmt

See scans.ads for definitions of tokens
More on GNAT Scanner

Read scn.adb code
 Very
easy reading, e.g.
ASSIGNMENT TWO

Write a flex or aflex program
 Recognize
tokens of Algol-68s program
 Print out tokens in style of flex example
 Extra credit
 Build
hash table for identifiers
 Output hash table key
Preprocessors

Some languages allow preprocessing
 This
is a separate step
 Input
is source
 Output is expanded source
 Can
either be done as separate phase
 Or
embedded into the lexical analyzer
 Often done as separate phase
 Need
to keep track of source locations
Nasty Glitches
Separation of tokens
 Not all languages have clear rules
 FORTRAN has optional spaces


DO10I=1.6



DO10I=1,6



identifier operator literal
DO10I
=
1.6
Keyword stmt loopvar operator literal punc literal
DO
10
I
=
1
,
6
Modern languages avoid this kind of thing!
Descargar

Lexical Analysis and Scanning