Recursive Descent Parsing for
XML Developers
Roger L. Costello
15 October 2014
1
Table of Contents
• Introduction to parsing in general, recursive descent parsing in
particular
• Example #1: How to do recursive descent parsing on Book data
• Example #2: How to do recursive descent parsing for a grammar that
contains alternatives
• Limitations of recursive descent parsing
2
Flat XML Document
You might receive an XML document that has no structure. For
example, this XML document contains a flat (linear) list of Book data:
<input>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</input>
2
Give it structure to facilitate processing
<input>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</input>
<Books>
<Book>
<Title>Parsing Techniques</Title>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
</Authors>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
</Book>
<Book>
<Title>Introduction to Graph Theory</Title>
<Authors>
<Author>Richard J. Trudeau</Author>
</Authors>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
<Book>
<Title>Introduction to Formal Languages</Title>
<Authors>
<Author>Gyorgy E. Revesz</Author>
</Authors>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
</Books>
3
That’s parsing!
Parsing is taking a flat (linear) sequence of items and adding structure
so that the result conforms to a grammar.
4
Parsing
<input>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</input>
parse
<Books>
<Book>
<Title>Parsing Techniques</Title>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
</Authors>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
</Book>
<Book>
<Title>Introduction to Graph Theory</Title>
<Authors>
<Author>Richard J. Trudeau</Author>
</Authors>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
<Book>
<Title>Introduction to Formal Languages</Title>
<Authors>
<Author>Gyorgy E. Revesz</Author>
</Authors>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
</Books>
5
From the book: “Parsing Techniques”
• Parsing is the process of structuring a linear representation in
accordance with a given grammar.
• The “linear representation” may be:
•
•
•
•
•
•
•
a flat sequence of XML elements
a sentence
a computer program
a knitting pattern
a sequence of geological strata
a piece of music
actions of ritual behavior
7
Grammar
• A grammar is a succinct description of the structure.
• Here is a grammar for Books:
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
7
Parsing
Linear representation
<input>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</input>
Grammar
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Structured representation
parser
<Books>
<Book>
<Title>Parsing Techniques</Title>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
</Authors>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
</Book>
<Book>
<Title>Introduction to Graph Theory</Title>
<Authors>
<Author>Richard J. Trudeau</Author>
</Authors>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
<Book>
<Title>Introduction to Formal Languages</Title>
<Authors>
<Author>Gyorgy E. Revesz</Author>
</Authors>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
</Books>
8
Alternate view of the parser output
Grammar
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Linear representation
<input>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</input>
parser
Books
Book
Title
Authors
Parsing Techniques
Date
2007
Author Author
Dick Grune
Ceriel J.H. Jacobs
Parse tree
Book
Book
ISBN
Publisher
978-0-387-20248-8
Springer
Authors
Title
Introduction to
Graph Theory
Date
1993
Author
Richard J. Trudeau
ISBN
Publisher
Title
Authors
0-486-67870-9 Dover Publications Introduction to
Formal Languages
Date
2012
Author
Gyorgy E. Revesz
ISBN
Publisher
0-486-66697-2 Dover Publications
8
Parsing Techniques
• Over the last 50 years many parsing techniques have been created.
• Some parsing techniques work from the starting grammar rule to the
bottom. Those are called top-down parsing techniques.
• Other parsing techniques work from the bottom grammar rules to the
starting grammar rule. Those are called bottom-up parsing
techniques.
• The following slides explain the “recursive descent parsing
technique.” It is a top-down parsing technique.
9
Terminology: Token
• A token is an atomic (indivisible) unit.
• Each item in the input is a token.
• After parsing the tokens will be leaf nodes.
12
The input consists of a sequence of tokens
Each of these are tokens.
This input consists of 16
tokens.
<input>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</input>
13
After parsing the tokens will be leaf nodes
<Books>
<Book>
<Title>Parsing Techniques</Title>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
</Authors>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
</Book>
<Book>
<Title>Introduction to Graph Theory</Title>
<Authors>
<Author>Richard J. Trudeau</Author>
</Authors>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
<Book>
<Title>Introduction to Formal Languages</Title>
<Authors>
<Author>Gyorgy E. Revesz</Author>
</Authors>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
</Books>
tokens (terminal symbols)
14
Another view of the tokens, after parsing
Books
Book
Title
Authors
Parsing Techniques
Date
2007
Author Author
Dick Grune
Ceriel J.H. Jacobs
Book
Book
ISBN
Publisher
978-0-387-20248-8
Springer
Authors
Title
Introduction to
Graph Theory
Date
1993
Author
Richard J. Trudeau
ISBN
Publisher
Title
Authors
0-486-67870-9 Dover Publications Introduction to
Formal Languages
Date
2012
ISBN
Publisher
0-486-66697-2 Dover Publications
Author
Gyorgy E. Revesz
15
Parsing structures the input by wrapping the
tokens in non-terminal symbols
non-terminal symbols
<Books>
<Book>
<Title>Parsing Techniques</Title>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
</Authors>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
</Book>
<Book>
<Title>Introduction to Graph Theory</Title>
<Authors>
<Author>Richard J. Trudeau</Author>
</Authors>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
<Book>
<Title>Introduction to Formal Languages</Title>
<Authors>
<Author>Gyorgy E. Revesz</Author>
</Authors>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</Book>
</Books>
16
Recursive descent parsing
Recursive descent parsing works like this:
• Start at the grammar’s start symbol and output it. In our grammar, the start
symbol is <Books>, so output it.
• Progress through each grammar rule. For a non-terminal symbol, output it.
For a terminal symbol (i.e., token), check the token in the input stream for
match with the terminal symbol; if it matches, output it.
17
Initial
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
Start with the grammar’s start symbol and the first token in the input stream.
7
Output the start symbol
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
</Books>
19
Grammar says there must be at least one
Book
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
So the input stream must contain all the tokens for at least one Book.
Let’s process the grammar rule for Book.
20
Output <Book>
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Book>
<Book>
</Books>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
21
Grammar says the token in the input
stream must be Title
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Book>
<Title>Parsing Techniques</Title>
<Book>
</Books>
Yea, the input token
matches the
grammar rule
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
22
Grammar: after Title must be Authors
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
So the input stream must contain Author tokens. Let’s process the rule
for Authors.
23
Output <Authors>
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Book>
<Authors>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
<Authors>
<Book>
</Books>
24
Grammar says the next token in the input
Yea, the input token
matches the
stream must be an Author token
grammar rule
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Book>
<Authors>
<Author>Dick Grune</Author>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
<Authors>
<Book>
</Books>
25
Grammar says the next token in the input
Another Author
stream may be an Author token
match
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Book>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Authors>
<Book>
</Books>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
26
The next token in the input stream is not
an Author token
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
So, return to the caller (i.e., return to the Book rule).
27
Grammar says the input stream token
must be a Date token
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Book>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Authors>
<Date>2007</Date>
<Book>
</Books>
Yea, the input token
matches the
grammar rule
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
28
Grammar says the input stream token
must be an ISBN token
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Book>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Authors>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Book>
</Books>
Yea, the input token
matches the
grammar rule
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
29
Grammar says the input stream token
must be a Publisher token
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Book>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Authors>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Book>
</Books>
Yea, the input token
matches the
grammar rule
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
30
We’ve completed structuring the first 6
input tokens
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
Output:
<Books>
<Book>
<Authors>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Authors>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Book>
</Books>
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
31
Completed the Book rule
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
We’ve finished processing the Book rule, so return to the caller (i.e., the Books rule).
32
Begin work on structuring the next Book
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Title → text
Author → text
Date → text
ISBN → text
Publisher → text
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
33
Implementation
The following slides show, in a step-by-step manner, how to implement
a recursive descent parser
34
Step 1
Create a function for each non-terminal symbol in the grammar:
Functions
Books() {
…
}
Books → Book+
Book → Title Authors Date ISBN Publisher
Authors → Author+
Book() {
…
}
Authors() {
…
}
35
Step 2
Create a global element, Token, that is used to identify the current
position in the input stream. Initialize Token to 0:
Token = 0
36
Step 3
Create a function, get_next_token(). When it is called, it increments the
current position in the input stream:
get_next_token() {
Token = Token + 1
}
37
Step 4
Create a function, token(), and pass it a name, tk. The purpose of this
function is to answer the question: “Does the token at the current
position in the input stream match tk?”
38
Example of using the token() function
Suppose that during recursive descent parsing the grammar indicates
that the next token in the input stream must be “Title.” Suppose the
global variable, Token, indicates that we are here in the input stream:
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
39
Example (cont.)
The token() function determines that there is a match, so it calls
get_next_token() to increment the position in the input stream and
returns the token:
<Title>Parsing Techniques</Title>
<Author>Dick Grune</Author>
<Author> Ceriel J.H. Jacobs</Author>
return
<Date>2007</Date>
<ISBN>978-0-387-20248-8</ISBN>
<Publisher>Springer</Publisher>
<Title>Introduction to Graph Theory</Title>
<Author>Richard J. Trudeau</Author>
<Date>1993</Date>
<ISBN>0-486-67870-9</ISBN>
<Publisher>Dover Publications</Publisher>
<Title>Introduction to Formal Languages</Title>
<Author>Gyorgy E. Revesz</Author>
<Date>2012</Date>
<ISBN>0-486-66697-2</ISBN>
<Publisher>Dover Publications</Publisher>
40
The token() function
token(string tk) {
if (tk != input[position() = Token]) then return ()
else {
get_next_token()
return input[position() = Token])
}
}
Notice that token() returns empty if there is not a match.
41
Motivation for Step 5
Suppose that during recursive descent parsing we are in the Book()
function. The Book() function first checks—by calling the token()
function—to see if the current position of the input stream contains
“Title.” Suppose it does. Then, according to the grammar, there must
be Authors, Date, ISBN, and then Publisher:
Book → Title Authors Date ISBN Publisher
42
Step 5
Create a function, require(), and pass it a token, found. If the token is
empty (i.e., the token() function returned empty because there was not
a match) then call the error() function. Otherwise, return the token.
require(element found) {
if empty(found) then error(‘Invalid input’)
else return found
}
43
Step 6
Create an error function, error(). Pass it a string. It outputs the string
and then halts the parser.
error(string s) {
output s
stop
}
44
The complete implementation
• Recursive descent has been around a long time and people have
developed beautiful code for it.
• The following two slides collects all the code from the previous slides.
I recommend spending some time studying it to appreciate its beauty.
45
Token = 0
main() {
get_next_token()
require(input())
}
Code for a Recursive Descent Parser
input() {
return require(Books())
}
Books() {
<Books>
return (require(Book()), optional_additional_Books())
</Books>
}
optional_additional_Books() {
book = Book()
if exists(book) then return (book, optional_additional_Books())
}
Book() {
title = token('Title')
if exists(title) then
<Book>
return (title, require(Authors(), require(token('Date')), require(token(‘ISBN')), require(token(‘Publisher'))
</Book>
}
Authors() {
<Authors>
return (require(Author()), optional_additional_Authors())
</Authors>
}
46
optional_additional_Authors() {
author = token(‘Author')
if exists(author) then return (author, optional_additional_Authors())
}
token(string tk) {
if (tk != input[position() = Token]) then return ()
else {
get_next_token()
return input[position() = Token])
}
}
require(element found) {
if empty(found) then error(‘Invalid input’)
else return found
}
get_next_token() {
Token = Token + 1
}
47
XSLT Implementation
• I created an XSLT implementation. I tried to mirror the beautiful code
shown on the previous slides.
• If you would like to give my implementation a go, here is the XSLT
program and a sample flat (linear) input XML document:
• http://www.xfront.com/parsing-techniques/recursive-descent-parser/booksparser.xsl
• http://www.xfront.com/parsing-techniques/recursive-descent-parser/bookstest.xml
48
Richer example
• The Books example shown on the previous slides was fine for
introducing recursive descent parsing.
• But it glossed over an important problem: grammar rules with
alternatives.
• The following example shows how to do recursive descent parsing
with a grammar that has alternatives.
49
Expressions
• Let’s parse a simple expression language that has these tokens:
IDENTIFIER, addition, parentheses, and EoF.
• Here are a few examples of expressions:
IDENTIFIER EoF
(IDENTIFIER) EoF
IDENTIFIER + IDENTIFIER EoF
(IDENTIFIER + IDENTIFIER) EoF
IDENTIFIER + (IDENTIFIER + IDENTIFIER) EoF
(IDENTIFIER + IDENTIFIER) + IDENTIFIER EoF
IDENTIFIER + (IDENTIFIER + (IDENTIFIER + IDENTIFIER)) EoF
Each expression ends with an end-of-file (EoF) token.
50
Expression grammar
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε
51
Parse tree for: IDENTIFIER EoF
input
expression
EoF
term
rest_expression
IDENTIFIER
ε
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε
52
Parser selects the first alternative
input
expression
EoF
term
rest_expression
IDENTIFIER
ε
term has two alternatives.
The parser selected the first
alternative.
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε
53
Parse tree for: (IDENTIFIER) EoF
input
expression
EoF
term
rest_expression
parenthesized_expression
ε
(
expression
)
term
rest_expression
IDENTIFIER
ε
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε
54
Parser selects the second alternative
input
expression
EoF
term
rest_expression
parenthesized_expression
ε
(
expression
term
IDENTIFIER
)
rest_expression
ε
term’s second
alternative is
selected
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε 55
Question
How does a recursive descent parser know that it should select the first
or second alternative?
term → IDENTIFIER | parenthesized_expression
How does the parser know which alternative to select?
56
Answer
• The parser doesn’t know.
• It tries the first alternative. If that fails it tries the second alternative
(i.e., the parser backtracks and tries the next alternative). It repeats
until it finds an alternative that succeeds.
57
Processing the first token in the input stream
input
1
expression
2
Input tokens:
(
IDENTIFIER
)
EoF
term
3
IDENTIFIER
Try the first alternative,
which says the input token
must be IDENTIFIER.
However, the input token is (
so we must back up and try
the next alternative
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε
58
Implementation of the term() function
term() {
<term>
identifier = token('IDENTIFIER')
if exists(identifier) then
return (identifier)
else
return (require(parenthesized_expression()))
</term>
}
Check the current token in the input stream to see if it is IDENTIFIER.
59
term() function (cont.)
term() {
<term>
identifier = token('IDENTIFIER')
if exists(identifier) then
return (identifier)
else
return (require(parenthesized_expression()))
</term>
}
If there is a match, return the token.
60
term() function (cont.)
term() {
<term>
identifier = token('IDENTIFIER')
if exists(identifier) then
return (identifier)
else
return (require(parenthesized_expression()))
</term>
}
Otherwise try the other alternative, it must succeed.
61
Let’s represent each expression as XML
Instead of this input: IDENTIFIER EoF
our input will be this:
<input>
<IDENTIFIER />
<EoF />
</input>
62
XML representation (cont.)
Instead of this input: (IDENTIFIER) EoF
our input will be this:
<input>
<LP />
<IDENTIFIER />
<RP />
<EoF />
</input>
63
XML representation (cont.)
Instead of this input: IDENTIFIER + IDENTIFIER EoF
our input will be this:
<input>
<IDENTIFIER />
<PLUS />
<IDENTIFIER />
<EoF />
</input>
64
XML representation (cont.)
Instead of this input: IDENTIFIER + (IDENTIFIER + IDENTIFIER) EoF
our input will be this XML input:
<input>
<IDENTIFIER />
<PLUS />
<LP />
<IDENTIFIER />
<PLUS />
<IDENTIFIER />
<RP />
<EoF />
</input>
65
Parsing
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε
<input>
<IDENTIFIER />
<EoF />
</input>
Parser
<output>
<expression>
<term>
<IDENTIFIER/>
</term>
<rest_expression/>
</expression>
<EoF/>
</output>
66
Parsing (cont.)
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε
<input>
<LP />
<IDENTIFIER />
<RP />
<EoF />
</input>
Parser
<output>
<expression>
<term>
<parenthesized_expression>
<LP/>
<expression>
<term>
<IDENTIFIER/>
</term>
<rest_expression/>
</expression>
<RP/>
</parenthesized_expression>
</term>
<rest_expression/>
</expression>
<EoF/>
</output>
67
Parsing (cont.)
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε
<input>
<IDENTIFIER />
<PLUS />
<IDENTIFIER />
<EoF />
</input>
Parser
<output>
<expression>
<term>
<IDENTIFIER/>
</term>
<rest_expression>
<PLUS/>
<expression>
<term>
<IDENTIFIER/>
</term>
<rest_expression/>
</expression>
</rest_expression>
</expression>
<EoF/>
</output>
68
Parsing (cont.)
input → expression EoF
expression → term rest_expression
term → IDENTIFIER | parenthesized_expression
parenthesized_expression → '(' expression ')'
rest_expression → '+' expression | ε
<input>
<IDENTIFIER />
<PLUS />
<LP />
<IDENTIFIER />
<PLUS />
<IDENTIFIER />
<RP />
<EoF />
</input>
Parser
<output>
<expression>
<term>
<IDENTIFIER/>
</term>
<rest_expression>
<PLUS/>
<expression>
<term>
<parenthesized_expression>
<LP/>
<expression>
<term>
<IDENTIFIER/>
</term>
<rest_expression>
<PLUS/>
<expression>
<term>
<IDENTIFIER/>
</term>
<rest_expression/>
</expression>
</rest_expression>
</expression>
<RP/>
</parenthesized_expression>
</term>
<rest_expression/>
</expression>
</rest_expression>
</expression>
<EoF/>
</output>
69
XSLT Implementation
• I created an XSLT implementation of a recursive descent parser for the
expression language.
• If you would like to give my implementation a go, here is the XSLT
program and a sample flat (linear) input XML document:
• http://www.xfront.com/parsing-techniques/recursive-descentparser/expression-parser.xsl
• http://www.xfront.com/parsing-techniques/recursive-descentparser/expression-test.xml
70
Limitations of Recursive Descent Parsers
Recall that in a rule containing alternatives we tried the first
alternative, if it failed we backtracked and tried the second alternative.
Searching the alternatives is time-consuming.
71
Limitations (cont.)
Recursive descent parsers can’t handle left-recursive grammar rules.
The parser goes into an infinite loop.
Example: suppose the grammar has this rule:
expression → expression '-' term
That is a “left-recursive” rule: on the rule’s right-hand side it starts with
the same symbol as on the left-hand side (i.e., expression). The
recursive descent routine for this rule is:
expression() {
return expression() and require(token(‘-’)) and require(term)
}
(infinite) recursion!
72
Limitations (cont.)
Suppose we add an array element as a term:
term → IDENTIFIER | indexed_element | parenthesized_expression
indexed_element → IDENTIFIER '[' expression ']'
and create a recursive descent parser for the new grammar. The
routine for indexed_element will never be tried: when the
sequence IDENTIFIER '[' occurs in the input, the first alternative
of term will succeed, consume the identifier, and leave the indigestible
part '[' expression ']' in the input.
73
References – Great Books
Modern Compiler Design (http://www.amazon.com/ModernCompiler-Design-DickGrune/dp/1461446988/ref=sr_1_1?s=books&ie=UTF8&qid=141340845
8&sr=1-1&keywords=modern+compiler+design)
Parsing Techniques (http://www.amazon.com/Parsing-TechniquesPractical-MonographsComputer/dp/038720248X/ref=sr_1_1?s=books&ie=UTF8&qid=14134
08496&sr=1-1&keywords=parsing+techniques)
74
Descargar

Recursive Descent Parsing for XML Developers