The Spirit recursive descent parser compiler
Copyright Ó
1999, 2000, Joel de Guzman
InterXys. All rights reserved.
Introduction:
The Parser Proper:
The C++ Recursive descent parser compiler dynamically compiles Extended BNF (EBNF) production rules into a working parser. The parser acts on the character level and thus obviates the need for a separate lexical analyzer stage. Extra care and attention was given to keep the syntax from being cluttered. The syntax is totally separated from the semantics.
The complete specification of the parser follows the meta-language:
grammar ::= ( identifier '::=' expr ';' )*;
expr ::= expr1 ( ('|' | '%') expr1 )*;
expr1 ::= expr2 ( expr2 )*;
expr2 ::= expr3 { ':' identifier } { '*' |
'+' };
expr3 ::= '.'
| set
| string
| identifier
| group
| optional
| contiguous;
group ::= '(' expr ')';
optional ::= '{' expr '}';
contiguous ::= '<' expr '>';
set ::= { '~' } '[' setItems ']';
setItems ::= ( setChar {'-' setChar } )*;
setChar ::= ~[\0\]];
string ::= '\'' <stringChar>* '\'';
stringChar ::= ~[\0'];
identifier ::= <[a-zA-Z_]*[a-zA-Z_0-9]*>;
whiteSpace ::=
([ \t\n\r] | comment)*;
comment ::= ('//'~[\0\n\r]*) | ('/*' .* '*/');
Where:
[] Is a Regular Expression style set. Example: [a-z].
. Matches any character except the null.
x-y Matches all characters from x to y.
string A Pascal style string. Example: 'hello world'
identifier A case sensitive C/C++ style identifier. Example: foo_123.
: identifier Where 'identifier' specifies a semantic action. Named semantic actions can be attached to any level in the grammar. These actions are C/C++ hook functions that will be called immediately if a match is found.
* Kleene star operator. Matches the preceding expression zero or more times. For example:
The expression 'A'*
Matches “A”, “AAA” or “”.
+ Plus operator. Matches the preceding expression one or more times. For example:
The expression 'A'+
Matches “A”, “AAA” but not “”.
{} Optional construct. Matches the enclosed expression zero or one time. For example:
The expression 'A' {'B'}
Matches “AB” or “A”.
| (Leftmost) Alternative operator. Matches either of the operands. Ambiguous alternatives are short-circuit evaluated and the leftmost alternative that matches wins.
The expression 'A' | 'B'
Matches “A” or “B”.
% Longest-alternative operator. Similar to the ‘|’ operator. Matches either of the operands. Unlike the former, alternatives are NOT short-circuit evaluated and the longest alternative that matches wins.
() Grouping construct. Orders the evaluation of sub- expressions. For example:
The expression ('A' | 'B') 'C' evaluates ('A' | 'B') first.
Whereas the expression
'A'
| 'B' 'C' evaluates 'A' 'B' first due to the implicit precedence of
operators as defined in the grammar.
<> Contiguous construct. Inhibit white space (and comment) skipping. By default white spaces and comments are skipped. The construct can be used to inhibit white space and comment skipping. This can be useful in situations where we want to work on the character level instead of the phrase level.
For instance, in parsing numbers such as integers we need to enclose the expression in order to regard the expression as contiguous:
The expression <{'+' | '-'}[0-9]+>;
matches “123” or “456” but not “1 2 3”.
White spaces are not totally invisible as these are still parsed by the same phrase/character level parser. Semantic actions can be associated with white spaces for example to recognize pragmas.
The back-slash can be used as an escape character in sets and strings. The back-slash matches the character following the '\' unless it is one of:
'\n' new-line NL (LF)
'\t' horizontal tab HT
'\r' carriage return CR
'\v' vertical tab VT
'\f' form feed FF
'\a' alert BEL
'\b' backspace BS
'\oooo' octal number
'\xhhh' hex number
An EBNF source that defines a target language is parsed by the parser compiler and outputs a fully working parser for that language. The generated parser is a hierarchical structure composed of Match objects. Given a source written in the target language, the top-down descent traverses the hierarchy, checking each object for a match, backtracking and checking all possible alternatives. Object aggregation allows access to any of the steps in the recursive descent. In addition, named semantic actions may be attached to any level within the hierarchy. These actions are C/C++ functions that will be called if a match is found in a particular context within the grammar.
The parser is non-deterministic in nature and allows backtracking, back-referencing and is capable of infinite look-ahead LL(inf). It can parse rather ambiguous grammars such as:
expr ::= 'A' 'B' 'C' 'D' 'E' | 'A' 'B' 'C' 'D' 'F';
Left recursion is not suitable with recursive parsers in
general. The Spirit Parser is no exception. Left
recursion is an error and will be reported as such. Otherwise it would result
in an infinite loop when the offending rule is invoked. Example:
a ::= a ‘A’; // Error, left recursion
Ambiguous alternatives are short-circuit evaluated and the leftmost alternative that matches wins. If this is not desired, the longest-alternative operator ‘%’ may be used. In this case ambiguous alternatives are NOT short-circuit evaluated and the longest alternative that matches wins. Consider the parsing of integers and real numbers:
number ::= real % integer;
integer ::= <
{ '+' | '-' }[0-9]+ >;
real ::= < integer { '. ' integer } { (' e' | ' E' ) integer } >;
A number can be a ‘real’ or an ‘integer’. Given an input:
“12.36e-6”
This will match a ‘real’ since it is the alternative that matches a longer part of input string. While the ‘%’ operator is called the longest-alternative operator, the standard ‘|’ operator may be called leftmost-alternative operator.
The order of
rules within a grammar does not matter. Rules that reference other rules are
resolved just-in-time when needed. Redefinition of
rules are not allowed and will result in an error.
These seemingly simple classes when composed and structured in a hierarchy form a very powerful object-oriented recursive descent parsing engine. These classes provide the infrastructure needed for the construction of parsers.
The power of this technique lies in its ability to form rather complex hierarchical parser structures dynamically at runtime, unlike function-based parsers that require compilation to object code. The hierarchical composition of objects as opposed to the hierarchical composition of functions defines the top-down structure of the parser. Typical compiler-compilers or parser generators generate code, usually C/C++ that has to be compiled.
The Match class declaration (Private members are omitted for brevity):
class
Match : public Allocator {
public:
Match();
virtual ~Match();
virtual bool Parse(Scanner& str) const = 0;
};
As stated, the parser compiler generates a hierarchical structure composed of Match objects. This object hierarchy does the actual parsing. The base Match class has one pure virtual Parse member function that accepts a reference to a Scanner object. This function returns true if a match is found at the current scanner position and the scanner is advanced. Otherwise, the scanner is not touched if there is no match. A match is found if a portion of the string starting at the current scanner position matches a certain pattern or algorithm.
The flexibility of object embedding and composition combined with recursion opens up a unique approach to parsing. Subclasses are free to form aggregates and algorithms of arbitrary complexity. Complex parsers can be created with the composition of only a few simple classes.
Examples of Match Sub-classes:
Match subclass. Matches strings of the form ‘string’ |
|
Match subclass. Abstract. Has one child (subject). |
|
Match subclass. Abstract. Has two children (left and right). |
|
BinaryMatch subclass. Matches expressions of the form: A B |
|
BinaryMatch subclass. Matches expressions of the form: A | B |
|
BinaryMatch subclass. Matches expressions of the form: { A } |
And the corresponding class declarations (Again, private members are omitted for brevity):
class
StringMatch : public Match {
// Matches strings of the form ‘string’
public:
StringMatch(
Char8 const* str,
UInt len);
virtual ~StringMatch();
virtual bool Parse(Scanner& str) const;
Char8 const* String() const;
};
class UnaryMatch : public Match {
public:
UnaryMatch(Match const*
subject);
virtual ~UnaryMatch();
Match const* Subject() const;
};
class BinaryMatch : public Match {
public:
BinaryMatch(
Match const* left,
Match const* right);
virtual ~BinaryMatch();
Match const* Left() const;
Match const* Right() const;
};
class AndMatch : public BinaryMatch {
// Matches expressions of the form: A B
public:
AndMatch(
Match const* left,
Match const* right);
virtual bool Parse(Scanner& str) const;
};
class OrMatch : public BinaryMatch {
// Matches expressions of the form: A | B
public:
OrMatch(
Match const* left,
Match const* right);
virtual bool Parse(Scanner& str) const;
};
class ZeroOrOneMatch : public
UnaryMatch {
// Matches expressions of the form: { A }
public:
ZeroOrOneMatch(Match const*
subject);
virtual bool Parse(Scanner& str) const;
};
As a simple illustration, given these examples of Match classes, a production prod ::= {‘A’ | ‘B’} ‘C’; will generate a parser composed of a hierarchical structure made up of Match objects as basic building blocks. This hierarchical structure corresponds neatly to the parse tree (AST or Abstract Syntax Tree) of the given grammar:
prod ::= {‘A’ | ‘B’} ‘C’;
The AndMatch object at the top of the hierarchy has two children: an ZeroOrOneMatch object and StringMatch(‘C’) object. The ZeroOrOneMatch object has a child: OrMatch which in turn has two children: StringMatch(‘A’) and StringMatch(‘B’).
Invoking the Parse method of
the production named ‘prod’ will activate the parser. A more detailed
illustration of what actually happens when a generated parser is at work will
be discussed later. For now suffice it to say that this is now a fully working
parser which accepts a source code and simply returns true if the source conforms to the given grammar prod ::= {‘A’ | ‘B’} ‘C’; otherwise returns false in case of a
syntax error.
Parsing comments (and all things considered as white spaces including pragmas etc.) with a BNF parser is a bit tricky. The usual practice is to have a pre-processor or a separate lexical analyzer that suppresses comments to make them invisible to the parser.
The scanner presented here handles this task. The scanner is a simple yet integral part of the parser and works in conjunction with the parser and is not a separate program. It is not a full-blown lexical analyzer. It does not extract tokens such as reserved words and operators. Nor does it extract numbers and literal strings.
A production named 'whiteSpace' may be defined. This will be the rule used to skip comments and white spaces in the parse. The scanner invokes this rule when tasked by the parser to scan the next character from the source. If the production 'whiteSpace' is not declared, it will default to: whiteSpace ::= []; In such cases, white spaces will not be skipped.
For instance, if we want to skip ‘ ‘, tabs, carriage returns and newlines, we may declare the production whiteSpace as:
whiteSpace ::= [ \t\n\r]*;
As mentioned earlier, the same phrase/character level parser parses white spaces. This implies that Semantic actions may also be associated with white spaces for example to recognize pragmas:
whiteSpace ::=
( [
\t\n\r]
| comment
| pragma:doPragma
)*;
The scanner uses the same Match object
that does the actual parsing and subsequently, skipping.
The scanner is highly optimized and uses a deferred comment/white space
skipping algorithm to minimize scanning overhead and
eliminate redundant skipping when parsing alternatives.
By default white spaces and comments are skipped. The angular brackets ‘<’ and ‘>’ enclosing an expression may be used to inhibit white space and comment skipping. This can be useful in situations where we want to work on the character level instead of the phrase level (example).
Semantic Actions:
Named semantic actions can be attached to any level in the grammar. These actions are C/C++ functions that will be called immediately if a match is found in the particular context where the action identifier is attached. These action functions may be used as hooks into the parser and may be used to:
1. Perform additional syntax analysis that is otherwise awkward or impossible to do within the confines of the top-down parser as it is.
2. To generate output from the parser (ASTs for example).
3. Report warnings or errors.
4. Implement symbol tables.
The function should have an interface:
bool Action(Client*
client, Char8 const* str, UInt len);
Where client is a user-defined type, ‘str’ is a pointer to the string that matches the expression and ‘len’ is the length of the matching string. The function may do some more syntax analysis and if for some reason the function finds the syntax pointed to by ‘str’ in error or unacceptable, it may be able to un-commit the parse by returning false. Otherwise if all is well, the function should return true.
We can modify our previous example by attaching a semantic action to the context of Stringmatch(‘C’):
{‘A’ | ‘B’} ‘C’:action
Thus, whenever the parser encounters a ‘C’ in the source file, the function associated with ‘action’ will be called. Uninterestingly, in this case, each time the function is invoked the parameters for ‘str’ and ‘len’ will be “C” and 1 respectively.
If we attach the ‘action’ such that:
{‘A’ | ‘B’}:action ‘C’
It gets a bit more interesting as the function associated with ‘action’ may be called this time with either ‘A’ or ‘B’ as possible parameters for ‘str’ depending on what is found in the source fed into the the parser.
The simplicity of syntax of semantic actions helps keep the source grammar clear and concise. It is the author’s view that the declarative nature of the EBNF syntax should not be cluttered with imperative statements as evident in many compiler-compilers. We shall keep the declaration separate from the implementation.