The goal of this work was to learn more about how compilers work and parse input from a programming language to an internal representation and machine code. Within this repository are two implementations for a recursive descent parser: one written in Python and one written in C. Both define a simple programming language based on arithmetic expressions with a scanner separate from the parser.
Language implementation systems must analyze source code, regardless of the specific implementation approach. Nearly all syntax analysis is based on a formal description of the syntax of the source language (Backus-Naur form, or BNF). The syntax analysis portion of a language processor nearly always consists of two parts:
- A low-level part called a lexical analyzer (mathematically, a finite automaton based on a regular grammar)
- A high-level part called a syntax analyzer, or parser (mathematically, a push-down automaton based on a context-free grammar, or BNF)
Lexical and syntax analysis are typically separated because of simplicity (less complex approaches can be used for lexical analysis; separating them simplifies the parser), efficiency (separation allows optimization of the lexical analyzer) and portability (parts of the lexical analyzer may not be portable, but the parser is always portable).
A lexixal analyzer is a pattern matcher for character strings (i.e. a front-end for the parser). It identifies substrings of the source program that belong together - lexemes. Lexemes match a character pattern, which is associated with a lexical category called a token. The lexical analyzer is usually a function that is called by the parser when it needs the next token.
The goals of the parser, given an input program, are to find all syntax errors (for each, produce an appropriate diagnostic message and recover quicky) and produce the parse tree, or at least a trace of the parse tree, for the program. There are two categories of parsers:
- Top-down: produces the parse tree beginning at the root
- Order is that of a leftmost derivation
- Traces or builds the parse tree in preorder
- Bottom-up: produces the parse tree beginning at the leaves
- Order is that of the reverse of a rightmost derivation
- Useful parsers look only one token ahead in the input
The work in this repository focuses on top-down parsers, where given a sentential form, xAα, the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A. The most common top-down parsing algorithms are recursive-descent (a coded implementation, used in this repo) and LL parsers (table-driven implementations).
For recursive-descent parsers, there is a subprogram for each nonterminal in the grammar which can parse sentences that can be generated by that nonterminal. Extended Backus-Naur form (EBNF) is ideally suited for being the basis of a recursive-descent parser, because EBNF minimizes the number of nonterminals. The following shows an example of a grammar for simple expressions:
<expr> → <term> {(+ | -) <term>}
<term> → <factor> {(* | /) <factor>}
<factor> → id | int_constant | ( <expr> )
When there is only one right-hand side (RHS):
- For each terminal symbol in the RHS, compare it with the next input token; if they match, continue, else there is an error
- For each nonterminal symbol in the RHS, call its associated parsing subprogram
A nonterminal that has more than one RHS requires an initial process to determine which RHS it is to parse:
- The correct RHS is chosen on the basis of the next token of input (the lookahead)
- The next token is compared with the first token that can be generated by each RHS until a match is found
- If no match is found, it is a syntax error
The Left Recursion Problem: If a grammar has left recursion, either direct or indirect, it cannot be the basis for a top-down parser. A grammar can be modified to remove direct left recursion as follows:
For each non-terminal, A
,
- Group the A-rules as
A → Aα1 | … | Aαm | β1 | β2 | … | βn
where none of the β‘s
begins with A
2. Replace the original A-rules with
A → β1A’ | β2A’ | … | βnA’
A’ → α1A’ | α2A’ | … | αmA’ | ε
The other characteristic of grammars that disallows top-down parsing is the lack of pairwise disjointness, the inability to determine the correct RHS on the basis of one token of lookahead.
Def: FIRST(α) = {a | α =>* aβ }
(If α =>* ε, ε is in FIRST(α)
Pairwise Disjointness Test:
For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A → αi
and A → αj
, it must be true that FIRST(αi) ⋂ FIRST(αj) = Φ
. For example:
A → a | bB | cAb
A → a | aB
Left factoring can resolve pairwise disjointness! Replace
<variable> → identifier | identifier [<expression>]
with
<variable> → identifier <new>
<new> → ε | [<expression>]
or
<variable> → identifier [[<expression>]]
(the outer brackets are metasymbols of EBNF)
Michael Balas