Search                        Top                                  Index
HELP LR_PARSER                              Robert Duncan, November 1992

    uses lr_parser;

An LALR(1) parser generator.


    CONTENTS - (Use <ENTER> g to access required sections)

 -- Introduction
 -- Background
 -- . An Example Grammar
 -- . Symbolic Representation of Grammars
 -- . Parsers and Parser Generators
 -- . LALR(1) Parser Generation
 -- . LR(1) Parsing
 -- . ACTION and GOTO Tables for Lambda1
 -- . Tracing the Parser
 -- . The Composition of Parser States
 -- . Ambiguities and Conflicts
 -- . Resolving Conflicts
 -- . Further Reading
 -- Using the LR_PARSER Library
 -- . Building the Parser Tables
 -- . Running the Parser in Trace Mode
 -- . Running the Parser Efficiently


-- Introduction -------------------------------------------------------

LIB * LR_PARSER defines an implementation of an LALR(1) parser generator
and supporting utilities. To use this library, your program must include
the line:

    uses lr_parser;

You should load this line now if you want to compile the examples in
this file.

The first part of this file attempts to give sufficient background
information about the LALR(1) system to make the library useful: readers
who are already familiar with this information or who have used similar
systems elsewhere may want to skip this section. A basic knowledge of
context-free grammars and their terminology -- terminal symbols,
productions etc. -- is assumed.

The remainder of the file provides an overview of the main library
procedures. For full details, see REF * LR_PARSER.

A friendlier syntactic interface to the parser generator is described in
HELP * DEFINE_PARSER.


-- Background ---------------------------------------------------------

This section provides background information about the LALR(1) parser
generation system which might prove useful to users of the library. It's
not essential that you should understand all of this before using the
library, but it's something you may want to come back to if you have
problems or want to explore the full potential of the system.


-- . An Example Grammar -----------------------------------------------

As a running example throughout the rest of this file, we shall use a
grammar describing expressions in the lambda calculus. Familiarity with
the lambda calculus is not required to appreciate the example, since we
are interested only in its syntax, not semantics.

An expression in the lambda calculus can be one of four kinds:

    o   a variable;
    o   an application (of one expression to another);
    o   an abstraction (of an expression w.r.t. some variable);
    o   a bracketed expression.

This is described by the grammar

    Lambda0:
        exp  -->  VAR               ;;; variable
        exp  -->  exp exp           ;;; application
        exp  -->  '\' VAR '.' exp   ;;; abstraction
        exp  -->  '(' exp ')'       ;;; bracketed expression

In this notation, Lambda0 is the grammar name; terminal symbols (or
tokens) are written either in upper case (VAR) or within single quotes
('\'), and non-terminal symbols are in lower case and unquoted (exp).
The first non-terminal (exp) is the start symbol.

We can assume that the token VAR stands for any of a whole family of
distinct variable names, unlike the quoted tokens ('\' etc.) which are
literal. What constitutes the set of variable names is immaterial:
typically they'll be textual -- "x", "y", "z", etc. -- or possibly
numeric, but could be anything depending on the context in which the
grammar is being used. It's sufficient for our purposes that the VAR
token implicitly carries a value which we can get to if we want, but the
value itself makes no difference either to the grammar or to the parser
generated from it.

The token '\' is meant to look vaguely like a Greek lambda character.

Beginning with a sequence containing just the start symbol, and
repeatedly replacing non-terminal symbols according to the grammar
rules, we may eventually end up with a sequence containing only terminal
symbols: such a sequence is called a sentence. The sequence of
replacement steps used to generate a particular sentence is called the
derivation of the sentence, and the set of all possible sentences is
called the language generated by the grammar.

Sentences of the Lambda0 language include:

    VAR                             ;;; variable
    VAR VAR                         ;;; application
    \ VAR . VAR VAR                 ;;; abstraction
    ( \ VAR . VAR VAR )             ;;; bracketed expression

A potential problem with Lambda0 is that it is ambiguous, in the sense
that certain sentences have more than one possible derivation. For
example, the sentence

    \ VAR . VAR VAR

has two distinct derivations:


              (a)                                   (b)
             +---+                                 +---+
             |exp|                                 |exp|
             +---+                                 +---+
        +------+------+                       +------+------+
        |   |    |  +-+-+                   +-+-+         +-+-+
        \  VAR   .  |exp|                   |exp|         |exp|
                    +---+                   +---+         +---+
                  +---+---+            +------+------+      |
                +-+-+   +-+-+          |   |    |  +-+-+    |
                |exp|   |exp|          \  VAR   .  |exp|   VAR
                +---+   +---+                      +---+
                  |       |                          |
                 VAR     VAR                        VAR


Which of these is preferred depends on which of the abstraction or
application operations is deemed to be the most binding. In the common
interpretation, the abstraction operation is considered to extend "as
far to the right as possible" making derivation (a) the preferred one.
We can make this preference explicit (and eliminate another similar
ambiguity) by rewriting the grammar to include two extra non-terminal
symbols: appexp (for applications) and aexp (for atomic expressions,
i.e. variables and bracketed terms). The revised grammar is

    Lambda1:

        exp     -->  '\' VAR '.' exp    ;;; abstraction
        exp     -->  appexp

        appexp  -->  appexp aexp        ;;; application
        appexp  -->  aexp

        aexp    -->  VAR                ;;; variable
        aexp    -->  '(' exp ')'        ;;; bracketed expression

It's this latter form of the grammar which we shall use from now on,
although we shall return to the original form later to show how the
parser generator copes with ambiguous grammars.


-- . Symbolic Representation of Grammars ------------------------------

In order to manipulate grammars procedurally, we need a way of
describing grammars free of any meta-notation such as '-->'.

Symbolically, a grammar can be described by a four-tuple

    (T, S, S0, R)

where:

    T  is a set of terminal symbols;

    S  is a set of non-terminal symbols;

    S0 is the start symbol;

    R  is a set of productions (rules).

The sets T and S must be disjoint, since no symbol can be both terminal
and non-terminal, and the start symbol S0 must be a member of S.

Each rule in R is a pair:

    (LHS, RHS)

where LHS is a non-terminal symbol drawn from S, and RHS is a sequence
of symbols drawn from the union of T and S.

Making this concrete, we can represent both lambda grammars with the
Pop11 data structures:

    vars

        Lambda0 = [
            [ VAR \ . ( ) ]             ;;; T
            [ exp ]                     ;;; S
            exp                         ;;; S0
            [[ exp    VAR ]             ;;; R
             [ exp    exp exp ]
             [ exp    \ VAR . exp ]
             [ exp    ( exp ) ]]
        ],

        Lambda1 = [
            [ VAR \ . ( ) ]             ;;; T
            [ exp appexp aexp ]         ;;; S
            exp                         ;;; S0
            [[ exp    \ VAR . exp ]     ;;; R
             [ exp    appexp ]
             [ appexp appexp aexp ]
             [ appexp aexp ]
             [ aexp   VAR ]
             [ aexp   ( exp ) ]]
        ],
    ;

Each rule is encoded as a single list, with the head and tail of the
list standing for the left- and right-hand-sides of the rule.


-- . Parsers and Parser Generators ------------------------------------

A parser for a grammar G is a machine which, when presented with a
string of tokens from G, can determine whether or not it constitutes a
sentence of the language generated by G; in other words, whether there
exists a derivation of the string from the start symbol of G.

In Pop11, this could be a procedure:

    parseG(TOKEN_LIST) -> BOOL

where the "G" suffix on the procedure name indicates that the parser is
particular to the grammar G.

Often we would like the parser to do more work than simply recognise
sentences. At the very least we might expect it to return a parse tree
(a data structure describing the derivation of the input) but there is
no limit to what it could do: in the context of a compiler for example,
the parser might do any amount of work from environment management,
through type checking to actual code generation. Once we have a
recogniser for a language, adding hooks which allow it to do more
constructive work is unlikely to be difficult.

The aim of a parser generator is to take the symbolic representation of
a grammar G and produce automatically a parser for that grammar. It's
too much to hope that there might be a universal generator which can
build a parser for any grammar to which it's applied. Inevitably, there
are several different parser-generation algorithms available, each of
which can cope with a particular class of grammars.


-- . LALR(1) Parser Generation ----------------------------------------

In common with many other similar systems -- most notably the UNIX yacc
utility -- the LR_PARSER library uses the LALR(1) parser-generation
algorithm. This has proved popular because it is comparatively simple
and well-understood, and the basic algorithms can be efficiently
implemented. Also, as a de-facto standard, it has the convenience that a
grammar developed on one such system can be easily ported to any other.

The name LALR(1) is used to describe not only the parser-generation
algorithm itself, but also the class of grammars which the algorithm can
handle. There is, unfortunately, no easy way of recognising an LALR(1)
grammar other than by trying the algorithm on it; however, the general
style is well-enough suited to the kinds of grammar used in formal
systems such as programming languages, although published grammars for
programming languages are often imprecise and need adjusting to make
them truly LALR(1).

To simplify parser generation, the parser itself is divided into two
parts: a fixed procedure implementing a generic parsing algorithm, and a
set of tables which specialise the algorithm for a particular grammar.
Parser generation then reduces to building an appropriate set of tables.

The method of table construction is too complicated to describe here,
and is anyway unhelpful for end-users. Interested readers are referred
to the section on 'Further Reading' below.

On the other hand, the parsing algorithm itself is quite simple, and a
basic knowledge of its operation is useful for the successful
development of LALR(1) grammars.


-- . LR(1) Parsing ----------------------------------------------------

The parsing algorithm used by the parser generator is known as LR(1):
the L stands for left-to-right and describes the way the parser reads
its input; the R stands for rightmost and describes the particular
derivation which the parser favours, and the 1 is the number of tokens
of look-ahead the parser uses in deciding what to do. LR(1) is just one
of a family of parsing algorithms LR(k), for k >= 0.

An LR(1) parser is a state machine, characterised by five components:

    (1) an INPUT stream of terminal symbols: the token at the head of
        the input is the current token.

    (2) a stack of STATES: the state at the top of the stack is the
        current state. The state stack gives the parser considerably
        more power than simpler machines with only a single current
        state, because it can remember the context from which a
        particular state was entered: this allows states to act like
        subroutines for recognising sub-phrases in the input.

    (3) a stack of SYMBOLS (or more generally, VALUES) which describes
        the input the parser has already read. Initially, terminal
        symbols are simply copied from the input to the value stack, but
        when the topmost symbols on the stack match the right-hand-side
        of an appropriate production, they are replaced by the
        corresponding left-hand-side (non-terminal) symbol. At the end
        of a successful parse, the value stack contains just a single
        item: the start symbol.

    (4) an ACTION table: a matrix indexed by current state and current
        token which determines what the parser's next action should be;

    (5) a GOTO table: a matrix indexed by a previous state (recovered
        from the state stack) and a non-terminal symbol which determines
        what state the parser should move to after an expansion of the
        symbol has been recognised in the input.

The ACTION and GOTO tables are specific to each grammar.

At each execution step, the parser can perform any of four kinds of
action:

    (1) SHIFT to state N: the new state is pushed on the state stack,
        and the current token is removed from the input and pushed on
        the value stack. This is valid only when the current token is a
        legal follower for the symbol at the top of the value stack.

    (2) REDUCE by production LHS --> RHS: this represents a successful
        recognition of LHS and is valid only when the topmost items on
        the value stack match the symbols in RHS; the matching symbols
        are removed from the value stack and replaced with LHS. To
        recover the context in which the search for LHS was initiated,
        one state is discarded from the state stack for each item in
        RHS; the newly exposed topmost state, together with LHS,
        determines via the GOTO table a new current state to be pushed
        on the state stack.

    (3) ACCEPT: the parse has succeeded.

    (4) ERROR: the parse has failed.

The whole process can be described in a few lines of Pop11:

    vars action, state, lhs, rhs;   ;;; for the pattern matcher

    define LR_1(INPUT, ACTION, GOTO);
        lvars INPUT, STATES, VALUES, ACTION, GOTO;
        ;;; start in state 1 with an empty value stack
        [1] -> STATES;
        [] -> VALUES;
        repeat
            ACTION(hd(STATES), hd(INPUT)) -> action;
            if action matches [shift ?state] then
                ;;; push new current state
                state :: STATES -> STATES;
                ;;; push current token
                hd(INPUT) :: VALUES -> VALUES;
                ;;; ... and remove it from the input
                tl(INPUT) -> INPUT;
            elseif action matches [reduce ?lhs ??rhs] then
                ;;; replace RHS items on value stack
                lhs :: allbutfirst(length(rhs), VALUES) -> VALUES;
                ;;; discard an equal number of states
                allbutfirst(length(rhs), STATES) -> STATES;
                ;;; GOTO new state
                GOTO(hd(STATES), lhs) :: STATES -> STATES;
            elseif action matches [accept] then
                ;;; success
                return(true);
            else
                ;;; failure
                return(false);
            endif;
        endrepeat;
    enddefine;

This version of LR(1) is just a recogniser, but modifying it to do
something more complicated is not difficult. Firstly, we notice that the
algorithm is unaffected by the contents of the value stack, so we can
allow that to contain arbitrary values rather than just symbols. Then,
to provide a hook into the parser's actions, we add an extra procedure
argument REDUCE_P which is called each time a REDUCE action is
performed: the arguments to REDUCE_P will include at least any values
discarded by the reduction and any results are returned to the value
stack. On ACCEPT, the contents of the value stack are returned.

The exact type and number of arguments to REDUCE_P will depend on the
kind of interface we want to provide, but for demonstration purposes we
can replace the lines:

    lhs :: allbutfirst(length(rhs), VALUES) -> VALUES;

and:

    return(true);

with the code:

    REDUCE_P(
        lhs,
        [% repeat length(rhs) times dest(VALUES) -> VALUES endrepeat %]
    ) :: VALUES -> VALUES;

and:

    return(hd(VALUES));

Setting REDUCE_P to be -erase- would then restore the original
functionality, but making it -conspair- would cause the parser to build
a parse tree.

Note that this implementation of LR(1) is written for clarity rather
than efficiency. One advantage of the separation of parsing algorithm
from table construction is that we can use different implementations of
the parser to suit different purposes: the LR_PARSER library provides
two distinct implementations, one for tracing and debugging and one for
efficient execution.


-- . ACTION and GOTO Tables for Lambda1 -------------------------------

Table 1 shows the ACTION and GOTO tables for the Lambda1 parser. Each
row of the table represents one parser state, of which there are 13 in
all.


         |              ACTION              |         GOTO
      ---+----------------------------------+----------------------
         |   VAR    \    .    (    )    $   |    exp appexp   aexp
      ---+----------------------------------+----------------------
       1 |    s2   s3        s4             |      5      6      7
       2 |    r5   r5   r5   r5   r5   r5   |
       3 |   s11                            |
       4 |    s2   s3        s4             |      9      6      7
       5 |                              a   |
       6 |    s2   r2   r2   s4   r2   r2   |                    8
       7 |    r4   r4   r4   r4   r4   r4   |
       8 |    r3   r3   r3   r3   r3   r3   |
       9 |                       s10        |
      10 |    r6   r6   r6   r6   r6   r6   |
      11 |             s12                  |
      12 |    s2   s3        s4             |     13      6      7
      13 |    r1   r1   r1   r1   r1   r1   |

      Table 1


The ACTION table is indexed by state number and terminal symbol, where
the extra '$' symbol stands for the end of input. Entries in the table
are actions -- SHIFT, REDUCE, ACCEPT or ERROR -- abbreviated as follows:

    s<n>    = shift to state <n>
    r<n>    = reduce by rule <n>
    a       = accept
    <blank> = error

Rule numbers are as shown in Table 2.


         |             RULES
      ---+-------------------------------
       1 |  exp     -->  '\' VAR '.' exp
       2 |  exp     -->  appexp
       3 |  appexp  -->  appexp aexp
       4 |  appexp  -->  aexp
       5 |  aexp    -->  VAR
       6 |  aexp    -->  '(' exp ')'

     Table 2


The GOTO table is indexed by state number and non-terminal symbol.
Numeric entries in the table are state numbers to move to; blank entries
should never be accessed.

Converting these tables into Pop11 data structures suitable for passing
to LR_1 is not difficult and is left as an exercise for the reader. The
number of blank entries and the repetitiveness of certain rows is
typical and means that tables can usually be stored in considerably less
space than would be required for the full matrix.

With the ACTION and GOTO tables built, a parser for Lambda1 could be
defined as:

    define parseLambda1 =
        LR_1(% ACTION, GOTO %);
    enddefine;


-- . Tracing the Parser -----------------------------------------------

Armed with the preceding tables, we can examine in detail the behaviour
of the Lambda1 parser when presented with the input string:

    VAR VAR $

where the '$' token is the explicit end-of-input marker.

Table 3 illustrates the complete operation of the parser in reducing
this string to an expression: it is an example of the kind of output
produced by the library procedure -lr_trace- described below. Each line
of the table corresponds to one execution step (i.e. one cycle through
the -repeat- loop of LR_1) and displays the configuration of the parser
in terms of the current state, the contents of the value stack, the
current input token and the action which the parser takes to make the
transition to the next state.


    State|               Stack | Input | Action
    -----+---------------------+-------+-------
        1|                     | VAR   | SHIFT  2
        2|                 VAR |       | REDUCE aexp --> VAR
        7|                aexp |       | REDUCE appexp --> aexp
        6|              appexp | VAR   | SHIFT  2
        2|          appexp VAR |       | REDUCE aexp --> VAR
        8|         appexp aexp |       | REDUCE appexp --> appexp aexp
        6|              appexp | $     | REDUCE exp --> appexp
        5|                 exp | $     | ACCEPT

    Table 3


The parser begins in state 1 with an empty stack and with the first VAR
as the input token. From Table 1 we have:

    ACTION(1, VAR) = s2

and so the first action is to SHIFT the VAR token onto the stack and
move to state 2.

The next input token is also VAR, and from Table 1 again, we have:

    ACTION(2, VAR) = r5

causing a reduction by rule 5. In fact, this action is the same for all
possible inputs, so the Input column is left blank in the trace to
indicate that the actual input makes no difference to this step. The
reduction causes the VAR token on the value stack to be removed and
replaced with the aexp symbol from the left-hand-side of rule 5. It also
causes the most recent state (2) to be popped from the state stack,
revealing the previous state (1) beneath: the entry

    GOTO(1, aexp) = 7

then directs the parser to state 7.

It is instructive to follow this trace through to the end, where the
last entry in the Action column -- ACCEPT -- indicates a successful
parse.

As a slightly more complicated example, Table 4 shows the parse for the
input string:

    \ VAR . VAR VAR $


    State|               Stack | Input | Action
    -----+---------------------+-------+-------
        1|                     | \     | SHIFT  3
        3|                   \ | VAR   | SHIFT  11
       11|               \ VAR | .     | SHIFT  12
       12|             \ VAR . | VAR   | SHIFT  2
        2|         \ VAR . VAR |       | REDUCE aexp --> VAR
        7|        \ VAR . aexp |       | REDUCE appexp --> aexp
        6|      \ VAR . appexp | VAR   | SHIFT  2
        2|  \ VAR . appexp VAR |       | REDUCE aexp --> VAR
        8| \ VAR . appexp aexp |       | REDUCE appexp --> appexp aexp
        6|      \ VAR . appexp | $     | REDUCE exp --> appexp
       13|         \ VAR . exp | $     | REDUCE exp --> \ VAR . exp
        5|                 exp | $     | ACCEPT

    Table 4


Note how the middle part of this trace (starting with the entry to state
2) exactly replicates Table 3 as the parser reduces the sub-phrase:

    VAR VAR


-- . The Composition of Parser States ---------------------------------

In the terminology of LR parsing, an item is a grammar rule with a
marker (or dot) somewhere on the right-hand-side. Using the symbol <.>
to stand for the marker, the production

    aexp --> ( exp )

gives rise to four distinct items:

    aexp --> <.> ( exp )
    aexp --> ( <.> exp )
    aexp --> ( exp <.> )
    aexp --> ( exp ) <.>

The marker shows how far the parser has got in reading through a
production. Symbols to the left of the marker have been read and will be
on the parser's value stack; symbols to the right are those which the
parser is prepared to accept next as input.

Each parser state is constructed from a set of these items, typically
drawn from different grammar rules. In any particular state, the symbol
immediately to the left of the marker will be the same for all items,
and is the symbol which caused the transition into the state: for a
terminal symbol, the transition must have been a SHIFT; for a
non-terminal symbol, the transition must have been a GOTO following a
reduction. Symbols immediately to the right of the item markers (if any)
determine the actions out of the state.

An item which has the marker immediately before a terminal symbol
generates a SHIFT from the state on that symbol. In Lambda1, state 3 is
constructed from the single item:

    exp --> \ <.> VAR . exp

Thus there is only one action out of this state: to SHIFT on token VAR.
Any other input is invalid. The state to shift to happens to be number
11 (the numbering scheme is arbitrary); if we look at the composition of
that state, we see the single item:

    exp --> \ VAR <.> . exp

showing how the marker moves over the shifted token.

An item which has the marker before a non-terminal symbol generates an
action for each symbol which could start an expansion of that symbol:
SHIFTS for terminal symbols and GOTOS for non-terminals. In Lambda1,
state 4 also consists of a single item:

    aexp --> ( <.> exp )

but this generates several actions: one SHIFT for each token which could
legitimately start an instance of exp (namely VAR, '\' and '(') and one
GOTO for each non-terminal symbol likewise (exp, appexp and aexp). The
GOTO for exp takes us to state 9, which not surprisingly reads:

    aexp --> ( exp <.> )

This state will accept only the closing parenthesis as input.

Finally, an item which has the marker at the right-hand end generates a
reduction from the state by the associated production, for each terminal
symbol which is a valid follower for its left-hand-side. State 2 of
Lambda1 is constructed from the single item:

    aexp --> VAR <.>

So the only action possible in this state is to reduce by this rule
(number 5), although the reduction is valid for any input token.

You can get a written description of a parser from the library procedure
-lr_report- described below: the output shows all the parser states with
their constituent items and resulting actions.


-- . Ambiguities and Conflicts ----------------------------------------

An ambiguous grammar can never be LALR(1). Ambiguities show up as
conflicts in the parser's ACTION table, where a conflict is simply a
choice of two or more actions for some particular state/token
combination. The parser can perform only one action at a time, so
wherever the ACTION table offers a choice, one or more possible actions
have to be ignored, thereby ruling out some potentially successful
parses. Obviously, the fewer conflicts the parser has, the better it is;
the target when developing a grammar should be to have no conflicts at
all, although this is not a fixed rule.

Conflicts come in two flavours:

    (1) SHIFT/REDUCE conflicts, where the choice is between a shift and
        a reduce action;

    (2) REDUCE/REDUCE conflicts, where the choice is between two
        alternative reductions.

SHIFT/SHIFT conflicts never arise simply because of the way the parsing
tables are constructed.

For example, we know that the original Lambda0 grammar has two
alternative derivations for the input string:

    \ VAR . VAR VAR $

If we were to build a parser for Lambda0 and present it with this input,
the parse would begin as follows:


    State|               Stack | Input | Action
    -----+---------------------+-------+-------
        1|                     | \     | SHIFT  3
        3|                   \ | VAR   | SHIFT  9
        9|               \ VAR | .     | SHIFT  10
       10|             \ VAR . | VAR   | SHIFT  2
        2|         \ VAR . VAR |       | REDUCE exp --> VAR
       11|         \ VAR . exp | VAR   |


So far, so good. But there are two ways of completing this trace:


    (a)                  Stack | Input | Action
          ---------------------+-------+-------
                   \ VAR . exp | VAR   | SHIFT
               \ VAR . exp VAR |       | REDUCE: exp --> VAR
               \ VAR . exp exp | $     | REDUCE: exp --> exp exp
                   \ VAR . exp | $     | REDUCE: exp --> \ VAR . exp
                           exp | $     | ACCEPT

    (b)                  Stack | Input | Action
          ---------------------+-------+-------
                   \ VAR . exp | VAR   | REDUCE: exp --> \ VAR . exp
                           exp | VAR   | SHIFT
                       exp VAR | $     | REDUCE: exp --> VAR
                       exp exp | $     | REDUCE: exp --> exp exp
                           exp | $     | ACCEPT


These correspond exactly to the two derivations of this sentence shown
previously. We say that the parser has a SHIFT/REDUCE conflict in state
11 on token VAR. There are similar conflicts for the tokens '\' and '('.

This conflict is the result of a genuine ambiguity in the Lambda0
grammar. Conflicts can also arise for grammars which aren't strictly
ambiguous, yet which require a more powerful parser than LR(1) --
perhaps one which is prepared to look further ahead than a single token.


-- . Resolving Conflicts ----------------------------------------------

The parser generator has two default rules for resolving a conflict in
favour of one of the available choices:

    (1) in a SHIFT/REDUCE conflict, choose the SHIFT;

    (2) in a REDUCE/REDUCE conflict, choose the lower-numbered rule.

Application of these rules reduces each conflict to a single action,
which means that the parser generator can always produce a parser of
some sort, even if it doesn't do very much. Also, because the rules are
deterministic, it's guaranteed that the parser will be the same every
time.

If the resulting parser happens to do the right thing (whatever that may
be) then it may be acceptable to use it just as it is. As far as rule
(1) goes, experience suggests that the outcome is often good: in the
example from Lambda0 above, choosing the SHIFT means opting for trace
(a) which happily corresponds to the preferred interpretation of the
expression. Rule (2) is more arbitrary because of its dependency on the
order in which the grammar rules are presented: this can be useful
occasionally, but in general a parser with SHIFT/REDUCE conflicts is
more "acceptable" than one with REDUCE/REDUCE conflicts. In any case,
you should never ignore conflicts in a parser unless you can understand
why they arise and can be sure that their effects are benevolent.

If the default rules don't produce a satisfactory outcome, there is
no solution other than rewriting the grammar to eliminate conflicts
altogether. At this point it is usually helpful to study the composition
of those parser states in which conflicts arise to see what's causing
the problem.

In the Lambda0 example, the conflict lies in state 11 which is
constructed from the two items:

    exp --> \ VAR . exp <.>
    exp --> exp <.> exp

The first wants to generate a REDUCE for each token which could follow
an exp; the second wants to generate a SHIFT for each token which could
start an exp. Both sets of tokens contain VAR, '\' and '(', so there is
a conflict for each of these. The two sets of symbols overlap because of
the rule

    exp --> exp exp

which, by its juxtaposition of exp and exp, ensures that every symbol
which can start an exp can also follow one. To remove the conflict, we
remove the juxtaposition, introducing a new non-terminal appexp:

    exp --> \ VAR . exp
    exp --> appexp

    appexp --> appexp appexp
    appexp --> VAR
    appexp --> ( exp )

This means that the only legal followers of exp are now ')' and '$' and
the conflicts from state 11 disappear.

It should be no surprise that similar problems exist for the symbol
appexp, and to remove conflicts entirely we need to introduce the aexp
symbol which brings us back to Lambda1.


-- . Further Reading --------------------------------------------------

For more information about LALR(1) parsing, refer to the following books
and papers:

[1] Des Watson
    High-Level Languages and Their Compilers
    Addison-Wesley, 1989

[2] Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman
    Compilers: Principles, Techniques and Tools
    Addison-Wesley, 1986

[3] Allen I. Holub
    Compiler Design in C
    Prentice-Hall, 1990
        [Contains complete code for a parser and parser generator in C.]

[4] Frank DeRemer and Thomas Pennello
    Efficient Computation of LALR(1) Look-Ahead Sets
    ACM TOPLAS, Vol. 4, No. 4, Oct 1982, pp. 615-649
        [Describes the algorithm adopted for the LR_PARSER library.]


-- Using the LR_PARSER Library ----------------------------------------

The LR_PARSER library defines only the procedural interface to the
parser generator, which is somewhat low-level for normal use. Pop11
programmers should consider using the syntactic interface described in
HELP * DEFINE_PARSER which uses a meta-notation for the declarative
description of grammars, and allows Pop11 code to be attached to grammar
rules to be executed when the rules are reduced. The library procedures
described here would normally be used directly only from other Poplog
languages, or in situations where a parser needs to be generated
dynamically.

This part of the file provides an overview of the main library
procedures, with examples of their use applied to the grammars described
above. Full technical information describing all the available options
can be found in REF * LR_PARSER.


-- . Building the Parser Tables ---------------------------------------

The parser generator itself has the call form:

    uses lr_build;
    lr_build(NAME, TOKENS, SYMBOLS, START_SYMBOL, RULES) -> PARSER

This constructs a parser from a symbolic description of a grammar,
where:

    NAME is a name for the grammar: this can be any item, but
        is typically a word or string;

    TOKENS is a list or vector of terminal symbols: a token can be any
        item which compares with '==' (typically again, a word, or a
        unique string);

    SYMBOLS is a list or vector of non-terminal symbols: as with tokens,
        symbols are compared with '==';

    START_SYMBOL is an item from SYMBOLS nominated as the start symbol
        of the grammar;

    RULES is a list or vector of grammar rules: each rule is itself a
        list or vector, whose first element is the left-hand-side of the
        rule and whose remaining elements make up the right-hand-side.

The resulting PARSER is a data structure which retains all the symbolic
information about the input grammar, but also contains the ACTION and
GOTO tables for the parser.

To build parsers for the Lambda grammars, we can do:

    vars

        Lambda0 = lr_build(
            "Lambda0",                  ;;; NAME
            [ VAR \ . ( ) ],            ;;; TOKENS
            [ exp ],                    ;;; SYMBOLS
            "exp",                      ;;; START_SYMBOL
            [[ exp    VAR ]             ;;; RULES
             [ exp    exp exp ]
             [ exp    \ VAR . exp ]
             [ exp    ( exp ) ]]
        ),

        Lambda1 = lr_build(
            "Lambda1",                  ;;; NAME
            [ VAR \ . ( ) ],            ;;; TOKENS
            [ exp appexp aexp ],        ;;; SYMBOLS
            "exp",                      ;;; START_SYMBOL
            [[ exp    \ VAR . exp ]     ;;; RULES
             [ exp    appexp ]
             [ appexp appexp aexp ]
             [ appexp aexp ]
             [ aexp   VAR ]
             [ aexp   ( exp ) ]]
        ),
    ;

    Lambda0 =>
    ** <parser Lambda0>

    Lambda1 =>
    ** <parser Lambda1>

The procedure:

    lr_parser(Lambda0) =>
    ** <parser Lambda0>

    lr_parser(3) =>
    ** <false>

is a recogniser for parsers built by -lr_build-: applied to a PARSER
structure, it returns it; otherwise it returns <false>.

The symbolic information can be retrieved from a parser using:

    lr_parser_name(Lambda0) =>
    ** Lambda0

    lr_parser_terminal_symbols(Lambda0) =>
    ** {VAR \ . ( )}

    lr_parser_nonterminal_symbols(Lambda0) =>
    ** {exp}

    lr_parser_start_symbol(Lambda0) =>
    ** exp

    lr_parser_rules(Lambda0) =>
    ** {[exp VAR] [exp exp exp] [exp \ VAR . exp] [exp ( exp )]}

These return the original arguments given to -lr_build-, although note
that the tokens, symbols and rules are always returned as vectors,
regardless of the form in which they were input: this is to allow
constant-time access to numbered symbols and rules from the parser. The
items within each vector are identical to those input and are in the
same order.

The number of SHIFT/REDUCE and REDUCE/REDUCE conflicts in the parser can
be obtained with the procedures:

    lr_parser_sr_conflicts(Lambda0) =>
    ** 6

    lr_parser_rr_conflicts(Lambda0) =>
    ** 0

Finally, to examine the composition of a parser in more detail, the
procedure:

    uses lr_report;
    lr_report(Lambda0, 'report');

produces a file called 'report' containing a written description of
every parser state with its core items, actions and gotos. The report
can be directed to a named file (as here) or to any appropriate output
destination, such as a character consumer or an output device. For all
but the smallest of grammars the report can be very lengthy, with
possibly hundreds of states and thousands of lines of output, so a named
file is usually the best destination.


-- . Running the Parser in Trace Mode ---------------------------------

To run the parser, we need an implementation of the LR(1) algorithm
which can interpret the parsing tables, provided initially by the
procedure:

    uses lr_trace;
    lr_trace(INPUT, PARSER) -> PARSE_TREE
    lr_trace(INPUT, PARSER) -> FALSE

INPUT is a list of items drawn from the PARSER's token set. If the parse
is successful, -lr_trace- returns a parse tree for the derivation it
found; for a failed parse, it returns just <false>.

The parse tree has a leaf node for each SHIFT action, consisting of the
token shifted, and an interior node for each REDUCTION, consisting of a
list whose head is the non-terminal symbol from the left-hand-side of
the rule being reduced, and whose tail is a sequence of sub-trees
corresponding to the symbols of the right-hand-side.

For example:

    lr_trace([VAR VAR], Lambda1) =>
    ** [exp [appexp [appexp [aexp VAR]] [aexp VAR]]]

Laying out the tree for readability we get:

    [exp
        [appexp
            [appexp
                [aexp VAR]]
            [aexp VAR]]]

Such a tree can be displayed graphically using LIB * SHOWTREE.

As a side-effect of its parsing, -lr_trace- produces a trace of its
actions similar to those shown in Tables 3 and 4. When called from VED,
this trace is written to a separate output file; otherwise it is written
to the current output. There are various flags available for controlling
output from the tracer: see REF * LR_PARSER for details.

The PARSER data structure has as its -class_apply- a procedure which
invokes -lr_trace- but with trace output suppressed; so we could equally
well do:

    Lambda1([VAR VAR]) =>
    ** [exp [appexp [appexp [aexp VAR]] [aexp VAR]]]


-- . Running the Parser Efficiently -----------------------------------

An alternative implementation of the LR(1) algorithm is provided by the
procedure -lr_parse-. This doesn't produce any trace output, but is
considerably more efficient. The interface to this procedure is more
complicated, but more flexible. The general call form looks like:

    lr_parse(INPUT_P, REDUCE_P, PARSER)

where INPUT_P is a procedure generating the input to the parser and
REDUCE_P is a procedure to use for the REDUCE actions. Any results
returned by a successful parse are determined exclusively by the
behaviour of REDUCE_P.

The input procedure has the form:

    INPUT_P() -> (VALUE, TOKEN_N)

where VALUE is some arbitrary value and TOKEN_N is the token number to
which it corresponds. Tokens are numbered from 1 according to their
order in the TOKENS list given to -lr_build-; the number 0 is used to
indicate the end of input. TOKEN_N is used by the parser to index the
ACTION table, but it's VALUE which is pushed on the stack for a SHIFT.

The reduction procedure has the form:

    REDUCE_P(RULE_N)

where RULE_N is the number of the rule being reduced. Beneath this on
the user stack will be arguments corresponding to the symbols on the
right-hand-side of the rule: each terminal symbol will be represented by
the VALUE which was shifted for it, and each non-terminal symbol will be
represented by any results returned by the call to REDUCE_P which
reduced it.

The results of a successful parse are the results returned by REDUCE_P
for the final reduction of the start symbol. Because the parser has no
control over these, it does not try to return any indication of whether
the parse succeeded or failed. Instead, if the parse fails, the parser
calls the procedure:

    lr_parse_error(VALUE, TOKEN_N, STATE)

where VALUE and TOKEN_N describe the input token which caused the error,
and STATE is the state number in which it arose. This error procedure is
redefinable; its default value is:

    define vars lr_parse_error(value, token_n, state);
        lvars value, token_n, state;
        mishap(value, 1, 'PARSE ERROR');
    enddefine;


To demonstrate how this works, we can write input and reduce procedures
suitable for use with the Lambda1 parser.

The input procedure will read characters from the current input,
stopping at the end of a line. Variables are represented as single lower
case letters: `a`, ..., `z`.

    define INPUT_P() -> (VALUE, TOKEN_N);
        lvars c = cucharin(), VALUE, TOKEN_N;
        while c == `\s` or c == `\t` do
            ;;; skip leading spaces
            cucharin() -> c;
        endwhile;
        if c == `\n` or c == termin then
            (termin, 0) -> (VALUE, TOKEN_N);
        else
            consword(c, 1) -> VALUE;
            if islowercode(c) then
                1;
            elseif c == `\\` then
                2;
            elseif c == `.` then
                3;
            elseif c == `(` then
                4;
            elseif c == `)` then
                5;
            else
                ;;; anything else is an error
                -1;
            endif -> TOKEN_N;
        endif;
    enddefine;

The reduction procedure will build a parse tree, but to make things a
bit different, it will reflect more the abstract syntax of the lambda
calculus, stripping out extraneous productions and the tokens: '\', '.',
'(' and ')'. Remember that when this procedure is called, the stack will
contain one value for each symbol on the right-hand-side of the rule
being reduced (refer back to Table 2 for the rule numbers).

    define REDUCE_P(N);
        lvars N, x, e, e1;
        go_on N to rule1 rule2 rule3 rule4 rule5 rule6;
    rule1:
        ;;; \ x . e
        () -> (, x, , e);
        return([ABS ^x ^e]);
    rule3:
        ;;; e e1
        () -> (e, e1);
        return([APP ^e ^e1]);
    rule5:
        ;;; VAR
        () -> x;
        return([VAR ^x]);
    rule6:
        ;;; ( e )
        () -> (, e, );
        return(e);
    rule2:
    rule4:
    enddefine;


Some examples of use (if you want to run these, you must mark and load
both the call and the subsequent line of input):

    lr_parse(INPUT_P, REDUCE_P, Lambda1) =>
    x
    ** [VAR x]

    lr_parse(INPUT_P, REDUCE_P, Lambda1) =>
    x(yz)
    ** [APP [VAR x] [APP [VAR y] [VAR z]]]

    lr_parse(INPUT_P, REDUCE_P, Lambda1) =>
    (\x.xx)(\x.xx)
    ** [APP [ABS x [APP [VAR x] [VAR x]]] [ABS x [APP [VAR x] [VAR x]]]]

    lr_parse(INPUT_P, REDUCE_P, Lambda1) =>
    \xx

    ;;; MISHAP - PARSE ERROR
    ;;; INVOLVING:  x
    ;;; FILE     :  lr_parser   LINE NUMBER: 1138
    ;;; DOING    :  lr_parse_error lr_parse compile


--- C.all/help/lr_parser
--- Copyright University of Sussex 1992. All rights reserved. ----------