Search Top Index

HELP LR_PARSER Robert Duncan, November 1992 uses lr_parser; An LALR(1) parser generator. CONTENTS - (Use <ENTER> g to access required sections) -- Introduction -- Background -- . An Example Grammar -- . Symbolic Representation of Grammars -- . Parsers and Parser Generators -- . LALR(1) Parser Generation -- . LR(1) Parsing -- . ACTION and GOTO Tables for Lambda1 -- . Tracing the Parser -- . The Composition of Parser States -- . Ambiguities and Conflicts -- . Resolving Conflicts -- . Further Reading -- Using the LR_PARSER Library -- . Building the Parser Tables -- . Running the Parser in Trace Mode -- . Running the Parser Efficiently -- Introduction ------------------------------------------------------- LIB * LR_PARSER defines an implementation of an LALR(1) parser generator and supporting utilities. To use this library, your program must include the line: uses lr_parser; You should load this line now if you want to compile the examples in this file. The first part of this file attempts to give sufficient background information about the LALR(1) system to make the library useful: readers who are already familiar with this information or who have used similar systems elsewhere may want to skip this section. A basic knowledge of context-free grammars and their terminology -- terminal symbols, productions etc. -- is assumed. The remainder of the file provides an overview of the main library procedures. For full details, see REF * LR_PARSER. A friendlier syntactic interface to the parser generator is described in HELP * DEFINE_PARSER. -- Background --------------------------------------------------------- This section provides background information about the LALR(1) parser generation system which might prove useful to users of the library. It's not essential that you should understand all of this before using the library, but it's something you may want to come back to if you have problems or want to explore the full potential of the system. -- . An Example Grammar ----------------------------------------------- As a running example throughout the rest of this file, we shall use a grammar describing expressions in the lambda calculus. Familiarity with the lambda calculus is not required to appreciate the example, since we are interested only in its syntax, not semantics. An expression in the lambda calculus can be one of four kinds: o a variable; o an application (of one expression to another); o an abstraction (of an expression w.r.t. some variable); o a bracketed expression. This is described by the grammar Lambda0: exp --> VAR ;;; variable exp --> exp exp ;;; application exp --> '\' VAR '.' exp ;;; abstraction exp --> '(' exp ')' ;;; bracketed expression In this notation, Lambda0 is the grammar name; terminal symbols (ortokens) are written either in upper case (VAR) or within single quotes ('\'), and non-terminal symbols are in lower case and unquoted (exp). The first non-terminal (exp) is the start symbol. We can assume that the token VAR stands for any of a whole family of distinct variable names, unlike the quoted tokens ('\' etc.) which are literal. What constitutes the set of variable names is immaterial: typically they'll be textual -- "x", "y", "z", etc. -- or possibly numeric, but could be anything depending on the context in which the grammar is being used. It's sufficient for our purposes that the VAR token implicitly carries a value which we can get to if we want, but the value itself makes no difference either to the grammar or to the parser generated from it. The token '\' is meant to look vaguely like a Greeklambdacharacter. Beginning with a sequence containing just the start symbol, and repeatedly replacing non-terminal symbols according to the grammar rules, we may eventually end up with a sequence containing only terminal symbols: such a sequence is called asentence. The sequence of replacement steps used to generate a particular sentence is called thederivationof the sentence, and the set of all possible sentences is called thelanguagegenerated by the grammar. Sentences of the Lambda0 language include: VAR ;;; variable VAR VAR ;;; application \ VAR . VAR VAR ;;; abstraction ( \ VAR . VAR VAR ) ;;; bracketed expression A potential problem with Lambda0 is that it is ambiguous, in the sense that certain sentences have more than one possible derivation. For example, the sentence \ VAR . VAR VAR has two distinct derivations: (a) (b) +---+ +---+ |exp| |exp| +---+ +---+ +------+------+ +------+------+ | | | +-+-+ +-+-+ +-+-+ \ VAR . |exp| |exp| |exp| +---+ +---+ +---+ +---+---+ +------+------+ | +-+-+ +-+-+ | | | +-+-+ | |exp| |exp| \ VAR . |exp| VAR +---+ +---+ +---+ | | | VAR VAR VAR Which of these is preferred depends on which of the abstraction or application operations is deemed to be the most binding. In the common interpretation, the abstraction operation is considered to extend "as far to the right as possible" making derivation (a) the preferred one. We can make this preference explicit (and eliminate another similar ambiguity) by rewriting the grammar to include two extra non-terminal symbols:appexp(for applications) andaexp(for atomic expressions, i.e. variables and bracketed terms). The revised grammar is Lambda1: exp --> '\' VAR '.' exp ;;; abstraction exp --> appexp appexp --> appexp aexp ;;; application appexp --> aexp aexp --> VAR ;;; variable aexp --> '(' exp ')' ;;; bracketed expression It's this latter form of the grammar which we shall use from now on, although we shall return to the original form later to show how the parser generator copes with ambiguous grammars. -- . Symbolic Representation of Grammars ------------------------------ In order to manipulate grammars procedurally, we need a way of describing grammars free of any meta-notation such as '-->'. Symbolically, a grammar can be described by a four-tuple (T, S, S0, R) where: T is a set of terminal symbols; S is a set of non-terminal symbols; S0 is the start symbol; R is a set of productions (rules). The sets T and S must be disjoint, since no symbol can be both terminal and non-terminal, and the start symbol S0 must be a member of S. Each rule in R is a pair: (LHS, RHS) where LHS is a non-terminal symbol drawn from S, and RHS is a sequence of symbols drawn from the union of T and S. Making this concrete, we can represent both lambda grammars with the Pop11 data structures: vars Lambda0 = [ [ VAR \ . ( ) ] ;;; T [ exp ] ;;; S exp ;;; S0 [[ exp VAR ] ;;; R [ exp exp exp ] [ exp \ VAR . exp ] [ exp ( exp ) ]] ], Lambda1 = [ [ VAR \ . ( ) ] ;;; T [ exp appexp aexp ] ;;; S exp ;;; S0 [[ exp \ VAR . exp ] ;;; R [ exp appexp ] [ appexp appexp aexp ] [ appexp aexp ] [ aexp VAR ] [ aexp ( exp ) ]] ], ; Each rule is encoded as a single list, with the head and tail of the list standing for the left- and right-hand-sides of the rule. -- . Parsers and Parser Generators ------------------------------------ A parser for a grammar G is a machine which, when presented with a string of tokens from G, can determine whether or not it constitutes a sentence of the language generated by G; in other words, whether there exists a derivation of the string from the start symbol of G. In Pop11, this could be a procedure: parseG(TOKEN_LIST) -> BOOL where the "G" suffix on the procedure name indicates that the parser is particular to the grammar G. Often we would like the parser to do more work than simply recognise sentences. At the very least we might expect it to return aparse tree(a data structure describing the derivation of the input) but there is no limit to what it could do: in the context of a compiler for example, the parser might do any amount of work from environment management, through type checking to actual code generation. Once we have a recogniser for a language, adding hooks which allow it to do more constructive work is unlikely to be difficult. The aim of a parser generator is to take the symbolic representation of a grammar G and produce automatically a parser for that grammar. It's too much to hope that there might be a universal generator which can build a parser for any grammar to which it's applied. Inevitably, there are several different parser-generation algorithms available, each of which can cope with a particular class of grammars. -- . LALR(1) Parser Generation ---------------------------------------- In common with many other similar systems -- most notably the UNIXyaccutility -- the LR_PARSER library uses the LALR(1) parser-generation algorithm. This has proved popular because it is comparatively simple and well-understood, and the basic algorithms can be efficiently implemented. Also, as a de-facto standard, it has the convenience that a grammar developed on one such system can be easily ported to any other. The name LALR(1) is used to describe not only the parser-generation algorithm itself, but also the class of grammars which the algorithm can handle. There is, unfortunately, no easy way of recognising an LALR(1) grammar other than by trying the algorithm on it; however, the general style is well-enough suited to the kinds of grammar used in formal systems such as programming languages, although published grammars for programming languages are often imprecise and need adjusting to make them truly LALR(1). To simplify parser generation, the parser itself is divided into two parts: a fixed procedure implementing a generic parsing algorithm, and a set of tables which specialise the algorithm for a particular grammar. Parser generation then reduces to building an appropriate set of tables. The method of table construction is too complicated to describe here, and is anyway unhelpful for end-users. Interested readers are referred to the section on 'Further Reading' below. On the other hand, the parsing algorithm itself is quite simple, and a basic knowledge of its operation is useful for the successful development of LALR(1) grammars. -- . LR(1) Parsing ---------------------------------------------------- The parsing algorithm used by the parser generator is known as LR(1): theLstands forleft-to-rightand describes the way the parser reads its input; theRstands forrightmostand describes the particular derivation which the parser favours, and the1is the number of tokens of look-ahead the parser uses in deciding what to do. LR(1) is just one of a family of parsing algorithms LR(k), fork>= 0. An LR(1) parser is a state machine, characterised by five components: (1) an INPUT stream of terminal symbols: the token at the head of the input is the current token. (2) a stack of STATES: the state at the top of the stack is the current state. The state stack gives the parser considerably more power than simpler machines with only a single current state, because it can remember the context from which a particular state was entered: this allows states to act like subroutines for recognising sub-phrases in the input. (3) a stack of SYMBOLS (or more generally, VALUES) which describes the input the parser has already read. Initially, terminal symbols are simply copied from the input to the value stack, but when the topmost symbols on the stack match the right-hand-side of an appropriate production, they are replaced by the corresponding left-hand-side (non-terminal) symbol. At the end of a successful parse, the value stack contains just a single item: the start symbol. (4) an ACTION table: a matrix indexed by current state and current token which determines what the parser's next action should be; (5) a GOTO table: a matrix indexed by a previous state (recovered from the state stack) and a non-terminal symbol which determines what state the parser should move to after an expansion of the symbol has been recognised in the input. The ACTION and GOTO tables are specific to each grammar. At each execution step, the parser can perform any of four kinds of action: (1) SHIFT to state N: the new state is pushed on the state stack, and the current token is removed from the input and pushed on the value stack. This is valid only when the current token is a legal follower for the symbol at the top of the value stack. (2) REDUCE by production LHS --> RHS: this represents a successful recognition of LHS and is valid only when the topmost items on the value stack match the symbols in RHS; the matching symbols are removed from the value stack and replaced with LHS. To recover the context in which the search for LHS was initiated, one state is discarded from the state stack for each item in RHS; the newly exposed topmost state, together with LHS, determines via the GOTO table a new current state to be pushed on the state stack. (3) ACCEPT: the parse has succeeded. (4) ERROR: the parse has failed. The whole process can be described in a few lines of Pop11: vars action, state, lhs, rhs; ;;; for the pattern matcher define LR_1(INPUT, ACTION, GOTO); lvars INPUT, STATES, VALUES, ACTION, GOTO; ;;; start in state 1 with an empty value stack [1] -> STATES; [] -> VALUES; repeat ACTION(hd(STATES), hd(INPUT)) -> action; if action matches [shift ?state] then ;;; push new current state state :: STATES -> STATES; ;;; push current token hd(INPUT) :: VALUES -> VALUES; ;;; ... and remove it from the input tl(INPUT) -> INPUT; elseif action matches [reduce ?lhs ??rhs] then ;;; replace RHS items on value stack lhs :: allbutfirst(length(rhs), VALUES) -> VALUES; ;;; discard an equal number of states allbutfirst(length(rhs), STATES) -> STATES; ;;; GOTO new state GOTO(hd(STATES), lhs) :: STATES -> STATES; elseif action matches [accept] then ;;; success return(true); else ;;; failure return(false); endif; endrepeat; enddefine; This version of LR(1) is just a recogniser, but modifying it to do something more complicated is not difficult. Firstly, we notice that the algorithm is unaffected by the contents of the value stack, so we can allow that to contain arbitrary values rather than just symbols. Then, to provide a hook into the parser's actions, we add an extra procedure argument REDUCE_P which is called each time a REDUCE action is performed: the arguments to REDUCE_P will include at least any values discarded by the reduction and any results are returned to the value stack. On ACCEPT, the contents of the value stack are returned. The exact type and number of arguments to REDUCE_P will depend on the kind of interface we want to provide, but for demonstration purposes we can replace the lines: lhs :: allbutfirst(length(rhs), VALUES) -> VALUES; and: return(true); with the code: REDUCE_P( lhs, [% repeat length(rhs) times dest(VALUES) -> VALUES endrepeat %] ) :: VALUES -> VALUES; and: return(hd(VALUES)); Setting REDUCE_P to be -erase- would then restore the original functionality, but making it -conspair- would cause the parser to build a parse tree. Note that this implementation of LR(1) is written for clarity rather than efficiency. One advantage of the separation of parsing algorithm from table construction is that we can use different implementations of the parser to suit different purposes: the LR_PARSER library provides two distinct implementations, one for tracing and debugging and one for efficient execution. -- . ACTION and GOTO Tables for Lambda1 ------------------------------- Table 1 shows the ACTION and GOTO tables for the Lambda1 parser. Each row of the table represents one parser state, of which there are 13 in all. | ACTION | GOTO ---+----------------------------------+---------------------- | VAR \ . ( ) $ | exp appexp aexp ---+----------------------------------+---------------------- 1 | s2 s3 s4 | 5 6 7 2 | r5 r5 r5 r5 r5 r5 | 3 | s11 | 4 | s2 s3 s4 | 9 6 7 5 | a | 6 | s2 r2 r2 s4 r2 r2 | 8 7 | r4 r4 r4 r4 r4 r4 | 8 | r3 r3 r3 r3 r3 r3 | 9 | s10 | 10 | r6 r6 r6 r6 r6 r6 | 11 | s12 | 12 | s2 s3 s4 | 13 6 7 13 | r1 r1 r1 r1 r1 r1 |Table 1The ACTION table is indexed by state number and terminal symbol, where the extra '$' symbol stands for the end of input. Entries in the table are actions -- SHIFT, REDUCE, ACCEPT or ERROR -- abbreviated as follows: s<n> = shift to state <n> r<n> = reduce by rule <n> a = accept <blank> = error Rule numbers are as shown in Table 2. | RULES ---+------------------------------- 1 | exp --> '\' VAR '.' exp 2 | exp --> appexp 3 | appexp --> appexp aexp 4 | appexp --> aexp 5 | aexp --> VAR 6 | aexp --> '(' exp ')'Table 2The GOTO table is indexed by state number and non-terminal symbol. Numeric entries in the table are state numbers to move to; blank entries should never be accessed. Converting these tables into Pop11 data structures suitable for passing to LR_1 is not difficult and is left as an exercise for the reader. The number of blank entries and the repetitiveness of certain rows is typical and means that tables can usually be stored in considerably less space than would be required for the full matrix. With the ACTION and GOTO tables built, a parser for Lambda1 could be defined as: define parseLambda1 = LR_1(% ACTION, GOTO %); enddefine; -- . Tracing the Parser ----------------------------------------------- Armed with the preceding tables, we can examine in detail the behaviour of the Lambda1 parser when presented with the input string: VAR VAR $ where the '$' token is the explicit end-of-input marker. Table 3 illustrates the complete operation of the parser in reducing this string to an expression: it is an example of the kind of output produced by the library procedure -lr_trace- described below. Each line of the table corresponds to one execution step (i.e. one cycle through the -repeat- loop of LR_1) and displays the configuration of the parser in terms of the current state, the contents of the value stack, the current input token and the action which the parser takes to make the transition to the next state. State| Stack | Input | Action -----+---------------------+-------+------- 1| | VAR | SHIFT 2 2| VAR | | REDUCE aexp --> VAR 7| aexp | | REDUCE appexp --> aexp 6| appexp | VAR | SHIFT 2 2| appexp VAR | | REDUCE aexp --> VAR 8| appexp aexp | | REDUCE appexp --> appexp aexp 6| appexp | $ | REDUCE exp --> appexp 5| exp | $ | ACCEPTTable 3The parser begins in state 1 with an empty stack and with the first VAR as the input token. From Table 1 we have: ACTION(1, VAR) = s2 and so the first action is to SHIFT the VAR token onto the stack and move to state 2. The next input token is also VAR, and from Table 1 again, we have: ACTION(2, VAR) = r5 causing a reduction by rule 5. In fact, this action is the same for all possible inputs, so the Input column is left blank in the trace to indicate that the actual input makes no difference to this step. The reduction causes the VAR token on the value stack to be removed and replaced with theaexpsymbol from the left-hand-side of rule 5. It also causes the most recent state (2) to be popped from the state stack, revealing the previous state (1) beneath: the entry GOTO(1, aexp) = 7 then directs the parser to state 7. It is instructive to follow this trace through to the end, where the last entry in the Action column -- ACCEPT -- indicates a successful parse. As a slightly more complicated example, Table 4 shows the parse for the input string: \ VAR . VAR VAR $ State| Stack | Input | Action -----+---------------------+-------+------- 1| | \ | SHIFT 3 3| \ | VAR | SHIFT 11 11| \ VAR | . | SHIFT 12 12| \ VAR . | VAR | SHIFT 2 2| \ VAR . VAR | | REDUCE aexp --> VAR 7| \ VAR . aexp | | REDUCE appexp --> aexp 6| \ VAR . appexp | VAR | SHIFT 2 2| \ VAR . appexp VAR | | REDUCE aexp --> VAR 8| \ VAR . appexp aexp | | REDUCE appexp --> appexp aexp 6| \ VAR . appexp | $ | REDUCE exp --> appexp 13| \ VAR . exp | $ | REDUCE exp --> \ VAR . exp 5| exp | $ | ACCEPTTable 4Note how the middle part of this trace (starting with the entry to state 2) exactly replicates Table 3 as the parser reduces the sub-phrase: VAR VAR -- . The Composition of Parser States --------------------------------- In the terminology of LR parsing, anitemis a grammar rule with a marker (ordot) somewhere on the right-hand-side. Using the symbol <.> to stand for the marker, the production aexp --> ( exp ) gives rise to four distinct items: aexp --> <.> ( exp ) aexp --> ( <.> exp ) aexp --> ( exp <.> ) aexp --> ( exp ) <.> The marker shows how far the parser has got in reading through a production. Symbols to the left of the marker have been read and will be on the parser's value stack; symbols to the right are those which the parser is prepared to accept next as input. Each parser state is constructed from a set of these items, typically drawn from different grammar rules. In any particular state, the symbol immediately to the left of the marker will be the same for all items, and is the symbol which caused the transition into the state: for a terminal symbol, the transition must have been a SHIFT; for a non-terminal symbol, the transition must have been a GOTO following a reduction. Symbols immediately to the right of the item markers (if any) determine the actions out of the state. An item which has the marker immediately before a terminal symbol generates a SHIFT from the state on that symbol. In Lambda1, state 3 is constructed from the single item: exp --> \ <.> VAR . exp Thus there is only one action out of this state: to SHIFT on token VAR. Any other input is invalid. The state to shift to happens to be number 11 (the numbering scheme is arbitrary); if we look at the composition of that state, we see the single item: exp --> \ VAR <.> . exp showing how the marker moves over the shifted token. An item which has the marker before a non-terminal symbol generates an action for each symbol which could start an expansion of that symbol: SHIFTS for terminal symbols and GOTOS for non-terminals. In Lambda1, state 4 also consists of a single item: aexp --> ( <.> exp ) but this generates several actions: one SHIFT for each token which could legitimately start an instance ofexp(namely VAR, '\' and '(') and one GOTO for each non-terminal symbol likewise (exp,appexpandaexp). The GOTO forexptakes us to state 9, which not surprisingly reads: aexp --> ( exp <.> ) This state will accept only the closing parenthesis as input. Finally, an item which has the marker at the right-hand end generates a reduction from the state by the associated production, for each terminal symbol which is a valid follower for its left-hand-side. State 2 of Lambda1 is constructed from the single item: aexp --> VAR <.> So the only action possible in this state is to reduce by this rule (number 5), although the reduction is valid for any input token. You can get a written description of a parser from the library procedure -lr_report- described below: the output shows all the parser states with their constituent items and resulting actions. -- . Ambiguities and Conflicts ---------------------------------------- An ambiguous grammar can never be LALR(1). Ambiguities show up asconflictsin the parser's ACTION table, where a conflict is simply a choice of two or more actions for some particular state/token combination. The parser can perform only one action at a time, so wherever the ACTION table offers a choice, one or more possible actions have to be ignored, thereby ruling out some potentially successful parses. Obviously, the fewer conflicts the parser has, the better it is; the target when developing a grammar should be to have no conflicts at all, although this is not a fixed rule. Conflicts come in two flavours: (1) SHIFT/REDUCE conflicts, where the choice is between a shift and a reduce action; (2) REDUCE/REDUCE conflicts, where the choice is between two alternative reductions. SHIFT/SHIFT conflicts never arise simply because of the way the parsing tables are constructed. For example, we know that the original Lambda0 grammar has two alternative derivations for the input string: \ VAR . VAR VAR $ If we were to build a parser for Lambda0 and present it with this input, the parse would begin as follows: State| Stack | Input | Action -----+---------------------+-------+------- 1| | \ | SHIFT 3 3| \ | VAR | SHIFT 9 9| \ VAR | . | SHIFT 10 10| \ VAR . | VAR | SHIFT 2 2| \ VAR . VAR | | REDUCE exp --> VAR 11| \ VAR . exp | VAR | So far, so good. But there are two ways of completing this trace: (a) Stack | Input | Action ---------------------+-------+------- \ VAR . exp | VAR | SHIFT \ VAR . exp VAR | | REDUCE: exp --> VAR \ VAR . exp exp | $ | REDUCE: exp --> exp exp \ VAR . exp | $ | REDUCE: exp --> \ VAR . exp exp | $ | ACCEPT (b) Stack | Input | Action ---------------------+-------+------- \ VAR . exp | VAR | REDUCE: exp --> \ VAR . exp exp | VAR | SHIFT exp VAR | $ | REDUCE: exp --> VAR exp exp | $ | REDUCE: exp --> exp exp exp | $ | ACCEPT These correspond exactly to the two derivations of this sentence shown previously. We say that the parser has a SHIFT/REDUCE conflict in state 11 on token VAR. There are similar conflicts for the tokens '\' and '('. This conflict is the result of a genuine ambiguity in the Lambda0 grammar. Conflicts can also arise for grammars which aren't strictly ambiguous, yet which require a more powerful parser than LR(1) -- perhaps one which is prepared to look further ahead than a single token. -- . Resolving Conflicts ---------------------------------------------- The parser generator has two default rules for resolving a conflict in favour of one of the available choices: (1) in a SHIFT/REDUCE conflict, choose the SHIFT; (2) in a REDUCE/REDUCE conflict, choose the lower-numbered rule. Application of these rules reduces each conflict to a single action, which means that the parser generator can always produce a parser of some sort, even if it doesn't do very much. Also, because the rules are deterministic, it's guaranteed that the parser will be the same every time. If the resulting parser happens to do the right thing (whatever that may be) then it may be acceptable to use it just as it is. As far as rule (1) goes, experience suggests that the outcome is often good: in the example from Lambda0 above, choosing the SHIFT means opting for trace (a) which happily corresponds to the preferred interpretation of the expression. Rule (2) is more arbitrary because of its dependency on the order in which the grammar rules are presented: this can be useful occasionally, but in general a parser with SHIFT/REDUCE conflicts is more "acceptable" than one with REDUCE/REDUCE conflicts. In any case, you should never ignore conflicts in a parser unless you can understand why they arise and can be sure that their effects are benevolent. If the default rules don't produce a satisfactory outcome, there is no solution other than rewriting the grammar to eliminate conflicts altogether. At this point it is usually helpful to study the composition of those parser states in which conflicts arise to see what's causing the problem. In the Lambda0 example, the conflict lies in state 11 which is constructed from the two items: exp --> \ VAR . exp <.> exp --> exp <.> exp The first wants to generate a REDUCE for each token which could follow anexp; the second wants to generate a SHIFT for each token which could start anexp. Both sets of tokens contain VAR, '\' and '(', so there is a conflict for each of these. The two sets of symbols overlap because of the rule exp --> exp exp which, by its juxtaposition ofexpandexp, ensures that every symbol which can start anexpcan also follow one. To remove the conflict, we remove the juxtaposition, introducing a new non-terminalappexp: exp --> \ VAR . exp exp --> appexp appexp --> appexp appexp appexp --> VAR appexp --> ( exp ) This means that the only legal followers ofexpare now ')' and '$' and the conflicts from state 11 disappear. It should be no surprise that similar problems exist for the symbolappexp, and to remove conflicts entirely we need to introduce theaexpsymbol which brings us back to Lambda1. -- . Further Reading -------------------------------------------------- For more information about LALR(1) parsing, refer to the following books and papers: [1] Des WatsonHigh-Level Languages and Their CompilersAddison-Wesley, 1989 [2] Alfred V. Aho, Ravi Sethi and Jeffrey D. UllmanCompilers: Principles, Techniques and ToolsAddison-Wesley, 1986 [3] Allen I. HolubCompiler Design in CPrentice-Hall, 1990 [Contains complete code for a parser and parser generator in C.] [4] Frank DeRemer and Thomas PennelloEfficient Computation of LALR(1) Look-Ahead SetsACM TOPLAS, Vol. 4, No. 4, Oct 1982, pp. 615-649 [Describes the algorithm adopted for the LR_PARSER library.] -- Using the LR_PARSER Library ---------------------------------------- The LR_PARSER library defines only the procedural interface to the parser generator, which is somewhat low-level for normal use. Pop11 programmers should consider using the syntactic interface described in HELP * DEFINE_PARSER which uses a meta-notation for the declarative description of grammars, and allows Pop11 code to be attached to grammar rules to be executed when the rules are reduced. The library procedures described here would normally be used directly only from other Poplog languages, or in situations where a parser needs to be generated dynamically. This part of the file provides an overview of the main library procedures, with examples of their use applied to the grammars described above. Full technical information describing all the available options can be found in REF * LR_PARSER. -- . Building the Parser Tables --------------------------------------- The parser generator itself has the call form: uses lr_build; lr_build(NAME, TOKENS, SYMBOLS, START_SYMBOL, RULES) -> PARSER This constructs a parser from a symbolic description of a grammar, where: NAME is a name for the grammar: this can be any item, but is typically a word or string; TOKENS is a list or vector of terminal symbols: a token can be any item which compares with '==' (typically again, a word, or a unique string); SYMBOLS is a list or vector of non-terminal symbols: as with tokens, symbols are compared with '=='; START_SYMBOL is an item from SYMBOLS nominated as the start symbol of the grammar; RULES is a list or vector of grammar rules: each rule is itself a list or vector, whose first element is the left-hand-side of the rule and whose remaining elements make up the right-hand-side. The resulting PARSER is a data structure which retains all the symbolic information about the input grammar, but also contains the ACTION and GOTO tables for the parser. To build parsers for the Lambda grammars, we can do: vars Lambda0 = lr_build( "Lambda0", ;;; NAME [ VAR \ . ( ) ], ;;; TOKENS [ exp ], ;;; SYMBOLS "exp", ;;; START_SYMBOL [[ exp VAR ] ;;; RULES [ exp exp exp ] [ exp \ VAR . exp ] [ exp ( exp ) ]] ), Lambda1 = lr_build( "Lambda1", ;;; NAME [ VAR \ . ( ) ], ;;; TOKENS [ exp appexp aexp ], ;;; SYMBOLS "exp", ;;; START_SYMBOL [[ exp \ VAR . exp ] ;;; RULES [ exp appexp ] [ appexp appexp aexp ] [ appexp aexp ] [ aexp VAR ] [ aexp ( exp ) ]] ), ; Lambda0 => ** <parser Lambda0> Lambda1 => ** <parser Lambda1> The procedure: lr_parser(Lambda0) => ** <parser Lambda0> lr_parser(3) => ** <false> is a recogniser for parsers built by -lr_build-: applied to a PARSER structure, it returns it; otherwise it returns <false>. The symbolic information can be retrieved from a parser using: lr_parser_name(Lambda0) => ** Lambda0 lr_parser_terminal_symbols(Lambda0) => ** {VAR \ . ( )} lr_parser_nonterminal_symbols(Lambda0) => ** {exp} lr_parser_start_symbol(Lambda0) => ** exp lr_parser_rules(Lambda0) => ** {[exp VAR] [exp exp exp] [exp \ VAR . exp] [exp ( exp )]} These return the original arguments given to -lr_build-, although note that the tokens, symbols and rules are always returned as vectors, regardless of the form in which they were input: this is to allow constant-time access to numbered symbols and rules from the parser. The items within each vector are identical to those input and are in the same order. The number of SHIFT/REDUCE and REDUCE/REDUCE conflicts in the parser can be obtained with the procedures: lr_parser_sr_conflicts(Lambda0) => ** 6 lr_parser_rr_conflicts(Lambda0) => ** 0 Finally, to examine the composition of a parser in more detail, the procedure: uses lr_report; lr_report(Lambda0, 'report'); produces a file called 'report' containing a written description of every parser state with its core items, actions and gotos. The report can be directed to a named file (as here) or to any appropriate output destination, such as a character consumer or an output device. For all but the smallest of grammars the report can be very lengthy, with possibly hundreds of states and thousands of lines of output, so a named file is usually the best destination. -- . Running the Parser in Trace Mode --------------------------------- To run the parser, we need an implementation of the LR(1) algorithm which can interpret the parsing tables, provided initially by the procedure: uses lr_trace; lr_trace(INPUT, PARSER) -> PARSE_TREE lr_trace(INPUT, PARSER) -> FALSE INPUT is a list of items drawn from the PARSER's token set. If the parse is successful, -lr_trace- returns a parse tree for the derivation it found; for a failed parse, it returns just <false>. The parse tree has a leaf node for each SHIFT action, consisting of the token shifted, and an interior node for each REDUCTION, consisting of a list whose head is the non-terminal symbol from the left-hand-side of the rule being reduced, and whose tail is a sequence of sub-trees corresponding to the symbols of the right-hand-side. For example: lr_trace([VAR VAR], Lambda1) => ** [exp [appexp [appexp [aexp VAR]] [aexp VAR]]] Laying out the tree for readability we get: [exp [appexp [appexp [aexp VAR]] [aexp VAR]]] Such a tree can be displayed graphically using LIB * SHOWTREE. As a side-effect of its parsing, -lr_trace- produces a trace of its actions similar to those shown in Tables 3 and 4. When called from VED, this trace is written to a separate output file; otherwise it is written to the current output. There are various flags available for controlling output from the tracer: see REF * LR_PARSER for details. The PARSER data structure has as its -class_apply- a procedure which invokes -lr_trace- but with trace output suppressed; so we could equally well do: Lambda1([VAR VAR]) => ** [exp [appexp [appexp [aexp VAR]] [aexp VAR]]] -- . Running the Parser Efficiently ----------------------------------- An alternative implementation of the LR(1) algorithm is provided by the procedure -lr_parse-. This doesn't produce any trace output, but is considerably more efficient. The interface to this procedure is more complicated, but more flexible. The general call form looks like: lr_parse(INPUT_P, REDUCE_P, PARSER) where INPUT_P is a procedure generating the input to the parser and REDUCE_P is a procedure to use for the REDUCE actions. Any results returned by a successful parse are determined exclusively by the behaviour of REDUCE_P. The input procedure has the form: INPUT_P() -> (VALUE, TOKEN_N) where VALUE is some arbitrary value and TOKEN_N is the token number to which it corresponds. Tokens are numbered from 1 according to their order in the TOKENS list given to -lr_build-; the number 0 is used to indicate the end of input. TOKEN_N is used by the parser to index the ACTION table, but it's VALUE which is pushed on the stack for a SHIFT. The reduction procedure has the form: REDUCE_P(RULE_N) where RULE_N is the number of the rule being reduced. Beneath this on the user stack will be arguments corresponding to the symbols on the right-hand-side of the rule: each terminal symbol will be represented by the VALUE which was shifted for it, and each non-terminal symbol will be represented by any results returned by the call to REDUCE_P which reduced it. The results of a successful parse are the results returned by REDUCE_P for the final reduction of the start symbol. Because the parser has no control over these, it does not try to return any indication of whether the parse succeeded or failed. Instead, if the parse fails, the parser calls the procedure: lr_parse_error(VALUE, TOKEN_N, STATE) where VALUE and TOKEN_N describe the input token which caused the error, and STATE is the state number in which it arose. This error procedure is redefinable; its default value is: define vars lr_parse_error(value, token_n, state); lvars value, token_n, state; mishap(value, 1, 'PARSE ERROR'); enddefine; To demonstrate how this works, we can write input and reduce procedures suitable for use with the Lambda1 parser. The input procedure will read characters from the current input, stopping at the end of a line. Variables are represented as single lower case letters: `a`, ..., `z`. define INPUT_P() -> (VALUE, TOKEN_N); lvars c = cucharin(), VALUE, TOKEN_N; while c == `\s` or c == `\t` do ;;; skip leading spaces cucharin() -> c; endwhile; if c == `\n` or c == termin then (termin, 0) -> (VALUE, TOKEN_N); else consword(c, 1) -> VALUE; if islowercode(c) then 1; elseif c == `\\` then 2; elseif c == `.` then 3; elseif c == `(` then 4; elseif c == `)` then 5; else ;;; anything else is an error -1; endif -> TOKEN_N; endif; enddefine; The reduction procedure will build a parse tree, but to make things a bit different, it will reflect more the abstract syntax of the lambda calculus, stripping out extraneous productions and the tokens: '\', '.', '(' and ')'. Remember that when this procedure is called, the stack will contain one value for each symbol on the right-hand-side of the rule being reduced (refer back to Table 2 for the rule numbers). define REDUCE_P(N); lvars N, x, e, e1; go_on N to rule1 rule2 rule3 rule4 rule5 rule6; rule1: ;;; \ x . e () -> (, x, , e); return([ABS ^x ^e]); rule3: ;;; e e1 () -> (e, e1); return([APP ^e ^e1]); rule5: ;;; VAR () -> x; return([VAR ^x]); rule6: ;;; ( e ) () -> (, e, ); return(e); rule2: rule4: enddefine; Some examples of use (if you want to run these, you must mark and load both the call and the subsequent line of input): lr_parse(INPUT_P, REDUCE_P, Lambda1) => x ** [VAR x] lr_parse(INPUT_P, REDUCE_P, Lambda1) => x(yz) ** [APP [VAR x] [APP [VAR y] [VAR z]]] lr_parse(INPUT_P, REDUCE_P, Lambda1) => (\x.xx)(\x.xx) ** [APP [ABS x [APP [VAR x] [VAR x]]] [ABS x [APP [VAR x] [VAR x]]]] lr_parse(INPUT_P, REDUCE_P, Lambda1) => \xx ;;; MISHAP - PARSE ERROR ;;; INVOLVING: x ;;; FILE : lr_parser LINE NUMBER: 1138 ;;; DOING : lr_parse_error lr_parse compile --- C.all/help/lr_parser --- Copyright University of Sussex 1992. All rights reserved. ----------