├── .gitignore
├── .travis.yml
├── ChangeLog
├── LICENSE
├── NOTES
├── README.md
├── doc
    ├── yapps2.man
    ├── yapps2.tex
    └── yapps_grammar.g
├── examples
    ├── calc.g
    ├── expr.g
    ├── lisp.g
    ├── notes
    └── xml.g
├── setup.py
├── test.sh
├── test
    ├── empty_clauses.g
    ├── line_numbers.g
    └── option.g
├── yapps
    ├── __init__.py
    ├── cli_tool.py
    ├── grammar.py
    ├── parsetree.py
    └── runtime.py
└── yapps2


/.gitignore:
--------------------------------------------------------------------------------
1 | /.gitattributes
2 | /*.egg-info
3 | /build
4 | /dist
5 | /README.txt
6 | *.pyc
7 | *.pyo
8 | /examples/expr.py
9 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | python:
 3 |   - "2.7"
 4 |   - "3.4"
 5 |   - "3.5"
 6 |   - "3.6"
 7 |   - "3.7"
 8 |   - "pypy"
 9 | script: sh test.sh
10 | 


--------------------------------------------------------------------------------
/ChangeLog:
--------------------------------------------------------------------------------
  1 | 2003-08-27  Amit Patel  <amitp@cs.stanford.edu>
  2 | 
  3 | 	* *: (VERSION) Release 2.1.1
  4 | 
  5 | 	* *: Added a test/ directory for test cases; I had previously put
  6 | 	tests in the examples/ directory, which is a bad place to put
  7 | 	them.  Examples are useful for learning how Yapps works.  Tests
  8 | 	are for testing specific features of Yapps.
  9 | 	
 10 | 	* parsetree.py (Plus.update): Fixed a long-standing bug in which
 11 | 	the FOLLOW set of 'a'+ would include 'a'.  In theory this makes no
 12 | 	practical difference because the 'a'+ rule eats up all the 'a'
 13 | 	tokens anyway.  However, it makes error messages a little bit more
 14 | 	confusing because they imply that an 'a' can follow.
 15 | 
 16 | 	* yappsrt.py (print_error): Incorporated the context object into
 17 | 	the printing of error messages.
 18 | 
 19 | 2003-08-12  Amit Patel  <amitp@cs.stanford.edu>
 20 | 
 21 | 	* *: (VERSION) Release 2.1.0
 22 | 	
 23 | 	* parsetree.py: Improved error message generation.  Instead of
 24 | 	relying on the scanner to produce errors, the parser now checks
 25 | 	things explicitly and produces errors directly.  The parser has
 26 | 	better knowledge of the context, so its error messages are more
 27 | 	precise and helpful.
 28 | 
 29 | 	* yapps_grammar.g: Instead of setting self.rule in the setup()
 30 | 	method, pass it in the constructor.  To make it available at
 31 | 	construction time, pass it along as another attribute in the
 32 | 	attribute grammar.
 33 | 
 34 | 2003-08-11  Amit Patel  <amitp@cs.stanford.edu>
 35 | 
 36 | 	* parsetree.py: Generated parsers now include a context object
 37 | 	that describes the parse rule stack.  For example, while parsing
 38 | 	rule A, called from rule B, called from rule D, the context object
 39 | 	will let you reconstruct the path D > B > A.  [Thanks David Morley]
 40 | 	
 41 | 	* *: Removed all output when things are working
 42 | 	properly; all warnings/errors now go to stderr.
 43 | 
 44 | 	* yapps_grammar.g: Added support for A? meaning an optional A.
 45 | 	This is equivalent to [A].
 46 | 
 47 | 	* yapps2.py: Design - refactored yapps2.py into yapps2.py +
 48 | 	grammar.py + parsetree.py.  grammar.py is automatically generated
 49 | 	from grammar.g.  Added lots of docstrings.
 50 | 	
 51 | 2003-08-09  Amit Patel  <amitp@cs.stanford.edu>
 52 | 
 53 | 	* yapps2.py: Documentation - added doctest tests to some of the
 54 | 	set algorithms in class Generator.
 55 | 
 56 | 	* yapps2.py: Style - removed "import *" everywhere.
 57 | 
 58 | 	* yapps2.py: Style - moved to Python 2 -- string methods,
 59 | 	list comprehensions, inline syntax for apply
 60 | 
 61 | 2003-07-28  Amit Patel  <amitp@cs.stanford.edu>
 62 | 
 63 | 	* *: (VERSION) Release 2.0.4
 64 | 	
 65 | 	* yappsrt.py: Style - replaced raising string exceptions
 66 | 	with raising class exceptions.  [Thanks Alex Verstak]
 67 | 
 68 | 	* yappsrt.py: (SyntaxError) Bug fix - SyntaxError.__init__ should
 69 | 	call Exception.__init__
 70 | 
 71 | 	* yapps2.py: Bug fix - identifiers in grammar rules that had
 72 | 	digits in them were not accessible in the {{python code}} sections
 73 | 	of the grammar.
 74 | 
 75 | 	* yapps2.py: Style - changed "b >= a and b < c" to "a <= b < c"
 76 | 
 77 | 	* yapps2.py: Style - change "`expr`" to "repr(expr)"
 78 | 
 79 | 2002-08-00  Amit Patel  <amitp@cs.stanford.edu>
 80 | 
 81 | 	* *: (VERSION) Release 2.0.3
 82 | 
 83 | 	* yapps2.py: Bug fix - inline tokens using the r"" syntax weren't
 84 | 	treated properly.
 85 | 
 86 | 2002-04-00  Amit Patel  <amitp@cs.stanford.edu>
 87 | 
 88 | 	* *: (VERSION) Release 2.0.2
 89 | 
 90 | 	* yapps2.py: Bug fix - when generating the "else" clause, if the
 91 | 	comment was too long, Yapps was not emitting a newline.  [Thanks
 92 | 	Steven Engelhardt]	
 93 | 
 94 | 2001-10-00  Amit Patel  <amitp@cs.stanford.edu>
 95 | 
 96 | 	* *: (VERSION) Release 2.0.1
 97 | 
 98 | 	* yappsrt.py: (SyntaxError) Style - the exception classes now
 99 | 	inherit from Exception. [Thanks Rich Salz]
100 | 
101 | 	* yappsrt.py: (Scanner) Performance - instead of passing the set
102 | 	of tokens into the scanner at initialization time, we build the
103 | 	list at compile time.  You can still override the default list per
104 | 	instance of the scanner, but in the common case, we don't have to
105 | 	rebuild the token list.  [Thanks Amaury Forgeot d'Arc]
106 | 
107 | 
108 | 	


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | <http://www.opensource.org/licenses/mit-license.php>
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining 
 4 | a copy of this software and associated documentation files (the 
 5 | "Software"), to deal in the Software without restriction, including 
 6 | without limitation the rights to use, copy, modify, merge, publish, 
 7 | distribute, sublicense, and/or sell copies of the Software, and to 
 8 | permit persons to whom the Software is furnished to do so, subject to 
 9 | the following conditions: 
10 |   
11 | The above copyright notice and this permission notice shall be included 
12 | in all copies or substantial portions of the Software. 
13 |   
14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
15 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
16 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 
17 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 
18 | CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 
19 | TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 
20 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 
21 | 


--------------------------------------------------------------------------------
/NOTES:
--------------------------------------------------------------------------------
 1 | [Last updated August 11, 2003]
 2 | 
 3 | Notes for myself:
 4 | 
 5 | Document the LINENO trick
 6 | 
 7 | Add a way to have a self-contained mode that doesn't require yappsrt?
 8 | 
 9 | Add a debugging mode that helps you understand how the grammar
10 |   is constructed and how things are being parsed
11 | 
12 | Optimize (remove) unused variables
13 | 
14 | Yapps produces a bunch of inline list literals.  We should be able to
15 |   instead create these lists as class variables (but this makes it
16 |   harder to read the code).  Also, 'A in X' could be written
17 |   'X.has_key(A)' if we can convert the lists into dictionaries ahead
18 |   of time.
19 | 
20 | Add a convenience to automatically gather up the values returned 
21 |   from subpatterns, put them into a list, and return them
22 | 
23 | "Gather" mode that simply outputs the return values for certain nodes.
24 |   For example, if you just want all expressions, you could ask yapps
25 |   to gather the results of the 'expr' rule into a list.  This would 
26 |   ignore all the higher level structure.
27 | 
28 | Improve the documentation
29 | 
30 | Write some larger examples (probably XML/HTML)
31 | 
32 | EOF needs to be dealt with.  It's probably a token that can match anywhere.
33 | 
34 | Get rid of old-style regex support
35 | 
36 | Use SRE's lex support to speed up lexing (this may be hard given that
37 |   yapps allows for context-sensitive lexers)
38 | 
39 | Look over Dan Connoly's experience with Yapps (bugs, frustrations, etc.)
40 |   and see what improvements could be made
41 | 
42 | Add something to pretty-print the grammar (without the actions)
43 | 
44 | Maybe conditionals?  Follow this rule only if <condition> holds.
45 |   But this would be useful mainly when multiple rules match, and we
46 |   want the first matching rule.  The conditional would mean we skip to
47 |   the next rule.  Maybe this is part of the attribute grammar system,
48 |   where rule X<0> can be specified separately from X<N>.
49 | 
50 | Convenience functions that could build return values for all rules
51 |   without specifying the code for each rule individually
52 | 
53 | Patterns (abstractions over rules) -- for example, comma separated values
54 |   have a certain rule pattern that gets replicated all over the place
55 | 
56 |   These are rules that take other rules as parameters.
57 | 
58 |   rule list<separator,element>:            {{ result = [] }}
59 |                 [ element                  {{ result.append(element) }}
60 |                   ( separator element      {{ result.append(element) }}
61 |                   )*
62 |                 ]                          {{ return result }}
63 |       
64 | Inheritance of parser and scanner classes.  The base class (Parser)
65 |   may define common tokens like ID, STR, NUM, space, comments, EOF,
66 |   etc., and common rules/patterns like optional, sequence,
67 |   delimiter-separated sequence.
68 | 
69 | Why do A? and (A | ) produce different code?  It seems that they
70 | should produce the very same code.
71 | 
72 | Look at everyone's Yapps grammars, and come up with larger examples
73 |   http://www.w3.org/2000/10/swap/SemEnglish.g
74 |   http://www.w3.org/2000/10/swap/kifExpr.g
75 |   http://www.w3.org/2000/10/swap/rdfn3.g
76 | 
77 | Construct lots of erroneous grammars and see what Yapps does with them
78 |   (improve error reporting)
79 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | YAPPS: Yet Another Python Parser System
 2 | ----------------------------------------
 3 | 
 4 | For the most complete and excellent documentation (e.g. [manual with
 5 | examples](http://theory.stanford.edu/~amitp/yapps/yapps2/manual/)) and info,
 6 | please see original project website: http://theory.stanford.edu/~amitp/yapps/
 7 | 
 8 | YAPPS is an easy to use parser generator that is written in Python and generates
 9 | Python code.
10 | There are several parser generator systems already available for Python, but
11 | this parser has different goals: Yapps is simple, very easy to use, and produces
12 | human-readable parsers.
13 | 
14 | It is not the fastest or most powerful parser.
15 | Yapps is designed to be used when regular expressions are not enough and other
16 | parser systems are too much: situations where you might otherwise write your own
17 | recursive descent parser.
18 | 
19 | This fork contains several upward-compatible enhancements to the original
20 | YAPPS source, originally included in [debian package](http://packages.debian.org/sid/yapps2):
21 | 
22 |  * Handle stacked input ("include files").
23 |  * Augmented ignore-able patterns (can parse multi-line C comments correctly).
24 |  * Better error reporting.
25 |  * Read input incrementally.
26 | 
27 | 
28 | Installation
29 | ----------------------------------------
30 | 
31 | It's a regular package for Python 2.7 (not 3.X, but there are links to 3.X
32 | patches listed on the [original author
33 | website](http://theory.stanford.edu/~amitp/yapps/)), but not in pypi, so can be
34 | installed from a checkout with something like that:
35 | 
36 | 	% python setup.py install
37 | 
38 | Better way would be to use [pip](http://pip-installer.org/) to install all the
39 | necessary dependencies as well:
40 | 
41 | 	% pip install 'git+https://github.com/mk-fg/yapps.git#egg=yapps'
42 | 
43 | Note that to install stuff in system-wide PATH and site-packages, elevated
44 | privileges are often required.
45 | Use "install --user",
46 | [~/.pydistutils.cfg](http://docs.python.org/install/index.html#distutils-configuration-files)
47 | or [virtualenv](http://pypi.python.org/pypi/virtualenv) to do unprivileged
48 | installs into custom paths.
49 | 
50 | Alternatively, `./yapps2` can be run right from the checkout tree, without any
51 | installation.
52 | 
53 | No extra package dependencies.
54 | 


--------------------------------------------------------------------------------
/doc/yapps2.man:
--------------------------------------------------------------------------------
 1 | .TH YAPPS2 1
 2 | .SH NAME
 3 | yapps2 \- generate python parser code from grammar description file
 4 | .SH SYNOPSIS
 5 | .B yapps2
 6 | [\fB\-h\fR]
 7 | [\fB\-i\fR, \fB\-\-context-insensitive-scanner\fR]
 8 | [\fB\-t\fR, \fB\-\-indent-with-tabs\fR]
 9 | [\fB\-\-dump\fR]
10 | .IR grammar_file [ parser_file ]
11 | .SH DESCRIPTION
12 | .B yapps2
13 | generates python parser code from a grammar description file.
14 | .SH OPTIONS
15 | .TP
16 | .BR \-h ", " \-\-help\fR
17 | show a help message and exit
18 | .TP
19 | .BR \-i ", " \-\-context-insensitive-scanner\fR
20 | Scan all tokens. See the documentation for details.
21 | .TP
22 | .BR \-t ", " \-\-indent-with-tabs\fR
23 | Use tabs instead of four spaces for indentation in generated code.
24 | .TP
25 | .BR \-\-dump\fR
26 | Dump out grammar information.
27 | .TP
28 | .BR grammar_file
29 | grammar description file (input)
30 | .TP
31 | .BR parser_file
32 | Name of the output file to be generated.
33 | .BR
34 | The grammar file's name, with .py appended, will be used, if omitted.
35 | \"-\" or \"/dev/stdout\" can be used to send generated code to stdout.
36 | 


--------------------------------------------------------------------------------
/doc/yapps2.tex:
--------------------------------------------------------------------------------
   1 | \documentclass[10pt]{article}
   2 | \usepackage{palatino}
   3 | \usepackage{html}
   4 | \usepackage{color}
   5 | 
   6 | \setlength{\headsep}{0in}
   7 | \setlength{\headheight}{0in}
   8 | \setlength{\textheight}{8.5in}
   9 | \setlength{\textwidth}{5.9in}
  10 | \setlength{\oddsidemargin}{0.25in}
  11 | 
  12 | \definecolor{darkblue}{rgb}{0,0,0.6}
  13 | \definecolor{darkerblue}{rgb}{0,0,0.3}
  14 | 
  15 | %% \newcommand{\mysection}[1]{\section{\textcolor{darkblue}{#1}}}
  16 | %% \newcommand{\mysubsection}[1]{\subsection{\textcolor{darkerblue}{#1}}}
  17 | \newcommand{\mysection}[1]{\section{#1}}
  18 | \newcommand{\mysubsection}[1]{\subsection{#1}}
  19 | 
  20 | \bodytext{bgcolor=white text=black link=#004080 vlink=#006020}
  21 | 
  22 | \newcommand{\first}{\textsc{first}}
  23 | \newcommand{\follow}{\textsc{follow}}
  24 | 
  25 | \begin{document}
  26 | 
  27 | \begin{center}
  28 | \hfill \begin{tabular}{c}
  29 | {\Large The \emph{Yapps} Parser Generator System}\\
  30 | \verb|http://theory.stanford.edu/~amitp/Yapps/|\\
  31 |                 Version 2\\
  32 | \\
  33 | Amit J. Patel\\
  34 | \htmladdnormallink{http://www-cs-students.stanford.edu/~amitp/}
  35 | {http://www-cs-students.stanford.edu/~amitp/}
  36 | 
  37 | \end{tabular} \hfill \rule{0in}{0in}
  38 | \end{center}
  39 | 
  40 | \mysection{Introduction}
  41 | 
  42 | \emph{Yapps} (\underline{Y}et \underline{A}nother \underline{P}ython
  43 | \underline{P}arser \underline{S}ystem) is an easy to use parser
  44 | generator that is written in Python and generates Python code.  There
  45 | are several parser generator systems already available for Python,
  46 | including \texttt{PyLR, kjParsing, PyBison,} and \texttt{mcf.pars,}
  47 | but I had different goals for my parser.  Yapps is simple, is easy to
  48 | use, and produces human-readable parsers.  It is not the fastest or
  49 | most powerful parser.  Yapps is designed to be used when regular
  50 | expressions are not enough and other parser systems are too much:
  51 | situations where you may write your own recursive descent parser.
  52 | 
  53 | Some unusual features of Yapps that may be of interest are:
  54 | 
  55 | \begin{enumerate}
  56 |    
  57 |  \item Yapps produces recursive descent parsers that are readable by
  58 |    humans, as opposed to table-driven parsers that are difficult to
  59 |    read.  A Yapps parser for a simple calculator looks similar to the
  60 |    one that Mark Lutz wrote by hand for \emph{Programming Python.}
  61 |         
  62 |  \item Yapps also allows for rules that accept parameters and pass
  63 |    arguments to be used while parsing subexpressions.  Grammars that
  64 |    allow for arguments to be passed to subrules and for values to be
  65 |    passed back are often called \emph{attribute grammars.}  In many
  66 |    cases parameterized rules can be used to perform actions at ``parse
  67 |    time'' that are usually delayed until later.  For example,
  68 |    information about variable declarations can be passed into the
  69 |    rules that parse a procedure body, so that undefined variables can
  70 |    be detected at parse time.  The types of defined variables can be
  71 |    used in parsing as well---for example, if the type of {\tt X} is
  72 |    known, we can determine whether {\tt X(1)} is an array reference or 
  73 |    a function call.
  74 |    
  75 |  \item Yapps grammars are fairly easy to write, although there are
  76 |    some inconveniences having to do with ELL(1) parsing that have to be
  77 |    worked around.  For example, rules have to be left factored and
  78 |    rules may not be left recursive.  However, neither limitation seems 
  79 |    to be a problem in practice.  
  80 |    
  81 |    Yapps grammars look similar to the notation used in the Python
  82 |    reference manual, with operators like \verb:*:, \verb:+:, \verb:|:,
  83 |    \verb:[]:, and \verb:(): for patterns, names ({\tt tim}) for rules,
  84 |    regular expressions (\verb:"[a-z]+":) for tokens, and \verb:#: for
  85 |    comments.
  86 | 
  87 |  \item The Yapps parser generator is written as a single Python module
  88 |    with no C extensions.  Yapps produces parsers that are written
  89 |    entirely in Python, and require only the Yapps run-time module (5k)
  90 |    for support.
  91 |    
  92 |  \item Yapps's scanner is context-sensitive, picking tokens based on
  93 |    the types of the tokens accepted by the parser.  This can be
  94 |    helpful when implementing certain kinds of parsers, such as for a
  95 |    preprocessor.
  96 | 
  97 | \end{enumerate}
  98 | 
  99 | There are several disadvantages of using Yapps over another parser system:
 100 | 
 101 | \begin{enumerate}
 102 |    
 103 |  \item Yapps parsers are \texttt{ELL(1)} (Extended LL(1)), which is
 104 |    less powerful than \texttt{LALR} (used by \texttt{PyLR}) or
 105 |    \texttt{SLR} (used by \texttt{kjParsing}), so Yapps would not be a
 106 |    good choice for parsing complex languages.  For example, allowing
 107 |    both \texttt{x := 5;} and \texttt{x;} as statements is difficult
 108 |    because we must distinguish based on only one token of lookahead.
 109 |    Seeing only \texttt{x}, we cannot decide whether we have an
 110 |    assignment statement or an expression statement.  (Note however
 111 |    that this kind of grammar can be matched with backtracking; see
 112 |    section \ref{sec:future}.)
 113 | 
 114 |  \item The scanner that Yapps provides can only read from strings, not
 115 |    files, so an entire file has to be read in before scanning can
 116 |    begin.  It is possible to build a custom scanner, though, so in
 117 |    cases where stream input is needed (from the console, a network, or
 118 |    a large file are examples), the Yapps parser can be given a custom
 119 |    scanner that reads from a stream instead of a string.
 120 |    
 121 |  \item Yapps is not designed with efficiency in mind.
 122 | 
 123 | \end{enumerate}
 124 | 
 125 | Yapps provides an easy to use parser generator that produces parsers
 126 | similar to what you might write by hand.  It is not meant to be a
 127 | solution for all parsing problems, but instead an aid for those times
 128 | you would write a parser by hand rather than using one of the more
 129 | powerful parsing packages available.
 130 | 
 131 | Yapps 2.0 is easier to use than Yapps 1.0.  New features include a
 132 | less restrictive input syntax, which allows mixing of sequences,
 133 | choices, terminals, and nonterminals; optional matching; the ability
 134 | to insert single-line statements into the generated parser; and
 135 | looping constructs \verb|*| and \verb|+| similar to the repetitive
 136 | matching constructs in regular expressions.  Unfortunately, the
 137 | addition of these constructs has made Yapps 2.0 incompatible with
 138 | Yapps 1.0, so grammars will have to be rewritten.  See section
 139 | \ref{sec:Upgrading} for tips on changing Yapps 1.0 grammars for use
 140 | with Yapps 2.0.
 141 | 
 142 | \mysection{Examples}
 143 | 
 144 | In this section are several examples that show the use of Yapps.
 145 | First, an introduction shows how to construct grammars and write them
 146 | in Yapps form.  This example can be skipped by someone familiar with
 147 | grammars and parsing.  Next is a Lisp expression grammar that produces
 148 | a parse tree as output.  This example demonstrates the use of tokens
 149 | and rules, as well as returning values from rules.  The third example
 150 | is a expression evaluation grammar that evaluates during parsing
 151 | (instead of producing a parse tree).
 152 | 
 153 | \mysubsection{Introduction to Grammars}
 154 | 
 155 | A \emph{grammar} for a natural language specifies how words can be put
 156 | together to form large structures, such as phrases and sentences.  A
 157 | grammar for a computer language is similar in that it specifies how
 158 | small components (called \emph{tokens}) can be put together to form
 159 | larger structures.  In this section we will write a grammar for a tiny
 160 | subset of English.
 161 | 
 162 | Simple English sentences can be described as being a noun phrase
 163 | followed by a verb followed by a noun phrase.  For example, in the
 164 | sentence, ``Jack sank the blue ship,'' the word ``Jack'' is the first
 165 | noun phrase, ``sank'' is the verb, and ``the blue ship'' is the second
 166 | noun phrase.  In addition we should say what a noun phrase is; for
 167 | this example we shall say that a noun phrase is an optional article
 168 | (a, an, the) followed by any number of adjectives followed by a noun.
 169 | The tokens in our language are the articles, nouns, verbs, and
 170 | adjectives.  The \emph{rules} in our language will tell us how to
 171 | combine the tokens together to form lists of adjectives, noun phrases,
 172 | and sentences:
 173 | 
 174 | \begin{itemize}
 175 |   \item \texttt{sentence: noun\_phrase verb noun\_phrase}
 176 |   \item \texttt{noun\_phrase: [article] adjective* noun}
 177 | \end{itemize}
 178 | 
 179 | Notice that some things that we said easily in English, such as
 180 | ``optional article,'' are expressed using special syntax, such as
 181 | brackets.  When we said, ``any number of adjectives,'' we wrote
 182 | \texttt{adjective*}, where the \texttt{*} means ``zero or more of the
 183 | preceding pattern''.
 184 | 
 185 | The grammar given above is close to a Yapps grammar.  We also have to
 186 | specify what the tokens are, and what to do when a pattern is matched.
 187 | For this example, we will do nothing when patterns are matched; the
 188 | next example will explain how to perform match actions.
 189 | 
 190 | \begin{verbatim}
 191 | parser TinyEnglish:
 192 |   ignore:          "\\W+"
 193 |   token noun:      "(Jack|spam|ship)"
 194 |   token verb:      "(sank|threw)"
 195 |   token article:   "(an|a|the)"
 196 |   token adjective: "(blue|red|green)"
 197 | 
 198 |   rule sentence:       noun_phrase verb noun_phrase
 199 |   rule noun_phrase:    [article] adjective* noun
 200 | \end{verbatim}
 201 | 
 202 | The tokens are specified as Python \emph{regular expressions}.  Since
 203 | Yapps produces Python code, you can write any regular expression that
 204 | would be accepted by Python.  (\emph{Note:} These are Python 1.5
 205 | regular expressions from the \texttt{re} module, not Python 1.4
 206 | regular expressions from the \texttt{regex} module.)  In addition to
 207 | tokens that you want to see (which are given names), you can also
 208 | specify tokens to ignore, marked by the \texttt{ignore} keyword.  In
 209 | this parser we want to ignore whitespace.
 210 | 
 211 | The TinyEnglish grammar shows how you define tokens and rules, but it
 212 | does not specify what should happen once we've matched the rules.  In
 213 | the next example, we will take a grammar and produce a \emph{parse
 214 | tree} from it.
 215 | 
 216 | \mysubsection{Lisp Expressions}
 217 | 
 218 | Lisp syntax, although hated by many, has a redeeming quality: it is
 219 | simple to parse.  In this section we will construct a Yapps grammar to
 220 | parse Lisp expressions and produce a parse tree as output.
 221 | 
 222 | \subsubsection*{Defining the Grammar}
 223 | 
 224 | The syntax of Lisp is simple.  It has expressions, which are
 225 | identifiers, strings, numbers, and lists.  A list is a left
 226 | parenthesis followed by some number of expressions (separated by
 227 | spaces) followed by a right parenthesis.  For example, \verb|5|,
 228 | \verb|"ni"|, and \verb|(print "1+2 = " (+ 1 2))| are Lisp expressions.
 229 | Written as a grammar,
 230 | 
 231 | \begin{verbatim}
 232 |     expr:   ID | STR | NUM | list
 233 |     list:   ( expr* )  
 234 | \end{verbatim}
 235 | 
 236 | In addition to having a grammar, we need to specify what to do every
 237 | time something is matched.  For the tokens, which are strings, we just
 238 | want to get the ``value'' of the token, attach its type (identifier,
 239 | string, or number) in some way, and return it.  For the lists, we want
 240 | to construct and return a Python list.
 241 | 
 242 | Once some pattern is matched, we enclose a return statement enclosed
 243 | in \verb|{{...}}|.  The braces allow us to insert any one-line
 244 | statement into the parser.  Within this statement, we can refer to the
 245 | values returned by matching each part of the rule.  After matching a
 246 | token such as \texttt{ID}, ``ID'' will be bound to the text of the
 247 | matched token.  Let's take a look at the rule:
 248 | 
 249 | \begin{verbatim}
 250 |     rule expr: ID   {{ return ('id', ID) }}
 251 |       ...
 252 | \end{verbatim}
 253 | 
 254 | In a rule, tokens return the text that was matched.  For identifiers,
 255 | we just return the identifier, along with a ``tag'' telling us that
 256 | this is an identifier and not a string or some other value.  Sometimes
 257 | we may need to convert this text to a different form.  For example, if
 258 | a string is matched, we want to remove quotes and handle special forms
 259 | like \verb|\n|.  If a number is matched, we want to convert it into a
 260 | number.  Let's look at the return values for the other tokens:
 261 | 
 262 | \begin{verbatim}
 263 |       ...
 264 |              | STR  {{ return ('str', eval(STR)) }}
 265 |              | NUM  {{ return ('num', atoi(NUM)) }}
 266 |       ...
 267 | \end{verbatim}
 268 | 
 269 | If we get a string, we want to remove the quotes and process any
 270 | special backslash codes, so we run \texttt{eval} on the quoted string.
 271 | If we get a number, we convert it to an integer with \texttt{atoi} and
 272 | then return the number along with its type tag.
 273 | 
 274 | For matching a list, we need to do something slightly more
 275 | complicated.  If we match a Lisp list of expressions, we want to
 276 | create a Python list with those values.
 277 | 
 278 | \begin{verbatim}
 279 |     rule list: "\\("                 # Match the opening parenthesis
 280 |                {{ result = [] }}     # Create a Python list
 281 |                ( 
 282 |                   expr               # When we match an expression,
 283 |                   {{ result.append(expr) }}   # add it to the list
 284 |                )*                    # * means repeat this if needed
 285 |                "\\)"                 # Match the closing parenthesis
 286 |                {{ return result }}   # Return the Python list
 287 | \end{verbatim}
 288 | 
 289 | In this rule we first match the opening parenthesis, then go into a
 290 | loop.  In this loop we match expressions and add them to the list.
 291 | When there are no more expressions to match, we match the closing
 292 | parenthesis and return the resulting.  Note that \verb:#: is used for
 293 | comments, just as in Python.
 294 | 
 295 | The complete grammar is specified as follows:
 296 | \begin{verbatim}
 297 | parser Lisp:
 298 |     ignore:      '\\s+'
 299 |     token NUM:   '[0-9]+'
 300 |     token ID:    '[-+*/!@%^&=.a-zA-Z0-9_]+' 
 301 |     token STR:   '"([^\\"]+|\\\\.)*"'
 302 | 
 303 |     rule expr:   ID     {{ return ('id', ID) }}
 304 |                | STR    {{ return ('str', eval(STR)) }}
 305 |                | NUM    {{ return ('num', atoi(NUM)) }}
 306 |                | list   {{ return list }}
 307 |     rule list: "\\("    {{ result = [] }} 
 308 |                ( expr   {{ result.append(expr) }}
 309 |                )*  
 310 |                "\\)"    {{ return result }} 
 311 | \end{verbatim}
 312 | 
 313 | One thing you may have noticed is that \verb|"\\("| and \verb|"\\)"|
 314 | appear in the \texttt{list} rule.  These are \emph{inline tokens}:
 315 | they appear in the rules without being given a name with the
 316 | \texttt{token} keyword.  Inline tokens are more convenient to use, but
 317 | since they do not have a name, the text that is matched cannot be used
 318 | in the return value.  They are best used for short simple patterns
 319 | (usually punctuation or keywords).
 320 | 
 321 | Another thing to notice is that the number and identifier tokens
 322 | overlap.  For example, ``487'' matches both NUM and ID.  In Yapps, the
 323 | scanner only tries to match tokens that are acceptable to the parser.
 324 | This rule doesn't help here, since both NUM and ID can appear in the
 325 | same place in the grammar.  There are two rules used to pick tokens if
 326 | more than one matches.  One is that the \emph{longest} match is
 327 | preferred.  For example, ``487x'' will match as an ID (487x) rather
 328 | than as a NUM (487) followed by an ID (x).  The second rule is that if
 329 | the two matches are the same length, the \emph{first} one listed in
 330 | the grammar is preferred.  For example, ``487'' will match as an NUM
 331 | rather than an ID because NUM is listed first in the grammar.  Inline
 332 | tokens have preference over any tokens you have listed.
 333 | 
 334 | Now that our grammar is defined, we can run Yapps to produce a parser,
 335 | and then run the parser to produce a parse tree.
 336 | 
 337 | \subsubsection*{Running Yapps}
 338 | 
 339 | In the Yapps module is a function \texttt{generate} that takes an
 340 | input filename and writes a parser to another file.  We can use this
 341 | function to generate the Lisp parser, which is assumed to be in
 342 | \texttt{lisp.g}.
 343 | 
 344 | \begin{verbatim}
 345 | % python
 346 | Python 1.5.1 (#1, Sep  3 1998, 22:51:17)  [GCC 2.7.2.3] on linux-i386
 347 | Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
 348 | >>> import yapps
 349 | >>> yapps.generate('lisp.g')
 350 | \end{verbatim}
 351 | 
 352 | At this point, Yapps has written a file \texttt{lisp.py} that contains
 353 | the parser.  In that file are two classes (one scanner and one parser)
 354 | and a function (called \texttt{parse}) that puts things together for
 355 | you.
 356 | 
 357 | Alternatively, we can run Yapps from the command line to generate the
 358 | parser file:
 359 | 
 360 | \begin{verbatim}
 361 | % python yapps.py lisp.g
 362 | \end{verbatim}
 363 | 
 364 | After running Yapps either from within Python or from the command
 365 | line, we can use the Lisp parser by calling the \texttt{parse}
 366 | function.  The first parameter should be the rule we want to match,
 367 | and the second parameter should be the string to parse.
 368 | 
 369 | \begin{verbatim}
 370 | >>> import lisp
 371 | >>> lisp.parse('expr', '(+ 3 4)')
 372 | [('id', '+'), ('num', 3), ('num', 4)]
 373 | >>> lisp.parse('expr', '(print "3 = " (+ 1 2))')
 374 | [('id', 'print'), ('str', '3 = '), [('id', '+'), ('num', 1), ('num', 2)]]
 375 | \end{verbatim}
 376 | 
 377 | The \texttt{parse} function is not the only way to use the parser;
 378 | section \ref{sec:Parser-Objects} describes how to access parser objects
 379 | directly.
 380 | 
 381 | We've now gone through the steps in creating a grammar, writing a
 382 | grammar file for Yapps, producing a parser, and using the parser.  In
 383 | the next example we'll see how rules can take parameters and also how
 384 | to do computations instead of just returning a parse tree.
 385 | 
 386 | \mysubsection{Calculator}
 387 | 
 388 | A common example parser given in many textbooks is that for simple
 389 | expressions, with numbers, addition, subtraction, multiplication,
 390 | division, and parenthesization of subexpressions.  We'll write this
 391 | example in Yapps, evaluating the expression as we parse.
 392 | 
 393 | Unlike \texttt{yacc}, Yapps does not have any way to specify
 394 | precedence rules, so we have to do it ourselves.  We say that an
 395 | expression is the sum of terms, and that a term is the product of
 396 | factors, and that a factor is a number or a parenthesized expression:
 397 | 
 398 | \begin{verbatim}
 399 |     expr:           factor ( ("+"|"-") factor )*
 400 |     factor:         term   ( ("*"|"/") term )*
 401 |     term:           NUM | "(" expr ")"
 402 | \end{verbatim}
 403 | 
 404 | In order to evaluate the expression as we go, we should keep along an
 405 | accumulator while evaluating the lists of terms or factors.  Just as
 406 | we kept a ``result'' variable to build a parse tree for Lisp
 407 | expressions, we will use a variable to evaluate numerical
 408 | expressions.  The full grammar is given below:
 409 | 
 410 | \begin{verbatim}
 411 | parser Calculator:
 412 |     token END: "$"         # $ means end of string
 413 |     token NUM: "[0-9]+"
 414 | 
 415 |     rule goal:           expr END         {{ return expr }}
 416 | 
 417 |     # An expression is the sum and difference of factors
 418 |     rule expr:           factor           {{ v = factor }}
 419 |                        ( "[+]" factor       {{ v = v+factor }}
 420 |                        |  "-"  factor       {{ v = v-factor }}
 421 |                        )*                 {{ return v }}
 422 | 
 423 |     # A factor is the product and division of terms
 424 |     rule factor:         term             {{ v = term }}
 425 |                        ( "[*]" term         {{ v = v*term }}
 426 |                        |  "/"  term         {{ v = v/term }}
 427 |                        )*                 {{ return v }}
 428 | 
 429 |     # A term is either a number or an expression surrounded by parentheses
 430 |     rule term:           NUM              {{ return atoi(NUM) }}
 431 |                        | "\\(" expr "\\)" {{ return expr }}
 432 | \end{verbatim}
 433 | 
 434 | The top-level rule is \emph{goal}, which says that we are looking for
 435 | an expression followed by the end of the string.  The \texttt{END}
 436 | token is needed because without it, it isn't clear when to stop
 437 | parsing.  For example, the string ``1+3'' could be parsed either as
 438 | the expression ``1'' followed by the string ``+3'' or it could be
 439 | parsed as the expression ``1+3''.  By requiring expressions to end
 440 | with \texttt{END}, the parser is forced to take ``1+3''.
 441 | 
 442 | In the two rules with repetition, the accumulator is named \texttt{v}.
 443 | After reading in one expression, we initialize the accumulator.  Each
 444 | time through the loop, we modify the accumulator by adding,
 445 | subtracting, multiplying by, or dividing the previous accumulator by
 446 | the expression that has been parsed.  At the end of the rule, we
 447 | return the accumulator.
 448 | 
 449 | The calculator example shows how to process lists of elements using
 450 | loops, as well as how to handle precedence of operators.
 451 | 
 452 | \emph{Note:} It's often important to put the \texttt{END} token in, so 
 453 | put it in unless you are sure that your grammar has some other
 454 | non-ambiguous token marking the end of the program.
 455 | 
 456 | \mysubsection{Calculator with Memory}
 457 | 
 458 | In the previous example we learned how to write a calculator that
 459 | evaluates simple numerical expressions.  In this section we will
 460 | extend the example to support both local and global variables.
 461 | 
 462 | To support global variables, we will add assignment statements to the
 463 | ``goal'' rule.
 464 | 
 465 | \begin{verbatim}
 466 |     rule goal:           expr END         {{ return expr }}
 467 |               | 'set' ID expr END         {{ global_vars[ID] = expr }}
 468 |                                           {{ return expr }}   
 469 | \end{verbatim}
 470 | 
 471 | To use these variables, we need a new kind of terminal:
 472 | 
 473 | \begin{verbatim}
 474 |     rule term: ... | ID {{ return global_vars[ID] }} 
 475 | \end{verbatim}
 476 | 
 477 | So far, these changes are straightforward.  We simply have a global
 478 | dictionary \texttt{global\_vars} that stores the variables and values, 
 479 | we modify it when there is an assignment statement, and we look up
 480 | variables in it when we see a variable name.
 481 | 
 482 | To support local variables, we will add variable declarations to the
 483 | set of allowed expressions.
 484 | 
 485 | \begin{verbatim}
 486 |     rule term: ... | 'let' VAR '=' expr 'in' expr ...
 487 | \end{verbatim}
 488 | 
 489 | This is where it becomes tricky.  Local variables should be stored in
 490 | a local dictionary, not in the global one.  One trick would be to save 
 491 | a copy of the global dictionary, modify it, and then restore it
 492 | later.  In this example we will instead use \emph{attributes} to
 493 | create local information and pass it to subrules.
 494 | 
 495 | A rule can optionally take parameters.  When we invoke the rule, we
 496 | must pass in arguments.  For local variables, let's use a single
 497 | parameter, \texttt{local\_vars}:
 498 | 
 499 | \begin{verbatim}
 500 |     rule expr<<local_vars>>:   ...
 501 |     rule factor<<local_vars>>: ...
 502 |     rule term<<local_vars>>:   ...
 503 | \end{verbatim}
 504 | 
 505 | Each time we want to match \texttt{expr}, \texttt{factor}, or
 506 | \texttt{term}, we will pass the local variables in the current rule to
 507 | the subrule.  One interesting case is when we pass as an argument
 508 | something \emph{other} than \texttt{local\_vars}:
 509 | 
 510 | \begin{verbatim}
 511 |    rule term<<local_vars>>: ...
 512 |                 | 'let' VAR '=' expr<<local_vars>>
 513 |                   {{ local_vars = [(VAR, expr)] + local_vars }}
 514 |                   'in' expr<<local_vars>>
 515 |                   {{ return expr }}
 516 | \end{verbatim}
 517 | 
 518 | Note that the assignment to the local variables list does not modify
 519 | the original list.  This is important to keep local variables from
 520 | being seen outside the ``let''.
 521 | 
 522 | The other interesting case is when we find a variable:
 523 | 
 524 | \begin{verbatim}
 525 | global_vars = {}
 526 | 
 527 | def lookup(map, name):
 528 |     for x,v in map:  if x==name: return v
 529 |     return global_vars[name]
 530 | %%
 531 |    ...
 532 |    rule term<<local_vars>: ...
 533 |                 | VAR {{ return lookup(local_vars, VAR) }}
 534 | \end{verbatim}
 535 | 
 536 | The lookup function will search through the local variable list, and
 537 | if it cannot find the name there, it will look it up in the global
 538 | variable dictionary.
 539 | 
 540 | A complete grammar for this example, including a read-eval-print loop
 541 | for interacting with the calculator, can be found in the examples
 542 | subdirectory included with Yapps.
 543 | 
 544 | In this section we saw how to insert code before the parser.  We also
 545 | saw how to use attributes to transmit local information from one rule
 546 | to its subrules.
 547 | 
 548 | \mysection{Grammars}
 549 | 
 550 | Each Yapps grammar has a name, a list of tokens, and a set of
 551 | production rules.  A grammar named \texttt{X} will be used to produce
 552 | a parser named \texttt{X} and a scanner anmed \texttt{XScanner}.  As
 553 | in Python, names are case sensitive, start with a letter, and contain
 554 | letters, numbers, and underscores (\_).
 555 | 
 556 | There are three kinds of tokens in Yapps: named, inline, and ignored.
 557 | As their name implies, named tokens are given a name, using the token
 558 | construct: \texttt{token \emph{name} : \emph{regexp}}.  In a rule, the
 559 | token can be matched by using the name.  Inline tokens are regular
 560 | expressions that are used in rules without being declared.  Ignored
 561 | tokens are declared using the ignore construct: \texttt{ignore:
 562 |   \emph{regexp}}.  These tokens are ignored by the scanner, and are
 563 | not seen by the parser.  Often whitespace is an ignored token.  The
 564 | regular expressions used to define tokens should use the syntax
 565 | defined in the \texttt{re} module, so some symbols may have to be
 566 | backslashed.
 567 | 
 568 | Production rules in Yapps have a name and a pattern to match.  If the
 569 | rule is parameterized, the name should be followed by a list of
 570 | parameter names in \verb|<<...>>|.  A pattern can be a simple pattern
 571 | or a compound pattern.  Simple patterns are the name of a named token,
 572 | a regular expression in quotes (inline token), the name of a
 573 | production rule (followed by arguments in \verb|<<...>>|, if the rule
 574 | has parameters), and single line Python statements (\verb|{{...}}|).
 575 | Compound patterns are sequences (\verb|A B C ...|), choices (
 576 | \verb:A | B | C | ...:), options (\verb|[...]|), zero-or-more repetitions
 577 | (\verb|...*|), and one-or-more repetitions (\verb|...+|).  Like
 578 | regular expressions, repetition operators have a higher precedence
 579 | than sequences, and sequences have a higher precedence than choices.
 580 | 
 581 | Whenever \verb|{{...}}| is used, a legal one-line Python statement
 582 | should be put inside the braces.  The token \verb|}}| should not
 583 | appear within the \verb|{{...}}| section, even within a string, since
 584 | Yapps does not attempt to parse the Python statement.  A workaround
 585 | for strings is to put two strings together (\verb|"}" "}"|), or to use
 586 | backslashes (\verb|"}\}"|).  At the end of a rule you should use a
 587 | \verb|{{ return X }}| statement to return a value.  However, you
 588 | should \emph{not} use any control statements (\texttt{return},
 589 | \texttt{continue}, \texttt{break}) in the middle of a rule.  Yapps
 590 | needs to make assumptions about the control flow to generate a parser,
 591 | and any changes to the control flow will confuse Yapps.
 592 | 
 593 | The \verb|<<...>>| form can occur in two places: to define parameters
 594 | to a rule and to give arguments when matching a rule.  Parameters use
 595 | the syntax used for Python functions, so they can include default
 596 | arguments and the special forms (\verb|*args| and \verb|**kwargs|).
 597 | Arguments use the syntax for Python function call arguments, so they
 598 | can include normal arguments and keyword arguments.  The token
 599 | \verb|>>| should not appear within the \verb|<<...>>| section.
 600 | 
 601 | In both the statements and rule arguments, you can use names defined
 602 | by the parser to refer to matched patterns.  You can refer to the text
 603 | matched by a named token by using the token name.  You can use the
 604 | value returned by a production rule by using the name of that rule.
 605 | If a name \texttt{X} is matched more than once (such as in loops), you
 606 | will have to save the earlier value(s) in a temporary variable, and
 607 | then use that temporary variable in the return value.  The next
 608 | section has an example of a name that occurs more than once.
 609 | 
 610 | \mysubsection{Left Factoring}
 611 | \label{sec:Left-Factoring}
 612 | 
 613 | Yapps produces ELL(1) parsers, which determine which clause to match
 614 | based on the first token available.  Sometimes the leftmost tokens of
 615 | several clauses may be the same.  The classic example is the
 616 | \emph{if/then/else} construct in Pascal:
 617 | 
 618 | \begin{verbatim}
 619 | rule stmt:  "if" expr "then" stmt {{ then_part = stmt }} 
 620 |                       "else" stmt {{ return ('If',expr,then_part,stmt) }}
 621 |           | "if" expr "then" stmt {{ return ('If',expr,stmt,[]) }}
 622 | \end{verbatim}
 623 | 
 624 | (Note that we have to save the first \texttt{stmt} into a variable
 625 | because there is another \texttt{stmt} that will be matched.)  The
 626 | left portions of the two clauses are the same, which presents a
 627 | problem for the parser.  The solution is \emph{left-factoring}: the
 628 | common parts are put together, and \emph{then} a choice is made about
 629 | the remaining part:
 630 | 
 631 | \begin{verbatim}
 632 | rule stmt:  "if" expr 
 633 |               "then" stmt {{ then_part = stmt }}
 634 |               {{ else_part = [] }}
 635 |               [ "else" stmt {{ else_part = stmt }} ]
 636 |             {{ return ('If', expr, then_part, else_part) }}
 637 | \end{verbatim}
 638 | 
 639 | Unfortunately, the classic \emph{if/then/else} situation is
 640 | \emph{still} ambiguous when you left-factor.  Yapps can deal with this
 641 | situation, but will report a warning; see section
 642 | \ref{sec:Ambiguous-Grammars} for details.
 643 | 
 644 | In general, replace rules of the form:
 645 | 
 646 | \begin{verbatim}
 647 | rule A:   a b1 {{ return E1 }}
 648 |         | a b2 {{ return E2 }}
 649 |         | c3   {{ return E3 }}
 650 |         | c4   {{ return E4 }}
 651 | \end{verbatim}
 652 | 
 653 | with rules of the form:
 654 | 
 655 | \begin{verbatim}
 656 | rule A:   a ( b1 {{ return E1 }}
 657 |             | b2 {{ return E2 }}
 658 |             )
 659 |         | c3   {{ return E3 }}
 660 |         | c4   {{ return E4 }}
 661 | \end{verbatim}
 662 | 
 663 | \mysubsection{Left Recursion}
 664 | 
 665 | A common construct in grammars is for matching a list of patterns,
 666 | sometimes separated with delimiters such as commas or semicolons.  In
 667 | LR-based parser systems, we can parse a list with something like this:
 668 | 
 669 | \begin{verbatim}
 670 | rule sum:  NUM             {{ return NUM }}
 671 |          | sum "+" NUM     {{ return (sum, NUM) }}
 672 | \end{verbatim}
 673 | 
 674 | Parsing \texttt{1+2+3+4} would produce the output
 675 | \texttt{(((1,2),3),4)}, which is what we want from a left-associative
 676 | addition operator.  Unfortunately, this grammar is \emph{left
 677 | recursive,} because the \texttt{sum} rule contains a clause that
 678 | begins with \texttt{sum}.  (The recursion occurs at the left side of
 679 | the clause.)
 680 | 
 681 | We must restructure this grammar to be \emph{right recursive} instead:
 682 | 
 683 | \begin{verbatim}
 684 | rule sum:  NUM             {{ return NUM }}
 685 |          | NUM "+" sum     {{ return (NUM, sum) }}
 686 | \end{verbatim}
 687 | 
 688 | Unfortunately, using this grammar, \texttt{1+2+3+4} would be parsed as
 689 | \texttt{(1,(2,(3,4)))}, which no longer follows left associativity.
 690 | The rule also needs to be left-factored.  Instead, we write the
 691 | pattern as a loop instead:
 692 | 
 693 | \begin{verbatim}
 694 | rule sum:       NUM {{ v = NUM }}
 695 |                 ( "[+]" NUM {{ v = (v,NUM) }} )*
 696 |                 {{ return v }}
 697 | \end{verbatim}
 698 | 
 699 | In general, replace rules of the form:
 700 | 
 701 | \begin{verbatim}
 702 | rule A:  A a1 -> << E1 >> 
 703 |        | A a2 -> << E2 >>
 704 |        | b3   -> << E3 >>
 705 |        | b4   -> << E4 >>
 706 | \end{verbatim}
 707 | 
 708 | with rules of the form:
 709 | 
 710 | \begin{verbatim}
 711 | rule A:  ( b3 {{ A = E3 }} 
 712 |          | b4 {{ A = E4 }} )
 713 |          ( a1 {{ A = E1 }}
 714 |          | a2 {{ A = E2 }} )*
 715 |          {{ return A }}
 716 | \end{verbatim}
 717 | 
 718 | We have taken a rule that proved problematic for with recursion and
 719 | turned it into a rule that works well with looping constructs.
 720 | 
 721 | \mysubsection{Ambiguous Grammars}
 722 | \label{sec:Ambiguous-Grammars}
 723 | 
 724 | In section \ref{sec:Left-Factoring} we saw the classic if/then/else
 725 | ambiguity, which occurs because the ``else \ldots'' portion of an ``if
 726 | \ldots then \ldots else \ldots'' construct is optional.  Programs with 
 727 | nested if/then/else constructs can be ambiguous when one of the else
 728 | clauses is missing:
 729 | \begin{verbatim}
 730 | if 1 then            if 1 then
 731 |     if 5 then            if 5 then
 732 |         x := 1;              x := 1;
 733 |     else             else
 734 |         y := 9;          y := 9;
 735 | \end{verbatim}
 736 | 
 737 | The indentation shows that the program can be parsed in two different
 738 | ways.  (Of course, if we all would adopt Python's indentation-based
 739 | structuring, this would never happen!)  Usually we want the parsing on
 740 | the left: the ``else'' should be associated with the closest ``if''
 741 | statement.  In section \ref{sec:Left-Factoring} we ``solved'' the
 742 | problem by using the following grammar:
 743 | 
 744 | \begin{verbatim}
 745 | rule stmt:  "if" expr 
 746 |               "then" stmt {{ then_part = stmt }}
 747 |               {{ else_part = [] }}
 748 |               [ "else" stmt {{ else_part = stmt }} ]
 749 |             {{ return ('If', expr, then_part, else_part) }}
 750 | \end{verbatim}
 751 | 
 752 | Here, we have an optional match of ``else'' followed by a statement.
 753 | The ambiguity is that if an ``else'' is present, it is not clear
 754 | whether you want it parsed immediately or if you want it to be parsed
 755 | by the outer ``if''.
 756 | 
 757 | Yapps will deal with the situation by matching when the else pattern
 758 | when it can.  The parser will work in this case because it prefers the
 759 | \emph{first} matching clause, which tells Yapps to parse the ``else''.
 760 | That is exactly what we want!
 761 | 
 762 | For ambiguity cases with choices, Yapps will choose the \emph{first}
 763 | matching choice.  However, remember that Yapps only looks at the first 
 764 | token to determine its decision, so {\tt (a b | a c)} will result in
 765 | Yapps choosing {\tt a b} even when the input is {\tt a c}.  It only
 766 | looks at the first token, {\tt a}, to make its decision.
 767 | 
 768 | \mysection{Customization}
 769 | 
 770 | Both the parsers and the scanners can be customized.  The parser is
 771 | usually extended by subclassing, and the scanner can either be
 772 | subclassed or completely replaced.
 773 | 
 774 | \mysubsection{Customizing Parsers}
 775 | 
 776 | If additional fields and methods are needed in order for a parser to
 777 | work, Python subclassing can be used.  (This is unlike parser classes
 778 | written in static languages, in which these fields and methods must be
 779 | defined in the generated parser class.)  We simply subclass the
 780 | generated parser, and add any fields or methods required.  Expressions
 781 | in the grammar can call methods of the subclass to perform any actions
 782 | that cannot be expressed as a simple expression.  For example,
 783 | consider this simple grammar:
 784 | 
 785 | \begin{verbatim}
 786 | parser X:
 787 |     rule goal:  "something"  {{ self.printmsg() }}
 788 | \end{verbatim}
 789 | 
 790 | The \texttt{printmsg} function need not be implemented in the parser
 791 | class \texttt{X}; it can be implemented in a subclass:
 792 | 
 793 | \begin{verbatim}
 794 | import Xparser
 795 | 
 796 | class MyX(Xparser.X):
 797 |     def printmsg(self):
 798 |         print "Hello!"
 799 | \end{verbatim}
 800 | 
 801 | \mysubsection{Customizing Scanners}
 802 | 
 803 | The generated parser class is not dependent on the generated scanner
 804 | class.  A scanner object is passed to the parser object's constructor
 805 | in the \texttt{parse} function.  To use a different scanner, write
 806 | your own function to construct parser objects, with an instance of a
 807 | different scanner.  Scanner objects must have a \texttt{token} method
 808 | that accepts an integer \texttt{N} as well as a list of allowed token
 809 | types, and returns the Nth token, as a tuple.  The default scanner
 810 | raises \texttt{NoMoreTokens} if no tokens are available, and
 811 | \texttt{SyntaxError} if no token could be matched.  However, the
 812 | parser does not rely on these exceptions; only the \texttt{parse}
 813 | convenience function (which calls \texttt{wrap\_error\_reporter}) and
 814 | the \texttt{print\_error} error display function use those exceptions.
 815 | 
 816 | The tuples representing tokens have four elements.  The first two are
 817 | the beginning and ending indices of the matched text in the input
 818 | string.  The third element is the type tag, matching either the name
 819 | of a named token or the quoted regexp of an inline or ignored token.
 820 | The fourth element of the token tuple is the matched text.  If the
 821 | input string is \texttt{s}, and the token tuple is
 822 | \texttt{(b,e,type,val)}, then \texttt{val} should be equal to
 823 | \texttt{s[b:e]}.
 824 | 
 825 | The generated parsers do not the beginning or ending index.  They use
 826 | only the token type and value.  However, the default error reporter
 827 | uses the beginning and ending index to show the user where the error
 828 | is.
 829 | 
 830 | \mysection{Parser Mechanics}
 831 | 
 832 | The base parser class (Parser) defines two methods, \texttt{\_scan}
 833 | and \texttt{\_peek}, and two fields, \texttt{\_pos} and
 834 | \texttt{\_scanner}.  The generated parser inherits from the base
 835 | parser, and contains one method for each rule in the grammar.  To
 836 | avoid name clashes, do not use names that begin with an underscore
 837 | (\texttt{\_}).
 838 | 
 839 | \mysubsection{Parser Objects}
 840 | \label{sec:Parser-Objects}
 841 | 
 842 | Yapps produces as output two exception classes, a scanner class, a
 843 | parser class, and a function \texttt{parse} that puts everything
 844 | together.  The \texttt{parse} function does not have to be used;
 845 | instead, one can create a parser and scanner object and use them
 846 | together for parsing.
 847 | 
 848 | \begin{verbatim}
 849 |     def parse(rule, text):
 850 |         P = X(XScanner(text))
 851 |         return wrap_error_reporter(P, rule)
 852 | \end{verbatim}
 853 | 
 854 | The \texttt{parse} function takes a name of a rule and an input string
 855 | as input.  It creates a scanner and parser object, then calls
 856 | \texttt{wrap\_error\_reporter} to execute the method in the parser
 857 | object named \texttt{rule}.  The wrapper function will call the
 858 | appropriate parser rule and report any parsing errors to standard
 859 | output.
 860 | 
 861 | There are several situations in which the \texttt{parse} function
 862 | would not be useful.  If a different parser or scanner is being used,
 863 | or exceptions are to be handled differently, a new \texttt{parse}
 864 | function would be required.  The supplied \texttt{parse} function can
 865 | be used as a template for writing a function for your own needs.  An
 866 | example of a custom parse function is the \texttt{generate} function
 867 | in \texttt{Yapps.py}.
 868 | 
 869 | \mysubsection{Context Sensitive Scanner}
 870 | 
 871 | Unlike most scanners, the scanner produced by Yapps can take into
 872 | account the context in which tokens are needed, and try to match only
 873 | good tokens.  For example, in the grammar:
 874 | 
 875 | \begin{verbatim}
 876 | parser IniFile:
 877 |    token ID:   "[a-zA-Z_0-9]+"
 878 |    token VAL:  ".*"
 879 | 
 880 |    rule pair:  ID "[ \t]*=[ \t]*" VAL "\n"
 881 | \end{verbatim}
 882 | 
 883 | we would like to scan lines of text and pick out a name/value pair.
 884 | In a conventional scanner, the input string \texttt{shell=progman.exe}
 885 | would be turned into a single token of type \texttt{VAL}.  The Yapps
 886 | scanner, however, knows that at the beginning of the line, an
 887 | \texttt{ID} is expected, so it will return \texttt{"shell"} as a token
 888 | of type \texttt{ID}.  Later, it will return \texttt{"progman.exe"} as
 889 | a token of type \texttt{VAL}.
 890 | 
 891 | Context sensitivity decreases the separation between scanner and
 892 | parser, but it is useful in parsers like \texttt{IniFile}, where the
 893 | tokens themselves are not unambiguous, but \emph{are} unambiguous
 894 | given a particular stage in the parsing process.
 895 | 
 896 | Unfortunately, context sensitivity can make it more difficult to
 897 | detect errors in the input.  For example, in parsing a Pascal-like
 898 | language with ``begin'' and ``end'' as keywords, a context sensitive
 899 | scanner would only match ``end'' as the END token if the parser is in
 900 | a place that will accept the END token.  If not, then the scanner
 901 | would match ``end'' as an identifier.  To disable the context
 902 | sensitive scanner in Yapps, add the
 903 | \texttt{context-insensitive-scanner} option to the grammar:
 904 | 
 905 | \begin{verbatim}
 906 | Parser X:
 907 |     option:  "context-insensitive-scanner"
 908 | \end{verbatim}
 909 | 
 910 | Context-insensitive scanning makes the parser look cleaner as well.
 911 | 
 912 | \mysubsection{Internal Variables}
 913 | 
 914 | There are two internal fields that may be of use.  The parser object
 915 | has two fields, \texttt{\_pos}, which is the index of the current
 916 | token being matched, and \texttt{\_scanner}, which is the scanner
 917 | object.  The token itself can be retrieved by accessing the scanner
 918 | object and calling the \texttt{token} method with the token index.  However, if you call \texttt{token} before the token has been requested by the parser, it may mess up a context-sensitive scanner.\footnote{When using a context-sensitive scanner, the parser tells the scanner what the valid token types are at each point.  If you call \texttt{token} before the parser can tell the scanner the valid token types, the scanner will attempt to match without considering the context.}  A
 919 | potentially useful combination of these fields is to extract the
 920 | portion of the input matched by the current rule.  To do this, just save the scanner state (\texttt{\_scanner.pos}) before the text is matched and then again after the text is matched:
 921 | 
 922 | \begin{verbatim}
 923 |   rule R: 
 924 |       {{ start = self._scanner.pos }}
 925 |       a b c 
 926 |       {{ end = self._scanner.pos }}
 927 |       {{ print 'Text is', self._scanner.input[start:end] }}
 928 | \end{verbatim}
 929 | 
 930 | \mysubsection{Pre- and Post-Parser Code}
 931 | 
 932 | Sometimes the parser code needs to rely on helper variables,
 933 | functions, and classes.  A Yapps grammar can optionally be surrounded
 934 | by double percent signs, to separate the grammar from Python code.
 935 | 
 936 | \begin{verbatim}
 937 | ... Python code ...
 938 | %%
 939 | ... Yapps grammar ...
 940 | %%
 941 | ... Python code ...
 942 | \end{verbatim}
 943 | 
 944 | The second \verb|%%| can be omitted if there is no Python code at the
 945 | end, and the first \verb|%%| can be omitted if there is no extra
 946 | Python code at all.  (To have code only at the end, both separators
 947 | are required.)
 948 | 
 949 | If the second \verb|%%| is omitted, Yapps will insert testing code
 950 | that allows you to use the generated parser to parse a file.
 951 | 
 952 | The extended calculator example in the Yapps examples subdirectory
 953 | includes both pre-parser and post-parser code.
 954 | 
 955 | \mysubsection{Representation of Grammars}
 956 | 
 957 | For each kind of pattern there is a class derived from Pattern.  Yapps 
 958 | has classes for Terminal, NonTerminal, Sequence, Choice, Option, Plus, 
 959 | Star, and Eval.  Each of these classes has the following interface:
 960 | 
 961 | \begin{itemize}
 962 |  \item[setup(\emph{gen})] Set accepts-$\epsilon$, and call
 963 |    \emph{gen.changed()} if it changed.  This function can change the
 964 |    flag from false to true but \emph{not} from true to false.
 965 |  \item[update(\emph(gen))] Set \first and \follow, and call
 966 |    \emph{gen.changed()} if either changed.  This function can add to
 967 |    the sets but \emph{not} remove from them.
 968 |  \item[output(\emph{gen}, \emph{indent})] Generate code for matching
 969 |    this rule, using \emph{indent} as the current indentation level.
 970 |    Writes are performed using \emph{gen.write}.
 971 |  \item[used(\emph{vars})] Given a list of variables \emph{vars},
 972 |    return two lists: one containing the variables that are used, and
 973 |    one containing the variables that are assigned.  This function is
 974 |    used for optimizing the resulting code.
 975 | \end{itemize}
 976 | 
 977 | Both \emph{setup} and \emph{update} monotonically increase the
 978 | variables they modify.  Since the variables can only increase a finite
 979 | number of times, we can repeatedly call the function until the
 980 | variable stabilized.  The \emph{used} function is not currently
 981 | implemented.
 982 | 
 983 | With each pattern in the grammar Yapps associates three pieces of
 984 | information: the \first set, the \follow set, and the
 985 | accepts-$\epsilon$ flag.
 986 | 
 987 | The \first set contains the tokens that can appear as we start
 988 | matching the pattern.  The \follow set contains the tokens that can
 989 | appear immediately after we match the pattern.  The accepts-$\epsilon$ 
 990 | flag is true if the pattern can match no tokens.  In this case, \first 
 991 | will contain all the elements in \follow.  The \follow set is not
 992 | needed when accepts-$\epsilon$ is false, and may not be accurate in
 993 | those cases.
 994 | 
 995 | Yapps does not compute these sets precisely.  Its approximation can
 996 | miss certain cases, such as this one:
 997 | 
 998 | \begin{verbatim}
 999 |   rule C: ( A* | B )
1000 |   rule B: C [A]
1001 | \end{verbatim}
1002 | 
1003 | Yapps will calculate {\tt C}'s \follow set to include {\tt A}.
1004 | However, {\tt C} will always match all the {\tt A}'s, so {\tt A} will
1005 | never follow it.  Yapps 2.0 does not properly handle this construct,
1006 | but if it seems important, I may add support for it in a future
1007 | version.
1008 | 
1009 | Yapps also cannot handle constructs that depend on the calling
1010 | sequence.  For example:
1011 | 
1012 | \begin{verbatim}
1013 |   rule R: U | 'b'
1014 |   rule S:   | 'c'
1015 |   rule T: S 'b'
1016 |   rule U: S 'a'
1017 | \end{verbatim}
1018 | 
1019 | The \follow set for {\tt S} includes {\tt a} and {\tt b}.  Since {\tt
1020 |   S} can be empty, the \first set for {\tt S} should include {\tt a},
1021 | {\tt b}, and {\tt c}.  However, when parsing {\tt R}, if the lookahead
1022 | is {\tt b} we should \emph{not} parse {\tt U}.  That's because in {\tt
1023 |   U}, {\tt S} is followed by {\tt a} and not {\tt b}.  Therefore in
1024 | {\tt R}, we should choose rule {\tt U} only if there is an {\tt a} or
1025 | {\tt c}, but not if there is a {\tt b}.  Yapps and many other LL(1)
1026 | systems do not distinguish {\tt S b} and {\tt S a}, making {\tt
1027 |   S}'s \follow set {\tt a, b}, and making {\tt R} always try to match
1028 | {\tt U}.  In this case we can solve the problem by changing {\tt R} to 
1029 | \verb:'b' | U: but it may not always be possible to solve all such
1030 | problems in this way.
1031 | 
1032 | \appendix
1033 | 
1034 | \mysection{Grammar for Parsers}
1035 | 
1036 | This is the grammar for parsers, without any Python code mixed in.
1037 | The complete grammar can be found in \texttt{parsedesc.g} in the Yapps
1038 | distribution.
1039 | 
1040 | \begin{verbatim}
1041 | parser ParserDescription:
1042 |     ignore:      "\\s+"
1043 |     ignore:      "#.*?\r?\n"
1044 |     token END:   "$"  # $ means end of string
1045 |     token ATTR:  "<<.+?>>"
1046 |     token STMT:  "{{.+?}}"
1047 |     token ID:    '[a-zA-Z_][a-zA-Z_0-9]*'
1048 |     token STR:   '[rR]?\'([^\\n\'\\\\]|\\\\.)*\'|[rR]?"([^\\n"\\\\]|\\\\.)*"'
1049 | 
1050 |     rule Parser: "parser" ID ":"
1051 |                    Options
1052 |                    Tokens
1053 |                    Rules
1054 |                  END 
1055 | 
1056 |     rule Options:  ( "option" ":" STR )*
1057 |     rule Tokens:   ( "token" ID ":" STR | "ignore"   ":" STR )*
1058 |     rule Rules:    ( "rule" ID OptParam ":" ClauseA )*
1059 | 
1060 |     rule ClauseA:  ClauseB ( '[|]' ClauseB )*
1061 |     rule ClauseB:  ClauseC*
1062 |     rule ClauseC:  ClauseD [ '[+]' | '[*]' ]
1063 |     rule ClauseD:  STR | ID [ATTR] | STMT
1064 |                  | '\\(' ClauseA '\\) | '\\[' ClauseA '\\]'
1065 | \end{verbatim}
1066 | 
1067 | \mysection{Upgrading}
1068 | 
1069 | Yapps 2.0 is not backwards compatible with Yapps 1.0.  In this section 
1070 | are some tips for upgrading:
1071 | 
1072 | \begin{enumerate}
1073 |  \item Yapps 1.0 was distributed as a single file.  Yapps 2.0 is
1074 |    instead distributed as two Python files: a \emph{parser generator}
1075 |    (26k) and a \emph{parser runtime} (5k).  You need both files to
1076 |    create parsers, but you need only the runtime (\texttt{yappsrt.py})
1077 |    to use the parsers.
1078 |    
1079 |  \item Yapps 1.0 supported Python 1.4 regular expressions from the
1080 |    \texttt{regex} module.  Yapps 2.0 uses Python 1.5 regular
1081 |    expressions from the \texttt{re} module.  \emph{The new syntax for
1082 |      regular expressions is not compatible with the old syntax.}
1083 |    Andrew Kuchling has a \htmladdnormallink{guide to converting
1084 |      regular
1085 |      expressions}{http://www.python.org/doc/howto/regex-to-re/} on his
1086 |    web page.
1087 |    
1088 |  \item Yapps 1.0 wants a pattern and then a return value in \verb|->|
1089 |    \verb|<<...>>|.  Yapps 2.0 allows patterns and Python statements to
1090 |    be mixed.  To convert a rule like this:
1091 | 
1092 | \begin{verbatim}
1093 | rule R: A B C -> << E1 >>
1094 |       | X Y Z -> << E2 >>
1095 | \end{verbatim}
1096 |    
1097 |    to Yapps 2.0 form, replace the return value specifiers with return
1098 |    statements:
1099 | 
1100 | \begin{verbatim}
1101 | rule R: A B C {{ return E1 }}
1102 |       | X Y Z {{ return E2 }}
1103 | \end{verbatim}
1104 |    
1105 |  \item Yapps 2.0 does not perform tail recursion elimination.  This
1106 |    means any recursive rules you write will be turned into recursive
1107 |    methods in the parser.  The parser will work, but may be slower.
1108 |    It can be made faster by rewriting recursive rules, using instead
1109 |    the looping operators \verb|*| and \verb|+| provided in Yapps 2.0.
1110 | 
1111 | \end{enumerate}
1112 | 
1113 | \mysection{Troubleshooting}
1114 | 
1115 | \begin{itemize}
1116 |  \item A common error is to write a grammar that doesn't have an END
1117 |    token.  End tokens are needed when it is not clear when to stop
1118 |    parsing.  For example, when parsing the expression {\tt 3+5}, it is 
1119 |    not clear after reading {\tt 3} whether to treat it as a complete
1120 |    expression or whether the parser should continue reading.
1121 |    Therefore the grammar for numeric expressions should include an end 
1122 |    token.  Another example is the grammar for Lisp expressions.  In
1123 |    Lisp, it is always clear when you should stop parsing, so you do
1124 |    \emph{not} need an end token.  In fact, it may be more useful not
1125 |    to have an end token, so that you can read in several Lisp expressions.
1126 |  \item If there is a chance of ambiguity, make sure to put the choices 
1127 |    in the order you want them checked.  Usually the most specific
1128 |    choice should be first.  Empty sequences should usually be last.
1129 |  \item The context sensitive scanner is not appropriate for all
1130 |    grammars.  You might try using the insensitive scanner with the
1131 |    {\tt context-insensitive-scanner} option in the grammar.
1132 |  \item If performance turns out to be a problem, try writing a custom
1133 |    scanner.  The Yapps scanner is rather slow (but flexible and easy
1134 |    to understand).
1135 | \end{itemize}
1136 | 
1137 | \mysection{History}
1138 | 
1139 | Yapps 1 had several limitations that bothered me while writing
1140 | parsers:
1141 | 
1142 | \begin{enumerate}
1143 |  \item It was not possible to insert statements into the generated
1144 |    parser.  A common workaround was to write an auxilliary function
1145 |    that executed those statements, and to call that function as part
1146 |    of the return value calculation.  For example, several of my
1147 |    parsers had an ``append(x,y)'' function that existed solely to call 
1148 |    ``x.append(y)''.
1149 |  \item The way in which grammars were specified was rather
1150 |    restrictive: a rule was a choice of clauses.  Each clause was a
1151 |    sequence of tokens and rule names, followed by a return value.
1152 |  \item Optional matching had to be put into a separate rule because
1153 |    choices were only made at the beginning of a rule.
1154 |  \item Repetition had to be specified in terms of recursion.  Not only 
1155 |    was this awkward (sometimes requiring additional rules), I had to
1156 |    add a tail recursion optimization to Yapps to transform the
1157 |    recursion back into a loop.
1158 | \end{enumerate}
1159 | 
1160 | Yapps 2 addresses each of these limitations.
1161 | 
1162 | \begin{enumerate}
1163 |  \item Statements can occur anywhere within a rule.  (However, only
1164 |    one-line statements are allowed; multiline blocks marked by
1165 |    indentation are not.)
1166 |  \item Grammars can be specified using any mix of sequences, choices,
1167 |    tokens, and rule names.  To allow for complex structures,
1168 |    parentheses can be used for grouping.
1169 |  \item Given choices and parenthesization, optional matching can be
1170 |    expressed as a choice between some pattern and nothing.  In
1171 |    addition, Yapps 2 has the convenience syntax \verb|[A B ...]| for
1172 |    matching \verb|A B ...| optionally.
1173 |  \item Repetition operators \verb|*| for zero or more and \verb|+| for 
1174 |    one or more make it easy to specify repeating patterns.
1175 | \end{enumerate}
1176 | 
1177 | It is my hope that Yapps 2 will be flexible enough to meet my needs
1178 | for another year, yet simple enough that I do not hesitate to use it.
1179 | 
1180 | \mysection{Debian Extensions}
1181 | \label{sec:debian}
1182 | 
1183 | The Debian version adds the following enhancements to the original
1184 | Yapps code. They were written by Matthias Urlichs.
1185 | 
1186 | \begin{enumerate}
1187 |  \item Yapps can stack input sources ("include files"). A usage example
1188 |  is supplied with the calc.g sample program.
1189 |  \item Yapps now understands augmented ignore-able patterns.
1190 |   This means that Yapps can parse multi-line C comments; this wasn't
1191 |   possible before.
1192 |  \item Better error reporting.
1193 |  \item Yapps now reads its input incrementally.
1194 | \end{enumerate}
1195 | 
1196 | The generated parser has been renamed to \texttt{yapps/runtime.py}.
1197 | In Debian, this file is provided by the \texttt{yapps2-runtime} package.
1198 | You need to depend on it if you Debianize Python programs which use
1199 | yapps.
1200 | 
1201 | \mysection{Future Extensions}
1202 | \label{sec:future}
1203 | 
1204 | I am still investigating the possibility of LL(2) and higher
1205 | lookahead.  However, it looks like the resulting parsers will be
1206 | somewhat ugly.  
1207 | 
1208 | It would be nice to control choices with user-defined predicates.
1209 | 
1210 | The most likely future extension is backtracking.  A grammar pattern
1211 | like \verb|(VAR ':=' expr)? {{ return Assign(VAR,expr) }} : expr {{ return expr }}|
1212 | would turn into code that attempted to match \verb|VAR ':=' expr|.  If 
1213 | it succeeded, it would run \verb|{{ return ... }}|.  If it failed, it
1214 | would match \verb|expr {{ return expr }}|.  Backtracking may make it
1215 | less necessary to write LL(2) grammars.
1216 | 
1217 | \mysection{References}
1218 | 
1219 | \begin{enumerate}
1220 |  \item The \htmladdnormallink{Python-Parser
1221 |      SIG}{http://www.python.org/sigs/parser-sig/} is the first place
1222 |    to look for a list of parser systems for Python.
1223 |     
1224 |   \item ANTLR/PCCTS, by Terrence Parr, is available at
1225 |   \htmladdnormallink{The ANTLR Home Page}{http://www.antlr.org/}.
1226 |   
1227 |   \item PyLR, by Scott Cotton, is at \htmladdnormallink{his Starship
1228 |     page}{http://starship.skyport.net/crew/scott/PyLR.html}.
1229 |   
1230 |   \item John Aycock's \htmladdnormallink{Compiling Little Languages
1231 |       Framework}{http://www.foretec.com/python/workshops/1998-11/proceedings/papers/aycock-little/aycock-little.html}.
1232 | 
1233 |   \item PyBison, by Scott Hassan, can be found at
1234 |   \htmladdnormallink{his Python Projects
1235 |     page}{http://coho.stanford.edu/\~{}hassan/Python/}.
1236 |   
1237 |   \item mcf.pars, by Mike C. Fletcher, is available at
1238 |   \htmladdnormallink{his web
1239 |     page}{http://members.rogers.com/mcfletch/programming/simpleparse/simpleparse.html}.
1240 |   
1241 |   \item kwParsing, by Aaron Watters, is available at
1242 |   \htmladdnormallink{his Starship
1243 |     page}{http://starship.skyport.net/crew/aaron_watters/kwParsing/}.
1244 | \end{enumerate}
1245 | 
1246 | \end{document}
1247 | 


--------------------------------------------------------------------------------
/doc/yapps_grammar.g:
--------------------------------------------------------------------------------
  1 | # grammar.py, part of Yapps 2 - yet another python parser system
  2 | # Copyright 1999-2003 by Amit J. Patel <amitp@cs.stanford.edu>
  3 | # Enhancements copyright 2003-2004 by Matthias Urlichs <smurf@debian.org>
  4 | #
  5 | # This version of the Yapps 2 grammar can be distributed under the
  6 | # terms of the MIT open source license, either found in the LICENSE
  7 | # file included with the Yapps distribution
  8 | # <http://theory.stanford.edu/~amitp/yapps/> or at
  9 | # <http://www.opensource.org/licenses/mit-license.php>
 10 | #
 11 | 
 12 | """Parser for Yapps grammars.
 13 | 
 14 | This file defines the grammar of Yapps grammars.  Naturally, it is
 15 | implemented in Yapps.  The grammar.py module needed by Yapps is built
 16 | by running Yapps on yapps_grammar.g.  (Holy circularity, Batman!)
 17 | 
 18 | """
 19 | 
 20 | import sys, re
 21 | from yapps import parsetree
 22 | 
 23 | ######################################################################
 24 | def cleanup_choice(rule, lst):
 25 |     if len(lst) == 0: return Sequence(rule, [])
 26 |     if len(lst) == 1: return lst[0]
 27 |     return parsetree.Choice(rule, *tuple(lst))
 28 | 
 29 | def cleanup_sequence(rule, lst):
 30 |     if len(lst) == 1: return lst[0]
 31 |     return parsetree.Sequence(rule, *tuple(lst))
 32 | 
 33 | def resolve_name(rule, tokens, id, args):
 34 |     if id in [x[0] for x in tokens]:
 35 |         # It's a token
 36 |         if args:
 37 |             print('Warning: ignoring parameters on TOKEN %s<<%s>>' % (id, args))
 38 |         return parsetree.Terminal(rule, id)
 39 |     else:
 40 |         # It's a name, so assume it's a nonterminal
 41 |         return parsetree.NonTerminal(rule, id, args)
 42 | 
 43 | %%
 44 | parser ParserDescription:
 45 | 
 46 |     ignore:      "[ \t\r\n]+"
 47 |     ignore:      "#.*?\r?\n"
 48 |     token EOF:   "$"
 49 |     token ATTR:  "<<.+?>>"
 50 |     token STMT:  "{{.+?}}"
 51 |     token ID:    '[a-zA-Z_][a-zA-Z_0-9]*'
 52 |     token STR:   '[rR]?\'([^\\n\'\\\\]|\\\\.)*\'|[rR]?"([^\\n"\\\\]|\\\\.)*"'
 53 |     token LP:    '\\('
 54 |     token RP:    '\\)'
 55 |     token LB:    '\\['
 56 |     token RB:    '\\]'
 57 |     token OR:    '[|]'
 58 |     token STAR:  '[*]'
 59 |     token PLUS:  '[+]'
 60 |     token QUEST: '[?]'
 61 |     token COLON: ':'
 62 | 
 63 |     rule Parser: "parser" ID ":"
 64 |                    Options
 65 |                    Tokens
 66 |                    Rules<<Tokens>>
 67 |                  EOF
 68 |                  {{ return parsetree.Generator(ID,Options,Tokens,Rules) }}
 69 | 
 70 |     rule Options: {{ opt = {} }}
 71 |                   ( "option" ":" Str {{ opt[Str] = 1 }} )*
 72 |                   {{ return opt }}
 73 | 
 74 |     rule Tokens:  {{ tok = [] }}
 75 |                   (
 76 |                     "token" ID ":" Str {{ tok.append( (ID,Str) ) }}
 77 |                   | "ignore"
 78 | 				    ":" Str {{ ign = ('#ignore',Str) }}
 79 | 				    ( STMT  {{ ign = ign + (STMT[2:-2],) }} )?
 80 | 				            {{ tok.append( ign ) }}
 81 |                   )*
 82 |                   {{ return tok }}
 83 | 
 84 |     rule Rules<<tokens>>:
 85 |                   {{ rul = [] }}
 86 |                   (
 87 |                     "rule" ID OptParam ":" ClauseA<<ID, tokens>>
 88 |                     {{ rul.append( (ID, OptParam, ClauseA) ) }}
 89 |                   )*
 90 |                   {{ return rul }}
 91 | 
 92 |     rule ClauseA<<rule, tokens>>:
 93 |                   ClauseB<<rule,tokens>>
 94 |                   {{ v = [ClauseB] }}
 95 |                   ( OR ClauseB<<rule,tokens>> {{ v.append(ClauseB) }} )*
 96 |                   {{ return cleanup_choice(rule,v) }}
 97 | 
 98 |     rule ClauseB<<rule,tokens>>:
 99 |                   {{ v = [] }}
100 |                   ( ClauseC<<rule,tokens>> {{ v.append(ClauseC) }} )*
101 |                   {{ return cleanup_sequence(rule, v) }}
102 | 
103 |     rule ClauseC<<rule,tokens>>:
104 |                   ClauseD<<rule,tokens>>
105 |                   ( PLUS {{ return parsetree.Plus(rule, ClauseD) }}
106 |                   | STAR {{ return parsetree.Star(rule, ClauseD) }}
107 |                   | QUEST {{ return parsetree.Option(rule, ClauseD) }}
108 |                   |      {{ return ClauseD }} )
109 | 
110 |     rule ClauseD<<rule,tokens>>:
111 |                   STR {{ t = (STR, eval(STR,{},{})) }}
112 |                       {{ if t not in tokens: tokens.insert( 0, t ) }}
113 |                       {{ return parsetree.Terminal(rule, STR) }}
114 |                 | ID OptParam {{ return resolve_name(rule,tokens, ID, OptParam) }}
115 |                 | LP ClauseA<<rule,tokens>> RP {{ return ClauseA }}
116 |                 | LB ClauseA<<rule,tokens>> RB {{ return parsetree.Option(rule, ClauseA) }}
117 |                 | STMT {{ return parsetree.Eval(rule, STMT[2:-2]) }}
118 | 
119 |     rule OptParam: [ ATTR {{ return ATTR[2:-2] }} ] {{ return '' }}
120 |     rule Str:   STR {{ return eval(STR,{},{}) }}
121 | %%
122 | 


--------------------------------------------------------------------------------
/examples/calc.g:
--------------------------------------------------------------------------------
 1 | globalvars = {}       # We will store the calculator's variables here
 2 | 
 3 | def lookup(map, name):
 4 |     for x,v in map:  
 5 |         if x == name: return v
 6 |     if not globalvars.has_key(name): print 'Undefined (defaulting to 0):', name
 7 |     return globalvars.get(name, 0)
 8 | 
 9 | def stack_input(scanner,ign):
10 |     """Grab more input"""
11 |     scanner.stack_input(raw_input(">?> "))
12 | 
13 | %%
14 | parser Calculator:
15 |     ignore:    "[ \r\t\n]+"
16 |     ignore:    "[?]"         {{ stack_input }}
17 | 
18 |     token END: "$"
19 |     token NUM: "[0-9]+"
20 |     token VAR: "[a-zA-Z_]+"
21 | 
22 |     # Each line can either be an expression or an assignment statement
23 |     rule goal:   expr<<[]>> END            {{ print '=', expr }}
24 |                                            {{ return expr }}
25 |                | "set" VAR expr<<[]>> END  {{ globalvars[VAR] = expr }}
26 |                                            {{ print VAR, '=', expr }}
27 |                                            {{ return expr }}
28 | 
29 |     # An expression is the sum and difference of factors
30 |     rule expr<<V>>:   factor<<V>>         {{ n = factor }}
31 |                      ( "[+]" factor<<V>>  {{ n = n+factor }}
32 |                      |  "-"  factor<<V>>  {{ n = n-factor }}
33 |                      )*                   {{ return n }}
34 | 
35 |     # A factor is the product and division of terms
36 |     rule factor<<V>>: term<<V>>           {{ v = term }}
37 |                      ( "[*]" term<<V>>    {{ v = v*term }}
38 |                      |  "/"  term<<V>>    {{ v = v/term }}
39 |                      )*                   {{ return v }}
40 | 
41 |     # A term is a number, variable, or an expression surrounded by parentheses
42 |     rule term<<V>>:   
43 |                  NUM                      {{ return int(NUM) }}
44 |                | VAR                      {{ return lookup(V, VAR) }}
45 |                | "\\(" expr "\\)"         {{ return expr }}
46 |                | "let" VAR "=" expr<<V>>  {{ V = [(VAR, expr)] + V }}
47 |                  "in" expr<<V>>           {{ return expr }}
48 | %%
49 | if __name__=='__main__':
50 |     print 'Welcome to the calculator sample for Yapps 2.'
51 |     print '  Enter either "<expression>" or "set <var> <expression>",'
52 |     print '  or just press return to exit.  An expression can have'
53 |     print '  local variables:  let x = expr in expr'
54 |     # We could have put this loop into the parser, by making the
55 |     # `goal' rule use (expr | set var expr)*, but by putting the
56 |     # loop into Python code, we can make it interactive (i.e., enter
57 |     # one expression, get the result, enter another expression, etc.)
58 |     while 1:
59 |         try: s = raw_input('>>> ')
60 |         except EOFError: break
61 |         if not s.strip(): break
62 |         parse('goal', s)
63 |     print 'Bye.'
64 | 
65 | 


--------------------------------------------------------------------------------
/examples/expr.g:
--------------------------------------------------------------------------------
 1 | parser Calculator:
 2 |     token END: "$"
 3 |     token NUM: "[0-9]+"
 4 | 
 5 |     rule goal:           expr END         {{ return expr }}
 6 | 
 7 |     # An expression is the sum and difference of factors
 8 |     rule expr:           factor           {{ v = factor }}
 9 |                        ( "[+]" factor       {{ v = v+factor }}
10 |                        |  "-"  factor       {{ v = v-factor }}
11 |                        )*                 {{ return v }}
12 | 
13 |     # A factor is the product and division of terms
14 |     rule factor:         term             {{ v = term }}
15 |                        ( "[*]" term         {{ v = v*term }}
16 |                        |  "/"  term         {{ v = v/term }}
17 |                        )*                 {{ return v }}
18 | 
19 |     # A term is either a number or an expression surrounded by parentheses
20 |     rule term:           NUM              {{ return int(NUM) }}
21 |                        | "\\(" expr "\\)" {{ return expr }}
22 | 


--------------------------------------------------------------------------------
/examples/lisp.g:
--------------------------------------------------------------------------------
 1 | parser Lisp:
 2 |     ignore:      r'\s+'
 3 |     token NUM:   r'[0-9]+'
 4 |     token ID:    r'[-+*/!@$%^&=.a-zA-Z0-9_]+'
 5 |     token STR:   r'"([^\\"]+|\\.)*"'
 6 | 
 7 |     rule expr:   ID              {{ return ('id',ID) }}
 8 |                | STR             {{ return ('str',eval(STR)) }}
 9 |                | NUM             {{ return ('num',int(NUM)) }}
10 |                | r"\("           
11 |                         {{ e = [] }}             # initialize the list
12 |                  ( expr {{ e.append(expr) }} ) * # put each expr into the list
13 |                  r"\)"  {{ return e }}           # return the list
14 | 


--------------------------------------------------------------------------------
/examples/notes:
--------------------------------------------------------------------------------
 1 | Hints
 2 | #####
 3 | 
 4 | Some additional hints for your edification.
 5 | 
 6 | Author: Matthias Urlichs <smurf@debian.org>
 7 | 
 8 | How to process C preprocessor codes:
 9 | ====================================
10 | 
11 | Rudimentary include handling has been added to the parser by me.
12 | 
13 | However, if you want to do anything fancy, like for instance whatever
14 | the C preprocessor does, things get more complicated. Fortunately,
15 | there's already a nice tool to handle C preprocessing -- CPP itself.
16 | 
17 | If you want to report errors correctly in that situation, do this:
18 | 
19 | 	def set_line(s,m):
20 | 		"""Fixup the scanner's idea of the current line"""
21 | 		s.filename = m.group(2)
22 | 		line = int(m.group(1))
23 | 		s.del_line = line - s.line
24 | 
25 | 	%%
26 | 	parser whatever:
27 | 		ignore:    '^#\s*(\d+)\s*"([^"\n]+)"\s*\n' {{ set_line }}
28 | 		ignore:    '^#.*\n'
29 | 
30 | 	[...]
31 | 	%%
32 | 	if __name__=='__main__':
33 | 		import sys,os
34 | 		for a in sys.argv[1:]:
35 | 			f=os.popen("cpp "+repr(a),"r")
36 | 
37 | 			P = whatever(whateverScanner("", filename=a, file=f))
38 | 			try: P.goal()
39 | 			except runtime.SyntaxError as e:
40 | 				runtime.print_error(e, P._scanner)
41 | 				sys.exit(1)
42 | 
43 | 			f.close()
44 | 
45 | 


--------------------------------------------------------------------------------
/examples/xml.g:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python2
 2 | 
 3 | # xml.g
 4 | #
 5 | # Amit J. Patel, August 2003
 6 | #
 7 | # Simple (non-conforming, non-validating) parsing of XML documents,
 8 | # based on Robert D. Cameron's "REX" shallow parser.  It doesn't
 9 | # handle CDATA and lots of other stuff; it's meant to demonstrate
10 | # Yapps, not replace a proper XML parser.
11 | 
12 | %%
13 | 
14 | parser xml:
15 |     token nodetext: r'[^<>]+'
16 |     token attrtext_singlequote: "[^']*"
17 |     token attrtext_doublequote: '[^"]*'
18 |     token SP: r'\s'
19 |     token id: r'[a-zA-Z_:][a-zA-Z0-9_:.-]*'
20 | 
21 |     rule node:
22 |         r'<!--.*?-->'                   {{ return ['!--comment'] }}
23 |       | r'<!\[CDATA\[.*?\]\]>'          {{ return ['![CDATA['] }}
24 |       | r'<!' SP* id '[^>]*>'           {{ return ['!doctype'] }}
25 |       | '<' SP* id SP* attributes SP*   {{ startid = id }}
26 |         ( '>' nodes '</' SP* id SP* '>' {{ assert startid == id, 'Mismatched tags <%s> ... </%s>' % (startid, id) }}
27 |                                         {{ return [id, attributes] + nodes }}
28 |         | '/\s*>'                       {{ return [id, attributes] }}
29 |         )
30 |       | nodetext                        {{ return nodetext }}
31 | 
32 |     rule nodes:                         {{ result = [] }}
33 |          ( node                         {{ result.append(node) }}
34 |          ) *                            {{ return result }}
35 | 
36 |     rule attribute: id SP* '=' SP*
37 |          ( '"' attrtext_doublequote '"' {{ return (id, attrtext_doublequote) }}
38 |          | "'" attrtext_singlequote "'" {{ return (id, attrtext_singlequote) }}
39 |          )
40 | 
41 |     rule attributes:                    {{ result = {} }}
42 |          ( attribute SP*                {{ result[attribute[0]] = attribute[1] }}
43 |          ) *                            {{ return result }}
44 | 
45 | %%
46 | 
47 | if __name__ == '__main__':
48 |     tests = ['<!-- hello -->',
49 |              'some text',
50 |              '< bad xml',
51 |              '<br />',
52 |              '<     spacey      a  =   "foo"    /   >',
53 |              '<a href="foo">text ... </a>',
54 |              '<begin> middle </end>',
55 |              '<begin> <nested attr=\'baz\' another="hey"> foo </nested> <nested> bar </nested> </begin>',
56 |             ]
57 |     print
58 |     print '____Running tests_______________________________________'
59 |     for test in tests:
60 |         print
61 |         try:
62 |             parser = xml(xmlScanner(test))
63 |             output = '%s ==> %s' % (repr(test), repr(parser.node()))
64 |         except (yappsrt.SyntaxError, AssertionError) as e:
65 |             output = '%s ==> FAILED ==> %s' % (repr(test), e)
66 |         print output
67 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from setuptools import setup, find_packages
 4 | import os
 5 | from yapps import __version__ as version
 6 | 
 7 | pkg_root = os.path.dirname(__file__)
 8 | 
 9 | # Error-handling here is to allow package to be built w/o README included
10 | try: readme = open(os.path.join(pkg_root, 'README.txt')).read()
11 | except IOError: readme = ''
12 | 
13 | setup(
14 |     name = 'Yapps2',
15 |     version = version,
16 |     author = 'Amit J. Patel, Matthias Urlichs',
17 |     author_email = 'amitp@cs.stanford.edu, smurf@debian.org',
18 |     maintainer = 'Mike Kazantsev',
19 |     maintainer_email = 'mk.fraggod@gmail.com',
20 |     license = 'MIT',
21 |     url = 'https://github.com/mk-fg/yapps',
22 | 
23 |     description = 'Yet Another Python Parser System',
24 |     long_description = readme,
25 | 
26 |     packages = find_packages(),
27 |     include_package_data = True,
28 |     package_data = {'': ['README.txt']},
29 |     exclude_package_data = {'': ['README.*']},
30 | 
31 |     entry_points = dict(console_scripts=[
32 |         'yapps2 = yapps.cli_tool:main' ]) )
33 | 


--------------------------------------------------------------------------------
/test.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | set -e
 4 | trap 'echo ERROR' 0 
 5 | 
 6 | export PYTHONPATH=$(pwd)
 7 | for PY_G in python python3 ; do
 8 | $PY_G ./yapps2 examples/expr.g examples/expr.py
 9 | 
10 | for PY_X in python python3 ; do
11 | test "$(echo "1+2*3+4" | $PY_X examples/expr.py goal)" = 11
12 | done
13 | 
14 | done
15 | 
16 | trap 'rm examples/expr.py; echo OK' 0 
17 | 


--------------------------------------------------------------------------------
/test/empty_clauses.g:
--------------------------------------------------------------------------------
 1 | # This parser tests the use of OR clauses with one of them being empty
 2 | #
 3 | # The output of --dump should indicate the FOLLOW set for (A | ) is 'c'.
 4 | 
 5 | parser Test:
 6 |     rule TestPlus: ( A | ) 'c'
 7 |     rule A: 'a'+
 8 | 
 9 |     rule TestStar: ( B | ) 'c'
10 |     rule B: 'b'*


--------------------------------------------------------------------------------
/test/line_numbers.g:
--------------------------------------------------------------------------------
 1 | #
 2 | # The error messages produced by Yapps have a line number.
 3 | # The line number should take the Python code section into account.
 4 | 
 5 | # The line number should be 10.
 6 | 
 7 | %%
 8 | 
 9 | parser error_1:
10 |     this_is_an_error;
11 | 


--------------------------------------------------------------------------------
/test/option.g:
--------------------------------------------------------------------------------
 1 | 
 2 | %%
 3 | 
 4 | parser test_option:
 5 |     ignore: r'\s+'
 6 |     token a: 'a'
 7 |     token b: 'b'
 8 |     token EOF: r'$'
 9 | 
10 |     rule test_brackets: a [b] EOF
11 | 
12 |     rule test_question_mark: a b? EOF
13 | 
14 | %%
15 | 
16 | # The generated code for test_brackets and test_question_mark should
17 | # be the same.
18 | 


--------------------------------------------------------------------------------
/yapps/__init__.py:
--------------------------------------------------------------------------------
1 | 
2 | __version__ = "2.2.1"
3 | 


--------------------------------------------------------------------------------
/yapps/cli_tool.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | #
  4 | # Yapps 2 - yet another python parser system
  5 | # Copyright 1999-2003 by Amit J. Patel <amitp@cs.stanford.edu>
  6 | #
  7 | # This version of Yapps 2 can be distributed under the
  8 | # terms of the MIT open source license, either found in the LICENSE file
  9 | # included with the Yapps distribution
 10 | # <http://theory.stanford.edu/~amitp/yapps/> or at
 11 | # <http://www.opensource.org/licenses/mit-license.php>
 12 | #
 13 | 
 14 | import os, sys, re, types
 15 | #from six import string_types
 16 | PY2 = sys.version_info[0] == 2
 17 | if PY2:
 18 |     string_types = (basestring,)
 19 | else:
 20 |     string_types = (str,)
 21 | 
 22 | try: from yapps import runtime, parsetree, grammar
 23 | except ImportError:
 24 |     # For running binary from a checkout-path directly
 25 |     if os.path.isfile('yapps/__init__.py'):
 26 |         sys.path.append('.')
 27 |         from yapps import runtime, parsetree, grammar
 28 |     else: raise
 29 | 
 30 | 
 31 | def generate(inputfile, outputfile=None, dump=0, **flags):
 32 |     """Generate a grammar, given an input filename (X.g)
 33 |     and an output filename (defaulting to X.py)."""
 34 | 
 35 |     inputfilename = inputfile if isinstance(
 36 |         inputfile, string_types ) else inputfile.name
 37 |     if not outputfile:
 38 |         if inputfilename.endswith('.g'):
 39 |             outputfile = inputfilename[:-2] + '.py'
 40 |         else:
 41 |             raise Exception('Must specify output filename if input filename is not *.g')
 42 | 
 43 |     DIVIDER = '\n%%\n' # This pattern separates the pre/post parsers
 44 |     preparser, postparser = None, None # Code before and after the parser desc
 45 | 
 46 |     # Read the entire file
 47 |     if isinstance(inputfile, string_types):
 48 |         inputfile = open(inputfilename)
 49 |     s = inputfile.read()
 50 | 
 51 |     # See if there's a separation between the pre-parser and parser
 52 |     f = s.find(DIVIDER)
 53 |     if f >= 0: preparser, s = s[:f]+'\n\n', s[f+len(DIVIDER):]
 54 | 
 55 |     # See if there's a separation between the parser and post-parser
 56 |     f = s.find(DIVIDER)
 57 |     if f >= 0: s, postparser = s[:f], '\n\n'+s[f+len(DIVIDER):]
 58 | 
 59 |     # Create the parser and scanner and parse the text
 60 |     scanner = grammar.ParserDescriptionScanner(s, filename=inputfilename)
 61 |     if preparser: scanner.del_line += preparser.count('\n')
 62 | 
 63 |     parser = grammar.ParserDescription(scanner)
 64 |     t = runtime.wrap_error_reporter(parser, 'Parser')
 65 |     if t is None: return 1 # Failure
 66 |     if preparser is not None: t.preparser = preparser
 67 |     if postparser is not None: t.postparser = postparser
 68 | 
 69 |     # Add command line options to the set
 70 |     t.options.update(flags)
 71 | 
 72 |     # Generate the output
 73 |     if dump:
 74 |         t.dump_information()
 75 |     else:
 76 |         if isinstance(outputfile, string_types):
 77 |             outputfile = open(outputfile, 'w')
 78 |         t.output = outputfile
 79 |         t.generate_output()
 80 |     return 0
 81 | 
 82 | 
 83 | def main(argv=None):
 84 |     import doctest
 85 |     doctest.testmod(sys.modules['__main__'])
 86 |     doctest.testmod(parsetree)
 87 | 
 88 |     import argparse
 89 |     parser = argparse.ArgumentParser(
 90 |         description='Generate python parser code from grammar description file.')
 91 |     parser.add_argument('grammar_path', help='Path to grammar description file (input).')
 92 |     parser.add_argument('parser_path', nargs='?',
 93 |         help='Path to output file to be generated.'
 94 |                 ' Input path, but with .py will be used, if omitted.'
 95 |             ' "-" or "/dev/stdout" (on some systems) can be used to output generated code to stdout.')
 96 |     parser.add_argument('-i', '--context-insensitive-scanner',
 97 |         action='store_true', help='Scan all tokens (see docs).')
 98 |     parser.add_argument('-t', '--indent-with-tabs', action='store_true',
 99 |         help='Use tabs instead of four spaces for indentation in generated code.')
100 |     parser.add_argument('--dump', action='store_true', help='Dump out grammar information.')
101 |     optz = parser.parse_args(argv if argv is not None else sys.argv[1:])
102 | 
103 |     parser_flags = dict()
104 |     for k in 'dump', 'context_insensitive_scanner':
105 |         if getattr(optz, k, False): parser_flags[k] = True
106 |     if optz.indent_with_tabs: parsetree.INDENT = '\t' # not the cleanest way
107 | 
108 |     outputfile = optz.parser_path
109 |     if outputfile == '-': outputfile = sys.stdout
110 | 
111 |     sys.exit(generate(optz.grammar_path, outputfile, **parser_flags))
112 | 
113 | 
114 | if __name__ == '__main__': main()
115 | 


--------------------------------------------------------------------------------
/yapps/grammar.py:
--------------------------------------------------------------------------------
  1 | # grammar.py, part of Yapps 2 - yet another python parser system
  2 | # Copyright 1999-2003 by Amit J. Patel <amitp@cs.stanford.edu>
  3 | # Enhancements copyright 2003-2004 by Matthias Urlichs <smurf@debian.org>
  4 | #
  5 | # This version of the Yapps 2 grammar can be distributed under the
  6 | # terms of the MIT open source license, either found in the LICENSE
  7 | # file included with the Yapps distribution
  8 | # <http://theory.stanford.edu/~amitp/yapps/> or at
  9 | # <http://www.opensource.org/licenses/mit-license.php>
 10 | #
 11 | 
 12 | """Parser for Yapps grammars.
 13 | 
 14 | This file defines the grammar of Yapps grammars.  Naturally, it is
 15 | implemented in Yapps.  The grammar.py module needed by Yapps is built
 16 | by running Yapps on yapps_grammar.g.  (Holy circularity, Batman!)
 17 | 
 18 | """
 19 | 
 20 | import sys, re
 21 | from yapps import parsetree
 22 | 
 23 | ######################################################################
 24 | def cleanup_choice(rule, lst):
 25 |     if len(lst) == 0: return Sequence(rule, [])
 26 |     if len(lst) == 1: return lst[0]
 27 |     return parsetree.Choice(rule, *tuple(lst))
 28 | 
 29 | def cleanup_sequence(rule, lst):
 30 |     if len(lst) == 1: return lst[0]
 31 |     return parsetree.Sequence(rule, *tuple(lst))
 32 | 
 33 | def resolve_name(rule, tokens, id, args):
 34 |     if id in [x[0] for x in tokens]:
 35 |         # It's a token
 36 |         if args:
 37 |             print('Warning: ignoring parameters on TOKEN %s<<%s>>' % (id, args))
 38 |         return parsetree.Terminal(rule, id)
 39 |     else:
 40 |         # It's a name, so assume it's a nonterminal
 41 |         return parsetree.NonTerminal(rule, id, args)
 42 | 
 43 | 
 44 | # Begin -- grammar generated by Yapps
 45 | import sys, re
 46 | from yapps import runtime
 47 | 
 48 | class ParserDescriptionScanner(runtime.Scanner):
 49 |     patterns = [
 50 |         ('"rule"', re.compile('rule')),
 51 |         ('"ignore"', re.compile('ignore')),
 52 |         ('"token"', re.compile('token')),
 53 |         ('"option"', re.compile('option')),
 54 |         ('":"', re.compile(':')),
 55 |         ('"parser"', re.compile('parser')),
 56 |         ('[ \t\r\n]+', re.compile('[ \t\r\n]+')),
 57 |         ('#.*?\r?\n', re.compile('#.*?\r?\n')),
 58 |         ('EOF', re.compile('$')),
 59 |         ('ATTR', re.compile('<<.+?>>')),
 60 |         ('STMT', re.compile('{{.+?}}')),
 61 |         ('ID', re.compile('[a-zA-Z_][a-zA-Z_0-9]*')),
 62 |         ('STR', re.compile('[rR]?\'([^\\n\'\\\\]|\\\\.)*\'|[rR]?"([^\\n"\\\\]|\\\\.)*"')),
 63 |         ('LP', re.compile('\\(')),
 64 |         ('RP', re.compile('\\)')),
 65 |         ('LB', re.compile('\\[')),
 66 |         ('RB', re.compile('\\]')),
 67 |         ('OR', re.compile('[|]')),
 68 |         ('STAR', re.compile('[*]')),
 69 |         ('PLUS', re.compile('[+]')),
 70 |         ('QUEST', re.compile('[?]')),
 71 |         ('COLON', re.compile(':')),
 72 |     ]
 73 |     def __init__(self, str,*args,**kw):
 74 |         runtime.Scanner.__init__(self,None,{'[ \t\r\n]+':None,'#.*?\r?\n':None,},str,*args,**kw)
 75 | 
 76 | class ParserDescription(runtime.Parser):
 77 |     Context = runtime.Context
 78 |     def Parser(self, _parent=None):
 79 |         _context = self.Context(_parent, self._scanner, 'Parser', [])
 80 |         self._scan('"parser"', context=_context)
 81 |         ID = self._scan('ID', context=_context)
 82 |         self._scan('":"', context=_context)
 83 |         Options = self.Options(_context)
 84 |         Tokens = self.Tokens(_context)
 85 |         Rules = self.Rules(Tokens, _context)
 86 |         EOF = self._scan('EOF', context=_context)
 87 |         return parsetree.Generator(ID,Options,Tokens,Rules)
 88 | 
 89 |     def Options(self, _parent=None):
 90 |         _context = self.Context(_parent, self._scanner, 'Options', [])
 91 |         opt = {}
 92 |         while self._peek('"option"', '"token"', '"ignore"', 'EOF', '"rule"', context=_context) == '"option"':
 93 |             self._scan('"option"', context=_context)
 94 |             self._scan('":"', context=_context)
 95 |             Str = self.Str(_context)
 96 |             opt[Str] = 1
 97 |         return opt
 98 | 
 99 |     def Tokens(self, _parent=None):
100 |         _context = self.Context(_parent, self._scanner, 'Tokens', [])
101 |         tok = []
102 |         while self._peek('"token"', '"ignore"', 'EOF', '"rule"', context=_context) in ['"token"', '"ignore"']:
103 |             _token = self._peek('"token"', '"ignore"', context=_context)
104 |             if _token == '"token"':
105 |                 self._scan('"token"', context=_context)
106 |                 ID = self._scan('ID', context=_context)
107 |                 self._scan('":"', context=_context)
108 |                 Str = self.Str(_context)
109 |                 tok.append( (ID,Str) )
110 |             else: # == '"ignore"'
111 |                 self._scan('"ignore"', context=_context)
112 |                 self._scan('":"', context=_context)
113 |                 Str = self.Str(_context)
114 |                 ign = ('#ignore',Str)
115 |                 if self._peek('STMT', '"token"', '"ignore"', 'EOF', '"rule"', context=_context) == 'STMT':
116 |                     STMT = self._scan('STMT', context=_context)
117 |                     ign = ign + (STMT[2:-2],)
118 |                 tok.append( ign )
119 |         return tok
120 | 
121 |     def Rules(self, tokens, _parent=None):
122 |         _context = self.Context(_parent, self._scanner, 'Rules', [tokens])
123 |         rul = []
124 |         while self._peek('"rule"', 'EOF', context=_context) == '"rule"':
125 |             self._scan('"rule"', context=_context)
126 |             ID = self._scan('ID', context=_context)
127 |             OptParam = self.OptParam(_context)
128 |             self._scan('":"', context=_context)
129 |             ClauseA = self.ClauseA(ID, tokens, _context)
130 |             rul.append( (ID, OptParam, ClauseA) )
131 |         return rul
132 | 
133 |     def ClauseA(self, rule, tokens, _parent=None):
134 |         _context = self.Context(_parent, self._scanner, 'ClauseA', [rule, tokens])
135 |         ClauseB = self.ClauseB(rule,tokens, _context)
136 |         v = [ClauseB]
137 |         while self._peek('OR', 'RP', 'RB', '"rule"', 'EOF', context=_context) == 'OR':
138 |             OR = self._scan('OR', context=_context)
139 |             ClauseB = self.ClauseB(rule,tokens, _context)
140 |             v.append(ClauseB)
141 |         return cleanup_choice(rule,v)
142 | 
143 |     def ClauseB(self, rule,tokens, _parent=None):
144 |         _context = self.Context(_parent, self._scanner, 'ClauseB', [rule,tokens])
145 |         v = []
146 |         while self._peek('STR', 'ID', 'LP', 'LB', 'STMT', 'OR', 'RP', 'RB', '"rule"', 'EOF', context=_context) in ['STR', 'ID', 'LP', 'LB', 'STMT']:
147 |             ClauseC = self.ClauseC(rule,tokens, _context)
148 |             v.append(ClauseC)
149 |         return cleanup_sequence(rule, v)
150 | 
151 |     def ClauseC(self, rule,tokens, _parent=None):
152 |         _context = self.Context(_parent, self._scanner, 'ClauseC', [rule,tokens])
153 |         ClauseD = self.ClauseD(rule,tokens, _context)
154 |         _token = self._peek('PLUS', 'STAR', 'QUEST', 'STR', 'ID', 'LP', 'LB', 'STMT', 'OR', 'RP', 'RB', '"rule"', 'EOF', context=_context)
155 |         if _token == 'PLUS':
156 |             PLUS = self._scan('PLUS', context=_context)
157 |             return parsetree.Plus(rule, ClauseD)
158 |         elif _token == 'STAR':
159 |             STAR = self._scan('STAR', context=_context)
160 |             return parsetree.Star(rule, ClauseD)
161 |         elif _token == 'QUEST':
162 |             QUEST = self._scan('QUEST', context=_context)
163 |             return parsetree.Option(rule, ClauseD)
164 |         else:
165 |             return ClauseD
166 | 
167 |     def ClauseD(self, rule,tokens, _parent=None):
168 |         _context = self.Context(_parent, self._scanner, 'ClauseD', [rule,tokens])
169 |         _token = self._peek('STR', 'ID', 'LP', 'LB', 'STMT', context=_context)
170 |         if _token == 'STR':
171 |             STR = self._scan('STR', context=_context)
172 |             t = (STR, eval(STR,{},{}))
173 |             if t not in tokens: tokens.insert( 0, t )
174 |             return parsetree.Terminal(rule, STR)
175 |         elif _token == 'ID':
176 |             ID = self._scan('ID', context=_context)
177 |             OptParam = self.OptParam(_context)
178 |             return resolve_name(rule,tokens, ID, OptParam)
179 |         elif _token == 'LP':
180 |             LP = self._scan('LP', context=_context)
181 |             ClauseA = self.ClauseA(rule,tokens, _context)
182 |             RP = self._scan('RP', context=_context)
183 |             return ClauseA
184 |         elif _token == 'LB':
185 |             LB = self._scan('LB', context=_context)
186 |             ClauseA = self.ClauseA(rule,tokens, _context)
187 |             RB = self._scan('RB', context=_context)
188 |             return parsetree.Option(rule, ClauseA)
189 |         else: # == 'STMT'
190 |             STMT = self._scan('STMT', context=_context)
191 |             return parsetree.Eval(rule, STMT[2:-2])
192 | 
193 |     def OptParam(self, _parent=None):
194 |         _context = self.Context(_parent, self._scanner, 'OptParam', [])
195 |         if self._peek('ATTR', '":"', 'PLUS', 'STAR', 'QUEST', 'STR', 'ID', 'LP', 'LB', 'STMT', 'OR', 'RP', 'RB', '"rule"', 'EOF', context=_context) == 'ATTR':
196 |             ATTR = self._scan('ATTR', context=_context)
197 |             return ATTR[2:-2]
198 |         return ''
199 | 
200 |     def Str(self, _parent=None):
201 |         _context = self.Context(_parent, self._scanner, 'Str', [])
202 |         STR = self._scan('STR', context=_context)
203 |         return eval(STR,{},{})
204 | 
205 | 
206 | def parse(rule, text):
207 |     P = ParserDescription(ParserDescriptionScanner(text))
208 |     return runtime.wrap_error_reporter(P, rule)
209 | 
210 | # End -- grammar generated by Yapps
211 | 


--------------------------------------------------------------------------------
/yapps/parsetree.py:
--------------------------------------------------------------------------------
  1 | # parsetree.py, part of Yapps 2 - yet another python parser system
  2 | # Copyright 1999-2003 by Amit J. Patel <amitp@cs.stanford.edu>
  3 | #
  4 | # This version of the Yapps 2 Runtime can be distributed under the
  5 | # terms of the MIT open source license, either found in the LICENSE file
  6 | # included with the Yapps distribution
  7 | # <http://theory.stanford.edu/~amitp/yapps/> or at
  8 | # <http://www.opensource.org/licenses/mit-license.php>
  9 | #
 10 | 
 11 | """Classes used to represent parse trees and generate output.
 12 | 
 13 | This module defines the Generator class, which drives the generation
 14 | of Python output from a grammar parse tree.  It also defines nodes
 15 | used to represent the parse tree; they are derived from class Node.
 16 | 
 17 | The main logic of Yapps is in this module.
 18 | """
 19 | 
 20 | from __future__ import print_function
 21 | import sys, re
 22 | 
 23 | ######################################################################
 24 | INDENT = ' '*4
 25 | 
 26 | class Generator:
 27 | 
 28 |     # TODO: many of the methods here should be class methods, not instance methods
 29 | 
 30 |     def __init__(self, name, options, tokens, rules):
 31 |         self.change_count = 0
 32 |         self.name = name
 33 |         self.options = options
 34 |         self.preparser = ''
 35 |         self.postparser = None
 36 | 
 37 |         self.tokens = {} # Map from tokens to regexps
 38 |         self.ignore = {} # List of token names to ignore in parsing, map to statements
 39 |         self.terminals = [] # List of token names (to maintain ordering)
 40 |         for t in tokens:
 41 |             if len(t) == 3:
 42 |                 n,t,s = t
 43 |             else:
 44 |                 n,t = t
 45 |                 s = None
 46 | 
 47 |             if n == '#ignore':
 48 |                 n = t
 49 |                 self.ignore[n] = s
 50 |             if n in self.tokens.keys() and self.tokens[n] != t:
 51 |                 print('Warning: token %s defined more than once.' % n, file=sys.stderr)
 52 |             self.tokens[n] = t
 53 |             self.terminals.append(n)
 54 | 
 55 |         self.rules = {} # Map from rule names to parser nodes
 56 |         self.params = {} # Map from rule names to parameters
 57 |         self.goals = [] # List of rule names (to maintain ordering)
 58 |         for n,p,r in rules:
 59 |             self.params[n] = p
 60 |             self.rules[n] = r
 61 |             self.goals.append(n)
 62 | 
 63 |         self.output = sys.stdout
 64 | 
 65 |     def has_option(self, name):
 66 |         return self.options.get(name, False)
 67 | 
 68 |     def non_ignored_tokens(self):
 69 |         return [x for x in self.terminals if x not in self.ignore]
 70 | 
 71 |     def changed(self):
 72 |         """Increments the change count.
 73 | 
 74 |         >>> t = Generator('', [], [], [])
 75 |         >>> old_count = t.change_count
 76 |         >>> t.changed()
 77 |         >>> assert t.change_count == old_count + 1
 78 |         """
 79 |         self.change_count = 1+self.change_count
 80 | 
 81 |     def set_subtract(self, a, b):
 82 |         """Returns the elements of a that are not in b.
 83 | 
 84 |         >>> t = Generator('', [], [], [])
 85 |         >>> t.set_subtract([], [])
 86 |         []
 87 |         >>> t.set_subtract([1, 2], [1, 2])
 88 |         []
 89 |         >>> t.set_subtract([1, 2, 3], [2])
 90 |         [1, 3]
 91 |         >>> t.set_subtract([1], [2, 3, 4])
 92 |         [1]
 93 |         """
 94 |         result = []
 95 |         for x in a:
 96 |             if x not in b:
 97 |                 result.append(x)
 98 |         return result
 99 | 
100 |     def subset(self, a, b):
101 |         """True iff all elements of sequence a are inside sequence b
102 | 
103 |         >>> t = Generator('', [], [], [])
104 |         >>> t.subset([], [1, 2, 3])
105 |         1
106 |         >>> t.subset([1, 2, 3], [])
107 |         0
108 |         >>> t.subset([1], [1, 2, 3])
109 |         1
110 |         >>> t.subset([3, 2, 1], [1, 2, 3])
111 |         1
112 |         >>> t.subset([1, 1, 1], [1, 2, 3])
113 |         1
114 |         >>> t.subset([1, 2, 3], [1, 1, 1])
115 |         0
116 |         """
117 |         for x in a:
118 |             if x not in b:
119 |                 return 0
120 |         return 1
121 | 
122 |     def equal_set(self, a, b):
123 |         """True iff subset(a, b) and subset(b, a)
124 | 
125 |         >>> t = Generator('', [], [], [])
126 |         >>> a_set = [1, 2, 3]
127 |         >>> t.equal_set(a_set, a_set)
128 |         1
129 |         >>> t.equal_set(a_set, a_set[:])
130 |         1
131 |         >>> t.equal_set([], a_set)
132 |         0
133 |         >>> t.equal_set([1, 2, 3], [3, 2, 1])
134 |         1
135 |         """
136 |         if len(a) != len(b): return 0
137 |         if a == b: return 1
138 |         return self.subset(a, b) and self.subset(b, a)
139 | 
140 |     def add_to(self, parent, additions):
141 |         "Modify _parent_ to include all elements in _additions_"
142 |         for x in additions:
143 |             if x not in parent:
144 |                 parent.append(x)
145 |                 self.changed()
146 | 
147 |     def equate(self, a, b):
148 |         """Extend (a) and (b) so that they contain each others' elements.
149 | 
150 |         >>> t = Generator('', [], [], [])
151 |         >>> a = [1, 2]
152 |         >>> b = [2, 3]
153 |         >>> t.equate(a, b)
154 |         >>> a
155 |         [1, 2, 3]
156 |         >>> b
157 |         [2, 3, 1]
158 |         """
159 |         self.add_to(a, b)
160 |         self.add_to(b, a)
161 | 
162 |     def write(self, *args):
163 |         for a in args:
164 |             self.output.write(a)
165 | 
166 |     def in_test(self, expr, full, set):
167 |         """Generate a test of (expr) being in (set), where (set) is a subset of (full)
168 | 
169 |         expr is a string (Python expression)
170 |         set is a list of values (which will be converted with repr)
171 |         full is the list of all values expr could possibly evaluate to
172 | 
173 |         >>> t = Generator('', [], [], [])
174 |         >>> t.in_test('x', [1,2,3,4], [])
175 |         '0'
176 |         >>> t.in_test('x', [1,2,3,4], [1,2,3,4])
177 |         '1'
178 |         >>> t.in_test('x', [1,2,3,4], [1])
179 |         'x == 1'
180 |         >>> t.in_test('a+b', [1,2,3,4], [1,2])
181 |         'a+b in [1, 2]'
182 |         >>> t.in_test('x', [1,2,3,4,5], [1,2,3])
183 |         'x not in [4, 5]'
184 |         >>> t.in_test('x', [1,2,3,4,5], [1,2,3,4])
185 |         'x != 5'
186 |         """
187 | 
188 |         if not set: return '0'
189 |         if len(set) == 1: return '%s == %s' % (expr, repr(set[0]))
190 |         if full and len(set) > len(full)/2:
191 |             # Reverse the sense of the test.
192 |             not_set = [x for x in full if x not in set]
193 |             return self.not_in_test(expr, full, not_set)
194 |         return '%s in %s' % (expr, repr(set))
195 | 
196 |     def not_in_test(self, expr, full, set):
197 |         """Like in_test, but the reverse test."""
198 |         if not set: return '1'
199 |         if len(set) == 1: return '%s != %s' % (expr, repr(set[0]))
200 |         return '%s not in %s' % (expr, repr(set))
201 | 
202 |     def peek_call(self, a):
203 |         """Generate a call to scan for a token in the set 'a'"""
204 |         assert type(a) == type([])
205 |         a_set = (repr(a)[1:-1])
206 |         if self.equal_set(a, self.non_ignored_tokens()): a_set = ''
207 |         if self.has_option('context_insensitive_scanner'): a_set = ''
208 |         if a_set: a_set += ","
209 | 
210 |         return 'self._peek(%s context=_context)' % a_set
211 | 
212 |     def peek_test(self, a, b):
213 |         """Generate a call to test whether the next token (which could be any of
214 |         the elements in a) is in the set b."""
215 |         if self.subset(a, b): return '1'
216 |         if self.has_option('context_insensitive_scanner'): a = self.non_ignored_tokens()
217 |         return self.in_test(self.peek_call(a), a, b)
218 | 
219 |     def not_peek_test(self, a, b):
220 |         """Like peek_test, but the opposite sense."""
221 |         if self.subset(a, b): return '0'
222 |         return self.not_in_test(self.peek_call(a), a, b)
223 | 
224 |     def calculate(self):
225 |         """The main loop to compute the epsilon, first, follow sets.
226 |         The loop continues until the sets converge.  This works because
227 |         each set can only get larger, so when they stop getting larger,
228 |         we're done."""
229 |         # First we determine whether a rule accepts epsilon (the empty sequence)
230 |         while 1:
231 |             for r in self.goals:
232 |                 self.rules[r].setup(self)
233 |             if self.change_count == 0: break
234 |             self.change_count = 0
235 | 
236 |         # Now we compute the first/follow sets
237 |         while 1:
238 |             for r in self.goals:
239 |                 self.rules[r].update(self)
240 |             if self.change_count == 0: break
241 |             self.change_count = 0
242 | 
243 |     def dump_information(self):
244 |         """Display the grammar in somewhat human-readable form."""
245 |         self.calculate()
246 |         for r in self.goals:
247 |             print('    _____' + '_'*len(r))
248 |             print(('___/Rule '+r+'\\' + '_'*80)[:79])
249 |             queue = [self.rules[r]]
250 |             while queue:
251 |                 top = queue[0]
252 |                 del queue[0]
253 | 
254 |                 print('Rule', repr(top), 'of class', top.__class__.__name__)
255 |                 top.first.sort()
256 |                 top.follow.sort()
257 |                 eps = []
258 |                 if top.accepts_epsilon: eps = ['(null)']
259 |                 print('     FIRST:', ', '.join(top.first+eps))
260 |                 print('    FOLLOW:', ', '.join(top.follow))
261 |                 for x in top.get_children(): queue.append(x)
262 | 
263 |     def repr_ignore(self):
264 |         out="{"
265 |         for t,s in self.ignore.items():
266 |             if s is None: s=repr(s)
267 |             out += "%s:%s," % (repr(t),s)
268 |         out += "}"
269 |         return out
270 | 
271 |     def generate_output(self):
272 |         self.calculate()
273 |         self.write(self.preparser)
274 |         self.write("# Begin -- grammar generated by Yapps\n")
275 |         self.write("from __future__ import print_function\n")
276 |         self.write("import sys, re\n")
277 |         self.write("from yapps import runtime\n")
278 |         self.write("\n")
279 |         self.write("class ", self.name, "Scanner(runtime.Scanner):\n")
280 |         self.write(INDENT, "patterns = [\n")
281 |         for p in self.terminals:
282 |             self.write(INDENT*2, "(%s, re.compile(%s)),\n" % (
283 |                 repr(p), repr(self.tokens[p])))
284 |         self.write(INDENT, "]\n")
285 |         self.write(INDENT, "def __init__(self, str,*args,**kw):\n")
286 |         self.write(INDENT*2, "runtime.Scanner.__init__(self,None,%s,str,*args,**kw)\n" %
287 |                    self.repr_ignore())
288 |         self.write("\n")
289 | 
290 |         self.write("class ", self.name, "(runtime.Parser):\n")
291 |         self.write(INDENT, "Context = runtime.Context\n")
292 |         for r in self.goals:
293 |             self.write(INDENT, "def ", r, "(self")
294 |             if self.params[r]: self.write(", ", self.params[r])
295 |             self.write(", _parent=None):\n")
296 |             self.write(INDENT*2, "_context = self.Context(_parent, self._scanner, %s, [%s])\n" %
297 |                        (repr(r), self.params.get(r, '')))
298 |             self.rules[r].output(self, INDENT*2)
299 |             self.write("\n")
300 | 
301 |         self.write("\n")
302 |         self.write("def parse(rule, text):\n")
303 |         self.write(INDENT, "P = ", self.name, "(", self.name, "Scanner(text))\n")
304 |         self.write(INDENT, "return runtime.wrap_error_reporter(P, rule)\n")
305 |         self.write("\n")
306 |         if self.postparser is not None:
307 |             self.write("# End -- grammar generated by Yapps\n")
308 |             self.write(self.postparser)
309 |         else:
310 |             self.write("if __name__ == '__main__':\n")
311 |             self.write(INDENT, "from sys import argv, stdin\n")
312 |             self.write(INDENT, "if len(argv) >= 2:\n")
313 |             self.write(INDENT*2, "if len(argv) >= 3:\n")
314 |             self.write(INDENT*3, "f = open(argv[2],'r')\n")
315 |             self.write(INDENT*2, "else:\n")
316 |             self.write(INDENT*3, "f = stdin\n")
317 |             self.write(INDENT*2, "print(parse(argv[1], f.read()))\n")
318 |             self.write(INDENT, "else: print ('Args:  <rule> [<filename>]', file=sys.stderr)\n")
319 |             self.write("# End -- grammar generated by Yapps\n")
320 | 
321 | ######################################################################
322 | class Node:
323 |     """This is the base class for all components of a grammar."""
324 |     def __init__(self, rule):
325 |         self.rule = rule # name of the rule containing this node
326 |         self.first = []
327 |         self.follow = []
328 |         self.accepts_epsilon = 0
329 | 
330 |     def setup(self, gen):
331 |         # Setup will change accepts_epsilon,
332 |         # sometimes from 0 to 1 but never 1 to 0.
333 |         # It will take a finite number of steps to set things up
334 |         pass
335 | 
336 |     def used(self, vars):
337 |         "Return two lists: one of vars used, and the other of vars assigned"
338 |         return vars, []
339 | 
340 |     def get_children(self):
341 |         "Return a list of sub-nodes"
342 |         return []
343 | 
344 |     def __repr__(self):
345 |         return str(self)
346 | 
347 |     def update(self, gen):
348 |         if self.accepts_epsilon:
349 |             gen.add_to(self.first, self.follow)
350 | 
351 |     def output(self, gen, indent):
352 |         "Write out code to _gen_ with _indent_:string indentation"
353 |         gen.write(indent, "assert 0 # Invalid parser node\n")
354 | 
355 | class Terminal(Node):
356 |     """This class stores terminal nodes, which are tokens."""
357 |     def __init__(self, rule, token):
358 |         Node.__init__(self, rule)
359 |         self.token = token
360 |         self.accepts_epsilon = 0
361 | 
362 |     def __str__(self):
363 |         return self.token
364 | 
365 |     def update(self, gen):
366 |         Node.update(self, gen)
367 |         if self.first != [self.token]:
368 |             self.first = [self.token]
369 |             gen.changed()
370 | 
371 |     def output(self, gen, indent):
372 |         gen.write(indent)
373 |         if re.match('[a-zA-Z_][a-zA-Z_0-9]*$', self.token):
374 |             gen.write(self.token, " = ")
375 |         gen.write("self._scan(%s, context=_context)\n" % repr(self.token))
376 | 
377 | class Eval(Node):
378 |     """This class stores evaluation nodes, from {{ ... }} clauses."""
379 |     def __init__(self, rule, expr):
380 |         Node.__init__(self, rule)
381 |         self.expr = expr
382 | 
383 |     def setup(self, gen):
384 |         Node.setup(self, gen)
385 |         if not self.accepts_epsilon:
386 |             self.accepts_epsilon = 1
387 |             gen.changed()
388 | 
389 |     def __str__(self):
390 |         return '{{ %s }}' % self.expr.strip()
391 | 
392 |     def output(self, gen, indent):
393 |         gen.write(indent, self.expr.strip(), '\n')
394 | 
395 | class NonTerminal(Node):
396 |     """This class stores nonterminal nodes, which are rules with arguments."""
397 |     def __init__(self, rule, name, args):
398 |         Node.__init__(self, rule)
399 |         self.name = name
400 |         self.args = args
401 | 
402 |     def setup(self, gen):
403 |         Node.setup(self, gen)
404 |         try:
405 |             self.target = gen.rules[self.name]
406 |             if self.accepts_epsilon != self.target.accepts_epsilon:
407 |                 self.accepts_epsilon = self.target.accepts_epsilon
408 |                 gen.changed()
409 |         except KeyError: # Oops, it's nonexistent
410 |             print('Error: no rule <%s>' % self.name, file=sys.stderr)
411 |             self.target = self
412 | 
413 |     def __str__(self):
414 |         return '%s' % self.name
415 | 
416 |     def update(self, gen):
417 |         Node.update(self, gen)
418 |         gen.equate(self.first, self.target.first)
419 |         gen.equate(self.follow, self.target.follow)
420 | 
421 |     def output(self, gen, indent):
422 |         gen.write(indent)
423 |         gen.write(self.name, " = ")
424 |         args = self.args
425 |         if args: args += ', '
426 |         args += '_context'
427 |         gen.write("self.", self.name, "(", args, ")\n")
428 | 
429 | class Sequence(Node):
430 |     """This class stores a sequence of nodes (A B C ...)"""
431 |     def __init__(self, rule, *children):
432 |         Node.__init__(self, rule)
433 |         self.children = children
434 | 
435 |     def setup(self, gen):
436 |         Node.setup(self, gen)
437 |         for c in self.children: c.setup(gen)
438 | 
439 |         if not self.accepts_epsilon:
440 |             # If it's not already accepting epsilon, it might now do so.
441 |             for c in self.children:
442 |                 # any non-epsilon means all is non-epsilon
443 |                 if not c.accepts_epsilon: break
444 |             else:
445 |                 self.accepts_epsilon = 1
446 |                 gen.changed()
447 | 
448 |     def get_children(self):
449 |         return self.children
450 | 
451 |     def __str__(self):
452 |         return '( %s )' % ' '.join(map(str, self.children))
453 | 
454 |     def update(self, gen):
455 |         Node.update(self, gen)
456 |         for g in self.children:
457 |             g.update(gen)
458 | 
459 |         empty = 1
460 |         for g_i in range(len(self.children)):
461 |             g = self.children[g_i]
462 | 
463 |             if empty: gen.add_to(self.first, g.first)
464 |             if not g.accepts_epsilon: empty = 0
465 | 
466 |             if g_i == len(self.children)-1:
467 |                 next = self.follow
468 |             else:
469 |                 next = self.children[1+g_i].first
470 |             gen.add_to(g.follow, next)
471 | 
472 |         if self.children:
473 |             gen.add_to(self.follow, self.children[-1].follow)
474 | 
475 |     def output(self, gen, indent):
476 |         if self.children:
477 |             for c in self.children:
478 |                 c.output(gen, indent)
479 |         else:
480 |             # Placeholder for empty sequences, just in case
481 |             gen.write(indent, 'pass\n')
482 | 
483 | class Choice(Node):
484 |     """This class stores a choice between nodes (A | B | C | ...)"""
485 |     def __init__(self, rule, *children):
486 |         Node.__init__(self, rule)
487 |         self.children = children
488 | 
489 |     def setup(self, gen):
490 |         Node.setup(self, gen)
491 |         for c in self.children: c.setup(gen)
492 | 
493 |         if not self.accepts_epsilon:
494 |             for c in self.children:
495 |                 if c.accepts_epsilon:
496 |                     self.accepts_epsilon = 1
497 |                     gen.changed()
498 | 
499 |     def get_children(self):
500 |         return self.children
501 | 
502 |     def __str__(self):
503 |         return '( %s )' % ' | '.join(map(str, self.children))
504 | 
505 |     def update(self, gen):
506 |         Node.update(self, gen)
507 |         for g in self.children:
508 |             g.update(gen)
509 | 
510 |         for g in self.children:
511 |             gen.add_to(self.first, g.first)
512 |             gen.add_to(self.follow, g.follow)
513 |         for g in self.children:
514 |             gen.add_to(g.follow, self.follow)
515 |         if self.accepts_epsilon:
516 |             gen.add_to(self.first, self.follow)
517 | 
518 |     def output(self, gen, indent):
519 |         test = "if"
520 |         gen.write(indent, "_token = ", gen.peek_call(self.first), "\n")
521 |         tokens_seen = []
522 |         tokens_unseen = self.first[:]
523 |         if gen.has_option('context_insensitive_scanner'):
524 |             # Context insensitive scanners can return ANY token,
525 |             # not only the ones in first.
526 |             tokens_unseen = gen.non_ignored_tokens()
527 |         for c in self.children:
528 |             testset = c.first[:]
529 |             removed = []
530 |             for x in testset:
531 |                 if x in tokens_seen:
532 |                     testset.remove(x)
533 |                     removed.append(x)
534 |                 if x in tokens_unseen: tokens_unseen.remove(x)
535 |             tokens_seen = tokens_seen + testset
536 |             if removed:
537 |                 if not testset:
538 |                     print('Error in rule', self.rule+':', file=sys.stderr)
539 |                 else:
540 |                     print('Warning in rule', self.rule+':', file=sys.stderr)
541 |                 print(' *', self, file=sys.stderr)
542 |                 print(' * These tokens could be matched by more than one clause:', file=sys.stderr)
543 |                 print(' *', ' '.join(removed), file=sys.stderr)
544 | 
545 |             if testset:
546 |                 if not tokens_unseen: # context sensitive scanners only!
547 |                     if test == 'if':
548 |                         # if it's the first AND last test, then
549 |                         # we can simply put the code without an if/else
550 |                         c.output(gen, indent)
551 |                     else:
552 |                         gen.write(indent, "else:")
553 |                         t = gen.in_test('', [], testset)
554 |                         if len(t) < 70-len(indent):
555 |                             gen.write(' #', t)
556 |                         gen.write("\n")
557 |                         c.output(gen, indent+INDENT)
558 |                 else:
559 |                     gen.write(indent, test, " ",
560 |                               gen.in_test('_token', tokens_unseen, testset),
561 |                               ":\n")
562 |                     c.output(gen, indent+INDENT)
563 |                 test = "elif"
564 | 
565 |         if tokens_unseen:
566 |             gen.write(indent, "else:\n")
567 |             gen.write(indent, INDENT, "raise runtime.SyntaxError(_token[0], ")
568 |             gen.write("'Could not match ", self.rule, "')\n")
569 | 
570 | class Wrapper(Node):
571 |     """This is a base class for nodes that modify a single child."""
572 |     def __init__(self, rule, child):
573 |         Node.__init__(self, rule)
574 |         self.child = child
575 | 
576 |     def setup(self, gen):
577 |         Node.setup(self, gen)
578 |         self.child.setup(gen)
579 | 
580 |     def get_children(self):
581 |         return [self.child]
582 | 
583 |     def update(self, gen):
584 |         Node.update(self, gen)
585 |         self.child.update(gen)
586 |         gen.add_to(self.first, self.child.first)
587 |         gen.equate(self.follow, self.child.follow)
588 | 
589 | class Option(Wrapper):
590 |     """This class represents an optional clause of the form [A]"""
591 |     def setup(self, gen):
592 |         Wrapper.setup(self, gen)
593 |         if not self.accepts_epsilon:
594 |             self.accepts_epsilon = 1
595 |             gen.changed()
596 | 
597 |     def __str__(self):
598 |         return '[ %s ]' % str(self.child)
599 | 
600 |     def output(self, gen, indent):
601 |         if self.child.accepts_epsilon:
602 |             print('Warning in rule', self.rule+': contents may be empty.', file=sys.stderr)
603 |         gen.write(indent, "if %s:\n" %
604 |                   gen.peek_test(self.first, self.child.first))
605 |         self.child.output(gen, indent+INDENT)
606 | 
607 |         if gen.has_option('context_insensitive_scanner'):
608 |             gen.write(indent, "if %s:\n" %
609 |                     gen.not_peek_test(gen.non_ignored_tokens(), self.follow))
610 |             gen.write(indent+INDENT, "raise runtime.SyntaxError(pos=self._scanner.get_pos(), context=_context, msg='Need one of ' + ', '.join(%s))\n" %
611 |                     repr(self.first))
612 | 
613 | 
614 | class Plus(Wrapper):
615 |     """This class represents a 1-or-more repetition clause of the form A+"""
616 |     def setup(self, gen):
617 |         Wrapper.setup(self, gen)
618 |         if self.accepts_epsilon != self.child.accepts_epsilon:
619 |             self.accepts_epsilon = self.child.accepts_epsilon
620 |             gen.changed()
621 | 
622 |     def __str__(self):
623 |         return '%s+' % str(self.child)
624 | 
625 |     def update(self, gen):
626 |         Wrapper.update(self, gen)
627 |         gen.add_to(self.child.follow, self.child.first)
628 | 
629 |     def output(self, gen, indent):
630 |         if self.child.accepts_epsilon:
631 |             print('Warning in rule', self.rule+':', file=sys.stderr)
632 |             print(' * The repeated pattern could be empty.  The resulting parser may not work properly.', file=sys.stderr)
633 |         gen.write(indent, "while 1:\n")
634 |         self.child.output(gen, indent+INDENT)
635 |         union = self.first[:]
636 |         gen.add_to(union, self.follow)
637 |         gen.write(indent+INDENT, "if %s: break\n" %
638 |                   gen.not_peek_test(union, self.child.first))
639 | 
640 |         if gen.has_option('context_insensitive_scanner'):
641 |             gen.write(indent, "if %s:\n" %
642 |                     gen.not_peek_test(gen.non_ignored_tokens(), self.follow))
643 |             gen.write(indent+INDENT, "raise runtime.SyntaxError(pos=self._scanner.get_pos(), context=_context, msg='Need one of ' + ', '.join(%s))\n" %
644 |                     repr(self.first))
645 | 
646 | 
647 | class Star(Wrapper):
648 |     """This class represents a 0-or-more repetition clause of the form A*"""
649 |     def setup(self, gen):
650 |         Wrapper.setup(self, gen)
651 |         if not self.accepts_epsilon:
652 |             self.accepts_epsilon = 1
653 |             gen.changed()
654 | 
655 |     def __str__(self):
656 |         return '%s*' % str(self.child)
657 | 
658 |     def update(self, gen):
659 |         Wrapper.update(self, gen)
660 |         gen.add_to(self.child.follow, self.child.first)
661 | 
662 |     def output(self, gen, indent):
663 |         if self.child.accepts_epsilon:
664 |             print('Warning in rule', self.rule+':', file=sys.stderr)
665 |             print(' * The repeated pattern could be empty.  The resulting parser probably will not work properly.', file=sys.stderr)
666 |         gen.write(indent, "while %s:\n" %
667 |                   gen.peek_test(self.follow, self.child.first))
668 |         self.child.output(gen, indent+INDENT)
669 | 
670 |         # TODO: need to generate tests like this in lots of rules
671 |         if gen.has_option('context_insensitive_scanner'):
672 |             gen.write(indent, "if %s:\n" %
673 |                     gen.not_peek_test(gen.non_ignored_tokens(), self.follow))
674 |             gen.write(indent+INDENT, "raise runtime.SyntaxError(pos=self._scanner.get_pos(), context=_context, msg='Need one of ' + ', '.join(%s))\n" %
675 |                     repr(self.first))
676 | 


--------------------------------------------------------------------------------
/yapps/runtime.py:
--------------------------------------------------------------------------------
  1 | # Yapps 2 Runtime, part of Yapps 2 - yet another python parser system
  2 | # Copyright 1999-2003 by Amit J. Patel <amitp@cs.stanford.edu>
  3 | # Enhancements copyright 2003-2004 by Matthias Urlichs <smurf@debian.org>
  4 | #
  5 | # This version of the Yapps 2 Runtime can be distributed under the
  6 | # terms of the MIT open source license, either found in the LICENSE file
  7 | # included with the Yapps distribution
  8 | # <http://theory.stanford.edu/~amitp/yapps/> or at
  9 | # <http://www.opensource.org/licenses/mit-license.php>
 10 | #
 11 | 
 12 | """Run time libraries needed to run parsers generated by Yapps.
 13 | 
 14 | This module defines parse-time exception classes, a scanner class, a
 15 | base class for parsers produced by Yapps, and a context class that
 16 | keeps track of the parse stack.
 17 | 
 18 | """
 19 | 
 20 | from __future__ import print_function
 21 | import sys, re
 22 | 
 23 | MIN_WINDOW=4096
 24 | # File lookup window
 25 | 
 26 | class SyntaxError(Exception):
 27 |     """When we run into an unexpected token, this is the exception to use"""
 28 |     def __init__(self, pos=None, msg="Bad Token", context=None):
 29 |         Exception.__init__(self)
 30 |         self.pos = pos
 31 |         self.msg = msg
 32 |         self.context = context
 33 | 
 34 |     def __str__(self):
 35 |         if not self.pos: return 'SyntaxError'
 36 |         else: return 'SyntaxError@%s(%s)' % (repr(self.pos), self.msg)
 37 | 
 38 | class NoMoreTokens(Exception):
 39 |     """Another exception object, for when we run out of tokens"""
 40 |     pass
 41 | 
 42 | class Token(object):
 43 |     """Yapps token.
 44 | 
 45 |     This is a container for a scanned token.
 46 |     """
 47 | 
 48 |     def __init__(self, type,value, pos=None):
 49 |         """Initialize a token."""
 50 |         self.type = type
 51 |         self.value = value
 52 |         self.pos = pos
 53 | 
 54 |     def __repr__(self):
 55 |         output = '<%s: %s' % (self.type, repr(self.value))
 56 |         if self.pos:
 57 |             output += " @ "
 58 |             if self.pos[0]:
 59 |                 output += "%s:" % self.pos[0]
 60 |             if self.pos[1]:
 61 |                 output += "%d" % self.pos[1]
 62 |             if self.pos[2] is not None:
 63 |                 output += ".%d" % self.pos[2]
 64 |         output += ">"
 65 |         return output
 66 | 
 67 | in_name=0
 68 | class Scanner(object):
 69 |     """Yapps scanner.
 70 | 
 71 |     The Yapps scanner can work in context sensitive or context
 72 |     insensitive modes.  The token(i) method is used to retrieve the
 73 |     i-th token.  It takes a restrict set that limits the set of tokens
 74 |     it is allowed to return.  In context sensitive mode, this restrict
 75 |     set guides the scanner.  In context insensitive mode, there is no
 76 |     restriction (the set is always the full set of tokens).
 77 | 
 78 |     """
 79 | 
 80 |     def __init__(self, patterns, ignore, input="",
 81 |             file=None,filename=None,stacked=False):
 82 |         """Initialize the scanner.
 83 | 
 84 |         Parameters:
 85 |           patterns : [(terminal, uncompiled regex), ...] or None
 86 |           ignore : {terminal:None, ...}
 87 |           input : string
 88 | 
 89 |         If patterns is None, we assume that the subclass has
 90 |         defined self.patterns : [(terminal, compiled regex), ...].
 91 |         Note that the patterns parameter expects uncompiled regexes,
 92 |         whereas the self.patterns field expects compiled regexes.
 93 | 
 94 |         The 'ignore' value is either None or a callable, which is called
 95 |         with the scanner and the to-be-ignored match object; this can
 96 |         be used for include file or comment handling.
 97 |         """
 98 | 
 99 |         if not filename:
100 |             global in_name
101 |             filename="<f.%d>" % in_name
102 |             in_name += 1
103 | 
104 |         self.input = input
105 |         self.ignore = ignore
106 |         self.file = file
107 |         self.filename = filename
108 |         self.pos = 0
109 |         self.del_pos = 0 # skipped
110 |         self.line = 1
111 |         self.del_line = 0 # skipped
112 |         self.col = 0
113 |         self.tokens = []
114 |         self.stack = None
115 |         self.stacked = stacked
116 | 
117 |         self.last_read_token = None
118 |         self.last_token = None
119 |         self.last_types = None
120 | 
121 |         if patterns is not None:
122 |             # Compile the regex strings into regex objects
123 |             self.patterns = []
124 |             for terminal, regex in patterns:
125 |                 self.patterns.append( (terminal, re.compile(regex)) )
126 | 
127 |     def stack_input(self, input="", file=None, filename=None):
128 |         """Temporarily parse from a second file."""
129 | 
130 |         # Already reading from somewhere else: Go on top of that, please.
131 |         if self.stack:
132 |             # autogenerate a recursion-level-identifying filename
133 |             if not filename:
134 |                 filename = 1
135 |             else:
136 |                 try:
137 |                     filename += 1
138 |                 except TypeError:
139 |                     pass
140 |                 # now pass off to the include file
141 |             self.stack.stack_input(input,file,filename)
142 |         else:
143 | 
144 |             try:
145 |                 filename += 0
146 |             except TypeError:
147 |                 pass
148 |             else:
149 |                 filename = "<str_%d>" % filename
150 | 
151 | #			self.stack = object.__new__(self.__class__)
152 | #			Scanner.__init__(self.stack,self.patterns,self.ignore,input,file,filename, stacked=True)
153 | 
154 |             # Note that the pattern+ignore are added by the generated
155 |             # scanner code
156 |             self.stack = self.__class__(input,file,filename, stacked=True)
157 | 
158 |     def get_pos(self):
159 |         """Return a file/line/char tuple."""
160 |         if self.stack: return self.stack.get_pos()
161 | 
162 |         return (self.filename, self.line+self.del_line, self.col)
163 | 
164 | #	def __repr__(self):
165 | #		"""Print the last few tokens that have been scanned in"""
166 | #		output = ''
167 | #		for t in self.tokens:
168 | #			output += '%s\n' % (repr(t),)
169 | #		return output
170 | 
171 |     def print_line_with_pointer(self, pos, length=0, out=sys.stderr):
172 |         """Print the line of 'text' that includes position 'p',
173 |         along with a second line with a single caret (^) at position p"""
174 | 
175 |         file,line,p = pos
176 |         if file != self.filename:
177 |             if self.stack: return self.stack.print_line_with_pointer(pos,length=length,out=out)
178 |             print >>out, "(%s: not in input buffer)" % file
179 |             return
180 | 
181 |         text = self.input
182 |         p += length-1 # starts at pos 1
183 | 
184 |         origline=line
185 |         line -= self.del_line
186 |         spos=0
187 |         if line > 0:
188 |             while 1:
189 |                 line = line - 1
190 |                 try:
191 |                     cr = text.index("\n",spos)
192 |                 except ValueError:
193 |                     if line:
194 |                         text = ""
195 |                     break
196 |                 if line == 0:
197 |                     text = text[spos:cr]
198 |                     break
199 |                 spos = cr+1
200 |         else:
201 |             print >>out, "(%s:%d not in input buffer)" % (file,origline)
202 |             return
203 | 
204 |         # Now try printing part of the line
205 |         text = text[max(p-80, 0):p+80]
206 |         p = p - max(p-80, 0)
207 | 
208 |         # Strip to the left
209 |         i = text[:p].rfind('\n')
210 |         j = text[:p].rfind('\r')
211 |         if i < 0 or (0 <= j < i): i = j
212 |         if 0 <= i < p:
213 |             p = p - i - 1
214 |             text = text[i+1:]
215 | 
216 |         # Strip to the right
217 |         i = text.find('\n', p)
218 |         j = text.find('\r', p)
219 |         if i < 0 or (0 <= j < i): i = j
220 |         if i >= 0:
221 |             text = text[:i]
222 | 
223 |         # Now shorten the text
224 |         while len(text) > 70 and p > 60:
225 |             # Cut off 10 chars
226 |             text = "..." + text[10:]
227 |             p = p - 7
228 | 
229 |         # Now print the string, along with an indicator
230 |         print >>out, '> ',text
231 |         print >>out, '> ',' '*p + '^'
232 | 
233 |     def grab_input(self):
234 |         """Get more input if possible."""
235 |         if not self.file: return
236 |         if len(self.input) - self.pos >= MIN_WINDOW: return
237 | 
238 |         data = self.file.read(MIN_WINDOW)
239 |         if data is None or data == "":
240 |             self.file = None
241 | 
242 |         # Drop bytes from the start, if necessary.
243 |         if self.pos > 2*MIN_WINDOW:
244 |             self.del_pos += MIN_WINDOW
245 |             self.del_line += self.input[:MIN_WINDOW].count("\n")
246 |             self.pos -= MIN_WINDOW
247 |             self.input = self.input[MIN_WINDOW:] + data
248 |         else:
249 |             self.input = self.input + data
250 | 
251 |     def getchar(self):
252 |         """Return the next character."""
253 |         self.grab_input()
254 | 
255 |         c = self.input[self.pos]
256 |         self.pos += 1
257 |         return c
258 | 
259 |     def token(self, restrict, context=None):
260 |         """Scan for another token."""
261 | 
262 |         while 1:
263 |             if self.stack:
264 |                 try:
265 |                     return self.stack.token(restrict, context)
266 |                 except StopIteration:
267 |                     self.stack = None
268 | 
269 |         # Keep looking for a token, ignoring any in self.ignore
270 |             self.grab_input()
271 | 
272 |             # special handling for end-of-file
273 |             if self.stacked and self.pos==len(self.input):
274 |                 raise StopIteration
275 | 
276 |             # Search the patterns for the longest match, with earlier
277 |             # tokens in the list having preference
278 |             best_match = -1
279 |             best_pat = '(error)'
280 |             best_m = None
281 |             for p, regexp in self.patterns:
282 |                 # First check to see if we're ignoring this token
283 |                 if restrict and p not in restrict and p not in self.ignore:
284 |                     continue
285 |                 m = regexp.match(self.input, self.pos)
286 |                 if m and m.end()-m.start() > best_match:
287 |                     # We got a match that's better than the previous one
288 |                     best_pat = p
289 |                     best_match = m.end()-m.start()
290 |                     best_m = m
291 | 
292 |             # If we didn't find anything, raise an error
293 |             if best_pat == '(error)' and best_match < 0:
294 |                 msg = 'Bad Token'
295 |                 if restrict:
296 |                     msg = 'Trying to find one of '+', '.join(restrict)
297 |                 raise SyntaxError(self.get_pos(), msg, context=context)
298 | 
299 |             ignore = best_pat in self.ignore
300 |             value = self.input[self.pos:self.pos+best_match]
301 |             if not ignore:
302 |                 tok=Token(type=best_pat, value=value, pos=self.get_pos())
303 | 
304 |             self.pos += best_match
305 | 
306 |             npos = value.rfind("\n")
307 |             if npos > -1:
308 |                 self.col = best_match-npos
309 |                 self.line += value.count("\n")
310 |             else:
311 |                 self.col += best_match
312 | 
313 |             # If we found something that isn't to be ignored, return it
314 |             if not ignore:
315 |                 if len(self.tokens) >= 10:
316 |                     del self.tokens[0]
317 |                 self.tokens.append(tok)
318 |                 self.last_read_token = tok
319 |                 # print repr(tok)
320 |                 return tok
321 |             else:
322 |                 ignore = self.ignore[best_pat]
323 |                 if ignore:
324 |                     ignore(self, best_m)
325 | 
326 |     def peek(self, *types, **kw):
327 |         """Returns the token type for lookahead; if there are any args
328 |         then the list of args is the set of token types to allow"""
329 |         context = kw.get("context",None)
330 |         if self.last_token is None:
331 |             self.last_types = types
332 |             self.last_token = self.token(types,context)
333 |         elif self.last_types:
334 |             for t in types:
335 |                 if t not in self.last_types:
336 |                     raise NotImplementedError("Unimplemented: restriction set changed")
337 |         return self.last_token.type
338 | 
339 |     def scan(self, type, **kw):
340 |         """Returns the matched text, and moves to the next token"""
341 |         context = kw.get("context",None)
342 | 
343 |         if self.last_token is None:
344 |             tok = self.token([type],context)
345 |         else:
346 |             if self.last_types and type not in self.last_types:
347 |                 raise NotImplementedError("Unimplemented: restriction set changed")
348 | 
349 |             tok = self.last_token
350 |             self.last_token = None
351 |         if tok.type != type:
352 |             if not self.last_types: self.last_types=[]
353 |             raise SyntaxError(tok.pos, 'Trying to find '+type+': '+ ', '.join(self.last_types)+", got "+tok.type, context=context)
354 |         return tok.value
355 | 
356 | class Parser(object):
357 |     """Base class for Yapps-generated parsers.
358 | 
359 |     """
360 | 
361 |     def __init__(self, scanner):
362 |         self._scanner = scanner
363 | 
364 |     def _stack(self, input="",file=None,filename=None):
365 |         """Temporarily read from someplace else"""
366 |         self._scanner.stack_input(input,file,filename)
367 |         self._tok = None
368 | 
369 |     def _peek(self, *types, **kw):
370 |         """Returns the token type for lookahead; if there are any args
371 |         then the list of args is the set of token types to allow"""
372 |         return self._scanner.peek(*types, **kw)
373 | 
374 |     def _scan(self, type, **kw):
375 |         """Returns the matched text, and moves to the next token"""
376 |         return self._scanner.scan(type, **kw)
377 | 
378 | class Context(object):
379 |     """Class to represent the parser's call stack.
380 | 
381 |     Every rule creates a Context that links to its parent rule.  The
382 |     contexts can be used for debugging.
383 | 
384 |     """
385 | 
386 |     def __init__(self, parent, scanner, rule, args=()):
387 |         """Create a new context.
388 | 
389 |         Args:
390 |         parent: Context object or None
391 |         scanner: Scanner object
392 |         rule: string (name of the rule)
393 |         args: tuple listing parameters to the rule
394 | 
395 |         """
396 |         self.parent = parent
397 |         self.scanner = scanner
398 |         self.rule = rule
399 |         self.args = args
400 |         while scanner.stack: scanner = scanner.stack
401 |         self.token = scanner.last_read_token
402 | 
403 |     def __str__(self):
404 |         output = ''
405 |         if self.parent: output = str(self.parent) + ' > '
406 |         output += self.rule
407 |         return output
408 | 
409 | def print_error(err, scanner, max_ctx=None):
410 |     """Print error messages, the parser stack, and the input text -- for human-readable error messages."""
411 |     # NOTE: this function assumes 80 columns :-(
412 |     # Figure out the line number
413 |     pos = err.pos
414 |     if not pos:
415 |         pos = scanner.get_pos()
416 | 
417 |     file_name, line_number, column_number = pos
418 |     print('%s:%d:%d: %s' % (file_name, line_number, column_number, err.msg), file=sys.stderr)
419 | 
420 |     scanner.print_line_with_pointer(pos)
421 | 
422 |     context = err.context
423 |     token = None
424 |     while context:
425 |         print('while parsing %s%s:' % (context.rule, tuple(context.args)), file=sys.stderr)
426 |         if context.token:
427 |             token = context.token
428 |         if token:
429 |             scanner.print_line_with_pointer(token.pos, length=len(token.value))
430 |         context = context.parent
431 |         if max_ctx:
432 |             max_ctx = max_ctx-1
433 |             if not max_ctx:
434 |                 break
435 | 
436 | def wrap_error_reporter(parser, rule, *args,**kw):
437 |     try:
438 |         return getattr(parser, rule)(*args,**kw)
439 |     except SyntaxError as e:
440 |         print_error(e, parser._scanner)
441 |     except NoMoreTokens:
442 |         print('Could not complete parsing; stopped around here:', file=sys.stderr)
443 |         print(parser._scanner, file=sys.stderr)
444 | 


--------------------------------------------------------------------------------
/yapps2:
--------------------------------------------------------------------------------
1 | yapps/cli_tool.py


--------------------------------------------------------------------------------