├── .gitignore ├── .travis.yml ├── ChangeLog ├── LICENSE ├── NOTES ├── README.md ├── doc ├── yapps2.man ├── yapps2.tex └── yapps_grammar.g ├── examples ├── calc.g ├── expr.g ├── lisp.g ├── notes └── xml.g ├── setup.py ├── test.sh ├── test ├── empty_clauses.g ├── line_numbers.g └── option.g ├── yapps ├── __init__.py ├── cli_tool.py ├── grammar.py ├── parsetree.py └── runtime.py └── yapps2 /.gitignore: -------------------------------------------------------------------------------- 1 | /.gitattributes 2 | /*.egg-info 3 | /build 4 | /dist 5 | /README.txt 6 | *.pyc 7 | *.pyo 8 | /examples/expr.py 9 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.7" 4 | - "3.4" 5 | - "3.5" 6 | - "3.6" 7 | - "3.7" 8 | - "pypy" 9 | script: sh test.sh 10 | -------------------------------------------------------------------------------- /ChangeLog: -------------------------------------------------------------------------------- 1 | 2003-08-27 Amit Patel 2 | 3 | * *: (VERSION) Release 2.1.1 4 | 5 | * *: Added a test/ directory for test cases; I had previously put 6 | tests in the examples/ directory, which is a bad place to put 7 | them. Examples are useful for learning how Yapps works. Tests 8 | are for testing specific features of Yapps. 9 | 10 | * parsetree.py (Plus.update): Fixed a long-standing bug in which 11 | the FOLLOW set of 'a'+ would include 'a'. In theory this makes no 12 | practical difference because the 'a'+ rule eats up all the 'a' 13 | tokens anyway. However, it makes error messages a little bit more 14 | confusing because they imply that an 'a' can follow. 15 | 16 | * yappsrt.py (print_error): Incorporated the context object into 17 | the printing of error messages. 18 | 19 | 2003-08-12 Amit Patel 20 | 21 | * *: (VERSION) Release 2.1.0 22 | 23 | * parsetree.py: Improved error message generation. Instead of 24 | relying on the scanner to produce errors, the parser now checks 25 | things explicitly and produces errors directly. The parser has 26 | better knowledge of the context, so its error messages are more 27 | precise and helpful. 28 | 29 | * yapps_grammar.g: Instead of setting self.rule in the setup() 30 | method, pass it in the constructor. To make it available at 31 | construction time, pass it along as another attribute in the 32 | attribute grammar. 33 | 34 | 2003-08-11 Amit Patel 35 | 36 | * parsetree.py: Generated parsers now include a context object 37 | that describes the parse rule stack. For example, while parsing 38 | rule A, called from rule B, called from rule D, the context object 39 | will let you reconstruct the path D > B > A. [Thanks David Morley] 40 | 41 | * *: Removed all output when things are working 42 | properly; all warnings/errors now go to stderr. 43 | 44 | * yapps_grammar.g: Added support for A? meaning an optional A. 45 | This is equivalent to [A]. 46 | 47 | * yapps2.py: Design - refactored yapps2.py into yapps2.py + 48 | grammar.py + parsetree.py. grammar.py is automatically generated 49 | from grammar.g. Added lots of docstrings. 50 | 51 | 2003-08-09 Amit Patel 52 | 53 | * yapps2.py: Documentation - added doctest tests to some of the 54 | set algorithms in class Generator. 55 | 56 | * yapps2.py: Style - removed "import *" everywhere. 57 | 58 | * yapps2.py: Style - moved to Python 2 -- string methods, 59 | list comprehensions, inline syntax for apply 60 | 61 | 2003-07-28 Amit Patel 62 | 63 | * *: (VERSION) Release 2.0.4 64 | 65 | * yappsrt.py: Style - replaced raising string exceptions 66 | with raising class exceptions. [Thanks Alex Verstak] 67 | 68 | * yappsrt.py: (SyntaxError) Bug fix - SyntaxError.__init__ should 69 | call Exception.__init__ 70 | 71 | * yapps2.py: Bug fix - identifiers in grammar rules that had 72 | digits in them were not accessible in the {{python code}} sections 73 | of the grammar. 74 | 75 | * yapps2.py: Style - changed "b >= a and b < c" to "a <= b < c" 76 | 77 | * yapps2.py: Style - change "`expr`" to "repr(expr)" 78 | 79 | 2002-08-00 Amit Patel 80 | 81 | * *: (VERSION) Release 2.0.3 82 | 83 | * yapps2.py: Bug fix - inline tokens using the r"" syntax weren't 84 | treated properly. 85 | 86 | 2002-04-00 Amit Patel 87 | 88 | * *: (VERSION) Release 2.0.2 89 | 90 | * yapps2.py: Bug fix - when generating the "else" clause, if the 91 | comment was too long, Yapps was not emitting a newline. [Thanks 92 | Steven Engelhardt] 93 | 94 | 2001-10-00 Amit Patel 95 | 96 | * *: (VERSION) Release 2.0.1 97 | 98 | * yappsrt.py: (SyntaxError) Style - the exception classes now 99 | inherit from Exception. [Thanks Rich Salz] 100 | 101 | * yappsrt.py: (Scanner) Performance - instead of passing the set 102 | of tokens into the scanner at initialization time, we build the 103 | list at compile time. You can still override the default list per 104 | instance of the scanner, but in the common case, we don't have to 105 | rebuild the token list. [Thanks Amaury Forgeot d'Arc] 106 | 107 | 108 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining 4 | a copy of this software and associated documentation files (the 5 | "Software"), to deal in the Software without restriction, including 6 | without limitation the rights to use, copy, modify, merge, publish, 7 | distribute, sublicense, and/or sell copies of the Software, and to 8 | permit persons to whom the Software is furnished to do so, subject to 9 | the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included 12 | in all copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 15 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 16 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 17 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 18 | CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 19 | TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 20 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /NOTES: -------------------------------------------------------------------------------- 1 | [Last updated August 11, 2003] 2 | 3 | Notes for myself: 4 | 5 | Document the LINENO trick 6 | 7 | Add a way to have a self-contained mode that doesn't require yappsrt? 8 | 9 | Add a debugging mode that helps you understand how the grammar 10 | is constructed and how things are being parsed 11 | 12 | Optimize (remove) unused variables 13 | 14 | Yapps produces a bunch of inline list literals. We should be able to 15 | instead create these lists as class variables (but this makes it 16 | harder to read the code). Also, 'A in X' could be written 17 | 'X.has_key(A)' if we can convert the lists into dictionaries ahead 18 | of time. 19 | 20 | Add a convenience to automatically gather up the values returned 21 | from subpatterns, put them into a list, and return them 22 | 23 | "Gather" mode that simply outputs the return values for certain nodes. 24 | For example, if you just want all expressions, you could ask yapps 25 | to gather the results of the 'expr' rule into a list. This would 26 | ignore all the higher level structure. 27 | 28 | Improve the documentation 29 | 30 | Write some larger examples (probably XML/HTML) 31 | 32 | EOF needs to be dealt with. It's probably a token that can match anywhere. 33 | 34 | Get rid of old-style regex support 35 | 36 | Use SRE's lex support to speed up lexing (this may be hard given that 37 | yapps allows for context-sensitive lexers) 38 | 39 | Look over Dan Connoly's experience with Yapps (bugs, frustrations, etc.) 40 | and see what improvements could be made 41 | 42 | Add something to pretty-print the grammar (without the actions) 43 | 44 | Maybe conditionals? Follow this rule only if holds. 45 | But this would be useful mainly when multiple rules match, and we 46 | want the first matching rule. The conditional would mean we skip to 47 | the next rule. Maybe this is part of the attribute grammar system, 48 | where rule X<0> can be specified separately from X. 49 | 50 | Convenience functions that could build return values for all rules 51 | without specifying the code for each rule individually 52 | 53 | Patterns (abstractions over rules) -- for example, comma separated values 54 | have a certain rule pattern that gets replicated all over the place 55 | 56 | These are rules that take other rules as parameters. 57 | 58 | rule list: {{ result = [] }} 59 | [ element {{ result.append(element) }} 60 | ( separator element {{ result.append(element) }} 61 | )* 62 | ] {{ return result }} 63 | 64 | Inheritance of parser and scanner classes. The base class (Parser) 65 | may define common tokens like ID, STR, NUM, space, comments, EOF, 66 | etc., and common rules/patterns like optional, sequence, 67 | delimiter-separated sequence. 68 | 69 | Why do A? and (A | ) produce different code? It seems that they 70 | should produce the very same code. 71 | 72 | Look at everyone's Yapps grammars, and come up with larger examples 73 | http://www.w3.org/2000/10/swap/SemEnglish.g 74 | http://www.w3.org/2000/10/swap/kifExpr.g 75 | http://www.w3.org/2000/10/swap/rdfn3.g 76 | 77 | Construct lots of erroneous grammars and see what Yapps does with them 78 | (improve error reporting) 79 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | YAPPS: Yet Another Python Parser System 2 | ---------------------------------------- 3 | 4 | For the most complete and excellent documentation (e.g. [manual with 5 | examples](http://theory.stanford.edu/~amitp/yapps/yapps2/manual/)) and info, 6 | please see original project website: http://theory.stanford.edu/~amitp/yapps/ 7 | 8 | YAPPS is an easy to use parser generator that is written in Python and generates 9 | Python code. 10 | There are several parser generator systems already available for Python, but 11 | this parser has different goals: Yapps is simple, very easy to use, and produces 12 | human-readable parsers. 13 | 14 | It is not the fastest or most powerful parser. 15 | Yapps is designed to be used when regular expressions are not enough and other 16 | parser systems are too much: situations where you might otherwise write your own 17 | recursive descent parser. 18 | 19 | This fork contains several upward-compatible enhancements to the original 20 | YAPPS source, originally included in [debian package](http://packages.debian.org/sid/yapps2): 21 | 22 | * Handle stacked input ("include files"). 23 | * Augmented ignore-able patterns (can parse multi-line C comments correctly). 24 | * Better error reporting. 25 | * Read input incrementally. 26 | 27 | 28 | Installation 29 | ---------------------------------------- 30 | 31 | It's a regular package for Python 2.7 (not 3.X, but there are links to 3.X 32 | patches listed on the [original author 33 | website](http://theory.stanford.edu/~amitp/yapps/)), but not in pypi, so can be 34 | installed from a checkout with something like that: 35 | 36 | % python setup.py install 37 | 38 | Better way would be to use [pip](http://pip-installer.org/) to install all the 39 | necessary dependencies as well: 40 | 41 | % pip install 'git+https://github.com/mk-fg/yapps.git#egg=yapps' 42 | 43 | Note that to install stuff in system-wide PATH and site-packages, elevated 44 | privileges are often required. 45 | Use "install --user", 46 | [~/.pydistutils.cfg](http://docs.python.org/install/index.html#distutils-configuration-files) 47 | or [virtualenv](http://pypi.python.org/pypi/virtualenv) to do unprivileged 48 | installs into custom paths. 49 | 50 | Alternatively, `./yapps2` can be run right from the checkout tree, without any 51 | installation. 52 | 53 | No extra package dependencies. 54 | -------------------------------------------------------------------------------- /doc/yapps2.man: -------------------------------------------------------------------------------- 1 | .TH YAPPS2 1 2 | .SH NAME 3 | yapps2 \- generate python parser code from grammar description file 4 | .SH SYNOPSIS 5 | .B yapps2 6 | [\fB\-h\fR] 7 | [\fB\-i\fR, \fB\-\-context-insensitive-scanner\fR] 8 | [\fB\-t\fR, \fB\-\-indent-with-tabs\fR] 9 | [\fB\-\-dump\fR] 10 | .IR grammar_file [ parser_file ] 11 | .SH DESCRIPTION 12 | .B yapps2 13 | generates python parser code from a grammar description file. 14 | .SH OPTIONS 15 | .TP 16 | .BR \-h ", " \-\-help\fR 17 | show a help message and exit 18 | .TP 19 | .BR \-i ", " \-\-context-insensitive-scanner\fR 20 | Scan all tokens. See the documentation for details. 21 | .TP 22 | .BR \-t ", " \-\-indent-with-tabs\fR 23 | Use tabs instead of four spaces for indentation in generated code. 24 | .TP 25 | .BR \-\-dump\fR 26 | Dump out grammar information. 27 | .TP 28 | .BR grammar_file 29 | grammar description file (input) 30 | .TP 31 | .BR parser_file 32 | Name of the output file to be generated. 33 | .BR 34 | The grammar file's name, with .py appended, will be used, if omitted. 35 | \"-\" or \"/dev/stdout\" can be used to send generated code to stdout. 36 | -------------------------------------------------------------------------------- /doc/yapps2.tex: -------------------------------------------------------------------------------- 1 | \documentclass[10pt]{article} 2 | \usepackage{palatino} 3 | \usepackage{html} 4 | \usepackage{color} 5 | 6 | \setlength{\headsep}{0in} 7 | \setlength{\headheight}{0in} 8 | \setlength{\textheight}{8.5in} 9 | \setlength{\textwidth}{5.9in} 10 | \setlength{\oddsidemargin}{0.25in} 11 | 12 | \definecolor{darkblue}{rgb}{0,0,0.6} 13 | \definecolor{darkerblue}{rgb}{0,0,0.3} 14 | 15 | %% \newcommand{\mysection}[1]{\section{\textcolor{darkblue}{#1}}} 16 | %% \newcommand{\mysubsection}[1]{\subsection{\textcolor{darkerblue}{#1}}} 17 | \newcommand{\mysection}[1]{\section{#1}} 18 | \newcommand{\mysubsection}[1]{\subsection{#1}} 19 | 20 | \bodytext{bgcolor=white text=black link=#004080 vlink=#006020} 21 | 22 | \newcommand{\first}{\textsc{first}} 23 | \newcommand{\follow}{\textsc{follow}} 24 | 25 | \begin{document} 26 | 27 | \begin{center} 28 | \hfill \begin{tabular}{c} 29 | {\Large The \emph{Yapps} Parser Generator System}\\ 30 | \verb|http://theory.stanford.edu/~amitp/Yapps/|\\ 31 | Version 2\\ 32 | \\ 33 | Amit J. Patel\\ 34 | \htmladdnormallink{http://www-cs-students.stanford.edu/~amitp/} 35 | {http://www-cs-students.stanford.edu/~amitp/} 36 | 37 | \end{tabular} \hfill \rule{0in}{0in} 38 | \end{center} 39 | 40 | \mysection{Introduction} 41 | 42 | \emph{Yapps} (\underline{Y}et \underline{A}nother \underline{P}ython 43 | \underline{P}arser \underline{S}ystem) is an easy to use parser 44 | generator that is written in Python and generates Python code. There 45 | are several parser generator systems already available for Python, 46 | including \texttt{PyLR, kjParsing, PyBison,} and \texttt{mcf.pars,} 47 | but I had different goals for my parser. Yapps is simple, is easy to 48 | use, and produces human-readable parsers. It is not the fastest or 49 | most powerful parser. Yapps is designed to be used when regular 50 | expressions are not enough and other parser systems are too much: 51 | situations where you may write your own recursive descent parser. 52 | 53 | Some unusual features of Yapps that may be of interest are: 54 | 55 | \begin{enumerate} 56 | 57 | \item Yapps produces recursive descent parsers that are readable by 58 | humans, as opposed to table-driven parsers that are difficult to 59 | read. A Yapps parser for a simple calculator looks similar to the 60 | one that Mark Lutz wrote by hand for \emph{Programming Python.} 61 | 62 | \item Yapps also allows for rules that accept parameters and pass 63 | arguments to be used while parsing subexpressions. Grammars that 64 | allow for arguments to be passed to subrules and for values to be 65 | passed back are often called \emph{attribute grammars.} In many 66 | cases parameterized rules can be used to perform actions at ``parse 67 | time'' that are usually delayed until later. For example, 68 | information about variable declarations can be passed into the 69 | rules that parse a procedure body, so that undefined variables can 70 | be detected at parse time. The types of defined variables can be 71 | used in parsing as well---for example, if the type of {\tt X} is 72 | known, we can determine whether {\tt X(1)} is an array reference or 73 | a function call. 74 | 75 | \item Yapps grammars are fairly easy to write, although there are 76 | some inconveniences having to do with ELL(1) parsing that have to be 77 | worked around. For example, rules have to be left factored and 78 | rules may not be left recursive. However, neither limitation seems 79 | to be a problem in practice. 80 | 81 | Yapps grammars look similar to the notation used in the Python 82 | reference manual, with operators like \verb:*:, \verb:+:, \verb:|:, 83 | \verb:[]:, and \verb:(): for patterns, names ({\tt tim}) for rules, 84 | regular expressions (\verb:"[a-z]+":) for tokens, and \verb:#: for 85 | comments. 86 | 87 | \item The Yapps parser generator is written as a single Python module 88 | with no C extensions. Yapps produces parsers that are written 89 | entirely in Python, and require only the Yapps run-time module (5k) 90 | for support. 91 | 92 | \item Yapps's scanner is context-sensitive, picking tokens based on 93 | the types of the tokens accepted by the parser. This can be 94 | helpful when implementing certain kinds of parsers, such as for a 95 | preprocessor. 96 | 97 | \end{enumerate} 98 | 99 | There are several disadvantages of using Yapps over another parser system: 100 | 101 | \begin{enumerate} 102 | 103 | \item Yapps parsers are \texttt{ELL(1)} (Extended LL(1)), which is 104 | less powerful than \texttt{LALR} (used by \texttt{PyLR}) or 105 | \texttt{SLR} (used by \texttt{kjParsing}), so Yapps would not be a 106 | good choice for parsing complex languages. For example, allowing 107 | both \texttt{x := 5;} and \texttt{x;} as statements is difficult 108 | because we must distinguish based on only one token of lookahead. 109 | Seeing only \texttt{x}, we cannot decide whether we have an 110 | assignment statement or an expression statement. (Note however 111 | that this kind of grammar can be matched with backtracking; see 112 | section \ref{sec:future}.) 113 | 114 | \item The scanner that Yapps provides can only read from strings, not 115 | files, so an entire file has to be read in before scanning can 116 | begin. It is possible to build a custom scanner, though, so in 117 | cases where stream input is needed (from the console, a network, or 118 | a large file are examples), the Yapps parser can be given a custom 119 | scanner that reads from a stream instead of a string. 120 | 121 | \item Yapps is not designed with efficiency in mind. 122 | 123 | \end{enumerate} 124 | 125 | Yapps provides an easy to use parser generator that produces parsers 126 | similar to what you might write by hand. It is not meant to be a 127 | solution for all parsing problems, but instead an aid for those times 128 | you would write a parser by hand rather than using one of the more 129 | powerful parsing packages available. 130 | 131 | Yapps 2.0 is easier to use than Yapps 1.0. New features include a 132 | less restrictive input syntax, which allows mixing of sequences, 133 | choices, terminals, and nonterminals; optional matching; the ability 134 | to insert single-line statements into the generated parser; and 135 | looping constructs \verb|*| and \verb|+| similar to the repetitive 136 | matching constructs in regular expressions. Unfortunately, the 137 | addition of these constructs has made Yapps 2.0 incompatible with 138 | Yapps 1.0, so grammars will have to be rewritten. See section 139 | \ref{sec:Upgrading} for tips on changing Yapps 1.0 grammars for use 140 | with Yapps 2.0. 141 | 142 | \mysection{Examples} 143 | 144 | In this section are several examples that show the use of Yapps. 145 | First, an introduction shows how to construct grammars and write them 146 | in Yapps form. This example can be skipped by someone familiar with 147 | grammars and parsing. Next is a Lisp expression grammar that produces 148 | a parse tree as output. This example demonstrates the use of tokens 149 | and rules, as well as returning values from rules. The third example 150 | is a expression evaluation grammar that evaluates during parsing 151 | (instead of producing a parse tree). 152 | 153 | \mysubsection{Introduction to Grammars} 154 | 155 | A \emph{grammar} for a natural language specifies how words can be put 156 | together to form large structures, such as phrases and sentences. A 157 | grammar for a computer language is similar in that it specifies how 158 | small components (called \emph{tokens}) can be put together to form 159 | larger structures. In this section we will write a grammar for a tiny 160 | subset of English. 161 | 162 | Simple English sentences can be described as being a noun phrase 163 | followed by a verb followed by a noun phrase. For example, in the 164 | sentence, ``Jack sank the blue ship,'' the word ``Jack'' is the first 165 | noun phrase, ``sank'' is the verb, and ``the blue ship'' is the second 166 | noun phrase. In addition we should say what a noun phrase is; for 167 | this example we shall say that a noun phrase is an optional article 168 | (a, an, the) followed by any number of adjectives followed by a noun. 169 | The tokens in our language are the articles, nouns, verbs, and 170 | adjectives. The \emph{rules} in our language will tell us how to 171 | combine the tokens together to form lists of adjectives, noun phrases, 172 | and sentences: 173 | 174 | \begin{itemize} 175 | \item \texttt{sentence: noun\_phrase verb noun\_phrase} 176 | \item \texttt{noun\_phrase: [article] adjective* noun} 177 | \end{itemize} 178 | 179 | Notice that some things that we said easily in English, such as 180 | ``optional article,'' are expressed using special syntax, such as 181 | brackets. When we said, ``any number of adjectives,'' we wrote 182 | \texttt{adjective*}, where the \texttt{*} means ``zero or more of the 183 | preceding pattern''. 184 | 185 | The grammar given above is close to a Yapps grammar. We also have to 186 | specify what the tokens are, and what to do when a pattern is matched. 187 | For this example, we will do nothing when patterns are matched; the 188 | next example will explain how to perform match actions. 189 | 190 | \begin{verbatim} 191 | parser TinyEnglish: 192 | ignore: "\\W+" 193 | token noun: "(Jack|spam|ship)" 194 | token verb: "(sank|threw)" 195 | token article: "(an|a|the)" 196 | token adjective: "(blue|red|green)" 197 | 198 | rule sentence: noun_phrase verb noun_phrase 199 | rule noun_phrase: [article] adjective* noun 200 | \end{verbatim} 201 | 202 | The tokens are specified as Python \emph{regular expressions}. Since 203 | Yapps produces Python code, you can write any regular expression that 204 | would be accepted by Python. (\emph{Note:} These are Python 1.5 205 | regular expressions from the \texttt{re} module, not Python 1.4 206 | regular expressions from the \texttt{regex} module.) In addition to 207 | tokens that you want to see (which are given names), you can also 208 | specify tokens to ignore, marked by the \texttt{ignore} keyword. In 209 | this parser we want to ignore whitespace. 210 | 211 | The TinyEnglish grammar shows how you define tokens and rules, but it 212 | does not specify what should happen once we've matched the rules. In 213 | the next example, we will take a grammar and produce a \emph{parse 214 | tree} from it. 215 | 216 | \mysubsection{Lisp Expressions} 217 | 218 | Lisp syntax, although hated by many, has a redeeming quality: it is 219 | simple to parse. In this section we will construct a Yapps grammar to 220 | parse Lisp expressions and produce a parse tree as output. 221 | 222 | \subsubsection*{Defining the Grammar} 223 | 224 | The syntax of Lisp is simple. It has expressions, which are 225 | identifiers, strings, numbers, and lists. A list is a left 226 | parenthesis followed by some number of expressions (separated by 227 | spaces) followed by a right parenthesis. For example, \verb|5|, 228 | \verb|"ni"|, and \verb|(print "1+2 = " (+ 1 2))| are Lisp expressions. 229 | Written as a grammar, 230 | 231 | \begin{verbatim} 232 | expr: ID | STR | NUM | list 233 | list: ( expr* ) 234 | \end{verbatim} 235 | 236 | In addition to having a grammar, we need to specify what to do every 237 | time something is matched. For the tokens, which are strings, we just 238 | want to get the ``value'' of the token, attach its type (identifier, 239 | string, or number) in some way, and return it. For the lists, we want 240 | to construct and return a Python list. 241 | 242 | Once some pattern is matched, we enclose a return statement enclosed 243 | in \verb|{{...}}|. The braces allow us to insert any one-line 244 | statement into the parser. Within this statement, we can refer to the 245 | values returned by matching each part of the rule. After matching a 246 | token such as \texttt{ID}, ``ID'' will be bound to the text of the 247 | matched token. Let's take a look at the rule: 248 | 249 | \begin{verbatim} 250 | rule expr: ID {{ return ('id', ID) }} 251 | ... 252 | \end{verbatim} 253 | 254 | In a rule, tokens return the text that was matched. For identifiers, 255 | we just return the identifier, along with a ``tag'' telling us that 256 | this is an identifier and not a string or some other value. Sometimes 257 | we may need to convert this text to a different form. For example, if 258 | a string is matched, we want to remove quotes and handle special forms 259 | like \verb|\n|. If a number is matched, we want to convert it into a 260 | number. Let's look at the return values for the other tokens: 261 | 262 | \begin{verbatim} 263 | ... 264 | | STR {{ return ('str', eval(STR)) }} 265 | | NUM {{ return ('num', atoi(NUM)) }} 266 | ... 267 | \end{verbatim} 268 | 269 | If we get a string, we want to remove the quotes and process any 270 | special backslash codes, so we run \texttt{eval} on the quoted string. 271 | If we get a number, we convert it to an integer with \texttt{atoi} and 272 | then return the number along with its type tag. 273 | 274 | For matching a list, we need to do something slightly more 275 | complicated. If we match a Lisp list of expressions, we want to 276 | create a Python list with those values. 277 | 278 | \begin{verbatim} 279 | rule list: "\\(" # Match the opening parenthesis 280 | {{ result = [] }} # Create a Python list 281 | ( 282 | expr # When we match an expression, 283 | {{ result.append(expr) }} # add it to the list 284 | )* # * means repeat this if needed 285 | "\\)" # Match the closing parenthesis 286 | {{ return result }} # Return the Python list 287 | \end{verbatim} 288 | 289 | In this rule we first match the opening parenthesis, then go into a 290 | loop. In this loop we match expressions and add them to the list. 291 | When there are no more expressions to match, we match the closing 292 | parenthesis and return the resulting. Note that \verb:#: is used for 293 | comments, just as in Python. 294 | 295 | The complete grammar is specified as follows: 296 | \begin{verbatim} 297 | parser Lisp: 298 | ignore: '\\s+' 299 | token NUM: '[0-9]+' 300 | token ID: '[-+*/!@%^&=.a-zA-Z0-9_]+' 301 | token STR: '"([^\\"]+|\\\\.)*"' 302 | 303 | rule expr: ID {{ return ('id', ID) }} 304 | | STR {{ return ('str', eval(STR)) }} 305 | | NUM {{ return ('num', atoi(NUM)) }} 306 | | list {{ return list }} 307 | rule list: "\\(" {{ result = [] }} 308 | ( expr {{ result.append(expr) }} 309 | )* 310 | "\\)" {{ return result }} 311 | \end{verbatim} 312 | 313 | One thing you may have noticed is that \verb|"\\("| and \verb|"\\)"| 314 | appear in the \texttt{list} rule. These are \emph{inline tokens}: 315 | they appear in the rules without being given a name with the 316 | \texttt{token} keyword. Inline tokens are more convenient to use, but 317 | since they do not have a name, the text that is matched cannot be used 318 | in the return value. They are best used for short simple patterns 319 | (usually punctuation or keywords). 320 | 321 | Another thing to notice is that the number and identifier tokens 322 | overlap. For example, ``487'' matches both NUM and ID. In Yapps, the 323 | scanner only tries to match tokens that are acceptable to the parser. 324 | This rule doesn't help here, since both NUM and ID can appear in the 325 | same place in the grammar. There are two rules used to pick tokens if 326 | more than one matches. One is that the \emph{longest} match is 327 | preferred. For example, ``487x'' will match as an ID (487x) rather 328 | than as a NUM (487) followed by an ID (x). The second rule is that if 329 | the two matches are the same length, the \emph{first} one listed in 330 | the grammar is preferred. For example, ``487'' will match as an NUM 331 | rather than an ID because NUM is listed first in the grammar. Inline 332 | tokens have preference over any tokens you have listed. 333 | 334 | Now that our grammar is defined, we can run Yapps to produce a parser, 335 | and then run the parser to produce a parse tree. 336 | 337 | \subsubsection*{Running Yapps} 338 | 339 | In the Yapps module is a function \texttt{generate} that takes an 340 | input filename and writes a parser to another file. We can use this 341 | function to generate the Lisp parser, which is assumed to be in 342 | \texttt{lisp.g}. 343 | 344 | \begin{verbatim} 345 | % python 346 | Python 1.5.1 (#1, Sep 3 1998, 22:51:17) [GCC 2.7.2.3] on linux-i386 347 | Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam 348 | >>> import yapps 349 | >>> yapps.generate('lisp.g') 350 | \end{verbatim} 351 | 352 | At this point, Yapps has written a file \texttt{lisp.py} that contains 353 | the parser. In that file are two classes (one scanner and one parser) 354 | and a function (called \texttt{parse}) that puts things together for 355 | you. 356 | 357 | Alternatively, we can run Yapps from the command line to generate the 358 | parser file: 359 | 360 | \begin{verbatim} 361 | % python yapps.py lisp.g 362 | \end{verbatim} 363 | 364 | After running Yapps either from within Python or from the command 365 | line, we can use the Lisp parser by calling the \texttt{parse} 366 | function. The first parameter should be the rule we want to match, 367 | and the second parameter should be the string to parse. 368 | 369 | \begin{verbatim} 370 | >>> import lisp 371 | >>> lisp.parse('expr', '(+ 3 4)') 372 | [('id', '+'), ('num', 3), ('num', 4)] 373 | >>> lisp.parse('expr', '(print "3 = " (+ 1 2))') 374 | [('id', 'print'), ('str', '3 = '), [('id', '+'), ('num', 1), ('num', 2)]] 375 | \end{verbatim} 376 | 377 | The \texttt{parse} function is not the only way to use the parser; 378 | section \ref{sec:Parser-Objects} describes how to access parser objects 379 | directly. 380 | 381 | We've now gone through the steps in creating a grammar, writing a 382 | grammar file for Yapps, producing a parser, and using the parser. In 383 | the next example we'll see how rules can take parameters and also how 384 | to do computations instead of just returning a parse tree. 385 | 386 | \mysubsection{Calculator} 387 | 388 | A common example parser given in many textbooks is that for simple 389 | expressions, with numbers, addition, subtraction, multiplication, 390 | division, and parenthesization of subexpressions. We'll write this 391 | example in Yapps, evaluating the expression as we parse. 392 | 393 | Unlike \texttt{yacc}, Yapps does not have any way to specify 394 | precedence rules, so we have to do it ourselves. We say that an 395 | expression is the sum of terms, and that a term is the product of 396 | factors, and that a factor is a number or a parenthesized expression: 397 | 398 | \begin{verbatim} 399 | expr: factor ( ("+"|"-") factor )* 400 | factor: term ( ("*"|"/") term )* 401 | term: NUM | "(" expr ")" 402 | \end{verbatim} 403 | 404 | In order to evaluate the expression as we go, we should keep along an 405 | accumulator while evaluating the lists of terms or factors. Just as 406 | we kept a ``result'' variable to build a parse tree for Lisp 407 | expressions, we will use a variable to evaluate numerical 408 | expressions. The full grammar is given below: 409 | 410 | \begin{verbatim} 411 | parser Calculator: 412 | token END: "$" # $ means end of string 413 | token NUM: "[0-9]+" 414 | 415 | rule goal: expr END {{ return expr }} 416 | 417 | # An expression is the sum and difference of factors 418 | rule expr: factor {{ v = factor }} 419 | ( "[+]" factor {{ v = v+factor }} 420 | | "-" factor {{ v = v-factor }} 421 | )* {{ return v }} 422 | 423 | # A factor is the product and division of terms 424 | rule factor: term {{ v = term }} 425 | ( "[*]" term {{ v = v*term }} 426 | | "/" term {{ v = v/term }} 427 | )* {{ return v }} 428 | 429 | # A term is either a number or an expression surrounded by parentheses 430 | rule term: NUM {{ return atoi(NUM) }} 431 | | "\\(" expr "\\)" {{ return expr }} 432 | \end{verbatim} 433 | 434 | The top-level rule is \emph{goal}, which says that we are looking for 435 | an expression followed by the end of the string. The \texttt{END} 436 | token is needed because without it, it isn't clear when to stop 437 | parsing. For example, the string ``1+3'' could be parsed either as 438 | the expression ``1'' followed by the string ``+3'' or it could be 439 | parsed as the expression ``1+3''. By requiring expressions to end 440 | with \texttt{END}, the parser is forced to take ``1+3''. 441 | 442 | In the two rules with repetition, the accumulator is named \texttt{v}. 443 | After reading in one expression, we initialize the accumulator. Each 444 | time through the loop, we modify the accumulator by adding, 445 | subtracting, multiplying by, or dividing the previous accumulator by 446 | the expression that has been parsed. At the end of the rule, we 447 | return the accumulator. 448 | 449 | The calculator example shows how to process lists of elements using 450 | loops, as well as how to handle precedence of operators. 451 | 452 | \emph{Note:} It's often important to put the \texttt{END} token in, so 453 | put it in unless you are sure that your grammar has some other 454 | non-ambiguous token marking the end of the program. 455 | 456 | \mysubsection{Calculator with Memory} 457 | 458 | In the previous example we learned how to write a calculator that 459 | evaluates simple numerical expressions. In this section we will 460 | extend the example to support both local and global variables. 461 | 462 | To support global variables, we will add assignment statements to the 463 | ``goal'' rule. 464 | 465 | \begin{verbatim} 466 | rule goal: expr END {{ return expr }} 467 | | 'set' ID expr END {{ global_vars[ID] = expr }} 468 | {{ return expr }} 469 | \end{verbatim} 470 | 471 | To use these variables, we need a new kind of terminal: 472 | 473 | \begin{verbatim} 474 | rule term: ... | ID {{ return global_vars[ID] }} 475 | \end{verbatim} 476 | 477 | So far, these changes are straightforward. We simply have a global 478 | dictionary \texttt{global\_vars} that stores the variables and values, 479 | we modify it when there is an assignment statement, and we look up 480 | variables in it when we see a variable name. 481 | 482 | To support local variables, we will add variable declarations to the 483 | set of allowed expressions. 484 | 485 | \begin{verbatim} 486 | rule term: ... | 'let' VAR '=' expr 'in' expr ... 487 | \end{verbatim} 488 | 489 | This is where it becomes tricky. Local variables should be stored in 490 | a local dictionary, not in the global one. One trick would be to save 491 | a copy of the global dictionary, modify it, and then restore it 492 | later. In this example we will instead use \emph{attributes} to 493 | create local information and pass it to subrules. 494 | 495 | A rule can optionally take parameters. When we invoke the rule, we 496 | must pass in arguments. For local variables, let's use a single 497 | parameter, \texttt{local\_vars}: 498 | 499 | \begin{verbatim} 500 | rule expr<>: ... 501 | rule factor<>: ... 502 | rule term<>: ... 503 | \end{verbatim} 504 | 505 | Each time we want to match \texttt{expr}, \texttt{factor}, or 506 | \texttt{term}, we will pass the local variables in the current rule to 507 | the subrule. One interesting case is when we pass as an argument 508 | something \emph{other} than \texttt{local\_vars}: 509 | 510 | \begin{verbatim} 511 | rule term<>: ... 512 | | 'let' VAR '=' expr<> 513 | {{ local_vars = [(VAR, expr)] + local_vars }} 514 | 'in' expr<> 515 | {{ return expr }} 516 | \end{verbatim} 517 | 518 | Note that the assignment to the local variables list does not modify 519 | the original list. This is important to keep local variables from 520 | being seen outside the ``let''. 521 | 522 | The other interesting case is when we find a variable: 523 | 524 | \begin{verbatim} 525 | global_vars = {} 526 | 527 | def lookup(map, name): 528 | for x,v in map: if x==name: return v 529 | return global_vars[name] 530 | %% 531 | ... 532 | rule term<: ... 533 | | VAR {{ return lookup(local_vars, VAR) }} 534 | \end{verbatim} 535 | 536 | The lookup function will search through the local variable list, and 537 | if it cannot find the name there, it will look it up in the global 538 | variable dictionary. 539 | 540 | A complete grammar for this example, including a read-eval-print loop 541 | for interacting with the calculator, can be found in the examples 542 | subdirectory included with Yapps. 543 | 544 | In this section we saw how to insert code before the parser. We also 545 | saw how to use attributes to transmit local information from one rule 546 | to its subrules. 547 | 548 | \mysection{Grammars} 549 | 550 | Each Yapps grammar has a name, a list of tokens, and a set of 551 | production rules. A grammar named \texttt{X} will be used to produce 552 | a parser named \texttt{X} and a scanner anmed \texttt{XScanner}. As 553 | in Python, names are case sensitive, start with a letter, and contain 554 | letters, numbers, and underscores (\_). 555 | 556 | There are three kinds of tokens in Yapps: named, inline, and ignored. 557 | As their name implies, named tokens are given a name, using the token 558 | construct: \texttt{token \emph{name} : \emph{regexp}}. In a rule, the 559 | token can be matched by using the name. Inline tokens are regular 560 | expressions that are used in rules without being declared. Ignored 561 | tokens are declared using the ignore construct: \texttt{ignore: 562 | \emph{regexp}}. These tokens are ignored by the scanner, and are 563 | not seen by the parser. Often whitespace is an ignored token. The 564 | regular expressions used to define tokens should use the syntax 565 | defined in the \texttt{re} module, so some symbols may have to be 566 | backslashed. 567 | 568 | Production rules in Yapps have a name and a pattern to match. If the 569 | rule is parameterized, the name should be followed by a list of 570 | parameter names in \verb|<<...>>|. A pattern can be a simple pattern 571 | or a compound pattern. Simple patterns are the name of a named token, 572 | a regular expression in quotes (inline token), the name of a 573 | production rule (followed by arguments in \verb|<<...>>|, if the rule 574 | has parameters), and single line Python statements (\verb|{{...}}|). 575 | Compound patterns are sequences (\verb|A B C ...|), choices ( 576 | \verb:A | B | C | ...:), options (\verb|[...]|), zero-or-more repetitions 577 | (\verb|...*|), and one-or-more repetitions (\verb|...+|). Like 578 | regular expressions, repetition operators have a higher precedence 579 | than sequences, and sequences have a higher precedence than choices. 580 | 581 | Whenever \verb|{{...}}| is used, a legal one-line Python statement 582 | should be put inside the braces. The token \verb|}}| should not 583 | appear within the \verb|{{...}}| section, even within a string, since 584 | Yapps does not attempt to parse the Python statement. A workaround 585 | for strings is to put two strings together (\verb|"}" "}"|), or to use 586 | backslashes (\verb|"}\}"|). At the end of a rule you should use a 587 | \verb|{{ return X }}| statement to return a value. However, you 588 | should \emph{not} use any control statements (\texttt{return}, 589 | \texttt{continue}, \texttt{break}) in the middle of a rule. Yapps 590 | needs to make assumptions about the control flow to generate a parser, 591 | and any changes to the control flow will confuse Yapps. 592 | 593 | The \verb|<<...>>| form can occur in two places: to define parameters 594 | to a rule and to give arguments when matching a rule. Parameters use 595 | the syntax used for Python functions, so they can include default 596 | arguments and the special forms (\verb|*args| and \verb|**kwargs|). 597 | Arguments use the syntax for Python function call arguments, so they 598 | can include normal arguments and keyword arguments. The token 599 | \verb|>>| should not appear within the \verb|<<...>>| section. 600 | 601 | In both the statements and rule arguments, you can use names defined 602 | by the parser to refer to matched patterns. You can refer to the text 603 | matched by a named token by using the token name. You can use the 604 | value returned by a production rule by using the name of that rule. 605 | If a name \texttt{X} is matched more than once (such as in loops), you 606 | will have to save the earlier value(s) in a temporary variable, and 607 | then use that temporary variable in the return value. The next 608 | section has an example of a name that occurs more than once. 609 | 610 | \mysubsection{Left Factoring} 611 | \label{sec:Left-Factoring} 612 | 613 | Yapps produces ELL(1) parsers, which determine which clause to match 614 | based on the first token available. Sometimes the leftmost tokens of 615 | several clauses may be the same. The classic example is the 616 | \emph{if/then/else} construct in Pascal: 617 | 618 | \begin{verbatim} 619 | rule stmt: "if" expr "then" stmt {{ then_part = stmt }} 620 | "else" stmt {{ return ('If',expr,then_part,stmt) }} 621 | | "if" expr "then" stmt {{ return ('If',expr,stmt,[]) }} 622 | \end{verbatim} 623 | 624 | (Note that we have to save the first \texttt{stmt} into a variable 625 | because there is another \texttt{stmt} that will be matched.) The 626 | left portions of the two clauses are the same, which presents a 627 | problem for the parser. The solution is \emph{left-factoring}: the 628 | common parts are put together, and \emph{then} a choice is made about 629 | the remaining part: 630 | 631 | \begin{verbatim} 632 | rule stmt: "if" expr 633 | "then" stmt {{ then_part = stmt }} 634 | {{ else_part = [] }} 635 | [ "else" stmt {{ else_part = stmt }} ] 636 | {{ return ('If', expr, then_part, else_part) }} 637 | \end{verbatim} 638 | 639 | Unfortunately, the classic \emph{if/then/else} situation is 640 | \emph{still} ambiguous when you left-factor. Yapps can deal with this 641 | situation, but will report a warning; see section 642 | \ref{sec:Ambiguous-Grammars} for details. 643 | 644 | In general, replace rules of the form: 645 | 646 | \begin{verbatim} 647 | rule A: a b1 {{ return E1 }} 648 | | a b2 {{ return E2 }} 649 | | c3 {{ return E3 }} 650 | | c4 {{ return E4 }} 651 | \end{verbatim} 652 | 653 | with rules of the form: 654 | 655 | \begin{verbatim} 656 | rule A: a ( b1 {{ return E1 }} 657 | | b2 {{ return E2 }} 658 | ) 659 | | c3 {{ return E3 }} 660 | | c4 {{ return E4 }} 661 | \end{verbatim} 662 | 663 | \mysubsection{Left Recursion} 664 | 665 | A common construct in grammars is for matching a list of patterns, 666 | sometimes separated with delimiters such as commas or semicolons. In 667 | LR-based parser systems, we can parse a list with something like this: 668 | 669 | \begin{verbatim} 670 | rule sum: NUM {{ return NUM }} 671 | | sum "+" NUM {{ return (sum, NUM) }} 672 | \end{verbatim} 673 | 674 | Parsing \texttt{1+2+3+4} would produce the output 675 | \texttt{(((1,2),3),4)}, which is what we want from a left-associative 676 | addition operator. Unfortunately, this grammar is \emph{left 677 | recursive,} because the \texttt{sum} rule contains a clause that 678 | begins with \texttt{sum}. (The recursion occurs at the left side of 679 | the clause.) 680 | 681 | We must restructure this grammar to be \emph{right recursive} instead: 682 | 683 | \begin{verbatim} 684 | rule sum: NUM {{ return NUM }} 685 | | NUM "+" sum {{ return (NUM, sum) }} 686 | \end{verbatim} 687 | 688 | Unfortunately, using this grammar, \texttt{1+2+3+4} would be parsed as 689 | \texttt{(1,(2,(3,4)))}, which no longer follows left associativity. 690 | The rule also needs to be left-factored. Instead, we write the 691 | pattern as a loop instead: 692 | 693 | \begin{verbatim} 694 | rule sum: NUM {{ v = NUM }} 695 | ( "[+]" NUM {{ v = (v,NUM) }} )* 696 | {{ return v }} 697 | \end{verbatim} 698 | 699 | In general, replace rules of the form: 700 | 701 | \begin{verbatim} 702 | rule A: A a1 -> << E1 >> 703 | | A a2 -> << E2 >> 704 | | b3 -> << E3 >> 705 | | b4 -> << E4 >> 706 | \end{verbatim} 707 | 708 | with rules of the form: 709 | 710 | \begin{verbatim} 711 | rule A: ( b3 {{ A = E3 }} 712 | | b4 {{ A = E4 }} ) 713 | ( a1 {{ A = E1 }} 714 | | a2 {{ A = E2 }} )* 715 | {{ return A }} 716 | \end{verbatim} 717 | 718 | We have taken a rule that proved problematic for with recursion and 719 | turned it into a rule that works well with looping constructs. 720 | 721 | \mysubsection{Ambiguous Grammars} 722 | \label{sec:Ambiguous-Grammars} 723 | 724 | In section \ref{sec:Left-Factoring} we saw the classic if/then/else 725 | ambiguity, which occurs because the ``else \ldots'' portion of an ``if 726 | \ldots then \ldots else \ldots'' construct is optional. Programs with 727 | nested if/then/else constructs can be ambiguous when one of the else 728 | clauses is missing: 729 | \begin{verbatim} 730 | if 1 then if 1 then 731 | if 5 then if 5 then 732 | x := 1; x := 1; 733 | else else 734 | y := 9; y := 9; 735 | \end{verbatim} 736 | 737 | The indentation shows that the program can be parsed in two different 738 | ways. (Of course, if we all would adopt Python's indentation-based 739 | structuring, this would never happen!) Usually we want the parsing on 740 | the left: the ``else'' should be associated with the closest ``if'' 741 | statement. In section \ref{sec:Left-Factoring} we ``solved'' the 742 | problem by using the following grammar: 743 | 744 | \begin{verbatim} 745 | rule stmt: "if" expr 746 | "then" stmt {{ then_part = stmt }} 747 | {{ else_part = [] }} 748 | [ "else" stmt {{ else_part = stmt }} ] 749 | {{ return ('If', expr, then_part, else_part) }} 750 | \end{verbatim} 751 | 752 | Here, we have an optional match of ``else'' followed by a statement. 753 | The ambiguity is that if an ``else'' is present, it is not clear 754 | whether you want it parsed immediately or if you want it to be parsed 755 | by the outer ``if''. 756 | 757 | Yapps will deal with the situation by matching when the else pattern 758 | when it can. The parser will work in this case because it prefers the 759 | \emph{first} matching clause, which tells Yapps to parse the ``else''. 760 | That is exactly what we want! 761 | 762 | For ambiguity cases with choices, Yapps will choose the \emph{first} 763 | matching choice. However, remember that Yapps only looks at the first 764 | token to determine its decision, so {\tt (a b | a c)} will result in 765 | Yapps choosing {\tt a b} even when the input is {\tt a c}. It only 766 | looks at the first token, {\tt a}, to make its decision. 767 | 768 | \mysection{Customization} 769 | 770 | Both the parsers and the scanners can be customized. The parser is 771 | usually extended by subclassing, and the scanner can either be 772 | subclassed or completely replaced. 773 | 774 | \mysubsection{Customizing Parsers} 775 | 776 | If additional fields and methods are needed in order for a parser to 777 | work, Python subclassing can be used. (This is unlike parser classes 778 | written in static languages, in which these fields and methods must be 779 | defined in the generated parser class.) We simply subclass the 780 | generated parser, and add any fields or methods required. Expressions 781 | in the grammar can call methods of the subclass to perform any actions 782 | that cannot be expressed as a simple expression. For example, 783 | consider this simple grammar: 784 | 785 | \begin{verbatim} 786 | parser X: 787 | rule goal: "something" {{ self.printmsg() }} 788 | \end{verbatim} 789 | 790 | The \texttt{printmsg} function need not be implemented in the parser 791 | class \texttt{X}; it can be implemented in a subclass: 792 | 793 | \begin{verbatim} 794 | import Xparser 795 | 796 | class MyX(Xparser.X): 797 | def printmsg(self): 798 | print "Hello!" 799 | \end{verbatim} 800 | 801 | \mysubsection{Customizing Scanners} 802 | 803 | The generated parser class is not dependent on the generated scanner 804 | class. A scanner object is passed to the parser object's constructor 805 | in the \texttt{parse} function. To use a different scanner, write 806 | your own function to construct parser objects, with an instance of a 807 | different scanner. Scanner objects must have a \texttt{token} method 808 | that accepts an integer \texttt{N} as well as a list of allowed token 809 | types, and returns the Nth token, as a tuple. The default scanner 810 | raises \texttt{NoMoreTokens} if no tokens are available, and 811 | \texttt{SyntaxError} if no token could be matched. However, the 812 | parser does not rely on these exceptions; only the \texttt{parse} 813 | convenience function (which calls \texttt{wrap\_error\_reporter}) and 814 | the \texttt{print\_error} error display function use those exceptions. 815 | 816 | The tuples representing tokens have four elements. The first two are 817 | the beginning and ending indices of the matched text in the input 818 | string. The third element is the type tag, matching either the name 819 | of a named token or the quoted regexp of an inline or ignored token. 820 | The fourth element of the token tuple is the matched text. If the 821 | input string is \texttt{s}, and the token tuple is 822 | \texttt{(b,e,type,val)}, then \texttt{val} should be equal to 823 | \texttt{s[b:e]}. 824 | 825 | The generated parsers do not the beginning or ending index. They use 826 | only the token type and value. However, the default error reporter 827 | uses the beginning and ending index to show the user where the error 828 | is. 829 | 830 | \mysection{Parser Mechanics} 831 | 832 | The base parser class (Parser) defines two methods, \texttt{\_scan} 833 | and \texttt{\_peek}, and two fields, \texttt{\_pos} and 834 | \texttt{\_scanner}. The generated parser inherits from the base 835 | parser, and contains one method for each rule in the grammar. To 836 | avoid name clashes, do not use names that begin with an underscore 837 | (\texttt{\_}). 838 | 839 | \mysubsection{Parser Objects} 840 | \label{sec:Parser-Objects} 841 | 842 | Yapps produces as output two exception classes, a scanner class, a 843 | parser class, and a function \texttt{parse} that puts everything 844 | together. The \texttt{parse} function does not have to be used; 845 | instead, one can create a parser and scanner object and use them 846 | together for parsing. 847 | 848 | \begin{verbatim} 849 | def parse(rule, text): 850 | P = X(XScanner(text)) 851 | return wrap_error_reporter(P, rule) 852 | \end{verbatim} 853 | 854 | The \texttt{parse} function takes a name of a rule and an input string 855 | as input. It creates a scanner and parser object, then calls 856 | \texttt{wrap\_error\_reporter} to execute the method in the parser 857 | object named \texttt{rule}. The wrapper function will call the 858 | appropriate parser rule and report any parsing errors to standard 859 | output. 860 | 861 | There are several situations in which the \texttt{parse} function 862 | would not be useful. If a different parser or scanner is being used, 863 | or exceptions are to be handled differently, a new \texttt{parse} 864 | function would be required. The supplied \texttt{parse} function can 865 | be used as a template for writing a function for your own needs. An 866 | example of a custom parse function is the \texttt{generate} function 867 | in \texttt{Yapps.py}. 868 | 869 | \mysubsection{Context Sensitive Scanner} 870 | 871 | Unlike most scanners, the scanner produced by Yapps can take into 872 | account the context in which tokens are needed, and try to match only 873 | good tokens. For example, in the grammar: 874 | 875 | \begin{verbatim} 876 | parser IniFile: 877 | token ID: "[a-zA-Z_0-9]+" 878 | token VAL: ".*" 879 | 880 | rule pair: ID "[ \t]*=[ \t]*" VAL "\n" 881 | \end{verbatim} 882 | 883 | we would like to scan lines of text and pick out a name/value pair. 884 | In a conventional scanner, the input string \texttt{shell=progman.exe} 885 | would be turned into a single token of type \texttt{VAL}. The Yapps 886 | scanner, however, knows that at the beginning of the line, an 887 | \texttt{ID} is expected, so it will return \texttt{"shell"} as a token 888 | of type \texttt{ID}. Later, it will return \texttt{"progman.exe"} as 889 | a token of type \texttt{VAL}. 890 | 891 | Context sensitivity decreases the separation between scanner and 892 | parser, but it is useful in parsers like \texttt{IniFile}, where the 893 | tokens themselves are not unambiguous, but \emph{are} unambiguous 894 | given a particular stage in the parsing process. 895 | 896 | Unfortunately, context sensitivity can make it more difficult to 897 | detect errors in the input. For example, in parsing a Pascal-like 898 | language with ``begin'' and ``end'' as keywords, a context sensitive 899 | scanner would only match ``end'' as the END token if the parser is in 900 | a place that will accept the END token. If not, then the scanner 901 | would match ``end'' as an identifier. To disable the context 902 | sensitive scanner in Yapps, add the 903 | \texttt{context-insensitive-scanner} option to the grammar: 904 | 905 | \begin{verbatim} 906 | Parser X: 907 | option: "context-insensitive-scanner" 908 | \end{verbatim} 909 | 910 | Context-insensitive scanning makes the parser look cleaner as well. 911 | 912 | \mysubsection{Internal Variables} 913 | 914 | There are two internal fields that may be of use. The parser object 915 | has two fields, \texttt{\_pos}, which is the index of the current 916 | token being matched, and \texttt{\_scanner}, which is the scanner 917 | object. The token itself can be retrieved by accessing the scanner 918 | object and calling the \texttt{token} method with the token index. However, if you call \texttt{token} before the token has been requested by the parser, it may mess up a context-sensitive scanner.\footnote{When using a context-sensitive scanner, the parser tells the scanner what the valid token types are at each point. If you call \texttt{token} before the parser can tell the scanner the valid token types, the scanner will attempt to match without considering the context.} A 919 | potentially useful combination of these fields is to extract the 920 | portion of the input matched by the current rule. To do this, just save the scanner state (\texttt{\_scanner.pos}) before the text is matched and then again after the text is matched: 921 | 922 | \begin{verbatim} 923 | rule R: 924 | {{ start = self._scanner.pos }} 925 | a b c 926 | {{ end = self._scanner.pos }} 927 | {{ print 'Text is', self._scanner.input[start:end] }} 928 | \end{verbatim} 929 | 930 | \mysubsection{Pre- and Post-Parser Code} 931 | 932 | Sometimes the parser code needs to rely on helper variables, 933 | functions, and classes. A Yapps grammar can optionally be surrounded 934 | by double percent signs, to separate the grammar from Python code. 935 | 936 | \begin{verbatim} 937 | ... Python code ... 938 | %% 939 | ... Yapps grammar ... 940 | %% 941 | ... Python code ... 942 | \end{verbatim} 943 | 944 | The second \verb|%%| can be omitted if there is no Python code at the 945 | end, and the first \verb|%%| can be omitted if there is no extra 946 | Python code at all. (To have code only at the end, both separators 947 | are required.) 948 | 949 | If the second \verb|%%| is omitted, Yapps will insert testing code 950 | that allows you to use the generated parser to parse a file. 951 | 952 | The extended calculator example in the Yapps examples subdirectory 953 | includes both pre-parser and post-parser code. 954 | 955 | \mysubsection{Representation of Grammars} 956 | 957 | For each kind of pattern there is a class derived from Pattern. Yapps 958 | has classes for Terminal, NonTerminal, Sequence, Choice, Option, Plus, 959 | Star, and Eval. Each of these classes has the following interface: 960 | 961 | \begin{itemize} 962 | \item[setup(\emph{gen})] Set accepts-$\epsilon$, and call 963 | \emph{gen.changed()} if it changed. This function can change the 964 | flag from false to true but \emph{not} from true to false. 965 | \item[update(\emph(gen))] Set \first and \follow, and call 966 | \emph{gen.changed()} if either changed. This function can add to 967 | the sets but \emph{not} remove from them. 968 | \item[output(\emph{gen}, \emph{indent})] Generate code for matching 969 | this rule, using \emph{indent} as the current indentation level. 970 | Writes are performed using \emph{gen.write}. 971 | \item[used(\emph{vars})] Given a list of variables \emph{vars}, 972 | return two lists: one containing the variables that are used, and 973 | one containing the variables that are assigned. This function is 974 | used for optimizing the resulting code. 975 | \end{itemize} 976 | 977 | Both \emph{setup} and \emph{update} monotonically increase the 978 | variables they modify. Since the variables can only increase a finite 979 | number of times, we can repeatedly call the function until the 980 | variable stabilized. The \emph{used} function is not currently 981 | implemented. 982 | 983 | With each pattern in the grammar Yapps associates three pieces of 984 | information: the \first set, the \follow set, and the 985 | accepts-$\epsilon$ flag. 986 | 987 | The \first set contains the tokens that can appear as we start 988 | matching the pattern. The \follow set contains the tokens that can 989 | appear immediately after we match the pattern. The accepts-$\epsilon$ 990 | flag is true if the pattern can match no tokens. In this case, \first 991 | will contain all the elements in \follow. The \follow set is not 992 | needed when accepts-$\epsilon$ is false, and may not be accurate in 993 | those cases. 994 | 995 | Yapps does not compute these sets precisely. Its approximation can 996 | miss certain cases, such as this one: 997 | 998 | \begin{verbatim} 999 | rule C: ( A* | B ) 1000 | rule B: C [A] 1001 | \end{verbatim} 1002 | 1003 | Yapps will calculate {\tt C}'s \follow set to include {\tt A}. 1004 | However, {\tt C} will always match all the {\tt A}'s, so {\tt A} will 1005 | never follow it. Yapps 2.0 does not properly handle this construct, 1006 | but if it seems important, I may add support for it in a future 1007 | version. 1008 | 1009 | Yapps also cannot handle constructs that depend on the calling 1010 | sequence. For example: 1011 | 1012 | \begin{verbatim} 1013 | rule R: U | 'b' 1014 | rule S: | 'c' 1015 | rule T: S 'b' 1016 | rule U: S 'a' 1017 | \end{verbatim} 1018 | 1019 | The \follow set for {\tt S} includes {\tt a} and {\tt b}. Since {\tt 1020 | S} can be empty, the \first set for {\tt S} should include {\tt a}, 1021 | {\tt b}, and {\tt c}. However, when parsing {\tt R}, if the lookahead 1022 | is {\tt b} we should \emph{not} parse {\tt U}. That's because in {\tt 1023 | U}, {\tt S} is followed by {\tt a} and not {\tt b}. Therefore in 1024 | {\tt R}, we should choose rule {\tt U} only if there is an {\tt a} or 1025 | {\tt c}, but not if there is a {\tt b}. Yapps and many other LL(1) 1026 | systems do not distinguish {\tt S b} and {\tt S a}, making {\tt 1027 | S}'s \follow set {\tt a, b}, and making {\tt R} always try to match 1028 | {\tt U}. In this case we can solve the problem by changing {\tt R} to 1029 | \verb:'b' | U: but it may not always be possible to solve all such 1030 | problems in this way. 1031 | 1032 | \appendix 1033 | 1034 | \mysection{Grammar for Parsers} 1035 | 1036 | This is the grammar for parsers, without any Python code mixed in. 1037 | The complete grammar can be found in \texttt{parsedesc.g} in the Yapps 1038 | distribution. 1039 | 1040 | \begin{verbatim} 1041 | parser ParserDescription: 1042 | ignore: "\\s+" 1043 | ignore: "#.*?\r?\n" 1044 | token END: "$" # $ means end of string 1045 | token ATTR: "<<.+?>>" 1046 | token STMT: "{{.+?}}" 1047 | token ID: '[a-zA-Z_][a-zA-Z_0-9]*' 1048 | token STR: '[rR]?\'([^\\n\'\\\\]|\\\\.)*\'|[rR]?"([^\\n"\\\\]|\\\\.)*"' 1049 | 1050 | rule Parser: "parser" ID ":" 1051 | Options 1052 | Tokens 1053 | Rules 1054 | END 1055 | 1056 | rule Options: ( "option" ":" STR )* 1057 | rule Tokens: ( "token" ID ":" STR | "ignore" ":" STR )* 1058 | rule Rules: ( "rule" ID OptParam ":" ClauseA )* 1059 | 1060 | rule ClauseA: ClauseB ( '[|]' ClauseB )* 1061 | rule ClauseB: ClauseC* 1062 | rule ClauseC: ClauseD [ '[+]' | '[*]' ] 1063 | rule ClauseD: STR | ID [ATTR] | STMT 1064 | | '\\(' ClauseA '\\) | '\\[' ClauseA '\\]' 1065 | \end{verbatim} 1066 | 1067 | \mysection{Upgrading} 1068 | 1069 | Yapps 2.0 is not backwards compatible with Yapps 1.0. In this section 1070 | are some tips for upgrading: 1071 | 1072 | \begin{enumerate} 1073 | \item Yapps 1.0 was distributed as a single file. Yapps 2.0 is 1074 | instead distributed as two Python files: a \emph{parser generator} 1075 | (26k) and a \emph{parser runtime} (5k). You need both files to 1076 | create parsers, but you need only the runtime (\texttt{yappsrt.py}) 1077 | to use the parsers. 1078 | 1079 | \item Yapps 1.0 supported Python 1.4 regular expressions from the 1080 | \texttt{regex} module. Yapps 2.0 uses Python 1.5 regular 1081 | expressions from the \texttt{re} module. \emph{The new syntax for 1082 | regular expressions is not compatible with the old syntax.} 1083 | Andrew Kuchling has a \htmladdnormallink{guide to converting 1084 | regular 1085 | expressions}{http://www.python.org/doc/howto/regex-to-re/} on his 1086 | web page. 1087 | 1088 | \item Yapps 1.0 wants a pattern and then a return value in \verb|->| 1089 | \verb|<<...>>|. Yapps 2.0 allows patterns and Python statements to 1090 | be mixed. To convert a rule like this: 1091 | 1092 | \begin{verbatim} 1093 | rule R: A B C -> << E1 >> 1094 | | X Y Z -> << E2 >> 1095 | \end{verbatim} 1096 | 1097 | to Yapps 2.0 form, replace the return value specifiers with return 1098 | statements: 1099 | 1100 | \begin{verbatim} 1101 | rule R: A B C {{ return E1 }} 1102 | | X Y Z {{ return E2 }} 1103 | \end{verbatim} 1104 | 1105 | \item Yapps 2.0 does not perform tail recursion elimination. This 1106 | means any recursive rules you write will be turned into recursive 1107 | methods in the parser. The parser will work, but may be slower. 1108 | It can be made faster by rewriting recursive rules, using instead 1109 | the looping operators \verb|*| and \verb|+| provided in Yapps 2.0. 1110 | 1111 | \end{enumerate} 1112 | 1113 | \mysection{Troubleshooting} 1114 | 1115 | \begin{itemize} 1116 | \item A common error is to write a grammar that doesn't have an END 1117 | token. End tokens are needed when it is not clear when to stop 1118 | parsing. For example, when parsing the expression {\tt 3+5}, it is 1119 | not clear after reading {\tt 3} whether to treat it as a complete 1120 | expression or whether the parser should continue reading. 1121 | Therefore the grammar for numeric expressions should include an end 1122 | token. Another example is the grammar for Lisp expressions. In 1123 | Lisp, it is always clear when you should stop parsing, so you do 1124 | \emph{not} need an end token. In fact, it may be more useful not 1125 | to have an end token, so that you can read in several Lisp expressions. 1126 | \item If there is a chance of ambiguity, make sure to put the choices 1127 | in the order you want them checked. Usually the most specific 1128 | choice should be first. Empty sequences should usually be last. 1129 | \item The context sensitive scanner is not appropriate for all 1130 | grammars. You might try using the insensitive scanner with the 1131 | {\tt context-insensitive-scanner} option in the grammar. 1132 | \item If performance turns out to be a problem, try writing a custom 1133 | scanner. The Yapps scanner is rather slow (but flexible and easy 1134 | to understand). 1135 | \end{itemize} 1136 | 1137 | \mysection{History} 1138 | 1139 | Yapps 1 had several limitations that bothered me while writing 1140 | parsers: 1141 | 1142 | \begin{enumerate} 1143 | \item It was not possible to insert statements into the generated 1144 | parser. A common workaround was to write an auxilliary function 1145 | that executed those statements, and to call that function as part 1146 | of the return value calculation. For example, several of my 1147 | parsers had an ``append(x,y)'' function that existed solely to call 1148 | ``x.append(y)''. 1149 | \item The way in which grammars were specified was rather 1150 | restrictive: a rule was a choice of clauses. Each clause was a 1151 | sequence of tokens and rule names, followed by a return value. 1152 | \item Optional matching had to be put into a separate rule because 1153 | choices were only made at the beginning of a rule. 1154 | \item Repetition had to be specified in terms of recursion. Not only 1155 | was this awkward (sometimes requiring additional rules), I had to 1156 | add a tail recursion optimization to Yapps to transform the 1157 | recursion back into a loop. 1158 | \end{enumerate} 1159 | 1160 | Yapps 2 addresses each of these limitations. 1161 | 1162 | \begin{enumerate} 1163 | \item Statements can occur anywhere within a rule. (However, only 1164 | one-line statements are allowed; multiline blocks marked by 1165 | indentation are not.) 1166 | \item Grammars can be specified using any mix of sequences, choices, 1167 | tokens, and rule names. To allow for complex structures, 1168 | parentheses can be used for grouping. 1169 | \item Given choices and parenthesization, optional matching can be 1170 | expressed as a choice between some pattern and nothing. In 1171 | addition, Yapps 2 has the convenience syntax \verb|[A B ...]| for 1172 | matching \verb|A B ...| optionally. 1173 | \item Repetition operators \verb|*| for zero or more and \verb|+| for 1174 | one or more make it easy to specify repeating patterns. 1175 | \end{enumerate} 1176 | 1177 | It is my hope that Yapps 2 will be flexible enough to meet my needs 1178 | for another year, yet simple enough that I do not hesitate to use it. 1179 | 1180 | \mysection{Debian Extensions} 1181 | \label{sec:debian} 1182 | 1183 | The Debian version adds the following enhancements to the original 1184 | Yapps code. They were written by Matthias Urlichs. 1185 | 1186 | \begin{enumerate} 1187 | \item Yapps can stack input sources ("include files"). A usage example 1188 | is supplied with the calc.g sample program. 1189 | \item Yapps now understands augmented ignore-able patterns. 1190 | This means that Yapps can parse multi-line C comments; this wasn't 1191 | possible before. 1192 | \item Better error reporting. 1193 | \item Yapps now reads its input incrementally. 1194 | \end{enumerate} 1195 | 1196 | The generated parser has been renamed to \texttt{yapps/runtime.py}. 1197 | In Debian, this file is provided by the \texttt{yapps2-runtime} package. 1198 | You need to depend on it if you Debianize Python programs which use 1199 | yapps. 1200 | 1201 | \mysection{Future Extensions} 1202 | \label{sec:future} 1203 | 1204 | I am still investigating the possibility of LL(2) and higher 1205 | lookahead. However, it looks like the resulting parsers will be 1206 | somewhat ugly. 1207 | 1208 | It would be nice to control choices with user-defined predicates. 1209 | 1210 | The most likely future extension is backtracking. A grammar pattern 1211 | like \verb|(VAR ':=' expr)? {{ return Assign(VAR,expr) }} : expr {{ return expr }}| 1212 | would turn into code that attempted to match \verb|VAR ':=' expr|. If 1213 | it succeeded, it would run \verb|{{ return ... }}|. If it failed, it 1214 | would match \verb|expr {{ return expr }}|. Backtracking may make it 1215 | less necessary to write LL(2) grammars. 1216 | 1217 | \mysection{References} 1218 | 1219 | \begin{enumerate} 1220 | \item The \htmladdnormallink{Python-Parser 1221 | SIG}{http://www.python.org/sigs/parser-sig/} is the first place 1222 | to look for a list of parser systems for Python. 1223 | 1224 | \item ANTLR/PCCTS, by Terrence Parr, is available at 1225 | \htmladdnormallink{The ANTLR Home Page}{http://www.antlr.org/}. 1226 | 1227 | \item PyLR, by Scott Cotton, is at \htmladdnormallink{his Starship 1228 | page}{http://starship.skyport.net/crew/scott/PyLR.html}. 1229 | 1230 | \item John Aycock's \htmladdnormallink{Compiling Little Languages 1231 | Framework}{http://www.foretec.com/python/workshops/1998-11/proceedings/papers/aycock-little/aycock-little.html}. 1232 | 1233 | \item PyBison, by Scott Hassan, can be found at 1234 | \htmladdnormallink{his Python Projects 1235 | page}{http://coho.stanford.edu/\~{}hassan/Python/}. 1236 | 1237 | \item mcf.pars, by Mike C. Fletcher, is available at 1238 | \htmladdnormallink{his web 1239 | page}{http://members.rogers.com/mcfletch/programming/simpleparse/simpleparse.html}. 1240 | 1241 | \item kwParsing, by Aaron Watters, is available at 1242 | \htmladdnormallink{his Starship 1243 | page}{http://starship.skyport.net/crew/aaron_watters/kwParsing/}. 1244 | \end{enumerate} 1245 | 1246 | \end{document} 1247 | -------------------------------------------------------------------------------- /doc/yapps_grammar.g: -------------------------------------------------------------------------------- 1 | # grammar.py, part of Yapps 2 - yet another python parser system 2 | # Copyright 1999-2003 by Amit J. Patel 3 | # Enhancements copyright 2003-2004 by Matthias Urlichs 4 | # 5 | # This version of the Yapps 2 grammar can be distributed under the 6 | # terms of the MIT open source license, either found in the LICENSE 7 | # file included with the Yapps distribution 8 | # or at 9 | # 10 | # 11 | 12 | """Parser for Yapps grammars. 13 | 14 | This file defines the grammar of Yapps grammars. Naturally, it is 15 | implemented in Yapps. The grammar.py module needed by Yapps is built 16 | by running Yapps on yapps_grammar.g. (Holy circularity, Batman!) 17 | 18 | """ 19 | 20 | import sys, re 21 | from yapps import parsetree 22 | 23 | ###################################################################### 24 | def cleanup_choice(rule, lst): 25 | if len(lst) == 0: return Sequence(rule, []) 26 | if len(lst) == 1: return lst[0] 27 | return parsetree.Choice(rule, *tuple(lst)) 28 | 29 | def cleanup_sequence(rule, lst): 30 | if len(lst) == 1: return lst[0] 31 | return parsetree.Sequence(rule, *tuple(lst)) 32 | 33 | def resolve_name(rule, tokens, id, args): 34 | if id in [x[0] for x in tokens]: 35 | # It's a token 36 | if args: 37 | print('Warning: ignoring parameters on TOKEN %s<<%s>>' % (id, args)) 38 | return parsetree.Terminal(rule, id) 39 | else: 40 | # It's a name, so assume it's a nonterminal 41 | return parsetree.NonTerminal(rule, id, args) 42 | 43 | %% 44 | parser ParserDescription: 45 | 46 | ignore: "[ \t\r\n]+" 47 | ignore: "#.*?\r?\n" 48 | token EOF: "$" 49 | token ATTR: "<<.+?>>" 50 | token STMT: "{{.+?}}" 51 | token ID: '[a-zA-Z_][a-zA-Z_0-9]*' 52 | token STR: '[rR]?\'([^\\n\'\\\\]|\\\\.)*\'|[rR]?"([^\\n"\\\\]|\\\\.)*"' 53 | token LP: '\\(' 54 | token RP: '\\)' 55 | token LB: '\\[' 56 | token RB: '\\]' 57 | token OR: '[|]' 58 | token STAR: '[*]' 59 | token PLUS: '[+]' 60 | token QUEST: '[?]' 61 | token COLON: ':' 62 | 63 | rule Parser: "parser" ID ":" 64 | Options 65 | Tokens 66 | Rules<> 67 | EOF 68 | {{ return parsetree.Generator(ID,Options,Tokens,Rules) }} 69 | 70 | rule Options: {{ opt = {} }} 71 | ( "option" ":" Str {{ opt[Str] = 1 }} )* 72 | {{ return opt }} 73 | 74 | rule Tokens: {{ tok = [] }} 75 | ( 76 | "token" ID ":" Str {{ tok.append( (ID,Str) ) }} 77 | | "ignore" 78 | ":" Str {{ ign = ('#ignore',Str) }} 79 | ( STMT {{ ign = ign + (STMT[2:-2],) }} )? 80 | {{ tok.append( ign ) }} 81 | )* 82 | {{ return tok }} 83 | 84 | rule Rules<>: 85 | {{ rul = [] }} 86 | ( 87 | "rule" ID OptParam ":" ClauseA<> 88 | {{ rul.append( (ID, OptParam, ClauseA) ) }} 89 | )* 90 | {{ return rul }} 91 | 92 | rule ClauseA<>: 93 | ClauseB<> 94 | {{ v = [ClauseB] }} 95 | ( OR ClauseB<> {{ v.append(ClauseB) }} )* 96 | {{ return cleanup_choice(rule,v) }} 97 | 98 | rule ClauseB<>: 99 | {{ v = [] }} 100 | ( ClauseC<> {{ v.append(ClauseC) }} )* 101 | {{ return cleanup_sequence(rule, v) }} 102 | 103 | rule ClauseC<>: 104 | ClauseD<> 105 | ( PLUS {{ return parsetree.Plus(rule, ClauseD) }} 106 | | STAR {{ return parsetree.Star(rule, ClauseD) }} 107 | | QUEST {{ return parsetree.Option(rule, ClauseD) }} 108 | | {{ return ClauseD }} ) 109 | 110 | rule ClauseD<>: 111 | STR {{ t = (STR, eval(STR,{},{})) }} 112 | {{ if t not in tokens: tokens.insert( 0, t ) }} 113 | {{ return parsetree.Terminal(rule, STR) }} 114 | | ID OptParam {{ return resolve_name(rule,tokens, ID, OptParam) }} 115 | | LP ClauseA<> RP {{ return ClauseA }} 116 | | LB ClauseA<> RB {{ return parsetree.Option(rule, ClauseA) }} 117 | | STMT {{ return parsetree.Eval(rule, STMT[2:-2]) }} 118 | 119 | rule OptParam: [ ATTR {{ return ATTR[2:-2] }} ] {{ return '' }} 120 | rule Str: STR {{ return eval(STR,{},{}) }} 121 | %% 122 | -------------------------------------------------------------------------------- /examples/calc.g: -------------------------------------------------------------------------------- 1 | globalvars = {} # We will store the calculator's variables here 2 | 3 | def lookup(map, name): 4 | for x,v in map: 5 | if x == name: return v 6 | if not globalvars.has_key(name): print 'Undefined (defaulting to 0):', name 7 | return globalvars.get(name, 0) 8 | 9 | def stack_input(scanner,ign): 10 | """Grab more input""" 11 | scanner.stack_input(raw_input(">?> ")) 12 | 13 | %% 14 | parser Calculator: 15 | ignore: "[ \r\t\n]+" 16 | ignore: "[?]" {{ stack_input }} 17 | 18 | token END: "$" 19 | token NUM: "[0-9]+" 20 | token VAR: "[a-zA-Z_]+" 21 | 22 | # Each line can either be an expression or an assignment statement 23 | rule goal: expr<<[]>> END {{ print '=', expr }} 24 | {{ return expr }} 25 | | "set" VAR expr<<[]>> END {{ globalvars[VAR] = expr }} 26 | {{ print VAR, '=', expr }} 27 | {{ return expr }} 28 | 29 | # An expression is the sum and difference of factors 30 | rule expr<>: factor<> {{ n = factor }} 31 | ( "[+]" factor<> {{ n = n+factor }} 32 | | "-" factor<> {{ n = n-factor }} 33 | )* {{ return n }} 34 | 35 | # A factor is the product and division of terms 36 | rule factor<>: term<> {{ v = term }} 37 | ( "[*]" term<> {{ v = v*term }} 38 | | "/" term<> {{ v = v/term }} 39 | )* {{ return v }} 40 | 41 | # A term is a number, variable, or an expression surrounded by parentheses 42 | rule term<>: 43 | NUM {{ return int(NUM) }} 44 | | VAR {{ return lookup(V, VAR) }} 45 | | "\\(" expr "\\)" {{ return expr }} 46 | | "let" VAR "=" expr<> {{ V = [(VAR, expr)] + V }} 47 | "in" expr<> {{ return expr }} 48 | %% 49 | if __name__=='__main__': 50 | print 'Welcome to the calculator sample for Yapps 2.' 51 | print ' Enter either "" or "set ",' 52 | print ' or just press return to exit. An expression can have' 53 | print ' local variables: let x = expr in expr' 54 | # We could have put this loop into the parser, by making the 55 | # `goal' rule use (expr | set var expr)*, but by putting the 56 | # loop into Python code, we can make it interactive (i.e., enter 57 | # one expression, get the result, enter another expression, etc.) 58 | while 1: 59 | try: s = raw_input('>>> ') 60 | except EOFError: break 61 | if not s.strip(): break 62 | parse('goal', s) 63 | print 'Bye.' 64 | 65 | -------------------------------------------------------------------------------- /examples/expr.g: -------------------------------------------------------------------------------- 1 | parser Calculator: 2 | token END: "$" 3 | token NUM: "[0-9]+" 4 | 5 | rule goal: expr END {{ return expr }} 6 | 7 | # An expression is the sum and difference of factors 8 | rule expr: factor {{ v = factor }} 9 | ( "[+]" factor {{ v = v+factor }} 10 | | "-" factor {{ v = v-factor }} 11 | )* {{ return v }} 12 | 13 | # A factor is the product and division of terms 14 | rule factor: term {{ v = term }} 15 | ( "[*]" term {{ v = v*term }} 16 | | "/" term {{ v = v/term }} 17 | )* {{ return v }} 18 | 19 | # A term is either a number or an expression surrounded by parentheses 20 | rule term: NUM {{ return int(NUM) }} 21 | | "\\(" expr "\\)" {{ return expr }} 22 | -------------------------------------------------------------------------------- /examples/lisp.g: -------------------------------------------------------------------------------- 1 | parser Lisp: 2 | ignore: r'\s+' 3 | token NUM: r'[0-9]+' 4 | token ID: r'[-+*/!@$%^&=.a-zA-Z0-9_]+' 5 | token STR: r'"([^\\"]+|\\.)*"' 6 | 7 | rule expr: ID {{ return ('id',ID) }} 8 | | STR {{ return ('str',eval(STR)) }} 9 | | NUM {{ return ('num',int(NUM)) }} 10 | | r"\(" 11 | {{ e = [] }} # initialize the list 12 | ( expr {{ e.append(expr) }} ) * # put each expr into the list 13 | r"\)" {{ return e }} # return the list 14 | -------------------------------------------------------------------------------- /examples/notes: -------------------------------------------------------------------------------- 1 | Hints 2 | ##### 3 | 4 | Some additional hints for your edification. 5 | 6 | Author: Matthias Urlichs 7 | 8 | How to process C preprocessor codes: 9 | ==================================== 10 | 11 | Rudimentary include handling has been added to the parser by me. 12 | 13 | However, if you want to do anything fancy, like for instance whatever 14 | the C preprocessor does, things get more complicated. Fortunately, 15 | there's already a nice tool to handle C preprocessing -- CPP itself. 16 | 17 | If you want to report errors correctly in that situation, do this: 18 | 19 | def set_line(s,m): 20 | """Fixup the scanner's idea of the current line""" 21 | s.filename = m.group(2) 22 | line = int(m.group(1)) 23 | s.del_line = line - s.line 24 | 25 | %% 26 | parser whatever: 27 | ignore: '^#\s*(\d+)\s*"([^"\n]+)"\s*\n' {{ set_line }} 28 | ignore: '^#.*\n' 29 | 30 | [...] 31 | %% 32 | if __name__=='__main__': 33 | import sys,os 34 | for a in sys.argv[1:]: 35 | f=os.popen("cpp "+repr(a),"r") 36 | 37 | P = whatever(whateverScanner("", filename=a, file=f)) 38 | try: P.goal() 39 | except runtime.SyntaxError as e: 40 | runtime.print_error(e, P._scanner) 41 | sys.exit(1) 42 | 43 | f.close() 44 | 45 | -------------------------------------------------------------------------------- /examples/xml.g: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2 2 | 3 | # xml.g 4 | # 5 | # Amit J. Patel, August 2003 6 | # 7 | # Simple (non-conforming, non-validating) parsing of XML documents, 8 | # based on Robert D. Cameron's "REX" shallow parser. It doesn't 9 | # handle CDATA and lots of other stuff; it's meant to demonstrate 10 | # Yapps, not replace a proper XML parser. 11 | 12 | %% 13 | 14 | parser xml: 15 | token nodetext: r'[^<>]+' 16 | token attrtext_singlequote: "[^']*" 17 | token attrtext_doublequote: '[^"]*' 18 | token SP: r'\s' 19 | token id: r'[a-zA-Z_:][a-zA-Z0-9_:.-]*' 20 | 21 | rule node: 22 | r'' {{ return ['!--comment'] }} 23 | | r'' {{ return ['![CDATA['] }} 24 | | r']*>' {{ return ['!doctype'] }} 25 | | '<' SP* id SP* attributes SP* {{ startid = id }} 26 | ( '>' nodes '' {{ assert startid == id, 'Mismatched tags <%s> ... ' % (startid, id) }} 27 | {{ return [id, attributes] + nodes }} 28 | | '/\s*>' {{ return [id, attributes] }} 29 | ) 30 | | nodetext {{ return nodetext }} 31 | 32 | rule nodes: {{ result = [] }} 33 | ( node {{ result.append(node) }} 34 | ) * {{ return result }} 35 | 36 | rule attribute: id SP* '=' SP* 37 | ( '"' attrtext_doublequote '"' {{ return (id, attrtext_doublequote) }} 38 | | "'" attrtext_singlequote "'" {{ return (id, attrtext_singlequote) }} 39 | ) 40 | 41 | rule attributes: {{ result = {} }} 42 | ( attribute SP* {{ result[attribute[0]] = attribute[1] }} 43 | ) * {{ return result }} 44 | 45 | %% 46 | 47 | if __name__ == '__main__': 48 | tests = ['', 49 | 'some text', 50 | '< bad xml', 51 | '
', 52 | '< spacey a = "foo" / >', 53 | 'text ... ', 54 | ' middle ', 55 | ' foo bar ', 56 | ] 57 | print 58 | print '____Running tests_______________________________________' 59 | for test in tests: 60 | print 61 | try: 62 | parser = xml(xmlScanner(test)) 63 | output = '%s ==> %s' % (repr(test), repr(parser.node())) 64 | except (yappsrt.SyntaxError, AssertionError) as e: 65 | output = '%s ==> FAILED ==> %s' % (repr(test), e) 66 | print output 67 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from setuptools import setup, find_packages 4 | import os 5 | from yapps import __version__ as version 6 | 7 | pkg_root = os.path.dirname(__file__) 8 | 9 | # Error-handling here is to allow package to be built w/o README included 10 | try: readme = open(os.path.join(pkg_root, 'README.txt')).read() 11 | except IOError: readme = '' 12 | 13 | setup( 14 | name = 'Yapps2', 15 | version = version, 16 | author = 'Amit J. Patel, Matthias Urlichs', 17 | author_email = 'amitp@cs.stanford.edu, smurf@debian.org', 18 | maintainer = 'Mike Kazantsev', 19 | maintainer_email = 'mk.fraggod@gmail.com', 20 | license = 'MIT', 21 | url = 'https://github.com/mk-fg/yapps', 22 | 23 | description = 'Yet Another Python Parser System', 24 | long_description = readme, 25 | 26 | packages = find_packages(), 27 | include_package_data = True, 28 | package_data = {'': ['README.txt']}, 29 | exclude_package_data = {'': ['README.*']}, 30 | 31 | entry_points = dict(console_scripts=[ 32 | 'yapps2 = yapps.cli_tool:main' ]) ) 33 | -------------------------------------------------------------------------------- /test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | set -e 4 | trap 'echo ERROR' 0 5 | 6 | export PYTHONPATH=$(pwd) 7 | for PY_G in python python3 ; do 8 | $PY_G ./yapps2 examples/expr.g examples/expr.py 9 | 10 | for PY_X in python python3 ; do 11 | test "$(echo "1+2*3+4" | $PY_X examples/expr.py goal)" = 11 12 | done 13 | 14 | done 15 | 16 | trap 'rm examples/expr.py; echo OK' 0 17 | -------------------------------------------------------------------------------- /test/empty_clauses.g: -------------------------------------------------------------------------------- 1 | # This parser tests the use of OR clauses with one of them being empty 2 | # 3 | # The output of --dump should indicate the FOLLOW set for (A | ) is 'c'. 4 | 5 | parser Test: 6 | rule TestPlus: ( A | ) 'c' 7 | rule A: 'a'+ 8 | 9 | rule TestStar: ( B | ) 'c' 10 | rule B: 'b'* -------------------------------------------------------------------------------- /test/line_numbers.g: -------------------------------------------------------------------------------- 1 | # 2 | # The error messages produced by Yapps have a line number. 3 | # The line number should take the Python code section into account. 4 | 5 | # The line number should be 10. 6 | 7 | %% 8 | 9 | parser error_1: 10 | this_is_an_error; 11 | -------------------------------------------------------------------------------- /test/option.g: -------------------------------------------------------------------------------- 1 | 2 | %% 3 | 4 | parser test_option: 5 | ignore: r'\s+' 6 | token a: 'a' 7 | token b: 'b' 8 | token EOF: r'$' 9 | 10 | rule test_brackets: a [b] EOF 11 | 12 | rule test_question_mark: a b? EOF 13 | 14 | %% 15 | 16 | # The generated code for test_brackets and test_question_mark should 17 | # be the same. 18 | -------------------------------------------------------------------------------- /yapps/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | __version__ = "2.2.1" 3 | -------------------------------------------------------------------------------- /yapps/cli_tool.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # 4 | # Yapps 2 - yet another python parser system 5 | # Copyright 1999-2003 by Amit J. Patel 6 | # 7 | # This version of Yapps 2 can be distributed under the 8 | # terms of the MIT open source license, either found in the LICENSE file 9 | # included with the Yapps distribution 10 | # or at 11 | # 12 | # 13 | 14 | import os, sys, re, types 15 | #from six import string_types 16 | PY2 = sys.version_info[0] == 2 17 | if PY2: 18 | string_types = (basestring,) 19 | else: 20 | string_types = (str,) 21 | 22 | try: from yapps import runtime, parsetree, grammar 23 | except ImportError: 24 | # For running binary from a checkout-path directly 25 | if os.path.isfile('yapps/__init__.py'): 26 | sys.path.append('.') 27 | from yapps import runtime, parsetree, grammar 28 | else: raise 29 | 30 | 31 | def generate(inputfile, outputfile=None, dump=0, **flags): 32 | """Generate a grammar, given an input filename (X.g) 33 | and an output filename (defaulting to X.py).""" 34 | 35 | inputfilename = inputfile if isinstance( 36 | inputfile, string_types ) else inputfile.name 37 | if not outputfile: 38 | if inputfilename.endswith('.g'): 39 | outputfile = inputfilename[:-2] + '.py' 40 | else: 41 | raise Exception('Must specify output filename if input filename is not *.g') 42 | 43 | DIVIDER = '\n%%\n' # This pattern separates the pre/post parsers 44 | preparser, postparser = None, None # Code before and after the parser desc 45 | 46 | # Read the entire file 47 | if isinstance(inputfile, string_types): 48 | inputfile = open(inputfilename) 49 | s = inputfile.read() 50 | 51 | # See if there's a separation between the pre-parser and parser 52 | f = s.find(DIVIDER) 53 | if f >= 0: preparser, s = s[:f]+'\n\n', s[f+len(DIVIDER):] 54 | 55 | # See if there's a separation between the parser and post-parser 56 | f = s.find(DIVIDER) 57 | if f >= 0: s, postparser = s[:f], '\n\n'+s[f+len(DIVIDER):] 58 | 59 | # Create the parser and scanner and parse the text 60 | scanner = grammar.ParserDescriptionScanner(s, filename=inputfilename) 61 | if preparser: scanner.del_line += preparser.count('\n') 62 | 63 | parser = grammar.ParserDescription(scanner) 64 | t = runtime.wrap_error_reporter(parser, 'Parser') 65 | if t is None: return 1 # Failure 66 | if preparser is not None: t.preparser = preparser 67 | if postparser is not None: t.postparser = postparser 68 | 69 | # Add command line options to the set 70 | t.options.update(flags) 71 | 72 | # Generate the output 73 | if dump: 74 | t.dump_information() 75 | else: 76 | if isinstance(outputfile, string_types): 77 | outputfile = open(outputfile, 'w') 78 | t.output = outputfile 79 | t.generate_output() 80 | return 0 81 | 82 | 83 | def main(argv=None): 84 | import doctest 85 | doctest.testmod(sys.modules['__main__']) 86 | doctest.testmod(parsetree) 87 | 88 | import argparse 89 | parser = argparse.ArgumentParser( 90 | description='Generate python parser code from grammar description file.') 91 | parser.add_argument('grammar_path', help='Path to grammar description file (input).') 92 | parser.add_argument('parser_path', nargs='?', 93 | help='Path to output file to be generated.' 94 | ' Input path, but with .py will be used, if omitted.' 95 | ' "-" or "/dev/stdout" (on some systems) can be used to output generated code to stdout.') 96 | parser.add_argument('-i', '--context-insensitive-scanner', 97 | action='store_true', help='Scan all tokens (see docs).') 98 | parser.add_argument('-t', '--indent-with-tabs', action='store_true', 99 | help='Use tabs instead of four spaces for indentation in generated code.') 100 | parser.add_argument('--dump', action='store_true', help='Dump out grammar information.') 101 | optz = parser.parse_args(argv if argv is not None else sys.argv[1:]) 102 | 103 | parser_flags = dict() 104 | for k in 'dump', 'context_insensitive_scanner': 105 | if getattr(optz, k, False): parser_flags[k] = True 106 | if optz.indent_with_tabs: parsetree.INDENT = '\t' # not the cleanest way 107 | 108 | outputfile = optz.parser_path 109 | if outputfile == '-': outputfile = sys.stdout 110 | 111 | sys.exit(generate(optz.grammar_path, outputfile, **parser_flags)) 112 | 113 | 114 | if __name__ == '__main__': main() 115 | -------------------------------------------------------------------------------- /yapps/grammar.py: -------------------------------------------------------------------------------- 1 | # grammar.py, part of Yapps 2 - yet another python parser system 2 | # Copyright 1999-2003 by Amit J. Patel 3 | # Enhancements copyright 2003-2004 by Matthias Urlichs 4 | # 5 | # This version of the Yapps 2 grammar can be distributed under the 6 | # terms of the MIT open source license, either found in the LICENSE 7 | # file included with the Yapps distribution 8 | # or at 9 | # 10 | # 11 | 12 | """Parser for Yapps grammars. 13 | 14 | This file defines the grammar of Yapps grammars. Naturally, it is 15 | implemented in Yapps. The grammar.py module needed by Yapps is built 16 | by running Yapps on yapps_grammar.g. (Holy circularity, Batman!) 17 | 18 | """ 19 | 20 | import sys, re 21 | from yapps import parsetree 22 | 23 | ###################################################################### 24 | def cleanup_choice(rule, lst): 25 | if len(lst) == 0: return Sequence(rule, []) 26 | if len(lst) == 1: return lst[0] 27 | return parsetree.Choice(rule, *tuple(lst)) 28 | 29 | def cleanup_sequence(rule, lst): 30 | if len(lst) == 1: return lst[0] 31 | return parsetree.Sequence(rule, *tuple(lst)) 32 | 33 | def resolve_name(rule, tokens, id, args): 34 | if id in [x[0] for x in tokens]: 35 | # It's a token 36 | if args: 37 | print('Warning: ignoring parameters on TOKEN %s<<%s>>' % (id, args)) 38 | return parsetree.Terminal(rule, id) 39 | else: 40 | # It's a name, so assume it's a nonterminal 41 | return parsetree.NonTerminal(rule, id, args) 42 | 43 | 44 | # Begin -- grammar generated by Yapps 45 | import sys, re 46 | from yapps import runtime 47 | 48 | class ParserDescriptionScanner(runtime.Scanner): 49 | patterns = [ 50 | ('"rule"', re.compile('rule')), 51 | ('"ignore"', re.compile('ignore')), 52 | ('"token"', re.compile('token')), 53 | ('"option"', re.compile('option')), 54 | ('":"', re.compile(':')), 55 | ('"parser"', re.compile('parser')), 56 | ('[ \t\r\n]+', re.compile('[ \t\r\n]+')), 57 | ('#.*?\r?\n', re.compile('#.*?\r?\n')), 58 | ('EOF', re.compile('$')), 59 | ('ATTR', re.compile('<<.+?>>')), 60 | ('STMT', re.compile('{{.+?}}')), 61 | ('ID', re.compile('[a-zA-Z_][a-zA-Z_0-9]*')), 62 | ('STR', re.compile('[rR]?\'([^\\n\'\\\\]|\\\\.)*\'|[rR]?"([^\\n"\\\\]|\\\\.)*"')), 63 | ('LP', re.compile('\\(')), 64 | ('RP', re.compile('\\)')), 65 | ('LB', re.compile('\\[')), 66 | ('RB', re.compile('\\]')), 67 | ('OR', re.compile('[|]')), 68 | ('STAR', re.compile('[*]')), 69 | ('PLUS', re.compile('[+]')), 70 | ('QUEST', re.compile('[?]')), 71 | ('COLON', re.compile(':')), 72 | ] 73 | def __init__(self, str,*args,**kw): 74 | runtime.Scanner.__init__(self,None,{'[ \t\r\n]+':None,'#.*?\r?\n':None,},str,*args,**kw) 75 | 76 | class ParserDescription(runtime.Parser): 77 | Context = runtime.Context 78 | def Parser(self, _parent=None): 79 | _context = self.Context(_parent, self._scanner, 'Parser', []) 80 | self._scan('"parser"', context=_context) 81 | ID = self._scan('ID', context=_context) 82 | self._scan('":"', context=_context) 83 | Options = self.Options(_context) 84 | Tokens = self.Tokens(_context) 85 | Rules = self.Rules(Tokens, _context) 86 | EOF = self._scan('EOF', context=_context) 87 | return parsetree.Generator(ID,Options,Tokens,Rules) 88 | 89 | def Options(self, _parent=None): 90 | _context = self.Context(_parent, self._scanner, 'Options', []) 91 | opt = {} 92 | while self._peek('"option"', '"token"', '"ignore"', 'EOF', '"rule"', context=_context) == '"option"': 93 | self._scan('"option"', context=_context) 94 | self._scan('":"', context=_context) 95 | Str = self.Str(_context) 96 | opt[Str] = 1 97 | return opt 98 | 99 | def Tokens(self, _parent=None): 100 | _context = self.Context(_parent, self._scanner, 'Tokens', []) 101 | tok = [] 102 | while self._peek('"token"', '"ignore"', 'EOF', '"rule"', context=_context) in ['"token"', '"ignore"']: 103 | _token = self._peek('"token"', '"ignore"', context=_context) 104 | if _token == '"token"': 105 | self._scan('"token"', context=_context) 106 | ID = self._scan('ID', context=_context) 107 | self._scan('":"', context=_context) 108 | Str = self.Str(_context) 109 | tok.append( (ID,Str) ) 110 | else: # == '"ignore"' 111 | self._scan('"ignore"', context=_context) 112 | self._scan('":"', context=_context) 113 | Str = self.Str(_context) 114 | ign = ('#ignore',Str) 115 | if self._peek('STMT', '"token"', '"ignore"', 'EOF', '"rule"', context=_context) == 'STMT': 116 | STMT = self._scan('STMT', context=_context) 117 | ign = ign + (STMT[2:-2],) 118 | tok.append( ign ) 119 | return tok 120 | 121 | def Rules(self, tokens, _parent=None): 122 | _context = self.Context(_parent, self._scanner, 'Rules', [tokens]) 123 | rul = [] 124 | while self._peek('"rule"', 'EOF', context=_context) == '"rule"': 125 | self._scan('"rule"', context=_context) 126 | ID = self._scan('ID', context=_context) 127 | OptParam = self.OptParam(_context) 128 | self._scan('":"', context=_context) 129 | ClauseA = self.ClauseA(ID, tokens, _context) 130 | rul.append( (ID, OptParam, ClauseA) ) 131 | return rul 132 | 133 | def ClauseA(self, rule, tokens, _parent=None): 134 | _context = self.Context(_parent, self._scanner, 'ClauseA', [rule, tokens]) 135 | ClauseB = self.ClauseB(rule,tokens, _context) 136 | v = [ClauseB] 137 | while self._peek('OR', 'RP', 'RB', '"rule"', 'EOF', context=_context) == 'OR': 138 | OR = self._scan('OR', context=_context) 139 | ClauseB = self.ClauseB(rule,tokens, _context) 140 | v.append(ClauseB) 141 | return cleanup_choice(rule,v) 142 | 143 | def ClauseB(self, rule,tokens, _parent=None): 144 | _context = self.Context(_parent, self._scanner, 'ClauseB', [rule,tokens]) 145 | v = [] 146 | while self._peek('STR', 'ID', 'LP', 'LB', 'STMT', 'OR', 'RP', 'RB', '"rule"', 'EOF', context=_context) in ['STR', 'ID', 'LP', 'LB', 'STMT']: 147 | ClauseC = self.ClauseC(rule,tokens, _context) 148 | v.append(ClauseC) 149 | return cleanup_sequence(rule, v) 150 | 151 | def ClauseC(self, rule,tokens, _parent=None): 152 | _context = self.Context(_parent, self._scanner, 'ClauseC', [rule,tokens]) 153 | ClauseD = self.ClauseD(rule,tokens, _context) 154 | _token = self._peek('PLUS', 'STAR', 'QUEST', 'STR', 'ID', 'LP', 'LB', 'STMT', 'OR', 'RP', 'RB', '"rule"', 'EOF', context=_context) 155 | if _token == 'PLUS': 156 | PLUS = self._scan('PLUS', context=_context) 157 | return parsetree.Plus(rule, ClauseD) 158 | elif _token == 'STAR': 159 | STAR = self._scan('STAR', context=_context) 160 | return parsetree.Star(rule, ClauseD) 161 | elif _token == 'QUEST': 162 | QUEST = self._scan('QUEST', context=_context) 163 | return parsetree.Option(rule, ClauseD) 164 | else: 165 | return ClauseD 166 | 167 | def ClauseD(self, rule,tokens, _parent=None): 168 | _context = self.Context(_parent, self._scanner, 'ClauseD', [rule,tokens]) 169 | _token = self._peek('STR', 'ID', 'LP', 'LB', 'STMT', context=_context) 170 | if _token == 'STR': 171 | STR = self._scan('STR', context=_context) 172 | t = (STR, eval(STR,{},{})) 173 | if t not in tokens: tokens.insert( 0, t ) 174 | return parsetree.Terminal(rule, STR) 175 | elif _token == 'ID': 176 | ID = self._scan('ID', context=_context) 177 | OptParam = self.OptParam(_context) 178 | return resolve_name(rule,tokens, ID, OptParam) 179 | elif _token == 'LP': 180 | LP = self._scan('LP', context=_context) 181 | ClauseA = self.ClauseA(rule,tokens, _context) 182 | RP = self._scan('RP', context=_context) 183 | return ClauseA 184 | elif _token == 'LB': 185 | LB = self._scan('LB', context=_context) 186 | ClauseA = self.ClauseA(rule,tokens, _context) 187 | RB = self._scan('RB', context=_context) 188 | return parsetree.Option(rule, ClauseA) 189 | else: # == 'STMT' 190 | STMT = self._scan('STMT', context=_context) 191 | return parsetree.Eval(rule, STMT[2:-2]) 192 | 193 | def OptParam(self, _parent=None): 194 | _context = self.Context(_parent, self._scanner, 'OptParam', []) 195 | if self._peek('ATTR', '":"', 'PLUS', 'STAR', 'QUEST', 'STR', 'ID', 'LP', 'LB', 'STMT', 'OR', 'RP', 'RB', '"rule"', 'EOF', context=_context) == 'ATTR': 196 | ATTR = self._scan('ATTR', context=_context) 197 | return ATTR[2:-2] 198 | return '' 199 | 200 | def Str(self, _parent=None): 201 | _context = self.Context(_parent, self._scanner, 'Str', []) 202 | STR = self._scan('STR', context=_context) 203 | return eval(STR,{},{}) 204 | 205 | 206 | def parse(rule, text): 207 | P = ParserDescription(ParserDescriptionScanner(text)) 208 | return runtime.wrap_error_reporter(P, rule) 209 | 210 | # End -- grammar generated by Yapps 211 | -------------------------------------------------------------------------------- /yapps/parsetree.py: -------------------------------------------------------------------------------- 1 | # parsetree.py, part of Yapps 2 - yet another python parser system 2 | # Copyright 1999-2003 by Amit J. Patel 3 | # 4 | # This version of the Yapps 2 Runtime can be distributed under the 5 | # terms of the MIT open source license, either found in the LICENSE file 6 | # included with the Yapps distribution 7 | # or at 8 | # 9 | # 10 | 11 | """Classes used to represent parse trees and generate output. 12 | 13 | This module defines the Generator class, which drives the generation 14 | of Python output from a grammar parse tree. It also defines nodes 15 | used to represent the parse tree; they are derived from class Node. 16 | 17 | The main logic of Yapps is in this module. 18 | """ 19 | 20 | from __future__ import print_function 21 | import sys, re 22 | 23 | ###################################################################### 24 | INDENT = ' '*4 25 | 26 | class Generator: 27 | 28 | # TODO: many of the methods here should be class methods, not instance methods 29 | 30 | def __init__(self, name, options, tokens, rules): 31 | self.change_count = 0 32 | self.name = name 33 | self.options = options 34 | self.preparser = '' 35 | self.postparser = None 36 | 37 | self.tokens = {} # Map from tokens to regexps 38 | self.ignore = {} # List of token names to ignore in parsing, map to statements 39 | self.terminals = [] # List of token names (to maintain ordering) 40 | for t in tokens: 41 | if len(t) == 3: 42 | n,t,s = t 43 | else: 44 | n,t = t 45 | s = None 46 | 47 | if n == '#ignore': 48 | n = t 49 | self.ignore[n] = s 50 | if n in self.tokens.keys() and self.tokens[n] != t: 51 | print('Warning: token %s defined more than once.' % n, file=sys.stderr) 52 | self.tokens[n] = t 53 | self.terminals.append(n) 54 | 55 | self.rules = {} # Map from rule names to parser nodes 56 | self.params = {} # Map from rule names to parameters 57 | self.goals = [] # List of rule names (to maintain ordering) 58 | for n,p,r in rules: 59 | self.params[n] = p 60 | self.rules[n] = r 61 | self.goals.append(n) 62 | 63 | self.output = sys.stdout 64 | 65 | def has_option(self, name): 66 | return self.options.get(name, False) 67 | 68 | def non_ignored_tokens(self): 69 | return [x for x in self.terminals if x not in self.ignore] 70 | 71 | def changed(self): 72 | """Increments the change count. 73 | 74 | >>> t = Generator('', [], [], []) 75 | >>> old_count = t.change_count 76 | >>> t.changed() 77 | >>> assert t.change_count == old_count + 1 78 | """ 79 | self.change_count = 1+self.change_count 80 | 81 | def set_subtract(self, a, b): 82 | """Returns the elements of a that are not in b. 83 | 84 | >>> t = Generator('', [], [], []) 85 | >>> t.set_subtract([], []) 86 | [] 87 | >>> t.set_subtract([1, 2], [1, 2]) 88 | [] 89 | >>> t.set_subtract([1, 2, 3], [2]) 90 | [1, 3] 91 | >>> t.set_subtract([1], [2, 3, 4]) 92 | [1] 93 | """ 94 | result = [] 95 | for x in a: 96 | if x not in b: 97 | result.append(x) 98 | return result 99 | 100 | def subset(self, a, b): 101 | """True iff all elements of sequence a are inside sequence b 102 | 103 | >>> t = Generator('', [], [], []) 104 | >>> t.subset([], [1, 2, 3]) 105 | 1 106 | >>> t.subset([1, 2, 3], []) 107 | 0 108 | >>> t.subset([1], [1, 2, 3]) 109 | 1 110 | >>> t.subset([3, 2, 1], [1, 2, 3]) 111 | 1 112 | >>> t.subset([1, 1, 1], [1, 2, 3]) 113 | 1 114 | >>> t.subset([1, 2, 3], [1, 1, 1]) 115 | 0 116 | """ 117 | for x in a: 118 | if x not in b: 119 | return 0 120 | return 1 121 | 122 | def equal_set(self, a, b): 123 | """True iff subset(a, b) and subset(b, a) 124 | 125 | >>> t = Generator('', [], [], []) 126 | >>> a_set = [1, 2, 3] 127 | >>> t.equal_set(a_set, a_set) 128 | 1 129 | >>> t.equal_set(a_set, a_set[:]) 130 | 1 131 | >>> t.equal_set([], a_set) 132 | 0 133 | >>> t.equal_set([1, 2, 3], [3, 2, 1]) 134 | 1 135 | """ 136 | if len(a) != len(b): return 0 137 | if a == b: return 1 138 | return self.subset(a, b) and self.subset(b, a) 139 | 140 | def add_to(self, parent, additions): 141 | "Modify _parent_ to include all elements in _additions_" 142 | for x in additions: 143 | if x not in parent: 144 | parent.append(x) 145 | self.changed() 146 | 147 | def equate(self, a, b): 148 | """Extend (a) and (b) so that they contain each others' elements. 149 | 150 | >>> t = Generator('', [], [], []) 151 | >>> a = [1, 2] 152 | >>> b = [2, 3] 153 | >>> t.equate(a, b) 154 | >>> a 155 | [1, 2, 3] 156 | >>> b 157 | [2, 3, 1] 158 | """ 159 | self.add_to(a, b) 160 | self.add_to(b, a) 161 | 162 | def write(self, *args): 163 | for a in args: 164 | self.output.write(a) 165 | 166 | def in_test(self, expr, full, set): 167 | """Generate a test of (expr) being in (set), where (set) is a subset of (full) 168 | 169 | expr is a string (Python expression) 170 | set is a list of values (which will be converted with repr) 171 | full is the list of all values expr could possibly evaluate to 172 | 173 | >>> t = Generator('', [], [], []) 174 | >>> t.in_test('x', [1,2,3,4], []) 175 | '0' 176 | >>> t.in_test('x', [1,2,3,4], [1,2,3,4]) 177 | '1' 178 | >>> t.in_test('x', [1,2,3,4], [1]) 179 | 'x == 1' 180 | >>> t.in_test('a+b', [1,2,3,4], [1,2]) 181 | 'a+b in [1, 2]' 182 | >>> t.in_test('x', [1,2,3,4,5], [1,2,3]) 183 | 'x not in [4, 5]' 184 | >>> t.in_test('x', [1,2,3,4,5], [1,2,3,4]) 185 | 'x != 5' 186 | """ 187 | 188 | if not set: return '0' 189 | if len(set) == 1: return '%s == %s' % (expr, repr(set[0])) 190 | if full and len(set) > len(full)/2: 191 | # Reverse the sense of the test. 192 | not_set = [x for x in full if x not in set] 193 | return self.not_in_test(expr, full, not_set) 194 | return '%s in %s' % (expr, repr(set)) 195 | 196 | def not_in_test(self, expr, full, set): 197 | """Like in_test, but the reverse test.""" 198 | if not set: return '1' 199 | if len(set) == 1: return '%s != %s' % (expr, repr(set[0])) 200 | return '%s not in %s' % (expr, repr(set)) 201 | 202 | def peek_call(self, a): 203 | """Generate a call to scan for a token in the set 'a'""" 204 | assert type(a) == type([]) 205 | a_set = (repr(a)[1:-1]) 206 | if self.equal_set(a, self.non_ignored_tokens()): a_set = '' 207 | if self.has_option('context_insensitive_scanner'): a_set = '' 208 | if a_set: a_set += "," 209 | 210 | return 'self._peek(%s context=_context)' % a_set 211 | 212 | def peek_test(self, a, b): 213 | """Generate a call to test whether the next token (which could be any of 214 | the elements in a) is in the set b.""" 215 | if self.subset(a, b): return '1' 216 | if self.has_option('context_insensitive_scanner'): a = self.non_ignored_tokens() 217 | return self.in_test(self.peek_call(a), a, b) 218 | 219 | def not_peek_test(self, a, b): 220 | """Like peek_test, but the opposite sense.""" 221 | if self.subset(a, b): return '0' 222 | return self.not_in_test(self.peek_call(a), a, b) 223 | 224 | def calculate(self): 225 | """The main loop to compute the epsilon, first, follow sets. 226 | The loop continues until the sets converge. This works because 227 | each set can only get larger, so when they stop getting larger, 228 | we're done.""" 229 | # First we determine whether a rule accepts epsilon (the empty sequence) 230 | while 1: 231 | for r in self.goals: 232 | self.rules[r].setup(self) 233 | if self.change_count == 0: break 234 | self.change_count = 0 235 | 236 | # Now we compute the first/follow sets 237 | while 1: 238 | for r in self.goals: 239 | self.rules[r].update(self) 240 | if self.change_count == 0: break 241 | self.change_count = 0 242 | 243 | def dump_information(self): 244 | """Display the grammar in somewhat human-readable form.""" 245 | self.calculate() 246 | for r in self.goals: 247 | print(' _____' + '_'*len(r)) 248 | print(('___/Rule '+r+'\\' + '_'*80)[:79]) 249 | queue = [self.rules[r]] 250 | while queue: 251 | top = queue[0] 252 | del queue[0] 253 | 254 | print('Rule', repr(top), 'of class', top.__class__.__name__) 255 | top.first.sort() 256 | top.follow.sort() 257 | eps = [] 258 | if top.accepts_epsilon: eps = ['(null)'] 259 | print(' FIRST:', ', '.join(top.first+eps)) 260 | print(' FOLLOW:', ', '.join(top.follow)) 261 | for x in top.get_children(): queue.append(x) 262 | 263 | def repr_ignore(self): 264 | out="{" 265 | for t,s in self.ignore.items(): 266 | if s is None: s=repr(s) 267 | out += "%s:%s," % (repr(t),s) 268 | out += "}" 269 | return out 270 | 271 | def generate_output(self): 272 | self.calculate() 273 | self.write(self.preparser) 274 | self.write("# Begin -- grammar generated by Yapps\n") 275 | self.write("from __future__ import print_function\n") 276 | self.write("import sys, re\n") 277 | self.write("from yapps import runtime\n") 278 | self.write("\n") 279 | self.write("class ", self.name, "Scanner(runtime.Scanner):\n") 280 | self.write(INDENT, "patterns = [\n") 281 | for p in self.terminals: 282 | self.write(INDENT*2, "(%s, re.compile(%s)),\n" % ( 283 | repr(p), repr(self.tokens[p]))) 284 | self.write(INDENT, "]\n") 285 | self.write(INDENT, "def __init__(self, str,*args,**kw):\n") 286 | self.write(INDENT*2, "runtime.Scanner.__init__(self,None,%s,str,*args,**kw)\n" % 287 | self.repr_ignore()) 288 | self.write("\n") 289 | 290 | self.write("class ", self.name, "(runtime.Parser):\n") 291 | self.write(INDENT, "Context = runtime.Context\n") 292 | for r in self.goals: 293 | self.write(INDENT, "def ", r, "(self") 294 | if self.params[r]: self.write(", ", self.params[r]) 295 | self.write(", _parent=None):\n") 296 | self.write(INDENT*2, "_context = self.Context(_parent, self._scanner, %s, [%s])\n" % 297 | (repr(r), self.params.get(r, ''))) 298 | self.rules[r].output(self, INDENT*2) 299 | self.write("\n") 300 | 301 | self.write("\n") 302 | self.write("def parse(rule, text):\n") 303 | self.write(INDENT, "P = ", self.name, "(", self.name, "Scanner(text))\n") 304 | self.write(INDENT, "return runtime.wrap_error_reporter(P, rule)\n") 305 | self.write("\n") 306 | if self.postparser is not None: 307 | self.write("# End -- grammar generated by Yapps\n") 308 | self.write(self.postparser) 309 | else: 310 | self.write("if __name__ == '__main__':\n") 311 | self.write(INDENT, "from sys import argv, stdin\n") 312 | self.write(INDENT, "if len(argv) >= 2:\n") 313 | self.write(INDENT*2, "if len(argv) >= 3:\n") 314 | self.write(INDENT*3, "f = open(argv[2],'r')\n") 315 | self.write(INDENT*2, "else:\n") 316 | self.write(INDENT*3, "f = stdin\n") 317 | self.write(INDENT*2, "print(parse(argv[1], f.read()))\n") 318 | self.write(INDENT, "else: print ('Args: []', file=sys.stderr)\n") 319 | self.write("# End -- grammar generated by Yapps\n") 320 | 321 | ###################################################################### 322 | class Node: 323 | """This is the base class for all components of a grammar.""" 324 | def __init__(self, rule): 325 | self.rule = rule # name of the rule containing this node 326 | self.first = [] 327 | self.follow = [] 328 | self.accepts_epsilon = 0 329 | 330 | def setup(self, gen): 331 | # Setup will change accepts_epsilon, 332 | # sometimes from 0 to 1 but never 1 to 0. 333 | # It will take a finite number of steps to set things up 334 | pass 335 | 336 | def used(self, vars): 337 | "Return two lists: one of vars used, and the other of vars assigned" 338 | return vars, [] 339 | 340 | def get_children(self): 341 | "Return a list of sub-nodes" 342 | return [] 343 | 344 | def __repr__(self): 345 | return str(self) 346 | 347 | def update(self, gen): 348 | if self.accepts_epsilon: 349 | gen.add_to(self.first, self.follow) 350 | 351 | def output(self, gen, indent): 352 | "Write out code to _gen_ with _indent_:string indentation" 353 | gen.write(indent, "assert 0 # Invalid parser node\n") 354 | 355 | class Terminal(Node): 356 | """This class stores terminal nodes, which are tokens.""" 357 | def __init__(self, rule, token): 358 | Node.__init__(self, rule) 359 | self.token = token 360 | self.accepts_epsilon = 0 361 | 362 | def __str__(self): 363 | return self.token 364 | 365 | def update(self, gen): 366 | Node.update(self, gen) 367 | if self.first != [self.token]: 368 | self.first = [self.token] 369 | gen.changed() 370 | 371 | def output(self, gen, indent): 372 | gen.write(indent) 373 | if re.match('[a-zA-Z_][a-zA-Z_0-9]*$', self.token): 374 | gen.write(self.token, " = ") 375 | gen.write("self._scan(%s, context=_context)\n" % repr(self.token)) 376 | 377 | class Eval(Node): 378 | """This class stores evaluation nodes, from {{ ... }} clauses.""" 379 | def __init__(self, rule, expr): 380 | Node.__init__(self, rule) 381 | self.expr = expr 382 | 383 | def setup(self, gen): 384 | Node.setup(self, gen) 385 | if not self.accepts_epsilon: 386 | self.accepts_epsilon = 1 387 | gen.changed() 388 | 389 | def __str__(self): 390 | return '{{ %s }}' % self.expr.strip() 391 | 392 | def output(self, gen, indent): 393 | gen.write(indent, self.expr.strip(), '\n') 394 | 395 | class NonTerminal(Node): 396 | """This class stores nonterminal nodes, which are rules with arguments.""" 397 | def __init__(self, rule, name, args): 398 | Node.__init__(self, rule) 399 | self.name = name 400 | self.args = args 401 | 402 | def setup(self, gen): 403 | Node.setup(self, gen) 404 | try: 405 | self.target = gen.rules[self.name] 406 | if self.accepts_epsilon != self.target.accepts_epsilon: 407 | self.accepts_epsilon = self.target.accepts_epsilon 408 | gen.changed() 409 | except KeyError: # Oops, it's nonexistent 410 | print('Error: no rule <%s>' % self.name, file=sys.stderr) 411 | self.target = self 412 | 413 | def __str__(self): 414 | return '%s' % self.name 415 | 416 | def update(self, gen): 417 | Node.update(self, gen) 418 | gen.equate(self.first, self.target.first) 419 | gen.equate(self.follow, self.target.follow) 420 | 421 | def output(self, gen, indent): 422 | gen.write(indent) 423 | gen.write(self.name, " = ") 424 | args = self.args 425 | if args: args += ', ' 426 | args += '_context' 427 | gen.write("self.", self.name, "(", args, ")\n") 428 | 429 | class Sequence(Node): 430 | """This class stores a sequence of nodes (A B C ...)""" 431 | def __init__(self, rule, *children): 432 | Node.__init__(self, rule) 433 | self.children = children 434 | 435 | def setup(self, gen): 436 | Node.setup(self, gen) 437 | for c in self.children: c.setup(gen) 438 | 439 | if not self.accepts_epsilon: 440 | # If it's not already accepting epsilon, it might now do so. 441 | for c in self.children: 442 | # any non-epsilon means all is non-epsilon 443 | if not c.accepts_epsilon: break 444 | else: 445 | self.accepts_epsilon = 1 446 | gen.changed() 447 | 448 | def get_children(self): 449 | return self.children 450 | 451 | def __str__(self): 452 | return '( %s )' % ' '.join(map(str, self.children)) 453 | 454 | def update(self, gen): 455 | Node.update(self, gen) 456 | for g in self.children: 457 | g.update(gen) 458 | 459 | empty = 1 460 | for g_i in range(len(self.children)): 461 | g = self.children[g_i] 462 | 463 | if empty: gen.add_to(self.first, g.first) 464 | if not g.accepts_epsilon: empty = 0 465 | 466 | if g_i == len(self.children)-1: 467 | next = self.follow 468 | else: 469 | next = self.children[1+g_i].first 470 | gen.add_to(g.follow, next) 471 | 472 | if self.children: 473 | gen.add_to(self.follow, self.children[-1].follow) 474 | 475 | def output(self, gen, indent): 476 | if self.children: 477 | for c in self.children: 478 | c.output(gen, indent) 479 | else: 480 | # Placeholder for empty sequences, just in case 481 | gen.write(indent, 'pass\n') 482 | 483 | class Choice(Node): 484 | """This class stores a choice between nodes (A | B | C | ...)""" 485 | def __init__(self, rule, *children): 486 | Node.__init__(self, rule) 487 | self.children = children 488 | 489 | def setup(self, gen): 490 | Node.setup(self, gen) 491 | for c in self.children: c.setup(gen) 492 | 493 | if not self.accepts_epsilon: 494 | for c in self.children: 495 | if c.accepts_epsilon: 496 | self.accepts_epsilon = 1 497 | gen.changed() 498 | 499 | def get_children(self): 500 | return self.children 501 | 502 | def __str__(self): 503 | return '( %s )' % ' | '.join(map(str, self.children)) 504 | 505 | def update(self, gen): 506 | Node.update(self, gen) 507 | for g in self.children: 508 | g.update(gen) 509 | 510 | for g in self.children: 511 | gen.add_to(self.first, g.first) 512 | gen.add_to(self.follow, g.follow) 513 | for g in self.children: 514 | gen.add_to(g.follow, self.follow) 515 | if self.accepts_epsilon: 516 | gen.add_to(self.first, self.follow) 517 | 518 | def output(self, gen, indent): 519 | test = "if" 520 | gen.write(indent, "_token = ", gen.peek_call(self.first), "\n") 521 | tokens_seen = [] 522 | tokens_unseen = self.first[:] 523 | if gen.has_option('context_insensitive_scanner'): 524 | # Context insensitive scanners can return ANY token, 525 | # not only the ones in first. 526 | tokens_unseen = gen.non_ignored_tokens() 527 | for c in self.children: 528 | testset = c.first[:] 529 | removed = [] 530 | for x in testset: 531 | if x in tokens_seen: 532 | testset.remove(x) 533 | removed.append(x) 534 | if x in tokens_unseen: tokens_unseen.remove(x) 535 | tokens_seen = tokens_seen + testset 536 | if removed: 537 | if not testset: 538 | print('Error in rule', self.rule+':', file=sys.stderr) 539 | else: 540 | print('Warning in rule', self.rule+':', file=sys.stderr) 541 | print(' *', self, file=sys.stderr) 542 | print(' * These tokens could be matched by more than one clause:', file=sys.stderr) 543 | print(' *', ' '.join(removed), file=sys.stderr) 544 | 545 | if testset: 546 | if not tokens_unseen: # context sensitive scanners only! 547 | if test == 'if': 548 | # if it's the first AND last test, then 549 | # we can simply put the code without an if/else 550 | c.output(gen, indent) 551 | else: 552 | gen.write(indent, "else:") 553 | t = gen.in_test('', [], testset) 554 | if len(t) < 70-len(indent): 555 | gen.write(' #', t) 556 | gen.write("\n") 557 | c.output(gen, indent+INDENT) 558 | else: 559 | gen.write(indent, test, " ", 560 | gen.in_test('_token', tokens_unseen, testset), 561 | ":\n") 562 | c.output(gen, indent+INDENT) 563 | test = "elif" 564 | 565 | if tokens_unseen: 566 | gen.write(indent, "else:\n") 567 | gen.write(indent, INDENT, "raise runtime.SyntaxError(_token[0], ") 568 | gen.write("'Could not match ", self.rule, "')\n") 569 | 570 | class Wrapper(Node): 571 | """This is a base class for nodes that modify a single child.""" 572 | def __init__(self, rule, child): 573 | Node.__init__(self, rule) 574 | self.child = child 575 | 576 | def setup(self, gen): 577 | Node.setup(self, gen) 578 | self.child.setup(gen) 579 | 580 | def get_children(self): 581 | return [self.child] 582 | 583 | def update(self, gen): 584 | Node.update(self, gen) 585 | self.child.update(gen) 586 | gen.add_to(self.first, self.child.first) 587 | gen.equate(self.follow, self.child.follow) 588 | 589 | class Option(Wrapper): 590 | """This class represents an optional clause of the form [A]""" 591 | def setup(self, gen): 592 | Wrapper.setup(self, gen) 593 | if not self.accepts_epsilon: 594 | self.accepts_epsilon = 1 595 | gen.changed() 596 | 597 | def __str__(self): 598 | return '[ %s ]' % str(self.child) 599 | 600 | def output(self, gen, indent): 601 | if self.child.accepts_epsilon: 602 | print('Warning in rule', self.rule+': contents may be empty.', file=sys.stderr) 603 | gen.write(indent, "if %s:\n" % 604 | gen.peek_test(self.first, self.child.first)) 605 | self.child.output(gen, indent+INDENT) 606 | 607 | if gen.has_option('context_insensitive_scanner'): 608 | gen.write(indent, "if %s:\n" % 609 | gen.not_peek_test(gen.non_ignored_tokens(), self.follow)) 610 | gen.write(indent+INDENT, "raise runtime.SyntaxError(pos=self._scanner.get_pos(), context=_context, msg='Need one of ' + ', '.join(%s))\n" % 611 | repr(self.first)) 612 | 613 | 614 | class Plus(Wrapper): 615 | """This class represents a 1-or-more repetition clause of the form A+""" 616 | def setup(self, gen): 617 | Wrapper.setup(self, gen) 618 | if self.accepts_epsilon != self.child.accepts_epsilon: 619 | self.accepts_epsilon = self.child.accepts_epsilon 620 | gen.changed() 621 | 622 | def __str__(self): 623 | return '%s+' % str(self.child) 624 | 625 | def update(self, gen): 626 | Wrapper.update(self, gen) 627 | gen.add_to(self.child.follow, self.child.first) 628 | 629 | def output(self, gen, indent): 630 | if self.child.accepts_epsilon: 631 | print('Warning in rule', self.rule+':', file=sys.stderr) 632 | print(' * The repeated pattern could be empty. The resulting parser may not work properly.', file=sys.stderr) 633 | gen.write(indent, "while 1:\n") 634 | self.child.output(gen, indent+INDENT) 635 | union = self.first[:] 636 | gen.add_to(union, self.follow) 637 | gen.write(indent+INDENT, "if %s: break\n" % 638 | gen.not_peek_test(union, self.child.first)) 639 | 640 | if gen.has_option('context_insensitive_scanner'): 641 | gen.write(indent, "if %s:\n" % 642 | gen.not_peek_test(gen.non_ignored_tokens(), self.follow)) 643 | gen.write(indent+INDENT, "raise runtime.SyntaxError(pos=self._scanner.get_pos(), context=_context, msg='Need one of ' + ', '.join(%s))\n" % 644 | repr(self.first)) 645 | 646 | 647 | class Star(Wrapper): 648 | """This class represents a 0-or-more repetition clause of the form A*""" 649 | def setup(self, gen): 650 | Wrapper.setup(self, gen) 651 | if not self.accepts_epsilon: 652 | self.accepts_epsilon = 1 653 | gen.changed() 654 | 655 | def __str__(self): 656 | return '%s*' % str(self.child) 657 | 658 | def update(self, gen): 659 | Wrapper.update(self, gen) 660 | gen.add_to(self.child.follow, self.child.first) 661 | 662 | def output(self, gen, indent): 663 | if self.child.accepts_epsilon: 664 | print('Warning in rule', self.rule+':', file=sys.stderr) 665 | print(' * The repeated pattern could be empty. The resulting parser probably will not work properly.', file=sys.stderr) 666 | gen.write(indent, "while %s:\n" % 667 | gen.peek_test(self.follow, self.child.first)) 668 | self.child.output(gen, indent+INDENT) 669 | 670 | # TODO: need to generate tests like this in lots of rules 671 | if gen.has_option('context_insensitive_scanner'): 672 | gen.write(indent, "if %s:\n" % 673 | gen.not_peek_test(gen.non_ignored_tokens(), self.follow)) 674 | gen.write(indent+INDENT, "raise runtime.SyntaxError(pos=self._scanner.get_pos(), context=_context, msg='Need one of ' + ', '.join(%s))\n" % 675 | repr(self.first)) 676 | -------------------------------------------------------------------------------- /yapps/runtime.py: -------------------------------------------------------------------------------- 1 | # Yapps 2 Runtime, part of Yapps 2 - yet another python parser system 2 | # Copyright 1999-2003 by Amit J. Patel 3 | # Enhancements copyright 2003-2004 by Matthias Urlichs 4 | # 5 | # This version of the Yapps 2 Runtime can be distributed under the 6 | # terms of the MIT open source license, either found in the LICENSE file 7 | # included with the Yapps distribution 8 | # or at 9 | # 10 | # 11 | 12 | """Run time libraries needed to run parsers generated by Yapps. 13 | 14 | This module defines parse-time exception classes, a scanner class, a 15 | base class for parsers produced by Yapps, and a context class that 16 | keeps track of the parse stack. 17 | 18 | """ 19 | 20 | from __future__ import print_function 21 | import sys, re 22 | 23 | MIN_WINDOW=4096 24 | # File lookup window 25 | 26 | class SyntaxError(Exception): 27 | """When we run into an unexpected token, this is the exception to use""" 28 | def __init__(self, pos=None, msg="Bad Token", context=None): 29 | Exception.__init__(self) 30 | self.pos = pos 31 | self.msg = msg 32 | self.context = context 33 | 34 | def __str__(self): 35 | if not self.pos: return 'SyntaxError' 36 | else: return 'SyntaxError@%s(%s)' % (repr(self.pos), self.msg) 37 | 38 | class NoMoreTokens(Exception): 39 | """Another exception object, for when we run out of tokens""" 40 | pass 41 | 42 | class Token(object): 43 | """Yapps token. 44 | 45 | This is a container for a scanned token. 46 | """ 47 | 48 | def __init__(self, type,value, pos=None): 49 | """Initialize a token.""" 50 | self.type = type 51 | self.value = value 52 | self.pos = pos 53 | 54 | def __repr__(self): 55 | output = '<%s: %s' % (self.type, repr(self.value)) 56 | if self.pos: 57 | output += " @ " 58 | if self.pos[0]: 59 | output += "%s:" % self.pos[0] 60 | if self.pos[1]: 61 | output += "%d" % self.pos[1] 62 | if self.pos[2] is not None: 63 | output += ".%d" % self.pos[2] 64 | output += ">" 65 | return output 66 | 67 | in_name=0 68 | class Scanner(object): 69 | """Yapps scanner. 70 | 71 | The Yapps scanner can work in context sensitive or context 72 | insensitive modes. The token(i) method is used to retrieve the 73 | i-th token. It takes a restrict set that limits the set of tokens 74 | it is allowed to return. In context sensitive mode, this restrict 75 | set guides the scanner. In context insensitive mode, there is no 76 | restriction (the set is always the full set of tokens). 77 | 78 | """ 79 | 80 | def __init__(self, patterns, ignore, input="", 81 | file=None,filename=None,stacked=False): 82 | """Initialize the scanner. 83 | 84 | Parameters: 85 | patterns : [(terminal, uncompiled regex), ...] or None 86 | ignore : {terminal:None, ...} 87 | input : string 88 | 89 | If patterns is None, we assume that the subclass has 90 | defined self.patterns : [(terminal, compiled regex), ...]. 91 | Note that the patterns parameter expects uncompiled regexes, 92 | whereas the self.patterns field expects compiled regexes. 93 | 94 | The 'ignore' value is either None or a callable, which is called 95 | with the scanner and the to-be-ignored match object; this can 96 | be used for include file or comment handling. 97 | """ 98 | 99 | if not filename: 100 | global in_name 101 | filename="" % in_name 102 | in_name += 1 103 | 104 | self.input = input 105 | self.ignore = ignore 106 | self.file = file 107 | self.filename = filename 108 | self.pos = 0 109 | self.del_pos = 0 # skipped 110 | self.line = 1 111 | self.del_line = 0 # skipped 112 | self.col = 0 113 | self.tokens = [] 114 | self.stack = None 115 | self.stacked = stacked 116 | 117 | self.last_read_token = None 118 | self.last_token = None 119 | self.last_types = None 120 | 121 | if patterns is not None: 122 | # Compile the regex strings into regex objects 123 | self.patterns = [] 124 | for terminal, regex in patterns: 125 | self.patterns.append( (terminal, re.compile(regex)) ) 126 | 127 | def stack_input(self, input="", file=None, filename=None): 128 | """Temporarily parse from a second file.""" 129 | 130 | # Already reading from somewhere else: Go on top of that, please. 131 | if self.stack: 132 | # autogenerate a recursion-level-identifying filename 133 | if not filename: 134 | filename = 1 135 | else: 136 | try: 137 | filename += 1 138 | except TypeError: 139 | pass 140 | # now pass off to the include file 141 | self.stack.stack_input(input,file,filename) 142 | else: 143 | 144 | try: 145 | filename += 0 146 | except TypeError: 147 | pass 148 | else: 149 | filename = "" % filename 150 | 151 | # self.stack = object.__new__(self.__class__) 152 | # Scanner.__init__(self.stack,self.patterns,self.ignore,input,file,filename, stacked=True) 153 | 154 | # Note that the pattern+ignore are added by the generated 155 | # scanner code 156 | self.stack = self.__class__(input,file,filename, stacked=True) 157 | 158 | def get_pos(self): 159 | """Return a file/line/char tuple.""" 160 | if self.stack: return self.stack.get_pos() 161 | 162 | return (self.filename, self.line+self.del_line, self.col) 163 | 164 | # def __repr__(self): 165 | # """Print the last few tokens that have been scanned in""" 166 | # output = '' 167 | # for t in self.tokens: 168 | # output += '%s\n' % (repr(t),) 169 | # return output 170 | 171 | def print_line_with_pointer(self, pos, length=0, out=sys.stderr): 172 | """Print the line of 'text' that includes position 'p', 173 | along with a second line with a single caret (^) at position p""" 174 | 175 | file,line,p = pos 176 | if file != self.filename: 177 | if self.stack: return self.stack.print_line_with_pointer(pos,length=length,out=out) 178 | print >>out, "(%s: not in input buffer)" % file 179 | return 180 | 181 | text = self.input 182 | p += length-1 # starts at pos 1 183 | 184 | origline=line 185 | line -= self.del_line 186 | spos=0 187 | if line > 0: 188 | while 1: 189 | line = line - 1 190 | try: 191 | cr = text.index("\n",spos) 192 | except ValueError: 193 | if line: 194 | text = "" 195 | break 196 | if line == 0: 197 | text = text[spos:cr] 198 | break 199 | spos = cr+1 200 | else: 201 | print >>out, "(%s:%d not in input buffer)" % (file,origline) 202 | return 203 | 204 | # Now try printing part of the line 205 | text = text[max(p-80, 0):p+80] 206 | p = p - max(p-80, 0) 207 | 208 | # Strip to the left 209 | i = text[:p].rfind('\n') 210 | j = text[:p].rfind('\r') 211 | if i < 0 or (0 <= j < i): i = j 212 | if 0 <= i < p: 213 | p = p - i - 1 214 | text = text[i+1:] 215 | 216 | # Strip to the right 217 | i = text.find('\n', p) 218 | j = text.find('\r', p) 219 | if i < 0 or (0 <= j < i): i = j 220 | if i >= 0: 221 | text = text[:i] 222 | 223 | # Now shorten the text 224 | while len(text) > 70 and p > 60: 225 | # Cut off 10 chars 226 | text = "..." + text[10:] 227 | p = p - 7 228 | 229 | # Now print the string, along with an indicator 230 | print >>out, '> ',text 231 | print >>out, '> ',' '*p + '^' 232 | 233 | def grab_input(self): 234 | """Get more input if possible.""" 235 | if not self.file: return 236 | if len(self.input) - self.pos >= MIN_WINDOW: return 237 | 238 | data = self.file.read(MIN_WINDOW) 239 | if data is None or data == "": 240 | self.file = None 241 | 242 | # Drop bytes from the start, if necessary. 243 | if self.pos > 2*MIN_WINDOW: 244 | self.del_pos += MIN_WINDOW 245 | self.del_line += self.input[:MIN_WINDOW].count("\n") 246 | self.pos -= MIN_WINDOW 247 | self.input = self.input[MIN_WINDOW:] + data 248 | else: 249 | self.input = self.input + data 250 | 251 | def getchar(self): 252 | """Return the next character.""" 253 | self.grab_input() 254 | 255 | c = self.input[self.pos] 256 | self.pos += 1 257 | return c 258 | 259 | def token(self, restrict, context=None): 260 | """Scan for another token.""" 261 | 262 | while 1: 263 | if self.stack: 264 | try: 265 | return self.stack.token(restrict, context) 266 | except StopIteration: 267 | self.stack = None 268 | 269 | # Keep looking for a token, ignoring any in self.ignore 270 | self.grab_input() 271 | 272 | # special handling for end-of-file 273 | if self.stacked and self.pos==len(self.input): 274 | raise StopIteration 275 | 276 | # Search the patterns for the longest match, with earlier 277 | # tokens in the list having preference 278 | best_match = -1 279 | best_pat = '(error)' 280 | best_m = None 281 | for p, regexp in self.patterns: 282 | # First check to see if we're ignoring this token 283 | if restrict and p not in restrict and p not in self.ignore: 284 | continue 285 | m = regexp.match(self.input, self.pos) 286 | if m and m.end()-m.start() > best_match: 287 | # We got a match that's better than the previous one 288 | best_pat = p 289 | best_match = m.end()-m.start() 290 | best_m = m 291 | 292 | # If we didn't find anything, raise an error 293 | if best_pat == '(error)' and best_match < 0: 294 | msg = 'Bad Token' 295 | if restrict: 296 | msg = 'Trying to find one of '+', '.join(restrict) 297 | raise SyntaxError(self.get_pos(), msg, context=context) 298 | 299 | ignore = best_pat in self.ignore 300 | value = self.input[self.pos:self.pos+best_match] 301 | if not ignore: 302 | tok=Token(type=best_pat, value=value, pos=self.get_pos()) 303 | 304 | self.pos += best_match 305 | 306 | npos = value.rfind("\n") 307 | if npos > -1: 308 | self.col = best_match-npos 309 | self.line += value.count("\n") 310 | else: 311 | self.col += best_match 312 | 313 | # If we found something that isn't to be ignored, return it 314 | if not ignore: 315 | if len(self.tokens) >= 10: 316 | del self.tokens[0] 317 | self.tokens.append(tok) 318 | self.last_read_token = tok 319 | # print repr(tok) 320 | return tok 321 | else: 322 | ignore = self.ignore[best_pat] 323 | if ignore: 324 | ignore(self, best_m) 325 | 326 | def peek(self, *types, **kw): 327 | """Returns the token type for lookahead; if there are any args 328 | then the list of args is the set of token types to allow""" 329 | context = kw.get("context",None) 330 | if self.last_token is None: 331 | self.last_types = types 332 | self.last_token = self.token(types,context) 333 | elif self.last_types: 334 | for t in types: 335 | if t not in self.last_types: 336 | raise NotImplementedError("Unimplemented: restriction set changed") 337 | return self.last_token.type 338 | 339 | def scan(self, type, **kw): 340 | """Returns the matched text, and moves to the next token""" 341 | context = kw.get("context",None) 342 | 343 | if self.last_token is None: 344 | tok = self.token([type],context) 345 | else: 346 | if self.last_types and type not in self.last_types: 347 | raise NotImplementedError("Unimplemented: restriction set changed") 348 | 349 | tok = self.last_token 350 | self.last_token = None 351 | if tok.type != type: 352 | if not self.last_types: self.last_types=[] 353 | raise SyntaxError(tok.pos, 'Trying to find '+type+': '+ ', '.join(self.last_types)+", got "+tok.type, context=context) 354 | return tok.value 355 | 356 | class Parser(object): 357 | """Base class for Yapps-generated parsers. 358 | 359 | """ 360 | 361 | def __init__(self, scanner): 362 | self._scanner = scanner 363 | 364 | def _stack(self, input="",file=None,filename=None): 365 | """Temporarily read from someplace else""" 366 | self._scanner.stack_input(input,file,filename) 367 | self._tok = None 368 | 369 | def _peek(self, *types, **kw): 370 | """Returns the token type for lookahead; if there are any args 371 | then the list of args is the set of token types to allow""" 372 | return self._scanner.peek(*types, **kw) 373 | 374 | def _scan(self, type, **kw): 375 | """Returns the matched text, and moves to the next token""" 376 | return self._scanner.scan(type, **kw) 377 | 378 | class Context(object): 379 | """Class to represent the parser's call stack. 380 | 381 | Every rule creates a Context that links to its parent rule. The 382 | contexts can be used for debugging. 383 | 384 | """ 385 | 386 | def __init__(self, parent, scanner, rule, args=()): 387 | """Create a new context. 388 | 389 | Args: 390 | parent: Context object or None 391 | scanner: Scanner object 392 | rule: string (name of the rule) 393 | args: tuple listing parameters to the rule 394 | 395 | """ 396 | self.parent = parent 397 | self.scanner = scanner 398 | self.rule = rule 399 | self.args = args 400 | while scanner.stack: scanner = scanner.stack 401 | self.token = scanner.last_read_token 402 | 403 | def __str__(self): 404 | output = '' 405 | if self.parent: output = str(self.parent) + ' > ' 406 | output += self.rule 407 | return output 408 | 409 | def print_error(err, scanner, max_ctx=None): 410 | """Print error messages, the parser stack, and the input text -- for human-readable error messages.""" 411 | # NOTE: this function assumes 80 columns :-( 412 | # Figure out the line number 413 | pos = err.pos 414 | if not pos: 415 | pos = scanner.get_pos() 416 | 417 | file_name, line_number, column_number = pos 418 | print('%s:%d:%d: %s' % (file_name, line_number, column_number, err.msg), file=sys.stderr) 419 | 420 | scanner.print_line_with_pointer(pos) 421 | 422 | context = err.context 423 | token = None 424 | while context: 425 | print('while parsing %s%s:' % (context.rule, tuple(context.args)), file=sys.stderr) 426 | if context.token: 427 | token = context.token 428 | if token: 429 | scanner.print_line_with_pointer(token.pos, length=len(token.value)) 430 | context = context.parent 431 | if max_ctx: 432 | max_ctx = max_ctx-1 433 | if not max_ctx: 434 | break 435 | 436 | def wrap_error_reporter(parser, rule, *args,**kw): 437 | try: 438 | return getattr(parser, rule)(*args,**kw) 439 | except SyntaxError as e: 440 | print_error(e, parser._scanner) 441 | except NoMoreTokens: 442 | print('Could not complete parsing; stopped around here:', file=sys.stderr) 443 | print(parser._scanner, file=sys.stderr) 444 | -------------------------------------------------------------------------------- /yapps2: -------------------------------------------------------------------------------- 1 | yapps/cli_tool.py --------------------------------------------------------------------------------