Article Source

22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |

Preface

31 | 32 |

Is there an area where a new programming language developed from scratch can be applied? For the first glance, there are no requirements which cannot be covered by the existing languages. However, such areas exist, e.g. systems where parallel calculation is required. It is remarkable the computer society has been looking a language to support grammar parsers itself! BNF (Backus Naur Form) notation and it’s derivatives (ABND, EBNF, etc.) are widely used for formal specifications of programming languages.

33 | 34 |

Although such specifications are really good for both requirements and design they do not help for parser programming. Another approach is to generate code for parsers. The most famous tool is yacc (Yet Another Compiler Compiler) the precompile routine which can generate C parser code from syntactic definition in BNF-like form. This solution is considered as heavy and the software community has been looking for something easier.

35 | 36 |

For the C++ the approach is obvious: a lite solution should be presented as embedded domain specific language (EDSL). It means BNF-like statements have to be mapped to both C++ keywords and overloaded operators. One known example is boost::spirit template library. This paper introduces the BNFLite C++ template library and demonstrates how to create the simplest byte-code formula compiler using this tool.

37 | 38 |

Introduction to BNFLite

39 | 40 |

The BNFLite implements embedded domain specific language approach for gramma specifications when all "production rules" are constructed as instances of C++ classes concatenated by means of overloaded arithmetic operators. For example, the integer number in BNF specs can be described as:

41 | 42 |

<digit> ::= <0> | <1> | <2> | <3> | <4> | <5> | <6> | <7> | <8> | <9>
 43 | <number> ::= <digit> | <digit> <number>
 44 |

45 | 46 |

This representation can be written on C++ using BNFlite approach:

47 | 48 |

#include "bnflite.h" // include source code lib
 49 | using namespace bnf;
 50 | Lexem Digit = Token("0") | "1"  | "2" | "4" | "5" | "6" | "7" | "8" | "9";
 51 | Lexem Number;
 52 | Number = Digit | Digit + Number;
 53 |

54 | 55 |

Execution of this C++ code produces an internal date structures for further parsing procedures.

56 | 57 |

58 | 59 |

Requirements for Formula Compiler

60 | 61 |

Let specify base requirements to the language of the formula compiler to be developed

62 | 63 |

Byte-code compiler shall be founded on expression like language: 65 |
1. The following base types shall be supported: <INTEGER> | <FLOAT> | <STRING>
2. The compiler shall support unary negative operation and basic binary arithmetic operations
3. The compiler shall be able to generate code to call C++ functions defined in body program
67 |

68 | 69 |

This is enough for design of simple programming language, which can be described in Backus-Naur Form.

70 | 71 |

Formal BNF term is called "production rule". Each rule except "terminal" is a conjunction of a series of more concrete rules or terminals: production_rule ::= <rule_1>...<rule_n> | <rule_n_1>...<rule_m>. So, the designed expression-like language can be presented by the following terms:

72 | 73 |

<operand> ::= <INTEGER> | <FLOAT> | <STRING>
 74 | <operator> ::= "+" | "-" | "*" | "/"
 75 | <arguments> ::= <EMPTY> | <expression> | < arguments> "," <expression>
 76 | <function> ::= <IDENTIFIER> "(" < arguments > ")"
 77 | <expression> ::= <operand> | <function> | "-" <expression> | <expression> <operator> <expression> | "(" <expression> ")"

78 | 79 |

Such decryption is formally correct but it has two known issues. Firstly, binary operator precedence is not presented. Secondly, it contains redundant number of recursions which can be substituted by repetitions.

80 | 81 |

Augmented BNF specifications introduce constructions like <a>*<b><element> to support repetition where <a> and <b> imply at least <a> and at most <b> occurrences of the element. For example, 3*3<element> allows exactly three and 1*2<element> allows one or two. *<element> simplified construction allows any number (from 0 to infinity). Alternatively 1*<element> enclose an optional element (at least one) and it can be optionally presented as [element].

82 | 83 |

Full specifications of the formula compiler language in ABNF form are not so complex:

84 | 85 |

ALPHA  =  'A'-'Z' | 'a'-'z'
 86 | DIGIT  =  '0'-'9'
 87 | digit1_9 =  '1'-'9'
 88 | string  = '\"' *( ALPHA  | DIGIT | '_' ) '\"'
 89 | e = 'e' | 'E'            
 90 | exp = e ['-' | '+'] 1*DIGIT
 91 | frac = '.' 1*DIGIT
 92 | int = '0' | ( digit1_9 *DIGIT )
 93 | number = [ '-' ] int [ frac ] [ exp ]
 94 | operand = int | number | string
 95 | operator = '+' | '-' | '*' | '/'
 96 | identifier =  (ALPHA | '_') *( ALPHA  | DIGIT | '_' )
 97 | arguments ::=  expression *( ',' expression )
 98 | function ::= identifier '(' [arguments]  ')'
 99 | elementary ::= operand | function | ('-' elementary) | ('(' expression ')')
100 | primary = elementary *(('/' elementary) | ('*' elementary))
101 | expression = primary *(('-' primary) | ('+' primary))

102 | 103 |

These specs are more computer friendly then previous ones.

104 | 105 |

Short Design Notes

106 | 107 |

The simple byte-code compiler is a minor extension of the expression calculator. For example, the expression 2 + 3 * 4 can be converted to tree.

108 | 109 |

└──+
110 |    ├── 2
111 |    └──*
112 |       ├── 3
113 |       └── 4

114 | 115 | 116 |

117 | 118 |

It can be written on C manned “add(2, mul(3, 4))” form. Let write it reversely: ” (2,(3,4)mul)add”. This form is called as revers polish notation(RPN) and byte-codes can be easily generated:

119 | 120 |

Int(2) Int(3), Int(4), Mul<I,I>, Addl<I,I>

121 | 122 |

Int(number) operation pushes the number to the stack. Mul/Add operation pops two parameters from the stack and pushes the result.

123 | 124 |

Practical benefit of the formula compiler comes only when functions are used. Let consider “liner map” 2 + 3 *GetX(). The byte-code will be:

125 | 126 |

Int(2) Int(3), Call<0>, Mul<I,I>, Addl<I,I>

127 | 128 |

For example, this functionally can be cyclically applied to X database column to obtain Y column (Moreover, the provided demo does four calculations simultaneously but it is rather the scope of another paper).

129 | 130 |

All byte-codes consist of two fields: type and union for integer, float and pointer to string.

131 | 132 |

struct byte_code
133 | {
134 |     OpCode type;
135 |     union {
136 |         int val_i;
137 |         float val_f;
138 |         const char* val_s;
139 |     };
140 |     byte_code(): type(opNop), val_i(0) {};
141 |     /* … */

142 | 143 |

The union represents the operand for the simple “immediate addressing mode”. The type is the operation code based upon the following enumerator:

144 | 145 |

enum OpCode
146 | {
147 |     opFatal = -1, opNop = 0,
148 |     opInt = 1,  opFloat = 2,  opStr = 3,  opMaskType = 0x03,
149 |     opError = 1, opNeg = 2, opPos = 3, opCall = 4,
150 |     opToInt = 5,  opToFloat = 6, opToStr = 7,
151 |     opAdd = 2,  opSub = 3,  opMul = 4,  opDiv = 5,
152 | };

153 | 154 |

The real operation code is a little more complex. It contains types for stack operands.

155 | 156 |

BNFLite Grammar classes (Token, Lexem and Rule)

157 | 158 |

The BNFLite description is similar to EBNF specifications above.

159 | 160 |

Number:

161 | 162 |

Token digit1_9('1', '9');
163 | Token DIGIT("0123456789");
164 | Lexem i_digit = 1*DIGIT;
165 | Lexem frac_ = "." + i_digit;
166 | Lexem int_ = "0" | digit1_9  + *DIGIT;
167 | Lexem exp_ = "Ee" + !Token("+-") + i_digit;
168 | Lexem number_ = !Token("-") + int_ + !frac_ + !exp_;
169 | Rule number = number_;

170 | 171 |

The Token class represents terminals, which are symbols. The Lexem constructs strings of symbols. Parsing is a process of analyzing tokens and lexemes of the input stream. The unary C operator `*` means the construction can be repeated from 0 to infinity. The binary C operator `*` means the construction like 1*DIGIT can be repeated from 1 to infinity. The unary C operator `!` means the construction is optional.

172 | 173 |

Practically, Rule is similar to Lexem except for callbacks and spaces sensitivity. The construction Rule number_ = !Token("-") + int_ + !frac_ + !exp_; allows incorrect spaces e.g. between integer and fractional parts.

174 | 175 |

Strings and identifiers:

176 | 177 |

Token az_("_"); az_.Add('A', 'Z'); az_.Add('a', 'z');
178 | Token az01_(az_); az01_.Add('0', '9');
179 | Token all(1,255); all.Remove("\"");
180 | Lexem identifier = az_  + *(az01_);
181 | Lexem quotedstring = "\"" + *all + "\"";
182 |

183 | 184 |

The Token all represents all symbols except quote;

185 | 186 |

Major parsing rules:

187 | 188 |

Rule expression;
189 | Rule unary;
190 | Rule function = identifier + "(" + !(expression + *("," + expression)) +  ")";
191 | Rule elementary = AcceptFirst()
192 |         | "(" + expression + ")"
193 |         | function
194 |         | number
195 |         | quotedstring + printErr
196 |         | unary;
197 | unary = Token("-") + elementary;
198 | Rule primary = elementary + *("*%/" + elementary);
199 | /* Rule */ expression = primary + *("+-" + primary);
200 |

201 | 202 |

Expression and unary rules are recursive. They need to be declare earlier. The AcceptFirst() statement changes parser behavior from the default mode to “accept best” for this particular Rule. After that the parser chooses a first appropriate construction instead of the most appropriate one.

203 | 204 |

Parser Calls and Callbacks

205 | 206 |

Recursive descent parser implies tree composition. The user needs a means to track the tree traversal recursion. First of all, the interface object structure to interact with the parser should be specified.

207 | 208 |

typedef Interface< std::list<byte_code> > Gen;
209 | /* ... */
210 | const char* tail = 0;
211 | Gen result;
212 | int tst = Analyze(expression, expr.c_str(), &tail, result);

213 | 214 |

Parsing starts by the Analyze call with the following arguments:

215 | 216 |

expression – root Rule object of BNFlite gramma
expr.c_str() – string of the expression to be compiled to byte-code
tail – pointer to end of text in case of successful parsing or to place where parsing is stopped due to error
result – final object from users callbacks
tst – contains result of parsing implemented as bitwise flags, error if negative

218 | 219 |

The Callback function accepts vector of byte-code lists formed early. It returns the list of new formed byte-codes. For example, Gen DoNumber(std::vector<Gen>& res) accepts the one element vector representing a number. It has to return the result Gen object with filled user's data (either Int<> or Float<>).

220 | 221 |

In common case the callback Gen DoBinary(std::vector<Gen>& res) accepts the left byte-code vector, the sign of operation(+**/), and the right byte-code vector. The callback just joints left and right byte-code vectors and generate the tail byte-code according to the sign.

222 | 223 |

Comparison with Boost::Spirit Implementation

224 | 225 |

This is a grammar part implemented by means of boost::spirit template library:

226 | 227 |

namespace q = boost::spirit::qi;
228 | 
229 | typedef boost::variant<
230 |     int, float, std::string,
231 |     boost::recursive_wrapper<struct binary_node>,
232 |     boost::recursive_wrapper<struct unary_node>,
233 |     boost::recursive_wrapper<struct call_node>
234 | > branch_t;
235 | 
236 | template <typename Iterator>
237 | struct SParser: q::grammar<Iterator, branch_t, q::space_type>
238 | {
239 |     q::rule<Iterator, branch_t, q::space_type> expression, primary, elementary, unary;
240 |     q::rule<Iterator, std::string()> quotedstring;
241 |     q::rule<Iterator, std::string()> identifier;
242 |     q::rule<Iterator, std::vector<branch_t>, q::space_type> arglist;
243 |     q::rule<Iterator, call_node, q::space_type> function;
244 | 
245 |     SParser() : SParser::base_type(expression)
246 |     {
247 |     using boost::phoenix::construct;
248 |         
249 |     expression = primary[q::_val = q::_1]
250 |         >> *('+'  > primary[q::_val = construct<binary_node>(q::_val, q::_1, '+')]
251 |             | '-' > primary[q::_val = construct<binary_node>(q::_val, q::_1, '-')] );
252 |     primary =   elementary[q::_val = q::_1]
253 |         >> *('*'  > elementary[q::_val = construct<binary_node>(q::_val, q::_1, '*')]
254 |             | '/' > elementary[q::_val = construct<binary_node>(q::_val, q::_1, '/')] );
255 |     unary = '-' > elementary[q::_val = construct<unary_node>(q::_1, '-')]; 
256 | 
257 |     elementary = q::real_parser<float, q::strict_real_policies<float>>()[q::_val = q::_1]
258 |         | q::int_[q::_val = q::_1]
259 |         | quotedstring[q::_val = q::_1]
260 |         | ('(' > expression > ')')[q::_val = q::_1]
261 |         | function[q::_val = q::_1]
262 |         | unary[q::_val = q::_1];
263 | 
264 |     quotedstring = '"' > *(q::char_ - '"') > '"';
265 |     identifier = (q::alpha | q::char_('_')) >> *(q::alnum | q::char_('_'));
266 | 
267 |     arglist = '(' > (expression % ',')[q::_val = q::_1] > ')';
268 |     function = (identifier > arglist)[q::_val = construct<call_node>(q::_1, q::_2)];
269 | 
270 |     on_error(expression, std::cout << boost::phoenix::val("Error "));
271 |     }
272 | };

273 | 274 |

Please, refer to boost::spirit documentation for details. The example is provided for demonstration similarity of boost::spirit gramma to BNFlite one.

275 | 276 |

Boost::spirit is more mature tool and does not use callbacks. For each “rule” it calls constructors of classes which should be member of boost::variant. Therefore, boost::spirit cannot be used for one-pass compilers. However, boost::variant can be effective if something more complex is required (e.g. byte-code optimizers).

277 | 278 |

Problems of boost::spirit lie in another area. Any inclusion of boost::spirit considerably increases project compiler time. Another significant drawback of boost::spirit is related to run-time debugging of the gramma - it is not possible at all!

279 | 280 |

Debugging of BNFLite Gramma

281 | 282 |

Writing gramma by EDSL is unusual and the user does not have full understanding about the parser. If the Analyze call returns an error for the correct text then the user always should take into consideration the possibility of grammar bugs.

283 | 284 |

Return code

285 | 286 |

Return code from Analyze call can contain flags related to the gramma. For example, eBadRule, eBadLexem flags mean the tree of rules is not properly built.

287 | 288 |

Names and breakpoints

289 | 290 |

The user can assign a name to the Rule. It can help to track recursive descent parser using Rule::_parse function. Debugger stack (history of function calls) can inform which Rule was applied and when. The user just needs to watch the this.name variable. It is not as difficult as it seems at first glance.

291 | 292 |

Gramma subsets

293 | 294 |

Analyze function can be applied as unit test to any Rule representing subset of gramma.

295 | 296 |

Tracing

297 | 298 |

Function with prototype “bool foo(const char* lexem, size_t len)” can be used in BNFLite expressions for both reasons: to obtain temporary results and to inform about predicted errors.

299 | 300 |

This function will print the parsed number

301 | 302 |

static bool DebugNumber(const char* lexem, size_t len)
303 | {    printf("The number is: %.*s;\n", len, lexem);    return true; }
304 |     /* … */
305 | Rule number = number_ + DebugNumber;

306 | 307 |

The function should return true if result is correct.

308 | 309 |

Let assume the numbers with leading ‘+’ are not allowed

310 | 311 |

static bool ErrorPlusNumber(const char* lexem, size_t len)
312 | {printf("The number %.*s with plus is not allowed\n", len, lexem); return false;}
313 |     /* … */
314 | Lexem number_p = !Token("+") + int_ + !frac_ + !exp_ + ErrorPlusNumber;
315 | Rule number = number_ | number_p;

316 | 317 |

The function should return false to pass the incorrect result to the parser. C++11 constructions like below are also possible:

318 | 319 |

Rule number = number_ | [](const char* lexem, size_t len)
320 | { return !printf("The number %.*s with plus is not allowed\n", len, lexem); }

321 | 322 |

Building and Run

323 | 324 |

The attached simplest formula compiler package has the following C++ files:

325 | 326 |

main.cpp - starter of byte-code compiler and interpreter
parser.cpp - BNFLite parser with grammar section and callbacks
code_gen.cpp - byte-code generator
code_lib.cpp - several examples of built-in functions (e.g POW(2,3) - power: 2*2*2)
code_run.cpp - byte-code interpreter (used SSE2 for parallel calculation of 4 formulas)

328 | 329 |

The package has several prebuilt projects but generally it can build as:

330 | 331 |

>$ g++ -O2 -march=pentium4 -std=c++14 -I.. code_gen.cpp  parser.cpp  code_lib.cpp  main.cpp code_run.cpp

332 | 333 |

Function GetX() returns four values: 0, 1, 2, 3. It can be used in expression to be compiled and calculated:

334 | 335 |

> $ a.exe "2 + 3 *GetX()"  
336 | 5 byte-codes in: 2+3*GetX()
337 | Byte-code: Int(2),Int(3),opCall<I>,opMul<I,I>,opAdd<I,I>;
338 | result = 2, 5, 8, 11;

339 | 340 |

Conclusion

341 | 342 |

The presented formula compiler is the realization of several ideas in order to demonstrate its feasibility. First of all, it is parallel computation according to compiled formula. Secondly, it introduces BNFLite header library to spread existing concept of applicability of BNF forms. Now a fast parser implementation can be easily created for specific customer requirements where BNF-like language is used.

343 | 344 |

References

345 | 346 |

[1] BNFLite with some examples, https://github.com/r35382/bnflite

347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 |