├── LICENSE ├── README.md └── articles ├── digit-separator.md ├── range-operator-2.md └── range-operator.md /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Thomas Punt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Update: I will no longer be posting articles here. Instead, checkout the [PHPInternals.net](https://phpinternals.net) website for new content around PHP's internals. 2 | 3 | # PHP Internals Articles 4 | 5 | This repository contains articles that will aim to provide an introduction into 6 | [PHP's internals](https://github.com/php/php-src). Some articles will take a 7 | pragmatic approach (typically having more breadth with less depth), and other 8 | articles will focus solely on particular aspects of internals (typically having 9 | less breadth with great depth). All of the articles here will be written for 10 | PHP 7 (powered by the Zend Engine version 3 (ZE3)) and above. 11 | 12 | These articles assume that the reader is able to build PHP from source. If this 13 | is not the case, then please see the [Building PHP](http://www.phpinternalsbook.com/build_system/building_php.html) chapter of 14 | the [PHP Internals Book](http://www.phpinternalsbook.com/) first. 15 | 16 | 17 | ## Article Index 18 | 19 | - Range Operator: 20 | - [Part 1] [Implementing a Range Operator in PHP](https://github.com/tpunt/php-internals-articles/blob/master/articles/range-operator.md) 21 | - [Part 2] [A Reimplementation of the Range Operator](https://github.com/tpunt/php-internals-articles/blob/master/articles/range-operator-2.md) 22 | - [Implementing a Digit Separator in PHP](https://github.com/tpunt/php-internals-articles/blob/master/articles/digit-separator.md) 23 | 24 | 25 | Feel free to open up issues to request/suggest topics that you would like to see covered! 26 | 27 | ## Other sources for PHP's internals 28 | - [PHP Internals Book](http://www.phpinternalsbook.com/) 29 | - [Nikita's Blog](http://nikic.github.io) 30 | - [Julien's Blog](http://jpauli.github.io) 31 | - [Anthony's Blog](https://blog.ircmaxell.com) 32 | - [Sara's Blog](http://blog.golemon.com) 33 | - [Extending and Embedding PHP](http://www.amazon.com/Extending-Embedding-PHP-Sara-Golemon/dp/067232704X) 34 | -------------------------------------------------------------------------------- /articles/digit-separator.md: -------------------------------------------------------------------------------- 1 | # Implementing a Digit Separator 2 | 3 | The digit separator provides developers with a way to format numerical literals 4 | in code. This provides greater readability to large numbers by breaking them 5 | into sub-groups, where a character (the underscore, in this case) acts as a 6 | delineator. 7 | 8 | This article will explain how we can add this relatively simple feature into 9 | PHP. 10 | 11 | Digression: 12 | > The official RFC for this can be found at: [Number Format Separator] 13 | (https://wiki.php.net/rfc/number_format_separator) 14 | 15 | 16 | ## Digit Separator Syntax 17 | 18 | The follow syntactic rules have been made: 19 | 1. Disallow leading underscores 20 | 2. Disallow trailing underscores 21 | 3. Disallow adjacent underscores 22 | 4. Enable underscores between digits only 23 | 5. Enable for arbitrary grouping of digits 24 | 25 | (These five rules will be referenced below.) 26 | 27 | Thus, the following examples demonstrate invalid usages of the underscore: 28 | ```PHP 29 | _100; // invalid as a numerical literal since it's already a valid constant name in PHP 30 | 100_; // Parse error: syntax error, unexpected '_' (T_STRING)... 31 | 1__000___000_00; // Parse error: syntax error, unexpected '__000___000_00' (T_STRING)... 32 | 100_.0; // Parse error: syntax error, unexpected '_' (T_STRING) in... 33 | 100._01; // Parse error: syntax error, unexpected '_01' (T_STRING) in... 34 | 0x_0123; // Parse error: syntax error, unexpected 'x_0123' (T_STRING) in... 35 | 0b_0101; // Parse error: syntax error, unexpected 'b_0101' (T_STRING) in... 36 | 1_e2; // Parse error: syntax error, unexpected '_e2' (T_STRING)... 37 | 1e_2; // Parse error: syntax error, unexpected 'e_2' (T_STRING)... 38 | ``` 39 | 40 | And the following examples demonstrate valid usages of the underscore: 41 | ```PHP 42 | 1_234_567; 43 | 123_4567; 44 | 1_00_00_00_000; 45 | 3.141_592; 46 | 0x11_22_33_44_55_66; 47 | 0x1122_3344_5566; 48 | 0b0010_1101; 49 | 0267_3432; 50 | 1_123.456_7e2; 51 | ``` 52 | 53 | ## The Implementation 54 | 55 | Implementing our digit separator will only involve updating the lexer. We will 56 | start by changing the current lexing rules so that numerical literals with 57 | underscores in them are recognised by the lexer. These underscores will then be 58 | stripped so that the numerical literals can be parsed (converted from stringy 59 | numbers to actual numbers) by PHP's internal functions/macros into suitable 60 | numeric types. 61 | 62 | ### Update the lexing rules 63 | 64 | If we execute the following code: 65 | ``` 66 | 1_000; 67 | ``` 68 | 69 | It will result with: 70 | ``` 71 | Parse error: syntax error, unexpected '_0' (T_STRING) in Command line code on line 1 72 | ``` 73 | 74 | We are given a nice parse error because the syntax is currently unknown to the 75 | lexer. The lexer attempted to match against all of the numerical literal rules, 76 | but failed to find a match because it does not recognise numbers with 77 | underscores in them. So let's start by updating these rules. 78 | 79 | In [Zend/zend_language_scanner.l](http://lxr.php.net/xref/PHP_7_0/Zend/zend_language_scanner.l) 80 | (line ~1100), we have the following rules: 81 | ``` 82 | LNUM [0-9]+ 83 | DNUM ([0-9]*"."[0-9]+)|([0-9]+"."[0-9]*) 84 | EXPONENT_DNUM (({LNUM}|{DNUM})[eE][+-]?{LNUM}) 85 | HNUM "0x"[0-9a-fA-F]+ 86 | BNUM "0b"[01]+ 87 | ``` 88 | 89 | The above shows the five basic ways to represent numerics in PHP. The regular 90 | expressions define their syntax, and so these are what we're going to want to 91 | change. Let's update these regular expressions with the five aforementioned 92 | syntax rules in mind for using underscores in numerical literals: 93 | ``` 94 | LNUM [0-9]+(_[0-9]+)* 95 | DNUM (([0-9]+(_[0-9]+)*)*"."([0-9]+(_[0-9]+)*)+)|(([0-9]+(_[0-9]+)*)+"."([0-9]+(_[0-9]+)*)*) 96 | EXPONENT_DNUM (({LNUM}|{DNUM})[eE][+-]?{LNUM}) 97 | HNUM "0x"[0-9a-fA-F]+(_[0-9a-fA-F]+)* 98 | BNUM "0b"[01]+(_[01]+)* 99 | ``` 100 | 101 | Looking at the `LNUM` rule, we can see that the regular expression specifies 102 | that there must be at least one leading digit (satisfying rule #1). It then 103 | says that there may be zero or more instances of an underscore and at 104 | least one following digit (satisfying rules #2, #3, and #5). Rule #4 has been 105 | satisfied in the `DNUM`, `EXPONENT_DNUM`, `HNUM`, and `BNUM` rules by 106 | disallowing the underscore to sit adjacently to the **.**, **e**, **0x**, and 107 | **0b**. 108 | 109 | 110 | Now, compile PHP and execute the following code again: 111 | ```PHP 112 | 1_000; 113 | ``` 114 | 115 | Result: 116 | ``` 117 | Parse error: Invalid numeric literal in Command line code on line 1 118 | ``` 119 | 120 | The result is still a parse error, but now it is no longer a syntax error. So 121 | the lexer now sees that it is a numerical literal, but it thinks it's invalid 122 | because the string to integer conversion functions/macros in PHP's internals 123 | (like `ZEND_STRTOL` and `zend_strtod`) cannot parse the stringy numerics with 124 | underscores in. We could update these functions/macros to cater for the new 125 | underscores, but this would change the coercion rules in numerous other parts 126 | of the language that use these. 127 | 128 | The simpler solution (that would also maintain backwards compatibility) would 129 | be to strip the underscores in the lexer. This will be the next step. 130 | 131 | ### Updating the lexing actions 132 | 133 | Let's take a look at the current token definition for `BNUM`: 134 | ```C 135 | {BNUM} { 136 | char *bin = yytext + 2; /* Skip "0b" */ 137 | int len = yyleng - 2; 138 | char *end; 139 | 140 | /* Skip any leading 0s */ 141 | while (*bin == '0') { 142 | ++bin; 143 | --len; 144 | } 145 | 146 | if (len < SIZEOF_ZEND_LONG * 8) { 147 | if (len == 0) { 148 | ZVAL_LONG(zendlval, 0); 149 | } else { 150 | errno = 0; 151 | ZVAL_LONG(zendlval, ZEND_STRTOL(bin, &end, 2)); 152 | ZEND_ASSERT(!errno && end == yytext + yyleng); 153 | } 154 | RETURN_TOKEN(T_LNUMBER); 155 | } else { 156 | ZVAL_DOUBLE(zendlval, zend_bin_strtod(bin, (const char **)&end)); 157 | /* errno isn't checked since we allow HUGE_VAL/INF overflow */ 158 | ZEND_ASSERT(end == yytext + yyleng); 159 | RETURN_TOKEN(T_DNUMBER); 160 | } 161 | } 162 | ``` 163 | 164 | The `ST_IN_SCRIPTING` mode is the state the lexer is currently in when it 165 | matches the `BNUM` rule, and so it will only match binary numbers when in 166 | normal scripting mode. The code between the curly braces is C code that will be 167 | executed when a binary number is found in the source code. 168 | 169 | The `yytext` macro is used to fetch the matched input data, and the `yyleng` 170 | macro is used to hold the length of the matched text. The reason why 171 | they're both macros (that are disguised as variables) rather than the actual 172 | `yytext` and `yyleng` variables is because they have global state. 173 | 174 | Digression: 175 | > Variables with global state need the Thread Safe Resource Manager (TSRM) to 176 | > handle the fetching of these global variables when PHP is built in Zend 177 | > Thread Safety (ZTS) mode. For example, the `yytext` macro expands to 178 | > `((char*)SCNG(yy_text))`, where the `SCNG` macro is expanded into 179 | > `LANG_SCNG`, which encapsulates this fetch operation for us. As a consequence 180 | > of thread safety in PHP, you'll see all global variables encapsulated in 181 | > macros so that the TSRM is hidden away behind `#ifdefs` that are resolved at 182 | > precompilation time. 183 | 184 | Since the preceding **0b** is not needed when parsing binary literals, the 185 | `bin` variable skips it by equalling to `yytext + 2`. This also decreases the 186 | matched text length by two, hence why `len = yyleng - 2`. From there, we skip 187 | all leading 0's because they are not important, and then the binary number is 188 | either parsed as an integer if it is within the range of a long, or as a double 189 | otherwise. 190 | 191 | Let's take a look at the updated `BNUM` token definition to account for 192 | underscores: 193 | ```C 194 | {BNUM} { 195 | /* The +/- 2 skips "0b" */ 196 | int len = yyleng - 2, contains_underscores, i; 197 | char *end, *bin = yytext + 2; 198 | 199 | /* Skip any leading 0s */ 200 | while (*bin == '0' || *bin == '_') { 201 | ++bin; 202 | --len; 203 | } 204 | 205 | for (i = 0; i < len && bin[i] != '_'; ++i); 206 | 207 | contains_underscores = i != len; 208 | 209 | if (contains_underscores) { 210 | bin = estrndup(bin, len); 211 | STRIP_UNDERSCORES(bin, len); 212 | } 213 | 214 | if (len < SIZEOF_ZEND_LONG * 8) { 215 | if (len == 0) { 216 | ZVAL_LONG(zendlval, 0); 217 | } else { 218 | errno = 0; 219 | ZVAL_LONG(zendlval, ZEND_STRTOL(bin, &end, 2)); 220 | ZEND_ASSERT(!errno && end == bin + len); 221 | } 222 | if (contains_underscores) { 223 | efree(bin); 224 | } 225 | RETURN_TOKEN(T_LNUMBER); 226 | } else { 227 | ZVAL_DOUBLE(zendlval, zend_bin_strtod(bin, (const char **)&end)); 228 | /* errno isn't checked since we allow HUGE_VAL/INF overflow */ 229 | ZEND_ASSERT(end == bin + len); 230 | if (contains_underscores) { 231 | efree(bin); 232 | } 233 | RETURN_TOKEN(T_DNUMBER); 234 | } 235 | } 236 | ``` 237 | 238 | The new code differs in a few ways from the previous code. Firstly, `0`'s and 239 | `_`'s are now stripped from the beginning of the matched text. This ensures 240 | that binary numbers like the following are properly stripped: 241 | ```PHP 242 | 0b00_00_11_01; 243 | ``` 244 | 245 | Secondly, we are now checking if there are any underscores present in the 246 | matched text. If there aren't any, then we continue working directly off of 247 | `yytext`. If, however, there are underscores present, then we must perform a 248 | copy of the matched string because `yytext` cannot be directly modified. This 249 | is because files in PHP are loaded in via `mmap`, and so attempting to rewrite 250 | `yytext` (that points to this) will cause a bus error because of insufficient 251 | write permissions for that mmap'ed segment. The string copying is done with the 252 | `estrndup` macro (notice the leading **e** here). 253 | 254 | Digression: 255 | > Whenever new memory needs to be allocated, the normal memory allocation 256 | > functions are not used - instead, their counterpart macros (disguised as 257 | > functions) are. These macros (including `emalloc`, `ecalloc`, `efree`, and 258 | > so on) are a part of the Zend Memory Manager (ZendMM). ZendMM handles the 259 | > cleanup of allocated memory when the Zend Engine bails out because of an 260 | > error (like a fatal error) typically by calling `php_error_docref`. It also 261 | > keeps track of the amount of memory allocated for the current request to 262 | > ensure that the amount of allocated memory does not surpass the 263 | > `memory_limit` ini setting. 264 | 265 | A new `STRIP_UNDERSCORES` macro has also been added (at the top of the same 266 | file): 267 | ```C 268 | #define STRIP_UNDERSCORES(n, len) do { \ 269 | int i, old_len = len; \ 270 | char *new_n, *old_n; \ 271 | for (i = 0, new_n = old_n = n; i < old_len; ++i, ++old_n) { \ 272 | if (*old_n != '_') { \ 273 | *new_n++ = *old_n; \ 274 | } else { \ 275 | --len; \ 276 | } \ 277 | } \ 278 | if (old_len > len) { \ 279 | *new_n = '\0'; \ 280 | } \ 281 | } while (0) 282 | ``` 283 | 284 | This simply copies the matched input to the new string copy, ignoring 285 | underscores as it encounters them. 286 | 287 | The other small changes made in `BNUM`'s new definition are the replacement of 288 | `yytext` with `bin` and `yyleng` with `len`, as well as the conditional 289 | `efree`ing if the string was copied. 290 | 291 | Now compile PHP and try using underscores in binary numbers: 292 | ```PHP 293 | var_dump(0b00_00_11_01); // int(13) 294 | ``` 295 | 296 | It works! 297 | 298 | Have a go at rewriting the other `LNUM`, `HNUM`, and `DNUM` definitions (they 299 | are similiar to, if not simpler than `BNUM`). To see what they look like, 300 | check out the [changed files of my PR](https://github.com/php/php-src/pull/1699/files). 301 | 302 | ## Conclusion 303 | 304 | We've covered how only a few changes in the lexer rules and definitions are 305 | needed to accept a new syntax in PHP's numerical literals. On the way, we've 306 | also glossed over the TSRM and ZendMM. I hope this has provided a brief, yet 307 | pragmatic overview into PHP's lexer, and I hope to delve further into this 308 | topic in some articles in future. 309 | -------------------------------------------------------------------------------- /articles/range-operator-2.md: -------------------------------------------------------------------------------- 1 | # A Reimplementation of the Range Operator 2 | 3 | [In the prequel to this article](https://github.com/tpunt/php-internals-articles/blob/master/articles/range-operator.md) 4 | (hint: make sure you've read it first), I showed one way to implement a range 5 | operator in PHP. Initial implementations, however, are rarely the best, and so 6 | it is the intention of this article to look at how the previous implementation 7 | can be improved. 8 | 9 | Thanks once again to [Nikita Popov](https://github.com/nikic/) for proofreading 10 | this article! 11 | 12 | 13 | ## Previous Implementation Drawbacks 14 | 15 | The initial implementation put all of the logic for the range operator into the 16 | Zend VM, which forced computation to take place purely at runtime when the 17 | `ZEND_RANGE` opcode was executed. This not only meant that computation could 18 | not be shifted to compile time for operands that were literal, but also meant 19 | that some features would simply not work. 20 | 21 | In this implementation, we will shift the range operator logic out of the Zend 22 | VM to enable for computation to be done at either compile time (for literal 23 | operands) or runtime (for dynamic operands). This will not only provide a small 24 | win for Opcache users, but will more importantly enable for constant expression 25 | features to be used with the range operator. 26 | 27 | For example: 28 | ```PHP 29 | // as constant definitions 30 | const AN_ARRAY = 1 |> 100; 31 | 32 | // as initial property definitions 33 | class A 34 | { 35 | private $a = 1 |> 2; 36 | } 37 | 38 | // as default values for optional parameters: 39 | function a($a = 1 |> 2) 40 | { 41 | // 42 | } 43 | ``` 44 | 45 | So without further ado, let's reimplement the range operator. 46 | 47 | 48 | ## Updating the Lexer 49 | 50 | The lexer implementation remains exactly the same. The token is firstly 51 | registered in [Zend/zend_language_scanner.l](http://lxr.php.net/xref/PHP_7_0/Zend/zend_language_scanner.l) 52 | (line ~1200): 53 | ```C 54 | "|>" { 55 | RETURN_TOKEN(T_RANGE); 56 | } 57 | ``` 58 | 59 | And then declared in [Zend/zend_language_parser.y](http://lxr.php.net/xref/PHP_7_0/Zend/zend_language_parser.y) 60 | (line ~220): 61 | ``` 62 | %token T_RANGE "|> (T_RANGE)" 63 | ``` 64 | 65 | The tokenizer extension must again be regenerated by going into the 66 | **ext/tokenizer** directory and executing the **tokenizer_data_gen.sh** file. 67 | 68 | 69 | ## Updating the Parser 70 | 71 | The parser implementation is partially the same as before. We again start by 72 | stating the operator's precedence and associativity by adding the `T_RANGE` 73 | token onto the end of the following line (line ~70): 74 | ``` 75 | %nonassoc T_IS_EQUAL T_IS_NOT_EQUAL T_IS_IDENTICAL T_IS_NOT_IDENTICAL T_SPACESHIP T_RANGE 76 | ``` 77 | 78 | We then update the `expr_without_variable` production rule again, though this 79 | time the semantic action (the code within the curly braces) will be slightly 80 | different. Update it with the following code (I placed it just below the 81 | `T_SPACESHIP` rule, line ~930): 82 | ``` 83 | | expr T_RANGE expr 84 | { $$ = zend_ast_create_binary_op(ZEND_RANGE, $1, $3); } 85 | ``` 86 | 87 | This time, we've used the `zend_ast_create_binary_op` function (instead of the 88 | `zend_ast_create` function), which creates a `ZEND_AST_BINARY_OP` node for us. 89 | `zend_ast_create_binary_op` takes an opcode name that will be used to 90 | distinguish binary operations from one-another during the compilation stage. 91 | 92 | Since we're reusing the `ZEND_AST_BINARY_OP` node type now, there is no need to 93 | define a new `ZEND_AST_RANGE` node type as done before in the 94 | [Zend/zend_ast.h](http://lxr.php.net/xref/PHP_7_0/Zend/zend_ast.h) file. 95 | 96 | 97 | ## Updating the Compilation Stage 98 | 99 | This time, there is no need to update the **Zend/zend_compile.c** file since 100 | [it already contains the necessary logic](http://lxr.php.net/xref/PHP_7_0/Zend/zend_compile.c#7137) 101 | to handle binary operations. Thus, we are simply reusing this logic by making 102 | our operator a `ZEND_AST_BINARY_OP` node. 103 | 104 | The following is a trimmed version of the `zend_compile_binary_op` function: 105 | ```C 106 | void zend_compile_binary_op(znode *result, zend_ast *ast) /* {{{ */ 107 | { 108 | zend_ast *left_ast = ast->child[0]; 109 | zend_ast *right_ast = ast->child[1]; 110 | uint32_t opcode = ast->attr; 111 | 112 | znode left_node, right_node; 113 | zend_compile_expr(&left_node, left_ast); 114 | zend_compile_expr(&right_node, right_ast); 115 | 116 | if (left_node.op_type == IS_CONST && right_node.op_type == IS_CONST) { 117 | if (zend_try_ct_eval_binary_op(&result->u.constant, opcode, 118 | &left_node.u.constant, &right_node.u.constant) 119 | ) { 120 | result->op_type = IS_CONST; 121 | zval_ptr_dtor(&left_node.u.constant); 122 | zval_ptr_dtor(&right_node.u.constant); 123 | return; 124 | } 125 | } 126 | 127 | do { 128 | // redacted code 129 | zend_emit_op_tmp(result, opcode, &left_node, &right_node); 130 | } while (0); 131 | } 132 | /* }}} */ 133 | ``` 134 | 135 | As we can see, it is pretty similar to the `zend_compile_range` function we 136 | created last time. The two important differences are in regards to how the 137 | opcode type is acquired and what happens when both operands are literals. 138 | 139 | The opcode type is acquired from the AST node this time (as opposed to being 140 | hardcoded, as seen last time), since the `ZEND_AST_BINARY_OP` node stores this 141 | value (as seen from the new production rule's semantic action) to differentiate 142 | between binary operations. When both operands are literals, the 143 | `zend_try_ct_eval_binary_op` function will be invoked. This function looks as 144 | follows: 145 | ```C 146 | static inline zend_bool zend_try_ct_eval_binary_op(zval *result, uint32_t opcode, zval *op1, zval *op2) /* {{{ */ 147 | { 148 | binary_op_type fn = get_binary_op(opcode); 149 | 150 | /* don't evaluate division by zero at compile-time */ 151 | if ((opcode == ZEND_DIV || opcode == ZEND_MOD) && 152 | zval_get_long(op2) == 0) { 153 | return 0; 154 | } else if ((opcode == ZEND_SL || opcode == ZEND_SR) && 155 | zval_get_long(op2) < 0) { 156 | return 0; 157 | } 158 | 159 | fn(result, op1, op2); 160 | return 1; 161 | } 162 | /* }}} */ 163 | ``` 164 | 165 | The function obtains a callback from the `get_binary_op` function 166 | ([source](http://lxr.php.net/xref/PHP_7_0/Zend/zend_opcode.c#721)) 167 | in [Zend/zend_opcode.c](http://lxr.php.net/xref/PHP_7_0/Zend/zend_opcode.c) 168 | according to the opcode type. This means we will need to update this function 169 | next to cater for the `ZEND_RANGE` opcode. Add the following case statement to 170 | the `get_binary_op` function (line ~750): 171 | ```C 172 | case ZEND_RANGE: 173 | return (binary_op_type) range_function; 174 | ``` 175 | 176 | Now we must define the `range_function` function. This will be done in the 177 | [Zend/zend_operators.c](http://lxr.php.net/xref/PHP_7_0/Zend/zend_operators.c) 178 | file alongside all of the other operators: 179 | ```C 180 | ZEND_API int ZEND_FASTCALL range_function(zval *result, zval *op1, zval *op2) /* {{{ */ 181 | { 182 | zval tmp; 183 | 184 | ZVAL_DEREF(op1); 185 | ZVAL_DEREF(op2); 186 | 187 | if (Z_TYPE_P(op1) == IS_LONG && Z_TYPE_P(op2) == IS_LONG) { 188 | zend_long min = Z_LVAL_P(op1), max = Z_LVAL_P(op2); 189 | zend_ulong size, i; 190 | 191 | if (min > max) { 192 | zend_throw_error(NULL, "Min should be less than (or equal to) max"); 193 | return FAILURE; 194 | } 195 | 196 | // calculate size (one less than the total size for an inclusive range) 197 | size = max - min; 198 | 199 | // the size cannot be greater than or equal to HT_MAX_SIZE 200 | // HT_MAX_SIZE - 1 takes into account the inclusive range size 201 | if (size >= HT_MAX_SIZE - 1) { 202 | zend_throw_error(NULL, "Range size is too large"); 203 | return FAILURE; 204 | } 205 | 206 | // increment the size to take into account the inclusive range 207 | ++size; 208 | 209 | // set the zval type to be a long 210 | Z_TYPE_INFO(tmp) = IS_LONG; 211 | 212 | // initialise the array to a given size 213 | array_init_size(result, size); 214 | zend_hash_real_init(Z_ARRVAL_P(result), 1); 215 | ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(result)) { 216 | for (i = 0; i < size; ++i) { 217 | Z_LVAL(tmp) = min + i; 218 | ZEND_HASH_FILL_ADD(&tmp); 219 | } 220 | } ZEND_HASH_FILL_END(); 221 | } else if ( // if both operands are either integers or doubles 222 | (Z_TYPE_P(op1) == IS_LONG || Z_TYPE_P(op1) == IS_DOUBLE) 223 | && (Z_TYPE_P(op2) == IS_LONG || Z_TYPE_P(op2) == IS_DOUBLE) 224 | ) { 225 | long double min, max, size, i; 226 | 227 | if (Z_TYPE_P(op1) == IS_LONG) { 228 | min = (long double) Z_LVAL_P(op1); 229 | max = (long double) Z_DVAL_P(op2); 230 | } else if (Z_TYPE_P(op2) == IS_LONG) { 231 | min = (long double) Z_DVAL_P(op1); 232 | max = (long double) Z_LVAL_P(op2); 233 | } else { 234 | min = (long double) Z_DVAL_P(op1); 235 | max = (long double) Z_DVAL_P(op2); 236 | } 237 | 238 | if (min > max) { 239 | zend_throw_error(NULL, "Min should be less than (or equal to) max"); 240 | return FAILURE; 241 | } 242 | 243 | size = max - min; 244 | 245 | if (size >= HT_MAX_SIZE - 1) { 246 | zend_throw_error(NULL, "Range size is too large"); 247 | return FAILURE; 248 | } 249 | 250 | // we cast the size to an integer to get rid of the decimal places, 251 | // since we only care about whole number sizes 252 | size = (int) size + 1; 253 | 254 | Z_TYPE_INFO(tmp) = IS_DOUBLE; 255 | 256 | array_init_size(result, size); 257 | zend_hash_real_init(Z_ARRVAL_P(result), 1); 258 | ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(result)) { 259 | for (i = 0; i < size; ++i) { 260 | Z_DVAL(tmp) = min + i; 261 | ZEND_HASH_FILL_ADD(&tmp); 262 | } 263 | } ZEND_HASH_FILL_END(); 264 | } else { 265 | zend_throw_error(NULL, "Unsupported operand types - only ints and floats are supported"); 266 | return FAILURE; 267 | } 268 | 269 | return SUCCESS; 270 | } 271 | /* }}} */ 272 | ``` 273 | 274 | The function prototype contains two new macros: `ZEND_API` and `ZEND_FASTCALL`. 275 | `ZEND_API` is used to control the visibility of functions by making them 276 | available to extensions that are compiled as shared objects. `ZEND_FASTCALL` is 277 | used to ensure a more efficient calling convention is used, where the first two 278 | arguments will be passed using registers rather than the stack (more relevant 279 | to 32bit builds than 64bit builds on x86). 280 | 281 | The function body is very similar to what we had in the **Zend/zend_vm_def.h** 282 | file in the previous article. The VM-specific stuff is no longer present, 283 | including the `HANDLE_EXCEPTION` macro calls (which have been replaced with 284 | `return FAILURE;`), and the `ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION` macro calls 285 | have been removed entirely (this check and operation needs to stay in the VM, 286 | and so the macro will be invoked from the VM code later). 287 | 288 | Another note-worthy difference is that we're applying `ZVAL_DEFEF` to both 289 | operands to ensure that references are handled properly. This was somethign 290 | something that was previously done inside of the VM using the pseudo-macro 291 | `GET_OPn_ZVAL_PTR_DEREF`, but has now been shifted into this function. This was 292 | done not because it is needed at compile time (since for compile time handling, 293 | both operands would have to be literals, and they cannot be referenced), but 294 | because it enables for other places inside the codebase to safely invoke 295 | `range_function` without having to worry about reference handling. As such, 296 | referencing handling is performed by most of the operator functions instead of 297 | in their VM opcode definition (except where performance matters). 298 | 299 | Lastly, we must add the `range_function` prototype to the 300 | [Zend/zend_operators.h](http://lxr.php.net/xref/PHP_7_0/Zend/zend_operators.h) 301 | file: 302 | ```C 303 | ZEND_API int ZEND_FASTCALL range_function(zval *result, zval *op1, zval *op2); 304 | ``` 305 | 306 | 307 | ## Updating the Zend VM 308 | 309 | Now we must once again update the Zend VM to handle the execution of the 310 | `ZEND_RANGE` opcode during runtime. Place the following code in 311 | **Zend/zend_vm_def.h** (at the bottom): 312 | ```C 313 | ZEND_VM_HANDLER(182, ZEND_RANGE, CONST|TMPVAR|CV, CONST|TMPVAR|CV) 314 | { 315 | USE_OPLINE 316 | zend_free_op free_op1, free_op2; 317 | zval *op1, *op2; 318 | 319 | SAVE_OPLINE(); 320 | op1 = GET_OP1_ZVAL_PTR(BP_VAR_R); 321 | op2 = GET_OP2_ZVAL_PTR(BP_VAR_R); 322 | range_function(EX_VAR(opline->result.var), op1, op2); 323 | FREE_OP1(); 324 | FREE_OP2(); 325 | ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION(); 326 | } 327 | ``` 328 | (Again, the opcode number must be one greater than the current highest opcode 329 | number, which can be seen at the bottom of the **Zend/zend_vm_opcodes.h** 330 | file.) 331 | 332 | The definition this time is far shorter since all of the work is handled in 333 | `range_function`. We simply invoke this function, passing in the result operand 334 | of the current opline to hold the computed value. The exception checks and 335 | skipping onto the next opcode that were removed from `range_function` are still 336 | handled in the VM by the call to `ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION` at the 337 | end. Also, as mentioned previously, we avoid handling references in the VM by 338 | using the `GET_OPn_ZVAL_PTR` pseudo-macros instead (rather than 339 | `GET_OPn_ZVAL_PTR_DEREF`). 340 | 341 | Now regenerate the VM by executing the **Zend/zend_vm_gen.php** file. 342 | 343 | Lastly, the pretty printer needs updating in the [Zend/zend_ast.c](http://lxr.php.net/xref/PHP_7_0/Zend/zend_ast.c) 344 | file once again. Update the precedence table comment by specifying the new 345 | operator to have a priority of 170 (line ~520): 346 | ``` 347 | * 170 non-associative == != === !== |> 348 | ``` 349 | 350 | Then, insert a `case` statement into the `zend_ast_export_ex` function to 351 | handle the `ZEND_RANGE` opcode in the `ZEND_AST_BINARY_OP` case statement (line 352 | ~1300): 353 | ```C 354 | case ZEND_RANGE: BINARY_OP(" |> ", 170, 171, 171); 355 | ``` 356 | 357 | 358 | ## Conclusion 359 | 360 | This article has shown an alternative way to implement the range operator, 361 | where the computation logic was shifted out of the VM. This had the advantage 362 | of being able to use the range operator in constant expression contexts. 363 | 364 | The third part to this article series will build upon this implementation by 365 | covering how we can overload this operator. This will enable for objects to be 366 | used as operands (such as those from the GMP library or those that implement an 367 | `__toString` method). It will also show how we can add *proper* support for 368 | strings (not like the support seen with PHP's current [range](http://php.net/range) 369 | function). But for now, I hope this has served as a nice demonstration of some 370 | of ZE's further aspects when implementing operators into PHP. 371 | -------------------------------------------------------------------------------- /articles/range-operator.md: -------------------------------------------------------------------------------- 1 | # Implementing a Range Operator into PHP 2 | 3 | ## Introduction 4 | 5 | This article will demonstrate how to implement a new operator in PHP. The 6 | following steps will be taken to do this: 7 | - **Updating the lexer**: This will make it aware of the new operator syntax 8 | so that it can be turned into a token 9 | - **Updating the parser**: This will say where it can be used, as well as what 10 | precedence and associativity it will have 11 | - **Updating the compilation stage**: This is where the abstract syntax tree 12 | (AST) is traversed and opcodes are emitted from it 13 | - **Updating the Zend VM**: This is used to handle the interpretation of the 14 | new opcode for the operator during script execution 15 | 16 | This article therefore seeks to provide a brief overview of a number of PHP's 17 | internal aspects. 18 | 19 | Also, a big thank you to [Nikita Popov](https://github.com/nikic/) for 20 | proofreading and helping to improve my article! 21 | 22 | ## The Range Operator 23 | 24 | The operator that will be added into PHP in this article will be called the 25 | range operator (`|>`). To keep things simple, the range operator will be 26 | defined with the following semantics: 27 | 1. The incrementation step will always be one 28 | 2. Both operands must either be integers or floats 29 | 3. If min = max, return a one element array consisting of min. 30 | 31 | (The above points will all be referenced in the final section, **Updating the 32 | Zend VM**, when we finally implement the above semantics.) 33 | 34 | If any of the above semantics are not satisfied, then an `Error` exception 35 | will be thrown. This will therefore occur if: 36 | - either operand is not an integer or float 37 | - min > max 38 | - the range (max - min) is too large 39 | 40 | Examples: 41 | ```PHP 42 | 1 |> 3; // [1, 2, 3] 43 | 2.5 |> 5; // [2.5, 3.5, 4.5] 44 | 45 | $a = $b = 1; 46 | $a |> $b; // [1] 47 | 48 | 2 |> 1; // Error exception 49 | 1 |> '1'; // Error exception 50 | new StdClass |> 1; // Error exception 51 | ``` 52 | 53 | ## Updating the Lexer 54 | 55 | Firstly, the new token must be registered in the lexer so that when the source 56 | code is tokenised, it will turn `|>` into the `T_RANGE` token. For this, the 57 | [Zend/zend_language_scanner.l](http://lxr.php.net/xref/PHP_7_0/Zend/zend_language_scanner.l) 58 | file will need to be updated by adding the following code to it (where all of 59 | the other tokens are defined, line ~1200): 60 | ```C 61 | "|>" { 62 | RETURN_TOKEN(T_RANGE); 63 | } 64 | ``` 65 | 66 | The `ST_IN_SCRIPTING` mode is the state the lexer is currently in. This means 67 | it will only match the `|>` character sequence when it is in normal scripting 68 | mode. The code between the curly braces is C code that will be executed when 69 | `|>` is found in the source code. In this example, it simply returns a 70 | `T_RANGE` token. 71 | 72 | Digression: 73 | > Since we're modifying the lexer, we will need Re2c to regenerate it. This 74 | dependency is not needed for normal builds of PHP. 75 | 76 | Next, the `T_RANGE` identifier must be declared in the 77 | [Zend/zend_language_parser.y](http://lxr.php.net/xref/PHP_7_0/Zend/zend_language_parser.y) 78 | file. To do this, we must add the following line to where the other token 79 | identifiers are declared (at the end will do, line ~220): 80 | ``` 81 | %token T_RANGE "|> (T_RANGE)" 82 | ``` 83 | 84 | PHP now recognises the new operator: 85 | ```PHP 86 | 1 |> 2; // Parse error: syntax error, unexpected '|>' (T_RANGE) in... 87 | ``` 88 | 89 | But since its usage hasn't been defined yet, using it will lead to a parse 90 | error. This will be fixed in the next section. 91 | 92 | First though, we must regenerate the **ext/tokenizer/tokenizer_data.c** file in 93 | the tokenizer extension to cater for the newly added token. (The tokenizer 94 | extension simply provides an interface for PHP's lexer to userland through the 95 | `token_get_all` and `token_name` functions.) At the moment, it is blissfully 96 | ignorant of our new `T_RANGE` token: 97 | ```PHP 98 | echo token_name(token_get_all('2;')[2][0]); // UNKNOWN 99 | ``` 100 | 101 | We regenerate the **ext/tokenizer/tokenizer_data.c** file by going into the 102 | **ext/tokenizer** directory and executing the **tokenizer_data_gen.sh** file. 103 | Then go back into the root php-src directory and build PHP again. Now the 104 | tokenizer extension works again: 105 | ```PHP 106 | echo token_name(token_get_all('2;')[2][0]); // T_RANGE 107 | ``` 108 | 109 | 110 | ## Updating the Parser 111 | 112 | The parser needs to be updated now so that it can validate where the new 113 | `T_RANGE` token is used in PHP scripts. It's also responsible for stating the 114 | precedence and associativity of the new operator and generating the abstract 115 | syntax tree (AST) node for the new construct. This will all be done in the 116 | [Zend/zend_language_parser.y](http://lxr.php.net/xref/PHP_7_0/Zend/zend_language_parser.y) 117 | grammar file, which contains the token definitions and production rules that 118 | Bison will use to generate the parser from. 119 | 120 | Digression: 121 | > **Precedence** determines the rules of grouping expressions. For example, in 122 | the expression `3 + 4 * 2`, `*` has a higher precedence than `+`, and so it 123 | will be grouped as `3 + (4 * 2)`. 124 | > 125 | > **Associativity** is how the operator will behave when chained. It determines 126 | whether the operator can be chained, and if so, then what direction it will be 127 | grouped from in a particular expression. For example, the ternary operator has 128 | (rather strangely) left-associativity, and so it will be grouped and executed 129 | from left to right. Therefore, the following expression: 130 | ```PHP 131 | 1 ? 0 : 1 ? 0 : 1; // 1 132 | ``` 133 | Will be executed as follows: 134 | ```PHP 135 | (1 ? 0 : 1) ? 0 : 1; // 1 136 | ``` 137 | This can, of course, be changed (read: rectified) to be right-associative with 138 | proper grouping: 139 | ```PHP 140 | $a = 1 ? 0 : (1 ? 0 : 1); // 0 141 | ``` 142 | Some operators, however, are non-associative and therefore cannot be chained at 143 | all. For example, the less than (`>`) operator is like this, and so the 144 | following is invalid: 145 | ```PHP 146 | 1 < $a < 2; 147 | ``` 148 | 149 | Since the range operator will evaluate to an array, having it as an input 150 | operand to itself would be rather useless (i.e. `1 |> 3 |> 5` would 151 | be non-sensical). So let's make the operator non-associative, and whilst we're 152 | at it, let's set it to have the same precedence as the combined comparison 153 | operator (`T_SPACESHIP`). This is done by adding the `T_RANGE` token onto the 154 | end of the following line (line ~70): 155 | ``` 156 | %nonassoc T_IS_EQUAL T_IS_NOT_EQUAL T_IS_IDENTICAL T_IS_NOT_IDENTICAL T_SPACESHIP T_RANGE 157 | ``` 158 | 159 | Next, we must update the `expr_without_variable` production rule to cater for 160 | our new operator. This will be done by adding the following code into the rule 161 | (I placed it just below the `T_SPACESHIP` rule, line ~930): 162 | ``` 163 | | expr T_RANGE expr 164 | { $$ = zend_ast_create(ZEND_AST_RANGE, $1, $3); } 165 | ``` 166 | 167 | The pipe character (|) is used to denote an *or*, meaning that any one of 168 | those rules can match in that particular production rule. The code within 169 | the curly braces is to be executed when that match occurs. The `$$` denotes 170 | the result node that stores the value of the expression. The `zend_ast_create` 171 | function is used to create our AST node for our operator. This AST node is 172 | created with the name `ZEND_AST_RANGE`, and has two values: `$1` references the 173 | left operand (**expr** T_RANGE expr) and `$3` references the right operand 174 | (expr T_RANGE **expr**). 175 | 176 | Next, we will need to define the `ZEND_AST_RANGE` constant for the AST. To do 177 | this, the [Zend/zend_ast.h](http://lxr.php.net/xref/PHP_7_0/Zend/zend_ast.h) 178 | file will need to be updated by simply adding the `ZEND_AST_RANGE` constant 179 | under the list of two children nodes (I added it under `ZEND_AST_COALESCE`): 180 | ``` 181 | ZEND_AST_RANGE, 182 | ``` 183 | 184 | Now executing our range operator will just cause the interpreter to hang: 185 | ```PHP 186 | 1 |> 2; 187 | ``` 188 | 189 | It's time to update the compilation stage. 190 | 191 | 192 | ## Updating the Compilation Stage 193 | 194 | We now need to update the compilation stage. The parser outputs an AST that 195 | is then recursively traversed, where functions are triggered to execute as each 196 | node in the AST is visited. These triggered functions emit opcodes that the 197 | Zend VM will then execute later during the interpretation phase. 198 | 199 | This compilation happens in [Zend/zend_compile.c](http://lxr.php.net/xref/PHP_7_0/Zend/zend_compile.c), 200 | so let's start by adding our new AST node name (`ZEND_AST_RANGE`) into the 201 | large switch statement in the `zend_compile_expr` function (I've added it just 202 | below `ZEND_AST_COALESCE`, line ~7200): 203 | ```C 204 | case ZEND_AST_RANGE: 205 | zend_compile_range(result, ast); 206 | return; 207 | ``` 208 | 209 | Now we must define the `zend_compile_range` function somewhere in that same 210 | file: 211 | ```C 212 | void zend_compile_range(znode *result, zend_ast *ast) /* {{{ */ 213 | { 214 | zend_ast *left_ast = ast->child[0]; 215 | zend_ast *right_ast = ast->child[1]; 216 | znode left_node, right_node; 217 | 218 | zend_compile_expr(&left_node, left_ast); 219 | zend_compile_expr(&right_node, right_ast); 220 | 221 | zend_emit_op_tmp(result, ZEND_RANGE, &left_node, &right_node); 222 | } 223 | /* }}} */ 224 | ``` 225 | 226 | We start by firstly dereferencing the left and right operands of the 227 | `ZEND_AST_RANGE` node into the pointer variables `left_ast` and `right_ast`. We 228 | then define two `znode` variables that will hold the result of compiling down 229 | the AST nodes for both of the operands (this is the recursive part of 230 | traversing the AST and compiling its nodes into opcodes). 231 | 232 | Next, we emit the `ZEND_RANGE` opcode with its two operands using the 233 | `zend_emit_op_tmp` function. 234 | 235 | Now would probably be a good time to quickly discuss opcodes and their types 236 | to better explain the usage of the `zend_emit_op_tmp` function. 237 | 238 | Opcodes are instructions that are executed by the Zend VM. They each have: 239 | - a name (a constant that maps to some integer) 240 | - an op1 node (optional) 241 | - an op2 node (optional) 242 | - a result node (optional). This is usually used to store a temporary value of 243 | the opcode operation 244 | - an extended value (optional). This is an integer value that is used to 245 | differentiate between behaviours for overloaded opcodes 246 | 247 | Digression: 248 | > The opcodes for a PHP script can be seen using either: 249 | > - PHPDBG: `sapi/phpdbg/phpdbg -np* program.php` 250 | > - Opcache 251 | > - [Vulcan Logic Disassembler (VLD) extension] 252 | (https://pecl.php.net/package/vld): `sapi/cli/php -dvld.active=1 program.php` 253 | > - Or, if the script is short and trivial, then [3v4l](https://3v4l.org) can 254 | be used 255 | 256 | Opcode nodes (`znode_op` structures) can be of a number of different types: 257 | - `IS_CV` - for **C**ompiled **V**ariables. These are simple variables (like 258 | `$a`) that are cached in a simple array to bypass hash table lookups. They 259 | were introduced in PHP 5.1 under the Compiled Variables optimisation. They're 260 | denoted by `!n` in VLD (where **n** is an integer) 261 | - `IS_VAR` - for all other variable-like expressions that aren't simple (like 262 | `$a->b`). They can hold an `IS_REFERENCE` zval, and are denoted by `$n` in VLD 263 | (where **n** is an integer) 264 | - `IS_CONST` - for literal values (e.g. hard-coded strings) 265 | - `IS_TMP_VAR` - temporary variables are used to hold the intermediate result 266 | of an expression (making them typically short-lived). They too can be 267 | refcounted (as of PHP 7), but cannot hold an `IS_REFERENCE` zval (because 268 | temporary values cannot be used as references). They are denoted by `~n` in 269 | VLD (where **n** is an integer) 270 | - `IS_UNUSED` - generally used to mark an op node as not used. Sometimes, 271 | however, data will be stored in the `znode_op.num` member to be used in the 272 | VM. 273 | 274 | This brings us back to the above `zend_emit_op_tmp` function, which will emit a 275 | `zend_op` with a type of `IS_TMP_VAR`. We want to do this because our operator 276 | will be an expression, and so the value it produces (an array) will be a 277 | temporary variable that may be used as an operand to another opcode (such as 278 | `ASSIGN` from code like `$var = 1 |> 3;`). 279 | 280 | 281 | ## Updating the Zend VM 282 | 283 | Now we will need to update the Zend VM to handle the execution of our new 284 | opcode. This will involve updating the **Zend/zend_vm_def.h** file by adding 285 | the following code (at the bottom will do): 286 | ```C 287 | ZEND_VM_HANDLER(182, ZEND_RANGE, CONST|TMP|VAR|CV, CONST|TMP|VAR|CV) 288 | { 289 | USE_OPLINE 290 | zend_free_op free_op1, free_op2; 291 | zval *op1, *op2, *result, tmp; 292 | 293 | SAVE_OPLINE(); 294 | op1 = GET_OP1_ZVAL_PTR_DEREF(BP_VAR_R); 295 | op2 = GET_OP2_ZVAL_PTR_DEREF(BP_VAR_R); 296 | result = EX_VAR(opline->result.var); 297 | 298 | // if both operands are integers 299 | if (Z_TYPE_P(op1) == IS_LONG && Z_TYPE_P(op2) == IS_LONG) { 300 | // for when min and max are integers 301 | } else if ( // if both operands are either integers or doubles 302 | (Z_TYPE_P(op1) == IS_LONG || Z_TYPE_P(op1) == IS_DOUBLE) 303 | && (Z_TYPE_P(op2) == IS_LONG || Z_TYPE_P(op2) == IS_DOUBLE) 304 | ) { 305 | // for when min and max are either integers or floats 306 | } else { 307 | // for when min and max are neither integers nor floats 308 | } 309 | 310 | FREE_OP1(); 311 | FREE_OP2(); 312 | ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION(); 313 | } 314 | ``` 315 | (The opcode number should be one more than the previous highest, so 182 may 316 | already be taken for you. To quickly see what the highest current opcode number 317 | is, look at the bottom of the **Zend/zend_vm_opcodes.h** file - the 318 | `ZEND_VM_LAST_OPCODE` constant should hold this value.) 319 | 320 | Digression: 321 | > The above code contains a number of pseudo-macros (like `USE_OPLINE` and 322 | `GET_OP1_ZVAL_PTR_DEREF`). These aren't actual C macros - instead, they're 323 | replaced by the **Zend/zend_vm_gen.php** script during VM generation, as 324 | opposed to by the preprocessor during source code compilation. Therefore, if 325 | you'd like to look up their definitions, you'll need to dig through the 326 | **Zend/zend_vm_gen.php** file. 327 | 328 | The `ZEND_VM_HANDLER` pseudo-macro contains each opcode's definition. It can 329 | have 5 parameters: 330 | 1. The opcode number (182) 331 | 2. The opcode name (ZEND_RANGE) 332 | 3. The valid left operand types (CONST|TMP|VAR|CV) (see `$vm_op_decode` in 333 | **Zend/zend_vm_gen.php** for all types) 334 | 4. The valid right operand types (CONST|TMP|VAR|CV) (ditto) 335 | 5. An optional flag holding the extended value for overloaded opcodes (see 336 | `$vm_ext_decode` in **Zend/zend_vm_gen.php** for all types) 337 | 338 | From our above definitions of the types, we can see that: 339 | ```PHP 340 | // CONST enables for 341 | 1 |> 5.0; 342 | 343 | // TMP enables for 344 | (2**2) |> (1 + 3); 345 | 346 | // VAR enables for 347 | $cmplx->var |> $var[1]; 348 | 349 | // CV enables for 350 | $a |> $b; 351 | ``` 352 | 353 | Digression: 354 | > If one or both operands are not used, then they are marked with `ANY`. 355 | 356 | Digression: 357 | > `TMPVAR` was introduced into ZE3 and is similar to `TMP|VAR` in that it 358 | handles the same opcode node types, but differs in what code is genereted. 359 | `TMPVAR` generates a single method to handle both `TMP` and `VAR`, which 360 | decreases the VM size but requires more conditional logic. Conversely, 361 | `TMP|VAR` generates methods for both of `TMP` and `VAR`, increasing the VM size 362 | but with less conditionals. 363 | 364 | Moving onto the body of our opcode definition, we begin by invoking the 365 | `USE_OPLINE` pseudo-macro to declare the `opline` variable (a `zend_op` 366 | struct). This will be used to fetch the operands (with pseudo-macros like 367 | `GET_OP1_ZVAL_PTR_DEREF`) and setting the return value of the opcode. 368 | 369 | Next, we declare two `zend_free_op` variables. These are simply [pointers to 370 | zvals](http://lxr.php.net/xref/PHP_7_0/Zend/zend_execute.h#314) that are 371 | declared for each operand we use. They are used when checking if that 372 | particular operand needs to be freed. Four `zval` variables are then declared. 373 | `op1` and `op2` are pointers to `zval`s that hold the operand values. `result` 374 | is declared to store the result of the opcode operation. Lastly, `tmp` is 375 | used as an intermediary value of the range looping operation that will be 376 | copied upon each iteration into a hash table. 377 | 378 | The `op1` and `op2` variables are then initialised by the 379 | `GET_OP1_ZVAL_PTR_DEREF` and `GET_OP2_ZVAL_PTR_DEREF` pseudo-macros, 380 | respectively. These pseudo-macros are also responsible for [initialising the 381 | `free_op1` and `free_op2` variables] 382 | (http://lxr.php.net/xref/PHP_7_0/Zend/zend_vm_gen.php#142). The `BP_VAR_R` 383 | constant that is passed into the aforementioned macros is a type flag. It 384 | stands for *BackPatching Variable Read* and is used when [fetching compiled 385 | variables](http://lxr.php.net/xref/PHP_7_0/Zend/zend_execute.c#_get_zval_cv_lookup). 386 | Lastly, the opline's result is dereferenced and assigned to `result`, to be 387 | used later on. 388 | 389 | Let's now fill in the first `if` body when both `min` and `max` are integers: 390 | ```C 391 | zend_long min = Z_LVAL_P(op1), max = Z_LVAL_P(op2); 392 | zend_ulong size, i; 393 | 394 | if (min > max) { 395 | zend_throw_error(NULL, "Min should be less than (or equal to) max"); 396 | HANDLE_EXCEPTION(); 397 | } 398 | 399 | // calculate size (one less than the total size for an inclusive range) 400 | size = max - min; 401 | 402 | // the size cannot be greater than or equal to HT_MAX_SIZE 403 | // HT_MAX_SIZE - 1 takes into account the inclusive range size 404 | if (size >= HT_MAX_SIZE - 1) { 405 | zend_throw_error(NULL, "Range size is too large"); 406 | HANDLE_EXCEPTION(); 407 | } 408 | 409 | // increment the size to take into account the inclusive range 410 | ++size; 411 | 412 | // set the zval type to be a long 413 | Z_TYPE_INFO(tmp) = IS_LONG; 414 | 415 | // initialise the array to a given size 416 | array_init_size(result, size); 417 | zend_hash_real_init(Z_ARRVAL_P(result), 1); 418 | ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(result)) { 419 | for (i = 0; i < size; ++i) { 420 | Z_LVAL(tmp) = min + i; 421 | ZEND_HASH_FILL_ADD(&tmp); 422 | } 423 | } ZEND_HASH_FILL_END(); 424 | ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION(); 425 | ``` 426 | 427 | We start by defining the `min` and `max` variables. These are declared as 428 | `zend_long`, which must be used when declaring long integers (likewise with 429 | `zend_ulong` for defining unsigned long integers). This size is then declared 430 | as `zend_ulong`, which will hold the size of the array to be generated. 431 | 432 | A check is then performed to see if `min` is greater than `max` - if it is, an 433 | `Error` exception is thrown. By passing in `NULL` as the first argument to 434 | `zend_throw_error`, the exception class defaults to `Error`. We could 435 | specialise this exception by sub-classing Error and make a new class entry in 436 | [Zend/zend_exceptions.c](http://lxr.php.net/xref/PHP_7_0/Zend/zend_exceptions.c), 437 | but that's probably best covered in a later article. If an exception is thrown, 438 | then we invoke the `HANDLE_EXCEPTION` pseudo-macro that skips onto the next 439 | opcode to be executed. 440 | 441 | Next, we calculate the size of the array to be generated. This size is one less 442 | than the actual size because it does not take into account the inclusive range. 443 | The reason why we don't simply plus one onto this size is because of the 444 | potential for overflow to occur if `min` is equal to `ZEND_LONG_MIN` 445 | (`PHP_INT_MIN`) and `max` is equal to `ZEND_LONG_MAX` (`PHP_INT_MAX`). 446 | 447 | The size is then checked against the `HT_MAX_SIZE` constant to ensure that the 448 | array will fit inside of the hash table. The total array size must not be 449 | greater than or equal to `HT_MAX_SIZE` - if it is, then we once again throw an 450 | Error exception and exit the VM. 451 | 452 | Because `HT_MAX_SIZE` is equal to `INT_MAX + 1`, we know that if `size` is less 453 | than this, we can safely increment size without fear of overflow. So this is 454 | what we do next so that our `size` now accommodates for an inclusive range. 455 | 456 | We then change the type of the `tmp` zval to `IS_LONG`, and then initialise 457 | `result` using the `array_init_size` macro. This macro basically sets the type 458 | of `result` to `IS_ARRAY_EX`, allocates memory for the `zend_array` structure 459 | (a hashtable), and sets up its corresponding hashtable. The 460 | `zend_hash_real_init` function then allocates memory for the `Bucket` 461 | structures that hold each of the elements of the array. The second argument, 462 | `1`, specifies that we would like it to be a packed hashtable. 463 | 464 | Digression: 465 | > A packed hashtable is effectively an *actual* array, i.e. one that is 466 | numerically accessed via integer keys (unlike typical associative arrays in 467 | PHP). This optimisation was introduced into PHP 7 because it was recognised 468 | that many arrays in PHP were integer indexed (keys in increasing order). Packed 469 | hashtables allow the hashtable buckets to be directly accessed (like a normal 470 | array). See Nikita's [PHP's new hashtable implementation] 471 | (http://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html) 472 | article for more information. 473 | 474 | Digression: 475 | > The `_zend_array` structure has two aliases: `zend_array` and `HashTable`. 476 | 477 | Next, we populate the array. This is done with the `ZEND_HASH_FILL_PACKED` 478 | macro ([definition](http://lxr.php.net/xref/PHP_7_0/Zend/zend_hash.h#873)), 479 | which basically keeps track of the current bucket to insert into. The `tmp` 480 | zval stores the intermediary result (the array element) when generating the 481 | array. The `ZEND_HASH_FILL_ADD` macro makes a copy of `tmp`, inserts this copy 482 | into the current hashtable bucket, and increments to the next bucket for the 483 | next iteration. 484 | 485 | Finally, the `ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION` macro (introduced in ZE3 486 | to replace the separate `CHECK_EXCEPTION()` and `ZEND_VM_NEXT_OPCODE()` 487 | calls made in ZE2) checks whether an exception has occurred. Provided an 488 | exception hasn't occurred, then the VM skips onto the next opcode. 489 | 490 | Let's now take a look at the `else if` block: 491 | ```C 492 | long double min, max, size, i; 493 | 494 | if (Z_TYPE_P(op1) == IS_LONG) { 495 | min = (long double) Z_LVAL_P(op1); 496 | max = (long double) Z_DVAL_P(op2); 497 | } else if (Z_TYPE_P(op2) == IS_LONG) { 498 | min = (long double) Z_DVAL_P(op1); 499 | max = (long double) Z_LVAL_P(op2); 500 | } else { 501 | min = (long double) Z_DVAL_P(op1); 502 | max = (long double) Z_DVAL_P(op2); 503 | } 504 | 505 | if (min > max) { 506 | zend_throw_error(NULL, "Min should be less than (or equal to) max"); 507 | HANDLE_EXCEPTION(); 508 | } 509 | 510 | size = max - min; 511 | 512 | if (size >= HT_MAX_SIZE - 1) { 513 | zend_throw_error(NULL, "Range size is too large"); 514 | HANDLE_EXCEPTION(); 515 | } 516 | 517 | // we cast the size to an integer to get rid of the decimal places, 518 | // since we only care about whole number sizes 519 | size = (int) size + 1; 520 | 521 | Z_TYPE_INFO(tmp) = IS_DOUBLE; 522 | 523 | array_init_size(result, size); 524 | zend_hash_real_init(Z_ARRVAL_P(result), 1); 525 | ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(result)) { 526 | for (i = 0; i < size; ++i) { 527 | Z_DVAL(tmp) = min + i; 528 | ZEND_HASH_FILL_ADD(&tmp); 529 | } 530 | } ZEND_HASH_FILL_END(); 531 | ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION(); 532 | ``` 533 | 534 | Digression: 535 | > We use `long double` to handle the cases where there is potentially a mix of 536 | floats and integers as operands. This is because `double` only has 53 bits of 537 | precision, and so any integer greater than 2^53 may not be accurately 538 | represented as a `double`. `long double`, on the other hand, has at least 64 539 | bits of precision, and so it can therefore accurately represent 64 bit 540 | integers. 541 | 542 | This code is very similiar to the previous logic. The main difference now is 543 | that we handle the data as floating point numbers. This includes fetching them 544 | with the `Z_DVAL_P` macro, setting the type info for `tmp` to `IS_DOUBLE`, 545 | and inserting the zval (of type double) with the `Z_DVAL` macro. 546 | 547 | Lastly, we must handle the case where either (or both) `min` and `max` are not 548 | integers or floats. As defined in point #2 of our range operator semantics, 549 | only integers and floats are supported as operands - if anything else is 550 | provided, an Error exception will be thrown. Paste the following code in the 551 | `else` block: 552 | ```C 553 | zend_throw_error(NULL, "Unsupported operand types - only ints and floats are supported"); 554 | HANDLE_EXCEPTION(); 555 | ``` 556 | 557 | With our opcode definition done, we must now regenerate the VM. This is done by 558 | running the **Zend/zend_vm_gen.php** file, which will use the 559 | **Zend/zend_vm_def.h** file to regenerate the **Zend/zend_vm_opcodes.h**, 560 | **Zend/zend_vm_opcodes.c**, and **Zend/zend_vm_execute.h** files. 561 | 562 | Now build PHP again so that we can see the range operator in action: 563 | ```PHP 564 | var_dump(1 |> 1.5); 565 | 566 | var_dump(PHP_INT_MIN |> PHP_INT_MIN + 1); 567 | ``` 568 | 569 | Outputs: 570 | ``` 571 | array(1) { 572 | [0]=> 573 | float(1) 574 | } 575 | 576 | array(2) { 577 | [0]=> 578 | int(-9223372036854775808) 579 | [1]=> 580 | int(-9223372036854775807) 581 | } 582 | ``` 583 | 584 | Now our operator finally works! We're not quite done yet, though. We still need 585 | to update the AST pretty printer (that turns the AST back to code). It 586 | currently does not support our range operator - this can be seen by using it 587 | within `assert()`: 588 | ```PHP 589 | assert(1 |> 2); // segfaults 590 | ``` 591 | 592 | Digression: 593 | > `assert()` uses the pretty printer to output the expression being asserted as 594 | part of its error message upon failure. This is only done when the asserted 595 | expression is not in string form (since the pretty printer would not be needed 596 | otherwise), and is something that is new to PHP 7. 597 | 598 | To rectify this, we simply need to update the [Zend/zend_ast.c] 599 | (http://lxr.php.net/xref/PHP_7_0/Zend/zend_ast.c) file to turn our 600 | `ZEND_AST_RANGE` node into a string. We will firstly update the precedence 601 | table comment (line ~520) by specifying our new operator to have a priority of 602 | 170 (this should match the zend_language_parser.y file): 603 | ``` 604 | * 170 non-associative == != === !== |> 605 | ``` 606 | 607 | Next, we need to insert a `case` statement in the `zend_ast_export_ex` function 608 | to handle `ZEND_AST_RANGE` (just above `case ZEND_AST_GREATER`): 609 | ```C 610 | case ZEND_AST_RANGE: BINARY_OP(" |> ", 170, 171, 171); 611 | case ZEND_AST_GREATER: BINARY_OP(" > ", 180, 181, 181); 612 | case ZEND_AST_GREATER_EQUAL: BINARY_OP(" >= ", 180, 181, 181); 613 | ``` 614 | 615 | The pretty printer has been updated now and `assert()` works fine once again: 616 | ```PHP 617 | assert(false && 1 |> 2); // Warning: assert(): assert(false && 1 |> 2) failed... 618 | ``` 619 | 620 | 621 | ## Conclusion 622 | 623 | This article has covered a lot of ground, albeit thinly. It has shown the 624 | stages ZE goes through when running PHP scripts, how these stages 625 | interoperate with one-another, and how we can modify each of these stages to 626 | include a new operator into PHP. This article demonstrated just one possible 627 | implementation of the range operator in PHP - I hope to cover an alternative 628 | (and better) implementation, as well as other related topics in a future 629 | articles (like creating new internal classes). 630 | --------------------------------------------------------------------------------