├── spec ├── introduction.html ├── index.html ├── syntax.html └── semantics.html └── README.md /spec/introduction.html: -------------------------------------------------------------------------------- 1 |

Lookarounds are zero-width assertions that match a string without consuming anything. ECMAScript has lookahead 2 | assertions that does this in forward direction, but the language is missing a way to do this backward which the 3 | lookbehind assertions provide. With lookbehind assertions, one can make sure that a pattern is or isn't preceded by 4 | another, e.g. matching a dollar amount without capturing the dollar sign.

5 | 6 |

See the proposal repository for background material and discussion.

7 | -------------------------------------------------------------------------------- /spec/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |

 4 | title: RegExp Lookbehind Assertions
 5 | status: proposal
 6 | stage: 4
 7 | location: https://github.com/tc39/proposal-regexp-lookbehind
 8 | copyright: false
 9 | contributors: Gorkem Yakim, Nozomu Katō, Daniel Ehrenberg, Thomas Wood
10 |

11 | 12 | 13 | 14 | 15 |

Introduction

16 | 17 | 18 | 19 | 20 |

Pattern Syntax

21 | 22 | 23 | 24 | 25 |

Pattern Semantics

26 | 27 | 28 | -------------------------------------------------------------------------------- /spec/syntax.html: -------------------------------------------------------------------------------- 1 | 2 | Pattern[U] :: 3 | Disjunction[?U] 4 | 5 | Disjunction[U] :: 6 | Alternative[?U] 7 | Alternative[?U] `|` Disjunction[?U] 8 | 9 | Alternative[U] :: 10 | [empty] 11 | Alternative[?U] Term[?U] 12 | 13 | Term[U] :: 14 | Assertion[?U] 15 | Atom[?U] 16 | Atom[?U] Quantifier 17 | 18 | Assertion[U] :: 19 | `^` 20 | `$` 21 | `\` `b` 22 | `\` `B` 23 | `(` `?` `=` Disjunction[?U] `)` 24 | `(` `?` `!` Disjunction[?U] `)` 25 | `(` `?` `<=` Disjunction[?U] `)` 26 | `(` `?` `<!` Disjunction[?U] `)` 27 | 28 | Quantifier :: 29 | QuantifierPrefix 30 | QuantifierPrefix `?` 31 | 32 | QuantifierPrefix :: 33 | `*` 34 | `+` 35 | `?` 36 | `{` DecimalDigits `}` 37 | `{` DecimalDigits `,` `}` 38 | `{` DecimalDigits `,` DecimalDigits `}` 39 | 40 | Atom[U] :: 41 | PatternCharacter 42 | `.` 43 | `\` AtomEscape[?U] 44 | CharacterClass[?U] 45 | `(` Disjunction[?U] `)` 46 | `(` `?` `:` Disjunction[?U] `)` 47 | 48 | SyntaxCharacter :: one of 49 | `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `]` `{` `}` `|` 50 | 51 | PatternCharacter :: 52 | SourceCharacter but not SyntaxCharacter 53 | 54 | AtomEscape[U] :: 55 | DecimalEscape 56 | CharacterClassEscape 57 | CharacterEscape[?U] 58 | 59 | CharacterEscape[U] :: 60 | ControlEscape 61 | `c` ControlLetter 62 | `0` [lookahead <! DecimalDigit] 63 | HexEscapeSequence 64 | RegExpUnicodeEscapeSequence[?U] 65 | IdentityEscape[?U] 66 | 67 | ControlEscape :: one of 68 | `f` `n` `r` `t` `v` 69 | 70 | ControlLetter :: one of 71 | `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m` `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z` 72 | `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M` `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z` 73 | 74 | RegExpUnicodeEscapeSequence[U] :: 75 | [+U] `u` LeadSurrogate `\u` TrailSurrogate 76 | [+U] `u` LeadSurrogate 77 | [+U] `u` TrailSurrogate 78 | [+U] `u` NonSurrogate 79 | [~U] `u` Hex4Digits 80 | [+U] `u{` HexDigits `}` 81 | 82 |

83 | 84 | LeadSurrogate :: 85 | Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xD800 to 0xDBFF] 86 | 87 | TrailSurrogate :: 88 | Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xDC00 to 0xDFFF] 89 | 90 | NonSurrogate :: 91 | Hex4Digits [> but only if the SV of |Hex4Digits| is not in the inclusive range 0xD800 to 0xDFFF] 92 | 93 | IdentityEscape[U] :: 94 | [+U] SyntaxCharacter 95 | [+U] `/` 96 | [~U] SourceCharacter but not UnicodeIDContinue 97 | 98 | DecimalEscape :: 99 | NonZeroDigit DecimalDigits? [lookahead <! DecimalDigit] 100 | 101 | CharacterClassEscape :: one of 102 | `d` `D` `s` `S` `w` `W` 103 | 104 | CharacterClass[U] :: 105 | `[` [lookahead <! {`^`}] ClassRanges[?U] `]` 106 | `[` `^` ClassRanges[?U] `]` 107 | 108 | ClassRanges[U] :: 109 | [empty] 110 | NonemptyClassRanges[?U] 111 | 112 | NonemptyClassRanges[U] :: 113 | ClassAtom[?U] 114 | ClassAtom[?U] NonemptyClassRangesNoDash[?U] 115 | ClassAtom[?U] `-` ClassAtom[?U] ClassRanges[?U] 116 | 117 | NonemptyClassRangesNoDash[U] :: 118 | ClassAtom[?U] 119 | ClassAtomNoDash[?U] NonemptyClassRangesNoDash[?U] 120 | ClassAtomNoDash[?U] `-` ClassAtom[?U] ClassRanges[?U] 121 | 122 | ClassAtom[U] :: 123 | `-` 124 | ClassAtomNoDash[?U] 125 | 126 | ClassAtomNoDash[U] :: 127 | SourceCharacter but not one of `\` or `]` or `-` 128 | `\` ClassEscape[?U] 129 | 130 | ClassEscape[U] :: 131 | `b` 132 | [+U] `-` 133 | CharacterClassEscape 134 | CharacterEscape[?U] 135 | 136 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RegExp Lookbehind Assertions 2 | 3 | Authors: Gorkem Yakin, Nozomu Katō, Daniel Ehrenberg 4 | 5 | Stage 4 6 | 7 | ## Introduction 8 | 9 | Lookarounds are zero-width assertions that match a string without consuming anything. ECMAScript has lookahead assertions that does this in forward direction, but the language is missing a way to do this backward which the lookbehind assertions provide. With lookbehind assertions, one can make sure that a pattern is or isn't preceded by another, e.g. matching a dollar amount without capturing the dollar sign. 10 | 11 | ## High Level API 12 | 13 | There are two versions of lookbehind assertions: positive and negative. 14 | 15 | Positive lookbehind assertions are denoted as `(?<=...)` and they ensure that the pattern contained within precedes the pattern following the assertion. For example, if one wants to match a dollar amount without capturing the dollar sign, `/(?<=\$)\d+(\.\d*)?/` can be used, matching `'$10.53'` and returning `'10.53'`. This, however, wouldn't match `€10.53`. 16 | 17 | Negative lookbehind assertions are denoted as `(? 2 | 3 |

Pattern

4 |

The production Pattern :: Disjunction evaluates as follows:

5 | 6 | 1. Evaluate |Disjunction| with +1 as its _direction_ argument to obtain a Matcher _m_. 7 | 1. Return an internal closure that takes two arguments, a String _str_ and an integer _index_, and performs the following steps: 8 | 1. Assert: _index_ ≤ the number of elements in _str_. 9 | 1. If _Unicode_ is *true*, let _Input_ be a List consisting of the sequence of code points of _str_ interpreted as a UTF-16 encoded () Unicode string. Otherwise, let _Input_ be a List consisting of the sequence of code units that are the elements of _str_. _Input_ will be used throughout the algorithms in . Each element of _Input_ is considered to be a character. 10 | 1. Let _InputLength_ be the number of characters contained in _Input_. This variable will be used throughout the algorithms in . 11 | 1. Let _listIndex_ be the index into _Input_ of the character that was obtained from element _index_ of _str_. 12 | 1. Let _c_ be a Continuation that always returns its State argument as a successful MatchResult. 13 | 1. Let _cap_ be a List of _NcapturingParens_ *undefined* values, indexed 1 through _NcapturingParens_. 14 | 1. Let _x_ be the State (_listIndex_, _cap_). 15 | 1. Call _m_(_x_, _c_) and return its result. 16 | 17 | 18 |

A Pattern evaluates (“compiles”) to an internal procedure value. RegExpBuiltinExec can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String, and, if it does match, what the values of the capturing parentheses would be. The algorithms in are designed so that compiling a pattern may throw a *SyntaxError* exception; on the other hand, once the pattern is successfully compiled, applying the resulting internal procedure to find a match in a String cannot throw an exception (except for any host-defined exceptions that can occur anywhere such as out-of-memory).

19 | 20 | 21 | 22 | 23 | 24 |

Disjunction

25 |

With argument _direction_.

26 |

The production Disjunction :: Alternative evaluates by evaluating |Alternative| to obtain a Matcher and returning that Matcher.

27 |

The production Disjunction :: Alternative `|` Disjunction evaluates as follows:

38 |

/a|ab/.exec("abc")

39 |

returns the result `"a"` and not `"ab"`. Moreover,

40 |

/((a)|(ab))((c)|(bc))/.exec("abc")

41 |

returns the array

42 |

["abc", "a", "a", undefined, "bc", undefined, "bc"]

43 |

and not

44 |

["abc", "ab", undefined, "ab", "c", "c", undefined]

45 |

The order in which the two alternatives are tried is independent of the value of _direction_.

46 | 47 | 48 | 49 | 50 | 51 | 52 |

Alternative

53 |

With argument _direction_.

54 |

The production Alternative :: [empty] evaluates by returning a Matcher that takes two arguments, a State _x_ and a Continuation _c_, and returns the result of calling _c_(_x_).

55 |

The production Alternative :: Alternative Term evaluates as follows:

56 | 57 | 1. Evaluate |Alternative| with argument _direction_ to obtain a Matcher _m1_. 58 | 1. Evaluate |Term| with argument _direction_ to obtain a Matcher _m2_. 59 | 1. If _direction_ is equal to +1, then, 60 | 1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated: 61 | 1. Let _d_ be a Continuation that takes a State argument _y_ and returns the result of calling _m2_(_y_, _c_). 62 | 1. Call _m1_(_x_, _d_) and return its result. 63 | 1. Else, _direction_ is equal to -1. 64 | 1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated: 65 | 1. Let _d_ be a Continuation that takes a State argument _y_ and returns the result of calling _m1_(_y_, _c_). 66 | 1. Call _m2_(_x_, _d_) and return its result. 67 | 68 | 69 |

71 | When _direction_ is equal to -1, the evaluation order of |Alternative| and |Term| are reversed. 72 | 73 | 74 | 75 | 76 | 77 |

Term

78 |

With argument _direction_.

79 |

The production Term :: Assertion evaluates by returning an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated:

80 | 81 | 1. Evaluate |Assertion| to obtain an AssertionTester _t_. 82 | 1. Call _t_(_x_) and let _r_ be the resulting Boolean value. 83 | 1. If _r_ is *false*, return ~failure~. 84 | 1. Call _c_(_x_) and return its result. 85 | 86 | 87 | 88 |

The AssertionTester is independent of _direction_.

89 | 90 | 91 |

The production Term :: Atom evaluates as follows:

92 | 93 | 1. Return the Matcher that is the result of evaluating |Atom| with argument _direction_. 94 | 95 |

The production Term :: Atom Quantifier evaluates as follows:

96 | 97 | 1. Evaluate |Atom| with argument _direction_ to obtain a Matcher _m_. 98 | 1. Evaluate |Quantifier| to obtain the three results: an integer _min_, an integer (or ∞) _max_, and Boolean _greedy_. 99 | 1. If _max_ is finite and less than _min_, throw a *SyntaxError* exception. 100 | 1. Let _parenIndex_ be the number of left capturing parentheses in the entire regular expression that occur to the left of this production expansion's |Term|. This is the total number of times the Atom :: `(` Disjunction `)` production is expanded prior to this production's |Term| plus the total number of Atom :: `(` Disjunction `)` productions enclosing this |Term|. 101 | 1. Let _parenCount_ be the number of left capturing parentheses in the expansion of this production's |Atom|. This is the total number of Atom :: `(` Disjunction `)` productions enclosed by this production's |Atom|. 102 | 1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated: 103 | 1. Call RepeatMatcher(_m_, _min_, _max_, _greedy_, _x_, _c_, _parenIndex_, _parenCount_) and return its result. 104 | 105 | 106 | 107 | 108 |

Runtime Semantics: RepeatMatcher Abstract Operation

109 |

The abstract operation RepeatMatcher takes eight parameters, a Matcher _m_, an integer _min_, an integer (or ∞) _max_, a Boolean _greedy_, a State _x_, a Continuation _c_, an integer _parenIndex_, and an integer _parenCount_, and performs the following steps:

110 | 111 | 1. If _max_ is zero, return _c_(_x_). 112 | 1. Let _d_ be an internal Continuation closure that takes one State argument _y_ and performs the following steps when evaluated: 113 | 1. If _min_ is zero and _y_'s _endIndex_ is equal to _x_'s _endIndex_, return ~failure~. 114 | 1. If _min_ is zero, let _min2_ be zero; otherwise let _min2_ be _min_-1. 115 | 1. If _max_ is ∞, let _max2_ be ∞; otherwise let _max2_ be _max_-1. 116 | 1. Call RepeatMatcher(_m_, _min2_, _max2_, _greedy_, _y_, _c_, _parenIndex_, _parenCount_) and return its result. 117 | 1. Let _cap_ be a fresh copy of _x_'s _captures_ List. 118 | 1. For each integer _k_ that satisfies _parenIndex_ < _k_ and _k_ ≤ _parenIndex_+_parenCount_, set _cap_[_k_] to *undefined*. 119 | 1. Let _e_ be _x_'s _endIndex_. 120 | 1. Let _xr_ be the State (_e_, _cap_). 121 | 1. If _min_ is not zero, return _m_(_xr_, _d_). 122 | 1. If _greedy_ is *false*, then 123 | 1. Call _c_(_x_) and let _z_ be its result. 124 | 1. If _z_ is not ~failure~, return _z_. 125 | 1. Call _m_(_xr_, _d_) and return its result. 126 | 1. Call _m_(_xr_, _d_) and let _z_ be its result. 127 | 1. If _z_ is not ~failure~, return _z_. 128 | 1. Call _c_(_x_) and return its result. 129 | 130 | 131 |

132 | 133 | 134 |

If the |Atom| and the sequel of the regular expression all have choice points, the |Atom| is first matched as many (or as few, if non-greedy) times as possible. All choices in the sequel are tried before moving on to the next choice in the last repetition of |Atom|. All choices in the last (n^th) repetition of |Atom| are tried before moving on to the next choice in the next-to-last (n-1)^st repetition of |Atom|; at which point it may turn out that more or fewer repetitions of |Atom| are now possible; these are exhausted (again, starting with either as few or as many as possible) before moving on to the next choice in the (n-1)^st repetition of |Atom| and so on.

135 |

Compare

136 |

/a[a-z]{2,4}/.exec("abcdefghi")

137 |

which returns `"abcde"` with

138 |

/a[a-z]{2,4}?/.exec("abcdefghi")

139 |

which returns `"abc"`.

140 |

Consider also

141 |

/(aa|aabaac|ba|b|c)*/.exec("aabaac")

142 |

which, by the choice point ordering above, returns the array

143 |

["aaba", "ba"]

144 |

and not any of:

145 |


146 |         ["aabaac", "aabaac"]
147 |         ["aabaac", "c"]
148 |

149 |

The above ordering of choice points can be used to write a regular expression that calculates the greatest common divisor of two numbers (represented in unary notation). The following example calculates the gcd of 10 and 15:

150 |

"aaaaaaaaaa,aaaaaaaaaaaaaaa".replace(/^(a+)\1*,\1+$/,"$1")

151 |

which returns the gcd in unary notation `"aaaaa"`.

152 | 153 | 154 |

Step 4 of the RepeatMatcher clears |Atom|'s captures each time |Atom| is repeated. We can see its behaviour in the regular expression

155 |

/(z)((a+)?(b+)?(c))*/.exec("zaacbbbcac")

156 |

which returns the array

157 |

["zaacbbbcac", "z", "ac", "a", undefined, "c"]

158 |

and not

159 |

["zaacbbbcac", "z", "ac", "a", "bbb", "c"]

160 |

because each iteration of the outermost `*` clears all captured Strings contained in the quantified |Atom|, which in this case includes capture Strings numbered 2, 3, 4, and 5.

161 | 162 | 163 |

Step 1 of the RepeatMatcher's _d_ closure states that, once the minimum number of repetitions has been satisfied, any more expansions of |Atom| that match the empty character sequence are not considered for further repetitions. This prevents the regular expression engine from falling into an infinite loop on patterns such as:

164 |

/(a*)*/.exec("b")

165 |

or the slightly more complicated:

166 |

/(a*)b\1+/.exec("baaaac")

167 |

which returns the array

168 |

["b", ""]

169 | 170 | 171 | 172 | 173 | 174 | 175 |

Assertion

176 |

The production Assertion :: `^` evaluates by returning an internal AssertionTester closure that takes a State argument _x_ and performs the following steps when evaluated:

Even when the `y` flag is used with a pattern, `^` always matches only at the beginning of _Input_, or (if _Multiline_ is *true*) at the beginning of a line.

186 | 187 |

The production Assertion :: `$` evaluates by returning an internal AssertionTester closure that takes a State argument _x_ and performs the following steps when evaluated:

The production Assertion :: `\` `b` evaluates by returning an internal AssertionTester closure that takes a State argument _x_ and performs the following steps when evaluated:

The production Assertion :: `\` `B` evaluates by returning an internal AssertionTester closure that takes a State argument _x_ and performs the following steps when evaluated:

The production Assertion :: `(` `?` `=` Disjunction `)` evaluates as follows:

214 | 215 | 1. Evaluate |Disjunction| with +1 as its _direction_ argument to obtain a Matcher _m_. 216 | 1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps: 217 | 1. Let _d_ be a Continuation that always returns its State argument as a successful MatchResult. 218 | 1. Call _m_(_x_, _d_) and let _r_ be its result. 219 | 1. If _r_ is ~failure~, return ~failure~. 220 | 1. Let _y_ be _r_'s State. 221 | 1. Let _cap_ be _y_'s _captures_ List. 222 | 1. Let _xe_ be _x_'s _endIndex_. 223 | 1. Let _z_ be the State (_xe_, _cap_). 224 | 1. Call _c_(_z_) and return its result. 225 | 226 |

The production Assertion :: `(` `?` `!` Disjunction `)` evaluates as follows:

The production Assertion :: `(` `?` `<=` Disjunction `)` evaluates as follows:

237 | 238 | 1. Evaluate |Disjunction| with -1 as its _direction_ argument to obtain a Matcher _m_. 239 | 1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps: 240 | 1. Let _d_ be a Continuation that always returns its State argument as a successful MatchResult. 241 | 1. Call _m_(_x_, _d_) and let _r_ be its result. 242 | 1. If _r_ is ~failure~, return ~failure~. 243 | 1. Let _y_ be _r_'s State. 244 | 1. Let _cap_ be _y_'s _captures_ List. 245 | 1. Let _xe_ be _x_'s _endIndex_. 246 | 1. Let _z_ be the State (_xe_, _cap_). 247 | 1. Call _c_(_z_) and return its result. 248 | 249 |

The production Assertion :: `(` `?` `<!` Disjunction `)` evaluates as follows:

250 | 251 | 1. Evaluate |Disjunction| with -1 as its _direction_ argument to obtain a Matcher _m_. 252 | 1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps: 253 | 1. Let _d_ be a Continuation that always returns its State argument as a successful MatchResult. 254 | 1. Call _m_(_x_, _d_) and let _r_ be its result. 255 | 1. If _r_ is not ~failure~, return ~failure~. 256 | 1. Call _c_(_x_) and return its result. 257 | 258 | 259 | 260 | 261 | 262 |

Runtime Semantics: WordCharacters Abstract Operation

263 |

The abstract operation WordCharacters performs the following steps:

264 | 265 | 1. Let _A_ be a set of characters containing the sixty-three characters: 266 |

267 | 268 | 269 | 270 | 273 | 276 | 279 | 282 | 285 | 288 | 291 | 294 | 297 | 300 | 303 | 306 | 309 | 312 | 315 | 318 | 321 | 324 | 327 | 330 | 333 | 336 | 339 | 342 | 345 | 348 | 349 | 350 | 353 | 356 | 359 | 362 | 365 | 368 | 371 | 374 | 377 | 380 | 383 | 386 | 389 | 392 | 395 | 398 | 401 | 404 | 407 | 410 | 413 | 416 | 419 | 422 | 425 | 428 | 429 | 430 | 433 | 436 | 439 | 442 | 445 | 448 | 451 | 454 | 457 | 460 | 463 | 465 | 467 | 469 | 471 | 473 | 475 | 477 | 479 | 481 | 483 | 485 | 487 | 489 | 491 | 493 | 494 | 495 |

271 | `a` 272 |

274 | `b` 275 |

277 | `c` 278 |

280 | `d` 281 |

283 | `e` 284 |

286 | `f` 287 |

289 | `g` 290 |

292 | `h` 293 |

295 | `i` 296 |

298 | `j` 299 |

301 | `k` 302 |

304 | `l` 305 |

307 | `m` 308 |

310 | `n` 311 |

313 | `o` 314 |

316 | `p` 317 |

319 | `q` 320 |

322 | `r` 323 |

325 | `s` 326 |

328 | `t` 329 |

331 | `u` 332 |

334 | `v` 335 |

337 | `w` 338 |

340 | `x` 341 |

343 | `y` 344 |

346 | `z` 347 |

351 | `A` 352 |

354 | `B` 355 |

357 | `C` 358 |

360 | `D` 361 |

363 | `E` 364 |

366 | `F` 367 |

369 | `G` 370 |

372 | `H` 373 |

375 | `I` 376 |

378 | `J` 379 |

381 | `K` 382 |

384 | `L` 385 |

387 | `M` 388 |

390 | `N` 391 |

393 | `O` 394 |

396 | `P` 397 |

399 | `Q` 400 |

402 | `R` 403 |

405 | `S` 406 |

408 | `T` 409 |

411 | `U` 412 |

414 | `V` 415 |

417 | `W` 418 |

420 | `X` 421 |

423 | `Y` 424 |

426 | `Z` 427 |

431 | `0` 432 |

434 | `1` 435 |

437 | `2` 438 |

440 | `3` 441 |

443 | `4` 444 |

446 | `5` 447 |

449 | `6` 450 |

452 | `7` 453 |

455 | `8` 456 |

458 | `9` 459 |

461 | `_` 462 |

464 |

466 |

468 |

470 |

472 |

474 |

476 |

478 |

480 |

482 |

484 |

486 |

488 |

490 |

492 |

496 | 497 | 1. Let _U_ be an empty set. 498 | 1. For each character _c_ not in set _A_ where Canonicalize(_c_) is in _A_, add _c_ to _U_. 499 | 1. Assert: Unless _Unicode_ and _IgnoreCase_ are both *true*, _U_ is empty. 500 | 1. Add the characters in set _U_ to set _A_. 501 | 1. Return _A_. 502 | 503 | 504 | 505 | 506 |

Runtime Semantics: IsWordChar Abstract Operation

507 |

The abstract operation IsWordChar takes an integer parameter _e_ and performs the following steps:

508 | 509 | 1. If _e_ is -1 or _e_ is _InputLength_, return *false*. 510 | 1. Let _c_ be the character _Input_[_e_]. 511 | 1. Let _wordChars_ be the result of ! WordCharacters(). 512 | 1. If _c_ is in _wordChars_, return *true*. 513 | 1. Return *false*. 514 | 515 | 516 | 517 | 518 | 519 | 520 |

Atom

521 |

With argument _direction_.

522 |

The production Atom :: PatternCharacter evaluates as follows:

523 | 524 | 1. Let _ch_ be the character matched by |PatternCharacter|. 525 | 1. Let _A_ be a one-element CharSet containing the character _ch_. 526 | 1. Call CharacterSetMatcher(_A_, *false*, _direction_) and return its Matcher result. 527 | 528 |

The production Atom :: `.` evaluates as follows:

529 | 530 | 1. Let _A_ be the set of all characters except |LineTerminator|. 531 | 1. Call CharacterSetMatcher(_A_, *false*, _direction_) and return its Matcher result. 532 | 533 |

The production Atom :: `\` AtomEscape evaluates as follows:

534 | 535 | 1. Return the Matcher that is the result of evaluating |AtomEscape| with argument _direction_. 536 | 537 |

The production Atom :: CharacterClass evaluates as follows:

538 | 539 | 1. Evaluate |CharacterClass| to obtain a CharSet _A_ and a Boolean _invert_. 540 | 1. Call CharacterSetMatcher(_A_, _invert_, _direction_) and return its Matcher result. 541 | 542 |

The production Atom :: `(` Disjunction `)` evaluates as follows:

543 | 544 | 1. Evaluate |Disjunction| with argument _direction_ to obtain a Matcher _m_. 545 | 1. Let _parenIndex_ be the number of left capturing parentheses in the entire regular expression that occur to the left of this production expansion's initial left parenthesis. This is the total number of times the Atom :: `(` Disjunction `)` production is expanded prior to this production's |Atom| plus the total number of Atom :: `(` Disjunction `)` productions enclosing this |Atom|. 546 | 1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps: 547 | 1. Let _d_ be an internal Continuation closure that takes one State argument _y_ and performs the following steps: 548 | 1. Let _cap_ be a fresh copy of _y_'s _captures_ List. 549 | 1. Let _xe_ be _x_'s _endIndex_. 550 | 1. Let _ye_ be _y_'s _endIndex_. 551 | 1. If _direction_ is equal to +1, then 552 | 1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). 553 | 1. Else, 554 | 1. Assert: _direction_ is equal to -1. 555 | 1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _ye_ (inclusive) through _xe_ (exclusive). 556 | 1. Set _cap_[_parenIndex_+1] to _s_. 557 | 1. Let _z_ be the State (_ye_, _cap_). 558 | 1. Call _c_(_z_) and return its result. 559 | 1. Call _m_(_x_, _d_) and return its result. 560 | 561 |

The production Atom :: `(` `?` `:` Disjunction `)` evaluates as follows:

562 | 563 | 1. Return the Matcher that is the result of evaluating |Disjunction| with argument _direction_. 564 | 565 | 566 | 567 | 568 |

Runtime Semantics: CharacterSetMatcher Abstract Operation

569 |

The abstract operation CharacterSetMatcher takes ~~two~~three arguments, a CharSet _A_, a Boolean flag _invert_, and an integer _direction_, and performs the following steps:

Runtime Semantics: Canonicalize ( _ch_ )

595 |

The abstract operation Canonicalize takes a character parameter _ch_ and performs the following steps:

596 | 597 | 1. If _IgnoreCase_ is *false*, return _ch_. 598 | 1. If _Unicode_ is *true*, then 599 | 1. If the file CaseFolding.txt of the Unicode Character Database provides a simple or common case folding mapping for _ch_, return the result of applying that mapping to _ch_. 600 | 1. Else, return _ch_. 601 | 1. Else, 602 | 1. Assert: _ch_ is a UTF-16 code unit. 603 | 1. Let _s_ be the ECMAScript String value consisting of the single code unit _ch_. 604 | 1. Let _u_ be the same result produced as if by performing the algorithm for `String.prototype.toUpperCase` using _s_ as the *this* value. 605 | 1. Assert: _u_ is a String value. 606 | 1. If _u_ does not consist of a single code unit, return _ch_. 607 | 1. Let _cu_ be _u_'s single code unit element. 608 | 1. If _ch_'s code unit value ≥ 128 and _cu_'s code unit value < 128, return _ch_. 609 | 1. Return _cu_. 610 | 611 | 612 |

Parentheses of the form `(` |Disjunction| `)` serve both to group the components of the |Disjunction| pattern together and to save the result of the match. The result can be used either in a backreference (`\\` followed by a nonzero decimal number), referenced in a replace String, or returned as part of an array from the regular expression matching internal procedure. To inhibit the capturing behaviour of parentheses, use the form `(?:` |Disjunction| `)` instead.

613 | 614 | 615 |

The form `(?=` |Disjunction| `)` specifies a zero-width positive lookahead. In order for it to succeed, the pattern inside |Disjunction| must match at the current position, but the current position is not advanced before matching the sequel. If |Disjunction| can match at the current position in several ways, only the first one is tried. Unlike other regular expression operators, there is no backtracking into a `(?=` form (this unusual behaviour is inherited from Perl). This only matters when the |Disjunction| contains capturing parentheses and the sequel of the pattern contains backreferences to those captures.

616 |

For example,

617 |

/(?=(a+))/.exec("baaabac")

618 |

matches the empty String immediately after the first `b` and therefore returns the array:

619 |

["", "aaa"]

620 |

To illustrate the lack of backtracking into the lookahead, consider:

621 |

/(?=(a+))a*b\1/.exec("baaabac")

622 |

This expression returns

623 |

["aba", "a"]

624 |

and not:

625 |

["aaaba", "a"]

626 | 627 | 628 |

The form `(?!` |Disjunction| `)` specifies a zero-width negative lookahead. In order for it to succeed, the pattern inside |Disjunction| must fail to match at the current position. The current position is not advanced before matching the sequel. |Disjunction| can contain capturing parentheses, but backreferences to them only make sense from within |Disjunction| itself. Backreferences to these capturing parentheses from elsewhere in the pattern always return *undefined* because the negative lookahead must fail for the pattern to succeed. For example,

629 |

/(.*?)a(?!(a+)b\2c)\2(.*)/.exec("baaabaac")

630 |

looks for an `a` not immediately followed by some positive number n of `a`'s, a `b`, another n `a`'s (specified by the first `\\2`) and a `c`. The second `\\2` is outside the negative lookahead, so it matches against *undefined* and therefore always succeeds. The whole expression returns the array:

631 |

["baaabaac", "ba", undefined, "abaac"]

632 | 633 | 634 |

In case-insignificant matches when _Unicode_ is *true*, all characters are implicitly case-folded using the simple mapping provided by the Unicode standard immediately before they are compared. The simple mapping always maps to a single code point, so it does not map, for example, `"ß"` (U+00DF) to `"SS"`. It may however map a code point outside the Basic Latin range to a character within, for example, `"ſ"` (U+017F) to `"s"`. Such characters are not mapped if _Unicode_ is *false*. This prevents Unicode code points such as U+017F and U+212A from matching regular expressions such as `/[a-z]/i`, but they will match `/[a-z]/ui`.

635 | 636 | 637 | 638 | 639 | 640 | 641 |

AtomEscape

642 |

With argument _direction_.

643 |

The production AtomEscape :: DecimalEscape evaluates as follows:

644 | 645 | 1. Evaluate |DecimalEscape| to obtain an integer _n_. 646 | 1. If _n_>_NcapturingParens_, throw a *SyntaxError* exception. 647 | 1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps: 648 | 1. Let _cap_ be _x_'s _captures_ List. 649 | 1. Let _s_ be _cap_[_n_]. 650 | 1. If _s_ is *undefined*, return _c_(_x_). 651 | 1. Let _e_ be _x_'s _endIndex_. 652 | 1. Let _len_ be _s_'s length. 653 | 1. ~~Let _f_ be _e_+_len_.~~ 654 | 1. Let _f_ be _e_ + _direction_×_len_. 655 | 1. If _f_ < 0 or _f_>_InputLength_, return ~failure~. 656 | 1. Let _g_ be min(_e_, _f_). 657 | 1. If there exists an integer _i_ between 0 (inclusive) and _len_ (exclusive) such that Canonicalize(_s_[_i_]) is not the same character value as Canonicalize(_Input_[~~_e_+_i_~~ _g_+_i_]), return ~failure~. 658 | 1. Let _y_ be the State (_f_, _cap_). 659 | 1. Call _c_(_y_) and return its result. 660 | 661 |

The production AtomEscape :: CharacterEscape evaluates as follows:

The production AtomEscape :: CharacterClassEscape evaluates as follows:

668 | 669 | 1. Evaluate |CharacterClassEscape| to obtain a CharSet _A_. 670 | 1. Call CharacterSetMatcher(_A_, *false*, _direction_) and return its Matcher result. 671 | 672 | 673 |

An escape sequence of the form `\\` followed by a nonzero decimal number _n_ matches the result of the _n_th set of capturing parentheses (). It is an error if the regular expression has fewer than _n_ capturing parentheses. If the regular expression has _n_ or more capturing parentheses but the _n_th one is *undefined* because it has not captured anything, then the backreference always succeeds.

674 | 675 | 676 | --------------------------------------------------------------------------------