├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Zsolt Nagy 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # JavaScript-RegExp-Cheat-Sheet version 1.0.0 2 | 3 | This document describes the regular expression features available in JavaScript up to ES2018. Some ES2018 features have not been released. 4 | 5 | The current version of this document is very far from done. This is just an initial commit describing some of the basic features of constructing simple regular expressions in JavaScript. This document will only be shared from version 1.0 onwards. 6 | 7 | ## Basic definitions 8 | 9 | - String `s` *matches* the regex pattern `/p/` whenever `s` contains the pattern 'p'. 10 | - Example: `abc` matches `/a/`, `/b/`, `/c/` 11 | - For simplicity, we will use the *matches* verb loosely in a sense that 12 | - a string can *match* a regex (e.g. 'a' matches /a/) 13 | - a regex can *match* a string (e.g. /a/ matches 'a') 14 | - A regex pattern consists of *literal characters* that match itself, and *metasyntax characters* 15 | - Literal characters can be *concatenated* in a regular expression. String `s` matches `/ab/` if there is an `a` character *directly followed by a `b` character. 16 | - Example: `abc` matches `/ab/`, `/bc/`, `/abc/` 17 | - Example: `abc` does not match `/ac/`, `/cd/`, `/abcd/` 18 | - *Alternative execution* can be achieved with the metasyntax character `|` 19 | - `/a|b/` means: match either `a` or `b` 20 | - Example: 'ac', 'b', 'ab' match `/a|b/` 21 | - Example: 'c' does not match `/a|b/` 22 | - Iteration is achieved using repeat modifiers. One repeat modifier is the `*` (asterisk) metasyntax character. 23 | - Example: `/a*/` matches any string containing any number of `a` characters 24 | - Example: `/a*/` matches any string, including `''`, because they all contain at least zero `a` characters 25 | - Matching is *greedy*. A *greedy* match attempts to stay in iterations as long as possible. 26 | - Example: `s = 'baaa'` matches `/a*a/` in the following way: 27 | - `s[0]`: `'b'` is discarded 28 | - `s[1]`: `'a'` matches the pattern `a*` 29 | - `s[1] - s[2]`: `'aa'` matches the pattern `a*` 30 | - `s[1] - s[3]`: `'aaa'` matches the pattern `a*` 31 | - as there are no more characters in `s` and there is a character yet to be matched in the regex, we *backtrack* one character 32 | - `s[1] - s[2]`: `'aa'` matches the pattern `a*`, and we end investigating the `a*` pattern 33 | - `s[3]`: `'a'` matches the `a` pattern 34 | - there is a complete match, `s[1] - s[2]` match the `a*` pattern, and `s[3]` matches the `a` pattern. The returned match is `aaa` starting at index `1` of string `s` 35 | - Backtracking is *minimal*. We attempt to backtrack one character at a time in the string, and attempt to interpret the rest of the regex pattern on the remainder of the string. 36 | 37 | 38 | ## Constructing a regex 39 | 40 | * literal form: `/regex/` 41 | * constructor: `new RegExp( 'regex' );` 42 | - escaping: `/\d/` becomes `new RegExp( '\\d' )` 43 | - argument list: `new RegExp( pattern, modifiers );` 44 | 45 | Applying modifiers in literal form: 46 | ``` 47 | const regex1 = new RegExp( 'regex', 'ig' ); 48 | const regex2 = /regex/ig; 49 | ``` 50 | 51 | * `RegExp` constructor also accepts a regular expression: 52 | 53 | ``` 54 | > new RegExp( /regex/, 'i' ); 55 | /regex/i 56 | ``` 57 | 58 | ## List of JavaScript regex modifiers 59 | 60 | | Modifier | Description | 61 | |----------|-------------| 62 | | `i` | non-case sensitive matching. Upper and lower cases don't matter. | 63 | | | | 64 | | `g` | global match. We attempt to find all matches instead of just returning the first match. The internal state of the regular expression stores where the last match was located, and matching is resumed where it was left in the previous match. | 65 | | | | 66 | | `m` | multiline match. It treats the `^` and `$` characters to match the beginning and the end of each line of the tested string. A newline character is determined by `\n` or `\r`. | 67 | | | | 68 | | `u` | unicode search. The regex pattern is treated as a unicode sequence | 69 | | | | 70 | | `y`| Sticky search | 71 | 72 | Example: 73 | 74 | ``` 75 | > const str = 'Regular Expressions'; 76 | 77 | > /re/.test( str ); 78 | false 79 | > /re/i.test( str ); // matches: 're', 'Re', 'rE', 'RE' 80 | true 81 | ``` 82 | 83 | ## Regex API 84 | 85 | - `regex.exec( str )`: returns information on the first match. Exec allows iteration on the regex for all matches if the `g` modifier is set for the regex 86 | - `regex.test( str )`: true iff regex *matches* a string 87 | 88 | ``` 89 | > const regex = /ab/; 90 | > const str = 'bbababb'; 91 | > const noMatchStr = 'c'; 92 | 93 | > regex.exec( str ); // first match is at index 2 94 | [ 0: "ab", index: 2, input: "bbababb" ] 95 | > regex.exec( noMatchStr ); 96 | null 97 | 98 | > regex.test( str ); 99 | true 100 | > regex.test( noMatchStr ); 101 | false 102 | 103 | > regex.exec( noMatchStr ); 104 | 105 | > const globalRegex = /ab/g; 106 | > globalRegex.exec( str ); 107 | > globalRegex.exec( str ); 108 | [ 0: "ab", index: 2, input: "bbababb" ] 109 | > globalRegex.exec( str ); 110 | [ 0: "ab", index: 4, input: "bbababb" ] 111 | > globalRegex.exec( str ); 112 | null 113 | 114 | > let result; 115 | > while ( ( result = globalRegex.exec( str ) ) !== null ) { 116 | console.log( result ); 117 | } 118 | [ 0: "ab", index: 2, input: "bbababb" ] 119 | [ 0: "ab", index: 4, input: "bbababb" ] 120 | ``` 121 | 122 | ## String API 123 | 124 | - `str.match( regex )`: for non-global regular expression arguments, it returns the same return value as `regex.exec( str )`. For global regular expressions, the return value is an array containing all the matches. Returns `null` if no match has been found. 125 | - `str.replace( regex, newString )`: replaces the first full match with `newString`. If `regex` has a global modifier, `str.replace( regex, newString )` replaces all matches with `newString`. Does not mutate the original string `str`. 126 | - `str.search( regex )`: returns the index of the first match. Returns `-1` when no match is found. Does not consider the global modifier. 127 | - `str.split( regex )`: does not consider the global modifier. Returns an array of strings containing strings in-between matches. 128 | 129 | ``` 130 | > const regex = /ab/; 131 | > const str = 'bbababb'; 132 | > const noMatchStr = 'c'; 133 | 134 | > str.match( regex ); 135 | ["ab", index: 2, input: "bbababb"] 136 | 137 | > str.match( globalRegex ); 138 | ["ab", "ab"] 139 | 140 | > noMatchStr.match( globalRegex ); 141 | null 142 | 143 | > str.replace( regex, 'X' ); 144 | "bbXabb" 145 | 146 | > str.replace( globalRegex, 'X' ) 147 | "bbXXb" 148 | 149 | > str.search( regex ); 150 | 2 151 | 152 | > noMatchStr.search( regex ); 153 | -1 154 | 155 | > str.split( regex ); 156 | ["bb", "", "b"] 157 | 158 | > noMatchStr.split( regex ); 159 | ["c"] 160 | ``` 161 | 162 | ## Literal characters 163 | 164 | A regex *literal character* matches itself. The expression `/a/` matches any string that contains the `a` character. Example: 165 | 166 | ``` 167 | /a/.test( 'Andrea' ) // true, because the last character is 'a' 168 | /a/.test( 'André' ) // false, because there is no 'a' in the string 169 | /a/i.test( 'André') // true, because 'A' matches /a/i with the case-insensitive flag `i` applied on `/a/` 170 | ``` 171 | 172 | Literal characters are: all characters except metasyntax characters such as: `.`, `*`, `^`, `$`, `[`, `]`, `(`, `)`, `{`, `}`, `|`, `?`, `+`, `\` 173 | 174 | When you need a metasyntax character, place a backslash in front of it. Examples: `\.`, `\\`, `\[`. 175 | 176 | Whitespaces: 177 | - behave as literal characters, exact match is required 178 | - use character classes for more flexibility, such as: 179 | - `\n` for a newline 180 | - `\t` for a tab 181 | - `\b` for word boundaries 182 | 183 | ## Metasyntax characters 184 | 185 | | Metasyntax character | Semantics | 186 | |----------------------|-----------| 187 | | `.` | arbitrary character class | 188 | | | | 189 | | `[]` | character sets, `[012]` means 0, or 1, or 2 | 190 | | | | 191 | | `^` | (1) negation, e.g. in a character set `[^890]` means not 8, not 9, and not 10, (2) anchor matching the start of the string or line | 192 | | | | 193 | | `$` | anchor matching the end of the string or line | 194 | | | | 195 | | `|` | alternative execution (or) | 196 | | | | 197 | | `*` | iteration: match any number of times | 198 | | | | 199 | | `?` | optional parts in the expression | 200 | | | | 201 | | `+` | match at least once 202 | | | | 203 | | `{}` and `{,}`| specify a range for the number of times an expression has to match the string. Forms: `{3}` exactly 3 times, `{3,}` at least 3 times, `{3,5}` between 3 and 5 times. | 204 | | | | 205 | | `()` | (1) overriding precedence through grouping, (2) extracting substrings | 206 | | | | 207 | | `\` | (1) before a metasyntax character: the next character becomes a literal character (e.g. `\\`). (2) before a special character: the sequence is interpreted as a special character sequence (e.g. `\d` as digit). 208 | | | | 209 | | `(?:`, `)` | non-capturing parentheses. Anything in-between `(?:` and `)` is matched, but not captured. Should be used to achieve only functionality (1) of `()` parentheses. | 210 | | | | 211 | | `(?=`, `)` | lookahead. E.g. `.(?=end)` matches an arbitrary character if it is followed by the characters `end`. Only the character is returned in the match, `end` is excluded. | 212 | | | | 213 | | `(?!`, `)` | negative lookahead. E.g. `.(?!end)` matches an arbitrary character if it is **not** followed by the characters `end`. Only the character is returned in the match, `end` is excluded. | 214 | | | | 215 | | `\b` | word boundary. Zero length assertion. Matches the start or end position of a word. E.g. `\bworld\b` matches the string `'the world'` | 216 | | | | 217 | | `[\b]` | matches a backspace character. This is **not** a character set including a word boundary. | 218 | | | | 219 | | `\B` | not a word boundary. | 220 | | | | 221 | | `\c` | `\c` is followed by character `x`. `\cx` matches `CTRL + x`. | 222 | | | | 223 | | `\d` | digit. `[0-9]` | 224 | | | | 225 | | `\D` | non-digit. `[^0-9]` | 226 | | | | 227 | | `\f` | form feed character | 228 | | | | 229 | | `\n` | newline character (line feed) | 230 | | | | 231 | | `\r` | carriage return character | 232 | | | | 233 | | `\s` | one arbitrary white space character | 234 | | | | 235 | | `\S` | one non-whitespace character | 236 | | | | 237 | | `\t` | tab character | 238 | | | | 239 | | `\u` | `\u` followed by four hexadecimal digits matches a unicode character described by those four digits when the `u` flag is not set. When the `u` flag is set, use the format `\u{0abc}`. | 240 | | | | 241 | | `\v` | vertical tab character | 242 | | | | 243 | | `\w` | alphanumeric character. `[A-Za-z0-9_]` 244 | | | | 245 | | `\W` | non-alphanumeric character | 246 | | | | 247 | | `\x` | `\x` followed by two hexadecimal digits matches the character described by those two digits. | 248 | | | | 249 | | `\0` | NULL character. Equivalent to `\x00` and `\u0000`. When the `u` flag is set for the regex, it is equivalent to `\u{0000}`. | 250 | | | | 251 | | `\1`, `\2`, ... | backreference. `\i` is a reference to the matched contents of the *i*th capture group. | 252 | 253 | 254 | 255 | Examples: 256 | 257 | ``` 258 | /.../.test( 'ab' ) // false, we need at least three arbitrary characters 259 | /.a*/.test( 'a' ) // true, . matches 'a', and a* matches '' 260 | /^a/.test( 'ba' ) // false, 'ba' does not start with a 261 | /^a/.test( 'ab' ) // true, 'ab' starts with a 262 | /a$/.test( 'ba' ) // true, 'ba' ends with a 263 | /a$/.test( 'ab' ) // false, 'ab' does not end with a 264 | /^a$/.test( 'ab' ) // false, 'ab' does not fully match a 265 | /^a*$/.test( 'aa' ) // true, 'aa' fully matches a pattern consisting of 266 | // a characters only 267 | /[ab]/.test( 'b' ) // true, 'b' contains a character that is a 268 | // member of the character class `[ab]` 269 | /a|b/.test( 'b' ) // true, 'b' contains a character that is 270 | // either `a` or `b` 271 | /ba?b/.test( 'bb' ) // true, the optional a is not included 272 | /ba?b/.test( 'bab') // true, the optional a is included 273 | /ba?b/.test( 'bcb') // false, only matches 'bb' or 'bab' 274 | /a+/.test( '' ) // false, at least one `a` character is needed 275 | /a+/.test( 'ba' ) // true, the `a` character was found 276 | /a{3}/.test('baaab')// true, three consecutive 'a' characters were found 277 | /(a|b)c/.test('abc')// true, a `b` character is followed by `c` 278 | /a(?=b)/.test('ab') // true. It matches 'a', because 'a' is followed by 'b' 279 | /a(?!b)/.test('ab') // false, because 'a' is not followed by 'b' 280 | /\ba/.test('bab') // false, because 'a' is not the first character of a word 281 | /\Ba/.test('bab') // true, because 'a' is not the first character of a word 282 | /(\d)\1/.test('55') // true. It matches two consecutive digits with the same value 283 | ``` 284 | 285 | In the last example, notice the parentheses. As the `|` operator has the lowest precedence out of all operators, parentheses made it possible to increase the prcedence of the alternative execution `a|b` higher than concatenation `(a|b)c`. 286 | 287 | ## Character sets, character classes 288 | 289 | - `[abc]` is `a|b|c` 290 | - `[a-c]` is `[abc]` 291 | - `[0-9a-fA-F]` is a case-insensitive hexadecimal digit 292 | - `[^abc]` is an arbitrary character that is not `a`, not `b`, and not `c` 293 | - `.`: arbitrary character class 294 | - Example: `/..e/`: three character sequence ending with `e` 295 | - other character classes such as digit (`\d`), not a digit (`\D`), word (`\w`), not a word (`\W`), whitespace character (`\s`): check out the section on metasyntax characters 296 | 297 | ## Basic (greedy) Repeat modifiers 298 | 299 | Matching is maximal. Backtracking is minimal, goes character by character. 300 | 301 | | Repeat modifier | Description | 302 | |-----------------|-------------| 303 | | `+` | Match at least once | 304 | | | | 305 | | `?` | Match at most once | 306 | | | | 307 | | `*` | Match any number of times | 308 | | | | 309 | | `{min,max}` | match at least `min` times, and at most `max` times | 310 | | | | 311 | | `{n}` | Match exactly `n` times | 312 | 313 | Examples: 314 | 315 | - `/^a+$/` matches any string consisting of one or more `'a'` characters and nothing else 316 | - `/^a?$/` matches `''` or `'a'`. The string may contain at most one `'a'` character 317 | - `/^a*$/` matches the empty string and everything matched by `/^a+$/` 318 | - `/^a{3,5}$/` matches `'aaa'`, `'aaaa'`, and `'aaaaa'` 319 | - `/(ab){3}/` matches any string containing the substring `'ababab'` 320 | 321 | ## Lazy Repeat Modifiers 322 | 323 | Matching is minimal. During backtracking, we add one character at a time. 324 | 325 | | Repeat modifier (PCRE) | Description | 326 | |-----------------|-------------| 327 | | `+?` | Match at least once | 328 | | | | 329 | | `??` | Match at most once | 330 | | | | 331 | | `*?` | Match any number of times | 332 | | | | 333 | | `{min,max}?` | match at least `min` times, and at most `max` times | 334 | | | | 335 | | `{n}?` | Match exactly `n` times | 336 | 337 | Examples for lazy matching: 338 | 339 | - `/^a+?$/` lazily matches any string consisting of one or more `'a'` characters and nothing else 340 | - `/^a??$/` lazily matches `''` or `'a'`. The string may contain at most one `'a'` character 341 | - `/^a*?$/` lazily matches the empty string and everything matched by `/^a+$/` 342 | - `/^a{3,5}?$/` lazily matches `'aaa'`, `'aaaa'`, and `'aaaaa'` 343 | - `/(ab){3}?/` lazily matches any string containing the substring `'ababab'` 344 | 345 | ## Capture groups 346 | 347 | - `(` and `)` captures a substring inside a regex 348 | - Capture groups have a reference number equal to the order of the starting parentheses of the open parentheses of the capture group starting with `1` 349 | - `(?:` and `)` act as non-capturing parentheses, they are not included in the capture group numbering 350 | 351 | Example: 352 | 353 | ``` 354 | /a(b|c(d|(e))(f))$/ 355 | ^ ^ ^ ^ 356 | | | | | 357 | 1 2 3 4 358 | ``` 359 | 360 | ``` 361 | > console.table( /^a(b|c(d|(e))(f+))$/.exec( 'ab' ) ) 362 | (index) Value 363 | 0 "ab" 364 | 1 "b" 365 | 2 undefined 366 | 3 undefined 367 | 4 undefined 368 | index 0 369 | input "ab" 370 | 371 | > console.table( /^a(b|c(d|(e))(f+))$/.exec( 'aceff' ) ) 372 | (index) Value 373 | 0 "aceff" 374 | 1 "ceff" 375 | 2 "e" 376 | 3 "e" 377 | 4 "ff" 378 | index 0 379 | input "aceff" 380 | 381 | > console.table( /^a(b|c(?:d|(e))(f+))$/.exec( 'aceff' ) ) 382 | (index) Value 383 | 0 "aceff" 384 | 1 "ceff" 385 | 2 "e" 386 | 3 "ff" 387 | index 0 388 | input "aceff" 389 | ``` 390 | 391 | ## Lookahead and Lookbehind 392 | 393 | | Lookahead type | JavaScript syntax | Remark | 394 | |----------------|-------------|---------------| 395 | | positive lookahead | `(?=pattern)` | | 396 | | negative lookahead | `(?!pattern)` | | 397 | | positive lookbehind | `(?<=pattern)` | only expected in ES2018 | 398 | | negative lookbehind | `(? /a(?=b)/.exec( 'ab' ) 411 | ["a", index: 0, input: "ab"] 412 | 413 | > /a(?!\d)/.exec( 'ab' ) 414 | ["a", index: 0, input: "ab"] 415 | > /a(?!\d)/.exec( 'a0' ) 416 | null 417 | 418 | > /(?<=a)b/.exec( 'ab' ) 419 | ["b", index: 1, input: "ab"] // executed in latest Google Chrome 420 | 421 | /(?a|b)` 452 | 453 | ## Recommended RegExp library 454 | 455 | [XRegExp](https://github.com/slevithan/xregexp) 456 | - Extended formatting 457 | - Named captures 458 | - Possessive loops 459 | - etc. 460 | 461 | ## Articles 462 | 463 | Check out [my articles](http://www.zsoltnagy.eu/category/regular-expressions/) on regular expressions. 464 | 465 | --------------------------------------------------------------------------------