├── .gitignore ├── README.md ├── Notes ├── img │ ├── the.svg │ ├── [^a].svg │ ├── _W_W_W.svg │ ├── _w_w_w.svg │ ├── _d_d_d.svg │ ├── _s_s_s.svg │ ├── _D_D_D.svg │ ├── _S_S_S.svg │ ├── hobby_ies.svg │ ├── a(_!b).svg │ ├── a(_=b).svg │ ├── NLP{3,5}.svg │ ├── ab+c.svg │ ├── a(bc)+.svg │ ├── colou_r.svg │ ├── ab_c.svg │ ├── [tT]he.svg │ ├── ab{2}c.svg │ ├── [hH]ello.svg │ ├── ab{2,}c.svg │ ├── hobb(y_ies).svg │ ├── a[^0-9]b.svg │ ├── _b[tT]he_b.svg │ ├── [^a-zA-Z0-9_].svg │ ├── [a-zA-Z0-9_].svg │ ├── [^a-zA-Z][tT]he[^a-zA-Z].svg │ └── (^_[^a-zA-Z])[tT]he([^a-zA-Z]_$).svg └── README.md └── SolvedExercises └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.pdf 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Regular-Expressions-Guide-for-Beginners 2 | 3 | ## Table of Contents 4 | - [Notes](./Notes/README.md) 5 | - [Solved Exercises](./SolvedExercises/README.md) 6 | -------------------------------------------------------------------------------- /Notes/img/the.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapthe 73 | -------------------------------------------------------------------------------- /Notes/img/[^a].svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with SnapNone of:a 73 | -------------------------------------------------------------------------------- /Notes/img/_W_W_W.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapnon-wordnon-wordnon-word 73 | -------------------------------------------------------------------------------- /Notes/img/_w_w_w.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapwordwordword 73 | -------------------------------------------------------------------------------- /Notes/img/_d_d_d.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapdigitdigitdigit 73 | -------------------------------------------------------------------------------- /Notes/img/_s_s_s.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapwhite spacewhite spacewhite space 73 | -------------------------------------------------------------------------------- /Notes/img/_D_D_D.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapnon-digitnon-digitnon-digit 73 | -------------------------------------------------------------------------------- /Notes/img/_S_S_S.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapnon-white spacenon-white spacenon-white space 73 | -------------------------------------------------------------------------------- /Notes/img/hobby_ies.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snaphobbyies 73 | -------------------------------------------------------------------------------- /Notes/img/a(_!b).svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapanegative lookaheadb 73 | -------------------------------------------------------------------------------- /Notes/img/a(_=b).svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapapositive lookaheadb 73 | -------------------------------------------------------------------------------- /Notes/img/NLP{3,5}.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with SnapNLP2…4 timesrepeats 3…5 times in total 73 | -------------------------------------------------------------------------------- /Notes/img/ab+c.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapabc 73 | -------------------------------------------------------------------------------- /Notes/img/a(bc)+.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapagroup #1bc 73 | -------------------------------------------------------------------------------- /Notes/img/colou_r.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapcolour 73 | -------------------------------------------------------------------------------- /Notes/img/ab_c.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapabc 73 | -------------------------------------------------------------------------------- /Notes/img/[tT]he.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with SnapOne of:tThe 73 | -------------------------------------------------------------------------------- /Notes/img/ab{2}c.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapaboncerepeats 2 times in totalc 73 | -------------------------------------------------------------------------------- /Notes/img/[hH]ello.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with SnapOne of:hHello 73 | -------------------------------------------------------------------------------- /Notes/img/ab{2,}c.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapab1+ timesrepeats 2+ times in totalc 73 | -------------------------------------------------------------------------------- /Notes/img/hobb(y_ies).svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snaphobbgroup #1yies 73 | -------------------------------------------------------------------------------- /Notes/img/a[^0-9]b.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with SnapaNone of:-09b 73 | -------------------------------------------------------------------------------- /SolvedExercises/README.md: -------------------------------------------------------------------------------- 1 | # Exercises 2 | ## 1) Write regular expressions for the following: 3 | 1. The set of all alphabetic strings: 4 | - We mean all the alphabetic characters repeated any number of time. 5 | - `[a-zA-Z]+`. 6 | - We can's use `\w+` as it also includes numbers and underscores. 7 | 2. The set of all lower case alphabetic strings ending in b: 8 | - `[a-z]*b`. 9 | - We used `*b` instead of `+b` because we want to match something like `b`. 10 | - We should always assume the possibility of matching an empty string. 11 | 3. The set of all strings from the alphabet `a`,`b` such that each `a` is immediately preceded by and immediately followed by `b`: 12 | - This means that we can't see the `a` alone, it should be preceded or followed by any number of `b`s. 13 | - That means that we should have something like `(ab+)` in our sequence. 14 | - This simple regex should be preceded by any number of `b`s. Hence, it would be something like `b+(ab+)` in our sequence. 15 | - The previously mentioned regex will capture patters like `bbbbabbb`, but won't match `bbbabab`. 16 | - We will add an asterisk after the last pattern to match any number of occurrences. 17 | - The final regex is: `b+(ab+)*`. 18 | 4. The set of all binary strings with at least four ones: 19 | - We are searching for `1` and `0` only patterns where the `1` should appear at least four times. 20 | - We can hardly code it as four ones each preceded and followed by any number of `0`s. 21 | - This will look like this: `0*10*10*10*10*[01]*` which is equivalent to `(0*1){4}[01]*`. 22 | - We can assume that the string is always binary and use a wild card like this: `([01]*1){4}[01]*`. 23 | - We can take more advantage of this assumption, as all the ones should be separated by a number of `0`s. 24 | - So, we can represent it as: `(0*1){4,}0*`. Which can be written as `(0*10*){4,}`. 25 | - This is another possible answer: `0*1+0*1+0*1+0*1+0*` which is equivalent to `(0*1+0*){4,}`. 26 | 5. The set of all binary strings where the number of zeros is a multiple of 3: 27 | - We need 3, 6, 9, ... `0`s. Zero occurrences is not allowed. 28 | - Those zeros can be separated by `1`s. 29 | - This a possible solutions: `(1*01*01*01*)+`. 30 | - Which can be rewritten as : `((1*01*){3})+`. 31 | 32 | ## 2) Write regular expressions for the following languages. 33 | >> By “word”, we mean an alphabetic string separated from other words by whitespace, any relevant punctuation, line breaks, and so forth. That's to say we can't use `\b` as a word boundary-can be used as string boundary- here. 34 | 1. The set of all strings with two consecutive repeated words in the same case (e.g., “Humbert Humbert” and “the the” but not “the bug” or “the big bug”): 35 | - When we are asked to search for repeated strings we should recall the capturing groups. 36 | - We need to match any word: `[a-zA-Z]+` 37 | - Words can be followed -separated- by any number of spaces: `([a-zA-Z]+)\s+` 38 | - We want to see another occurrence of the same word after the spaces: `([a-zA-Z]+)\s+\1` 39 | - Now we can safely add a word boundary to match whitespace, new lines, etc: `\b([a-zA-Z]+)\s+\1\b`. 40 | 2. All strings that start at the beginning of the line with an integer and that end at the end of the line with a word: 41 | - Start with an integer: `^\d+`. 42 | - End with a word -alphabetic string-: `[a-zA-Z]+$`. 43 | - Any string is: `.*\b`, we added `\b` because the end of the string must be a separator. 44 | - Combine them: `^\d+.*\b[a-zA-Z]+$`. 45 | - We can go a more sophisticated way and use `^(-)?\d` instead of `^\d` to capture negative numbers: `^(-)?\d+.*\b[a-zA-Z]+$`. 46 | 3. All strings that have both the word grotto and the word raven in them (but not, e.g., words like grottos that merely contain the word grotto): 47 | - Capture the `grotto` word: `\bgrotto\b`. 48 | - Capture the `raven` word: `\braven\b`. 49 | - Capture the string with both `grotto` and `raven`: `.*\bgrotto\b.*\braven\b.*`. 50 | - This will capture them only in this order, we should consider the other order using `|` like this: `.*\bgrotto\b.*\braven\b.* | .*\braven\b.*\bgrotto\b.*` 51 | - What if too many permutations? 52 | 53 | ## 3) Write a regular expression that matches responses to this question: “What are blue, grey and red?” The following 6 responses should be matched: 54 | - colours, colors: 55 | - Optional `u`. 56 | - `colou?rs`. 57 | - they're colours, they're colors: 58 | - Optional `they're`. 59 | - `(they're )?colou?rs`. 60 | - they are colours, they are colors: 61 | - `they are` is also accepted. 62 | - `(they('|a)re )?colou?rs`. -------------------------------------------------------------------------------- /Notes/img/_b[tT]he_b.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapword boundaryOne of:tTheword boundary 73 | -------------------------------------------------------------------------------- /Notes/img/[^a-zA-Z0-9_].svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with SnapNone of:-az-AZ-09_ 73 | -------------------------------------------------------------------------------- /Notes/img/[a-zA-Z0-9_].svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with SnapOne of:-az-AZ-09_ 73 | -------------------------------------------------------------------------------- /Notes/img/[^a-zA-Z][tT]he[^a-zA-Z].svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with SnapNone of:-az-AZOne of:tTheNone of:-az-AZ 73 | -------------------------------------------------------------------------------- /Notes/img/(^_[^a-zA-Z])[tT]he([^a-zA-Z]_$).svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Created with Snapgroup #1Start of lineNone of:-az-AZOne of:tThegroup #2None of:-az-AZEnd of line 73 | -------------------------------------------------------------------------------- /Notes/README.md: -------------------------------------------------------------------------------- 1 | # Regular Expressions 2 | - Formally, a regular expression is an **algebraic notation** for characterizing a set of strings. 3 | - Useful for searching in texts, we have: 4 | - A pattern to search for 5 | - And a corpus of texts to search through. 6 | - They are case sensitive. 7 | - Regular expressions always match the largest string they can -> Greedy 8 | - we will use the online tool [RegEx Pal](https://www.regexpal.com/) for testing our regular expressions. 9 | - You can use this [dummy text](https://www.lipsum.com/) for testing. 10 | 11 | >> Note: we may show some regular expressions delimited by slashes but slashes are not part of the regular expressions. 12 | 13 | >> Note: We can set a flag to ignore the case. 14 | 15 | ## Simple text 16 | - **String**: Typing a string of characters will match that string. 17 | - E.g.: `mostafa` will match every `mostafa` in the passage. 18 | - **Single Character**: Typing a solo characters will match that characters. 19 | - E.g.: `!` will match every `!` in the passage. 20 | ## Disjunction 21 | - **Disjunction**: Typing characters enclosed by square brackets `[]` to match any of the characters from this list. 22 | - ![img](img/[hH]ello.svg) 23 | - E.g.: `[hH]ello` will match every `hello` or `Hello` in the passage. 24 | - E.g.: `[mM]ostafa` will match every `mostafa` or `Mostafa` in the passage. 25 | ## Range 26 | - **Range**: Typing a range of characters enclosed by square brackets `[]` and separated by dashes `-` to match any of the characters from this range. 27 | - E.g.: `[a-c]ello` will match every `aello`, `bello`, `cello` in the passage. 28 | - E.g.: `[0-9]ello` will match every `0ello`, `1ello`, `2ello`, `3ello`, `4ello`, `5ello`, `6ello`, `7ello`, `8ello`, `9ello` in the passage. 29 | ## Caret 30 | - **Caret**: The square braces can also be used to specify what a single character cannot be, by use of the caret `^`. 31 | - If the caret `^` is the first symbol after the open square brace `[`, the resulting pattern is negated. 32 | - E.g.: `[^a]` will match any single character (including special characters) except `a`. 33 | - ![img](img/[^a].svg) 34 | - E.g.: `[^A-Z]` will match every character that is not a capital letter. 35 | - E.g.: `[^0-9]` will match every character that is not a number. 36 | - E.g.: `a[^0-9]b` will match every `a` followed by a character that is not a number and then a `b`. 37 | - ![img](img/a[^0-9]b.svg) 38 | - This is only true when the caret is the first symbol after the open square brace. If it occurs anywhere else, it usually stands for a caret. 39 | - E.g.: `a^b` will match every `a^b`. 40 | ## Quantifiers 41 | - **Question Mark**: The `?` indicates **zero** or **one** occurrences of the **preceding** element. 42 | - ![img](img/colou_r.svg) 43 | - E.g.: `colou?r` matches both `color` and `colour`. 44 | - **Asterisk**: The `*` indicates **zero** or **more** occurrences of the **preceding** element. 45 | - ![img](img/ab_c.svg) 46 | - E.g.: `ab*c` matches `ac`, `abc`, `abbc`, `abbbc`, and so on. 47 | - **Plus Sign**: The `+` indicates **one** or **more** occurrences of the **preceding** element. 48 | - ![img](img/ab+c.svg) 49 | - E.g.: `ab+c` matches `abc`, `abbc`, `abbbc`, and so on, but not `ac`. 50 | - **{n}**: The **preceding** item is matched **exactly n** times. 51 | - ![img](img/ab{2}c.svg) 52 | - E.g.: `ab{2}c` matches `abbc`. 53 | - **{min,}**: The **preceding** item is matched **min** or **more** times. 54 | - ![img](img/ab{2,}c.svg) 55 | - E.g.: `ab{2,}c` matches `abbc`, `abbbc`, `abbbbc`, and so on. 56 | - **{,max}**: The **preceding** item is matched **max** or **less** times. 57 | - E.g.: `ab{,2}c` matches `ac`, `abc`, `abbc`. 58 | - **{min,max}**: The **preceding** item is matched at **between min and max** times. 59 | - ![img](img/NLP{3,5}.svg) 60 | - E.g.: `NLP{3,5}` matches `NLPPP`, `NLPPPP`, `NLPPPPP` ,but not `NLP` or `NLPP`. 61 | ## Wildcard & Anchors 62 | - **Dot**: The `.` matches any single character except the newline character `\n`. 63 | - E.g.: `a.c` matches `abc`, `a c`, `a1c`, `a-c`, and so on. 64 | - E.g.: `.{5}` matches any five-character string. 65 | - E.g.: `.*Mostafa*.` matches any string with `Mostafa` as a sub-string. 66 | - E.g.: `Mostafa.*Wael` matches any string that starts with `Mostafa` and ends with `Wael` (in the same line). 67 | >> a “word” for the purposes of a regular expression is defined as any sequence of **word characters**: **digits**, **underscores**, or **letters** 68 | - **Anchors**: Doesn't match characters, rather they assert conditions about the string. 69 | - **Caret**: The `^` matches the **beginning** of a **string** or a **new line** if the multiline flag is on. 70 | - E.g.: `^a` matches `a` in `abc`, but not `a` in `bac`. 71 | - **Dollar Sign**: The `$` matches the **end** of a **string** or a **line** if the multiline flag is on. 72 | - E.g.: `a$` matches `a` in `bca`, but not `a` in `abc` or `bac`. 73 | - **Word Boundary**: The `\b` matches a **word boundary** which is either a **whitespace** character or the **beginning** or **end** of a string or a **punctuation**. 74 | - The following three positions are qualified as word boundaries: 75 | - Before the first character in a string if the first character is a word character. 76 | - After the last character in a string if the last character is a word character. 77 | - Between two characters in a string if one is a word character and the other is not. 78 | - E.g.: The word boundaries in `Mostafa, focus!` are: 79 | - Before the `M`. 80 | - After the last `a`. 81 | - Before the `f`. 82 | - After the `s` in `focus`. 83 | - E.g.: `\bMostafa\b` matches `Mostafa` in `Mostafa, focus!`, but not `Mostafa` in `Mostafaaaaa, focus!`. 84 | - E.g.: `\bWael` matches `Wael` in `mostafa Wael` and `mostafa-Wael`, but not `Wael` in `mostafa_Wael` or `mostafa2Wael` or `mostafaWael`. 85 | - E.g.: `\b\d\d:\d\d\b` matches `1:30` in `I wrote this at 01:30 AM` 86 | - **Non-Word Boundary**: The `\B` matches a **non-word boundary** which is any word character like **digits**, **underscores**, or **letters**. 87 | - E.g.: `\BWael` matches `Wael` in `mostafa_Wael`, `mostafa2Wael` or `mostafaWael`, but not `Wael` in `mostafa Wael` or `mostafa-Wael`. 88 | ## Grouping 89 | - **Pipe Symbol**: The `|` matches either the **preceding** or the **following** element. 90 | - E.g.: `hobby|ies` matches `hobby` in `my hobby` and `ies` in `hobbies`. 91 | - ![img](img/hobby_ies.svg) 92 | - **Parenthesis**: The `()` is used for grouping characters together to allow operators to act on them as a group. 93 | - E.g.: `hobb(y|ies)` matches `hobby` in `my hobby` and `hobbies` in `my hobbies`. 94 | - ![img](img/hobb(y_ies).svg) 95 | - E.g.: `a(bc)+` matches `abc` in `abc`, `abcbc` in `abcbc`, and `abcbcbc` in `abcbcbccc`. 96 | - ![img](img/a(bc)+.svg) 97 | ### Aliases 98 | - **Aliases**: To save typing for common ranges. 99 | - **\d**: Expands to `[0-9]` and matches any digit. 100 | - E.g.: `\d` or `[0-9]` match `0` in `0abc` and `9` in `9abc`. 101 | - E.g.: `\d\d\d` or `[0-9][0-9][0-9]` match `123` in `123abc`. 102 | - ![img](img/_d_d_d.svg) 103 | - **\D**: Expands to `[^0-9]` and matches any non-digit. 104 | - E.g.: `\D` or `[^0-9]` match `abc` in `0abc5` and `bac` in `9abc8`. 105 | - E.g.: `\D\D\D` or `[^0-9][^0-9][^0-9]` match `abc` in `123abc`. 106 | - ![img](img/_D_D_D.svg) 107 | - **\w**: Expands to `[a-zA-Z0-9_]` and matches any word character(digits, underscores, or letters). 108 | - E.g.: `\w` or `[a-zA-Z0-9_]` match `ab` and `c` in `ab-c`. 109 | - ![img](img/[a-zA-Z0-9_].svg) 110 | - E.g.: `\w\w\w` or `[a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_]` match `mos` in `most-tafa`. 111 | - ![img](img/_w_w_w.svg) 112 | - **\W**: Expands to `[^a-zA-Z0-9_]` and matches any non-word character(digits, underscores, or letters). 113 | - E.g.: `\W` or `[^a-zA-Z0-9_]` match `-` in `ab-c`. 114 | - ![img](img/[^a-zA-Z0-9_].svg) 115 | - E.g.: `\W\W\W` or `[^a-zA-Z0-9_][^a-zA-Z0-9_][^a-zA-Z0-9_]` match `-` in `most---tafa`. 116 | - ![img](img/_W_W_W.svg) 117 | - **\s**: Expands to `[ \t\n\r\f\v]` and matches any whitespace/tabs character. 118 | - E.g.: `\s` or `[ \t\n\r\f\v]` match ` ` in `a b`. 119 | - E.g.: `\s\s\s` or `[ \t\n\r\f\v][ \t\n\r\f\v][ \t\n\r\f\v]` match ` ` in `a b`. 120 | - ![img](img/_s_s_s.svg) 121 | - **\S**: Expands to `[^ \t\n\r\f\v]` and matches any non-whitespace/tabs character. 122 | - E.g.: `\S` or `[^ \t\n\r\f\v]` match `a` and `-b` in `a -b`. 123 | - E.g.: `\S\S\S` or `[^ \t\n\r\f\v][^ \t\n\r\f\v][^ \t\n\r\f\v]` match `Mos` and `afa` in `Mos afa`. 124 | - ![img](img/_S_S_S.svg) 125 | 126 | ## Backslash 127 | - **Backslashes**: To refer to characters that are special themselves. 128 | - **\\***: Matches the `*`. 129 | - **\\.**: Matches the `.`. 130 | - **\\?**: Matches the `?`. 131 | - Or even: 132 | - **\n**: Matches the newline character. 133 | - **\t**: Matches the tab character. 134 | 135 | ## Simple Example 136 | Write a RegEx to find cases of the English word `the`. We want to get even if it at the beginning of the line or starts with a number. 137 | 1. `the`: 138 | - ![the](img/the.svg) 139 | - Wrong! 140 | - Misses `The` with capital `T`. 141 | 2. `[tT]he`: 142 | - ![tT](img/[tT]he.svg) 143 | - Wrong! 144 | - Incorrectly return texts with the embedded in other words. 145 | - E.g.: `other` or `theology`. 146 | 3. `\b[tT]he\b`: 147 | - ![\b[tT]he\b](img/_b%5BtT%5Dhe_b.svg) 148 | - Wrong! 149 | - Won’t treat underscores and numbers as word boundaries. 150 | - But, we want to detect sequences as `the_` or `the25`. 151 | 4. `[^a-zA-Z][tT]he[^a-zA-Z]`: 152 | - ![img](img/[^a-zA-Z][tT]he[^a-zA-Z].svg) 153 | - Wrong! 154 | - Here we specify that we want instances in which there are no alphabetic letters on either side of the but it misses the when it begins a line. 155 | 5. `(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)`: 156 | - ![img](img/(^_[^a-zA-Z])[tT]he([^a-zA-Z]_$).svg) 157 | - Correct! 158 | - By specifying that before the `the` we require either `the` beginning-of-line or a non-alphabetic character, and `the` same at the end of the line. 159 | - Problems with consecutive `the`. 160 | ## Substitutions 161 | - We can replace a string we have found -or parts of it- with something else. 162 | - E.g.: `s/this/that/` will replace `this` with `that`. 163 | - E.g.: `s/colour/color/` will replace `colour/` with `color/`. 164 | 165 | ## Capturing Groups & Referencing 166 | - Regex stores the captured patterns in memory and they can be called back and forth. 167 | - To do so, we must specify the regex in parentheses `()`. 168 | - E.g.: `CMP([0-9]+)` will capture the number `23` in `CMP 23` and save it in the memory. 169 | - We can call that exact number using a one-based index like this: `CMP([0-9]+) are \1 years old` will match `CMP 23 are 23 years old`. 170 | - E.g.: `The (.*)er they (.*)` will capture `The faster they ran` where `fast` can be accessed using `\1` and `ran` can be accessed using `\2`. 171 | - E.g.: `The (.*)er they (.*), the \1er we \2` will capture `The faster they ran, the faster we ran`. 172 | - This is called a **capturing group**, we can use a **non-capturing group** by adding `?:` before the expression in the parentheses. 173 | - E.g.: `(?:some|few) (people) like \1` will match `some people like people` and `few people like people`. Where the `\1` will be replaced with `people` not `some` nor `few`, because they are in a **non-capturing group**. 174 | ## Types of Errors 175 | - The process we just went through was based on fixing two kinds of errors: 176 | - **False positives**, strings that we incorrectly **matched** like `other` or `there`. 177 | - **False negatives**, strings that we incorrectly **missed**, like `The`. 178 | - Reducing the overall error rate for an application thus involves two antagonistic efforts: 179 | - Increasing **precision** (minimizing false **positives**). 180 | - Increasing **recall** (minimizing false **negatives**). 181 | 182 | ## Lookaround Assertions 183 | For performing matches based on information that follows or precedes a pattern, without the information within the lookaround assertion forming part of the returned text → do not consume characters in the string, but only assert whether a match is possible or not (zero-length assertions). 184 | - Types of lookaround assertion: 185 | - Positive Lookahead `(?=f)` : 186 | - ![img](img/a(_=b).svg) 187 | - Asserts that what immediately follows the current position in the string is `f` 188 | - `a(?=b)` will match `a` in `abc` but will not match `a` in `acb` or `bac`. 189 | - Positive Lookbehind `(?<=f)` : 190 | - Asserts that what immediately precedes the current position in the string is `f` 191 | - `(?<=y)z` will match `z` in `xyz` but will not match `z` in `zyx`. 192 | - Negative Lookahead `(?!f)` : 193 | - ![img](img/a(_!b).svg) 194 | - Asserts that what immediately follows the current position in the string is not `f` 195 | - `a(?!b)` will match `a` in `acb` but will not match `a` in `abc`. 196 | - Negative Lookbehind `(?> we used `!` instead of `=` in the -ve expressions. 200 | 201 | ## Simple Example 202 | Write a RegEx to concatenate sperate digits. E.g.: `2 3` be `23`, ... 203 | 204 | Answer: `s/(?=\d)\s+(?<=\d)//`. 205 | 206 | Illustration: 207 | - To replace `x` with `y`: `s/x/y/` 208 | - To remove `x`: `s/x//` 209 | - We want to select everyspace(s) that is bounded by two digits: 210 | - Should be precceded by a digit: `(?=\d)`. 211 | - Followed by a digit: `(?<=\d)`. 212 | - Any number of spaces: `\s+`. 213 | 214 | >> Diagrams made with [Regex Pal](https://www.regexpal.com/). 215 | --------------------------------------------------------------------------------