├── .gitignore ├── LICENSE ├── MANIFEST.in ├── README.md ├── python └── pcre.py ├── setup.py ├── src └── pcremodule.c └── test ├── __init__.py ├── pcre_tests.py ├── test_pcre.py ├── test_pcre_re_compat.py └── test_support.py /.gitignore: -------------------------------------------------------------------------------- 1 | build 2 | dist 3 | MANIFEST 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2012-2015, Arkadiusz Wahlig 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | * Redistributions of source code must retain the above copyright 7 | notice, this list of conditions and the following disclaimer. 8 | * Redistributions in binary form must reproduce the above copyright 9 | notice, this list of conditions and the following disclaimer in the 10 | documentation and/or other materials provided with the distribution. 11 | * Neither the name of the nor the 12 | names of its contributors may be used to endorse or promote products 13 | derived from this software without specific prior written permission. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 16 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 17 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 18 | DISCLAIMED. IN NO EVENT SHALL BE LIABLE FOR ANY 19 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 20 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 21 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 22 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 23 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md 2 | include LICENSE 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | python-pcre 2 | =========== 3 | 4 | Python bindings for PCRE regex engine. 5 | 6 | 7 | Requirements 8 | ------------ 9 | 10 | * [PCRE](http://www.pcre.org) 8.x 11 | * [Python](http://python.org) 2.6+ or 3.x 12 | 13 | Tested with Python 2.6, 2.7, 3.4 and PCRE 8.12, 8.30, 8.35. 14 | 15 | 16 | Building and installation 17 | ------------------------- 18 | 19 | A standard distutils `setup.py` script is provided. 20 | After making sure all dependencies are installed, building 21 | and installation should be as simple as: 22 | 23 | ``` 24 | $ python setup.py build install 25 | ``` 26 | 27 | When building PCRE, UTF-8 mode must be enabled (`./configure --enable-utf`). You might 28 | also want to enable stackless recursion (`--disable-stack-for-recursion`) and unicode 29 | character properties (`--enable-unicode-properties`). If you plan to use JIT, 30 | add `--enable-jit`. 31 | 32 | 33 | Differences between python-pcre and re 34 | -------------------------------------- 35 | 36 | The API is very similar to that of the built-in 37 | [re module](http://docs.python.org/library/re.html). 38 | 39 | Differences: 40 | 41 | * slightly different regex syntax 42 | * by default, `sub()`, `subn()`, `expand()` use `str.format()` instead of `\1` substitution 43 | (see below) 44 | * `DEBUG` and `LOCALE` flags are not supported 45 | * patterns are not cached 46 | * scanner APIs are not supported 47 | 48 | For a comprehensive PCRE regex syntax you can visit 49 | [PHP documentation](http://php.net/manual/en/reference.pcre.pattern.syntax.php). 50 | 51 | 52 | Substitution 53 | ------------ 54 | 55 | By default, python-pcre uses `str.format()` instead of the `re`-style `\1` and `\g` 56 | substitution in calls to `sub()`, `subn()` and `expand()`. 57 | 58 | Example: 59 | 60 | ```python 61 | >>> import pcre 62 | >>> pcre.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', 63 | ... 'static PyObject*\npy_{1}(void)\n{{', # str.format() template 64 | ... 'def myfunc():') 65 | 'static PyObject*\npy_myfunc(void)\n{' 66 | ``` 67 | Note `{1}` and escaped `{{` in the template string. 68 | 69 | The built-in re module would use `\1` instead: 70 | `r'static PyObject*\npy_\1(void)\n{'` 71 | 72 | Named groups are referenced using `{name}` instead of `\g`. 73 | 74 | Entire match can be referenced using `{0}`. 75 | 76 | This makes the template string easier to read and means that it no longer needs to be 77 | a raw string. 78 | 79 | However, `re` template mode can be enabled using `enable_re_template_mode()`. 80 | This might be useful if python-pcre is to be used with existing `re`-based code. 81 | 82 | ```python 83 | >>> pcre.enable_re_template_mode() 84 | >>> pcre.sub(r'(.)', r'[\1]', 'foo') 85 | '[f][o][o]' 86 | ``` 87 | 88 | A function to convert `re` templates is also provided for those one-off cases. 89 | 90 | ```python 91 | >>> pcre.convert_re_template(r'static PyObject*\npy_\1(void)\n{') 92 | 'static PyObject*\npy_{1}(void)\n{{' 93 | ``` 94 | 95 | A small difference between the two modes is that in `str.format()` mode, groups that 96 | didn't match are replaced with `''` whereas in `re` mode it's an error to reference 97 | such groups in the template. 98 | 99 | Also note that in Python 3.x `bytes.format()` is not available so the template needs 100 | to be a `str`. 101 | 102 | 103 | Unicode handling 104 | ---------------- 105 | 106 | python-pcre internally uses the UTF-8 interface of the PCRE library. 107 | 108 | Patterns or matched subjects specified as byte strings that contain ascii characters 109 | only (0-127) are passed to PCRE directly, as ascii is a subset of UTF-8. 110 | Other byte strings are internally re-encoded using a simple Latin1 to UTF-8 codec 111 | which maps characters 128-255 to unicode codepoints of the same value. 112 | This conversion is transparent to the caller. 113 | 114 | If you know that your byte strings are UTF-8, you can use the `pcre.UTF8` flag 115 | to tell python-pcre to pass them directly to PCRE. This flag has to be specified 116 | every time a UTF-8 pattern is compiled or a UTF-8 subject is matched. Note that 117 | in this mode things like `.` may match multiple bytes: 118 | 119 | ```python 120 | >>> pcre.compile('.').match(b'\xc3\x9c', flags=pcre.UTF8).group() 121 | b'\xc3\x9c' # two bytes 122 | >>> _.decode('utf-8') 123 | u'\xdc' # one character 124 | ``` 125 | 126 | python-pcre also accepts unicode strings as input. In Python 3.3 or newer, which 127 | implement [PEP 393](http://legacy.python.org/dev/peps/pep-0393/), unicode strings 128 | stored internally as ascii are passed to PCRE directly. Other internal formats are 129 | encoded into UTF-8 using Python APIs (which use the UTF-8 form cached in the unicode 130 | object if available). In older Python versions these optimizations are not supported 131 | so all unicode objects require the extra encoding step. 132 | 133 | python-pcre also accepts objects supporting the buffer interface, such as `array.array` 134 | objects. Supported are both old and new buffer APIs with buffers containing either bytes 135 | or unicode characters, with the same UTF-8 encoding strategy as byte/unicode strings. 136 | 137 | When internally encoding subject strings to UTF-8, any offsets accepted as input 138 | or provided as output are also converted between byte and character offsets so that 139 | the caller doesn't need to be aware of the conversion -- the offsets are always 140 | indexes into the specified subject string, whether it's a byte string or a unicode 141 | string. 142 | 143 | 144 | License 145 | ------- 146 | 147 | ``` 148 | Copyright (c) 2012-2015, Arkadiusz Wahlig 149 | All rights reserved. 150 | 151 | Redistribution and use in source and binary forms, with or without 152 | modification, are permitted provided that the following conditions are met: 153 | * Redistributions of source code must retain the above copyright 154 | notice, this list of conditions and the following disclaimer. 155 | * Redistributions in binary form must reproduce the above copyright 156 | notice, this list of conditions and the following disclaimer in the 157 | documentation and/or other materials provided with the distribution. 158 | * Neither the name of the nor the 159 | names of its contributors may be used to endorse or promote products 160 | derived from this software without specific prior written permission. 161 | 162 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 163 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 164 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 165 | DISCLAIMED. IN NO EVENT SHALL BE LIABLE FOR ANY 166 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 167 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 168 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 169 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 170 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 171 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 172 | ``` 173 | -------------------------------------------------------------------------------- /python/pcre.py: -------------------------------------------------------------------------------- 1 | """ python-pcre 2 | 3 | Copyright (c) 2012-2015, Arkadiusz Wahlig 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | * Redistributions of source code must retain the above copyright 9 | notice, this list of conditions and the following disclaimer. 10 | * Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | * Neither the name of the nor the 14 | names of its contributors may be used to endorse or promote products 15 | derived from this software without specific prior written permission. 16 | 17 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 18 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 19 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 20 | DISCLAIMED. IN NO EVENT SHALL BE LIABLE FOR ANY 21 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 22 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 23 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 24 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 25 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 26 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 27 | """ 28 | 29 | import _pcre 30 | 31 | __version__ = '0.7' 32 | 33 | class Pattern(_pcre.Pattern): 34 | def search(self, string, pos=-1, endpos=-1, flags=0): 35 | try: 36 | return Match(self, string, pos, endpos, flags) 37 | except NoMatch: 38 | pass 39 | 40 | def match(self, string, pos=-1, endpos=-1, flags=0): 41 | try: 42 | return Match(self, string, pos, endpos, flags | ANCHORED) 43 | except NoMatch: 44 | pass 45 | 46 | def split(self, string, maxsplit=0, flags=0): 47 | output = [] 48 | pos = n = 0 49 | for match in self.finditer(string, flags=flags): 50 | start, end = match.span() 51 | if start != end: 52 | output.append(string[pos:start]) 53 | output.extend(match.groups()) 54 | pos = end 55 | n += 1 56 | if 0 < maxsplit <= n: 57 | break 58 | output.append(string[pos:]) 59 | return output 60 | 61 | def findall(self, string, pos=-1, endpos=-1, flags=0): 62 | matches = self.finditer(string, pos, endpos, flags) 63 | if self.groups == 0: 64 | return [m.group() for m in matches] 65 | if self.groups == 1: 66 | return [m.groups('')[0] for m in matches] 67 | return [m.groups('') for m in matches] 68 | 69 | def finditer(self, string, pos=-1, endpos=-1, flags=0): 70 | try: 71 | while 1: 72 | match = Match(self, string, pos, endpos, flags) 73 | yield match 74 | start, pos = match.span() 75 | if pos == start: 76 | pos += 1 77 | except NoMatch: 78 | pass 79 | 80 | def sub(self, repl, string, count=0, flags=0): 81 | return self.subn(repl, string, count, flags)[0] 82 | 83 | def subn(self, repl, string, count=0, flags=0): 84 | if not hasattr(repl, '__call__'): 85 | repl = lambda match, tmpl=repl: match.expand(tmpl) 86 | output = [] 87 | pos = n = 0 88 | for match in self.finditer(string, flags=flags): 89 | start, end = match.span() 90 | if not pos == start == end or pos == 0: 91 | output.extend((string[pos:start], repl(match))) 92 | pos = end 93 | n += 1 94 | if 0 < count <= n: 95 | break 96 | output.append(string[pos:]) 97 | return (string[:0].join(output), n) 98 | 99 | def __reduce__(self): 100 | if self.pattern is None: 101 | return (Pattern, (None, 0, self.dumps())) 102 | return (Pattern, (self.pattern, self.flags)) 103 | 104 | def __repr__(self): 105 | if self.pattern is None: 106 | return '{0}.loads({1})'.format(__name__, repr(self.dumps())) 107 | flags = self.flags 108 | if flags: 109 | v = [] 110 | for name in _FLAGS: 111 | value = getattr(_pcre, name) 112 | if flags & value: 113 | v.append('{0}.{1}'.format(__name__, name)) 114 | flags &= ~value 115 | if flags: 116 | v.append(hex(flags)) 117 | return '{0}.compile({1}, {2})'.format(__name__, repr(self.pattern), '|'.join(v)) 118 | return '{0}.compile({1})'.format(__name__, repr(self.pattern)) 119 | 120 | class Match(_pcre.Match): 121 | def expand(self, template): 122 | return template.format(self.group(), *self.groups(''), **self.groupdict('')) 123 | 124 | def __repr__(self): 125 | cls = self.__class__ 126 | return '<{0}.{1} object; span={2}, match={3}>'.format(cls.__module__, 127 | cls.__name__, repr(self.span()), repr(self.group())) 128 | 129 | class REMatch(Match): 130 | def expand(self, template): 131 | groups = (self.group(),) + self.groups() 132 | groupdict = self.groupdict() 133 | def repl(match): 134 | esc, index, group, badgroup = match.groups() 135 | if esc: 136 | return ('\\' + esc).decode('string-escape') 137 | if badgroup: 138 | raise PCREError(100, 'invalid group name') 139 | try: 140 | if index or group.isdigit(): 141 | result = groups[int(index or group)] 142 | else: 143 | result = groupdict[group] 144 | except IndexError: 145 | raise PCREError(15, 'invalid group reference') 146 | except KeyError: 147 | raise IndexError('unknown group name') 148 | if result is None: 149 | raise PCREError(101, 'unmatched group') 150 | return result 151 | return _REGEX_RE_TEMPLATE.sub(repl, template) 152 | 153 | def compile(pattern, flags=0): 154 | if isinstance(pattern, _pcre.Pattern): 155 | if flags != 0: 156 | raise ValueError('cannot process flags argument with a compiled pattern') 157 | return pattern 158 | return Pattern(pattern, flags) 159 | 160 | def match(pattern, string, flags=0): 161 | return compile(pattern, flags).match(string) 162 | 163 | def search(pattern, string, flags=0): 164 | return compile(pattern, flags).search(string) 165 | 166 | def split(pattern, string, maxsplit=0, flags=0): 167 | return compile(pattern, flags).split(string, maxsplit) 168 | 169 | def findall(pattern, string, flags=0): 170 | return compile(pattern, flags).findall(string) 171 | 172 | def finditer(pattern, string, flags=0): 173 | return compile(pattern, flags).finditer(string) 174 | 175 | def sub(pattern, repl, string, count=0, flags=0): 176 | return compile(pattern, flags).sub(repl, string, count) 177 | 178 | def subn(pattern, repl, string, count=0, flags=0): 179 | return compile(pattern, flags).subn(repl, string, count) 180 | 181 | def loads(data): 182 | # Loads a pattern serialized with Pattern.dumps(). 183 | return Pattern(None, loads=data) 184 | 185 | def escape(pattern): 186 | # Escapes a regular expression. 187 | s = list(pattern) 188 | alnum = _ALNUM 189 | for i, c in enumerate(pattern): 190 | if c not in alnum: 191 | s[i] = '\\000' if c == '\000' else ('\\' + c) 192 | return pattern[:0].join(s) 193 | 194 | def escape_template(template): 195 | # Escapes "{" and "}" characters in the template. 196 | return template.replace('{', '{{').replace('}', '}}') 197 | 198 | def convert_re_template(template): 199 | # Converts re template r"\1\g" to "{1}{id}" format. 200 | def repl(match): 201 | esc, index, group, badgroup = match.groups() 202 | if esc: 203 | return ('\\' + esc).decode('string-escape') 204 | if badgroup: 205 | raise PCREError(100, 'invalid group name') 206 | return '{%s}' % (index or group) 207 | return _REGEX_RE_TEMPLATE.sub(repl, escape_template(template)) 208 | 209 | def enable_re_template_mode(): 210 | # Makes calls to sub() take re templates instead of str.format() templates. 211 | global Match 212 | Match = REMatch 213 | 214 | _ALNUM = frozenset('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890') 215 | error = PCREError = _pcre.PCREError 216 | NoMatch = _pcre.NoMatch 217 | MAXREPEAT = 65536 218 | 219 | # Provides PCRE build-time configuration. 220 | config = type('config', (), _pcre.get_config()) 221 | 222 | # Pattern and/or match flags 223 | _FLAGS = ('IGNORECASE', 'MULTILINE', 'DOTALL', 'UNICODE', 'VERBOSE', 224 | 'ANCHORED', 'NOTBOL', 'NOTEOL', 'NOTEMPTY', 'NOTEMPTY_ATSTART', 225 | 'UTF8', 'NO_UTF8_CHECK') 226 | 227 | # Copy flags from _pcre module 228 | ns = globals() 229 | for name in _FLAGS: 230 | ns[name] = getattr(_pcre, name) 231 | del ns, name 232 | 233 | # Short versions 234 | I = IGNORECASE 235 | M = MULTILINE 236 | S = DOTALL 237 | U = UNICODE 238 | X = VERBOSE 239 | 240 | # Study flags 241 | STUDY_JIT = _pcre.STUDY_JIT 242 | 243 | # Used to parse re templates. 244 | _REGEX_RE_TEMPLATE = compile(r'\\(?:([\\abfnrtv]|0[0-7]{0,2}|[0-7]{3})|' 245 | r'(\d{1,2})|g<(\d+|[^\d\W]\w*)>|(g[^>]*))') 246 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ python-pcre 4 | 5 | Copyright (c) 2012-2015, Arkadiusz Wahlig 6 | All rights reserved. 7 | 8 | Redistribution and use in source and binary forms, with or without 9 | modification, are permitted provided that the following conditions are met: 10 | * Redistributions of source code must retain the above copyright 11 | notice, this list of conditions and the following disclaimer. 12 | * Redistributions in binary form must reproduce the above copyright 13 | notice, this list of conditions and the following disclaimer in the 14 | documentation and/or other materials provided with the distribution. 15 | * Neither the name of the nor the 16 | names of its contributors may be used to endorse or promote products 17 | derived from this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 20 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 21 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL BE LIABLE FOR ANY 23 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 24 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 25 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 26 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 27 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 28 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | """ 30 | 31 | from distutils.core import setup, Extension 32 | 33 | 34 | _pcre = Extension('_pcre', ['src/pcremodule.c'], 35 | libraries=['pcre'], 36 | extra_compile_args=['-fno-strict-aliasing']) 37 | 38 | 39 | setup(name='python-pcre', 40 | version='0.7', 41 | description='Python PCRE bindings', 42 | author='Arkadiusz Wahlig', 43 | url='https://github.com/awahlig/python-pcre', 44 | package_dir={'': 'python'}, 45 | py_modules=['pcre'], 46 | ext_modules=[_pcre]) 47 | -------------------------------------------------------------------------------- /src/pcremodule.c: -------------------------------------------------------------------------------- 1 | /* python-pcre 2 | 3 | Copyright (c) 2012-2015, Arkadiusz Wahlig 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | * Redistributions of source code must retain the above copyright 9 | notice, this list of conditions and the following disclaimer. 10 | * Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | * Neither the name of the nor the 14 | names of its contributors may be used to endorse or promote products 15 | derived from this software without specific prior written permission. 16 | 17 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 18 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 19 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 20 | DISCLAIMED. IN NO EVENT SHALL BE LIABLE FOR ANY 21 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 22 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 23 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 24 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 25 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 26 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 27 | */ 28 | 29 | #include 30 | #include 31 | 32 | #include 33 | 34 | #if PY_MAJOR_VERSION >= 3 35 | # define PY3 36 | # define PyInt_FromLong PyLong_FromLong 37 | # if PY_VERSION_HEX >= 0x03030000 38 | # define PY3_NEW_UNICODE 39 | # endif 40 | #endif 41 | 42 | /* Custom errors/configs. */ 43 | #define PYPCRE_ERROR_STUDY (-50) 44 | #define PYPCRE_CONFIG_NONE (1000) 45 | #define PYPCRE_CONFIG_VERSION (1001) 46 | 47 | /* JIT was added in PCRE 8.20. */ 48 | #ifdef PCRE_STUDY_JIT_COMPILE 49 | # define PYPCRE_HAS_JIT_API 50 | #else 51 | # define PCRE_STUDY_JIT_COMPILE (0) 52 | # define PCRE_CONFIG_JIT PYPCRE_CONFIG_NONE 53 | # define PCRE_CONFIG_JITTARGET PYPCRE_CONFIG_NONE 54 | # define pcre_free_study pcre_free 55 | #endif 56 | 57 | /* Flag added in PCRE 8.34. */ 58 | #ifndef PCRE_CONFIG_PARENS_LIMIT 59 | # define PCRE_CONFIG_PARENS_LIMIT PYPCRE_CONFIG_NONE 60 | #endif 61 | 62 | static PyObject *PyExc_PCREError; 63 | static PyObject *PyExc_NoMatch; 64 | 65 | /* Used to hold UTF-8 data extracted from any of the supported 66 | * input objects in a most efficient way. 67 | */ 68 | typedef struct { 69 | const char *string; 70 | int length; 71 | PyObject *op; 72 | Py_buffer *buffer; 73 | } pypcre_string_t; 74 | 75 | /* Release buffer created by pypcre_buffer_get(). */ 76 | static void 77 | pypcre_buffer_release(Py_buffer *buffer) 78 | { 79 | if (buffer) { 80 | PyBuffer_Release(buffer); 81 | PyMem_Free(buffer); 82 | } 83 | } 84 | 85 | /* Get new style buffer from object . */ 86 | static Py_buffer * 87 | pypcre_buffer_get(PyObject *op, int flags) 88 | { 89 | Py_buffer *view; 90 | 91 | view = (Py_buffer *)PyMem_Malloc(sizeof(Py_buffer)); 92 | if (view == NULL) { 93 | PyErr_NoMemory(); 94 | return NULL; 95 | } 96 | 97 | memset(view, 0, sizeof(Py_buffer)); 98 | if (PyObject_GetBuffer(op, view, flags) < 0) { 99 | PyMem_Free(view); 100 | return NULL; 101 | } 102 | 103 | return view; 104 | } 105 | 106 | /* Release string created by pypcre_string_get(). */ 107 | static void 108 | pypcre_string_release(pypcre_string_t *str) 109 | { 110 | if (str) { 111 | pypcre_buffer_release(str->buffer); 112 | Py_XDECREF(str->op); 113 | memset(str, 0, sizeof(pypcre_string_t)); 114 | } 115 | } 116 | 117 | /* Helper function handling buffers containing bytes. */ 118 | static int 119 | _string_get_from_bytes(pypcre_string_t *str, PyObject *op, int *options, 120 | Py_buffer *view, int viewrel) 121 | { 122 | const unsigned char *start = (const unsigned char *)view->buf; 123 | const unsigned char *p, *end = start + view->len; 124 | unsigned char c, *q; 125 | Py_ssize_t count = 0; 126 | 127 | if (!(*options & PCRE_UTF8)) { 128 | *options |= PCRE_NO_UTF8_CHECK; 129 | 130 | /* Count non-ascii bytes. */ 131 | for (p = start; p < end; ++p) { 132 | if (*p > 127) 133 | ++count; 134 | } 135 | } 136 | 137 | /* As-is if ascii or declared UTF-8 by caller. */ 138 | if (count == 0) { 139 | str->string = (const char *)view->buf; 140 | str->length = view->len; 141 | /* Save buffer if it needs to be released. */ 142 | if (viewrel) 143 | str->buffer = view; 144 | str->op = op; 145 | Py_INCREF(op); 146 | return 0; 147 | } 148 | 149 | /* Non-ascii characters will take two bytes. */ 150 | count += view->len; 151 | op = PyBytes_FromStringAndSize(NULL, count); 152 | if (op == NULL) { 153 | if (viewrel) 154 | pypcre_buffer_release(view); 155 | return -1; 156 | } 157 | 158 | q = (unsigned char *)PyBytes_AS_STRING(op); 159 | str->string = (const char *)q; 160 | str->length = count; 161 | str->op = op; 162 | 163 | /* Inline Latin1 -> UTF-8 conversion. */ 164 | for (p = start; p < end; ++p) { 165 | if ((c = *p) > 127) { 166 | *q++ = 0xc0 | (c >> 6); 167 | *q++ = 0x80 | (c & 0x3f); 168 | } 169 | else 170 | *q++ = c; 171 | } 172 | 173 | if (viewrel) 174 | pypcre_buffer_release(view); 175 | return 0; 176 | } 177 | 178 | #ifndef PY3_NEW_UNICODE 179 | /* Helper function handling buffers containing Py_UNICODE. */ 180 | static int 181 | _string_get_from_pyunicode(pypcre_string_t *str, int *options, Py_buffer *view, 182 | int viewrel, Py_ssize_t size) 183 | { 184 | PyObject *op; 185 | 186 | op = PyUnicode_EncodeUTF8((const Py_UNICODE *)view->buf, size, NULL); 187 | if (viewrel) 188 | pypcre_buffer_release(view); 189 | if (op == NULL) 190 | return -1; 191 | 192 | str->string = PyBytes_AS_STRING(op); 193 | str->length = PyBytes_GET_SIZE(op); 194 | str->op = op; 195 | *options |= PCRE_NO_UTF8_CHECK; 196 | return 0; 197 | } 198 | #endif 199 | 200 | /* Extract UTF-8 data from . 201 | * Sets str->string and str->length to UTF-8 data buffer. 202 | * Sets str->op to if no encoding was required or to a bytes object 203 | * owning str->string if encoded internally to UTF-8. 204 | * If PCRE_UTF8 option is set, bytes-like objects are assumed to be UTF-8. 205 | * Sets PCRE_NO_UTF8_CHECK option if encoded internally or ascii. 206 | * Returns 0 if successful or sets an exception and returns -1 in case 207 | * of an error. 208 | */ 209 | static int 210 | pypcre_string_get(pypcre_string_t *str, PyObject *op, int *options) 211 | { 212 | memset(str, 0, sizeof(pypcre_string_t)); 213 | 214 | /* This is not strictly needed because bytes support the buffer 215 | * interface but it's more efficient. 216 | */ 217 | if (PyBytes_Check(op)) { 218 | Py_buffer view; 219 | 220 | view.buf = PyBytes_AS_STRING(op); 221 | view.len = PyBytes_GET_SIZE(op); 222 | return _string_get_from_bytes(str, op, options, &view, 0); 223 | } 224 | 225 | if (PyUnicode_Check(op)) { 226 | *options |= PCRE_NO_UTF8_CHECK; 227 | 228 | #ifdef PY3_NEW_UNICODE 229 | if (PyUnicode_READY(op) < 0) 230 | return -1; 231 | 232 | /* If stored as ascii, return as-is because ascii is a subset of UTF-8. */ 233 | if (PyUnicode_IS_ASCII(op)) { 234 | str->string = (const char *)PyUnicode_DATA(op); 235 | str->length = PyUnicode_GET_LENGTH(op); 236 | str->op = op; 237 | Py_INCREF(op); 238 | return 0; 239 | } 240 | #endif 241 | 242 | /* Encode into UTF-8 bytes object. */ 243 | op = PyUnicode_AsUTF8String(op); 244 | if (op == NULL) 245 | return -1; 246 | 247 | str->string = PyBytes_AS_STRING(op); 248 | str->length = PyBytes_GET_SIZE(op); 249 | str->op = op; 250 | return 0; 251 | } 252 | 253 | /* Try the new buffer interface, */ 254 | if (PyObject_CheckBuffer(op)) { 255 | Py_buffer *view; 256 | 257 | view = pypcre_buffer_get(op, PyBUF_ND); 258 | if (view == NULL) 259 | return -1; 260 | 261 | /* Buffer contains bytes. */ 262 | if (view->itemsize == 1 && view->ndim == 1) 263 | return _string_get_from_bytes(str, op, options, view, 1); 264 | 265 | #ifdef PY3_NEW_UNICODE 266 | else if ((view->itemsize == 2 || view->itemsize == 4) && view->ndim == 1) { 267 | /* Buffer contains 2-byte or 4-byte values. */ 268 | PyObject *unicode; 269 | 270 | /* Convert to unicode object. */ 271 | unicode = PyUnicode_FromKindAndData((view->itemsize == 2 ? PyUnicode_2BYTE_KIND : 272 | PyUnicode_4BYTE_KIND), view->buf, view->shape[0]); 273 | pypcre_buffer_release(view); 274 | if (unicode == NULL) 275 | return -1; 276 | 277 | /* Encode into UTF-8 bytes object. */ 278 | op = PyUnicode_AsUTF8String(unicode); 279 | Py_DECREF(unicode); 280 | if (op == NULL) 281 | return -1; 282 | 283 | str->string = PyBytes_AS_STRING(op); 284 | str->length = PyBytes_GET_SIZE(op); 285 | str->op = op; 286 | *options |= PCRE_NO_UTF8_CHECK; 287 | return 0; 288 | } 289 | #else 290 | /* Buffer contains Py_UNICODE values. */ 291 | else if (view->itemsize == sizeof(Py_UNICODE) && view->ndim == 1) 292 | return _string_get_from_pyunicode(str, options, view, 1, view->shape[0]); 293 | #endif 294 | 295 | pypcre_buffer_release(view); 296 | PyErr_SetString(PyExc_TypeError, "unsupported buffer format"); 297 | return -1; 298 | } 299 | 300 | #ifndef PY3 301 | /* Try the old buffer interface. */ 302 | if (Py_TYPE(op)->tp_as_buffer 303 | && Py_TYPE(op)->tp_as_buffer->bf_getreadbuffer 304 | && Py_TYPE(op)->tp_as_buffer->bf_getsegcount 305 | /* Must have exactly 1 segment. */ 306 | && Py_TYPE(op)->tp_as_buffer->bf_getsegcount(op, NULL) == 1) { 307 | 308 | Py_buffer view; 309 | Py_ssize_t size; 310 | 311 | /* Read the segment. */ 312 | view.len = Py_TYPE(op)->tp_as_buffer->bf_getreadbuffer(op, 0, &view.buf); 313 | if (view.len < 0) 314 | return -1; 315 | 316 | /* Get object length. */ 317 | size = PyObject_Size(op); 318 | if (size < 0) 319 | return -1; 320 | 321 | /* Segment contains bytes. */ 322 | if (PyBytes_Check(op) || view.len == size) 323 | return _string_get_from_bytes(str, op, options, &view, 0); 324 | 325 | /* Segment contains unicode. */ 326 | else if (view.len == sizeof(Py_UNICODE) * size) 327 | return _string_get_from_pyunicode(str, options, &view, 0, size); 328 | 329 | PyErr_SetString(PyExc_TypeError, "buffer size mismatch"); 330 | return -1; 331 | } 332 | #endif 333 | 334 | /* Unsupported object type. */ 335 | PyErr_Format(PyExc_TypeError, "expected string or buffer, not %.200s", 336 | Py_TYPE(op)->tp_name); 337 | return -1; 338 | } 339 | 340 | #define ISUTF8(c) (((c) & 0xC0) != 0x80) 341 | #ifdef Py_UNICODE_WIDE 342 | #define UTF8LOOPBODY { \ 343 | (void)(ISUTF8(s[++i]) || ISUTF8(s[++i]) || ISUTF8(s[++i]) || ++i); ++charnum; } 344 | #else 345 | #define UTF8LOOPBODY { \ 346 | (void)(ISUTF8(s[++i]) || ISUTF8(s[++i]) || ++i); ++charnum; } 347 | #endif 348 | 349 | /* Converts UTF-8 byte offsets into character offsets. 350 | * If is specified, it must not be less than . 351 | */ 352 | static void 353 | pypcre_string_byte_to_char_offsets(const pypcre_string_t *str, int *pos, int *endpos) 354 | { 355 | const char *s = str->string; 356 | Py_ssize_t length = str->length; 357 | int charnum = 0, i = 0, offset; 358 | 359 | if (pos && (*pos >= 0)) { 360 | offset = *pos; 361 | while ((i < offset) && (i < length)) 362 | UTF8LOOPBODY 363 | *pos = charnum; 364 | } 365 | if (endpos && (*endpos >= 0)) { 366 | offset = *endpos; 367 | while ((i < offset) && (i < length)) 368 | UTF8LOOPBODY 369 | *endpos = charnum; 370 | } 371 | } 372 | 373 | /* Converts character offsets into UTF-8 byte offsets. 374 | * If is specified it must not be less than . 375 | */ 376 | static void 377 | pypcre_string_char_to_byte_offsets(const pypcre_string_t *str, int *pos, int *endpos) 378 | { 379 | const char *s = str->string; 380 | Py_ssize_t length = str->length; 381 | int charnum = 0, i = 0, offset; 382 | 383 | if (pos && (*pos >= 0)) { 384 | offset = *pos; 385 | while ((charnum < offset) && (i < length)) 386 | UTF8LOOPBODY 387 | *pos = i; 388 | } 389 | if (endpos && (*endpos >= 0)) { 390 | offset = *endpos; 391 | while ((charnum < offset) && (i < length)) 392 | UTF8LOOPBODY 393 | *endpos = i; 394 | } 395 | } 396 | 397 | /* Sets an exception from PCRE error code and error string. */ 398 | static void 399 | set_pcre_error(int rc, const char *s) 400 | { 401 | PyObject *op; 402 | 403 | switch (rc) { 404 | case PCRE_ERROR_NOMEMORY: 405 | PyErr_NoMemory(); 406 | break; 407 | 408 | case PCRE_ERROR_NOMATCH: 409 | PyErr_SetNone(PyExc_NoMatch); 410 | break; 411 | 412 | case 5: /* number too big in {} quantifier */ 413 | PyErr_SetString(PyExc_OverflowError, s); 414 | break; 415 | 416 | default: 417 | op = Py_BuildValue("(is)", rc, s); 418 | if (op) { 419 | PyErr_SetObject(PyExc_PCREError, op); 420 | Py_DECREF(op); 421 | } 422 | } 423 | } 424 | 425 | /* 426 | * Pattern 427 | */ 428 | 429 | typedef struct { 430 | PyObject_HEAD 431 | PyObject *pattern; /* as passed in */ 432 | PyObject *groupindex; /* name->index dict */ 433 | pcre *code; /* compiled pattern */ 434 | pcre_extra *extra; /* pcre_study result */ 435 | #ifdef PYPCRE_HAS_JIT_API 436 | pcre_jit_stack *jit_stack; /* user-allocated jit stack */ 437 | #endif 438 | int flags; /* as passed in */ 439 | int groups; /* capturing groups count */ 440 | } PyPatternObject; 441 | 442 | /* Returns 0 if Pattern.__init__ has been called or sets an exception 443 | * and returns -1 if not. Pattern.__init__ sets all fields in one go 444 | * so 0 means they can all be safely used. 445 | */ 446 | static int 447 | assert_pattern_ready(PyPatternObject *op) 448 | { 449 | if (op && op->code) 450 | return 0; 451 | 452 | PyErr_SetString(PyExc_AssertionError, "pattern not ready"); 453 | return -1; 454 | } 455 | 456 | /* Converts an object into group index or sets an exception and returns -1 457 | * if object is of bad type or value is out of range. 458 | * Supports int/long group indexes and str/unicode group names. 459 | */ 460 | static Py_ssize_t 461 | get_index(PyPatternObject *op, PyObject *index) 462 | { 463 | Py_ssize_t i = -1; 464 | 465 | #ifdef PY3 466 | if (PyLong_Check(index)) 467 | i = PyLong_AsSsize_t(index); 468 | #else 469 | if (PyInt_Check(index) || PyLong_Check(index)) 470 | i = PyInt_AsSsize_t(index); 471 | #endif 472 | 473 | else { 474 | index = PyDict_GetItem(op->groupindex, index); 475 | #ifdef PY3 476 | if (index && PyLong_Check(index)) 477 | i = PyLong_AsSsize_t(index); 478 | #else 479 | if (index && (PyInt_Check(index) || PyLong_Check(index))) 480 | i = PyInt_AsSsize_t(index); 481 | #endif 482 | } 483 | 484 | /* PyInt_AsSsize_t() may have failed. */ 485 | if (PyErr_Occurred()) 486 | return -1; 487 | 488 | /* Return group index if it's in range. */ 489 | if (i >= 0 && i <= op->groups) 490 | return i; 491 | 492 | PyErr_SetString(PyExc_IndexError, "no such group"); 493 | return -1; 494 | } 495 | 496 | /* Create a mapping from group names to group indexes. */ 497 | static PyObject * 498 | make_groupindex(pcre *code, int unicode) 499 | { 500 | PyObject *dict; 501 | int rc, index, count, size; 502 | const unsigned char *table; 503 | PyObject *key, *value; 504 | 505 | if ((rc = pcre_fullinfo(code, NULL, PCRE_INFO_NAMECOUNT, &count)) != 0 506 | || (rc = pcre_fullinfo(code, NULL, PCRE_INFO_NAMEENTRYSIZE, &size)) != 0 507 | || (rc = pcre_fullinfo(code, NULL, PCRE_INFO_NAMETABLE, &table)) != 0) { 508 | set_pcre_error(rc, "failed to query nametable properties"); 509 | return NULL; 510 | } 511 | 512 | dict = PyDict_New(); 513 | if (dict == NULL) 514 | return NULL; 515 | 516 | for (index = 0; index < count; ++index) { 517 | /* Group name starts from the third byte. Must not be empty. */ 518 | if (table[2] == 0) { 519 | Py_DECREF(dict); 520 | set_pcre_error(84, "group name must not be empty"); 521 | return NULL; 522 | } 523 | 524 | /* Create group name object. */ 525 | #ifdef PY3 526 | /* XXX re module in 3 always uses unicode here */ 527 | key = PyUnicode_FromString((const char *)(table + 2)); 528 | #else 529 | if (unicode) 530 | key = PyUnicode_FromString((const char *)(table + 2)); 531 | else 532 | key = PyBytes_FromString((const char *)(table + 2)); 533 | #endif 534 | if (key == NULL) { 535 | Py_DECREF(dict); 536 | return NULL; 537 | } 538 | 539 | /* First two bytes contain the group index. */ 540 | value = PyInt_FromLong((table[0] << 8) | table[1]); 541 | if (value == NULL) { 542 | Py_DECREF(key); 543 | Py_DECREF(dict); 544 | return NULL; 545 | } 546 | 547 | rc = PyDict_SetItem(dict, key, value); 548 | Py_DECREF(value); 549 | Py_DECREF(key); 550 | if (rc < 0) { 551 | Py_DECREF(dict); 552 | return NULL; 553 | } 554 | 555 | table += size; 556 | } 557 | 558 | return dict; 559 | } 560 | 561 | static int 562 | pattern_init(PyPatternObject *self, PyObject *args, PyObject *kwds) 563 | { 564 | PyObject *pattern, *loads = NULL, *groupindex; 565 | int rc, groups, flags = 0; 566 | pcre *code; 567 | 568 | static const char *const kwlist[] = {"pattern", "flags", "loads", NULL}; 569 | 570 | if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|iS:__init__", (char **)kwlist, 571 | &pattern, &flags, &loads)) 572 | return -1; 573 | 574 | /* Patterns can be serialized using dumps() and then unserialized 575 | * using the "loads" argument. 576 | */ 577 | if (loads) { 578 | Py_ssize_t size; 579 | 580 | size = PyBytes_GET_SIZE(loads); 581 | code = pcre_malloc(size); 582 | if (code == NULL) { 583 | PyErr_NoMemory(); 584 | return -1; 585 | } 586 | 587 | memcpy(code, PyBytes_AS_STRING(loads), size); 588 | } 589 | else { 590 | pypcre_string_t str; 591 | const char *err = NULL; 592 | int o, options = flags; 593 | 594 | /* Extract UTF-8 string from the pattern object. Encode if needed. */ 595 | if (pypcre_string_get(&str, pattern, &options) < 0) 596 | return -1; 597 | 598 | /* Compile the regex. */ 599 | code = pcre_compile2(str.string, options | PCRE_UTF8, &rc, &err, &o, NULL); 600 | if (code == NULL) { 601 | PyObject *op; 602 | 603 | /* Convert byte offset into character offset if needed. */ 604 | if (str.op != pattern) 605 | pypcre_string_byte_to_char_offsets(&str, &o, NULL); 606 | pypcre_string_release(&str); 607 | 608 | op = PyBytes_FromFormat("%.200s at position %d", err, o); 609 | if (op) { 610 | /* Note. Compilation error codes are positive. */ 611 | set_pcre_error(rc, PyBytes_AS_STRING(op)); 612 | Py_DECREF(op); 613 | } 614 | return -1; 615 | } 616 | pypcre_string_release(&str); 617 | } 618 | 619 | /* Get number of capturing groups. */ 620 | if ((rc = pcre_fullinfo(code, NULL, PCRE_INFO_CAPTURECOUNT, &groups)) != 0) { 621 | pcre_free(code); 622 | set_pcre_error(rc, "failed to query number of capturing groups"); 623 | return -1; 624 | } 625 | 626 | /* Create a dict mapping named group names to their indexes. */ 627 | groupindex = make_groupindex(code, PyUnicode_Check(pattern)); 628 | if (groupindex == NULL) { 629 | pcre_free(code); 630 | return -1; 631 | } 632 | 633 | pcre_free(self->code); 634 | self->code = code; 635 | 636 | Py_CLEAR(self->pattern); 637 | self->pattern = pattern; 638 | Py_INCREF(pattern); 639 | 640 | Py_CLEAR(self->groupindex); 641 | self->groupindex = groupindex; 642 | 643 | self->flags = flags; 644 | self->groups = groups; 645 | 646 | return 0; 647 | } 648 | 649 | static void 650 | pattern_dealloc(PyPatternObject *self) 651 | { 652 | Py_XDECREF(self->pattern); 653 | Py_XDECREF(self->groupindex); 654 | pcre_free(self->code); 655 | pcre_free_study(self->extra); 656 | #ifdef PYPCRE_HAS_JIT_API 657 | if (self->jit_stack) 658 | pcre_jit_stack_free(self->jit_stack); 659 | #endif 660 | Py_TYPE(self)->tp_free(self); 661 | } 662 | 663 | static PyObject * 664 | pattern_study(PyPatternObject *self, PyObject *args) 665 | { 666 | int options = 0; 667 | const char *err = NULL; 668 | pcre_extra *extra; 669 | 670 | if (!PyArg_ParseTuple(args, "|i:study", &options)) 671 | return NULL; 672 | 673 | if (assert_pattern_ready(self) < 0) 674 | return NULL; 675 | 676 | /* Study the pattern. */ 677 | extra = pcre_study(self->code, options, &err); 678 | if (err) { 679 | set_pcre_error(PYPCRE_ERROR_STUDY, err); 680 | return NULL; 681 | } 682 | 683 | /* Replace previous study results. */ 684 | pcre_free_study(self->extra); 685 | self->extra = extra; 686 | 687 | /* Return True if studying the pattern produced additional 688 | * information that will help speed up matching. 689 | */ 690 | return PyBool_FromLong(extra != NULL); 691 | } 692 | 693 | static PyObject * 694 | pattern_set_jit_stack(PyPatternObject *self, PyObject *args) 695 | { 696 | int startsize, maxsize; 697 | #ifdef PYPCRE_HAS_JIT_API 698 | int rc, jit; 699 | pcre_jit_stack *stack; 700 | #endif 701 | 702 | if (!PyArg_ParseTuple(args, "ii", &startsize, &maxsize)) 703 | return NULL; 704 | 705 | #ifdef PYPCRE_HAS_JIT_API 706 | /* Check whether PCRE library has been built with JIT support. */ 707 | if ((rc = pcre_config(PCRE_CONFIG_JIT, &jit)) != 0) { 708 | set_pcre_error(rc, "failed to query JIT support"); 709 | return NULL; 710 | } 711 | 712 | /* Error if no JIT support. */ 713 | if (!jit) { 714 | PyErr_SetString(PyExc_AssertionError, "PCRE library built without JIT support"); 715 | return NULL; 716 | } 717 | 718 | /* Assigning a new JIT stack requires a studied pattern. */ 719 | if (self->extra == NULL) { 720 | PyErr_SetString(PyExc_AssertionError, "pattern must be studied first"); 721 | return NULL; 722 | } 723 | 724 | /* Allocate new JIT stack. */ 725 | stack = pcre_jit_stack_alloc(startsize, maxsize); 726 | if (stack == NULL) { 727 | PyErr_NoMemory(); 728 | return NULL; 729 | } 730 | 731 | /* Release old stack and assign the new one. */ 732 | if (self->jit_stack) 733 | pcre_jit_stack_free(self->jit_stack); 734 | self->jit_stack = stack; 735 | pcre_assign_jit_stack(self->extra, NULL, stack); 736 | 737 | Py_RETURN_NONE; 738 | 739 | #else 740 | /* JIT API not supported. */ 741 | PyErr_SetString(PyExc_AssertionError, "PCRE library too old"); 742 | return NULL; 743 | 744 | #endif 745 | } 746 | 747 | /* Serializes a pattern into a string. 748 | * Pattern can be unserialized using the "loads" argument of __init__. 749 | */ 750 | static PyObject * 751 | pattern_dumps(PyPatternObject *self) 752 | { 753 | size_t size; 754 | int rc; 755 | 756 | if (assert_pattern_ready(self) < 0) 757 | return NULL; 758 | 759 | rc = pcre_fullinfo(self->code, NULL, PCRE_INFO_SIZE, &size); 760 | if (rc != 0) { 761 | set_pcre_error(rc, "failed to query pattern size"); 762 | return NULL; 763 | } 764 | return PyBytes_FromStringAndSize((char *)self->code, size); 765 | } 766 | 767 | static PyObject * 768 | pattern_richcompare(PyPatternObject *self, PyObject *other, int op); 769 | 770 | static const PyMethodDef pattern_methods[] = { 771 | {"study", (PyCFunction)pattern_study, METH_VARARGS}, 772 | {"set_jit_stack", (PyCFunction)pattern_set_jit_stack, METH_VARARGS}, 773 | {"dumps", (PyCFunction)pattern_dumps, METH_NOARGS}, 774 | {NULL} /* sentinel */ 775 | }; 776 | 777 | static const PyMemberDef pattern_members[] = { 778 | {"pattern", T_OBJECT, offsetof(PyPatternObject, pattern), READONLY}, 779 | {"flags", T_INT, offsetof(PyPatternObject, flags), READONLY}, 780 | {"groups", T_INT, offsetof(PyPatternObject, groups), READONLY}, 781 | {"groupindex", T_OBJECT, offsetof(PyPatternObject, groupindex), READONLY}, 782 | {NULL} /* sentinel */ 783 | }; 784 | 785 | static PyTypeObject PyPattern_Type = { 786 | PyVarObject_HEAD_INIT(NULL, 0) 787 | "_pcre.Pattern", /* tp_name */ 788 | sizeof(PyPatternObject), /* tp_basicsize */ 789 | 0, /* tp_itemsize */ 790 | (destructor)pattern_dealloc, /* tp_dealloc */ 791 | 0, /* tp_print */ 792 | 0, /* tp_getattr */ 793 | 0, /* tp_setattr */ 794 | 0, /* tp_compare */ 795 | 0, /* tp_repr */ 796 | 0, /* tp_as_number */ 797 | 0, /* tp_as_sequence */ 798 | 0, /* tp_as_mapping */ 799 | 0, /* tp_hash */ 800 | 0, /* tp_call */ 801 | 0, /* tp_str */ 802 | 0, /* tp_getattro */ 803 | 0, /* tp_setattro */ 804 | 0, /* tp_as_buffer */ 805 | Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE, /* tp_flags */ 806 | 0, /* tp_doc */ 807 | 0, /* tp_traverse */ 808 | 0, /* tp_clear */ 809 | (richcmpfunc)pattern_richcompare, /* tp_richcompare */ 810 | 0, /* tp_weaklistoffset */ 811 | 0, /* tp_iter */ 812 | 0, /* tp_iternext */ 813 | (PyMethodDef *)pattern_methods, /* tp_methods */ 814 | (PyMemberDef *)pattern_members, /* tp_members */ 815 | 0, /* tp_getset */ 816 | 0, /* tp_base */ 817 | 0, /* tp_dict */ 818 | 0, /* tp_descr_get */ 819 | 0, /* tp_descr_set */ 820 | 0, /* tp_dictoffset */ 821 | (initproc)pattern_init, /* tp_init */ 822 | 0, /* tp_alloc */ 823 | 0, /* tp_new */ 824 | 0, /* tp_free */ 825 | }; 826 | 827 | static PyObject * 828 | pattern_richcompare(PyPatternObject *self, PyObject *otherobj, int op) 829 | { 830 | PyPatternObject *other; 831 | int equal, rc; 832 | size_t size, other_size; 833 | 834 | /* Only == and != comparisons to another pattern supported. */ 835 | if (!PyObject_TypeCheck(otherobj, &PyPattern_Type) || (op != Py_EQ && op != Py_NE)) { 836 | Py_INCREF(Py_NotImplemented); 837 | return Py_NotImplemented; 838 | } 839 | 840 | other = (PyPatternObject *)otherobj; 841 | if (self->code == other->code) 842 | equal = 1; 843 | else if (self->code == NULL || other->code == NULL) 844 | equal = 0; 845 | else if ((rc = pcre_fullinfo(self->code, NULL, PCRE_INFO_SIZE, &size)) != 0 846 | || (rc = pcre_fullinfo(other->code, NULL, PCRE_INFO_SIZE, &other_size)) != 0) { 847 | set_pcre_error(rc, "failed to query pattern size"); 848 | return NULL; 849 | } 850 | else if (size != other_size) 851 | equal = 0; 852 | else 853 | equal = (memcmp(self->code, other->code, size) == 0); 854 | 855 | return PyBool_FromLong(op == Py_EQ ? equal : !equal); 856 | } 857 | 858 | /* 859 | * Match 860 | */ 861 | 862 | typedef struct { 863 | PyObject_HEAD 864 | PyPatternObject *pattern; /* pattern instance */ 865 | PyObject *subject; /* as passed in */ 866 | pypcre_string_t str; /* UTF-8 string */ 867 | int *ovector; /* matched spans */ 868 | int startpos; /* after boundary checks */ 869 | int endpos; /* after boundary checks */ 870 | int flags; /* as passed in */ 871 | int lastindex; /* returned by pcre_exec */ 872 | } PyMatchObject; 873 | 874 | /* Returns 0 if Match.__init__ has been called or sets an exception 875 | * and returns -1 if not. Match.__init__ sets all fields in one go 876 | * so 0 means they can all be safely used. 877 | */ 878 | static int 879 | assert_match_ready(PyMatchObject *op) 880 | { 881 | if (op && op->ovector) 882 | return 0; 883 | 884 | PyErr_SetString(PyExc_AssertionError, "match not ready"); 885 | return -1; 886 | } 887 | 888 | /* Retrieves offsets into the subject object for given group. Both 889 | * pos and endpos can be NULL if not used. Returns 0 if successful 890 | * or sets an exception and returns -1 in case of an error. 891 | */ 892 | static int 893 | get_span(PyMatchObject *op, Py_ssize_t index, int *pos, int *endpos) 894 | { 895 | if (index < 0 || index > op->pattern->groups) { 896 | PyErr_SetString(PyExc_IndexError, "no such group"); 897 | return -1; 898 | } 899 | 900 | if (pos) 901 | *pos = op->ovector[index * 2]; 902 | if (endpos) 903 | *endpos = op->ovector[index * 2 + 1]; 904 | 905 | /* Sanity check. */ 906 | if (pos && endpos && (*pos > *endpos) && (*endpos >= 0)) { 907 | PyErr_SetString(PyExc_RuntimeError, "bad span"); 908 | return -1; 909 | } 910 | 911 | /* If subject has been encoded internally to UTF-8, 912 | * convert byte offsets into character offsets. 913 | */ 914 | if (op->subject != op->str.op) 915 | pypcre_string_byte_to_char_offsets(&op->str, pos, endpos); 916 | 917 | return 0; 918 | } 919 | 920 | /* Slices the subject string using offsets from given group. If group 921 | * has no match, returns the default object. Returns new reference. 922 | */ 923 | static PyObject * 924 | get_slice(PyMatchObject *op, Py_ssize_t index, PyObject *def) 925 | { 926 | int pos, endpos; 927 | 928 | if (get_span(op, index, &pos, &endpos) < 0) 929 | return NULL; 930 | 931 | if (pos >= 0 && endpos >= 0) 932 | return PySequence_GetSlice(op->subject, pos, endpos); 933 | 934 | Py_INCREF(def); 935 | return def; 936 | } 937 | 938 | /* Same as get_slice() but takes PyObject group index. */ 939 | static PyObject * 940 | get_slice_o(PyMatchObject *op, PyObject *index, PyObject *def) 941 | { 942 | Py_ssize_t i = get_index(op->pattern, index); 943 | if (i < 0) 944 | return NULL; 945 | 946 | return get_slice(op, i, def); 947 | } 948 | 949 | static int 950 | match_init(PyMatchObject *self, PyObject *args, PyObject *kwds) 951 | { 952 | PyPatternObject *pattern; 953 | PyObject *subject; 954 | int pos = -1, endpos = -1, flags = 0, options, *ovector, ovecsize, startoffset, size, rc; 955 | pypcre_string_t str; 956 | 957 | static const char *const kwlist[] = {"pattern", "string", "pos", "endpos", "flags", NULL}; 958 | 959 | if (!PyArg_ParseTupleAndKeywords(args, kwds, "O!O|iii:__init__", (char **)kwlist, 960 | &PyPattern_Type, &pattern, &subject, &pos, &endpos, &flags)) 961 | return -1; 962 | 963 | if (assert_pattern_ready(pattern) < 0) 964 | return -1; 965 | 966 | /* Extract UTF-8 string from the subject object. Encode if needed. */ 967 | options = flags; 968 | if (pypcre_string_get(&str, subject, &options) < 0) 969 | return -1; 970 | 971 | /* Check bounds. */ 972 | if (pos < 0) 973 | pos = 0; 974 | if (endpos < 0 || endpos > str.length) 975 | endpos = str.length; 976 | if (pos > endpos) { 977 | pypcre_string_release(&str); 978 | PyErr_SetNone(PyExc_NoMatch); 979 | return -1; 980 | } 981 | 982 | /* If subject has been encoded internally, convert provided character offsets 983 | * into byte offsets. 984 | */ 985 | startoffset = pos; 986 | size = endpos; 987 | if (str.op != subject) 988 | pypcre_string_char_to_byte_offsets(&str, &startoffset, &size); 989 | 990 | /* Create ovector array. */ 991 | ovecsize = (pattern->groups + 1) * 3; 992 | ovector = pcre_malloc(ovecsize * sizeof(int)); 993 | if (ovector == NULL) { 994 | pypcre_string_release(&str); 995 | PyErr_NoMemory(); 996 | return -1; 997 | } 998 | 999 | /* Perform the match. */ 1000 | rc = pcre_exec(pattern->code, pattern->extra, str.string, size, startoffset, 1001 | options & ~PCRE_UTF8, ovector, ovecsize); 1002 | if (rc < 0) { 1003 | pypcre_string_release(&str); 1004 | pcre_free(ovector); 1005 | set_pcre_error(rc, "failed to match pattern"); 1006 | return -1; 1007 | } 1008 | 1009 | Py_CLEAR(self->pattern); 1010 | self->pattern = pattern; 1011 | Py_INCREF(pattern); 1012 | 1013 | Py_CLEAR(self->subject); 1014 | self->subject = subject; 1015 | Py_INCREF(subject); 1016 | 1017 | pypcre_string_release(&self->str); 1018 | memcpy(&self->str, &str, sizeof(pypcre_string_t)); 1019 | 1020 | pcre_free(self->ovector); 1021 | self->ovector = ovector; 1022 | 1023 | self->startpos = pos; 1024 | self->endpos = endpos; 1025 | self->flags = flags; 1026 | self->lastindex = rc - 1; 1027 | 1028 | return 0; 1029 | } 1030 | 1031 | static void 1032 | match_dealloc(PyMatchObject *self) 1033 | { 1034 | Py_XDECREF(self->pattern); 1035 | Py_XDECREF(self->subject); 1036 | pypcre_string_release(&self->str); 1037 | pcre_free(self->ovector); 1038 | Py_TYPE(self)->tp_free(self); 1039 | } 1040 | 1041 | static PyObject * 1042 | match_group(PyMatchObject *self, PyObject *args) 1043 | { 1044 | PyObject *result; 1045 | Py_ssize_t i, size; 1046 | 1047 | if (assert_match_ready(self) < 0) 1048 | return NULL; 1049 | 1050 | size = PyTuple_GET_SIZE(args); 1051 | switch (size) { 1052 | case 0: /* no args -- return the whole match */ 1053 | result = get_slice(self, 0, Py_None); 1054 | break; 1055 | 1056 | case 1: /* one arg -- return a single slice */ 1057 | result = get_slice_o(self, PyTuple_GET_ITEM(args, 0), Py_None); 1058 | break; 1059 | 1060 | default: /* more than one arg -- return a tuple of slices */ 1061 | result = PyTuple_New(size); 1062 | if (result == NULL) 1063 | return NULL; 1064 | for (i = 0; i < size; ++i) { 1065 | PyObject *item = get_slice_o(self, PyTuple_GET_ITEM(args, i), Py_None); 1066 | if (item == NULL) { 1067 | Py_DECREF(result); 1068 | return NULL; 1069 | } 1070 | PyTuple_SET_ITEM(result, i, item); 1071 | } 1072 | break; 1073 | } 1074 | 1075 | return result; 1076 | } 1077 | 1078 | static PyObject * 1079 | match_start(PyMatchObject *self, PyObject *args) 1080 | { 1081 | PyObject *index = NULL; 1082 | Py_ssize_t i = 0; 1083 | int pos; 1084 | 1085 | if (!PyArg_UnpackTuple(args, "start", 0, 1, &index)) 1086 | return NULL; 1087 | 1088 | if (assert_match_ready(self) < 0) 1089 | return NULL; 1090 | 1091 | if (index) { 1092 | i = get_index(self->pattern, index); 1093 | if (i < 0) 1094 | return NULL; 1095 | } 1096 | 1097 | if (get_span(self, i, &pos, NULL) < 0) 1098 | return NULL; 1099 | 1100 | return PyInt_FromLong(pos); 1101 | } 1102 | 1103 | static PyObject * 1104 | match_end(PyMatchObject *self, PyObject *args) 1105 | { 1106 | PyObject *index = NULL; 1107 | Py_ssize_t i = 0; 1108 | int endpos; 1109 | 1110 | if (!PyArg_UnpackTuple(args, "end", 0, 1, &index)) 1111 | return NULL; 1112 | 1113 | if (assert_match_ready(self) < 0) 1114 | return NULL; 1115 | 1116 | if (index) { 1117 | i = get_index(self->pattern, index); 1118 | if (i < 0) 1119 | return NULL; 1120 | } 1121 | 1122 | if (get_span(self, i, NULL, &endpos) < 0) 1123 | return NULL; 1124 | 1125 | return PyInt_FromLong(endpos); 1126 | } 1127 | 1128 | static PyObject * 1129 | match_span(PyMatchObject *self, PyObject *args) 1130 | { 1131 | PyObject *index = NULL; 1132 | Py_ssize_t i = 0; 1133 | int pos, endpos; 1134 | 1135 | if (!PyArg_UnpackTuple(args, "span", 0, 1, &index)) 1136 | return NULL; 1137 | 1138 | if (assert_match_ready(self) < 0) 1139 | return NULL; 1140 | 1141 | if (index) { 1142 | i = get_index(self->pattern, index); 1143 | if (i < 0) 1144 | return NULL; 1145 | } 1146 | 1147 | if (get_span(self, i, &pos, &endpos) < 0) 1148 | return NULL; 1149 | 1150 | return Py_BuildValue("(ii)", pos, endpos); 1151 | } 1152 | 1153 | static PyObject * 1154 | match_groups(PyMatchObject *self, PyObject *args) 1155 | { 1156 | PyObject *result; 1157 | PyObject *def = Py_None; 1158 | Py_ssize_t index; 1159 | 1160 | if (!PyArg_UnpackTuple(args, "groups", 0, 1, &def)) 1161 | return NULL; 1162 | 1163 | if (assert_match_ready(self) < 0) 1164 | return NULL; 1165 | 1166 | result = PyTuple_New(self->pattern->groups); 1167 | if (result == NULL) 1168 | return NULL; 1169 | 1170 | for (index = 1; index <= self->pattern->groups; ++index) { 1171 | PyObject *item = get_slice(self, index, def); 1172 | if (item == NULL) { 1173 | Py_DECREF(result); 1174 | return NULL; 1175 | } 1176 | PyTuple_SET_ITEM(result, index - 1, item); 1177 | } 1178 | 1179 | return result; 1180 | } 1181 | 1182 | static PyObject * 1183 | match_groupdict(PyMatchObject *self, PyObject *args) 1184 | { 1185 | PyObject *def = Py_None; 1186 | PyObject *dict, *key, *value; 1187 | Py_ssize_t pos; 1188 | int rc; 1189 | 1190 | if (!PyArg_UnpackTuple(args, "groupdict", 0, 1, &def)) 1191 | return NULL; 1192 | 1193 | if (assert_match_ready(self) < 0) 1194 | return NULL; 1195 | 1196 | dict = PyDict_New(); 1197 | if (dict == NULL) 1198 | return NULL; 1199 | 1200 | pos = 0; 1201 | while (PyDict_Next(self->pattern->groupindex, &pos, &key, &value)) { 1202 | value = get_slice_o(self, value, def); 1203 | if (value == NULL) { 1204 | Py_DECREF(dict); 1205 | return NULL; 1206 | } 1207 | rc = PyDict_SetItem(dict, key, value); 1208 | Py_DECREF(value); 1209 | if (rc < 0) { 1210 | Py_DECREF(dict); 1211 | return NULL; 1212 | } 1213 | } 1214 | 1215 | return dict; 1216 | } 1217 | 1218 | static PyObject * 1219 | match_lastindex_getter(PyMatchObject *self, void *closure) 1220 | { 1221 | if (self->lastindex > 0) 1222 | return PyInt_FromLong(self->lastindex); 1223 | Py_RETURN_NONE; 1224 | } 1225 | 1226 | static PyObject * 1227 | match_lastgroup_getter(PyMatchObject *self, void *closure) 1228 | { 1229 | PyObject *key, *value; 1230 | Py_ssize_t pos; 1231 | 1232 | if (assert_match_ready(self) < 0) 1233 | return NULL; 1234 | 1235 | /* Simple reverse lookup into groupindex. */ 1236 | pos = 0; 1237 | while (PyDict_Next(self->pattern->groupindex, &pos, &key, &value)) { 1238 | #ifdef PY3 1239 | if (PyLong_Check(value) && PyLong_AS_LONG(value) == self->lastindex) 1240 | #else 1241 | if (PyInt_Check(value) && PyInt_AS_LONG(value) == self->lastindex) 1242 | #endif 1243 | { 1244 | Py_INCREF(key); 1245 | return key; 1246 | } 1247 | } 1248 | 1249 | Py_RETURN_NONE; 1250 | } 1251 | 1252 | static PyObject * 1253 | match_regs_getter(PyMatchObject *self, void *closure) 1254 | { 1255 | PyObject *regs, *item; 1256 | Py_ssize_t count, i; 1257 | 1258 | if (assert_match_ready(self) < 0) 1259 | return NULL; 1260 | 1261 | count = self->pattern->groups + 1; 1262 | regs = PyTuple_New(count); 1263 | if (regs == NULL) 1264 | return NULL; 1265 | 1266 | for (i = 0; i < count; ++i) { 1267 | item = Py_BuildValue("(ii)", self->ovector[(i * 2)], 1268 | self->ovector[(i * 2) + 1]); 1269 | if (item == NULL) { 1270 | Py_DECREF(regs); 1271 | return NULL; 1272 | } 1273 | PyTuple_SET_ITEM(regs, i, item); 1274 | } 1275 | 1276 | return regs; 1277 | } 1278 | 1279 | static const PyMethodDef match_methods[] = { 1280 | {"group", (PyCFunction)match_group, METH_VARARGS}, 1281 | {"start", (PyCFunction)match_start, METH_VARARGS}, 1282 | {"end", (PyCFunction)match_end, METH_VARARGS}, 1283 | {"span", (PyCFunction)match_span, METH_VARARGS}, 1284 | {"groups", (PyCFunction)match_groups, METH_VARARGS}, 1285 | {"groupdict", (PyCFunction)match_groupdict, METH_VARARGS}, 1286 | {NULL} /* sentinel */ 1287 | }; 1288 | 1289 | static const PyGetSetDef match_getset[] = { 1290 | {"lastindex", (getter)match_lastindex_getter}, 1291 | {"lastgroup", (getter)match_lastgroup_getter}, 1292 | {"regs", (getter)match_regs_getter}, 1293 | {NULL} /* sentinel */ 1294 | }; 1295 | 1296 | static const PyMemberDef match_members[] = { 1297 | {"string", T_OBJECT, offsetof(PyMatchObject, subject), READONLY}, 1298 | {"re", T_OBJECT, offsetof(PyMatchObject, pattern), READONLY}, 1299 | {"pos", T_INT, offsetof(PyMatchObject, startpos), READONLY}, 1300 | {"endpos", T_INT, offsetof(PyMatchObject, endpos), READONLY}, 1301 | {"flags", T_INT, offsetof(PyMatchObject, flags), READONLY}, 1302 | {NULL} /* sentinel */ 1303 | }; 1304 | 1305 | static PyTypeObject PyMatch_Type = { 1306 | PyVarObject_HEAD_INIT(NULL, 0) 1307 | "_pcre.Match", /* tp_name */ 1308 | sizeof(PyMatchObject), /* tp_basicsize */ 1309 | 0, /* tp_itemsize */ 1310 | (destructor)match_dealloc, /* tp_dealloc */ 1311 | 0, /* tp_print */ 1312 | 0, /* tp_getattr */ 1313 | 0, /* tp_setattr */ 1314 | 0, /* tp_compare */ 1315 | 0, /* tp_repr */ 1316 | 0, /* tp_as_number */ 1317 | 0, /* tp_as_sequence */ 1318 | 0, /* tp_as_mapping */ 1319 | 0, /* tp_hash */ 1320 | 0, /* tp_call */ 1321 | 0, /* tp_str */ 1322 | 0, /* tp_getattro */ 1323 | 0, /* tp_setattro */ 1324 | 0, /* tp_as_buffer */ 1325 | Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE, /* tp_flags */ 1326 | 0, /* tp_doc */ 1327 | 0, /* tp_traverse */ 1328 | 0, /* tp_clear */ 1329 | 0, /* tp_richcompare */ 1330 | 0, /* tp_weaklistoffset */ 1331 | 0, /* tp_iter */ 1332 | 0, /* tp_iternext */ 1333 | (PyMethodDef *)match_methods, /* tp_methods */ 1334 | (PyMemberDef *)match_members, /* tp_members */ 1335 | (PyGetSetDef *)match_getset, /* tp_getset */ 1336 | 0, /* tp_base */ 1337 | 0, /* tp_dict */ 1338 | 0, /* tp_descr_get */ 1339 | 0, /* tp_descr_set */ 1340 | 0, /* tp_dictoffset */ 1341 | (initproc)match_init, /* tp_init */ 1342 | 0, /* tp_alloc */ 1343 | 0, /* tp_new */ 1344 | 0, /* tp_free */ 1345 | }; 1346 | 1347 | /* 1348 | * _pcre 1349 | */ 1350 | 1351 | static int 1352 | _config_do_get_int(PyObject *dict, const char *name, int what, int boolean) 1353 | { 1354 | int rc, value = 0; 1355 | PyObject *op; 1356 | 1357 | if (what != PYPCRE_CONFIG_NONE) 1358 | pcre_config(what, &value); 1359 | 1360 | if (boolean) 1361 | op = PyBool_FromLong(value); 1362 | else 1363 | op = PyInt_FromLong(value); 1364 | if (op == NULL) 1365 | return -1; 1366 | 1367 | rc = PyDict_SetItemString(dict, name, op); 1368 | Py_DECREF(op); 1369 | return rc; 1370 | } 1371 | 1372 | static unsigned long 1373 | _config_get_ulong(PyObject *dict, const char *name, int what) 1374 | { 1375 | int rc; 1376 | unsigned long value = 0; 1377 | PyObject *op; 1378 | 1379 | if (what != PYPCRE_CONFIG_NONE) 1380 | pcre_config(what, &value); 1381 | 1382 | op = PyInt_FromLong(value); 1383 | if (op == NULL) 1384 | return -1; 1385 | 1386 | rc = PyDict_SetItemString(dict, name, op); 1387 | Py_DECREF(op); 1388 | return rc; 1389 | } 1390 | 1391 | static int 1392 | _config_get_int(PyObject *dict, const char *name, int what) 1393 | { 1394 | return _config_do_get_int(dict, name, what, 0); 1395 | } 1396 | 1397 | static int 1398 | _config_get_bool(PyObject *dict, const char *name, int what) 1399 | { 1400 | return _config_do_get_int(dict, name, what, 1); 1401 | } 1402 | 1403 | static int 1404 | _config_get_str(PyObject *dict, const char *name, int what) 1405 | { 1406 | int rc; 1407 | const char *value = NULL; 1408 | PyObject *op; 1409 | 1410 | if (what == PYPCRE_CONFIG_VERSION) 1411 | value = pcre_version(); 1412 | else if (what != PYPCRE_CONFIG_NONE) 1413 | pcre_config(what, &value); 1414 | 1415 | if (value == NULL) 1416 | value = ""; 1417 | 1418 | #ifdef PY3 1419 | op = PyUnicode_FromString(value); 1420 | #else 1421 | op = PyBytes_FromString(value); 1422 | #endif 1423 | if (op == NULL) 1424 | return -1; 1425 | 1426 | rc = PyDict_SetItemString(dict, name, op); 1427 | Py_DECREF(op); 1428 | return rc; 1429 | } 1430 | 1431 | static PyObject * 1432 | get_config(PyObject *self) 1433 | { 1434 | PyObject *dict; 1435 | 1436 | dict = PyDict_New(); 1437 | if (dict == NULL) 1438 | return NULL; 1439 | 1440 | if (_config_get_str(dict, "version", PYPCRE_CONFIG_VERSION) < 0 1441 | || _config_get_bool(dict, "utf_8", PCRE_CONFIG_UTF8) < 0 1442 | || _config_get_bool(dict, "unicode_properties", PCRE_CONFIG_UNICODE_PROPERTIES) < 0 1443 | || _config_get_bool(dict, "jit", PCRE_CONFIG_JIT) < 0 1444 | || _config_get_str(dict, "jit_target", PCRE_CONFIG_JITTARGET) < 0 1445 | || _config_get_int(dict, "newline", PCRE_CONFIG_NEWLINE) < 0 1446 | || _config_get_bool(dict, "bsr", PCRE_CONFIG_BSR) < 0 1447 | || _config_get_int(dict, "link_size", PCRE_CONFIG_LINK_SIZE) < 0 1448 | || _config_get_ulong(dict, "parens_limit", PCRE_CONFIG_PARENS_LIMIT) < 0 1449 | || _config_get_ulong(dict, "match_limit", PCRE_CONFIG_MATCH_LIMIT) < 0 1450 | || _config_get_ulong(dict, "match_limit_recursion", PCRE_CONFIG_MATCH_LIMIT_RECURSION) < 0 1451 | || _config_get_bool(dict, "stack_recurse", PCRE_CONFIG_STACKRECURSE) < 0) { 1452 | Py_DECREF(dict); 1453 | return NULL; 1454 | } 1455 | 1456 | return dict; 1457 | } 1458 | 1459 | static const PyMethodDef pypcre_methods[] = { 1460 | {"get_config", (PyCFunction)get_config, METH_NOARGS}, 1461 | {NULL} /* sentinel */ 1462 | }; 1463 | 1464 | #ifdef PY3 1465 | static PyModuleDef pypcre_module = { 1466 | PyModuleDef_HEAD_INIT, 1467 | "_pcre", 1468 | NULL, 1469 | -1, 1470 | (PyMethodDef *)pypcre_methods 1471 | }; 1472 | #endif 1473 | 1474 | static PyObject * 1475 | pypcre_init(void) 1476 | { 1477 | PyObject *m; 1478 | 1479 | /* Use Python memory manager for PCRE allocations. */ 1480 | pcre_malloc = PyMem_Malloc; 1481 | pcre_free = PyMem_Free; 1482 | 1483 | /* _pcre */ 1484 | #ifdef PY3 1485 | m = PyModule_Create(&pypcre_module); 1486 | #else 1487 | m = Py_InitModule("_pcre", (PyMethodDef *)pypcre_methods); 1488 | #endif 1489 | if (m == NULL) 1490 | return NULL; 1491 | 1492 | /* Pattern */ 1493 | PyPattern_Type.tp_new = PyType_GenericNew; 1494 | PyType_Ready(&PyPattern_Type); 1495 | Py_INCREF(&PyPattern_Type); 1496 | PyModule_AddObject(m, "Pattern", (PyObject *)&PyPattern_Type); 1497 | 1498 | /* Match */ 1499 | PyMatch_Type.tp_new = PyType_GenericNew; 1500 | PyType_Ready(&PyMatch_Type); 1501 | Py_INCREF(&PyMatch_Type); 1502 | PyModule_AddObject(m, "Match", (PyObject *)&PyMatch_Type); 1503 | 1504 | /* NoMatch exception */ 1505 | PyExc_NoMatch = PyErr_NewException("pcre.NoMatch", 1506 | PyExc_Exception, NULL); 1507 | Py_INCREF(PyExc_NoMatch); 1508 | PyModule_AddObject(m, "NoMatch", PyExc_NoMatch); 1509 | 1510 | /* PCREError exception */ 1511 | PyExc_PCREError = PyErr_NewException("pcre.PCREError", 1512 | PyExc_EnvironmentError, NULL); 1513 | Py_INCREF(PyExc_PCREError); 1514 | PyModule_AddObject(m, "PCREError", PyExc_PCREError); 1515 | 1516 | /* pcre_compile and/or pcre_exec flags */ 1517 | PyModule_AddIntConstant(m, "IGNORECASE", PCRE_CASELESS); 1518 | PyModule_AddIntConstant(m, "MULTILINE", PCRE_MULTILINE); 1519 | PyModule_AddIntConstant(m, "DOTALL", PCRE_DOTALL); 1520 | PyModule_AddIntConstant(m, "UNICODE", PCRE_UCP); 1521 | PyModule_AddIntConstant(m, "VERBOSE", PCRE_EXTENDED); 1522 | PyModule_AddIntConstant(m, "ANCHORED", PCRE_ANCHORED); 1523 | PyModule_AddIntConstant(m, "NOTBOL", PCRE_NOTBOL); 1524 | PyModule_AddIntConstant(m, "NOTEOL", PCRE_NOTEOL); 1525 | PyModule_AddIntConstant(m, "NOTEMPTY", PCRE_NOTEMPTY); 1526 | PyModule_AddIntConstant(m, "NOTEMPTY_ATSTART", PCRE_NOTEMPTY_ATSTART); 1527 | PyModule_AddIntConstant(m, "UTF8", PCRE_UTF8); 1528 | PyModule_AddIntConstant(m, "NO_UTF8_CHECK", PCRE_NO_UTF8_CHECK); 1529 | 1530 | /* pcre_study flags */ 1531 | PyModule_AddIntConstant(m, "STUDY_JIT", PCRE_STUDY_JIT_COMPILE); 1532 | 1533 | return m; 1534 | } 1535 | 1536 | #ifdef PY3 1537 | PyMODINIT_FUNC 1538 | PyInit__pcre(void) 1539 | { 1540 | return pypcre_init(); 1541 | } 1542 | #else 1543 | PyMODINIT_FUNC 1544 | init_pcre(void) 1545 | { 1546 | pypcre_init(); 1547 | } 1548 | #endif 1549 | -------------------------------------------------------------------------------- /test/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awahlig/python-pcre/21cc038ade8eb6154c6dd21da26dbd49b9eb3bae/test/__init__.py -------------------------------------------------------------------------------- /test/pcre_tests.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- mode: python -*- 3 | 4 | # -------------------------------------------------------------------------- 5 | # This file is a modified re module testsuite from Python 2.7.5. 6 | # 7 | # Modifications have been commented with "PCRE:". 8 | # Regular expressions have been changed to PCRE format. 9 | # -------------------------------------------------------------------------- 10 | 11 | # Re test suite and benchmark suite v1.5 12 | 13 | # The 3 possible outcomes for each pattern 14 | [SUCCEED, FAIL, SYNTAX_ERROR] = range(3) 15 | 16 | # Benchmark suite (needs expansion) 17 | # 18 | # The benchmark suite does not test correctness, just speed. The 19 | # first element of each tuple is the regex pattern; the second is a 20 | # string to match it against. The benchmarking code will embed the 21 | # second string inside several sizes of padding, to test how regex 22 | # matching performs on large strings. 23 | 24 | benchmarks = [ 25 | 26 | # test common prefix 27 | ('Python|Perl', 'Perl'), # Alternation 28 | ('(Python|Perl)', 'Perl'), # Grouped alternation 29 | 30 | ('Python|Perl|Tcl', 'Perl'), # Alternation 31 | ('(Python|Perl|Tcl)', 'Perl'), # Grouped alternation 32 | 33 | ('(Python)\\1', 'PythonPython'), # Backreference 34 | ('([0a-z][a-z0-9]*,)+', 'a5,b7,c9,'), # Disable the fastmap optimization 35 | ('([a-z][a-z0-9]*,)+', 'a5,b7,c9,'), # A few sets 36 | 37 | ('Python', 'Python'), # Simple text literal 38 | ('.*Python', 'Python'), # Bad text literal 39 | ('.*Python.*', 'Python'), # Worse text literal 40 | ('.*(Python)', 'Python'), # Bad text literal with grouping 41 | 42 | ] 43 | 44 | # Test suite (for verifying correctness) 45 | # 46 | # The test suite is a list of 5- or 3-tuples. The 5 parts of a 47 | # complete tuple are: 48 | # element 0: a string containing the pattern 49 | # 1: the string to match against the pattern 50 | # 2: the expected result (SUCCEED, FAIL, SYNTAX_ERROR) 51 | # 3: a string that will be eval()'ed to produce a test string. 52 | # This is an arbitrary Python expression; the available 53 | # variables are "found" (the whole match), and "g1", "g2", ... 54 | # up to "g99" contain the contents of each group, or the 55 | # string 'None' if the group wasn't given a value, or the 56 | # string 'Error' if the group index was out of range; 57 | # also "groups", the return value of m.group() (a tuple). 58 | # 4: The expected result of evaluating the expression. 59 | # If the two don't match, an error is reported. 60 | # 61 | # If the regex isn't expected to work, the latter two elements can be omitted. 62 | 63 | tests = [ 64 | # Test ?P< and ?P= extensions 65 | ('(?Pa)', '', SYNTAX_ERROR), # Begins with a digit 67 | ('(?Pa)', '', SYNTAX_ERROR), # Begins with an illegal char 68 | ('(?Pa)', '', SYNTAX_ERROR), # Begins with an illegal char 69 | 70 | # Same tests, for the ?P= form 71 | ('(?Pa)(?P=foo_123', 'aa', SYNTAX_ERROR), 72 | ('(?Pa)(?P=1)', 'aa', SYNTAX_ERROR), 73 | ('(?Pa)(?P=!)', 'aa', SYNTAX_ERROR), 74 | ('(?Pa)(?P=foo_124', 'aa', SYNTAX_ERROR), # Backref to undefined group 75 | 76 | ('(?Pa)', 'a', SUCCEED, 'g1', 'a'), 77 | ('(?Pa)(?P=foo_123)', 'aa', SUCCEED, 'g1', 'a'), 78 | 79 | # Test octal escapes 80 | ('\\1', 'a', SYNTAX_ERROR), # Backreference 81 | ('[\\1]', '\1', SUCCEED, 'found', '\1'), # Character 82 | ('\\09', chr(0) + '9', SUCCEED, 'found', chr(0) + '9'), 83 | ('\\141', 'a', SUCCEED, 'found', 'a'), 84 | # PCRE: PCRE interprets "\119" as group 119, not group 11 + "9" 85 | #('(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\\119', 'abcdefghijklk9', SUCCEED, 'found+"-"+g11', 'abcdefghijklk9-k'), 86 | 87 | # Test \0 is handled everywhere 88 | (r'\0', '\0', SUCCEED, 'found', '\0'), 89 | (r'[\0a]', '\0', SUCCEED, 'found', '\0'), 90 | (r'[a\0]', '\0', SUCCEED, 'found', '\0'), 91 | (r'[^a\0]', '\0', FAIL), 92 | 93 | # Test various letter escapes 94 | (r'\a[\b]\f\n\r\t\v', '\a\b\f\n\r\t\v', SUCCEED, 'found', '\a\b\f\n\r\t\v'), 95 | (r'[\a][\b][\f][\n][\r][\t][\v]', '\a\b\f\n\r\t\v', SUCCEED, 'found', '\a\b\f\n\r\t\v'), 96 | # NOTE: not an error under PCRE/PRE: 97 | (r'\u', '', SYNTAX_ERROR), # A Perl escape 98 | # PCRE: these have special meaning, syntax error 99 | #(r'\c\e\g\h\i\j\k\m\o\p\q\y\z', 'ceghijkmopqyz', SUCCEED, 'found', 'ceghijkmopqyz'), 100 | (r'\xff', '\377', SUCCEED, 'found', chr(255)), 101 | # new \x semantics 102 | (r'\x00ffffffffffffff', '\377', FAIL, 'found', chr(255)), 103 | (r'\x00f', '\017', FAIL, 'found', chr(15)), 104 | (r'\x00fe', '\376', FAIL, 'found', chr(254)), 105 | # (r'\x00ffffffffffffff', '\377', SUCCEED, 'found', chr(255)), 106 | # (r'\x00f', '\017', SUCCEED, 'found', chr(15)), 107 | # (r'\x00fe', '\376', SUCCEED, 'found', chr(254)), 108 | 109 | (r"^\w+=(\\[\000-\277]|[^\n\\])*", "SRC=eval.c g.c blah blah blah \\\\\n\tapes.c", 110 | SUCCEED, 'found', "SRC=eval.c g.c blah blah blah \\\\"), 111 | 112 | # Test that . only matches \n in DOTALL mode 113 | ('a.b', 'acb', SUCCEED, 'found', 'acb'), 114 | ('a.b', 'a\nb', FAIL), 115 | ('a.*b', 'acc\nccb', FAIL), 116 | ('a.{4,5}b', 'acc\nccb', FAIL), 117 | # PCRE: depends on build and compilation flags 118 | #('a.b', 'a\rb', SUCCEED, 'found', 'a\rb'), 119 | # PCRE: (?s) must be before the dot 120 | #('a.b(?s)', 'a\nb', SUCCEED, 'found', 'a\nb'), 121 | #('a.*(?s)b', 'acc\nccb', SUCCEED, 'found', 'acc\nccb'), 122 | ('(?s)a.{4,5}b', 'acc\nccb', SUCCEED, 'found', 'acc\nccb'), 123 | ('(?s)a.b', 'a\nb', SUCCEED, 'found', 'a\nb'), 124 | 125 | (')', '', SYNTAX_ERROR), # Unmatched right bracket 126 | ('', '', SUCCEED, 'found', ''), # Empty pattern 127 | ('abc', 'abc', SUCCEED, 'found', 'abc'), 128 | ('abc', 'xbc', FAIL), 129 | ('abc', 'axc', FAIL), 130 | ('abc', 'abx', FAIL), 131 | ('abc', 'xabcy', SUCCEED, 'found', 'abc'), 132 | ('abc', 'ababc', SUCCEED, 'found', 'abc'), 133 | ('ab*c', 'abc', SUCCEED, 'found', 'abc'), 134 | ('ab*bc', 'abc', SUCCEED, 'found', 'abc'), 135 | ('ab*bc', 'abbc', SUCCEED, 'found', 'abbc'), 136 | ('ab*bc', 'abbbbc', SUCCEED, 'found', 'abbbbc'), 137 | ('ab+bc', 'abbc', SUCCEED, 'found', 'abbc'), 138 | ('ab+bc', 'abc', FAIL), 139 | ('ab+bc', 'abq', FAIL), 140 | ('ab+bc', 'abbbbc', SUCCEED, 'found', 'abbbbc'), 141 | ('ab?bc', 'abbc', SUCCEED, 'found', 'abbc'), 142 | ('ab?bc', 'abc', SUCCEED, 'found', 'abc'), 143 | ('ab?bc', 'abbbbc', FAIL), 144 | ('ab?c', 'abc', SUCCEED, 'found', 'abc'), 145 | ('^abc$', 'abc', SUCCEED, 'found', 'abc'), 146 | ('^abc$', 'abcc', FAIL), 147 | ('^abc', 'abcc', SUCCEED, 'found', 'abc'), 148 | ('^abc$', 'aabc', FAIL), 149 | ('abc$', 'aabc', SUCCEED, 'found', 'abc'), 150 | ('^', 'abc', SUCCEED, 'found+"-"', '-'), 151 | ('$', 'abc', SUCCEED, 'found+"-"', '-'), 152 | ('a.c', 'abc', SUCCEED, 'found', 'abc'), 153 | ('a.c', 'axc', SUCCEED, 'found', 'axc'), 154 | ('a.*c', 'axyzc', SUCCEED, 'found', 'axyzc'), 155 | ('a.*c', 'axyzd', FAIL), 156 | ('a[bc]d', 'abc', FAIL), 157 | ('a[bc]d', 'abd', SUCCEED, 'found', 'abd'), 158 | ('a[b-d]e', 'abd', FAIL), 159 | ('a[b-d]e', 'ace', SUCCEED, 'found', 'ace'), 160 | ('a[b-d]', 'aac', SUCCEED, 'found', 'ac'), 161 | ('a[-b]', 'a-', SUCCEED, 'found', 'a-'), 162 | ('a[\\-b]', 'a-', SUCCEED, 'found', 'a-'), 163 | # NOTE: not an error under PCRE/PRE: 164 | ('a[b-]', 'a-', SYNTAX_ERROR), 165 | ('a[]b', '-', SYNTAX_ERROR), 166 | ('a[', '-', SYNTAX_ERROR), 167 | ('a\\', '-', SYNTAX_ERROR), 168 | ('abc)', '-', SYNTAX_ERROR), 169 | ('(abc', '-', SYNTAX_ERROR), 170 | ('a]', 'a]', SUCCEED, 'found', 'a]'), 171 | ('a[]]b', 'a]b', SUCCEED, 'found', 'a]b'), 172 | ('a[\]]b', 'a]b', SUCCEED, 'found', 'a]b'), 173 | ('a[^bc]d', 'aed', SUCCEED, 'found', 'aed'), 174 | ('a[^bc]d', 'abd', FAIL), 175 | ('a[^-b]c', 'adc', SUCCEED, 'found', 'adc'), 176 | ('a[^-b]c', 'a-c', FAIL), 177 | ('a[^]b]c', 'a]c', FAIL), 178 | ('a[^]b]c', 'adc', SUCCEED, 'found', 'adc'), 179 | ('\\ba\\b', 'a-', SUCCEED, '"-"', '-'), 180 | ('\\ba\\b', '-a', SUCCEED, '"-"', '-'), 181 | ('\\ba\\b', '-a-', SUCCEED, '"-"', '-'), 182 | ('\\by\\b', 'xy', FAIL), 183 | ('\\by\\b', 'yz', FAIL), 184 | ('\\by\\b', 'xyz', FAIL), 185 | ('x\\b', 'xyz', FAIL), 186 | ('x\\B', 'xyz', SUCCEED, '"-"', '-'), 187 | ('\\Bz', 'xyz', SUCCEED, '"-"', '-'), 188 | ('z\\B', 'xyz', FAIL), 189 | ('\\Bx', 'xyz', FAIL), 190 | ('\\Ba\\B', 'a-', FAIL, '"-"', '-'), 191 | ('\\Ba\\B', '-a', FAIL, '"-"', '-'), 192 | ('\\Ba\\B', '-a-', FAIL, '"-"', '-'), 193 | ('\\By\\B', 'xy', FAIL), 194 | ('\\By\\B', 'yz', FAIL), 195 | ('\\By\\b', 'xy', SUCCEED, '"-"', '-'), 196 | ('\\by\\B', 'yz', SUCCEED, '"-"', '-'), 197 | ('\\By\\B', 'xyz', SUCCEED, '"-"', '-'), 198 | ('ab|cd', 'abc', SUCCEED, 'found', 'ab'), 199 | ('ab|cd', 'abcd', SUCCEED, 'found', 'ab'), 200 | ('()ef', 'def', SUCCEED, 'found+"-"+g1', 'ef-'), 201 | ('$b', 'b', FAIL), 202 | ('a\\(b', 'a(b', SUCCEED, 'found+"-"+g1', 'a(b-Error'), 203 | ('a\\(*b', 'ab', SUCCEED, 'found', 'ab'), 204 | ('a\\(*b', 'a((b', SUCCEED, 'found', 'a((b'), 205 | ('a\\\\b', 'a\\b', SUCCEED, 'found', 'a\\b'), 206 | ('((a))', 'abc', SUCCEED, 'found+"-"+g1+"-"+g2', 'a-a-a'), 207 | ('(a)b(c)', 'abc', SUCCEED, 'found+"-"+g1+"-"+g2', 'abc-a-c'), 208 | ('a+b+c', 'aabbabc', SUCCEED, 'found', 'abc'), 209 | ('(a+|b)*', 'ab', SUCCEED, 'found+"-"+g1', 'ab-b'), 210 | ('(a+|b)+', 'ab', SUCCEED, 'found+"-"+g1', 'ab-b'), 211 | ('(a+|b)?', 'ab', SUCCEED, 'found+"-"+g1', 'a-a'), 212 | (')(', '-', SYNTAX_ERROR), 213 | ('[^ab]*', 'cde', SUCCEED, 'found', 'cde'), 214 | ('abc', '', FAIL), 215 | ('a*', '', SUCCEED, 'found', ''), 216 | ('a|b|c|d|e', 'e', SUCCEED, 'found', 'e'), 217 | ('(a|b|c|d|e)f', 'ef', SUCCEED, 'found+"-"+g1', 'ef-e'), 218 | ('abcd*efg', 'abcdefg', SUCCEED, 'found', 'abcdefg'), 219 | ('ab*', 'xabyabbbz', SUCCEED, 'found', 'ab'), 220 | ('ab*', 'xayabbbz', SUCCEED, 'found', 'a'), 221 | ('(ab|cd)e', 'abcde', SUCCEED, 'found+"-"+g1', 'cde-cd'), 222 | ('[abhgefdc]ij', 'hij', SUCCEED, 'found', 'hij'), 223 | ('^(ab|cd)e', 'abcde', FAIL, 'xg1y', 'xy'), 224 | ('(abc|)ef', 'abcdef', SUCCEED, 'found+"-"+g1', 'ef-'), 225 | ('(a|b)c*d', 'abcd', SUCCEED, 'found+"-"+g1', 'bcd-b'), 226 | ('(ab|ab*)bc', 'abc', SUCCEED, 'found+"-"+g1', 'abc-a'), 227 | ('a([bc]*)c*', 'abc', SUCCEED, 'found+"-"+g1', 'abc-bc'), 228 | ('a([bc]*)(c*d)', 'abcd', SUCCEED, 'found+"-"+g1+"-"+g2', 'abcd-bc-d'), 229 | ('a([bc]+)(c*d)', 'abcd', SUCCEED, 'found+"-"+g1+"-"+g2', 'abcd-bc-d'), 230 | ('a([bc]*)(c+d)', 'abcd', SUCCEED, 'found+"-"+g1+"-"+g2', 'abcd-b-cd'), 231 | ('a[bcd]*dcdcde', 'adcdcde', SUCCEED, 'found', 'adcdcde'), 232 | ('a[bcd]+dcdcde', 'adcdcde', FAIL), 233 | ('(ab|a)b*c', 'abc', SUCCEED, 'found+"-"+g1', 'abc-ab'), 234 | ('((a)(b)c)(d)', 'abcd', SUCCEED, 'g1+"-"+g2+"-"+g3+"-"+g4', 'abc-a-b-d'), 235 | ('[a-zA-Z_][a-zA-Z0-9_]*', 'alpha', SUCCEED, 'found', 'alpha'), 236 | ('^a(bc+|b[eh])g|.h$', 'abh', SUCCEED, 'found+"-"+g1', 'bh-None'), 237 | ('(bc+d$|ef*g.|h?i(j|k))', 'effgz', SUCCEED, 'found+"-"+g1+"-"+g2', 'effgz-effgz-None'), 238 | ('(bc+d$|ef*g.|h?i(j|k))', 'ij', SUCCEED, 'found+"-"+g1+"-"+g2', 'ij-ij-j'), 239 | ('(bc+d$|ef*g.|h?i(j|k))', 'effg', FAIL), 240 | ('(bc+d$|ef*g.|h?i(j|k))', 'bcdd', FAIL), 241 | ('(bc+d$|ef*g.|h?i(j|k))', 'reffgz', SUCCEED, 'found+"-"+g1+"-"+g2', 'effgz-effgz-None'), 242 | ('(((((((((a)))))))))', 'a', SUCCEED, 'found', 'a'), 243 | ('multiple words of text', 'uh-uh', FAIL), 244 | ('multiple words', 'multiple words, yeah', SUCCEED, 'found', 'multiple words'), 245 | ('(.*)c(.*)', 'abcde', SUCCEED, 'found+"-"+g1+"-"+g2', 'abcde-ab-de'), 246 | ('\\((.*), (.*)\\)', '(a, b)', SUCCEED, 'g2+"-"+g1', 'b-a'), 247 | ('[k]', 'ab', FAIL), 248 | ('a[-]?c', 'ac', SUCCEED, 'found', 'ac'), 249 | ('(abc)\\1', 'abcabc', SUCCEED, 'g1', 'abc'), 250 | ('([a-c]*)\\1', 'abcabc', SUCCEED, 'g1', 'abc'), 251 | ('^(.+)?B', 'AB', SUCCEED, 'g1', 'A'), 252 | ('(a+).\\1$', 'aaaaa', SUCCEED, 'found+"-"+g1', 'aaaaa-aa'), 253 | ('^(a+).\\1$', 'aaaa', FAIL), 254 | ('(abc)\\1', 'abcabc', SUCCEED, 'found+"-"+g1', 'abcabc-abc'), 255 | ('([a-c]+)\\1', 'abcabc', SUCCEED, 'found+"-"+g1', 'abcabc-abc'), 256 | ('(a)\\1', 'aa', SUCCEED, 'found+"-"+g1', 'aa-a'), 257 | ('(a+)\\1', 'aa', SUCCEED, 'found+"-"+g1', 'aa-a'), 258 | ('(a+)+\\1', 'aa', SUCCEED, 'found+"-"+g1', 'aa-a'), 259 | ('(a).+\\1', 'aba', SUCCEED, 'found+"-"+g1', 'aba-a'), 260 | ('(a)ba*\\1', 'aba', SUCCEED, 'found+"-"+g1', 'aba-a'), 261 | ('(aa|a)a\\1$', 'aaa', SUCCEED, 'found+"-"+g1', 'aaa-a'), 262 | ('(a|aa)a\\1$', 'aaa', SUCCEED, 'found+"-"+g1', 'aaa-a'), 263 | ('(a+)a\\1$', 'aaa', SUCCEED, 'found+"-"+g1', 'aaa-a'), 264 | ('([abc]*)\\1', 'abcabc', SUCCEED, 'found+"-"+g1', 'abcabc-abc'), 265 | ('(a)(b)c|ab', 'ab', SUCCEED, 'found+"-"+g1+"-"+g2', 'ab-None-None'), 266 | ('(a)+x', 'aaax', SUCCEED, 'found+"-"+g1', 'aaax-a'), 267 | ('([ac])+x', 'aacx', SUCCEED, 'found+"-"+g1', 'aacx-c'), 268 | ('([^/]*/)*sub1/', 'd:msgs/tdir/sub1/trial/away.cpp', SUCCEED, 'found+"-"+g1', 'd:msgs/tdir/sub1/-tdir/'), 269 | ('([^.]*)\\.([^:]*):[T ]+(.*)', 'track1.title:TBlah blah blah', SUCCEED, 'found+"-"+g1+"-"+g2+"-"+g3', 'track1.title:TBlah blah blah-track1-title-Blah blah blah'), 270 | ('([^N]*N)+', 'abNNxyzN', SUCCEED, 'found+"-"+g1', 'abNNxyzN-xyzN'), 271 | ('([^N]*N)+', 'abNNxyz', SUCCEED, 'found+"-"+g1', 'abNN-N'), 272 | ('([abc]*)x', 'abcx', SUCCEED, 'found+"-"+g1', 'abcx-abc'), 273 | ('([abc]*)x', 'abc', FAIL), 274 | ('([xyz]*)x', 'abcx', SUCCEED, 'found+"-"+g1', 'x-'), 275 | ('(a)+b|aac', 'aac', SUCCEED, 'found+"-"+g1', 'aac-None'), 276 | 277 | # Test symbolic groups 278 | 279 | ('(?Paaa)a', 'aaaa', SYNTAX_ERROR), 280 | ('(?Paaa)a', 'aaaa', SUCCEED, 'found+"-"+id', 'aaaa-aaa'), 281 | ('(?Paa)(?P=id)', 'aaaa', SUCCEED, 'found+"-"+id', 'aaaa-aa'), 282 | ('(?Paa)(?P=xd)', 'aaaa', SYNTAX_ERROR), 283 | 284 | # Test octal escapes/memory references 285 | 286 | ('\\1', 'a', SYNTAX_ERROR), 287 | ('\\09', chr(0) + '9', SUCCEED, 'found', chr(0) + '9'), 288 | ('\\141', 'a', SUCCEED, 'found', 'a'), 289 | # PCRE: PCRE interprets "\119" as group 119, not group 11 + "9" 290 | #('(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\\119', 'abcdefghijklk9', SUCCEED, 'found+"-"+g11', 'abcdefghijklk9-k'), 291 | 292 | # All tests from Perl 293 | 294 | ('abc', 'abc', SUCCEED, 'found', 'abc'), 295 | ('abc', 'xbc', FAIL), 296 | ('abc', 'axc', FAIL), 297 | ('abc', 'abx', FAIL), 298 | ('abc', 'xabcy', SUCCEED, 'found', 'abc'), 299 | ('abc', 'ababc', SUCCEED, 'found', 'abc'), 300 | ('ab*c', 'abc', SUCCEED, 'found', 'abc'), 301 | ('ab*bc', 'abc', SUCCEED, 'found', 'abc'), 302 | ('ab*bc', 'abbc', SUCCEED, 'found', 'abbc'), 303 | ('ab*bc', 'abbbbc', SUCCEED, 'found', 'abbbbc'), 304 | ('ab{0,}bc', 'abbbbc', SUCCEED, 'found', 'abbbbc'), 305 | ('ab+bc', 'abbc', SUCCEED, 'found', 'abbc'), 306 | ('ab+bc', 'abc', FAIL), 307 | ('ab+bc', 'abq', FAIL), 308 | ('ab{1,}bc', 'abq', FAIL), 309 | ('ab+bc', 'abbbbc', SUCCEED, 'found', 'abbbbc'), 310 | ('ab{1,}bc', 'abbbbc', SUCCEED, 'found', 'abbbbc'), 311 | ('ab{1,3}bc', 'abbbbc', SUCCEED, 'found', 'abbbbc'), 312 | ('ab{3,4}bc', 'abbbbc', SUCCEED, 'found', 'abbbbc'), 313 | ('ab{4,5}bc', 'abbbbc', FAIL), 314 | ('ab?bc', 'abbc', SUCCEED, 'found', 'abbc'), 315 | ('ab?bc', 'abc', SUCCEED, 'found', 'abc'), 316 | ('ab{0,1}bc', 'abc', SUCCEED, 'found', 'abc'), 317 | ('ab?bc', 'abbbbc', FAIL), 318 | ('ab?c', 'abc', SUCCEED, 'found', 'abc'), 319 | ('ab{0,1}c', 'abc', SUCCEED, 'found', 'abc'), 320 | ('^abc$', 'abc', SUCCEED, 'found', 'abc'), 321 | ('^abc$', 'abcc', FAIL), 322 | ('^abc', 'abcc', SUCCEED, 'found', 'abc'), 323 | ('^abc$', 'aabc', FAIL), 324 | ('abc$', 'aabc', SUCCEED, 'found', 'abc'), 325 | ('^', 'abc', SUCCEED, 'found', ''), 326 | ('$', 'abc', SUCCEED, 'found', ''), 327 | ('a.c', 'abc', SUCCEED, 'found', 'abc'), 328 | ('a.c', 'axc', SUCCEED, 'found', 'axc'), 329 | ('a.*c', 'axyzc', SUCCEED, 'found', 'axyzc'), 330 | ('a.*c', 'axyzd', FAIL), 331 | ('a[bc]d', 'abc', FAIL), 332 | ('a[bc]d', 'abd', SUCCEED, 'found', 'abd'), 333 | ('a[b-d]e', 'abd', FAIL), 334 | ('a[b-d]e', 'ace', SUCCEED, 'found', 'ace'), 335 | ('a[b-d]', 'aac', SUCCEED, 'found', 'ac'), 336 | ('a[-b]', 'a-', SUCCEED, 'found', 'a-'), 337 | ('a[b-]', 'a-', SUCCEED, 'found', 'a-'), 338 | ('a[b-a]', '-', SYNTAX_ERROR), 339 | ('a[]b', '-', SYNTAX_ERROR), 340 | ('a[', '-', SYNTAX_ERROR), 341 | ('a]', 'a]', SUCCEED, 'found', 'a]'), 342 | ('a[]]b', 'a]b', SUCCEED, 'found', 'a]b'), 343 | ('a[^bc]d', 'aed', SUCCEED, 'found', 'aed'), 344 | ('a[^bc]d', 'abd', FAIL), 345 | ('a[^-b]c', 'adc', SUCCEED, 'found', 'adc'), 346 | ('a[^-b]c', 'a-c', FAIL), 347 | ('a[^]b]c', 'a]c', FAIL), 348 | ('a[^]b]c', 'adc', SUCCEED, 'found', 'adc'), 349 | ('ab|cd', 'abc', SUCCEED, 'found', 'ab'), 350 | ('ab|cd', 'abcd', SUCCEED, 'found', 'ab'), 351 | ('()ef', 'def', SUCCEED, 'found+"-"+g1', 'ef-'), 352 | ('*a', '-', SYNTAX_ERROR), 353 | ('(*)b', '-', SYNTAX_ERROR), 354 | ('$b', 'b', FAIL), 355 | ('a\\', '-', SYNTAX_ERROR), 356 | ('a\\(b', 'a(b', SUCCEED, 'found+"-"+g1', 'a(b-Error'), 357 | ('a\\(*b', 'ab', SUCCEED, 'found', 'ab'), 358 | ('a\\(*b', 'a((b', SUCCEED, 'found', 'a((b'), 359 | ('a\\\\b', 'a\\b', SUCCEED, 'found', 'a\\b'), 360 | ('abc)', '-', SYNTAX_ERROR), 361 | ('(abc', '-', SYNTAX_ERROR), 362 | ('((a))', 'abc', SUCCEED, 'found+"-"+g1+"-"+g2', 'a-a-a'), 363 | ('(a)b(c)', 'abc', SUCCEED, 'found+"-"+g1+"-"+g2', 'abc-a-c'), 364 | ('a+b+c', 'aabbabc', SUCCEED, 'found', 'abc'), 365 | ('a{1,}b{1,}c', 'aabbabc', SUCCEED, 'found', 'abc'), 366 | ('a**', '-', SYNTAX_ERROR), 367 | ('a.+?c', 'abcabc', SUCCEED, 'found', 'abc'), 368 | ('(a+|b)*', 'ab', SUCCEED, 'found+"-"+g1', 'ab-b'), 369 | ('(a+|b){0,}', 'ab', SUCCEED, 'found+"-"+g1', 'ab-b'), 370 | ('(a+|b)+', 'ab', SUCCEED, 'found+"-"+g1', 'ab-b'), 371 | ('(a+|b){1,}', 'ab', SUCCEED, 'found+"-"+g1', 'ab-b'), 372 | ('(a+|b)?', 'ab', SUCCEED, 'found+"-"+g1', 'a-a'), 373 | ('(a+|b){0,1}', 'ab', SUCCEED, 'found+"-"+g1', 'a-a'), 374 | (')(', '-', SYNTAX_ERROR), 375 | ('[^ab]*', 'cde', SUCCEED, 'found', 'cde'), 376 | ('abc', '', FAIL), 377 | ('a*', '', SUCCEED, 'found', ''), 378 | ('([abc])*d', 'abbbcd', SUCCEED, 'found+"-"+g1', 'abbbcd-c'), 379 | ('([abc])*bcd', 'abcd', SUCCEED, 'found+"-"+g1', 'abcd-a'), 380 | ('a|b|c|d|e', 'e', SUCCEED, 'found', 'e'), 381 | ('(a|b|c|d|e)f', 'ef', SUCCEED, 'found+"-"+g1', 'ef-e'), 382 | ('abcd*efg', 'abcdefg', SUCCEED, 'found', 'abcdefg'), 383 | ('ab*', 'xabyabbbz', SUCCEED, 'found', 'ab'), 384 | ('ab*', 'xayabbbz', SUCCEED, 'found', 'a'), 385 | ('(ab|cd)e', 'abcde', SUCCEED, 'found+"-"+g1', 'cde-cd'), 386 | ('[abhgefdc]ij', 'hij', SUCCEED, 'found', 'hij'), 387 | ('^(ab|cd)e', 'abcde', FAIL), 388 | ('(abc|)ef', 'abcdef', SUCCEED, 'found+"-"+g1', 'ef-'), 389 | ('(a|b)c*d', 'abcd', SUCCEED, 'found+"-"+g1', 'bcd-b'), 390 | ('(ab|ab*)bc', 'abc', SUCCEED, 'found+"-"+g1', 'abc-a'), 391 | ('a([bc]*)c*', 'abc', SUCCEED, 'found+"-"+g1', 'abc-bc'), 392 | ('a([bc]*)(c*d)', 'abcd', SUCCEED, 'found+"-"+g1+"-"+g2', 'abcd-bc-d'), 393 | ('a([bc]+)(c*d)', 'abcd', SUCCEED, 'found+"-"+g1+"-"+g2', 'abcd-bc-d'), 394 | ('a([bc]*)(c+d)', 'abcd', SUCCEED, 'found+"-"+g1+"-"+g2', 'abcd-b-cd'), 395 | ('a[bcd]*dcdcde', 'adcdcde', SUCCEED, 'found', 'adcdcde'), 396 | ('a[bcd]+dcdcde', 'adcdcde', FAIL), 397 | ('(ab|a)b*c', 'abc', SUCCEED, 'found+"-"+g1', 'abc-ab'), 398 | ('((a)(b)c)(d)', 'abcd', SUCCEED, 'g1+"-"+g2+"-"+g3+"-"+g4', 'abc-a-b-d'), 399 | ('[a-zA-Z_][a-zA-Z0-9_]*', 'alpha', SUCCEED, 'found', 'alpha'), 400 | ('^a(bc+|b[eh])g|.h$', 'abh', SUCCEED, 'found+"-"+g1', 'bh-None'), 401 | ('(bc+d$|ef*g.|h?i(j|k))', 'effgz', SUCCEED, 'found+"-"+g1+"-"+g2', 'effgz-effgz-None'), 402 | ('(bc+d$|ef*g.|h?i(j|k))', 'ij', SUCCEED, 'found+"-"+g1+"-"+g2', 'ij-ij-j'), 403 | ('(bc+d$|ef*g.|h?i(j|k))', 'effg', FAIL), 404 | ('(bc+d$|ef*g.|h?i(j|k))', 'bcdd', FAIL), 405 | ('(bc+d$|ef*g.|h?i(j|k))', 'reffgz', SUCCEED, 'found+"-"+g1+"-"+g2', 'effgz-effgz-None'), 406 | ('((((((((((a))))))))))', 'a', SUCCEED, 'g10', 'a'), 407 | ('((((((((((a))))))))))\\10', 'aa', SUCCEED, 'found', 'aa'), 408 | # Python does not have the same rules for \\41 so this is a syntax error 409 | # ('((((((((((a))))))))))\\41', 'aa', FAIL), 410 | # ('((((((((((a))))))))))\\41', 'a!', SUCCEED, 'found', 'a!'), 411 | ('((((((((((a))))))))))\\41', '', SYNTAX_ERROR), 412 | ('(?i)((((((((((a))))))))))\\41', '', SYNTAX_ERROR), 413 | ('(((((((((a)))))))))', 'a', SUCCEED, 'found', 'a'), 414 | ('multiple words of text', 'uh-uh', FAIL), 415 | ('multiple words', 'multiple words, yeah', SUCCEED, 'found', 'multiple words'), 416 | ('(.*)c(.*)', 'abcde', SUCCEED, 'found+"-"+g1+"-"+g2', 'abcde-ab-de'), 417 | ('\\((.*), (.*)\\)', '(a, b)', SUCCEED, 'g2+"-"+g1', 'b-a'), 418 | ('[k]', 'ab', FAIL), 419 | ('a[-]?c', 'ac', SUCCEED, 'found', 'ac'), 420 | ('(abc)\\1', 'abcabc', SUCCEED, 'g1', 'abc'), 421 | ('([a-c]*)\\1', 'abcabc', SUCCEED, 'g1', 'abc'), 422 | ('(?i)abc', 'ABC', SUCCEED, 'found', 'ABC'), 423 | ('(?i)abc', 'XBC', FAIL), 424 | ('(?i)abc', 'AXC', FAIL), 425 | ('(?i)abc', 'ABX', FAIL), 426 | ('(?i)abc', 'XABCY', SUCCEED, 'found', 'ABC'), 427 | ('(?i)abc', 'ABABC', SUCCEED, 'found', 'ABC'), 428 | ('(?i)ab*c', 'ABC', SUCCEED, 'found', 'ABC'), 429 | ('(?i)ab*bc', 'ABC', SUCCEED, 'found', 'ABC'), 430 | ('(?i)ab*bc', 'ABBC', SUCCEED, 'found', 'ABBC'), 431 | ('(?i)ab*?bc', 'ABBBBC', SUCCEED, 'found', 'ABBBBC'), 432 | ('(?i)ab{0,}?bc', 'ABBBBC', SUCCEED, 'found', 'ABBBBC'), 433 | ('(?i)ab+?bc', 'ABBC', SUCCEED, 'found', 'ABBC'), 434 | ('(?i)ab+bc', 'ABC', FAIL), 435 | ('(?i)ab+bc', 'ABQ', FAIL), 436 | ('(?i)ab{1,}bc', 'ABQ', FAIL), 437 | ('(?i)ab+bc', 'ABBBBC', SUCCEED, 'found', 'ABBBBC'), 438 | ('(?i)ab{1,}?bc', 'ABBBBC', SUCCEED, 'found', 'ABBBBC'), 439 | ('(?i)ab{1,3}?bc', 'ABBBBC', SUCCEED, 'found', 'ABBBBC'), 440 | ('(?i)ab{3,4}?bc', 'ABBBBC', SUCCEED, 'found', 'ABBBBC'), 441 | ('(?i)ab{4,5}?bc', 'ABBBBC', FAIL), 442 | ('(?i)ab??bc', 'ABBC', SUCCEED, 'found', 'ABBC'), 443 | ('(?i)ab??bc', 'ABC', SUCCEED, 'found', 'ABC'), 444 | ('(?i)ab{0,1}?bc', 'ABC', SUCCEED, 'found', 'ABC'), 445 | ('(?i)ab??bc', 'ABBBBC', FAIL), 446 | ('(?i)ab??c', 'ABC', SUCCEED, 'found', 'ABC'), 447 | ('(?i)ab{0,1}?c', 'ABC', SUCCEED, 'found', 'ABC'), 448 | ('(?i)^abc$', 'ABC', SUCCEED, 'found', 'ABC'), 449 | ('(?i)^abc$', 'ABCC', FAIL), 450 | ('(?i)^abc', 'ABCC', SUCCEED, 'found', 'ABC'), 451 | ('(?i)^abc$', 'AABC', FAIL), 452 | ('(?i)abc$', 'AABC', SUCCEED, 'found', 'ABC'), 453 | ('(?i)^', 'ABC', SUCCEED, 'found', ''), 454 | ('(?i)$', 'ABC', SUCCEED, 'found', ''), 455 | ('(?i)a.c', 'ABC', SUCCEED, 'found', 'ABC'), 456 | ('(?i)a.c', 'AXC', SUCCEED, 'found', 'AXC'), 457 | ('(?i)a.*?c', 'AXYZC', SUCCEED, 'found', 'AXYZC'), 458 | ('(?i)a.*c', 'AXYZD', FAIL), 459 | ('(?i)a[bc]d', 'ABC', FAIL), 460 | ('(?i)a[bc]d', 'ABD', SUCCEED, 'found', 'ABD'), 461 | ('(?i)a[b-d]e', 'ABD', FAIL), 462 | ('(?i)a[b-d]e', 'ACE', SUCCEED, 'found', 'ACE'), 463 | ('(?i)a[b-d]', 'AAC', SUCCEED, 'found', 'AC'), 464 | ('(?i)a[-b]', 'A-', SUCCEED, 'found', 'A-'), 465 | ('(?i)a[b-]', 'A-', SUCCEED, 'found', 'A-'), 466 | ('(?i)a[b-a]', '-', SYNTAX_ERROR), 467 | ('(?i)a[]b', '-', SYNTAX_ERROR), 468 | ('(?i)a[', '-', SYNTAX_ERROR), 469 | ('(?i)a]', 'A]', SUCCEED, 'found', 'A]'), 470 | ('(?i)a[]]b', 'A]B', SUCCEED, 'found', 'A]B'), 471 | ('(?i)a[^bc]d', 'AED', SUCCEED, 'found', 'AED'), 472 | ('(?i)a[^bc]d', 'ABD', FAIL), 473 | ('(?i)a[^-b]c', 'ADC', SUCCEED, 'found', 'ADC'), 474 | ('(?i)a[^-b]c', 'A-C', FAIL), 475 | ('(?i)a[^]b]c', 'A]C', FAIL), 476 | ('(?i)a[^]b]c', 'ADC', SUCCEED, 'found', 'ADC'), 477 | ('(?i)ab|cd', 'ABC', SUCCEED, 'found', 'AB'), 478 | ('(?i)ab|cd', 'ABCD', SUCCEED, 'found', 'AB'), 479 | ('(?i)()ef', 'DEF', SUCCEED, 'found+"-"+g1', 'EF-'), 480 | ('(?i)*a', '-', SYNTAX_ERROR), 481 | ('(?i)(*)b', '-', SYNTAX_ERROR), 482 | ('(?i)$b', 'B', FAIL), 483 | ('(?i)a\\', '-', SYNTAX_ERROR), 484 | ('(?i)a\\(b', 'A(B', SUCCEED, 'found+"-"+g1', 'A(B-Error'), 485 | ('(?i)a\\(*b', 'AB', SUCCEED, 'found', 'AB'), 486 | ('(?i)a\\(*b', 'A((B', SUCCEED, 'found', 'A((B'), 487 | ('(?i)a\\\\b', 'A\\B', SUCCEED, 'found', 'A\\B'), 488 | ('(?i)abc)', '-', SYNTAX_ERROR), 489 | ('(?i)(abc', '-', SYNTAX_ERROR), 490 | ('(?i)((a))', 'ABC', SUCCEED, 'found+"-"+g1+"-"+g2', 'A-A-A'), 491 | ('(?i)(a)b(c)', 'ABC', SUCCEED, 'found+"-"+g1+"-"+g2', 'ABC-A-C'), 492 | ('(?i)a+b+c', 'AABBABC', SUCCEED, 'found', 'ABC'), 493 | ('(?i)a{1,}b{1,}c', 'AABBABC', SUCCEED, 'found', 'ABC'), 494 | ('(?i)a**', '-', SYNTAX_ERROR), 495 | ('(?i)a.+?c', 'ABCABC', SUCCEED, 'found', 'ABC'), 496 | ('(?i)a.*?c', 'ABCABC', SUCCEED, 'found', 'ABC'), 497 | ('(?i)a.{0,5}?c', 'ABCABC', SUCCEED, 'found', 'ABC'), 498 | ('(?i)(a+|b)*', 'AB', SUCCEED, 'found+"-"+g1', 'AB-B'), 499 | ('(?i)(a+|b){0,}', 'AB', SUCCEED, 'found+"-"+g1', 'AB-B'), 500 | ('(?i)(a+|b)+', 'AB', SUCCEED, 'found+"-"+g1', 'AB-B'), 501 | ('(?i)(a+|b){1,}', 'AB', SUCCEED, 'found+"-"+g1', 'AB-B'), 502 | ('(?i)(a+|b)?', 'AB', SUCCEED, 'found+"-"+g1', 'A-A'), 503 | ('(?i)(a+|b){0,1}', 'AB', SUCCEED, 'found+"-"+g1', 'A-A'), 504 | ('(?i)(a+|b){0,1}?', 'AB', SUCCEED, 'found+"-"+g1', '-None'), 505 | ('(?i))(', '-', SYNTAX_ERROR), 506 | ('(?i)[^ab]*', 'CDE', SUCCEED, 'found', 'CDE'), 507 | ('(?i)abc', '', FAIL), 508 | ('(?i)a*', '', SUCCEED, 'found', ''), 509 | ('(?i)([abc])*d', 'ABBBCD', SUCCEED, 'found+"-"+g1', 'ABBBCD-C'), 510 | ('(?i)([abc])*bcd', 'ABCD', SUCCEED, 'found+"-"+g1', 'ABCD-A'), 511 | ('(?i)a|b|c|d|e', 'E', SUCCEED, 'found', 'E'), 512 | ('(?i)(a|b|c|d|e)f', 'EF', SUCCEED, 'found+"-"+g1', 'EF-E'), 513 | ('(?i)abcd*efg', 'ABCDEFG', SUCCEED, 'found', 'ABCDEFG'), 514 | ('(?i)ab*', 'XABYABBBZ', SUCCEED, 'found', 'AB'), 515 | ('(?i)ab*', 'XAYABBBZ', SUCCEED, 'found', 'A'), 516 | ('(?i)(ab|cd)e', 'ABCDE', SUCCEED, 'found+"-"+g1', 'CDE-CD'), 517 | ('(?i)[abhgefdc]ij', 'HIJ', SUCCEED, 'found', 'HIJ'), 518 | ('(?i)^(ab|cd)e', 'ABCDE', FAIL), 519 | ('(?i)(abc|)ef', 'ABCDEF', SUCCEED, 'found+"-"+g1', 'EF-'), 520 | ('(?i)(a|b)c*d', 'ABCD', SUCCEED, 'found+"-"+g1', 'BCD-B'), 521 | ('(?i)(ab|ab*)bc', 'ABC', SUCCEED, 'found+"-"+g1', 'ABC-A'), 522 | ('(?i)a([bc]*)c*', 'ABC', SUCCEED, 'found+"-"+g1', 'ABC-BC'), 523 | ('(?i)a([bc]*)(c*d)', 'ABCD', SUCCEED, 'found+"-"+g1+"-"+g2', 'ABCD-BC-D'), 524 | ('(?i)a([bc]+)(c*d)', 'ABCD', SUCCEED, 'found+"-"+g1+"-"+g2', 'ABCD-BC-D'), 525 | ('(?i)a([bc]*)(c+d)', 'ABCD', SUCCEED, 'found+"-"+g1+"-"+g2', 'ABCD-B-CD'), 526 | ('(?i)a[bcd]*dcdcde', 'ADCDCDE', SUCCEED, 'found', 'ADCDCDE'), 527 | ('(?i)a[bcd]+dcdcde', 'ADCDCDE', FAIL), 528 | ('(?i)(ab|a)b*c', 'ABC', SUCCEED, 'found+"-"+g1', 'ABC-AB'), 529 | ('(?i)((a)(b)c)(d)', 'ABCD', SUCCEED, 'g1+"-"+g2+"-"+g3+"-"+g4', 'ABC-A-B-D'), 530 | ('(?i)[a-zA-Z_][a-zA-Z0-9_]*', 'ALPHA', SUCCEED, 'found', 'ALPHA'), 531 | ('(?i)^a(bc+|b[eh])g|.h$', 'ABH', SUCCEED, 'found+"-"+g1', 'BH-None'), 532 | ('(?i)(bc+d$|ef*g.|h?i(j|k))', 'EFFGZ', SUCCEED, 'found+"-"+g1+"-"+g2', 'EFFGZ-EFFGZ-None'), 533 | ('(?i)(bc+d$|ef*g.|h?i(j|k))', 'IJ', SUCCEED, 'found+"-"+g1+"-"+g2', 'IJ-IJ-J'), 534 | ('(?i)(bc+d$|ef*g.|h?i(j|k))', 'EFFG', FAIL), 535 | ('(?i)(bc+d$|ef*g.|h?i(j|k))', 'BCDD', FAIL), 536 | ('(?i)(bc+d$|ef*g.|h?i(j|k))', 'REFFGZ', SUCCEED, 'found+"-"+g1+"-"+g2', 'EFFGZ-EFFGZ-None'), 537 | ('(?i)((((((((((a))))))))))', 'A', SUCCEED, 'g10', 'A'), 538 | ('(?i)((((((((((a))))))))))\\10', 'AA', SUCCEED, 'found', 'AA'), 539 | #('(?i)((((((((((a))))))))))\\41', 'AA', FAIL), 540 | #('(?i)((((((((((a))))))))))\\41', 'A!', SUCCEED, 'found', 'A!'), 541 | ('(?i)(((((((((a)))))))))', 'A', SUCCEED, 'found', 'A'), 542 | ('(?i)(?:(?:(?:(?:(?:(?:(?:(?:(?:(a))))))))))', 'A', SUCCEED, 'g1', 'A'), 543 | ('(?i)(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))', 'C', SUCCEED, 'g1', 'C'), 544 | ('(?i)multiple words of text', 'UH-UH', FAIL), 545 | ('(?i)multiple words', 'MULTIPLE WORDS, YEAH', SUCCEED, 'found', 'MULTIPLE WORDS'), 546 | ('(?i)(.*)c(.*)', 'ABCDE', SUCCEED, 'found+"-"+g1+"-"+g2', 'ABCDE-AB-DE'), 547 | ('(?i)\\((.*), (.*)\\)', '(A, B)', SUCCEED, 'g2+"-"+g1', 'B-A'), 548 | ('(?i)[k]', 'AB', FAIL), 549 | # ('(?i)abcd', 'ABCD', SUCCEED, 'found+"-"+\\found+"-"+\\\\found', 'ABCD-$&-\\ABCD'), 550 | # ('(?i)a(bc)d', 'ABCD', SUCCEED, 'g1+"-"+\\g1+"-"+\\\\g1', 'BC-$1-\\BC'), 551 | ('(?i)a[-]?c', 'AC', SUCCEED, 'found', 'AC'), 552 | ('(?i)(abc)\\1', 'ABCABC', SUCCEED, 'g1', 'ABC'), 553 | ('(?i)([a-c]*)\\1', 'ABCABC', SUCCEED, 'g1', 'ABC'), 554 | ('a(?!b).', 'abad', SUCCEED, 'found', 'ad'), 555 | ('a(?=d).', 'abad', SUCCEED, 'found', 'ad'), 556 | ('a(?=c|d).', 'abad', SUCCEED, 'found', 'ad'), 557 | ('a(?:b|c|d)(.)', 'ace', SUCCEED, 'g1', 'e'), 558 | ('a(?:b|c|d)*(.)', 'ace', SUCCEED, 'g1', 'e'), 559 | ('a(?:b|c|d)+?(.)', 'ace', SUCCEED, 'g1', 'e'), 560 | ('a(?:b|(c|e){1,2}?|d)+?(.)', 'ace', SUCCEED, 'g1 + g2', 'ce'), 561 | ('^(.+)?B', 'AB', SUCCEED, 'g1', 'A'), 562 | 563 | # lookbehind: split by : but not if it is escaped by -. 564 | ('(?]*?b', 'a>b', FAIL), 670 | # bug 490573: minimizing repeat problem 671 | (r'^a*?$', 'foo', FAIL), 672 | # bug 470582: nested groups problem 673 | (r'^((a)c)?(ab)$', 'ab', SUCCEED, 'g1+"-"+g2+"-"+g3', 'None-None-ab'), 674 | # another minimizing repeat problem (capturing groups in assertions) 675 | ('^([ab]*?)(?=(b)?)c', 'abc', SUCCEED, 'g1+"-"+g2', 'ab-None'), 676 | ('^([ab]*?)(?!(b))c', 'abc', SUCCEED, 'g1+"-"+g2', 'ab-None'), 677 | ('^([ab]*?)(?x)', '\g\g', 'xx'), 'xxxx') 80 | self.assertEqual(re.sub('(?Px)', '\g\g<1>', 'xx'), 'xxxx') 81 | self.assertEqual(re.sub('(?Px)', '\g\g', 'xx'), 'xxxx') 82 | self.assertEqual(re.sub('(?Px)', '\g<1>\g<1>', 'xx'), 'xxxx') 83 | 84 | self.assertEqual(re.sub('a',r'\t\n\v\r\f\a\b\B\Z\a\A\w\W\s\S\d\D','a'), 85 | '\t\n\v\r\f\a\b\\B\\Z\a\\A\\w\\W\\s\\S\\d\\D') 86 | self.assertEqual(re.sub('a', '\t\n\v\r\f\a', 'a'), '\t\n\v\r\f\a') 87 | self.assertEqual(re.sub('a', '\t\n\v\r\f\a', 'a'), 88 | (chr(9)+chr(10)+chr(11)+chr(13)+chr(12)+chr(7))) 89 | 90 | self.assertEqual(re.sub('^\s*', 'X', 'test'), 'Xtest') 91 | 92 | def test_bug_449964(self): 93 | # fails for group followed by other escape 94 | self.assertEqual(re.sub(r'(?Px)', '\g<1>\g<1>\\b', 'xx'), 95 | 'xx\bxx\b') 96 | 97 | def test_bug_449000(self): 98 | # Test for sub() on escaped characters 99 | self.assertEqual(re.sub(r'\r\n', r'\n', 'abc\r\ndef\r\n'), 100 | 'abc\ndef\n') 101 | self.assertEqual(re.sub('\r\n', r'\n', 'abc\r\ndef\r\n'), 102 | 'abc\ndef\n') 103 | self.assertEqual(re.sub(r'\r\n', '\n', 'abc\r\ndef\r\n'), 104 | 'abc\ndef\n') 105 | self.assertEqual(re.sub('\r\n', '\n', 'abc\r\ndef\r\n'), 106 | 'abc\ndef\n') 107 | 108 | def test_bug_1140(self): 109 | # re.sub(x, y, u'') should return u'', not '', and 110 | # re.sub(x, y, '') should return '', not u''. 111 | # Also: 112 | # re.sub(x, y, unicode(x)) should return unicode(y), and 113 | # re.sub(x, y, str(x)) should return 114 | # str(y) if isinstance(y, str) else unicode(y). 115 | for x in 'x', u'x': 116 | for y in 'y', u'y': 117 | z = re.sub(x, y, u'') 118 | self.assertEqual(z, u'') 119 | self.assertEqual(type(z), unicode) 120 | # 121 | z = re.sub(x, y, '') 122 | self.assertEqual(z, '') 123 | self.assertEqual(type(z), str) 124 | # 125 | z = re.sub(x, y, unicode(x)) 126 | self.assertEqual(z, y) 127 | self.assertEqual(type(z), unicode) 128 | # 129 | z = re.sub(x, y, str(x)) 130 | self.assertEqual(z, y) 131 | self.assertEqual(type(z), type(y)) 132 | 133 | def test_bug_1661(self): 134 | # Verify that flags do not get silently ignored with compiled patterns 135 | pattern = re.compile('.') 136 | self.assertRaises(ValueError, re.match, pattern, 'A', re.I) 137 | self.assertRaises(ValueError, re.search, pattern, 'A', re.I) 138 | self.assertRaises(ValueError, re.findall, pattern, 'A', re.I) 139 | self.assertRaises(ValueError, re.compile, pattern, re.I) 140 | 141 | def test_bug_3629(self): 142 | # A regex that triggered a bug in the sre-code validator 143 | re.compile("(?P)(?(quote))") 144 | 145 | def test_sub_template_numeric_escape(self): 146 | # bug 776311 and friends 147 | self.assertEqual(re.sub('x', r'\0', 'x'), '\0') 148 | self.assertEqual(re.sub('x', r'\000', 'x'), '\000') 149 | self.assertEqual(re.sub('x', r'\001', 'x'), '\001') 150 | self.assertEqual(re.sub('x', r'\008', 'x'), '\0' + '8') 151 | self.assertEqual(re.sub('x', r'\009', 'x'), '\0' + '9') 152 | self.assertEqual(re.sub('x', r'\111', 'x'), '\111') 153 | self.assertEqual(re.sub('x', r'\117', 'x'), '\117') 154 | 155 | self.assertEqual(re.sub('x', r'\1111', 'x'), '\1111') 156 | self.assertEqual(re.sub('x', r'\1111', 'x'), '\111' + '1') 157 | 158 | self.assertEqual(re.sub('x', r'\00', 'x'), '\x00') 159 | self.assertEqual(re.sub('x', r'\07', 'x'), '\x07') 160 | self.assertEqual(re.sub('x', r'\08', 'x'), '\0' + '8') 161 | self.assertEqual(re.sub('x', r'\09', 'x'), '\0' + '9') 162 | self.assertEqual(re.sub('x', r'\0a', 'x'), '\0' + 'a') 163 | 164 | self.assertEqual(re.sub('x', r'\400', 'x'), '\0') 165 | self.assertEqual(re.sub('x', r'\777', 'x'), '\377') 166 | 167 | self.assertRaises(re.error, re.sub, 'x', r'\1', 'x') 168 | self.assertRaises(re.error, re.sub, 'x', r'\8', 'x') 169 | self.assertRaises(re.error, re.sub, 'x', r'\9', 'x') 170 | self.assertRaises(re.error, re.sub, 'x', r'\11', 'x') 171 | self.assertRaises(re.error, re.sub, 'x', r'\18', 'x') 172 | self.assertRaises(re.error, re.sub, 'x', r'\1a', 'x') 173 | self.assertRaises(re.error, re.sub, 'x', r'\90', 'x') 174 | self.assertRaises(re.error, re.sub, 'x', r'\99', 'x') 175 | self.assertRaises(re.error, re.sub, 'x', r'\118', 'x') # r'\11' + '8' 176 | self.assertRaises(re.error, re.sub, 'x', r'\11a', 'x') 177 | self.assertRaises(re.error, re.sub, 'x', r'\181', 'x') # r'\18' + '1' 178 | self.assertRaises(re.error, re.sub, 'x', r'\800', 'x') # r'\80' + '0' 179 | 180 | # in python2.3 (etc), these loop endlessly in sre_parser.py 181 | self.assertEqual(re.sub('(((((((((((x)))))))))))', r'\11', 'x'), 'x') 182 | self.assertEqual(re.sub('((((((((((y))))))))))(.)', r'\118', 'xyz'), 183 | 'xz8') 184 | self.assertEqual(re.sub('((((((((((y))))))))))(.)', r'\11a', 'xyz'), 185 | 'xza') 186 | 187 | def test_qualified_re_sub(self): 188 | self.assertEqual(re.sub('a', 'b', 'aaaaa'), 'bbbbb') 189 | self.assertEqual(re.sub('a', 'b', 'aaaaa', 1), 'baaaa') 190 | 191 | def test_bug_114660(self): 192 | self.assertEqual(re.sub(r'(\S)\s+(\S)', r'\1 \2', 'hello there'), 193 | 'hello there') 194 | 195 | def test_bug_462270(self): 196 | # Test for empty sub() behaviour, see SF bug #462270 197 | self.assertEqual(re.sub('x*', '-', 'abxd'), '-a-b-d-') 198 | self.assertEqual(re.sub('x+', '-', 'abxd'), 'ab-d') 199 | 200 | def test_symbolic_groups(self): 201 | re.compile('(?Px)(?P=a)(?(a)y)') 202 | re.compile('(?Px)(?P=a1)(?(a1)y)') 203 | self.assertRaises(re.error, re.compile, '(?P)(?P)') 204 | self.assertRaises(re.error, re.compile, '(?Px)') 205 | self.assertRaises(re.error, re.compile, '(?P=)') 206 | self.assertRaises(re.error, re.compile, '(?P=1)') 207 | self.assertRaises(re.error, re.compile, '(?P=a)') 208 | self.assertRaises(re.error, re.compile, '(?P=a1)') 209 | self.assertRaises(re.error, re.compile, '(?P=a.)') 210 | self.assertRaises(re.error, re.compile, '(?P<)') 211 | self.assertRaises(re.error, re.compile, '(?P<>)') 212 | self.assertRaises(re.error, re.compile, '(?P<1>)') 213 | self.assertRaises(re.error, re.compile, '(?P)') 214 | self.assertRaises(re.error, re.compile, '(?())') 215 | self.assertRaises(re.error, re.compile, '(?(a))') 216 | self.assertRaises(re.error, re.compile, '(?(1a))') 217 | self.assertRaises(re.error, re.compile, '(?(a.))') 218 | 219 | def test_symbolic_refs(self): 220 | self.assertRaises(re.error, re.sub, '(?Px)', '\gx)', '\g<', 'xx') 222 | self.assertRaises(re.error, re.sub, '(?Px)', '\g', 'xx') 223 | self.assertRaises(re.error, re.sub, '(?Px)', '\g', 'xx') 224 | self.assertRaises(re.error, re.sub, '(?Px)', '\g<>', 'xx') 225 | self.assertRaises(re.error, re.sub, '(?Px)', '\g<1a1>', 'xx') 226 | self.assertRaises(IndexError, re.sub, '(?Px)', '\g', 'xx') 227 | self.assertRaises(re.error, re.sub, '(?Px)|(?Py)', '\g', 'xx') 228 | self.assertRaises(re.error, re.sub, '(?Px)|(?Py)', '\\2', 'xx') 229 | self.assertRaises(re.error, re.sub, '(?Px)', '\g<-1>', 'xx') 230 | 231 | def test_re_subn(self): 232 | self.assertEqual(re.subn("(?i)b+", "x", "bbbb BBBB"), ('x x', 2)) 233 | self.assertEqual(re.subn("b+", "x", "bbbb BBBB"), ('x BBBB', 1)) 234 | self.assertEqual(re.subn("b+", "x", "xyz"), ('xyz', 0)) 235 | self.assertEqual(re.subn("b*", "x", "xyz"), ('xxxyxzx', 4)) 236 | self.assertEqual(re.subn("b*", "x", "xyz", 2), ('xxxyz', 2)) 237 | 238 | def test_re_split(self): 239 | self.assertEqual(re.split(":", ":a:b::c"), ['', 'a', 'b', '', 'c']) 240 | self.assertEqual(re.split(":*", ":a:b::c"), ['', 'a', 'b', 'c']) 241 | self.assertEqual(re.split("(:*)", ":a:b::c"), 242 | ['', ':', 'a', ':', 'b', '::', 'c']) 243 | self.assertEqual(re.split("(?::*)", ":a:b::c"), ['', 'a', 'b', 'c']) 244 | self.assertEqual(re.split("(:)*", ":a:b::c"), 245 | ['', ':', 'a', ':', 'b', ':', 'c']) 246 | self.assertEqual(re.split("([b:]+)", ":a:b::c"), 247 | ['', ':', 'a', ':b::', 'c']) 248 | self.assertEqual(re.split("(b)|(:+)", ":a:b::c"), 249 | ['', None, ':', 'a', None, ':', '', 'b', None, '', 250 | None, '::', 'c']) 251 | self.assertEqual(re.split("(?:b)|(?::+)", ":a:b::c"), 252 | ['', 'a', '', '', 'c']) 253 | 254 | def test_qualified_re_split(self): 255 | self.assertEqual(re.split(":", ":a:b::c", 2), ['', 'a', 'b::c']) 256 | self.assertEqual(re.split(':', 'a:b:c:d', 2), ['a', 'b', 'c:d']) 257 | self.assertEqual(re.split("(:)", ":a:b::c", 2), 258 | ['', ':', 'a', ':', 'b::c']) 259 | self.assertEqual(re.split("(:*)", ":a:b::c", 2), 260 | ['', ':', 'a', ':', 'b::c']) 261 | 262 | def test_re_findall(self): 263 | self.assertEqual(re.findall(":+", "abc"), []) 264 | self.assertEqual(re.findall(":+", "a:b::c:::d"), [":", "::", ":::"]) 265 | self.assertEqual(re.findall("(:+)", "a:b::c:::d"), [":", "::", ":::"]) 266 | self.assertEqual(re.findall("(:)(:*)", "a:b::c:::d"), [(":", ""), 267 | (":", ":"), 268 | (":", "::")]) 269 | 270 | def test_bug_117612(self): 271 | self.assertEqual(re.findall(r"(a|(b))", "aba"), 272 | [("a", ""),("b", "b"),("a", "")]) 273 | 274 | def test_re_match(self): 275 | self.assertEqual(re.match('a', 'a').groups(), ()) 276 | self.assertEqual(re.match('(a)', 'a').groups(), ('a',)) 277 | self.assertEqual(re.match(r'(a)', 'a').group(0), 'a') 278 | self.assertEqual(re.match(r'(a)', 'a').group(1), 'a') 279 | self.assertEqual(re.match(r'(a)', 'a').group(1, 1), ('a', 'a')) 280 | 281 | pat = re.compile('((a)|(b))(c)?') 282 | self.assertEqual(pat.match('a').groups(), ('a', 'a', None, None)) 283 | self.assertEqual(pat.match('b').groups(), ('b', None, 'b', None)) 284 | self.assertEqual(pat.match('ac').groups(), ('a', 'a', None, 'c')) 285 | self.assertEqual(pat.match('bc').groups(), ('b', None, 'b', 'c')) 286 | self.assertEqual(pat.match('bc').groups(""), ('b', "", 'b', 'c')) 287 | 288 | # A single group 289 | m = re.match('(a)', 'a') 290 | self.assertEqual(m.group(0), 'a') 291 | self.assertEqual(m.group(0), 'a') 292 | self.assertEqual(m.group(1), 'a') 293 | self.assertEqual(m.group(1, 1), ('a', 'a')) 294 | 295 | pat = re.compile('(?:(?Pa)|(?Pb))(?Pc)?') 296 | self.assertEqual(pat.match('a').group(1, 2, 3), ('a', None, None)) 297 | self.assertEqual(pat.match('b').group('a1', 'b2', 'c3'), 298 | (None, 'b', None)) 299 | self.assertEqual(pat.match('ac').group(1, 'b2', 3), ('a', None, 'c')) 300 | 301 | def test_re_groupref_exists(self): 302 | self.assertEqual(re.match('^(\()?([^()]+)(?(1)\))$', '(a)').groups(), 303 | ('(', 'a')) 304 | self.assertEqual(re.match('^(\()?([^()]+)(?(1)\))$', 'a').groups(), 305 | (None, 'a')) 306 | self.assertEqual(re.match('^(\()?([^()]+)(?(1)\))$', 'a)'), None) 307 | self.assertEqual(re.match('^(\()?([^()]+)(?(1)\))$', '(a'), None) 308 | self.assertEqual(re.match('^(?:(a)|c)((?(1)b|d))$', 'ab').groups(), 309 | ('a', 'b')) 310 | self.assertEqual(re.match('^(?:(a)|c)((?(1)b|d))$', 'cd').groups(), 311 | (None, 'd')) 312 | self.assertEqual(re.match('^(?:(a)|c)((?(1)|d))$', 'cd').groups(), 313 | (None, 'd')) 314 | self.assertEqual(re.match('^(?:(a)|c)((?(1)|d))$', 'a').groups(), 315 | ('a', '')) 316 | 317 | # Tests for bug #1177831: exercise groups other than the first group 318 | p = re.compile('(?Pa)(?Pb)?((?(g2)c|d))') 319 | self.assertEqual(p.match('abc').groups(), 320 | ('a', 'b', 'c')) 321 | self.assertEqual(p.match('ad').groups(), 322 | ('a', None, 'd')) 323 | self.assertEqual(p.match('abd'), None) 324 | self.assertEqual(p.match('ac'), None) 325 | 326 | 327 | def test_re_groupref(self): 328 | self.assertEqual(re.match(r'^(\|)?([^()]+)\1$', '|a|').groups(), 329 | ('|', 'a')) 330 | self.assertEqual(re.match(r'^(\|)?([^()]+)\1?$', 'a').groups(), 331 | (None, 'a')) 332 | self.assertEqual(re.match(r'^(\|)?([^()]+)\1$', 'a|'), None) 333 | self.assertEqual(re.match(r'^(\|)?([^()]+)\1$', '|a'), None) 334 | self.assertEqual(re.match(r'^(?:(a)|c)(\1)$', 'aa').groups(), 335 | ('a', 'a')) 336 | self.assertEqual(re.match(r'^(?:(a)|c)(\1)?$', 'c').groups(), 337 | (None, None)) 338 | 339 | def test_groupdict(self): 340 | self.assertEqual(re.match('(?Pfirst) (?Psecond)', 341 | 'first second').groupdict(), 342 | {'first':'first', 'second':'second'}) 343 | 344 | def test_expand(self): 345 | self.assertEqual(re.match("(?Pfirst) (?Psecond)", 346 | "first second") 347 | .expand(r"\2 \1 \g \g"), 348 | "second first second first") 349 | 350 | def test_repeat_minmax(self): 351 | self.assertEqual(re.match("^(\w){1}$", "abc"), None) 352 | self.assertEqual(re.match("^(\w){1}?$", "abc"), None) 353 | self.assertEqual(re.match("^(\w){1,2}$", "abc"), None) 354 | self.assertEqual(re.match("^(\w){1,2}?$", "abc"), None) 355 | 356 | self.assertEqual(re.match("^(\w){3}$", "abc").group(1), "c") 357 | self.assertEqual(re.match("^(\w){1,3}$", "abc").group(1), "c") 358 | self.assertEqual(re.match("^(\w){1,4}$", "abc").group(1), "c") 359 | self.assertEqual(re.match("^(\w){3,4}?$", "abc").group(1), "c") 360 | self.assertEqual(re.match("^(\w){3}?$", "abc").group(1), "c") 361 | self.assertEqual(re.match("^(\w){1,3}?$", "abc").group(1), "c") 362 | self.assertEqual(re.match("^(\w){1,4}?$", "abc").group(1), "c") 363 | self.assertEqual(re.match("^(\w){3,4}?$", "abc").group(1), "c") 364 | 365 | self.assertEqual(re.match("^x{1}$", "xxx"), None) 366 | self.assertEqual(re.match("^x{1}?$", "xxx"), None) 367 | self.assertEqual(re.match("^x{1,2}$", "xxx"), None) 368 | self.assertEqual(re.match("^x{1,2}?$", "xxx"), None) 369 | 370 | self.assertNotEqual(re.match("^x{3}$", "xxx"), None) 371 | self.assertNotEqual(re.match("^x{1,3}$", "xxx"), None) 372 | self.assertNotEqual(re.match("^x{1,4}$", "xxx"), None) 373 | self.assertNotEqual(re.match("^x{3,4}?$", "xxx"), None) 374 | self.assertNotEqual(re.match("^x{3}?$", "xxx"), None) 375 | self.assertNotEqual(re.match("^x{1,3}?$", "xxx"), None) 376 | # PCRE: need to make ungreedy for this case to work in PCRE 377 | #self.assertNotEqual(re.match("^x{1,4}?$", "xxx"), None) 378 | self.assertNotEqual(re.match("(?U)^x{1,4}?$", "xxx"), None) 379 | self.assertNotEqual(re.match("^x{3,4}?$", "xxx"), None) 380 | 381 | self.assertEqual(re.match("^x{}$", "xxx"), None) 382 | self.assertNotEqual(re.match("^x{}$", "x{}"), None) 383 | 384 | def test_getattr(self): 385 | self.assertEqual(re.match("(a)", "a").pos, 0) 386 | self.assertEqual(re.match("(a)", "a").endpos, 1) 387 | self.assertEqual(re.match("(a)", "a").string, "a") 388 | self.assertEqual(re.match("(a)", "a").regs, ((0, 1), (0, 1))) 389 | self.assertNotEqual(re.match("(a)", "a").re, None) 390 | 391 | def test_special_escapes(self): 392 | self.assertEqual(re.search(r"\b(b.)\b", 393 | "abcd abc bcd bx").group(1), "bx") 394 | self.assertEqual(re.search(r"\B(b.)\B", 395 | "abc bcd bc abxd").group(1), "bx") 396 | # PCRE: python-pcre doesn't support LOCALE flag 397 | #self.assertEqual(re.search(r"\b(b.)\b", 398 | # "abcd abc bcd bx", re.LOCALE).group(1), "bx") 399 | #self.assertEqual(re.search(r"\B(b.)\B", 400 | # "abc bcd bc abxd", re.LOCALE).group(1), "bx") 401 | self.assertEqual(re.search(r"\b(b.)\b", 402 | "abcd abc bcd bx", re.UNICODE).group(1), "bx") 403 | self.assertEqual(re.search(r"\B(b.)\B", 404 | "abc bcd bc abxd", re.UNICODE).group(1), "bx") 405 | self.assertEqual(re.search(r"^abc$", "\nabc\n", re.M).group(0), "abc") 406 | self.assertEqual(re.search(r"^\Aabc\Z$", "abc", re.M).group(0), "abc") 407 | self.assertEqual(re.search(r"^\Aabc\Z$", "\nabc\n", re.M), None) 408 | self.assertEqual(re.search(r"\b(b.)\b", 409 | u"abcd abc bcd bx").group(1), "bx") 410 | self.assertEqual(re.search(r"\B(b.)\B", 411 | u"abc bcd bc abxd").group(1), "bx") 412 | self.assertEqual(re.search(r"^abc$", u"\nabc\n", re.M).group(0), "abc") 413 | self.assertEqual(re.search(r"^\Aabc\Z$", u"abc", re.M).group(0), "abc") 414 | self.assertEqual(re.search(r"^\Aabc\Z$", u"\nabc\n", re.M), None) 415 | self.assertEqual(re.search(r"\d\D\w\W\s\S", 416 | "1aa! a").group(0), "1aa! a") 417 | # PCRE: python-pcre doesn't support LOCALE flag 418 | #self.assertEqual(re.search(r"\d\D\w\W\s\S", 419 | # "1aa! a", re.LOCALE).group(0), "1aa! a") 420 | self.assertEqual(re.search(r"\d\D\w\W\s\S", 421 | "1aa! a", re.UNICODE).group(0), "1aa! a") 422 | 423 | def test_string_boundaries(self): 424 | # See http://bugs.python.org/issue10713 425 | self.assertEqual(re.search(r"\b(abc)\b", "abc").group(1), 426 | "abc") 427 | # There's a word boundary at the start of a string. 428 | self.assertTrue(re.match(r"\b", "abc")) 429 | # A non-empty string includes a non-boundary zero-length match. 430 | self.assertTrue(re.search(r"\B", "abc")) 431 | # There is no non-boundary match at the start of a string. 432 | self.assertFalse(re.match(r"\B", "abc")) 433 | # However, an empty string contains no word boundaries, and also no 434 | # non-boundaries. 435 | # PCRE: PCRE needs the NOTEMPTY_ATSTART flag to pass 436 | #self.assertEqual(re.search(r"\B", ""), None) 437 | self.assertEqual(re.compile(r"\B").search("", flags=re.NOTEMPTY_ATSTART), None) 438 | # This one is questionable and different from the perlre behaviour, 439 | # but describes current behavior. 440 | self.assertEqual(re.search(r"\b", ""), None) 441 | # A single word-character string has two boundaries, but no 442 | # non-boundary gaps. 443 | self.assertEqual(len(re.findall(r"\b", "a")), 2) 444 | self.assertEqual(len(re.findall(r"\B", "a")), 0) 445 | # If there are no words, there are no boundaries 446 | self.assertEqual(len(re.findall(r"\b", " ")), 0) 447 | self.assertEqual(len(re.findall(r"\b", " ")), 0) 448 | # Can match around the whitespace. 449 | self.assertEqual(len(re.findall(r"\B", " ")), 2) 450 | 451 | def test_bigcharset(self): 452 | self.assertEqual(re.match(u"([\u2222\u2223])", 453 | u"\u2222").group(1), u"\u2222") 454 | self.assertEqual(re.match(u"([\u2222\u2223])", 455 | u"\u2222", re.UNICODE).group(1), u"\u2222") 456 | r = u'[%s]' % u''.join(map(unichr, range(256, 2**16, 255))) 457 | self.assertEqual(re.match(r, u"\uff01", re.UNICODE).group(), u"\uff01") 458 | 459 | def test_big_codesize(self): 460 | # Issue #1160 461 | # PCRE: Size needs to be reduced to pass. 462 | #r = re.compile('|'.join(('%d'%x for x in range(10000)))) 463 | r = re.compile('|'.join(('%d'%x for x in range(6000)))) # PCRE 464 | self.assertIsNotNone(r.match('1000')) 465 | self.assertIsNotNone(r.match('9999')) 466 | 467 | def test_anyall(self): 468 | self.assertEqual(re.match("a.b", "a\nb", re.DOTALL).group(0), 469 | "a\nb") 470 | self.assertEqual(re.match("a.*b", "a\n\nb", re.DOTALL).group(0), 471 | "a\n\nb") 472 | 473 | def test_non_consuming(self): 474 | self.assertEqual(re.match("(a(?=\s[^a]))", "a b").group(1), "a") 475 | self.assertEqual(re.match("(a(?=\s[^a]*))", "a b").group(1), "a") 476 | self.assertEqual(re.match("(a(?=\s[abc]))", "a b").group(1), "a") 477 | self.assertEqual(re.match("(a(?=\s[abc]*))", "a bc").group(1), "a") 478 | self.assertEqual(re.match(r"(a)(?=\s\1)", "a a").group(1), "a") 479 | self.assertEqual(re.match(r"(a)(?=\s\1*)", "a aa").group(1), "a") 480 | self.assertEqual(re.match(r"(a)(?=\s(abc|a))", "a a").group(1), "a") 481 | 482 | self.assertEqual(re.match(r"(a(?!\s[^a]))", "a a").group(1), "a") 483 | self.assertEqual(re.match(r"(a(?!\s[abc]))", "a d").group(1), "a") 484 | self.assertEqual(re.match(r"(a)(?!\s\1)", "a b").group(1), "a") 485 | self.assertEqual(re.match(r"(a)(?!\s(abc|a))", "a b").group(1), "a") 486 | 487 | def test_ignore_case(self): 488 | self.assertEqual(re.match("abc", "ABC", re.I).group(0), "ABC") 489 | self.assertEqual(re.match("abc", u"ABC", re.I).group(0), "ABC") 490 | self.assertEqual(re.match(r"(a\s[^a])", "a b", re.I).group(1), "a b") 491 | self.assertEqual(re.match(r"(a\s[^a]*)", "a bb", re.I).group(1), "a bb") 492 | self.assertEqual(re.match(r"(a\s[abc])", "a b", re.I).group(1), "a b") 493 | self.assertEqual(re.match(r"(a\s[abc]*)", "a bb", re.I).group(1), "a bb") 494 | self.assertEqual(re.match(r"((a)\s\2)", "a a", re.I).group(1), "a a") 495 | self.assertEqual(re.match(r"((a)\s\2*)", "a aa", re.I).group(1), "a aa") 496 | self.assertEqual(re.match(r"((a)\s(abc|a))", "a a", re.I).group(1), "a a") 497 | self.assertEqual(re.match(r"((a)\s(abc|a)*)", "a aa", re.I).group(1), "a aa") 498 | 499 | def test_category(self): 500 | self.assertEqual(re.match(r"(\s)", " ").group(1), " ") 501 | 502 | def test_getlower(self): 503 | # PCRE: we're not testing _sre here 504 | #import _sre 505 | #self.assertEqual(_sre.getlower(ord('A'), 0), ord('a')) 506 | #self.assertEqual(_sre.getlower(ord('A'), re.LOCALE), ord('a')) 507 | #self.assertEqual(_sre.getlower(ord('A'), re.UNICODE), ord('a')) 508 | 509 | self.assertEqual(re.match("abc", "ABC", re.I).group(0), "ABC") 510 | self.assertEqual(re.match("abc", u"ABC", re.I).group(0), "ABC") 511 | 512 | def test_not_literal(self): 513 | self.assertEqual(re.search("\s([^a])", " b").group(1), "b") 514 | self.assertEqual(re.search("\s([^a]*)", " bb").group(1), "bb") 515 | 516 | def test_search_coverage(self): 517 | self.assertEqual(re.search("\s(b)", " b").group(1), "b") 518 | self.assertEqual(re.search("a\s", "a ").group(0), "a ") 519 | 520 | def assertMatch(self, pattern, text, match=None, span=None, 521 | matcher=re.match): 522 | if match is None and span is None: 523 | # the pattern matches the whole text 524 | match = text 525 | span = (0, len(text)) 526 | elif match is None or span is None: 527 | raise ValueError('If match is not None, span should be specified ' 528 | '(and vice versa).') 529 | m = matcher(pattern, text) 530 | self.assertTrue(m) 531 | self.assertEqual(m.group(), match) 532 | self.assertEqual(m.span(), span) 533 | 534 | def test_re_escape(self): 535 | alnum_chars = string.ascii_letters + string.digits 536 | p = u''.join(unichr(i) for i in range(256)) 537 | for c in p: 538 | if c in alnum_chars: 539 | self.assertEqual(re.escape(c), c) 540 | elif c == u'\x00': 541 | self.assertEqual(re.escape(c), u'\\000') 542 | else: 543 | self.assertEqual(re.escape(c), u'\\' + c) 544 | self.assertMatch(re.escape(c), c) 545 | self.assertMatch(re.escape(p), p) 546 | 547 | def test_re_escape_byte(self): 548 | alnum_chars = (string.ascii_letters + string.digits).encode('ascii') 549 | p = ''.join(chr(i) for i in range(256)) 550 | for b in p: 551 | if b in alnum_chars: 552 | self.assertEqual(re.escape(b), b) 553 | elif b == b'\x00': 554 | self.assertEqual(re.escape(b), b'\\000') 555 | else: 556 | self.assertEqual(re.escape(b), b'\\' + b) 557 | self.assertMatch(re.escape(b), b) 558 | self.assertMatch(re.escape(p), p) 559 | 560 | def test_re_escape_non_ascii(self): 561 | s = u'xxx\u2620\u2620\u2620xxx' 562 | s_escaped = re.escape(s) 563 | self.assertEqual(s_escaped, u'xxx\\\u2620\\\u2620\\\u2620xxx') 564 | self.assertMatch(s_escaped, s) 565 | self.assertMatch(u'.%s+.' % re.escape(u'\u2620'), s, 566 | u'x\u2620\u2620\u2620x', (2, 7), re.search) 567 | 568 | def test_re_escape_non_ascii_bytes(self): 569 | b = u'y\u2620y\u2620y'.encode('utf-8') 570 | b_escaped = re.escape(b) 571 | self.assertEqual(b_escaped, b'y\\\xe2\\\x98\\\xa0y\\\xe2\\\x98\\\xa0y') 572 | self.assertMatch(b_escaped, b) 573 | res = re.findall(re.escape(u'\u2620'.encode('utf-8')), b) 574 | self.assertEqual(len(res), 2) 575 | 576 | def test_pickling(self): 577 | import pickle 578 | self.pickle_test(pickle) 579 | import cPickle 580 | self.pickle_test(cPickle) 581 | # PCRE: we're not testing _sre here 582 | # old pickles expect the _compile() reconstructor in sre module 583 | #import_module("sre", deprecated=True) 584 | #from sre import _compile 585 | 586 | def pickle_test(self, pickle): 587 | oldpat = re.compile('a(?:b|(c|e){1,2}?|d)+?(.)') 588 | s = pickle.dumps(oldpat) 589 | newpat = pickle.loads(s) 590 | self.assertEqual(oldpat, newpat) 591 | 592 | def test_constants(self): 593 | self.assertEqual(re.I, re.IGNORECASE) 594 | # PCRE: python-pcre doesn't support LOCALE flag 595 | #self.assertEqual(re.L, re.LOCALE) 596 | self.assertEqual(re.M, re.MULTILINE) 597 | self.assertEqual(re.S, re.DOTALL) 598 | self.assertEqual(re.X, re.VERBOSE) 599 | 600 | def test_flags(self): 601 | # PCRE: python-pcre doesn't support LOCALE flag 602 | #for flag in [re.I, re.M, re.X, re.S, re.L]: 603 | for flag in [re.I, re.M, re.X, re.S]: 604 | self.assertNotEqual(re.compile('^pattern$', flag), None) 605 | 606 | def test_sre_character_literals(self): 607 | for i in [0, 8, 16, 32, 64, 127, 128, 255]: 608 | self.assertNotEqual(re.match(r"\%03o" % i, chr(i)), None) 609 | self.assertNotEqual(re.match(r"\%03o0" % i, chr(i)+"0"), None) 610 | self.assertNotEqual(re.match(r"\%03o8" % i, chr(i)+"8"), None) 611 | self.assertNotEqual(re.match(r"\x%02x" % i, chr(i)), None) 612 | self.assertNotEqual(re.match(r"\x%02x0" % i, chr(i)+"0"), None) 613 | self.assertNotEqual(re.match(r"\x%02xz" % i, chr(i)+"z"), None) 614 | # PCRE: PCRE doesn't see this as an error 615 | #self.assertRaises(re.error, re.match, "\911", "") 616 | 617 | def test_sre_character_class_literals(self): 618 | for i in [0, 8, 16, 32, 64, 127, 128, 255]: 619 | self.assertNotEqual(re.match(r"[\%03o]" % i, chr(i)), None) 620 | self.assertNotEqual(re.match(r"[\%03o0]" % i, chr(i)), None) 621 | self.assertNotEqual(re.match(r"[\%03o8]" % i, chr(i)), None) 622 | self.assertNotEqual(re.match(r"[\x%02x]" % i, chr(i)), None) 623 | self.assertNotEqual(re.match(r"[\x%02x0]" % i, chr(i)), None) 624 | self.assertNotEqual(re.match(r"[\x%02xz]" % i, chr(i)), None) 625 | # PCRE: PCRE doesn't see this as an error 626 | #self.assertRaises(re.error, re.match, "[\911]", "") 627 | 628 | def test_bug_113254(self): 629 | self.assertEqual(re.match(r'(a)|(b)', 'b').start(1), -1) 630 | self.assertEqual(re.match(r'(a)|(b)', 'b').end(1), -1) 631 | self.assertEqual(re.match(r'(a)|(b)', 'b').span(1), (-1, -1)) 632 | 633 | def test_bug_527371(self): 634 | # bug described in patches 527371/672491 635 | self.assertEqual(re.match(r'(a)?a','a').lastindex, None) 636 | self.assertEqual(re.match(r'(a)(b)?b','ab').lastindex, 1) 637 | self.assertEqual(re.match(r'(?Pa)(?Pb)?b','ab').lastgroup, 'a') 638 | # PCRE: PCRE's lastgroup/lastindex is based on opening parentheses 639 | #self.assertEqual(re.match("(?Pa(b))", "ab").lastgroup, 'a') 640 | #self.assertEqual(re.match("((a))", "a").lastindex, 1) 641 | self.assertEqual(re.match("(?Pa(b))", "ab").lastgroup, None) 642 | self.assertEqual(re.match("((a))", "a").lastindex, 2) 643 | 644 | def test_bug_545855(self): 645 | # bug 545855 -- This pattern failed to cause a compile error as it 646 | # should, instead provoking a TypeError. 647 | self.assertRaises(re.error, re.compile, 'foo[a-') 648 | 649 | def test_bug_418626(self): 650 | # bugs 418626 at al. -- Testing Greg Chapman's addition of op code 651 | # SRE_OP_MIN_REPEAT_ONE for eliminating recursion on simple uses of 652 | # pattern '*?' on a long string. 653 | self.assertEqual(re.match('.*?c', 10000*'ab'+'cd').end(0), 20001) 654 | self.assertEqual(re.match('.*?cd', 5000*'ab'+'c'+5000*'ab'+'cde').end(0), 655 | 20003) 656 | self.assertEqual(re.match('.*?cd', 20000*'abc'+'de').end(0), 60001) 657 | # non-simple '*?' still used to hit the recursion limit, before the 658 | # non-recursive scheme was implemented. 659 | self.assertEqual(re.search('(a|b)*?c', 10000*'ab'+'cd').end(0), 20001) 660 | 661 | def test_bug_612074(self): 662 | pat=u"["+re.escape(u"\u2039")+u"]" 663 | self.assertEqual(re.compile(pat) and 1, 1) 664 | 665 | def test_stack_overflow(self): 666 | # nasty cases that used to overflow the straightforward recursive 667 | # implementation of repeated groups. 668 | self.assertEqual(re.match('(x)*', 50000*'x').group(1), 'x') 669 | self.assertEqual(re.match('(x)*y', 50000*'x'+'y').group(1), 'x') 670 | self.assertEqual(re.match('(x)*?y', 50000*'x'+'y').group(1), 'x') 671 | 672 | def test_unlimited_zero_width_repeat(self): 673 | # Issue #9669 674 | self.assertIsNone(re.match(r'(?:a?)*y', 'z')) 675 | self.assertIsNone(re.match(r'(?:a?)+y', 'z')) 676 | self.assertIsNone(re.match(r'(?:a?){2,}y', 'z')) 677 | self.assertIsNone(re.match(r'(?:a?)*?y', 'z')) 678 | self.assertIsNone(re.match(r'(?:a?)+?y', 'z')) 679 | self.assertIsNone(re.match(r'(?:a?){2,}?y', 'z')) 680 | 681 | # PCRE: python-pcre doesn't support scanner APIs 682 | def __test_scanner(self): 683 | def s_ident(scanner, token): return token 684 | def s_operator(scanner, token): return "op%s" % token 685 | def s_float(scanner, token): return float(token) 686 | def s_int(scanner, token): return int(token) 687 | 688 | scanner = Scanner([ 689 | (r"[a-zA-Z_]\w*", s_ident), 690 | (r"\d+\.\d*", s_float), 691 | (r"\d+", s_int), 692 | (r"=|\+|-|\*|/", s_operator), 693 | (r"\s+", None), 694 | ]) 695 | 696 | self.assertNotEqual(scanner.scanner.scanner("").pattern, None) 697 | 698 | self.assertEqual(scanner.scan("sum = 3*foo + 312.50 + bar"), 699 | (['sum', 'op=', 3, 'op*', 'foo', 'op+', 312.5, 700 | 'op+', 'bar'], '')) 701 | 702 | def test_bug_448951(self): 703 | # bug 448951 (similar to 429357, but with single char match) 704 | # (Also test greedy matches.) 705 | for op in '','?','*': 706 | self.assertEqual(re.match(r'((.%s):)?z'%op, 'z').groups(), 707 | (None, None)) 708 | self.assertEqual(re.match(r'((.%s):)?z'%op, 'a:z').groups(), 709 | ('a:', 'a')) 710 | 711 | def test_bug_725106(self): 712 | # capturing groups in alternatives in repeats 713 | self.assertEqual(re.match('^((a)|b)*', 'abc').groups(), 714 | ('b', 'a')) 715 | self.assertEqual(re.match('^(([ab])|c)*', 'abc').groups(), 716 | ('c', 'b')) 717 | self.assertEqual(re.match('^((d)|[ab])*', 'abc').groups(), 718 | ('b', None)) 719 | self.assertEqual(re.match('^((a)c|[ab])*', 'abc').groups(), 720 | ('b', None)) 721 | self.assertEqual(re.match('^((a)|b)*?c', 'abc').groups(), 722 | ('b', 'a')) 723 | self.assertEqual(re.match('^(([ab])|c)*?d', 'abcd').groups(), 724 | ('c', 'b')) 725 | self.assertEqual(re.match('^((d)|[ab])*?c', 'abc').groups(), 726 | ('b', None)) 727 | self.assertEqual(re.match('^((a)c|[ab])*?c', 'abc').groups(), 728 | ('b', None)) 729 | 730 | def test_bug_725149(self): 731 | # mark_stack_base restoring before restoring marks 732 | self.assertEqual(re.match('(a)(?:(?=(b)*)c)*', 'abb').groups(), 733 | ('a', None)) 734 | self.assertEqual(re.match('(a)((?!(b)*))*', 'abb').groups(), 735 | ('a', None, None)) 736 | 737 | def test_bug_764548(self): 738 | # bug 764548, re.compile() barfs on str/unicode subclasses 739 | try: 740 | unicode 741 | except NameError: 742 | self.skipTest('no problem if we have no unicode') 743 | class my_unicode(unicode): pass 744 | pat = re.compile(my_unicode("abc")) 745 | self.assertEqual(pat.match("xyz"), None) 746 | 747 | def test_finditer(self): 748 | iter = re.finditer(r":+", "a:b::c:::d") 749 | self.assertEqual([item.group(0) for item in iter], 750 | [":", "::", ":::"]) 751 | 752 | def test_bug_926075(self): 753 | try: 754 | unicode 755 | except NameError: 756 | self.skipTest('no problem if we have no unicode') 757 | self.assertTrue(re.compile('bug_926075') is not 758 | re.compile(eval("u'bug_926075'"))) 759 | 760 | def test_bug_931848(self): 761 | try: 762 | unicode 763 | except NameError: 764 | self.skipTest('no problem if we have no unicode') 765 | pattern = eval('u"[\u002E\u3002\uFF0E\uFF61]"') 766 | self.assertEqual(re.compile(pattern).split("a.b.c"), 767 | ['a','b','c']) 768 | 769 | def test_bug_581080(self): 770 | iter = re.finditer(r"\s", "a b") 771 | self.assertEqual(iter.next().span(), (1,2)) 772 | self.assertRaises(StopIteration, iter.next) 773 | 774 | # PCRE: python-pcre doesn't support scanner APIs 775 | return 776 | scanner = re.compile(r"\s").scanner("a b") 777 | self.assertEqual(scanner.search().span(), (1, 2)) 778 | self.assertEqual(scanner.search(), None) 779 | 780 | def test_bug_817234(self): 781 | iter = re.finditer(r".*", "asdf") 782 | self.assertEqual(iter.next().span(), (0, 4)) 783 | self.assertEqual(iter.next().span(), (4, 4)) 784 | self.assertRaises(StopIteration, iter.next) 785 | 786 | def test_bug_6561(self): 787 | # '\d' should match characters in Unicode category 'Nd' 788 | # (Number, Decimal Digit), but not those in 'Nl' (Number, 789 | # Letter) or 'No' (Number, Other). 790 | decimal_digits = [ 791 | u'\u0037', # '\N{DIGIT SEVEN}', category 'Nd' 792 | u'\u0e58', # '\N{THAI DIGIT SIX}', category 'Nd' 793 | u'\uff10', # '\N{FULLWIDTH DIGIT ZERO}', category 'Nd' 794 | ] 795 | for x in decimal_digits: 796 | self.assertEqual(re.match('^\d$', x, re.UNICODE).group(0), x) 797 | 798 | not_decimal_digits = [ 799 | u'\u2165', # '\N{ROMAN NUMERAL SIX}', category 'Nl' 800 | u'\u3039', # '\N{HANGZHOU NUMERAL TWENTY}', category 'Nl' 801 | u'\u2082', # '\N{SUBSCRIPT TWO}', category 'No' 802 | u'\u32b4', # '\N{CIRCLED NUMBER THIRTY NINE}', category 'No' 803 | ] 804 | for x in not_decimal_digits: 805 | self.assertIsNone(re.match('^\d$', x, re.UNICODE)) 806 | 807 | def test_empty_array(self): 808 | # SF buf 1647541 809 | import array 810 | for typecode in 'cbBuhHiIlLfd': 811 | a = array.array(typecode) 812 | self.assertEqual(re.compile("bla").match(a), None) 813 | self.assertEqual(re.compile("").match(a).groups(), ()) 814 | 815 | def test_inline_flags(self): 816 | # Bug #1700 817 | upper_char = unichr(0x1ea0) # Latin Capital Letter A with Dot Bellow 818 | lower_char = unichr(0x1ea1) # Latin Small Letter A with Dot Bellow 819 | 820 | p = re.compile(upper_char, re.I | re.U) 821 | q = p.match(lower_char) 822 | self.assertNotEqual(q, None) 823 | 824 | p = re.compile(lower_char, re.I | re.U) 825 | q = p.match(upper_char) 826 | self.assertNotEqual(q, None) 827 | 828 | p = re.compile('(?i)' + upper_char, re.U) 829 | q = p.match(lower_char) 830 | self.assertNotEqual(q, None) 831 | 832 | p = re.compile('(?i)' + lower_char, re.U) 833 | q = p.match(upper_char) 834 | self.assertNotEqual(q, None) 835 | 836 | # PCRE: UNICODE mode can be enabled using (*UCP) instead of (?u) 837 | #p = re.compile('(?iu)' + upper_char) 838 | p = re.compile('(*UCP)(?i)' + upper_char) 839 | q = p.match(lower_char) 840 | self.assertNotEqual(q, None) 841 | 842 | # PCRE: UNICODE mode can be enabled using (*UCP) instead of (?u) 843 | #p = re.compile('(?iu)' + lower_char) 844 | p = re.compile('(*UCP)(?i)' + lower_char) 845 | q = p.match(upper_char) 846 | self.assertNotEqual(q, None) 847 | 848 | def test_dollar_matches_twice(self): 849 | "$ matches the end of string, and just before the terminating \n" 850 | pattern = re.compile('$') 851 | self.assertEqual(pattern.sub('#', 'a\nb\n'), 'a\nb#\n#') 852 | self.assertEqual(pattern.sub('#', 'a\nb\nc'), 'a\nb\nc#') 853 | self.assertEqual(pattern.sub('#', '\n'), '#\n#') 854 | 855 | pattern = re.compile('$', re.MULTILINE) 856 | self.assertEqual(pattern.sub('#', 'a\nb\n' ), 'a#\nb#\n#' ) 857 | self.assertEqual(pattern.sub('#', 'a\nb\nc'), 'a#\nb#\nc#') 858 | self.assertEqual(pattern.sub('#', '\n'), '#\n#') 859 | 860 | def test_dealloc(self): 861 | # PCRE: we're not testing _sre here 862 | # issue 3299: check for segfault in debug build 863 | #import _sre 864 | # the overflow limit is different on wide and narrow builds and it 865 | # depends on the definition of SRE_CODE (see sre.h). 866 | # 2**128 should be big enough to overflow on both. For smaller values 867 | # a RuntimeError is raised instead of OverflowError. 868 | #long_overflow = 2**128 869 | # PCRE: finditer is implemented as generator function -- next() has to be called 870 | #self.assertRaises(TypeError, re.finditer, "a", {}) 871 | self.assertRaises(TypeError, re.finditer("a", {}).next) 872 | #self.assertRaises(OverflowError, _sre.compile, "abc", 0, [long_overflow]) 873 | 874 | def test_compile(self): 875 | # Test return value when given string and pattern as parameter 876 | pattern = re.compile('random pattern') 877 | self.assertIsInstance(pattern, re._pattern_type) 878 | same_pattern = re.compile(pattern) 879 | self.assertIsInstance(same_pattern, re._pattern_type) 880 | self.assertIs(same_pattern, pattern) 881 | # Test behaviour when not given a string or pattern as parameter 882 | self.assertRaises(TypeError, re.compile, 0) 883 | 884 | def test_bug_13899(self): 885 | # Issue #13899: re pattern r"[\A]" should work like "A" but matches 886 | # nothing. Ditto B and Z. 887 | self.assertEqual(re.findall(r'[\A\B\b\C\Z]', 'AB\bCZ'), 888 | ['A', 'B', '\b', 'C', 'Z']) 889 | 890 | @precisionbigmemtest(size=_2G, memuse=1) 891 | def test_large_search(self, size): 892 | # Issue #10182: indices were 32-bit-truncated. 893 | s = 'a' * size 894 | m = re.search('$', s) 895 | self.assertIsNotNone(m) 896 | self.assertEqual(m.start(), size) 897 | self.assertEqual(m.end(), size) 898 | 899 | # The huge memuse is because of re.sub() using a list and a join() 900 | # to create the replacement result. 901 | @precisionbigmemtest(size=_2G, memuse=16 + 2) 902 | def test_large_subn(self, size): 903 | # Issue #10182: indices were 32-bit-truncated. 904 | s = 'a' * size 905 | r, n = re.subn('', '', s) 906 | self.assertEqual(r, s) 907 | self.assertEqual(n, size + 1) 908 | 909 | 910 | def test_repeat_minmax_overflow(self): 911 | # Issue #13169 912 | 913 | # PCRE: PCRE requires "{0,x}" instead of "{,x}" 914 | # PCRE: PCRE doesn't support repetition over 64K 915 | # PCRE: 2**128 repetition raises NO error -- PCRE bug? 916 | 917 | string = "x" * 100000 918 | self.assertEqual(re.match(r".{65535}", string).span(), (0, 65535)) 919 | self.assertEqual(re.match(r".{0,65535}", string).span(), (0, 65535)) 920 | self.assertEqual(re.match(r".{65534,}?", string).span(), (0, 65534)) 921 | self.assertEqual(re.match(r".{65535}", string).span(), (0, 65535)) 922 | self.assertEqual(re.match(r".{0,65535}", string).span(), (0, 65535)) 923 | self.assertEqual(re.match(r".{65535,}?", string).span(), (0, 65535)) 924 | self.assertRaises(OverflowError, re.compile, r".{%d}" % 65536) 925 | self.assertRaises(OverflowError, re.compile, r".{0,%d}" % 65536) 926 | self.assertRaises(OverflowError, re.compile, r".{%d,}?" % 65536) 927 | self.assertRaises(OverflowError, re.compile, r".{%d,%d}" % (65537, 65536)) 928 | return 929 | 930 | string = "x" * 100000 931 | self.assertEqual(re.match(r".{65535}", string).span(), (0, 65535)) 932 | self.assertEqual(re.match(r".{,65535}", string).span(), (0, 65535)) 933 | self.assertEqual(re.match(r".{65535,}?", string).span(), (0, 65535)) 934 | self.assertEqual(re.match(r".{65536}", string).span(), (0, 65536)) 935 | self.assertEqual(re.match(r".{,65536}", string).span(), (0, 65536)) 936 | self.assertEqual(re.match(r".{65536,}?", string).span(), (0, 65536)) 937 | # 2**128 should be big enough to overflow both SRE_CODE and Py_ssize_t. 938 | self.assertRaises(OverflowError, re.compile, r".{%d}" % 2**128) 939 | self.assertRaises(OverflowError, re.compile, r".{,%d}" % 2**128) 940 | self.assertRaises(OverflowError, re.compile, r".{%d,}?" % 2**128) 941 | self.assertRaises(OverflowError, re.compile, r".{%d,%d}" % (2**129, 2**128)) 942 | 943 | @cpython_only 944 | def test_repeat_minmax_overflow_maxrepeat(self): 945 | try: 946 | # PCRE: this is defined in pcre module (and is always 64K) 947 | #from _sre import MAXREPEAT 948 | from pcre import MAXREPEAT 949 | except ImportError: 950 | self.skipTest('requires _sre.MAXREPEAT constant') 951 | 952 | # PCRE: PCRE requires "{0,x}" instead of "{,x}" 953 | 954 | string = "x" * 10000 955 | self.assertIsNone(re.match(r".{%d}" % (MAXREPEAT - 1), string)) 956 | self.assertEqual(re.match(r".{0,%d}" % (MAXREPEAT - 1), string).span(), 957 | (0, 10000)) 958 | self.assertIsNone(re.match(r".{%d,}?" % (MAXREPEAT - 1), string)) 959 | self.assertRaises(OverflowError, re.compile, r".{%d}" % MAXREPEAT) 960 | self.assertRaises(OverflowError, re.compile, r".{0,%d}" % MAXREPEAT) 961 | self.assertRaises(OverflowError, re.compile, r".{%d,}?" % MAXREPEAT) 962 | return 963 | 964 | string = "x" * 100000 965 | self.assertIsNone(re.match(r".{%d}" % (MAXREPEAT - 1), string)) 966 | self.assertEqual(re.match(r".{,%d}" % (MAXREPEAT - 1), string).span(), 967 | (0, 100000)) 968 | self.assertIsNone(re.match(r".{%d,}?" % (MAXREPEAT - 1), string)) 969 | self.assertRaises(OverflowError, re.compile, r".{%d}" % MAXREPEAT) 970 | self.assertRaises(OverflowError, re.compile, r".{,%d}" % MAXREPEAT) 971 | self.assertRaises(OverflowError, re.compile, r".{%d,}?" % MAXREPEAT) 972 | 973 | # PCRE: PCRE's errors don't contain group names 974 | def __test_backref_group_name_in_exception(self): 975 | # Issue 17341: Poor error message when compiling invalid regex 976 | with self.assertRaisesRegexp(sre_constants.error, ''): 977 | re.compile('(?P=)') 978 | 979 | # PCRE: PCRE's errors don't contain group names 980 | def __test_group_name_in_exception(self): 981 | # Issue 17341: Poor error message when compiling invalid regex 982 | with self.assertRaisesRegexp(sre_constants.error, '\?foo'): 983 | re.compile('(?P)') 984 | 985 | def test_issue17998(self): 986 | for reps in '*', '+', '?', '{1}': 987 | for mod in '', '?': 988 | pattern = '.' + reps + mod + 'yz' 989 | self.assertEqual(re.compile(pattern, re.S).findall('xyz'), 990 | ['xyz'], msg=pattern) 991 | pattern = pattern.encode() 992 | self.assertEqual(re.compile(pattern, re.S).findall(b'xyz'), 993 | [b'xyz'], msg=pattern) 994 | 995 | 996 | def test_bug_2537(self): 997 | # issue 2537: empty submatches 998 | for outer_op in ('{0,}', '*', '+', '{1,187}'): 999 | for inner_op in ('{0,}', '*', '?'): 1000 | r = re.compile("^((x|y)%s)%s" % (inner_op, outer_op)) 1001 | m = r.match("xyyzy") 1002 | self.assertEqual(m.group(0), "xyy") 1003 | self.assertEqual(m.group(1), "") 1004 | self.assertEqual(m.group(2), "y") 1005 | 1006 | # PCRE: python-pcre doesn't support DEBUG flag 1007 | def __test_debug_flag(self): 1008 | with captured_stdout() as out: 1009 | re.compile('foo', re.DEBUG) 1010 | self.assertEqual(out.getvalue().splitlines(), 1011 | ['literal 102', 'literal 111', 'literal 111']) 1012 | # Debug output is output again even a second time (bypassing 1013 | # the cache -- issue #20426). 1014 | with captured_stdout() as out: 1015 | re.compile('foo', re.DEBUG) 1016 | self.assertEqual(out.getvalue().splitlines(), 1017 | ['literal 102', 'literal 111', 'literal 111']) 1018 | 1019 | def test_keyword_parameters(self): 1020 | # Issue #20283: Accepting the string keyword parameter. 1021 | pat = re.compile(r'(ab)') 1022 | self.assertEqual( 1023 | pat.match(string='abracadabra', pos=7, endpos=10).span(), (7, 9)) 1024 | self.assertEqual( 1025 | pat.search(string='abracadabra', pos=3, endpos=10).span(), (7, 9)) 1026 | self.assertEqual( 1027 | pat.findall(string='abracadabra', pos=3, endpos=10), ['ab']) 1028 | self.assertEqual( 1029 | pat.split(string='abracadabra', maxsplit=1), 1030 | ['', 'ab', 'racadabra']) 1031 | 1032 | 1033 | def run_re_tests(): 1034 | # PCRE: use pcre_tests instead of re_tests 1035 | #from test.re_tests import tests, SUCCEED, FAIL, SYNTAX_ERROR 1036 | from test.pcre_tests import tests, SUCCEED, FAIL, SYNTAX_ERROR 1037 | if verbose: 1038 | #print 'Running re_tests test suite' 1039 | print 'Running pcre_tests test suite' 1040 | else: 1041 | # To save time, only run the first and last 10 tests 1042 | #tests = tests[:10] + tests[-10:] 1043 | pass 1044 | 1045 | for t in tests: 1046 | sys.stdout.flush() 1047 | pattern = s = outcome = repl = expected = None 1048 | if len(t) == 5: 1049 | pattern, s, outcome, repl, expected = t 1050 | elif len(t) == 3: 1051 | pattern, s, outcome = t 1052 | else: 1053 | raise ValueError, ('Test tuples should have 3 or 5 fields', t) 1054 | 1055 | try: 1056 | obj = re.compile(pattern) 1057 | except re.error: 1058 | if outcome == SYNTAX_ERROR: pass # Expected a syntax error 1059 | else: 1060 | print '=== Syntax error:', t 1061 | except KeyboardInterrupt: raise KeyboardInterrupt 1062 | except: 1063 | print '*** Unexpected error ***', t 1064 | if verbose: 1065 | traceback.print_exc(file=sys.stdout) 1066 | else: 1067 | try: 1068 | result = obj.search(s) 1069 | except re.error, msg: 1070 | print '=== Unexpected exception', t, repr(msg) 1071 | if outcome == SYNTAX_ERROR: 1072 | # This should have been a syntax error; forget it. 1073 | pass 1074 | elif outcome == FAIL: 1075 | if result is None: pass # No match, as expected 1076 | else: print '=== Succeeded incorrectly', t 1077 | elif outcome == SUCCEED: 1078 | if result is not None: 1079 | # Matched, as expected, so now we compute the 1080 | # result string and compare it to our expected result. 1081 | start, end = result.span(0) 1082 | vardict={'found': result.group(0), 1083 | 'groups': result.group(), 1084 | 'flags': result.re.flags} 1085 | for i in range(1, 100): 1086 | try: 1087 | gi = result.group(i) 1088 | # Special hack because else the string concat fails: 1089 | if gi is None: 1090 | gi = "None" 1091 | except IndexError: 1092 | gi = "Error" 1093 | vardict['g%d' % i] = gi 1094 | for i in result.re.groupindex.keys(): 1095 | try: 1096 | gi = result.group(i) 1097 | if gi is None: 1098 | gi = "None" 1099 | except IndexError: 1100 | gi = "Error" 1101 | vardict[i] = gi 1102 | repl = eval(repl, vardict) 1103 | if repl != expected: 1104 | print '=== grouping error', t, 1105 | print repr(repl) + ' should be ' + repr(expected) 1106 | else: 1107 | print '=== Failed incorrectly', t 1108 | 1109 | # Try the match on a unicode string, and check that it 1110 | # still succeeds. 1111 | try: 1112 | result = obj.search(unicode(s, "latin-1")) 1113 | if result is None: 1114 | print '=== Fails on unicode match', t 1115 | except NameError: 1116 | continue # 1.5.2 1117 | except TypeError: 1118 | continue # unicode test case 1119 | 1120 | # Try the match on a unicode pattern, and check that it 1121 | # still succeeds. 1122 | obj=re.compile(unicode(pattern, "latin-1")) 1123 | result = obj.search(s) 1124 | if result is None: 1125 | print '=== Fails on unicode pattern match', t 1126 | 1127 | # Try the match with the search area limited to the extent 1128 | # of the match and see if it still succeeds. \B will 1129 | # break (because it won't match at the end or start of a 1130 | # string), so we'll ignore patterns that feature it. 1131 | 1132 | if pattern[:2] != '\\B' and pattern[-2:] != '\\B' \ 1133 | and result is not None: 1134 | obj = re.compile(pattern) 1135 | result = obj.search(s, result.start(0), result.end(0) + 1) 1136 | if result is None: 1137 | print '=== Failed on range-limited match', t 1138 | 1139 | # Try the match with IGNORECASE enabled, and check that it 1140 | # still succeeds. 1141 | obj = re.compile(pattern, re.IGNORECASE) 1142 | result = obj.search(s) 1143 | if result is None: 1144 | print '=== Fails on case-insensitive match', t 1145 | 1146 | # Try the match with LOCALE enabled, and check that it 1147 | # still succeeds. 1148 | # PCRE: python-pcre doesn't support LOCALE flag 1149 | #obj = re.compile(pattern, re.LOCALE) 1150 | #result = obj.search(s) 1151 | #if result is None: 1152 | # print '=== Fails on locale-sensitive match', t 1153 | 1154 | # Try the match with UNICODE locale enabled, and check 1155 | # that it still succeeds. 1156 | obj = re.compile(pattern, re.UNICODE) 1157 | result = obj.search(s) 1158 | if result is None: 1159 | print '=== Fails on unicode-sensitive match', t 1160 | 1161 | def test_main(): 1162 | run_unittest(ReTests) 1163 | run_re_tests() 1164 | 1165 | if __name__ == "__main__": 1166 | test_main() 1167 | -------------------------------------------------------------------------------- /test/test_pcre_re_compat.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # -------------------------------------------------------------------------- 4 | # This file is a modified re module testsuite from Python 2.7.5. 5 | # 6 | # Modifications have been commented with "PCRE:". 7 | # Imports pcre instead of re. 8 | # Internal _sre tests have been commented out. 9 | # 10 | # Provides a way to determine in what ways python-pcre is different than re. 11 | # -------------------------------------------------------------------------- 12 | 13 | from test.test_support import verbose, run_unittest, import_module 14 | from test.test_support import precisionbigmemtest, _2G, cpython_only 15 | from test.test_support import captured_stdout 16 | 17 | # PCRE: import pcre instead of re 18 | #import re 19 | #from re import Scanner 20 | #import sre_constants 21 | import pcre as re 22 | re.enable_re_template_mode() 23 | sre_constants = re 24 | re._pattern_type = re.Pattern 25 | 26 | import sys 27 | import string 28 | import traceback 29 | from weakref import proxy 30 | 31 | 32 | # Misc tests from Tim Peters' re.doc 33 | 34 | # WARNING: Don't change details in these tests if you don't know 35 | # what you're doing. Some of these tests were carefully modeled to 36 | # cover most of the code. 37 | 38 | import unittest 39 | 40 | class ReTests(unittest.TestCase): 41 | 42 | def test_weakref(self): 43 | s = 'QabbbcR' 44 | x = re.compile('ab+c') 45 | y = proxy(x) 46 | self.assertEqual(x.findall('QabbbcR'), y.findall('QabbbcR')) 47 | 48 | def test_search_star_plus(self): 49 | self.assertEqual(re.search('x*', 'axx').span(0), (0, 0)) 50 | self.assertEqual(re.search('x*', 'axx').span(), (0, 0)) 51 | self.assertEqual(re.search('x+', 'axx').span(0), (1, 3)) 52 | self.assertEqual(re.search('x+', 'axx').span(), (1, 3)) 53 | self.assertEqual(re.search('x', 'aaa'), None) 54 | self.assertEqual(re.match('a*', 'xxx').span(0), (0, 0)) 55 | self.assertEqual(re.match('a*', 'xxx').span(), (0, 0)) 56 | self.assertEqual(re.match('x*', 'xxxa').span(0), (0, 3)) 57 | self.assertEqual(re.match('x*', 'xxxa').span(), (0, 3)) 58 | self.assertEqual(re.match('a+', 'xxx'), None) 59 | 60 | def bump_num(self, matchobj): 61 | int_value = int(matchobj.group(0)) 62 | return str(int_value + 1) 63 | 64 | def test_basic_re_sub(self): 65 | self.assertEqual(re.sub("(?i)b+", "x", "bbbb BBBB"), 'x x') 66 | self.assertEqual(re.sub(r'\d+', self.bump_num, '08.2 -2 23x99y'), 67 | '9.3 -3 24x100y') 68 | self.assertEqual(re.sub(r'\d+', self.bump_num, '08.2 -2 23x99y', 3), 69 | '9.3 -3 23x99y') 70 | 71 | self.assertEqual(re.sub('.', lambda m: r"\n", 'x'), '\\n') 72 | self.assertEqual(re.sub('.', r"\n", 'x'), '\n') 73 | 74 | s = r"\1\1" 75 | self.assertEqual(re.sub('(.)', s, 'x'), 'xx') 76 | self.assertEqual(re.sub('(.)', re.escape(s), 'x'), s) 77 | self.assertEqual(re.sub('(.)', lambda m: s, 'x'), s) 78 | 79 | self.assertEqual(re.sub('(?Px)', '\g\g', 'xx'), 'xxxx') 80 | self.assertEqual(re.sub('(?Px)', '\g\g<1>', 'xx'), 'xxxx') 81 | self.assertEqual(re.sub('(?Px)', '\g\g', 'xx'), 'xxxx') 82 | self.assertEqual(re.sub('(?Px)', '\g<1>\g<1>', 'xx'), 'xxxx') 83 | 84 | self.assertEqual(re.sub('a',r'\t\n\v\r\f\a\b\B\Z\a\A\w\W\s\S\d\D','a'), 85 | '\t\n\v\r\f\a\b\\B\\Z\a\\A\\w\\W\\s\\S\\d\\D') 86 | self.assertEqual(re.sub('a', '\t\n\v\r\f\a', 'a'), '\t\n\v\r\f\a') 87 | self.assertEqual(re.sub('a', '\t\n\v\r\f\a', 'a'), 88 | (chr(9)+chr(10)+chr(11)+chr(13)+chr(12)+chr(7))) 89 | 90 | self.assertEqual(re.sub('^\s*', 'X', 'test'), 'Xtest') 91 | 92 | def test_bug_449964(self): 93 | # fails for group followed by other escape 94 | self.assertEqual(re.sub(r'(?Px)', '\g<1>\g<1>\\b', 'xx'), 95 | 'xx\bxx\b') 96 | 97 | def test_bug_449000(self): 98 | # Test for sub() on escaped characters 99 | self.assertEqual(re.sub(r'\r\n', r'\n', 'abc\r\ndef\r\n'), 100 | 'abc\ndef\n') 101 | self.assertEqual(re.sub('\r\n', r'\n', 'abc\r\ndef\r\n'), 102 | 'abc\ndef\n') 103 | self.assertEqual(re.sub(r'\r\n', '\n', 'abc\r\ndef\r\n'), 104 | 'abc\ndef\n') 105 | self.assertEqual(re.sub('\r\n', '\n', 'abc\r\ndef\r\n'), 106 | 'abc\ndef\n') 107 | 108 | def test_bug_1140(self): 109 | # re.sub(x, y, u'') should return u'', not '', and 110 | # re.sub(x, y, '') should return '', not u''. 111 | # Also: 112 | # re.sub(x, y, unicode(x)) should return unicode(y), and 113 | # re.sub(x, y, str(x)) should return 114 | # str(y) if isinstance(y, str) else unicode(y). 115 | for x in 'x', u'x': 116 | for y in 'y', u'y': 117 | z = re.sub(x, y, u'') 118 | self.assertEqual(z, u'') 119 | self.assertEqual(type(z), unicode) 120 | # 121 | z = re.sub(x, y, '') 122 | self.assertEqual(z, '') 123 | self.assertEqual(type(z), str) 124 | # 125 | z = re.sub(x, y, unicode(x)) 126 | self.assertEqual(z, y) 127 | self.assertEqual(type(z), unicode) 128 | # 129 | z = re.sub(x, y, str(x)) 130 | self.assertEqual(z, y) 131 | self.assertEqual(type(z), type(y)) 132 | 133 | def test_bug_1661(self): 134 | # Verify that flags do not get silently ignored with compiled patterns 135 | pattern = re.compile('.') 136 | self.assertRaises(ValueError, re.match, pattern, 'A', re.I) 137 | self.assertRaises(ValueError, re.search, pattern, 'A', re.I) 138 | self.assertRaises(ValueError, re.findall, pattern, 'A', re.I) 139 | self.assertRaises(ValueError, re.compile, pattern, re.I) 140 | 141 | def test_bug_3629(self): 142 | # A regex that triggered a bug in the sre-code validator 143 | re.compile("(?P)(?(quote))") 144 | 145 | def test_sub_template_numeric_escape(self): 146 | # bug 776311 and friends 147 | self.assertEqual(re.sub('x', r'\0', 'x'), '\0') 148 | self.assertEqual(re.sub('x', r'\000', 'x'), '\000') 149 | self.assertEqual(re.sub('x', r'\001', 'x'), '\001') 150 | self.assertEqual(re.sub('x', r'\008', 'x'), '\0' + '8') 151 | self.assertEqual(re.sub('x', r'\009', 'x'), '\0' + '9') 152 | self.assertEqual(re.sub('x', r'\111', 'x'), '\111') 153 | self.assertEqual(re.sub('x', r'\117', 'x'), '\117') 154 | 155 | self.assertEqual(re.sub('x', r'\1111', 'x'), '\1111') 156 | self.assertEqual(re.sub('x', r'\1111', 'x'), '\111' + '1') 157 | 158 | self.assertEqual(re.sub('x', r'\00', 'x'), '\x00') 159 | self.assertEqual(re.sub('x', r'\07', 'x'), '\x07') 160 | self.assertEqual(re.sub('x', r'\08', 'x'), '\0' + '8') 161 | self.assertEqual(re.sub('x', r'\09', 'x'), '\0' + '9') 162 | self.assertEqual(re.sub('x', r'\0a', 'x'), '\0' + 'a') 163 | 164 | self.assertEqual(re.sub('x', r'\400', 'x'), '\0') 165 | self.assertEqual(re.sub('x', r'\777', 'x'), '\377') 166 | 167 | self.assertRaises(re.error, re.sub, 'x', r'\1', 'x') 168 | self.assertRaises(re.error, re.sub, 'x', r'\8', 'x') 169 | self.assertRaises(re.error, re.sub, 'x', r'\9', 'x') 170 | self.assertRaises(re.error, re.sub, 'x', r'\11', 'x') 171 | self.assertRaises(re.error, re.sub, 'x', r'\18', 'x') 172 | self.assertRaises(re.error, re.sub, 'x', r'\1a', 'x') 173 | self.assertRaises(re.error, re.sub, 'x', r'\90', 'x') 174 | self.assertRaises(re.error, re.sub, 'x', r'\99', 'x') 175 | self.assertRaises(re.error, re.sub, 'x', r'\118', 'x') # r'\11' + '8' 176 | self.assertRaises(re.error, re.sub, 'x', r'\11a', 'x') 177 | self.assertRaises(re.error, re.sub, 'x', r'\181', 'x') # r'\18' + '1' 178 | self.assertRaises(re.error, re.sub, 'x', r'\800', 'x') # r'\80' + '0' 179 | 180 | # in python2.3 (etc), these loop endlessly in sre_parser.py 181 | self.assertEqual(re.sub('(((((((((((x)))))))))))', r'\11', 'x'), 'x') 182 | self.assertEqual(re.sub('((((((((((y))))))))))(.)', r'\118', 'xyz'), 183 | 'xz8') 184 | self.assertEqual(re.sub('((((((((((y))))))))))(.)', r'\11a', 'xyz'), 185 | 'xza') 186 | 187 | def test_qualified_re_sub(self): 188 | self.assertEqual(re.sub('a', 'b', 'aaaaa'), 'bbbbb') 189 | self.assertEqual(re.sub('a', 'b', 'aaaaa', 1), 'baaaa') 190 | 191 | def test_bug_114660(self): 192 | self.assertEqual(re.sub(r'(\S)\s+(\S)', r'\1 \2', 'hello there'), 193 | 'hello there') 194 | 195 | def test_bug_462270(self): 196 | # Test for empty sub() behaviour, see SF bug #462270 197 | self.assertEqual(re.sub('x*', '-', 'abxd'), '-a-b-d-') 198 | self.assertEqual(re.sub('x+', '-', 'abxd'), 'ab-d') 199 | 200 | def test_symbolic_groups(self): 201 | re.compile('(?Px)(?P=a)(?(a)y)') 202 | re.compile('(?Px)(?P=a1)(?(a1)y)') 203 | self.assertRaises(re.error, re.compile, '(?P)(?P)') 204 | self.assertRaises(re.error, re.compile, '(?Px)') 205 | self.assertRaises(re.error, re.compile, '(?P=)') 206 | self.assertRaises(re.error, re.compile, '(?P=1)') 207 | self.assertRaises(re.error, re.compile, '(?P=a)') 208 | self.assertRaises(re.error, re.compile, '(?P=a1)') 209 | self.assertRaises(re.error, re.compile, '(?P=a.)') 210 | self.assertRaises(re.error, re.compile, '(?P<)') 211 | self.assertRaises(re.error, re.compile, '(?P<>)') 212 | self.assertRaises(re.error, re.compile, '(?P<1>)') 213 | self.assertRaises(re.error, re.compile, '(?P)') 214 | self.assertRaises(re.error, re.compile, '(?())') 215 | self.assertRaises(re.error, re.compile, '(?(a))') 216 | self.assertRaises(re.error, re.compile, '(?(1a))') 217 | self.assertRaises(re.error, re.compile, '(?(a.))') 218 | 219 | def test_symbolic_refs(self): 220 | self.assertRaises(re.error, re.sub, '(?Px)', '\gx)', '\g<', 'xx') 222 | self.assertRaises(re.error, re.sub, '(?Px)', '\g', 'xx') 223 | self.assertRaises(re.error, re.sub, '(?Px)', '\g', 'xx') 224 | self.assertRaises(re.error, re.sub, '(?Px)', '\g<>', 'xx') 225 | self.assertRaises(re.error, re.sub, '(?Px)', '\g<1a1>', 'xx') 226 | self.assertRaises(IndexError, re.sub, '(?Px)', '\g', 'xx') 227 | self.assertRaises(re.error, re.sub, '(?Px)|(?Py)', '\g', 'xx') 228 | self.assertRaises(re.error, re.sub, '(?Px)|(?Py)', '\\2', 'xx') 229 | self.assertRaises(re.error, re.sub, '(?Px)', '\g<-1>', 'xx') 230 | 231 | def test_re_subn(self): 232 | self.assertEqual(re.subn("(?i)b+", "x", "bbbb BBBB"), ('x x', 2)) 233 | self.assertEqual(re.subn("b+", "x", "bbbb BBBB"), ('x BBBB', 1)) 234 | self.assertEqual(re.subn("b+", "x", "xyz"), ('xyz', 0)) 235 | self.assertEqual(re.subn("b*", "x", "xyz"), ('xxxyxzx', 4)) 236 | self.assertEqual(re.subn("b*", "x", "xyz", 2), ('xxxyz', 2)) 237 | 238 | def test_re_split(self): 239 | self.assertEqual(re.split(":", ":a:b::c"), ['', 'a', 'b', '', 'c']) 240 | self.assertEqual(re.split(":*", ":a:b::c"), ['', 'a', 'b', 'c']) 241 | self.assertEqual(re.split("(:*)", ":a:b::c"), 242 | ['', ':', 'a', ':', 'b', '::', 'c']) 243 | self.assertEqual(re.split("(?::*)", ":a:b::c"), ['', 'a', 'b', 'c']) 244 | self.assertEqual(re.split("(:)*", ":a:b::c"), 245 | ['', ':', 'a', ':', 'b', ':', 'c']) 246 | self.assertEqual(re.split("([b:]+)", ":a:b::c"), 247 | ['', ':', 'a', ':b::', 'c']) 248 | self.assertEqual(re.split("(b)|(:+)", ":a:b::c"), 249 | ['', None, ':', 'a', None, ':', '', 'b', None, '', 250 | None, '::', 'c']) 251 | self.assertEqual(re.split("(?:b)|(?::+)", ":a:b::c"), 252 | ['', 'a', '', '', 'c']) 253 | 254 | def test_qualified_re_split(self): 255 | self.assertEqual(re.split(":", ":a:b::c", 2), ['', 'a', 'b::c']) 256 | self.assertEqual(re.split(':', 'a:b:c:d', 2), ['a', 'b', 'c:d']) 257 | self.assertEqual(re.split("(:)", ":a:b::c", 2), 258 | ['', ':', 'a', ':', 'b::c']) 259 | self.assertEqual(re.split("(:*)", ":a:b::c", 2), 260 | ['', ':', 'a', ':', 'b::c']) 261 | 262 | def test_re_findall(self): 263 | self.assertEqual(re.findall(":+", "abc"), []) 264 | self.assertEqual(re.findall(":+", "a:b::c:::d"), [":", "::", ":::"]) 265 | self.assertEqual(re.findall("(:+)", "a:b::c:::d"), [":", "::", ":::"]) 266 | self.assertEqual(re.findall("(:)(:*)", "a:b::c:::d"), [(":", ""), 267 | (":", ":"), 268 | (":", "::")]) 269 | 270 | def test_bug_117612(self): 271 | self.assertEqual(re.findall(r"(a|(b))", "aba"), 272 | [("a", ""),("b", "b"),("a", "")]) 273 | 274 | def test_re_match(self): 275 | self.assertEqual(re.match('a', 'a').groups(), ()) 276 | self.assertEqual(re.match('(a)', 'a').groups(), ('a',)) 277 | self.assertEqual(re.match(r'(a)', 'a').group(0), 'a') 278 | self.assertEqual(re.match(r'(a)', 'a').group(1), 'a') 279 | self.assertEqual(re.match(r'(a)', 'a').group(1, 1), ('a', 'a')) 280 | 281 | pat = re.compile('((a)|(b))(c)?') 282 | self.assertEqual(pat.match('a').groups(), ('a', 'a', None, None)) 283 | self.assertEqual(pat.match('b').groups(), ('b', None, 'b', None)) 284 | self.assertEqual(pat.match('ac').groups(), ('a', 'a', None, 'c')) 285 | self.assertEqual(pat.match('bc').groups(), ('b', None, 'b', 'c')) 286 | self.assertEqual(pat.match('bc').groups(""), ('b', "", 'b', 'c')) 287 | 288 | # A single group 289 | m = re.match('(a)', 'a') 290 | self.assertEqual(m.group(0), 'a') 291 | self.assertEqual(m.group(0), 'a') 292 | self.assertEqual(m.group(1), 'a') 293 | self.assertEqual(m.group(1, 1), ('a', 'a')) 294 | 295 | pat = re.compile('(?:(?Pa)|(?Pb))(?Pc)?') 296 | self.assertEqual(pat.match('a').group(1, 2, 3), ('a', None, None)) 297 | self.assertEqual(pat.match('b').group('a1', 'b2', 'c3'), 298 | (None, 'b', None)) 299 | self.assertEqual(pat.match('ac').group(1, 'b2', 3), ('a', None, 'c')) 300 | 301 | def test_re_groupref_exists(self): 302 | self.assertEqual(re.match('^(\()?([^()]+)(?(1)\))$', '(a)').groups(), 303 | ('(', 'a')) 304 | self.assertEqual(re.match('^(\()?([^()]+)(?(1)\))$', 'a').groups(), 305 | (None, 'a')) 306 | self.assertEqual(re.match('^(\()?([^()]+)(?(1)\))$', 'a)'), None) 307 | self.assertEqual(re.match('^(\()?([^()]+)(?(1)\))$', '(a'), None) 308 | self.assertEqual(re.match('^(?:(a)|c)((?(1)b|d))$', 'ab').groups(), 309 | ('a', 'b')) 310 | self.assertEqual(re.match('^(?:(a)|c)((?(1)b|d))$', 'cd').groups(), 311 | (None, 'd')) 312 | self.assertEqual(re.match('^(?:(a)|c)((?(1)|d))$', 'cd').groups(), 313 | (None, 'd')) 314 | self.assertEqual(re.match('^(?:(a)|c)((?(1)|d))$', 'a').groups(), 315 | ('a', '')) 316 | 317 | # Tests for bug #1177831: exercise groups other than the first group 318 | p = re.compile('(?Pa)(?Pb)?((?(g2)c|d))') 319 | self.assertEqual(p.match('abc').groups(), 320 | ('a', 'b', 'c')) 321 | self.assertEqual(p.match('ad').groups(), 322 | ('a', None, 'd')) 323 | self.assertEqual(p.match('abd'), None) 324 | self.assertEqual(p.match('ac'), None) 325 | 326 | 327 | def test_re_groupref(self): 328 | self.assertEqual(re.match(r'^(\|)?([^()]+)\1$', '|a|').groups(), 329 | ('|', 'a')) 330 | self.assertEqual(re.match(r'^(\|)?([^()]+)\1?$', 'a').groups(), 331 | (None, 'a')) 332 | self.assertEqual(re.match(r'^(\|)?([^()]+)\1$', 'a|'), None) 333 | self.assertEqual(re.match(r'^(\|)?([^()]+)\1$', '|a'), None) 334 | self.assertEqual(re.match(r'^(?:(a)|c)(\1)$', 'aa').groups(), 335 | ('a', 'a')) 336 | self.assertEqual(re.match(r'^(?:(a)|c)(\1)?$', 'c').groups(), 337 | (None, None)) 338 | 339 | def test_groupdict(self): 340 | self.assertEqual(re.match('(?Pfirst) (?Psecond)', 341 | 'first second').groupdict(), 342 | {'first':'first', 'second':'second'}) 343 | 344 | def test_expand(self): 345 | self.assertEqual(re.match("(?Pfirst) (?Psecond)", 346 | "first second") 347 | .expand(r"\2 \1 \g \g"), 348 | "second first second first") 349 | 350 | def test_repeat_minmax(self): 351 | self.assertEqual(re.match("^(\w){1}$", "abc"), None) 352 | self.assertEqual(re.match("^(\w){1}?$", "abc"), None) 353 | self.assertEqual(re.match("^(\w){1,2}$", "abc"), None) 354 | self.assertEqual(re.match("^(\w){1,2}?$", "abc"), None) 355 | 356 | self.assertEqual(re.match("^(\w){3}$", "abc").group(1), "c") 357 | self.assertEqual(re.match("^(\w){1,3}$", "abc").group(1), "c") 358 | self.assertEqual(re.match("^(\w){1,4}$", "abc").group(1), "c") 359 | self.assertEqual(re.match("^(\w){3,4}?$", "abc").group(1), "c") 360 | self.assertEqual(re.match("^(\w){3}?$", "abc").group(1), "c") 361 | self.assertEqual(re.match("^(\w){1,3}?$", "abc").group(1), "c") 362 | self.assertEqual(re.match("^(\w){1,4}?$", "abc").group(1), "c") 363 | self.assertEqual(re.match("^(\w){3,4}?$", "abc").group(1), "c") 364 | 365 | self.assertEqual(re.match("^x{1}$", "xxx"), None) 366 | self.assertEqual(re.match("^x{1}?$", "xxx"), None) 367 | self.assertEqual(re.match("^x{1,2}$", "xxx"), None) 368 | self.assertEqual(re.match("^x{1,2}?$", "xxx"), None) 369 | 370 | self.assertNotEqual(re.match("^x{3}$", "xxx"), None) 371 | self.assertNotEqual(re.match("^x{1,3}$", "xxx"), None) 372 | self.assertNotEqual(re.match("^x{1,4}$", "xxx"), None) 373 | self.assertNotEqual(re.match("^x{3,4}?$", "xxx"), None) 374 | self.assertNotEqual(re.match("^x{3}?$", "xxx"), None) 375 | self.assertNotEqual(re.match("^x{1,3}?$", "xxx"), None) 376 | self.assertNotEqual(re.match("^x{1,4}?$", "xxx"), None) 377 | self.assertNotEqual(re.match("^x{3,4}?$", "xxx"), None) 378 | 379 | self.assertEqual(re.match("^x{}$", "xxx"), None) 380 | self.assertNotEqual(re.match("^x{}$", "x{}"), None) 381 | 382 | def test_getattr(self): 383 | self.assertEqual(re.match("(a)", "a").pos, 0) 384 | self.assertEqual(re.match("(a)", "a").endpos, 1) 385 | self.assertEqual(re.match("(a)", "a").string, "a") 386 | self.assertEqual(re.match("(a)", "a").regs, ((0, 1), (0, 1))) 387 | self.assertNotEqual(re.match("(a)", "a").re, None) 388 | 389 | def test_special_escapes(self): 390 | self.assertEqual(re.search(r"\b(b.)\b", 391 | "abcd abc bcd bx").group(1), "bx") 392 | self.assertEqual(re.search(r"\B(b.)\B", 393 | "abc bcd bc abxd").group(1), "bx") 394 | self.assertEqual(re.search(r"\b(b.)\b", 395 | "abcd abc bcd bx", re.LOCALE).group(1), "bx") 396 | self.assertEqual(re.search(r"\B(b.)\B", 397 | "abc bcd bc abxd", re.LOCALE).group(1), "bx") 398 | self.assertEqual(re.search(r"\b(b.)\b", 399 | "abcd abc bcd bx", re.UNICODE).group(1), "bx") 400 | self.assertEqual(re.search(r"\B(b.)\B", 401 | "abc bcd bc abxd", re.UNICODE).group(1), "bx") 402 | self.assertEqual(re.search(r"^abc$", "\nabc\n", re.M).group(0), "abc") 403 | self.assertEqual(re.search(r"^\Aabc\Z$", "abc", re.M).group(0), "abc") 404 | self.assertEqual(re.search(r"^\Aabc\Z$", "\nabc\n", re.M), None) 405 | self.assertEqual(re.search(r"\b(b.)\b", 406 | u"abcd abc bcd bx").group(1), "bx") 407 | self.assertEqual(re.search(r"\B(b.)\B", 408 | u"abc bcd bc abxd").group(1), "bx") 409 | self.assertEqual(re.search(r"^abc$", u"\nabc\n", re.M).group(0), "abc") 410 | self.assertEqual(re.search(r"^\Aabc\Z$", u"abc", re.M).group(0), "abc") 411 | self.assertEqual(re.search(r"^\Aabc\Z$", u"\nabc\n", re.M), None) 412 | self.assertEqual(re.search(r"\d\D\w\W\s\S", 413 | "1aa! a").group(0), "1aa! a") 414 | self.assertEqual(re.search(r"\d\D\w\W\s\S", 415 | "1aa! a", re.LOCALE).group(0), "1aa! a") 416 | self.assertEqual(re.search(r"\d\D\w\W\s\S", 417 | "1aa! a", re.UNICODE).group(0), "1aa! a") 418 | 419 | def test_string_boundaries(self): 420 | # See http://bugs.python.org/issue10713 421 | self.assertEqual(re.search(r"\b(abc)\b", "abc").group(1), 422 | "abc") 423 | # There's a word boundary at the start of a string. 424 | self.assertTrue(re.match(r"\b", "abc")) 425 | # A non-empty string includes a non-boundary zero-length match. 426 | self.assertTrue(re.search(r"\B", "abc")) 427 | # There is no non-boundary match at the start of a string. 428 | self.assertFalse(re.match(r"\B", "abc")) 429 | # However, an empty string contains no word boundaries, and also no 430 | # non-boundaries. 431 | self.assertEqual(re.search(r"\B", ""), None) 432 | # This one is questionable and different from the perlre behaviour, 433 | # but describes current behavior. 434 | self.assertEqual(re.search(r"\b", ""), None) 435 | # A single word-character string has two boundaries, but no 436 | # non-boundary gaps. 437 | self.assertEqual(len(re.findall(r"\b", "a")), 2) 438 | self.assertEqual(len(re.findall(r"\B", "a")), 0) 439 | # If there are no words, there are no boundaries 440 | self.assertEqual(len(re.findall(r"\b", " ")), 0) 441 | self.assertEqual(len(re.findall(r"\b", " ")), 0) 442 | # Can match around the whitespace. 443 | self.assertEqual(len(re.findall(r"\B", " ")), 2) 444 | 445 | def test_bigcharset(self): 446 | self.assertEqual(re.match(u"([\u2222\u2223])", 447 | u"\u2222").group(1), u"\u2222") 448 | self.assertEqual(re.match(u"([\u2222\u2223])", 449 | u"\u2222", re.UNICODE).group(1), u"\u2222") 450 | r = u'[%s]' % u''.join(map(unichr, range(256, 2**16, 255))) 451 | self.assertEqual(re.match(r, u"\uff01", re.UNICODE).group(), u"\uff01") 452 | 453 | def test_big_codesize(self): 454 | # Issue #1160 455 | r = re.compile('|'.join(('%d'%x for x in range(10000)))) 456 | self.assertIsNotNone(r.match('1000')) 457 | self.assertIsNotNone(r.match('9999')) 458 | 459 | def test_anyall(self): 460 | self.assertEqual(re.match("a.b", "a\nb", re.DOTALL).group(0), 461 | "a\nb") 462 | self.assertEqual(re.match("a.*b", "a\n\nb", re.DOTALL).group(0), 463 | "a\n\nb") 464 | 465 | def test_non_consuming(self): 466 | self.assertEqual(re.match("(a(?=\s[^a]))", "a b").group(1), "a") 467 | self.assertEqual(re.match("(a(?=\s[^a]*))", "a b").group(1), "a") 468 | self.assertEqual(re.match("(a(?=\s[abc]))", "a b").group(1), "a") 469 | self.assertEqual(re.match("(a(?=\s[abc]*))", "a bc").group(1), "a") 470 | self.assertEqual(re.match(r"(a)(?=\s\1)", "a a").group(1), "a") 471 | self.assertEqual(re.match(r"(a)(?=\s\1*)", "a aa").group(1), "a") 472 | self.assertEqual(re.match(r"(a)(?=\s(abc|a))", "a a").group(1), "a") 473 | 474 | self.assertEqual(re.match(r"(a(?!\s[^a]))", "a a").group(1), "a") 475 | self.assertEqual(re.match(r"(a(?!\s[abc]))", "a d").group(1), "a") 476 | self.assertEqual(re.match(r"(a)(?!\s\1)", "a b").group(1), "a") 477 | self.assertEqual(re.match(r"(a)(?!\s(abc|a))", "a b").group(1), "a") 478 | 479 | def test_ignore_case(self): 480 | self.assertEqual(re.match("abc", "ABC", re.I).group(0), "ABC") 481 | self.assertEqual(re.match("abc", u"ABC", re.I).group(0), "ABC") 482 | self.assertEqual(re.match(r"(a\s[^a])", "a b", re.I).group(1), "a b") 483 | self.assertEqual(re.match(r"(a\s[^a]*)", "a bb", re.I).group(1), "a bb") 484 | self.assertEqual(re.match(r"(a\s[abc])", "a b", re.I).group(1), "a b") 485 | self.assertEqual(re.match(r"(a\s[abc]*)", "a bb", re.I).group(1), "a bb") 486 | self.assertEqual(re.match(r"((a)\s\2)", "a a", re.I).group(1), "a a") 487 | self.assertEqual(re.match(r"((a)\s\2*)", "a aa", re.I).group(1), "a aa") 488 | self.assertEqual(re.match(r"((a)\s(abc|a))", "a a", re.I).group(1), "a a") 489 | self.assertEqual(re.match(r"((a)\s(abc|a)*)", "a aa", re.I).group(1), "a aa") 490 | 491 | def test_category(self): 492 | self.assertEqual(re.match(r"(\s)", " ").group(1), " ") 493 | 494 | def test_getlower(self): 495 | # PCRE: disabled, we're not testing _sre here 496 | #import _sre 497 | #self.assertEqual(_sre.getlower(ord('A'), 0), ord('a')) 498 | #self.assertEqual(_sre.getlower(ord('A'), re.LOCALE), ord('a')) 499 | #self.assertEqual(_sre.getlower(ord('A'), re.UNICODE), ord('a')) 500 | 501 | self.assertEqual(re.match("abc", "ABC", re.I).group(0), "ABC") 502 | self.assertEqual(re.match("abc", u"ABC", re.I).group(0), "ABC") 503 | 504 | def test_not_literal(self): 505 | self.assertEqual(re.search("\s([^a])", " b").group(1), "b") 506 | self.assertEqual(re.search("\s([^a]*)", " bb").group(1), "bb") 507 | 508 | def test_search_coverage(self): 509 | self.assertEqual(re.search("\s(b)", " b").group(1), "b") 510 | self.assertEqual(re.search("a\s", "a ").group(0), "a ") 511 | 512 | def assertMatch(self, pattern, text, match=None, span=None, 513 | matcher=re.match): 514 | if match is None and span is None: 515 | # the pattern matches the whole text 516 | match = text 517 | span = (0, len(text)) 518 | elif match is None or span is None: 519 | raise ValueError('If match is not None, span should be specified ' 520 | '(and vice versa).') 521 | m = matcher(pattern, text) 522 | self.assertTrue(m) 523 | self.assertEqual(m.group(), match) 524 | self.assertEqual(m.span(), span) 525 | 526 | def test_re_escape(self): 527 | alnum_chars = string.ascii_letters + string.digits 528 | p = u''.join(unichr(i) for i in range(256)) 529 | for c in p: 530 | if c in alnum_chars: 531 | self.assertEqual(re.escape(c), c) 532 | elif c == u'\x00': 533 | self.assertEqual(re.escape(c), u'\\000') 534 | else: 535 | self.assertEqual(re.escape(c), u'\\' + c) 536 | self.assertMatch(re.escape(c), c) 537 | self.assertMatch(re.escape(p), p) 538 | 539 | def test_re_escape_byte(self): 540 | alnum_chars = (string.ascii_letters + string.digits).encode('ascii') 541 | p = ''.join(chr(i) for i in range(256)) 542 | for b in p: 543 | if b in alnum_chars: 544 | self.assertEqual(re.escape(b), b) 545 | elif b == b'\x00': 546 | self.assertEqual(re.escape(b), b'\\000') 547 | else: 548 | self.assertEqual(re.escape(b), b'\\' + b) 549 | self.assertMatch(re.escape(b), b) 550 | self.assertMatch(re.escape(p), p) 551 | 552 | def test_re_escape_non_ascii(self): 553 | s = u'xxx\u2620\u2620\u2620xxx' 554 | s_escaped = re.escape(s) 555 | self.assertEqual(s_escaped, u'xxx\\\u2620\\\u2620\\\u2620xxx') 556 | self.assertMatch(s_escaped, s) 557 | self.assertMatch(u'.%s+.' % re.escape(u'\u2620'), s, 558 | u'x\u2620\u2620\u2620x', (2, 7), re.search) 559 | 560 | def test_re_escape_non_ascii_bytes(self): 561 | b = u'y\u2620y\u2620y'.encode('utf-8') 562 | b_escaped = re.escape(b) 563 | self.assertEqual(b_escaped, b'y\\\xe2\\\x98\\\xa0y\\\xe2\\\x98\\\xa0y') 564 | self.assertMatch(b_escaped, b) 565 | res = re.findall(re.escape(u'\u2620'.encode('utf-8')), b) 566 | self.assertEqual(len(res), 2) 567 | 568 | def test_pickling(self): 569 | import pickle 570 | self.pickle_test(pickle) 571 | import cPickle 572 | self.pickle_test(cPickle) 573 | # PCRE: disabled, we're not testing _sre here 574 | # old pickles expect the _compile() reconstructor in sre module 575 | #import_module("sre", deprecated=True) 576 | #from sre import _compile 577 | 578 | def pickle_test(self, pickle): 579 | oldpat = re.compile('a(?:b|(c|e){1,2}?|d)+?(.)') 580 | s = pickle.dumps(oldpat) 581 | newpat = pickle.loads(s) 582 | self.assertEqual(oldpat, newpat) 583 | 584 | def test_constants(self): 585 | self.assertEqual(re.I, re.IGNORECASE) 586 | self.assertEqual(re.L, re.LOCALE) 587 | self.assertEqual(re.M, re.MULTILINE) 588 | self.assertEqual(re.S, re.DOTALL) 589 | self.assertEqual(re.X, re.VERBOSE) 590 | 591 | def test_flags(self): 592 | for flag in [re.I, re.M, re.X, re.S, re.L]: 593 | self.assertNotEqual(re.compile('^pattern$', flag), None) 594 | 595 | def test_sre_character_literals(self): 596 | for i in [0, 8, 16, 32, 64, 127, 128, 255]: 597 | self.assertNotEqual(re.match(r"\%03o" % i, chr(i)), None) 598 | self.assertNotEqual(re.match(r"\%03o0" % i, chr(i)+"0"), None) 599 | self.assertNotEqual(re.match(r"\%03o8" % i, chr(i)+"8"), None) 600 | self.assertNotEqual(re.match(r"\x%02x" % i, chr(i)), None) 601 | self.assertNotEqual(re.match(r"\x%02x0" % i, chr(i)+"0"), None) 602 | self.assertNotEqual(re.match(r"\x%02xz" % i, chr(i)+"z"), None) 603 | self.assertRaises(re.error, re.match, "\911", "") 604 | 605 | def test_sre_character_class_literals(self): 606 | for i in [0, 8, 16, 32, 64, 127, 128, 255]: 607 | self.assertNotEqual(re.match(r"[\%03o]" % i, chr(i)), None) 608 | self.assertNotEqual(re.match(r"[\%03o0]" % i, chr(i)), None) 609 | self.assertNotEqual(re.match(r"[\%03o8]" % i, chr(i)), None) 610 | self.assertNotEqual(re.match(r"[\x%02x]" % i, chr(i)), None) 611 | self.assertNotEqual(re.match(r"[\x%02x0]" % i, chr(i)), None) 612 | self.assertNotEqual(re.match(r"[\x%02xz]" % i, chr(i)), None) 613 | self.assertRaises(re.error, re.match, "[\911]", "") 614 | 615 | def test_bug_113254(self): 616 | self.assertEqual(re.match(r'(a)|(b)', 'b').start(1), -1) 617 | self.assertEqual(re.match(r'(a)|(b)', 'b').end(1), -1) 618 | self.assertEqual(re.match(r'(a)|(b)', 'b').span(1), (-1, -1)) 619 | 620 | def test_bug_527371(self): 621 | # bug described in patches 527371/672491 622 | self.assertEqual(re.match(r'(a)?a','a').lastindex, None) 623 | self.assertEqual(re.match(r'(a)(b)?b','ab').lastindex, 1) 624 | self.assertEqual(re.match(r'(?Pa)(?Pb)?b','ab').lastgroup, 'a') 625 | self.assertEqual(re.match("(?Pa(b))", "ab").lastgroup, 'a') 626 | self.assertEqual(re.match("((a))", "a").lastindex, 1) 627 | 628 | def test_bug_545855(self): 629 | # bug 545855 -- This pattern failed to cause a compile error as it 630 | # should, instead provoking a TypeError. 631 | self.assertRaises(re.error, re.compile, 'foo[a-') 632 | 633 | def test_bug_418626(self): 634 | # bugs 418626 at al. -- Testing Greg Chapman's addition of op code 635 | # SRE_OP_MIN_REPEAT_ONE for eliminating recursion on simple uses of 636 | # pattern '*?' on a long string. 637 | self.assertEqual(re.match('.*?c', 10000*'ab'+'cd').end(0), 20001) 638 | self.assertEqual(re.match('.*?cd', 5000*'ab'+'c'+5000*'ab'+'cde').end(0), 639 | 20003) 640 | self.assertEqual(re.match('.*?cd', 20000*'abc'+'de').end(0), 60001) 641 | # non-simple '*?' still used to hit the recursion limit, before the 642 | # non-recursive scheme was implemented. 643 | self.assertEqual(re.search('(a|b)*?c', 10000*'ab'+'cd').end(0), 20001) 644 | 645 | def test_bug_612074(self): 646 | pat=u"["+re.escape(u"\u2039")+u"]" 647 | self.assertEqual(re.compile(pat) and 1, 1) 648 | 649 | def test_stack_overflow(self): 650 | # nasty cases that used to overflow the straightforward recursive 651 | # implementation of repeated groups. 652 | self.assertEqual(re.match('(x)*', 50000*'x').group(1), 'x') 653 | self.assertEqual(re.match('(x)*y', 50000*'x'+'y').group(1), 'x') 654 | self.assertEqual(re.match('(x)*?y', 50000*'x'+'y').group(1), 'x') 655 | 656 | def test_unlimited_zero_width_repeat(self): 657 | # Issue #9669 658 | self.assertIsNone(re.match(r'(?:a?)*y', 'z')) 659 | self.assertIsNone(re.match(r'(?:a?)+y', 'z')) 660 | self.assertIsNone(re.match(r'(?:a?){2,}y', 'z')) 661 | self.assertIsNone(re.match(r'(?:a?)*?y', 'z')) 662 | self.assertIsNone(re.match(r'(?:a?)+?y', 'z')) 663 | self.assertIsNone(re.match(r'(?:a?){2,}?y', 'z')) 664 | 665 | def test_scanner(self): 666 | def s_ident(scanner, token): return token 667 | def s_operator(scanner, token): return "op%s" % token 668 | def s_float(scanner, token): return float(token) 669 | def s_int(scanner, token): return int(token) 670 | 671 | scanner = Scanner([ 672 | (r"[a-zA-Z_]\w*", s_ident), 673 | (r"\d+\.\d*", s_float), 674 | (r"\d+", s_int), 675 | (r"=|\+|-|\*|/", s_operator), 676 | (r"\s+", None), 677 | ]) 678 | 679 | self.assertNotEqual(scanner.scanner.scanner("").pattern, None) 680 | 681 | self.assertEqual(scanner.scan("sum = 3*foo + 312.50 + bar"), 682 | (['sum', 'op=', 3, 'op*', 'foo', 'op+', 312.5, 683 | 'op+', 'bar'], '')) 684 | 685 | def test_bug_448951(self): 686 | # bug 448951 (similar to 429357, but with single char match) 687 | # (Also test greedy matches.) 688 | for op in '','?','*': 689 | self.assertEqual(re.match(r'((.%s):)?z'%op, 'z').groups(), 690 | (None, None)) 691 | self.assertEqual(re.match(r'((.%s):)?z'%op, 'a:z').groups(), 692 | ('a:', 'a')) 693 | 694 | def test_bug_725106(self): 695 | # capturing groups in alternatives in repeats 696 | self.assertEqual(re.match('^((a)|b)*', 'abc').groups(), 697 | ('b', 'a')) 698 | self.assertEqual(re.match('^(([ab])|c)*', 'abc').groups(), 699 | ('c', 'b')) 700 | self.assertEqual(re.match('^((d)|[ab])*', 'abc').groups(), 701 | ('b', None)) 702 | self.assertEqual(re.match('^((a)c|[ab])*', 'abc').groups(), 703 | ('b', None)) 704 | self.assertEqual(re.match('^((a)|b)*?c', 'abc').groups(), 705 | ('b', 'a')) 706 | self.assertEqual(re.match('^(([ab])|c)*?d', 'abcd').groups(), 707 | ('c', 'b')) 708 | self.assertEqual(re.match('^((d)|[ab])*?c', 'abc').groups(), 709 | ('b', None)) 710 | self.assertEqual(re.match('^((a)c|[ab])*?c', 'abc').groups(), 711 | ('b', None)) 712 | 713 | def test_bug_725149(self): 714 | # mark_stack_base restoring before restoring marks 715 | self.assertEqual(re.match('(a)(?:(?=(b)*)c)*', 'abb').groups(), 716 | ('a', None)) 717 | self.assertEqual(re.match('(a)((?!(b)*))*', 'abb').groups(), 718 | ('a', None, None)) 719 | 720 | def test_bug_764548(self): 721 | # bug 764548, re.compile() barfs on str/unicode subclasses 722 | try: 723 | unicode 724 | except NameError: 725 | self.skipTest('no problem if we have no unicode') 726 | class my_unicode(unicode): pass 727 | pat = re.compile(my_unicode("abc")) 728 | self.assertEqual(pat.match("xyz"), None) 729 | 730 | def test_finditer(self): 731 | iter = re.finditer(r":+", "a:b::c:::d") 732 | self.assertEqual([item.group(0) for item in iter], 733 | [":", "::", ":::"]) 734 | 735 | def test_bug_926075(self): 736 | try: 737 | unicode 738 | except NameError: 739 | self.skipTest('no problem if we have no unicode') 740 | self.assertTrue(re.compile('bug_926075') is not 741 | re.compile(eval("u'bug_926075'"))) 742 | 743 | def test_bug_931848(self): 744 | try: 745 | unicode 746 | except NameError: 747 | self.skipTest('no problem if we have no unicode') 748 | pattern = eval('u"[\u002E\u3002\uFF0E\uFF61]"') 749 | self.assertEqual(re.compile(pattern).split("a.b.c"), 750 | ['a','b','c']) 751 | 752 | def test_bug_581080(self): 753 | iter = re.finditer(r"\s", "a b") 754 | self.assertEqual(iter.next().span(), (1,2)) 755 | self.assertRaises(StopIteration, iter.next) 756 | 757 | scanner = re.compile(r"\s").scanner("a b") 758 | self.assertEqual(scanner.search().span(), (1, 2)) 759 | self.assertEqual(scanner.search(), None) 760 | 761 | def test_bug_817234(self): 762 | iter = re.finditer(r".*", "asdf") 763 | self.assertEqual(iter.next().span(), (0, 4)) 764 | self.assertEqual(iter.next().span(), (4, 4)) 765 | self.assertRaises(StopIteration, iter.next) 766 | 767 | def test_bug_6561(self): 768 | # '\d' should match characters in Unicode category 'Nd' 769 | # (Number, Decimal Digit), but not those in 'Nl' (Number, 770 | # Letter) or 'No' (Number, Other). 771 | decimal_digits = [ 772 | u'\u0037', # '\N{DIGIT SEVEN}', category 'Nd' 773 | u'\u0e58', # '\N{THAI DIGIT SIX}', category 'Nd' 774 | u'\uff10', # '\N{FULLWIDTH DIGIT ZERO}', category 'Nd' 775 | ] 776 | for x in decimal_digits: 777 | self.assertEqual(re.match('^\d$', x, re.UNICODE).group(0), x) 778 | 779 | not_decimal_digits = [ 780 | u'\u2165', # '\N{ROMAN NUMERAL SIX}', category 'Nl' 781 | u'\u3039', # '\N{HANGZHOU NUMERAL TWENTY}', category 'Nl' 782 | u'\u2082', # '\N{SUBSCRIPT TWO}', category 'No' 783 | u'\u32b4', # '\N{CIRCLED NUMBER THIRTY NINE}', category 'No' 784 | ] 785 | for x in not_decimal_digits: 786 | self.assertIsNone(re.match('^\d$', x, re.UNICODE)) 787 | 788 | def test_empty_array(self): 789 | # SF buf 1647541 790 | import array 791 | for typecode in 'cbBuhHiIlLfd': 792 | a = array.array(typecode) 793 | self.assertEqual(re.compile("bla").match(a), None) 794 | self.assertEqual(re.compile("").match(a).groups(), ()) 795 | 796 | def test_inline_flags(self): 797 | # Bug #1700 798 | upper_char = unichr(0x1ea0) # Latin Capital Letter A with Dot Bellow 799 | lower_char = unichr(0x1ea1) # Latin Small Letter A with Dot Bellow 800 | 801 | p = re.compile(upper_char, re.I | re.U) 802 | q = p.match(lower_char) 803 | self.assertNotEqual(q, None) 804 | 805 | p = re.compile(lower_char, re.I | re.U) 806 | q = p.match(upper_char) 807 | self.assertNotEqual(q, None) 808 | 809 | p = re.compile('(?i)' + upper_char, re.U) 810 | q = p.match(lower_char) 811 | self.assertNotEqual(q, None) 812 | 813 | p = re.compile('(?i)' + lower_char, re.U) 814 | q = p.match(upper_char) 815 | self.assertNotEqual(q, None) 816 | 817 | p = re.compile('(?iu)' + upper_char) 818 | q = p.match(lower_char) 819 | self.assertNotEqual(q, None) 820 | 821 | p = re.compile('(?iu)' + lower_char) 822 | q = p.match(upper_char) 823 | self.assertNotEqual(q, None) 824 | 825 | def test_dollar_matches_twice(self): 826 | "$ matches the end of string, and just before the terminating \n" 827 | pattern = re.compile('$') 828 | self.assertEqual(pattern.sub('#', 'a\nb\n'), 'a\nb#\n#') 829 | self.assertEqual(pattern.sub('#', 'a\nb\nc'), 'a\nb\nc#') 830 | self.assertEqual(pattern.sub('#', '\n'), '#\n#') 831 | 832 | pattern = re.compile('$', re.MULTILINE) 833 | self.assertEqual(pattern.sub('#', 'a\nb\n' ), 'a#\nb#\n#' ) 834 | self.assertEqual(pattern.sub('#', 'a\nb\nc'), 'a#\nb#\nc#') 835 | self.assertEqual(pattern.sub('#', '\n'), '#\n#') 836 | 837 | def test_dealloc(self): 838 | # PCRE: disabled, we're not testing _sre here 839 | # issue 3299: check for segfault in debug build 840 | #import _sre 841 | # the overflow limit is different on wide and narrow builds and it 842 | # depends on the definition of SRE_CODE (see sre.h). 843 | # 2**128 should be big enough to overflow on both. For smaller values 844 | # a RuntimeError is raised instead of OverflowError. 845 | #long_overflow = 2**128 846 | # PCRE: finditer is implemented as generator function -- next() has to be called 847 | # to throw the error 848 | #self.assertRaises(TypeError, re.finditer, "a", {}) 849 | self.assertRaises(TypeError, re.finditer("a", {}).next) 850 | #self.assertRaises(OverflowError, _sre.compile, "abc", 0, [long_overflow]) 851 | 852 | def test_compile(self): 853 | # Test return value when given string and pattern as parameter 854 | pattern = re.compile('random pattern') 855 | self.assertIsInstance(pattern, re._pattern_type) 856 | same_pattern = re.compile(pattern) 857 | self.assertIsInstance(same_pattern, re._pattern_type) 858 | self.assertIs(same_pattern, pattern) 859 | # Test behaviour when not given a string or pattern as parameter 860 | self.assertRaises(TypeError, re.compile, 0) 861 | 862 | def test_bug_13899(self): 863 | # Issue #13899: re pattern r"[\A]" should work like "A" but matches 864 | # nothing. Ditto B and Z. 865 | self.assertEqual(re.findall(r'[\A\B\b\C\Z]', 'AB\bCZ'), 866 | ['A', 'B', '\b', 'C', 'Z']) 867 | 868 | @precisionbigmemtest(size=_2G, memuse=1) 869 | def test_large_search(self, size): 870 | # Issue #10182: indices were 32-bit-truncated. 871 | s = 'a' * size 872 | m = re.search('$', s) 873 | self.assertIsNotNone(m) 874 | self.assertEqual(m.start(), size) 875 | self.assertEqual(m.end(), size) 876 | 877 | # The huge memuse is because of re.sub() using a list and a join() 878 | # to create the replacement result. 879 | @precisionbigmemtest(size=_2G, memuse=16 + 2) 880 | def test_large_subn(self, size): 881 | # Issue #10182: indices were 32-bit-truncated. 882 | s = 'a' * size 883 | r, n = re.subn('', '', s) 884 | self.assertEqual(r, s) 885 | self.assertEqual(n, size + 1) 886 | 887 | 888 | def test_repeat_minmax_overflow(self): 889 | # Issue #13169 890 | string = "x" * 100000 891 | self.assertEqual(re.match(r".{65535}", string).span(), (0, 65535)) 892 | self.assertEqual(re.match(r".{,65535}", string).span(), (0, 65535)) 893 | self.assertEqual(re.match(r".{65535,}?", string).span(), (0, 65535)) 894 | self.assertEqual(re.match(r".{65536}", string).span(), (0, 65536)) 895 | self.assertEqual(re.match(r".{,65536}", string).span(), (0, 65536)) 896 | self.assertEqual(re.match(r".{65536,}?", string).span(), (0, 65536)) 897 | # 2**128 should be big enough to overflow both SRE_CODE and Py_ssize_t. 898 | self.assertRaises(OverflowError, re.compile, r".{%d}" % 2**128) 899 | self.assertRaises(OverflowError, re.compile, r".{,%d}" % 2**128) 900 | self.assertRaises(OverflowError, re.compile, r".{%d,}?" % 2**128) 901 | self.assertRaises(OverflowError, re.compile, r".{%d,%d}" % (2**129, 2**128)) 902 | 903 | @cpython_only 904 | def test_repeat_minmax_overflow_maxrepeat(self): 905 | try: 906 | # PCRE: this is defined in pcre module 907 | #from _sre import MAXREPEAT 908 | from pcre import MAXREPEAT 909 | except ImportError: 910 | self.skipTest('requires _sre.MAXREPEAT constant') 911 | string = "x" * 100000 912 | self.assertIsNone(re.match(r".{%d}" % (MAXREPEAT - 1), string)) 913 | self.assertEqual(re.match(r".{,%d}" % (MAXREPEAT - 1), string).span(), 914 | (0, 100000)) 915 | self.assertIsNone(re.match(r".{%d,}?" % (MAXREPEAT - 1), string)) 916 | self.assertRaises(OverflowError, re.compile, r".{%d}" % MAXREPEAT) 917 | self.assertRaises(OverflowError, re.compile, r".{,%d}" % MAXREPEAT) 918 | self.assertRaises(OverflowError, re.compile, r".{%d,}?" % MAXREPEAT) 919 | 920 | def test_backref_group_name_in_exception(self): 921 | # Issue 17341: Poor error message when compiling invalid regex 922 | with self.assertRaisesRegexp(sre_constants.error, ''): 923 | re.compile('(?P=)') 924 | 925 | def test_group_name_in_exception(self): 926 | # Issue 17341: Poor error message when compiling invalid regex 927 | with self.assertRaisesRegexp(sre_constants.error, '\?foo'): 928 | re.compile('(?P)') 929 | 930 | def test_issue17998(self): 931 | for reps in '*', '+', '?', '{1}': 932 | for mod in '', '?': 933 | pattern = '.' + reps + mod + 'yz' 934 | self.assertEqual(re.compile(pattern, re.S).findall('xyz'), 935 | ['xyz'], msg=pattern) 936 | pattern = pattern.encode() 937 | self.assertEqual(re.compile(pattern, re.S).findall(b'xyz'), 938 | [b'xyz'], msg=pattern) 939 | 940 | 941 | def test_bug_2537(self): 942 | # issue 2537: empty submatches 943 | for outer_op in ('{0,}', '*', '+', '{1,187}'): 944 | for inner_op in ('{0,}', '*', '?'): 945 | r = re.compile("^((x|y)%s)%s" % (inner_op, outer_op)) 946 | m = r.match("xyyzy") 947 | self.assertEqual(m.group(0), "xyy") 948 | self.assertEqual(m.group(1), "") 949 | self.assertEqual(m.group(2), "y") 950 | 951 | def test_debug_flag(self): 952 | with captured_stdout() as out: 953 | re.compile('foo', re.DEBUG) 954 | self.assertEqual(out.getvalue().splitlines(), 955 | ['literal 102', 'literal 111', 'literal 111']) 956 | # Debug output is output again even a second time (bypassing 957 | # the cache -- issue #20426). 958 | with captured_stdout() as out: 959 | re.compile('foo', re.DEBUG) 960 | self.assertEqual(out.getvalue().splitlines(), 961 | ['literal 102', 'literal 111', 'literal 111']) 962 | 963 | def test_keyword_parameters(self): 964 | # Issue #20283: Accepting the string keyword parameter. 965 | pat = re.compile(r'(ab)') 966 | self.assertEqual( 967 | pat.match(string='abracadabra', pos=7, endpos=10).span(), (7, 9)) 968 | self.assertEqual( 969 | pat.search(string='abracadabra', pos=3, endpos=10).span(), (7, 9)) 970 | self.assertEqual( 971 | pat.findall(string='abracadabra', pos=3, endpos=10), ['ab']) 972 | self.assertEqual( 973 | pat.split(string='abracadabra', maxsplit=1), 974 | ['', 'ab', 'racadabra']) 975 | 976 | 977 | def run_re_tests(): 978 | from test.re_tests import tests, SUCCEED, FAIL, SYNTAX_ERROR 979 | if verbose: 980 | print 'Running re_tests test suite' 981 | else: 982 | # To save time, only run the first and last 10 tests 983 | #tests = tests[:10] + tests[-10:] 984 | pass 985 | 986 | for t in tests: 987 | sys.stdout.flush() 988 | pattern = s = outcome = repl = expected = None 989 | if len(t) == 5: 990 | pattern, s, outcome, repl, expected = t 991 | elif len(t) == 3: 992 | pattern, s, outcome = t 993 | else: 994 | raise ValueError, ('Test tuples should have 3 or 5 fields', t) 995 | 996 | try: 997 | obj = re.compile(pattern) 998 | except re.error: 999 | if outcome == SYNTAX_ERROR: pass # Expected a syntax error 1000 | else: 1001 | print '=== Syntax error:', t 1002 | except KeyboardInterrupt: raise KeyboardInterrupt 1003 | except: 1004 | print '*** Unexpected error ***', t 1005 | if verbose: 1006 | traceback.print_exc(file=sys.stdout) 1007 | else: 1008 | try: 1009 | result = obj.search(s) 1010 | except re.error, msg: 1011 | print '=== Unexpected exception', t, repr(msg) 1012 | if outcome == SYNTAX_ERROR: 1013 | # This should have been a syntax error; forget it. 1014 | pass 1015 | elif outcome == FAIL: 1016 | if result is None: pass # No match, as expected 1017 | else: print '=== Succeeded incorrectly', t 1018 | elif outcome == SUCCEED: 1019 | if result is not None: 1020 | # Matched, as expected, so now we compute the 1021 | # result string and compare it to our expected result. 1022 | start, end = result.span(0) 1023 | vardict={'found': result.group(0), 1024 | 'groups': result.group(), 1025 | 'flags': result.re.flags} 1026 | for i in range(1, 100): 1027 | try: 1028 | gi = result.group(i) 1029 | # Special hack because else the string concat fails: 1030 | if gi is None: 1031 | gi = "None" 1032 | except IndexError: 1033 | gi = "Error" 1034 | vardict['g%d' % i] = gi 1035 | for i in result.re.groupindex.keys(): 1036 | try: 1037 | gi = result.group(i) 1038 | if gi is None: 1039 | gi = "None" 1040 | except IndexError: 1041 | gi = "Error" 1042 | vardict[i] = gi 1043 | repl = eval(repl, vardict) 1044 | if repl != expected: 1045 | print '=== grouping error', t, 1046 | print repr(repl) + ' should be ' + repr(expected) 1047 | else: 1048 | print '=== Failed incorrectly', t 1049 | 1050 | # Try the match on a unicode string, and check that it 1051 | # still succeeds. 1052 | try: 1053 | result = obj.search(unicode(s, "latin-1")) 1054 | if result is None: 1055 | print '=== Fails on unicode match', t 1056 | except NameError: 1057 | continue # 1.5.2 1058 | except TypeError: 1059 | continue # unicode test case 1060 | 1061 | # Try the match on a unicode pattern, and check that it 1062 | # still succeeds. 1063 | obj=re.compile(unicode(pattern, "latin-1")) 1064 | result = obj.search(s) 1065 | if result is None: 1066 | print '=== Fails on unicode pattern match', t 1067 | 1068 | # Try the match with the search area limited to the extent 1069 | # of the match and see if it still succeeds. \B will 1070 | # break (because it won't match at the end or start of a 1071 | # string), so we'll ignore patterns that feature it. 1072 | 1073 | if pattern[:2] != '\\B' and pattern[-2:] != '\\B' \ 1074 | and result is not None: 1075 | obj = re.compile(pattern) 1076 | result = obj.search(s, result.start(0), result.end(0) + 1) 1077 | if result is None: 1078 | print '=== Failed on range-limited match', t 1079 | 1080 | # Try the match with IGNORECASE enabled, and check that it 1081 | # still succeeds. 1082 | obj = re.compile(pattern, re.IGNORECASE) 1083 | result = obj.search(s) 1084 | if result is None: 1085 | print '=== Fails on case-insensitive match', t 1086 | 1087 | # Try the match with LOCALE enabled, and check that it 1088 | # still succeeds. 1089 | obj = re.compile(pattern, re.LOCALE) 1090 | result = obj.search(s) 1091 | if result is None: 1092 | print '=== Fails on locale-sensitive match', t 1093 | 1094 | # Try the match with UNICODE locale enabled, and check 1095 | # that it still succeeds. 1096 | obj = re.compile(pattern, re.UNICODE) 1097 | result = obj.search(s) 1098 | if result is None: 1099 | print '=== Fails on unicode-sensitive match', t 1100 | 1101 | def test_main(): 1102 | run_unittest(ReTests) 1103 | run_re_tests() 1104 | 1105 | if __name__ == "__main__": 1106 | test_main() 1107 | --------------------------------------------------------------------------------