├── .gitignore ├── .travis.yml ├── LICENSE ├── Makefile ├── README.md ├── __init__.py ├── _cdifflib.c ├── _cdifflib3.c ├── cdifflib.py ├── pyproject.toml ├── setup.py └── tests ├── cdifflib_tests.py └── testdata.py /.gitignore: -------------------------------------------------------------------------------- 1 | CDiffLib.egg-info 2 | *.pyc 3 | _cdifflib*.so 4 | build/ 5 | dist/ 6 | wheelhouse/ 7 | venv/ 8 | .pypirc 9 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | sudo: false 3 | python: 4 | - "2.7" 5 | - "3.3" 6 | - "3.4" 7 | - "3.5" 8 | - "3.6" 9 | os: 10 | - linux 11 | # - osx # Unfortunately Py2.7 seems broken on travis as of 2017-07 12 | install: 13 | - python setup.py install 14 | script: 15 | - python setup.py test 16 | notifications: 17 | email: 18 | on_success: change 19 | on_failure: always 20 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013, Matthew Duggan 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without modification, 5 | are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, this 11 | list of conditions and the following disclaimer in the documentation and/or 12 | other materials provided with the distribution. 13 | 14 | * Neither the name of the {organization} nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 19 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 20 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR 22 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 23 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 24 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 25 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 26 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 27 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | SHELL := /bin/bash 2 | 3 | default: venv build install test 4 | 5 | build: 6 | source .venv/bin/activate && python -m build -s -w 7 | 8 | install: 9 | source .venv/bin/activate && pip install dist/cdifflib-*.tar.gz 10 | 11 | venv: 12 | python3 -m venv .venv 13 | source .venv/bin/activate && pip install build ruff mypy twine pytest 14 | 15 | test: 16 | source .venv/bin/activate && python -m pytest tests/cdifflib_tests.py 17 | 18 | clean: 19 | rm -rf build/ 20 | rm -rf dist/ 21 | rm -rf CDiffLib.egg-info 22 | rm -f _cdifflib.so 23 | rm -f *.pyc 24 | rm -f tests/*.pyc 25 | rm -rf __pycache__ 26 | rm -rf tests/__pycache__ 27 | rm -rf .venv/ 28 | 29 | PYVERSIONS = 3.9 3.10 3.11 3.12 3.13 30 | 31 | multidist: 32 | source .venv/bin/activate && python -m build -s 33 | $(foreach pyver,$(PYVERSIONS),rm -rf venv-tmp-$(pyver) && python$(pyver) -m venv venv-tmp-$(pyver) && source venv-tmp-$(pyver)/bin/activate && pip install build && python -m build && rm -rf venv-tmp-$(pyver)) 34 | twine check dist/* 35 | 36 | upload: 37 | twine upload dist/* 38 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | cdifflib 2 | ======== 3 | [](https://travis-ci.org/mduggan/cdifflib/) 4 | 5 | Python [difflib](http://docs.python.org/2/library/difflib.html) sequence 6 | matcher reimplemented in C. 7 | 8 | Actually only contains reimplemented parts. Creates a `CSequenceMatcher` type 9 | which inherets most functions from `difflib.SequenceMatcher`. 10 | 11 | `cdifflib` is about 4x the speed of the pure python `difflib` when diffing 12 | large streams. 13 | 14 | Limitations 15 | ----------- 16 | The C part of the code can only work on `list` rather than generic iterables, 17 | so anything that isn't a `list` will be converted to `list` in the 18 | `CSequenceMatcher` constructor. This may cause undesirable behavior if you're 19 | not expecting it. 20 | 21 | Works with Python 2.7 and 3.6 (Should work on all 3.3+) 22 | 23 | Usage 24 | ----- 25 | Can be used just like the `difflib.SequenceMatcher` as long as you pass lists. These examples are right out of the [difflib docs](http://docs.python.org/2/library/difflib.html): 26 | ```Python 27 | >>> from cdifflib import CSequenceMatcher 28 | >>> s = CSequenceMatcher(None, ' abcd', 'abcd abcd') 29 | >>> s.find_longest_match(0, 5, 0, 9) 30 | Match(a=1, b=0, size=4) 31 | >>> s = CSequenceMatcher(lambda x: x == " ", 32 | ... "private Thread currentThread;", 33 | ... "private volatile Thread currentThread;") 34 | >>> print round(s.ratio(), 3) 35 | 0.866 36 | ``` 37 | 38 | It's completely compatible, so you can replace the difflib version on startup 39 | and then other libraries will use CSequenceMatcher too, eg: 40 | ```Python 41 | from cdifflib import CSequenceMatcher 42 | import difflib 43 | difflib.SequenceMatcher = CSequenceMatcher 44 | import library_that_uses_difflib 45 | 46 | # Now the library will transparantely be using the C SequenceMatcher - other 47 | # things remain the same 48 | library_that_uses_difflib.do_some_diffing() 49 | ``` 50 | 51 | 52 | Making 53 | ------ 54 | Set up dev environment: 55 | ``` 56 | make venv 57 | source .venv/bin/activate 58 | ``` 59 | 60 | To build/install into the venv: 61 | ``` 62 | make build 63 | make install 64 | ``` 65 | 66 | To test: 67 | ``` 68 | make test 69 | ``` 70 | 71 | License etc 72 | ----------- 73 | This code lives at https://github.com/mduggan. See LICENSE for the license. 74 | 75 | 76 | Changelog 77 | --------- 78 | * 1.2.9 - Repackage again, no code change (#13) 79 | * 1.2.8 - Bump to fix version number in py file, no code change 80 | * 1.2.7 - Update for newer pythons (#12) 81 | * 1.2.6 - Clear state correctly when replacing seq1 (#10) 82 | * 1.2.5 - Fix some memory leaks (#7) 83 | * 1.2.4 - Repackage yet again using twine for pypi upload (no binary changes) 84 | * 1.2.3 - Repackage again with changelog update and corrected src package (no binary changes) 85 | * 1.2.2 - Repackage to add README.md in a way pypi supports (no binary changes) 86 | * 1.2.1 - Fix bug for longer sequences with "autojunk" 87 | * 1.2.0 - Python 3 support for other versions 88 | * 1.1.0 - Added Python 3.6 support (thanks Bclavie) 89 | * 1.0.4 - Changes to make it compile on MSVC++ compiler, no change for other platforms 90 | * 1.0.2 - Bugfix - also replace set_seq1 implementation so `difflib.compare` works with a `CSequenceMatcher` 91 | * 1.0.1 - Implement more bits in c to squeeze a bit more speed out 92 | * 1.0.0 - First release 93 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mduggan/cdifflib/6e70fcff18194e0159a23845beb13ac7875fa067/__init__.py -------------------------------------------------------------------------------- /_cdifflib.c: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | #if PY_MAJOR_VERSION == 2 4 | 5 | // 6 | // A simple wrapper to see if two Python list entries are "Python equal". 7 | // 8 | static __inline int 9 | list_items_eq(PyObject *a, int ai, PyObject *b, int bi) 10 | { 11 | PyObject *o1 = PyList_GET_ITEM(a, ai); 12 | PyObject *o2 = PyList_GET_ITEM(b, bi); 13 | int result = PyObject_RichCompareBool(o1, o2, Py_EQ); 14 | return result; 15 | } 16 | 17 | // 18 | // A simple wrapper to call a callable python object with an argument and 19 | // return the result as a boolean. 20 | // 21 | static __inline int 22 | call_obj(PyObject *callable, PyObject *arg) 23 | { 24 | PyObject *result; 25 | int retval; 26 | if (!callable) 27 | return 0; 28 | assert(PyCallable_Check(callable)); 29 | result = PyObject_CallFunctionObjArgs(callable, arg, NULL); 30 | retval = PyObject_IsTrue(result); 31 | Py_DECREF(result); 32 | return retval; 33 | } 34 | 35 | static void 36 | _find_longest_match_worker( 37 | PyObject *self, 38 | PyObject *a, 39 | PyObject *b, 40 | PyObject *isbjunk, 41 | const int alo, 42 | const int ahi, 43 | const int blo, 44 | const int bhi, 45 | long *_besti, 46 | long *_bestj, 47 | long *_bestsize 48 | ) 49 | { 50 | int besti = alo; 51 | int bestj = blo; 52 | int bestsize = 0; 53 | int i; 54 | 55 | assert(self && a && b); 56 | 57 | // Degenerate case: evaluate an empty range: there is no match. 58 | if (alo == ahi || blo == bhi) { 59 | *_besti = alo; 60 | *_bestj = blo; 61 | *_bestsize = 0; 62 | return; 63 | } 64 | 65 | //printf("longest match helper\n"); 66 | { 67 | PyObject *b2j = PyObject_GetAttrString(self, "b2j"); 68 | PyObject *j2len = PyDict_New(); 69 | PyObject *newj2len = PyDict_New(); 70 | 71 | assert(PyDict_CheckExact(b2j)); 72 | 73 | // 74 | // This loop creates a lot of python objects only to read back 75 | // their values inside the same loop. It should be faster using a 76 | // simpler data structure to do the same thing, but I rewrote it 77 | // with a c++ unordered_map and it was half the speed :( 78 | // 79 | for (i = alo; i < ahi; i++) 80 | { 81 | PyObject *tmp; 82 | PyObject *oj = PyDict_GetItem(b2j, PyList_GET_ITEM(a, i)); 83 | 84 | /* oj is a list of indexes in b at which the line a[i] appears, or 85 | NULL if it does not appear */ 86 | if (oj != NULL) 87 | { 88 | int ojlen, oji; 89 | assert(PyList_Check(oj)); 90 | ojlen = (int)PyList_GET_SIZE(oj); 91 | for (oji = 0; oji < ojlen; oji++) 92 | { 93 | PyObject *j2len_j, *jint, *kint, *jminus1; 94 | int j; 95 | int k = 1; 96 | 97 | jint = PyList_GET_ITEM(oj, oji); 98 | assert(PyInt_CheckExact(jint)); 99 | j = (int)PyInt_AS_LONG(jint); 100 | 101 | if (j < blo) 102 | continue; 103 | if (j >= bhi) 104 | break; 105 | 106 | jminus1 = PyInt_FromLong(j-1); 107 | j2len_j = PyDict_GetItem(j2len, jminus1); 108 | Py_DECREF(jminus1); 109 | if (j2len_j) 110 | k += (int)PyInt_AS_LONG(j2len_j); 111 | 112 | // this looks like an allocation, but k is usually low 113 | kint = PyInt_FromLong(k); 114 | PyDict_SetItem(newj2len, jint, kint); 115 | Py_DECREF(kint); 116 | 117 | if (k > bestsize) { 118 | besti = i-k; 119 | bestj = j-k; 120 | bestsize = k; 121 | } 122 | } 123 | } 124 | 125 | // Cycle j2len and newj2len 126 | tmp = j2len; 127 | j2len = newj2len; 128 | newj2len = tmp; 129 | PyDict_Clear(newj2len); 130 | } 131 | 132 | /* besti and bestj are offset by 1 if set in the loop above */ 133 | if (bestsize) 134 | { 135 | besti++; 136 | bestj++; 137 | } 138 | 139 | /* Done with these now. */ 140 | Py_DECREF(j2len); 141 | Py_DECREF(newj2len); 142 | Py_DECREF(b2j); 143 | } 144 | 145 | //printf("twiddle values %d %d %d %d %d %d\n", besti, alo, ahi, bestj, blo, bhi); 146 | while (besti > alo && bestj > blo && 147 | !call_obj(isbjunk, PyList_GET_ITEM(b, bestj-1)) && 148 | list_items_eq(a, besti-1, b, bestj-1)) 149 | { 150 | besti--; 151 | bestj--; 152 | bestsize++; 153 | } 154 | 155 | //printf("twiddle values 2\n"); 156 | while (besti+bestsize < ahi && bestj+bestsize < bhi && 157 | !call_obj(isbjunk, PyList_GET_ITEM(b, bestj+bestsize)) && 158 | list_items_eq(a, besti+bestsize, b, bestj+bestsize)) 159 | { 160 | bestsize++; 161 | } 162 | 163 | 164 | //printf("twiddle values 3\n"); 165 | while (besti > alo && bestj > blo && 166 | call_obj(isbjunk, PyList_GET_ITEM(b, bestj-1)) && 167 | list_items_eq(a, besti-1, b, bestj-1)) 168 | { 169 | besti--; 170 | bestj--; 171 | bestsize++; 172 | } 173 | 174 | //printf("twiddle values 4\n"); 175 | while (besti+bestsize < ahi && bestj+bestsize < bhi && 176 | call_obj(isbjunk, PyList_GET_ITEM(b, bestj+bestsize)) && 177 | list_items_eq(a, besti+bestsize, b, bestj+bestsize)) 178 | { 179 | bestsize++; 180 | } 181 | 182 | //printf("helper done\n"); 183 | *_besti = besti; 184 | *_bestj = bestj; 185 | *_bestsize = bestsize; 186 | } 187 | 188 | 189 | // 190 | // A very simple C reimplementation of Python 2.7's 191 | // difflib.SequenceMatcher.find_longest_match() 192 | // 193 | // The algorithm is identical (right down to using Python dicts and lists for 194 | // local variables), but the c version runs in 1/4 the time. 195 | // 196 | static PyObject * 197 | find_longest_match(PyObject *module, PyObject *args) 198 | { 199 | long alo, ahi, blo, bhi; 200 | long besti, bestj, bestsize; 201 | PyObject *self, *a, *b, *isbjunk; 202 | 203 | if (!PyArg_ParseTuple(args, "Ollll", &self, &alo, &ahi, &blo, &bhi)) { 204 | PyErr_SetString(PyExc_ValueError, "find_longest_match parameters not as expected"); 205 | return NULL; 206 | } 207 | 208 | assert(self); 209 | 210 | //printf("check junk\n"); 211 | /* Slight speedup - if we have no junk, don't bother calling isbjunk lots */ 212 | { 213 | PyObject *nojunk = PyObject_GetAttrString(self, "_nojunk"); 214 | if (nojunk && PyObject_IsTrue(nojunk)) 215 | { 216 | isbjunk = NULL; 217 | } 218 | else 219 | { 220 | PyErr_Clear(); 221 | isbjunk = PyObject_GetAttrString(self, "isbjunk"); 222 | assert(isbjunk); 223 | if (!PyCallable_Check(isbjunk)) { 224 | PyErr_SetString(PyExc_RuntimeError, "isbjunk not callable"); 225 | return NULL; 226 | } 227 | } 228 | if (nojunk) 229 | Py_DECREF(nojunk); 230 | } 231 | 232 | //printf("get members\n"); 233 | // FIXME: Really should support non-list sequences for a and b 234 | a = PyObject_GetAttrString(self, "a"); 235 | b = PyObject_GetAttrString(self, "b"); 236 | if (!PyList_Check(a) || !PyList_Check(b)) 237 | return NULL; 238 | 239 | // This function actually does the work, the rest is just window dressing. 240 | _find_longest_match_worker(self, a, b, isbjunk, alo, ahi, blo, bhi, &besti, &bestj, &bestsize); 241 | 242 | //printf("done\n"); 243 | 244 | Py_DECREF(a); 245 | Py_DECREF(b); 246 | if (isbjunk) 247 | Py_DECREF(isbjunk); 248 | 249 | return Py_BuildValue("iii", besti, bestj, bestsize); 250 | } 251 | 252 | /* 253 | def __helper(self, alo, ahi, blo, bhi, answer): 254 | i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi) 255 | # a[alo:i] vs b[blo:j] unknown 256 | # a[i:i+k] same as b[j:j+k] 257 | # a[i+k:ahi] vs b[j+k:bhi] unknown 258 | if k: 259 | if alo < i and blo < j: 260 | self.__helper(alo, i, blo, j, answer) 261 | answer.append(x) 262 | if i+k < ahi and j+k < bhi: 263 | self.__helper(i+k, ahi, j+k, bhi, answer) 264 | */ 265 | 266 | static void 267 | matching_block_helper(PyObject *self, PyObject *a, PyObject *b, PyObject *isjunk, PyObject *answer, const long alo, const long ahi, const long blo, const long bhi) 268 | { 269 | long i, j, k; 270 | //printf("matching_block_helper 1\n"); 271 | _find_longest_match_worker(self, a, b, isjunk, alo, ahi, blo, bhi, &i, &j, &k); 272 | //printf("matching_block_helper 2\n"); 273 | 274 | if (k) { 275 | PyObject *p = Py_BuildValue("(iii)", i, j, k); 276 | if (alo < i && blo < j) 277 | matching_block_helper(self, a, b, isjunk, answer, alo, i, blo, j); 278 | PyList_Append(answer, p); 279 | Py_DECREF(p); 280 | if (i+k < ahi && j+k < bhi) 281 | matching_block_helper(self, a, b, isjunk, answer, i+k, ahi, j+k, bhi); 282 | } 283 | //printf("matching_block_helper 3\n"); 284 | } 285 | 286 | static PyObject * 287 | matching_blocks(PyObject *module, PyObject *args) 288 | { 289 | PyObject *self, *a, *b, *isbjunk, *matching; 290 | long la, lb; 291 | 292 | if (!PyArg_ParseTuple(args, "O", &self)) { 293 | PyErr_SetString(PyExc_ValueError, "expected one argument, self"); 294 | return NULL; 295 | } 296 | 297 | //printf("matching_blocks 1\n"); 298 | /* Slight speedup - if we have no junk, don't bother calling isbjunk lots */ 299 | { 300 | PyObject *nojunk = PyObject_GetAttrString(self, "_nojunk"); 301 | if (nojunk && PyObject_IsTrue(nojunk)) 302 | { 303 | isbjunk = NULL; 304 | } 305 | else 306 | { 307 | PyErr_Clear(); 308 | isbjunk = PyObject_GetAttrString(self, "isbjunk"); 309 | assert(isbjunk); 310 | if (!PyCallable_Check(isbjunk)) { 311 | PyErr_SetString(PyExc_RuntimeError, "isbjunk not callable"); 312 | return NULL; 313 | } 314 | } 315 | if (nojunk) 316 | Py_DECREF(nojunk); 317 | } 318 | 319 | // FIXME: Really should support non-list sequences for a and b 320 | //printf("matching_blocks 2\n"); 321 | a = PyObject_GetAttrString(self, "a"); 322 | b = PyObject_GetAttrString(self, "b"); 323 | if (!PyList_Check(a) || !PyList_Check(b)) { 324 | PyErr_SetString(PyExc_ValueError, "cdifflib only supports lists for both sequences"); 325 | return NULL; 326 | } 327 | 328 | //printf("matching_blocks 3\n"); 329 | la = PyList_GET_SIZE(a); 330 | lb = PyList_GET_SIZE(b); 331 | 332 | matching = PyList_New(0); 333 | 334 | matching_block_helper(self, a, b, isbjunk, matching, 0, la, 0, lb); 335 | 336 | //printf("matching_blocks 4\n"); 337 | if (isbjunk) 338 | Py_DECREF(isbjunk); 339 | Py_DECREF(a); 340 | Py_DECREF(b); 341 | // don't decrement matching, put it straight in to the return val 342 | return Py_BuildValue("N", matching); 343 | } 344 | 345 | 346 | static PyObject * 347 | chain_b(PyObject *module, PyObject *args) 348 | { 349 | long n; 350 | Py_ssize_t i; 351 | 352 | // These are temporary and are decremented after use 353 | PyObject *b, *isjunk, *fast_b, *self; 354 | 355 | // These are needed through the function and are decremented at the end 356 | PyObject *junk = NULL, *popular = NULL, *b2j = NULL, *retval = NULL, *autojunk = NULL; 357 | 358 | //printf("chain_b\n"); 359 | 360 | if (!PyArg_ParseTuple(args, "O", &self)) 361 | goto error; 362 | 363 | b = PyObject_GetAttrString(self, "b"); 364 | if (b == NULL || b == Py_None) 365 | goto error; 366 | b2j = PyDict_New(); 367 | PyObject_SetAttrString(self, "b2j", b2j); 368 | 369 | /* construct b2j here */ 370 | //printf("construct b2j\n"); 371 | assert(PySequence_Check(b)); 372 | fast_b = PySequence_Fast(b, "accessing sequence 2"); 373 | Py_DECREF(b); 374 | n = PySequence_Fast_GET_SIZE(fast_b); 375 | for (i = 0; i < n; i++) 376 | { 377 | PyObject *iint; 378 | PyObject *indices; 379 | PyObject *elt = PySequence_Fast_GET_ITEM(fast_b, i); 380 | assert(elt && elt != Py_None); 381 | indices = PyDict_GetItem(b2j, elt); 382 | assert(indices == NULL || indices != Py_None); 383 | if (indices == NULL) 384 | { 385 | if (PyErr_Occurred()) 386 | { 387 | if (!PyErr_ExceptionMatches(PyExc_KeyError)) 388 | { 389 | Py_DECREF(fast_b); 390 | goto error; 391 | } 392 | PyErr_Clear(); 393 | } 394 | 395 | indices = PyList_New(0); 396 | PyDict_SetItem(b2j, elt, indices); 397 | Py_DECREF(indices); 398 | } 399 | iint = PyInt_FromLong(i); 400 | PyList_Append(indices, iint); 401 | Py_DECREF(iint); 402 | } 403 | Py_DECREF(fast_b); 404 | 405 | assert(!PyErr_Occurred()); 406 | 407 | //printf("determine junk\n"); 408 | junk = PySet_New(NULL); 409 | isjunk = PyObject_GetAttrString(self, "isjunk"); 410 | if (isjunk != NULL && isjunk != Py_None) 411 | { 412 | PyObject *keys = PyDict_Keys(b2j); 413 | PyObject *fastkeys; 414 | assert(PySequence_Check(keys)); 415 | fastkeys = PySequence_Fast(keys, "dict keys"); 416 | Py_DECREF(keys); 417 | /* call isjunk here */ 418 | for (i = 0; i < PySequence_Fast_GET_SIZE(fastkeys); i++) 419 | { 420 | PyObject *elt = PySequence_Fast_GET_ITEM(fastkeys, i); 421 | if (call_obj(isjunk, elt)) 422 | { 423 | PySet_Add(junk, elt); 424 | PyDict_DelItem(b2j, elt); 425 | } 426 | } 427 | Py_DECREF(fastkeys); 428 | Py_DECREF(isjunk); 429 | } 430 | 431 | /* build autojunk here */ 432 | //printf("build autojunk\n"); 433 | popular = PySet_New(NULL); 434 | autojunk = PyObject_GetAttrString(self, "autojunk"); 435 | assert(autojunk != NULL); 436 | if (PyObject_IsTrue(autojunk) && n >= 200) { 437 | long ntest = n/100 + 1; 438 | PyObject *items = PyDict_Items(b2j); 439 | long b2jlen = PyList_GET_SIZE(items); 440 | for (i = 0; i < b2jlen; i++) 441 | { 442 | PyObject *tuple = PyList_GET_ITEM(items, i); 443 | PyObject *elt = PyTuple_GET_ITEM(tuple, 0); 444 | PyObject *idxs = PyTuple_GET_ITEM(tuple, 1); 445 | 446 | assert(PyList_Check(idxs)); 447 | 448 | if (PyList_GET_SIZE(idxs) > ntest) 449 | { 450 | PySet_Add(popular, elt); 451 | PyDict_DelItem(b2j, elt); 452 | } 453 | } 454 | Py_DECREF(items); 455 | } 456 | 457 | retval = Py_BuildValue("OO", junk, popular); 458 | assert(!PyErr_Occurred()); 459 | 460 | error: 461 | if (b2j) 462 | Py_DECREF(b2j); 463 | if (junk) 464 | Py_DECREF(junk); 465 | if (popular) 466 | Py_DECREF(popular); 467 | if (autojunk) 468 | Py_DECREF(autojunk); 469 | return retval; 470 | } 471 | 472 | // 473 | // Define functions in this module 474 | // 475 | static PyMethodDef CDiffLibMethods[4] = { 476 | {"find_longest_match", find_longest_match, METH_VARARGS, 477 | "c implementation of difflib.SequenceMatcher.find_longest_match"}, 478 | {"chain_b", chain_b, METH_VARARGS, 479 | "c implementation of most of difflib.SequenceMatcher.__chain_b"}, 480 | {"matching_blocks", matching_blocks, METH_VARARGS, 481 | "c implementation of part of difflib.SequenceMatcher.get_matching_blocks"}, 482 | {NULL, NULL, 0, NULL} /* Sentinel */ 483 | }; 484 | 485 | // 486 | // Module init entrypoint. 487 | // 488 | PyMODINIT_FUNC 489 | init_cdifflib(void) 490 | { 491 | PyObject *m; 492 | 493 | m = Py_InitModule("_cdifflib", CDiffLibMethods); 494 | if (m == NULL) 495 | return; 496 | // No special initialisation to do at the moment.. 497 | } 498 | 499 | #endif // PY_MAJOR_VERSION == 2 500 | -------------------------------------------------------------------------------- /_cdifflib3.c: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | #if PY_MAJOR_VERSION == 3 4 | 5 | // 6 | // A simple wrapper to see if two Python list entries are "Python equal". 7 | // 8 | static __inline int 9 | list_items_eq(PyObject *a, int ai, PyObject *b, int bi) 10 | { 11 | PyObject *o1 = PyList_GET_ITEM(a, ai); 12 | PyObject *o2 = PyList_GET_ITEM(b, bi); 13 | int result = PyObject_RichCompareBool(o1, o2, Py_EQ); 14 | return result; 15 | } 16 | 17 | // 18 | // A simple wrapper to call a callable python object with an argument and 19 | // return the result as a boolean. 20 | // 21 | static __inline int 22 | call_obj(PyObject *callable, PyObject *arg) 23 | { 24 | PyObject *result; 25 | int retval; 26 | if (!callable) 27 | return 0; 28 | assert(PyCallable_Check(callable)); 29 | result = PyObject_CallFunctionObjArgs(callable, arg, NULL); 30 | retval = PyObject_IsTrue(result); 31 | Py_DECREF(result); 32 | return retval; 33 | } 34 | 35 | static void 36 | _find_longest_match_worker( 37 | PyObject *self, 38 | PyObject *a, 39 | PyObject *b, 40 | PyObject *isbjunk, 41 | const int alo, 42 | const int ahi, 43 | const int blo, 44 | const int bhi, 45 | long *_besti, 46 | long *_bestj, 47 | long *_bestsize 48 | ) 49 | { 50 | int besti = alo; 51 | int bestj = blo; 52 | int bestsize = 0; 53 | int i; 54 | 55 | assert(self && a && b); 56 | 57 | // Degenerate case: evaluate an empty range: there is no match. 58 | if (alo == ahi || blo == bhi) { 59 | *_besti = alo; 60 | *_bestj = blo; 61 | *_bestsize = 0; 62 | return; 63 | } 64 | 65 | //printf("longest match helper\n"); 66 | { 67 | PyObject *b2j = PyObject_GetAttrString(self, "b2j"); 68 | PyObject *j2len = PyDict_New(); 69 | PyObject *newj2len = PyDict_New(); 70 | 71 | assert(PyDict_CheckExact(b2j)); 72 | 73 | // 74 | // This loop creates a lot of python objects only to read back 75 | // their values inside the same loop. It should be faster using a 76 | // simpler data structure to do the same thing, but I rewrote it 77 | // with a c++ unordered_map and it was half the speed :( 78 | // 79 | for (i = alo; i < ahi; i++) 80 | { 81 | PyObject *tmp; 82 | PyObject *oj = PyDict_GetItem(b2j, PyList_GET_ITEM(a, i)); 83 | 84 | /* oj is a list of indexes in b at which the line a[i] appears, or 85 | NULL if it does not appear */ 86 | if (oj != NULL) 87 | { 88 | int ojlen, oji; 89 | assert(PyList_Check(oj)); 90 | ojlen = (int)PyList_GET_SIZE(oj); 91 | for (oji = 0; oji < ojlen; oji++) 92 | { 93 | PyObject *j2len_j, *jint, *kint, *jminus1; 94 | int j; 95 | int k = 1; 96 | 97 | jint = PyList_GET_ITEM(oj, oji); 98 | assert(PyLong_CheckExact(jint)); 99 | j = (int)PyLong_AsLong(jint); 100 | 101 | if (j < blo) 102 | continue; 103 | if (j >= bhi) 104 | break; 105 | 106 | jminus1 = PyLong_FromLong(j-1); 107 | j2len_j = PyDict_GetItem(j2len, jminus1); 108 | Py_DECREF(jminus1); 109 | if (j2len_j) 110 | k += (int)PyLong_AsLong(j2len_j); 111 | 112 | // this looks like an allocation, but k is usually low 113 | kint = PyLong_FromLong(k); 114 | PyDict_SetItem(newj2len, jint, kint); 115 | Py_DECREF(kint); 116 | 117 | if (k > bestsize) { 118 | besti = i-k; 119 | bestj = j-k; 120 | bestsize = k; 121 | } 122 | } 123 | } 124 | 125 | // Cycle j2len and newj2len 126 | tmp = j2len; 127 | j2len = newj2len; 128 | newj2len = tmp; 129 | PyDict_Clear(newj2len); 130 | } 131 | 132 | /* besti and bestj are offset by 1 if set in the loop above */ 133 | if (bestsize) 134 | { 135 | besti++; 136 | bestj++; 137 | } 138 | 139 | /* Done with these now. */ 140 | Py_DECREF(j2len); 141 | Py_DECREF(newj2len); 142 | Py_DECREF(b2j); 143 | } 144 | 145 | //printf("twiddle values %d %d %d %d %d %d\n", besti, alo, ahi, bestj, blo, bhi); 146 | while (besti > alo && bestj > blo && 147 | !call_obj(isbjunk, PyList_GET_ITEM(b, bestj-1)) && 148 | list_items_eq(a, besti-1, b, bestj-1)) 149 | { 150 | besti--; 151 | bestj--; 152 | bestsize++; 153 | } 154 | 155 | //printf("twiddle values 2\n"); 156 | while (besti+bestsize < ahi && bestj+bestsize < bhi && 157 | !call_obj(isbjunk, PyList_GET_ITEM(b, bestj+bestsize)) && 158 | list_items_eq(a, besti+bestsize, b, bestj+bestsize)) 159 | { 160 | bestsize++; 161 | } 162 | 163 | 164 | //printf("twiddle values 3\n"); 165 | while (besti > alo && bestj > blo && 166 | call_obj(isbjunk, PyList_GET_ITEM(b, bestj-1)) && 167 | list_items_eq(a, besti-1, b, bestj-1)) 168 | { 169 | besti--; 170 | bestj--; 171 | bestsize++; 172 | } 173 | 174 | //printf("twiddle values 4\n"); 175 | while (besti+bestsize < ahi && bestj+bestsize < bhi && 176 | call_obj(isbjunk, PyList_GET_ITEM(b, bestj+bestsize)) && 177 | list_items_eq(a, besti+bestsize, b, bestj+bestsize)) 178 | { 179 | bestsize++; 180 | } 181 | 182 | //printf("helper done\n"); 183 | *_besti = besti; 184 | *_bestj = bestj; 185 | *_bestsize = bestsize; 186 | } 187 | 188 | 189 | // 190 | // A very simple C reimplementation of Python 2.7's 191 | // difflib.SequenceMatcher.find_longest_match() 192 | // 193 | // The algorithm is identical (right down to using Python dicts and lists for 194 | // local variables), but the c version runs in 1/4 the time. 195 | // 196 | static PyObject * 197 | find_longest_match(PyObject *module, PyObject *args) 198 | { 199 | long alo, ahi, blo, bhi; 200 | long besti, bestj, bestsize; 201 | PyObject *self, *a, *b, *isbjunk; 202 | 203 | if (!PyArg_ParseTuple(args, "Ollll", &self, &alo, &ahi, &blo, &bhi)) { 204 | PyErr_SetString(PyExc_ValueError, "find_longest_match parameters not as expected"); 205 | return NULL; 206 | } 207 | 208 | assert(self); 209 | 210 | //printf("check junk\n"); 211 | /* Slight speedup - if we have no junk, don't bother calling isbjunk lots */ 212 | { 213 | PyObject *nojunk = PyObject_GetAttrString(self, "_nojunk"); 214 | if (nojunk && PyObject_IsTrue(nojunk)) 215 | { 216 | isbjunk = NULL; 217 | } 218 | else 219 | { 220 | PyErr_Clear(); 221 | isbjunk = PyObject_GetAttrString(self, "isbjunk"); 222 | assert(isbjunk); 223 | if (!PyCallable_Check(isbjunk)) { 224 | PyErr_SetString(PyExc_RuntimeError, "isbjunk not callable"); 225 | return NULL; 226 | } 227 | } 228 | if (nojunk) 229 | Py_DECREF(nojunk); 230 | } 231 | 232 | //printf("get members\n"); 233 | // FIXME: Really should support non-list sequences for a and b 234 | a = PyObject_GetAttrString(self, "a"); 235 | b = PyObject_GetAttrString(self, "b"); 236 | if (!PyList_Check(a) || !PyList_Check(b)) 237 | return NULL; 238 | 239 | // This function actually does the work, the rest is just window dressing. 240 | _find_longest_match_worker(self, a, b, isbjunk, alo, ahi, blo, bhi, &besti, &bestj, &bestsize); 241 | 242 | //printf("done\n"); 243 | 244 | Py_DECREF(a); 245 | Py_DECREF(b); 246 | if (isbjunk) 247 | Py_DECREF(isbjunk); 248 | 249 | return Py_BuildValue("iii", besti, bestj, bestsize); 250 | } 251 | 252 | /* 253 | def __helper(self, alo, ahi, blo, bhi, answer): 254 | i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi) 255 | # a[alo:i] vs b[blo:j] unknown 256 | # a[i:i+k] same as b[j:j+k] 257 | # a[i+k:ahi] vs b[j+k:bhi] unknown 258 | if k: 259 | if alo < i and blo < j: 260 | self.__helper(alo, i, blo, j, answer) 261 | answer.append(x) 262 | if i+k < ahi and j+k < bhi: 263 | self.__helper(i+k, ahi, j+k, bhi, answer) 264 | */ 265 | 266 | static void 267 | matching_block_helper(PyObject *self, PyObject *a, PyObject *b, PyObject *isjunk, PyObject *answer, const long alo, const long ahi, const long blo, const long bhi) 268 | { 269 | long i, j, k; 270 | //printf("matching_block_helper 1\n"); 271 | _find_longest_match_worker(self, a, b, isjunk, alo, ahi, blo, bhi, &i, &j, &k); 272 | //printf("matching_block_helper 2\n"); 273 | 274 | if (k) { 275 | PyObject *p = Py_BuildValue("(iii)", i, j, k); 276 | if (alo < i && blo < j) 277 | matching_block_helper(self, a, b, isjunk, answer, alo, i, blo, j); 278 | PyList_Append(answer, p); 279 | Py_DECREF(p); 280 | if (i+k < ahi && j+k < bhi) 281 | matching_block_helper(self, a, b, isjunk, answer, i+k, ahi, j+k, bhi); 282 | } 283 | //printf("matching_block_helper 3\n"); 284 | } 285 | 286 | static PyObject * 287 | matching_blocks(PyObject *module, PyObject *args) 288 | { 289 | PyObject *self, *a, *b, *isbjunk, *matching; 290 | long la, lb; 291 | 292 | if (!PyArg_ParseTuple(args, "O", &self)) { 293 | PyErr_SetString(PyExc_ValueError, "expected one argument, self"); 294 | return NULL; 295 | } 296 | 297 | //printf("matching_blocks 1\n"); 298 | /* Slight speedup - if we have no junk, don't bother calling isbjunk lots */ 299 | { 300 | PyObject *nojunk = PyObject_GetAttrString(self, "_nojunk"); 301 | if (nojunk && PyObject_IsTrue(nojunk)) 302 | { 303 | isbjunk = NULL; 304 | } 305 | else 306 | { 307 | PyErr_Clear(); 308 | isbjunk = PyObject_GetAttrString(self, "isbjunk"); 309 | assert(isbjunk); 310 | if (!PyCallable_Check(isbjunk)) { 311 | PyErr_SetString(PyExc_RuntimeError, "isbjunk not callable"); 312 | return NULL; 313 | } 314 | } 315 | if (nojunk) 316 | Py_DECREF(nojunk); 317 | } 318 | 319 | // FIXME: Really should support non-list sequences for a and b 320 | //printf("matching_blocks 2\n"); 321 | a = PyObject_GetAttrString(self, "a"); 322 | b = PyObject_GetAttrString(self, "b"); 323 | if (!PyList_Check(a) || !PyList_Check(b)) { 324 | PyErr_SetString(PyExc_ValueError, "cdifflib only supports lists for both sequences"); 325 | return NULL; 326 | } 327 | 328 | //printf("matching_blocks 3\n"); 329 | la = PyList_GET_SIZE(a); 330 | lb = PyList_GET_SIZE(b); 331 | 332 | matching = PyList_New(0); 333 | 334 | matching_block_helper(self, a, b, isbjunk, matching, 0, la, 0, lb); 335 | 336 | //printf("matching_blocks 4\n"); 337 | if (isbjunk) 338 | Py_DECREF(isbjunk); 339 | Py_DECREF(a); 340 | Py_DECREF(b); 341 | // don't decrement matching, put it straight in to the return val 342 | return Py_BuildValue("N", matching); 343 | } 344 | 345 | 346 | static PyObject * 347 | chain_b(PyObject *module, PyObject *args) 348 | { 349 | long n; 350 | Py_ssize_t i; 351 | 352 | // These are temporary and are decremented after use 353 | PyObject *b, *isjunk, *fast_b, *self; 354 | 355 | // These are needed through the function and are decremented at the end 356 | PyObject *junk = NULL, *popular = NULL, *b2j = NULL, *retval = NULL, *autojunk = NULL; 357 | 358 | //printf("chain_b\n"); 359 | 360 | if (!PyArg_ParseTuple(args, "O", &self)) 361 | goto error; 362 | 363 | b = PyObject_GetAttrString(self, "b"); 364 | if (b == NULL || b == Py_None) 365 | goto error; 366 | b2j = PyDict_New(); 367 | PyObject_SetAttrString(self, "b2j", b2j); 368 | 369 | /* construct b2j here */ 370 | //printf("construct b2j\n"); 371 | assert(PySequence_Check(b)); 372 | fast_b = PySequence_Fast(b, "accessing sequence 2"); 373 | Py_DECREF(b); 374 | n = PySequence_Fast_GET_SIZE(fast_b); 375 | for (i = 0; i < n; i++) 376 | { 377 | PyObject *iint; 378 | PyObject *indices; 379 | PyObject *elt = PySequence_Fast_GET_ITEM(fast_b, i); 380 | assert(elt && elt != Py_None); 381 | indices = PyDict_GetItem(b2j, elt); 382 | assert(indices == NULL || indices != Py_None); 383 | if (indices == NULL) 384 | { 385 | if (PyErr_Occurred()) 386 | { 387 | if (!PyErr_ExceptionMatches(PyExc_KeyError)) 388 | { 389 | Py_DECREF(fast_b); 390 | goto error; 391 | } 392 | PyErr_Clear(); 393 | } 394 | 395 | indices = PyList_New(0); 396 | PyDict_SetItem(b2j, elt, indices); 397 | Py_DECREF(indices); 398 | } 399 | iint = PyLong_FromLong(i); 400 | PyList_Append(indices, iint); 401 | Py_DECREF(iint); 402 | } 403 | Py_DECREF(fast_b); 404 | 405 | assert(!PyErr_Occurred()); 406 | 407 | //printf("determine junk\n"); 408 | junk = PySet_New(NULL); 409 | isjunk = PyObject_GetAttrString(self, "isjunk"); 410 | if (isjunk != NULL && isjunk != Py_None) 411 | { 412 | PyObject *keys = PyDict_Keys(b2j); 413 | PyObject *fastkeys; 414 | assert(PySequence_Check(keys)); 415 | fastkeys = PySequence_Fast(keys, "dict keys"); 416 | Py_DECREF(keys); 417 | /* call isjunk here */ 418 | for (i = 0; i < PySequence_Fast_GET_SIZE(fastkeys); i++) 419 | { 420 | PyObject *elt = PySequence_Fast_GET_ITEM(fastkeys, i); 421 | if (call_obj(isjunk, elt)) 422 | { 423 | PySet_Add(junk, elt); 424 | PyDict_DelItem(b2j, elt); 425 | } 426 | } 427 | Py_DECREF(fastkeys); 428 | Py_DECREF(isjunk); 429 | } 430 | 431 | /* build autojunk here */ 432 | //printf("build autojunk\n"); 433 | popular = PySet_New(NULL); 434 | autojunk = PyObject_GetAttrString(self, "autojunk"); 435 | assert(autojunk != NULL); 436 | if (PyObject_IsTrue(autojunk) && n >= 200) { 437 | long ntest = n/100 + 1; 438 | PyObject *items = PyDict_Items(b2j); 439 | long b2jlen = PyList_GET_SIZE(items); 440 | for (i = 0; i < b2jlen; i++) 441 | { 442 | PyObject *tuple = PyList_GET_ITEM(items, i); 443 | PyObject *elt = PyTuple_GET_ITEM(tuple, 0); 444 | PyObject *idxs = PyTuple_GET_ITEM(tuple, 1); 445 | 446 | assert(PyList_Check(idxs)); 447 | 448 | if (PyList_GET_SIZE(idxs) > ntest) 449 | { 450 | PySet_Add(popular, elt); 451 | PyDict_DelItem(b2j, elt); 452 | } 453 | } 454 | Py_DECREF(items); 455 | } 456 | 457 | retval = Py_BuildValue("OO", junk, popular); 458 | assert(!PyErr_Occurred()); 459 | 460 | error: 461 | if (b2j) 462 | Py_DECREF(b2j); 463 | if (junk) 464 | Py_DECREF(junk); 465 | if (popular) 466 | Py_DECREF(popular); 467 | if (autojunk) 468 | Py_DECREF(autojunk); 469 | return retval; 470 | } 471 | 472 | // 473 | // Define functions in this module 474 | // 475 | 476 | static PyMethodDef CDiffLibMethods[4] = { 477 | {"find_longest_match", find_longest_match, METH_VARARGS, 478 | "c implementation of difflib.SequenceMatcher.find_longest_match"}, 479 | {"chain_b", chain_b, METH_VARARGS, 480 | "c implementation of most of difflib.SequenceMatcher.__chain_b"}, 481 | {"matching_blocks", matching_blocks, METH_VARARGS, 482 | "c implementation of part of difflib.SequenceMatcher.get_matching_blocks"}, 483 | {NULL, NULL, 0, NULL} /* Sentinel */ 484 | }; 485 | 486 | static struct PyModuleDef _cdifflib = { 487 | PyModuleDef_HEAD_INIT, 488 | "_cdifflib", 489 | "C Implementation of Python3's difflib", 490 | -1, 491 | CDiffLibMethods 492 | }; 493 | 494 | // Init module 495 | 496 | PyMODINIT_FUNC PyInit__cdifflib() 497 | { 498 | return PyModule_Create(&_cdifflib); 499 | } 500 | 501 | #endif // PY_MAJOR_VERSION == 3 502 | -------------------------------------------------------------------------------- /cdifflib.py: -------------------------------------------------------------------------------- 1 | """ 2 | Module cdifflib -- c implementation of difflib. 3 | 4 | Class CSequenceMatcher: 5 | A faster version of difflib.SequenceMatcher. Reimplements a single 6 | bottleneck function - find_longest_match - in native C. The rest of the 7 | implementation is inherited. 8 | """ 9 | 10 | __all__ = ["CSequenceMatcher", "__version__"] 11 | 12 | __version__ = "1.2.9" 13 | 14 | import sys 15 | from difflib import SequenceMatcher as _SequenceMatcher 16 | from difflib import Match as _Match 17 | import _cdifflib 18 | 19 | 20 | class CSequenceMatcher(_SequenceMatcher): 21 | def __init__(self, isjunk=None, a="", b="", autojunk=True): 22 | """Construct a CSequenceMatcher. 23 | 24 | Simply wraps the difflib.SequenceMatcher. 25 | """ 26 | if sys.version_info[0] == 2 and sys.version_info[1] < 7: 27 | # No autojunk in Python 2.6 and lower 28 | _SequenceMatcher.__init__(self, isjunk, a, b) 29 | else: 30 | _SequenceMatcher.__init__(self, isjunk, a, b, autojunk) 31 | 32 | def find_longest_match(self, alo, ahi, blo, bhi): 33 | """Find longest matching block in a[alo:ahi] and b[blo:bhi]. 34 | 35 | Wrapper for the C implementation of this function. 36 | """ 37 | besti, bestj, bestsize = _cdifflib.find_longest_match(self, alo, ahi, blo, bhi) 38 | return _Match(besti, bestj, bestsize) 39 | 40 | def set_seq1(self, a): 41 | """Same as SequenceMatcher.set_seq1, but check for non-list inputs 42 | implementation.""" 43 | if a is self.a: 44 | return 45 | self.a = a 46 | if not isinstance(self.a, list): 47 | self.a = list(self.a) 48 | # Types must be hashable to work in the c layer. This will raise if 49 | # list items are *not* hashable. 50 | [hash(x) for x in self.a] 51 | self.matching_blocks = self.opcodes = None 52 | 53 | def set_seq2(self, b): 54 | """Same as SequenceMatcher.set_seq2, but uses the c chainb 55 | implementation. 56 | """ 57 | if b is self.b and hasattr(self, "isbjunk"): 58 | return 59 | self.b = b 60 | if not isinstance(self.a, list): 61 | self.a = list(self.a) 62 | if not isinstance(self.b, list): 63 | self.b = list(self.b) 64 | 65 | # Types must be hashable to work in the c layer. This check lines will 66 | # raise the correct error if they are *not* hashable. 67 | [hash(x) for x in self.a] 68 | [hash(x) for x in self.b] 69 | 70 | self.matching_blocks = self.opcodes = None 71 | self.fullbcount = None 72 | junk, popular = _cdifflib.chain_b(self) 73 | assert hasattr(junk, "__contains__") 74 | assert hasattr(popular, "__contains__") 75 | self.isbjunk = junk.__contains__ 76 | self.isbpopular = popular.__contains__ 77 | # We use this to speed up find_longest_match a smidge 78 | 79 | def get_matching_blocks(self): 80 | """Same as SequenceMatcher.get_matching_blocks, but calls through to a 81 | faster loop for find_longest_match. The rest is the same. 82 | """ 83 | if self.matching_blocks is not None: 84 | return self.matching_blocks 85 | 86 | matching_blocks = _cdifflib.matching_blocks(self) 87 | matching_blocks.append((len(self.a), len(self.b), 0)) 88 | self.matching_blocks = matching_blocks 89 | 90 | return map(_Match._make, self.matching_blocks) 91 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["setuptools >= 61.0", "pytest", "ruff", "twine"] 3 | build-backend = "setuptools.build_meta" 4 | 5 | [project] 6 | name = "cdifflib" 7 | version = "1.2.9" 8 | authors = [ 9 | { name="Matthew Duggan", email="mgithub@guarana.org" }, 10 | ] 11 | description = "C implementation of parts of difflib" 12 | readme = "README.md" 13 | license = { file="LICENSE" } 14 | requires-python = ">=3.4" 15 | keywords=["difflib", "c", "diff"] 16 | classifiers = [ 17 | "Development Status :: 5 - Production/Stable", 18 | "Environment :: Console", 19 | "Intended Audience :: Developers", 20 | "License :: OSI Approved :: BSD License", 21 | "Operating System :: MacOS :: MacOS X", 22 | "Operating System :: Microsoft :: Windows", 23 | "Operating System :: POSIX", 24 | "Programming Language :: Python", 25 | "Topic :: Software Development", 26 | "Topic :: Text Processing :: General", 27 | ] 28 | 29 | [project.urls] 30 | "Homepage" = "https://github.com/mduggan/cdifflib" 31 | "Bug Tracker" = "https://github.com/mduggan/cdifflib/issues" 32 | 33 | # 34 | # TODO: Switch to this method for building instead of using setup.py. 35 | # This requires setuptools >= 74.1 which is a bit too new as of 2025 36 | # 37 | #[tool.setuptools] 38 | #ext-modules = [ 39 | # {name = "cdifflib", sources = ["_cdifflib.c", "_cdifflib3.c"]} 40 | #] 41 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, Extension 2 | 3 | # 4 | # TODO: Delete this file entirely once support for `ext-modules` in 5 | # pyproject.toml is mature. It just replicates the same information 6 | # otherwise. 7 | # 8 | 9 | ext_modules = [Extension("_cdifflib", sources=["_cdifflib.c", "_cdifflib3.c"]),] 10 | 11 | with open("README.md") as f: 12 | long_description = f.read() 13 | 14 | setup( 15 | name="cdifflib", 16 | version="1.2.9", 17 | description="C implementation of parts of difflib", 18 | long_description=long_description, 19 | long_description_content_type="text/markdown", 20 | ext_modules=ext_modules, 21 | py_modules=["cdifflib"], 22 | author="Matthew Duggan", 23 | author_email="mgithub@guarana.org", 24 | license="BSD", 25 | url="https://github.com/mduggan/cdifflib", 26 | keywords="difflib c diff", 27 | classifiers=[ 28 | "Development Status :: 5 - Production/Stable", 29 | "Environment :: Console", 30 | "Intended Audience :: Developers", 31 | "License :: OSI Approved :: BSD License", 32 | "Operating System :: MacOS :: MacOS X", 33 | "Operating System :: Microsoft :: Windows", 34 | "Operating System :: POSIX", 35 | "Programming Language :: Python", 36 | "Topic :: Software Development", 37 | "Topic :: Text Processing :: General", 38 | ], 39 | ) 40 | -------------------------------------------------------------------------------- /tests/cdifflib_tests.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import random 3 | import pytest 4 | 5 | import os 6 | import sys 7 | 8 | sys.path.append(os.path.join(os.path.dirname(__file__), "..")) 9 | 10 | from cdifflib import CSequenceMatcher 11 | from difflib import SequenceMatcher 12 | 13 | 14 | ## 15 | # A function to count lines of difference using a provided SequenceMatcher 16 | # class. 17 | # 18 | def linecount(smclass, s1, s2): 19 | linecount = 0 20 | sm = smclass(None, s1, s2, autojunk=False) 21 | i = 0 22 | j = 0 23 | for ai, bj, n in sm.get_matching_blocks(): 24 | linecount += max(ai - i, bj - j) 25 | i, j = ai + n, bj + n 26 | return linecount 27 | 28 | 29 | def profile_sequence_matcher(smclass, a, b, n): 30 | for i in range(0, n): 31 | out = linecount(smclass, a, b) 32 | print("Diff from %s is %d" % (smclass.__name__, out)) 33 | 34 | 35 | def print_stats(prof): 36 | print("Top 10 by cumtime:") 37 | prof.sort_stats("cumulative") 38 | prof.print_stats(10) 39 | print("Top 10 by selftime:") 40 | prof.sort_stats("time") 41 | prof.print_stats(10) 42 | 43 | 44 | def generate_similar_streams(nlines, ndiffs): 45 | lines = [None] * nlines 46 | chars = [chr(x) for x in range(32, 126)] 47 | for l in range(len(lines)): 48 | lines[l] = "".join([random.choice(chars) for x in range(60)]) 49 | orig = list(lines) 50 | for r in range(ndiffs): 51 | row = random.randint(0, len(lines) - 1) 52 | lines[row] = "".join([random.choice(chars) for x in range(60)]) 53 | return orig, lines 54 | 55 | 56 | def assertNearlyEqual(x, y): 57 | """Simple test to make sure floats are close""" 58 | assert abs(x - y) < 1e-5 59 | 60 | 61 | @pytest.fixture() 62 | def test_streams(): 63 | random.seed(1234) 64 | streama, streamb = generate_similar_streams(2000, 200) 65 | yield streama, streamb 66 | 67 | 68 | def testCDifflibVsDifflibRandom(test_streams): 69 | """Test cdifflib gets same answer as difflib on semi-random sequence of lines""" 70 | cdiff = linecount(CSequenceMatcher, test_streams[0], test_streams[1]) 71 | diff = linecount(SequenceMatcher, test_streams[0], test_streams[1]) 72 | assert cdiff != 0 73 | assert cdiff == diff 74 | 75 | 76 | def testCDifflibVsDifflibIdentical(test_streams): 77 | """Test cdifflib gets 0 difference on the same sequence of lines""" 78 | cdiff = linecount(CSequenceMatcher, test_streams[0], test_streams[0]) 79 | assert cdiff == 0 80 | cdiff = linecount(CSequenceMatcher, test_streams[1], test_streams[1]) 81 | assert cdiff == 0 82 | 83 | 84 | def testCDifflibWithEmptyInput(test_streams): 85 | """Test cdifflib gets correct difference vs empty stream""" 86 | cdiff = linecount(CSequenceMatcher, [], []) 87 | assert cdiff == 0 88 | cdiff = linecount(CSequenceMatcher, test_streams[0], []) 89 | assert cdiff == len(test_streams[0]) 90 | cdiff = linecount(CSequenceMatcher, [], test_streams[1]) 91 | assert cdiff == len(test_streams[1]) 92 | 93 | 94 | def testCDifflibWithBadTypes(test_streams): 95 | """Check cdifflib raises the same type complaints as difflib""" 96 | with pytest.raises(TypeError): 97 | linecount(CSequenceMatcher, None, test_streams[1]) 98 | with pytest.raises(TypeError): 99 | linecount(SequenceMatcher, None, test_streams[1]) 100 | with pytest.raises(TypeError): 101 | linecount(CSequenceMatcher, test_streams[0], 1) 102 | with pytest.raises(TypeError): 103 | linecount(SequenceMatcher, test_streams[0], 1) 104 | with pytest.raises(TypeError): 105 | linecount(SequenceMatcher, test_streams[0], [{}, {}]) 106 | with pytest.raises(TypeError): 107 | linecount(SequenceMatcher, [set([])], [1]) 108 | 109 | 110 | def testCDifflibWithNonLists(test_streams): 111 | """Check cdifflib handles non-list types the same as difflib""" 112 | cdiff = linecount(CSequenceMatcher, "not a list", "also not a list") 113 | diff = linecount(SequenceMatcher, "not a list", "also not a list") 114 | assert diff == cdiff 115 | assert cdiff == 5 116 | 117 | def gena(): 118 | for x in test_streams[0]: 119 | yield x 120 | 121 | def genb(): 122 | for x in test_streams[1]: 123 | yield x 124 | 125 | cdiff = linecount(CSequenceMatcher, gena(), genb()) 126 | # actually difflib doesn't handle generators, just check cdiff result. 127 | assert cdiff > 0 128 | 129 | 130 | def testCDifflibWithBug5Data(): 131 | """Check cdifflib returns the same result for bug #5 132 | (autojunk handling issues)""" 133 | import testdata 134 | 135 | # note: convert both to lists for Python 3.3 136 | sm = SequenceMatcher(None, testdata.a5, testdata.b5) 137 | difflib_matches = list(sm.get_matching_blocks()) 138 | 139 | sm = CSequenceMatcher(None, testdata.a5, testdata.b5) 140 | cdifflib_matches = list(sm.get_matching_blocks()) 141 | 142 | assert difflib_matches == cdifflib_matches 143 | 144 | 145 | def testSeq1ResetsCorrectly(): 146 | s = CSequenceMatcher(None, "abcd", "bcde") 147 | assertNearlyEqual(s.ratio(), 0.75) 148 | s.set_seq1("bcde") 149 | assertNearlyEqual(s.ratio(), 1.0) 150 | 151 | 152 | def main(): 153 | from optparse import OptionParser 154 | import time 155 | 156 | parser = OptionParser( 157 | description="Test the C version of difflib. Either " 158 | "specify files, or leave empty for auto-generated " 159 | "random lines", 160 | usage="Usage: %prog [options] [file1 file2]", 161 | ) 162 | parser.add_option( 163 | "-n", 164 | "--niter", 165 | dest="niter", 166 | type="int", 167 | help="num of iterations (default=%default)", 168 | default=1, 169 | ) 170 | parser.add_option( 171 | "-l", 172 | "--lines", 173 | dest="lines", 174 | type="int", 175 | help="num of lines to generate if no files specified (default=%default)", 176 | default=20000, 177 | ) 178 | parser.add_option( 179 | "-d", 180 | "--diffs", 181 | dest="diffs", 182 | type="int", 183 | help="num of random lines to change if no files specified (default=%default)", 184 | default=200, 185 | ) 186 | parser.add_option( 187 | "-p", 188 | "--profile", 189 | dest="profile", 190 | default=False, 191 | action="store_true", 192 | help="run in the python profiler and print results", 193 | ) 194 | parser.add_option( 195 | "-c", 196 | "--compare", 197 | dest="compare", 198 | default=False, 199 | action="store_true", 200 | help="also run the non-c difflib to compare outputs", 201 | ) 202 | parser.add_option( 203 | "-y", 204 | "--yep", 205 | dest="yep", 206 | default=False, 207 | action="store_true", 208 | help="use yep to profile the c code", 209 | ) 210 | 211 | (opts, args) = parser.parse_args() 212 | 213 | start = int(time.time()) 214 | 215 | if opts.niter < 1: 216 | parser.error("Need to do at least 1 iteration..") 217 | 218 | if args: 219 | if len(args) != 2: 220 | parser.error("Need exactly 2 files to compare.") 221 | try: 222 | print("Reading input files...") 223 | s1 = open(args[0]).readlines() 224 | s2 = open(args[1]).readlines() 225 | except (IOError, OSError): 226 | parser.error("Couldn't load input files %s and %s" % (args[0], args[1])) 227 | else: 228 | print("Generating random similar streams...") 229 | s1, s2 = generate_similar_streams(opts.lines, opts.diffs) 230 | 231 | # shonky, but saves time.. 232 | sys.path.append("build/lib.linux-x86_64-2.7/") 233 | sys.path.append("build/lib.linux-x86_64-2.7-pydebug/") 234 | sys.path.append("build/lib.macosx-10.6-intel-2.7") 235 | 236 | if opts.yep: 237 | import yep 238 | 239 | yep.start("cdifflib.prof") 240 | 241 | if opts.profile: 242 | import cProfile 243 | import pstats 244 | 245 | fn = "cdifflib_%d.prof" % start 246 | print("Profiling cdifflib.CSequenceMatcher...") 247 | cProfile.runctx( 248 | "p(sm,a,b,n)", 249 | dict(p=profile_sequence_matcher), 250 | dict(a=s1, b=s2, n=opts.niter, sm=CSequenceMatcher), 251 | fn, 252 | ) 253 | print_stats(pstats.Stats(fn)) 254 | 255 | if opts.compare: 256 | fn = "difflib_%d.prof" % start 257 | print("Profiling difflib.SequenceMatcher...") 258 | cProfile.runctx( 259 | "p(sm,a,b,n)", 260 | dict(p=profile_sequence_matcher), 261 | dict(a=s1, b=s2, n=opts.niter, sm=SequenceMatcher), 262 | fn, 263 | ) 264 | print_stats(pstats.Stats(fn)) 265 | 266 | else: 267 | print("Running cdifflib.CSequenceMatcher %d times..." % opts.niter) 268 | profile_sequence_matcher(CSequenceMatcher, s1, s2, opts.niter) 269 | if opts.compare: 270 | print("Running difflib.SequenceMatcher %d times..." % opts.niter) 271 | profile_sequence_matcher(SequenceMatcher, s1, s2, opts.niter) 272 | 273 | if opts.yep: 274 | yep.stop() 275 | 276 | 277 | if __name__ == "__main__": 278 | main() 279 | -------------------------------------------------------------------------------- /tests/testdata.py: -------------------------------------------------------------------------------- 1 | """ 2 | Test data from bug https://github.com/mduggan/cdifflib/issues/5 3 | 4 | Revealed some bugs with autojunk handling 5 | """ 6 | a5 = [ 7 | 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 8 | 12, 124, 16, 12, 12, 12, 108, 588, 1316, 12, 8, 42, 6, 168, 36, 12, 10, 10, 9 | 158, 36, 10, 24, 152, 914, 84, 216, 4, 10, 254, 8, 40, 54, 20, 12, 54, 38, 10 | 10, 8, 310, 6, 580, 28, 20, 44, 12, 24, 34, 44, 4, 20, 8, 16, 14, 12, 8, 11 | 12, 20, 14, 28, 12, 24, 6, 12, 372, 544, 1212, 28, 64, 12, 16, 16, 34, 146, 12 | 70, 284, 110, 206, 354, 612, 16, 12, 18, 6, 18, 6, 6, 20, 6, 12, 12, 12, 13 | 20, 12, 12, 12, 20, 12, 12, 358, 258, 12, 54, 20, 8, 8, 6, 16, 12, 6, 112, 14 | 130, 16, 8, 26, 8, 8, 44, 44, 22, 88, 314, 394, 588, 122, 6, 644, 6, 32, 15 | 24, 924, 10, 66, 22, 270, 16, 1340, 2408, 54, 452, 158, 1950, 382, 594, 38, 16 | 110, 106, 40, 56, 5302, 1398, 6, 1016, 814, 46, 112, 14, 6, 14, 12, 6, 6, 17 | 46, 16, 80, 80, 68, 84, 82, 6, 1224, 518 18 | ] 19 | 20 | b5 = [ 21 | 284, 6, 528, 64, 16, 230, 254, 6, 162, 350, 28, 22, 88, 18, 136, 64, 36, 22 | 32, 102, 14, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 23 | 12, 12, 12, 12, 12, 12, 124, 16, 12, 12, 12, 108, 588, 1312, 12, 8, 42, 6, 24 | 168, 36, 12, 10, 10, 158, 36, 10, 26, 154, 906, 82, 214, 4, 10, 242, 8, 38, 25 | 54, 20, 14, 54, 38, 10, 8, 314, 6, 574, 28, 20, 38, 12, 24, 22, 44, 4, 20, 26 | 8, 16, 12, 12, 8, 12, 20, 14, 24, 12, 24, 6, 12, 382, 544, 1212, 28, 64, 27 | 12, 16, 16, 34, 146, 70, 284, 110, 206, 354, 612, 16, 12, 18, 6, 18, 6, 6, 28 | 20, 6, 12, 12, 12, 20, 12, 12, 12, 20, 12, 12, 358, 258, 12, 54, 20, 8, 8, 29 | 6, 16, 12, 6, 112, 130, 16, 8, 26, 8, 8, 44, 44, 22, 88, 314, 394, 588, 30 | 122, 6, 644, 6, 32, 24, 924, 10, 66, 22, 270, 16, 1340, 2408, 54, 452, 158, 31 | 1950, 382, 594, 38, 110, 106, 40, 56, 5302, 1398, 6, 1016, 814, 46, 112, 32 | 14, 6, 14, 12, 6 33 | ] 34 | --------------------------------------------------------------------------------