├── .gitignore
├── .travis.yml
├── LICENSE
├── Makefile
├── README.md
├── __init__.py
├── _cdifflib.c
├── _cdifflib3.c
├── cdifflib.py
├── pyproject.toml
├── setup.py
└── tests
├── cdifflib_tests.py
└── testdata.py
/.gitignore:
--------------------------------------------------------------------------------
1 | CDiffLib.egg-info
2 | *.pyc
3 | _cdifflib*.so
4 | build/
5 | dist/
6 | wheelhouse/
7 | venv/
8 | .pypirc
9 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 | sudo: false
3 | python:
4 | - "2.7"
5 | - "3.3"
6 | - "3.4"
7 | - "3.5"
8 | - "3.6"
9 | os:
10 | - linux
11 | # - osx # Unfortunately Py2.7 seems broken on travis as of 2017-07
12 | install:
13 | - python setup.py install
14 | script:
15 | - python setup.py test
16 | notifications:
17 | email:
18 | on_success: change
19 | on_failure: always
20 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2013, Matthew Duggan
2 | All rights reserved.
3 |
4 | Redistribution and use in source and binary forms, with or without modification,
5 | are permitted provided that the following conditions are met:
6 |
7 | * Redistributions of source code must retain the above copyright notice, this
8 | list of conditions and the following disclaimer.
9 |
10 | * Redistributions in binary form must reproduce the above copyright notice, this
11 | list of conditions and the following disclaimer in the documentation and/or
12 | other materials provided with the distribution.
13 |
14 | * Neither the name of the {organization} nor the names of its
15 | contributors may be used to endorse or promote products derived from
16 | this software without specific prior written permission.
17 |
18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
19 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
20 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
22 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
23 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
24 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
25 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
26 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
27 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | SHELL := /bin/bash
2 |
3 | default: venv build install test
4 |
5 | build:
6 | source .venv/bin/activate && python -m build -s -w
7 |
8 | install:
9 | source .venv/bin/activate && pip install dist/cdifflib-*.tar.gz
10 |
11 | venv:
12 | python3 -m venv .venv
13 | source .venv/bin/activate && pip install build ruff mypy twine pytest
14 |
15 | test:
16 | source .venv/bin/activate && python -m pytest tests/cdifflib_tests.py
17 |
18 | clean:
19 | rm -rf build/
20 | rm -rf dist/
21 | rm -rf CDiffLib.egg-info
22 | rm -f _cdifflib.so
23 | rm -f *.pyc
24 | rm -f tests/*.pyc
25 | rm -rf __pycache__
26 | rm -rf tests/__pycache__
27 | rm -rf .venv/
28 |
29 | PYVERSIONS = 3.9 3.10 3.11 3.12 3.13
30 |
31 | multidist:
32 | source .venv/bin/activate && python -m build -s
33 | $(foreach pyver,$(PYVERSIONS),rm -rf venv-tmp-$(pyver) && python$(pyver) -m venv venv-tmp-$(pyver) && source venv-tmp-$(pyver)/bin/activate && pip install build && python -m build && rm -rf venv-tmp-$(pyver))
34 | twine check dist/*
35 |
36 | upload:
37 | twine upload dist/*
38 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | cdifflib
2 | ========
3 | [
](https://travis-ci.org/mduggan/cdifflib/)
4 |
5 | Python [difflib](http://docs.python.org/2/library/difflib.html) sequence
6 | matcher reimplemented in C.
7 |
8 | Actually only contains reimplemented parts. Creates a `CSequenceMatcher` type
9 | which inherets most functions from `difflib.SequenceMatcher`.
10 |
11 | `cdifflib` is about 4x the speed of the pure python `difflib` when diffing
12 | large streams.
13 |
14 | Limitations
15 | -----------
16 | The C part of the code can only work on `list` rather than generic iterables,
17 | so anything that isn't a `list` will be converted to `list` in the
18 | `CSequenceMatcher` constructor. This may cause undesirable behavior if you're
19 | not expecting it.
20 |
21 | Works with Python 2.7 and 3.6 (Should work on all 3.3+)
22 |
23 | Usage
24 | -----
25 | Can be used just like the `difflib.SequenceMatcher` as long as you pass lists. These examples are right out of the [difflib docs](http://docs.python.org/2/library/difflib.html):
26 | ```Python
27 | >>> from cdifflib import CSequenceMatcher
28 | >>> s = CSequenceMatcher(None, ' abcd', 'abcd abcd')
29 | >>> s.find_longest_match(0, 5, 0, 9)
30 | Match(a=1, b=0, size=4)
31 | >>> s = CSequenceMatcher(lambda x: x == " ",
32 | ... "private Thread currentThread;",
33 | ... "private volatile Thread currentThread;")
34 | >>> print round(s.ratio(), 3)
35 | 0.866
36 | ```
37 |
38 | It's completely compatible, so you can replace the difflib version on startup
39 | and then other libraries will use CSequenceMatcher too, eg:
40 | ```Python
41 | from cdifflib import CSequenceMatcher
42 | import difflib
43 | difflib.SequenceMatcher = CSequenceMatcher
44 | import library_that_uses_difflib
45 |
46 | # Now the library will transparantely be using the C SequenceMatcher - other
47 | # things remain the same
48 | library_that_uses_difflib.do_some_diffing()
49 | ```
50 |
51 |
52 | Making
53 | ------
54 | Set up dev environment:
55 | ```
56 | make venv
57 | source .venv/bin/activate
58 | ```
59 |
60 | To build/install into the venv:
61 | ```
62 | make build
63 | make install
64 | ```
65 |
66 | To test:
67 | ```
68 | make test
69 | ```
70 |
71 | License etc
72 | -----------
73 | This code lives at https://github.com/mduggan. See LICENSE for the license.
74 |
75 |
76 | Changelog
77 | ---------
78 | * 1.2.9 - Repackage again, no code change (#13)
79 | * 1.2.8 - Bump to fix version number in py file, no code change
80 | * 1.2.7 - Update for newer pythons (#12)
81 | * 1.2.6 - Clear state correctly when replacing seq1 (#10)
82 | * 1.2.5 - Fix some memory leaks (#7)
83 | * 1.2.4 - Repackage yet again using twine for pypi upload (no binary changes)
84 | * 1.2.3 - Repackage again with changelog update and corrected src package (no binary changes)
85 | * 1.2.2 - Repackage to add README.md in a way pypi supports (no binary changes)
86 | * 1.2.1 - Fix bug for longer sequences with "autojunk"
87 | * 1.2.0 - Python 3 support for other versions
88 | * 1.1.0 - Added Python 3.6 support (thanks Bclavie)
89 | * 1.0.4 - Changes to make it compile on MSVC++ compiler, no change for other platforms
90 | * 1.0.2 - Bugfix - also replace set_seq1 implementation so `difflib.compare` works with a `CSequenceMatcher`
91 | * 1.0.1 - Implement more bits in c to squeeze a bit more speed out
92 | * 1.0.0 - First release
93 |
--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mduggan/cdifflib/6e70fcff18194e0159a23845beb13ac7875fa067/__init__.py
--------------------------------------------------------------------------------
/_cdifflib.c:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | #if PY_MAJOR_VERSION == 2
4 |
5 | //
6 | // A simple wrapper to see if two Python list entries are "Python equal".
7 | //
8 | static __inline int
9 | list_items_eq(PyObject *a, int ai, PyObject *b, int bi)
10 | {
11 | PyObject *o1 = PyList_GET_ITEM(a, ai);
12 | PyObject *o2 = PyList_GET_ITEM(b, bi);
13 | int result = PyObject_RichCompareBool(o1, o2, Py_EQ);
14 | return result;
15 | }
16 |
17 | //
18 | // A simple wrapper to call a callable python object with an argument and
19 | // return the result as a boolean.
20 | //
21 | static __inline int
22 | call_obj(PyObject *callable, PyObject *arg)
23 | {
24 | PyObject *result;
25 | int retval;
26 | if (!callable)
27 | return 0;
28 | assert(PyCallable_Check(callable));
29 | result = PyObject_CallFunctionObjArgs(callable, arg, NULL);
30 | retval = PyObject_IsTrue(result);
31 | Py_DECREF(result);
32 | return retval;
33 | }
34 |
35 | static void
36 | _find_longest_match_worker(
37 | PyObject *self,
38 | PyObject *a,
39 | PyObject *b,
40 | PyObject *isbjunk,
41 | const int alo,
42 | const int ahi,
43 | const int blo,
44 | const int bhi,
45 | long *_besti,
46 | long *_bestj,
47 | long *_bestsize
48 | )
49 | {
50 | int besti = alo;
51 | int bestj = blo;
52 | int bestsize = 0;
53 | int i;
54 |
55 | assert(self && a && b);
56 |
57 | // Degenerate case: evaluate an empty range: there is no match.
58 | if (alo == ahi || blo == bhi) {
59 | *_besti = alo;
60 | *_bestj = blo;
61 | *_bestsize = 0;
62 | return;
63 | }
64 |
65 | //printf("longest match helper\n");
66 | {
67 | PyObject *b2j = PyObject_GetAttrString(self, "b2j");
68 | PyObject *j2len = PyDict_New();
69 | PyObject *newj2len = PyDict_New();
70 |
71 | assert(PyDict_CheckExact(b2j));
72 |
73 | //
74 | // This loop creates a lot of python objects only to read back
75 | // their values inside the same loop. It should be faster using a
76 | // simpler data structure to do the same thing, but I rewrote it
77 | // with a c++ unordered_map and it was half the speed :(
78 | //
79 | for (i = alo; i < ahi; i++)
80 | {
81 | PyObject *tmp;
82 | PyObject *oj = PyDict_GetItem(b2j, PyList_GET_ITEM(a, i));
83 |
84 | /* oj is a list of indexes in b at which the line a[i] appears, or
85 | NULL if it does not appear */
86 | if (oj != NULL)
87 | {
88 | int ojlen, oji;
89 | assert(PyList_Check(oj));
90 | ojlen = (int)PyList_GET_SIZE(oj);
91 | for (oji = 0; oji < ojlen; oji++)
92 | {
93 | PyObject *j2len_j, *jint, *kint, *jminus1;
94 | int j;
95 | int k = 1;
96 |
97 | jint = PyList_GET_ITEM(oj, oji);
98 | assert(PyInt_CheckExact(jint));
99 | j = (int)PyInt_AS_LONG(jint);
100 |
101 | if (j < blo)
102 | continue;
103 | if (j >= bhi)
104 | break;
105 |
106 | jminus1 = PyInt_FromLong(j-1);
107 | j2len_j = PyDict_GetItem(j2len, jminus1);
108 | Py_DECREF(jminus1);
109 | if (j2len_j)
110 | k += (int)PyInt_AS_LONG(j2len_j);
111 |
112 | // this looks like an allocation, but k is usually low
113 | kint = PyInt_FromLong(k);
114 | PyDict_SetItem(newj2len, jint, kint);
115 | Py_DECREF(kint);
116 |
117 | if (k > bestsize) {
118 | besti = i-k;
119 | bestj = j-k;
120 | bestsize = k;
121 | }
122 | }
123 | }
124 |
125 | // Cycle j2len and newj2len
126 | tmp = j2len;
127 | j2len = newj2len;
128 | newj2len = tmp;
129 | PyDict_Clear(newj2len);
130 | }
131 |
132 | /* besti and bestj are offset by 1 if set in the loop above */
133 | if (bestsize)
134 | {
135 | besti++;
136 | bestj++;
137 | }
138 |
139 | /* Done with these now. */
140 | Py_DECREF(j2len);
141 | Py_DECREF(newj2len);
142 | Py_DECREF(b2j);
143 | }
144 |
145 | //printf("twiddle values %d %d %d %d %d %d\n", besti, alo, ahi, bestj, blo, bhi);
146 | while (besti > alo && bestj > blo &&
147 | !call_obj(isbjunk, PyList_GET_ITEM(b, bestj-1)) &&
148 | list_items_eq(a, besti-1, b, bestj-1))
149 | {
150 | besti--;
151 | bestj--;
152 | bestsize++;
153 | }
154 |
155 | //printf("twiddle values 2\n");
156 | while (besti+bestsize < ahi && bestj+bestsize < bhi &&
157 | !call_obj(isbjunk, PyList_GET_ITEM(b, bestj+bestsize)) &&
158 | list_items_eq(a, besti+bestsize, b, bestj+bestsize))
159 | {
160 | bestsize++;
161 | }
162 |
163 |
164 | //printf("twiddle values 3\n");
165 | while (besti > alo && bestj > blo &&
166 | call_obj(isbjunk, PyList_GET_ITEM(b, bestj-1)) &&
167 | list_items_eq(a, besti-1, b, bestj-1))
168 | {
169 | besti--;
170 | bestj--;
171 | bestsize++;
172 | }
173 |
174 | //printf("twiddle values 4\n");
175 | while (besti+bestsize < ahi && bestj+bestsize < bhi &&
176 | call_obj(isbjunk, PyList_GET_ITEM(b, bestj+bestsize)) &&
177 | list_items_eq(a, besti+bestsize, b, bestj+bestsize))
178 | {
179 | bestsize++;
180 | }
181 |
182 | //printf("helper done\n");
183 | *_besti = besti;
184 | *_bestj = bestj;
185 | *_bestsize = bestsize;
186 | }
187 |
188 |
189 | //
190 | // A very simple C reimplementation of Python 2.7's
191 | // difflib.SequenceMatcher.find_longest_match()
192 | //
193 | // The algorithm is identical (right down to using Python dicts and lists for
194 | // local variables), but the c version runs in 1/4 the time.
195 | //
196 | static PyObject *
197 | find_longest_match(PyObject *module, PyObject *args)
198 | {
199 | long alo, ahi, blo, bhi;
200 | long besti, bestj, bestsize;
201 | PyObject *self, *a, *b, *isbjunk;
202 |
203 | if (!PyArg_ParseTuple(args, "Ollll", &self, &alo, &ahi, &blo, &bhi)) {
204 | PyErr_SetString(PyExc_ValueError, "find_longest_match parameters not as expected");
205 | return NULL;
206 | }
207 |
208 | assert(self);
209 |
210 | //printf("check junk\n");
211 | /* Slight speedup - if we have no junk, don't bother calling isbjunk lots */
212 | {
213 | PyObject *nojunk = PyObject_GetAttrString(self, "_nojunk");
214 | if (nojunk && PyObject_IsTrue(nojunk))
215 | {
216 | isbjunk = NULL;
217 | }
218 | else
219 | {
220 | PyErr_Clear();
221 | isbjunk = PyObject_GetAttrString(self, "isbjunk");
222 | assert(isbjunk);
223 | if (!PyCallable_Check(isbjunk)) {
224 | PyErr_SetString(PyExc_RuntimeError, "isbjunk not callable");
225 | return NULL;
226 | }
227 | }
228 | if (nojunk)
229 | Py_DECREF(nojunk);
230 | }
231 |
232 | //printf("get members\n");
233 | // FIXME: Really should support non-list sequences for a and b
234 | a = PyObject_GetAttrString(self, "a");
235 | b = PyObject_GetAttrString(self, "b");
236 | if (!PyList_Check(a) || !PyList_Check(b))
237 | return NULL;
238 |
239 | // This function actually does the work, the rest is just window dressing.
240 | _find_longest_match_worker(self, a, b, isbjunk, alo, ahi, blo, bhi, &besti, &bestj, &bestsize);
241 |
242 | //printf("done\n");
243 |
244 | Py_DECREF(a);
245 | Py_DECREF(b);
246 | if (isbjunk)
247 | Py_DECREF(isbjunk);
248 |
249 | return Py_BuildValue("iii", besti, bestj, bestsize);
250 | }
251 |
252 | /*
253 | def __helper(self, alo, ahi, blo, bhi, answer):
254 | i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
255 | # a[alo:i] vs b[blo:j] unknown
256 | # a[i:i+k] same as b[j:j+k]
257 | # a[i+k:ahi] vs b[j+k:bhi] unknown
258 | if k:
259 | if alo < i and blo < j:
260 | self.__helper(alo, i, blo, j, answer)
261 | answer.append(x)
262 | if i+k < ahi and j+k < bhi:
263 | self.__helper(i+k, ahi, j+k, bhi, answer)
264 | */
265 |
266 | static void
267 | matching_block_helper(PyObject *self, PyObject *a, PyObject *b, PyObject *isjunk, PyObject *answer, const long alo, const long ahi, const long blo, const long bhi)
268 | {
269 | long i, j, k;
270 | //printf("matching_block_helper 1\n");
271 | _find_longest_match_worker(self, a, b, isjunk, alo, ahi, blo, bhi, &i, &j, &k);
272 | //printf("matching_block_helper 2\n");
273 |
274 | if (k) {
275 | PyObject *p = Py_BuildValue("(iii)", i, j, k);
276 | if (alo < i && blo < j)
277 | matching_block_helper(self, a, b, isjunk, answer, alo, i, blo, j);
278 | PyList_Append(answer, p);
279 | Py_DECREF(p);
280 | if (i+k < ahi && j+k < bhi)
281 | matching_block_helper(self, a, b, isjunk, answer, i+k, ahi, j+k, bhi);
282 | }
283 | //printf("matching_block_helper 3\n");
284 | }
285 |
286 | static PyObject *
287 | matching_blocks(PyObject *module, PyObject *args)
288 | {
289 | PyObject *self, *a, *b, *isbjunk, *matching;
290 | long la, lb;
291 |
292 | if (!PyArg_ParseTuple(args, "O", &self)) {
293 | PyErr_SetString(PyExc_ValueError, "expected one argument, self");
294 | return NULL;
295 | }
296 |
297 | //printf("matching_blocks 1\n");
298 | /* Slight speedup - if we have no junk, don't bother calling isbjunk lots */
299 | {
300 | PyObject *nojunk = PyObject_GetAttrString(self, "_nojunk");
301 | if (nojunk && PyObject_IsTrue(nojunk))
302 | {
303 | isbjunk = NULL;
304 | }
305 | else
306 | {
307 | PyErr_Clear();
308 | isbjunk = PyObject_GetAttrString(self, "isbjunk");
309 | assert(isbjunk);
310 | if (!PyCallable_Check(isbjunk)) {
311 | PyErr_SetString(PyExc_RuntimeError, "isbjunk not callable");
312 | return NULL;
313 | }
314 | }
315 | if (nojunk)
316 | Py_DECREF(nojunk);
317 | }
318 |
319 | // FIXME: Really should support non-list sequences for a and b
320 | //printf("matching_blocks 2\n");
321 | a = PyObject_GetAttrString(self, "a");
322 | b = PyObject_GetAttrString(self, "b");
323 | if (!PyList_Check(a) || !PyList_Check(b)) {
324 | PyErr_SetString(PyExc_ValueError, "cdifflib only supports lists for both sequences");
325 | return NULL;
326 | }
327 |
328 | //printf("matching_blocks 3\n");
329 | la = PyList_GET_SIZE(a);
330 | lb = PyList_GET_SIZE(b);
331 |
332 | matching = PyList_New(0);
333 |
334 | matching_block_helper(self, a, b, isbjunk, matching, 0, la, 0, lb);
335 |
336 | //printf("matching_blocks 4\n");
337 | if (isbjunk)
338 | Py_DECREF(isbjunk);
339 | Py_DECREF(a);
340 | Py_DECREF(b);
341 | // don't decrement matching, put it straight in to the return val
342 | return Py_BuildValue("N", matching);
343 | }
344 |
345 |
346 | static PyObject *
347 | chain_b(PyObject *module, PyObject *args)
348 | {
349 | long n;
350 | Py_ssize_t i;
351 |
352 | // These are temporary and are decremented after use
353 | PyObject *b, *isjunk, *fast_b, *self;
354 |
355 | // These are needed through the function and are decremented at the end
356 | PyObject *junk = NULL, *popular = NULL, *b2j = NULL, *retval = NULL, *autojunk = NULL;
357 |
358 | //printf("chain_b\n");
359 |
360 | if (!PyArg_ParseTuple(args, "O", &self))
361 | goto error;
362 |
363 | b = PyObject_GetAttrString(self, "b");
364 | if (b == NULL || b == Py_None)
365 | goto error;
366 | b2j = PyDict_New();
367 | PyObject_SetAttrString(self, "b2j", b2j);
368 |
369 | /* construct b2j here */
370 | //printf("construct b2j\n");
371 | assert(PySequence_Check(b));
372 | fast_b = PySequence_Fast(b, "accessing sequence 2");
373 | Py_DECREF(b);
374 | n = PySequence_Fast_GET_SIZE(fast_b);
375 | for (i = 0; i < n; i++)
376 | {
377 | PyObject *iint;
378 | PyObject *indices;
379 | PyObject *elt = PySequence_Fast_GET_ITEM(fast_b, i);
380 | assert(elt && elt != Py_None);
381 | indices = PyDict_GetItem(b2j, elt);
382 | assert(indices == NULL || indices != Py_None);
383 | if (indices == NULL)
384 | {
385 | if (PyErr_Occurred())
386 | {
387 | if (!PyErr_ExceptionMatches(PyExc_KeyError))
388 | {
389 | Py_DECREF(fast_b);
390 | goto error;
391 | }
392 | PyErr_Clear();
393 | }
394 |
395 | indices = PyList_New(0);
396 | PyDict_SetItem(b2j, elt, indices);
397 | Py_DECREF(indices);
398 | }
399 | iint = PyInt_FromLong(i);
400 | PyList_Append(indices, iint);
401 | Py_DECREF(iint);
402 | }
403 | Py_DECREF(fast_b);
404 |
405 | assert(!PyErr_Occurred());
406 |
407 | //printf("determine junk\n");
408 | junk = PySet_New(NULL);
409 | isjunk = PyObject_GetAttrString(self, "isjunk");
410 | if (isjunk != NULL && isjunk != Py_None)
411 | {
412 | PyObject *keys = PyDict_Keys(b2j);
413 | PyObject *fastkeys;
414 | assert(PySequence_Check(keys));
415 | fastkeys = PySequence_Fast(keys, "dict keys");
416 | Py_DECREF(keys);
417 | /* call isjunk here */
418 | for (i = 0; i < PySequence_Fast_GET_SIZE(fastkeys); i++)
419 | {
420 | PyObject *elt = PySequence_Fast_GET_ITEM(fastkeys, i);
421 | if (call_obj(isjunk, elt))
422 | {
423 | PySet_Add(junk, elt);
424 | PyDict_DelItem(b2j, elt);
425 | }
426 | }
427 | Py_DECREF(fastkeys);
428 | Py_DECREF(isjunk);
429 | }
430 |
431 | /* build autojunk here */
432 | //printf("build autojunk\n");
433 | popular = PySet_New(NULL);
434 | autojunk = PyObject_GetAttrString(self, "autojunk");
435 | assert(autojunk != NULL);
436 | if (PyObject_IsTrue(autojunk) && n >= 200) {
437 | long ntest = n/100 + 1;
438 | PyObject *items = PyDict_Items(b2j);
439 | long b2jlen = PyList_GET_SIZE(items);
440 | for (i = 0; i < b2jlen; i++)
441 | {
442 | PyObject *tuple = PyList_GET_ITEM(items, i);
443 | PyObject *elt = PyTuple_GET_ITEM(tuple, 0);
444 | PyObject *idxs = PyTuple_GET_ITEM(tuple, 1);
445 |
446 | assert(PyList_Check(idxs));
447 |
448 | if (PyList_GET_SIZE(idxs) > ntest)
449 | {
450 | PySet_Add(popular, elt);
451 | PyDict_DelItem(b2j, elt);
452 | }
453 | }
454 | Py_DECREF(items);
455 | }
456 |
457 | retval = Py_BuildValue("OO", junk, popular);
458 | assert(!PyErr_Occurred());
459 |
460 | error:
461 | if (b2j)
462 | Py_DECREF(b2j);
463 | if (junk)
464 | Py_DECREF(junk);
465 | if (popular)
466 | Py_DECREF(popular);
467 | if (autojunk)
468 | Py_DECREF(autojunk);
469 | return retval;
470 | }
471 |
472 | //
473 | // Define functions in this module
474 | //
475 | static PyMethodDef CDiffLibMethods[4] = {
476 | {"find_longest_match", find_longest_match, METH_VARARGS,
477 | "c implementation of difflib.SequenceMatcher.find_longest_match"},
478 | {"chain_b", chain_b, METH_VARARGS,
479 | "c implementation of most of difflib.SequenceMatcher.__chain_b"},
480 | {"matching_blocks", matching_blocks, METH_VARARGS,
481 | "c implementation of part of difflib.SequenceMatcher.get_matching_blocks"},
482 | {NULL, NULL, 0, NULL} /* Sentinel */
483 | };
484 |
485 | //
486 | // Module init entrypoint.
487 | //
488 | PyMODINIT_FUNC
489 | init_cdifflib(void)
490 | {
491 | PyObject *m;
492 |
493 | m = Py_InitModule("_cdifflib", CDiffLibMethods);
494 | if (m == NULL)
495 | return;
496 | // No special initialisation to do at the moment..
497 | }
498 |
499 | #endif // PY_MAJOR_VERSION == 2
500 |
--------------------------------------------------------------------------------
/_cdifflib3.c:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | #if PY_MAJOR_VERSION == 3
4 |
5 | //
6 | // A simple wrapper to see if two Python list entries are "Python equal".
7 | //
8 | static __inline int
9 | list_items_eq(PyObject *a, int ai, PyObject *b, int bi)
10 | {
11 | PyObject *o1 = PyList_GET_ITEM(a, ai);
12 | PyObject *o2 = PyList_GET_ITEM(b, bi);
13 | int result = PyObject_RichCompareBool(o1, o2, Py_EQ);
14 | return result;
15 | }
16 |
17 | //
18 | // A simple wrapper to call a callable python object with an argument and
19 | // return the result as a boolean.
20 | //
21 | static __inline int
22 | call_obj(PyObject *callable, PyObject *arg)
23 | {
24 | PyObject *result;
25 | int retval;
26 | if (!callable)
27 | return 0;
28 | assert(PyCallable_Check(callable));
29 | result = PyObject_CallFunctionObjArgs(callable, arg, NULL);
30 | retval = PyObject_IsTrue(result);
31 | Py_DECREF(result);
32 | return retval;
33 | }
34 |
35 | static void
36 | _find_longest_match_worker(
37 | PyObject *self,
38 | PyObject *a,
39 | PyObject *b,
40 | PyObject *isbjunk,
41 | const int alo,
42 | const int ahi,
43 | const int blo,
44 | const int bhi,
45 | long *_besti,
46 | long *_bestj,
47 | long *_bestsize
48 | )
49 | {
50 | int besti = alo;
51 | int bestj = blo;
52 | int bestsize = 0;
53 | int i;
54 |
55 | assert(self && a && b);
56 |
57 | // Degenerate case: evaluate an empty range: there is no match.
58 | if (alo == ahi || blo == bhi) {
59 | *_besti = alo;
60 | *_bestj = blo;
61 | *_bestsize = 0;
62 | return;
63 | }
64 |
65 | //printf("longest match helper\n");
66 | {
67 | PyObject *b2j = PyObject_GetAttrString(self, "b2j");
68 | PyObject *j2len = PyDict_New();
69 | PyObject *newj2len = PyDict_New();
70 |
71 | assert(PyDict_CheckExact(b2j));
72 |
73 | //
74 | // This loop creates a lot of python objects only to read back
75 | // their values inside the same loop. It should be faster using a
76 | // simpler data structure to do the same thing, but I rewrote it
77 | // with a c++ unordered_map and it was half the speed :(
78 | //
79 | for (i = alo; i < ahi; i++)
80 | {
81 | PyObject *tmp;
82 | PyObject *oj = PyDict_GetItem(b2j, PyList_GET_ITEM(a, i));
83 |
84 | /* oj is a list of indexes in b at which the line a[i] appears, or
85 | NULL if it does not appear */
86 | if (oj != NULL)
87 | {
88 | int ojlen, oji;
89 | assert(PyList_Check(oj));
90 | ojlen = (int)PyList_GET_SIZE(oj);
91 | for (oji = 0; oji < ojlen; oji++)
92 | {
93 | PyObject *j2len_j, *jint, *kint, *jminus1;
94 | int j;
95 | int k = 1;
96 |
97 | jint = PyList_GET_ITEM(oj, oji);
98 | assert(PyLong_CheckExact(jint));
99 | j = (int)PyLong_AsLong(jint);
100 |
101 | if (j < blo)
102 | continue;
103 | if (j >= bhi)
104 | break;
105 |
106 | jminus1 = PyLong_FromLong(j-1);
107 | j2len_j = PyDict_GetItem(j2len, jminus1);
108 | Py_DECREF(jminus1);
109 | if (j2len_j)
110 | k += (int)PyLong_AsLong(j2len_j);
111 |
112 | // this looks like an allocation, but k is usually low
113 | kint = PyLong_FromLong(k);
114 | PyDict_SetItem(newj2len, jint, kint);
115 | Py_DECREF(kint);
116 |
117 | if (k > bestsize) {
118 | besti = i-k;
119 | bestj = j-k;
120 | bestsize = k;
121 | }
122 | }
123 | }
124 |
125 | // Cycle j2len and newj2len
126 | tmp = j2len;
127 | j2len = newj2len;
128 | newj2len = tmp;
129 | PyDict_Clear(newj2len);
130 | }
131 |
132 | /* besti and bestj are offset by 1 if set in the loop above */
133 | if (bestsize)
134 | {
135 | besti++;
136 | bestj++;
137 | }
138 |
139 | /* Done with these now. */
140 | Py_DECREF(j2len);
141 | Py_DECREF(newj2len);
142 | Py_DECREF(b2j);
143 | }
144 |
145 | //printf("twiddle values %d %d %d %d %d %d\n", besti, alo, ahi, bestj, blo, bhi);
146 | while (besti > alo && bestj > blo &&
147 | !call_obj(isbjunk, PyList_GET_ITEM(b, bestj-1)) &&
148 | list_items_eq(a, besti-1, b, bestj-1))
149 | {
150 | besti--;
151 | bestj--;
152 | bestsize++;
153 | }
154 |
155 | //printf("twiddle values 2\n");
156 | while (besti+bestsize < ahi && bestj+bestsize < bhi &&
157 | !call_obj(isbjunk, PyList_GET_ITEM(b, bestj+bestsize)) &&
158 | list_items_eq(a, besti+bestsize, b, bestj+bestsize))
159 | {
160 | bestsize++;
161 | }
162 |
163 |
164 | //printf("twiddle values 3\n");
165 | while (besti > alo && bestj > blo &&
166 | call_obj(isbjunk, PyList_GET_ITEM(b, bestj-1)) &&
167 | list_items_eq(a, besti-1, b, bestj-1))
168 | {
169 | besti--;
170 | bestj--;
171 | bestsize++;
172 | }
173 |
174 | //printf("twiddle values 4\n");
175 | while (besti+bestsize < ahi && bestj+bestsize < bhi &&
176 | call_obj(isbjunk, PyList_GET_ITEM(b, bestj+bestsize)) &&
177 | list_items_eq(a, besti+bestsize, b, bestj+bestsize))
178 | {
179 | bestsize++;
180 | }
181 |
182 | //printf("helper done\n");
183 | *_besti = besti;
184 | *_bestj = bestj;
185 | *_bestsize = bestsize;
186 | }
187 |
188 |
189 | //
190 | // A very simple C reimplementation of Python 2.7's
191 | // difflib.SequenceMatcher.find_longest_match()
192 | //
193 | // The algorithm is identical (right down to using Python dicts and lists for
194 | // local variables), but the c version runs in 1/4 the time.
195 | //
196 | static PyObject *
197 | find_longest_match(PyObject *module, PyObject *args)
198 | {
199 | long alo, ahi, blo, bhi;
200 | long besti, bestj, bestsize;
201 | PyObject *self, *a, *b, *isbjunk;
202 |
203 | if (!PyArg_ParseTuple(args, "Ollll", &self, &alo, &ahi, &blo, &bhi)) {
204 | PyErr_SetString(PyExc_ValueError, "find_longest_match parameters not as expected");
205 | return NULL;
206 | }
207 |
208 | assert(self);
209 |
210 | //printf("check junk\n");
211 | /* Slight speedup - if we have no junk, don't bother calling isbjunk lots */
212 | {
213 | PyObject *nojunk = PyObject_GetAttrString(self, "_nojunk");
214 | if (nojunk && PyObject_IsTrue(nojunk))
215 | {
216 | isbjunk = NULL;
217 | }
218 | else
219 | {
220 | PyErr_Clear();
221 | isbjunk = PyObject_GetAttrString(self, "isbjunk");
222 | assert(isbjunk);
223 | if (!PyCallable_Check(isbjunk)) {
224 | PyErr_SetString(PyExc_RuntimeError, "isbjunk not callable");
225 | return NULL;
226 | }
227 | }
228 | if (nojunk)
229 | Py_DECREF(nojunk);
230 | }
231 |
232 | //printf("get members\n");
233 | // FIXME: Really should support non-list sequences for a and b
234 | a = PyObject_GetAttrString(self, "a");
235 | b = PyObject_GetAttrString(self, "b");
236 | if (!PyList_Check(a) || !PyList_Check(b))
237 | return NULL;
238 |
239 | // This function actually does the work, the rest is just window dressing.
240 | _find_longest_match_worker(self, a, b, isbjunk, alo, ahi, blo, bhi, &besti, &bestj, &bestsize);
241 |
242 | //printf("done\n");
243 |
244 | Py_DECREF(a);
245 | Py_DECREF(b);
246 | if (isbjunk)
247 | Py_DECREF(isbjunk);
248 |
249 | return Py_BuildValue("iii", besti, bestj, bestsize);
250 | }
251 |
252 | /*
253 | def __helper(self, alo, ahi, blo, bhi, answer):
254 | i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
255 | # a[alo:i] vs b[blo:j] unknown
256 | # a[i:i+k] same as b[j:j+k]
257 | # a[i+k:ahi] vs b[j+k:bhi] unknown
258 | if k:
259 | if alo < i and blo < j:
260 | self.__helper(alo, i, blo, j, answer)
261 | answer.append(x)
262 | if i+k < ahi and j+k < bhi:
263 | self.__helper(i+k, ahi, j+k, bhi, answer)
264 | */
265 |
266 | static void
267 | matching_block_helper(PyObject *self, PyObject *a, PyObject *b, PyObject *isjunk, PyObject *answer, const long alo, const long ahi, const long blo, const long bhi)
268 | {
269 | long i, j, k;
270 | //printf("matching_block_helper 1\n");
271 | _find_longest_match_worker(self, a, b, isjunk, alo, ahi, blo, bhi, &i, &j, &k);
272 | //printf("matching_block_helper 2\n");
273 |
274 | if (k) {
275 | PyObject *p = Py_BuildValue("(iii)", i, j, k);
276 | if (alo < i && blo < j)
277 | matching_block_helper(self, a, b, isjunk, answer, alo, i, blo, j);
278 | PyList_Append(answer, p);
279 | Py_DECREF(p);
280 | if (i+k < ahi && j+k < bhi)
281 | matching_block_helper(self, a, b, isjunk, answer, i+k, ahi, j+k, bhi);
282 | }
283 | //printf("matching_block_helper 3\n");
284 | }
285 |
286 | static PyObject *
287 | matching_blocks(PyObject *module, PyObject *args)
288 | {
289 | PyObject *self, *a, *b, *isbjunk, *matching;
290 | long la, lb;
291 |
292 | if (!PyArg_ParseTuple(args, "O", &self)) {
293 | PyErr_SetString(PyExc_ValueError, "expected one argument, self");
294 | return NULL;
295 | }
296 |
297 | //printf("matching_blocks 1\n");
298 | /* Slight speedup - if we have no junk, don't bother calling isbjunk lots */
299 | {
300 | PyObject *nojunk = PyObject_GetAttrString(self, "_nojunk");
301 | if (nojunk && PyObject_IsTrue(nojunk))
302 | {
303 | isbjunk = NULL;
304 | }
305 | else
306 | {
307 | PyErr_Clear();
308 | isbjunk = PyObject_GetAttrString(self, "isbjunk");
309 | assert(isbjunk);
310 | if (!PyCallable_Check(isbjunk)) {
311 | PyErr_SetString(PyExc_RuntimeError, "isbjunk not callable");
312 | return NULL;
313 | }
314 | }
315 | if (nojunk)
316 | Py_DECREF(nojunk);
317 | }
318 |
319 | // FIXME: Really should support non-list sequences for a and b
320 | //printf("matching_blocks 2\n");
321 | a = PyObject_GetAttrString(self, "a");
322 | b = PyObject_GetAttrString(self, "b");
323 | if (!PyList_Check(a) || !PyList_Check(b)) {
324 | PyErr_SetString(PyExc_ValueError, "cdifflib only supports lists for both sequences");
325 | return NULL;
326 | }
327 |
328 | //printf("matching_blocks 3\n");
329 | la = PyList_GET_SIZE(a);
330 | lb = PyList_GET_SIZE(b);
331 |
332 | matching = PyList_New(0);
333 |
334 | matching_block_helper(self, a, b, isbjunk, matching, 0, la, 0, lb);
335 |
336 | //printf("matching_blocks 4\n");
337 | if (isbjunk)
338 | Py_DECREF(isbjunk);
339 | Py_DECREF(a);
340 | Py_DECREF(b);
341 | // don't decrement matching, put it straight in to the return val
342 | return Py_BuildValue("N", matching);
343 | }
344 |
345 |
346 | static PyObject *
347 | chain_b(PyObject *module, PyObject *args)
348 | {
349 | long n;
350 | Py_ssize_t i;
351 |
352 | // These are temporary and are decremented after use
353 | PyObject *b, *isjunk, *fast_b, *self;
354 |
355 | // These are needed through the function and are decremented at the end
356 | PyObject *junk = NULL, *popular = NULL, *b2j = NULL, *retval = NULL, *autojunk = NULL;
357 |
358 | //printf("chain_b\n");
359 |
360 | if (!PyArg_ParseTuple(args, "O", &self))
361 | goto error;
362 |
363 | b = PyObject_GetAttrString(self, "b");
364 | if (b == NULL || b == Py_None)
365 | goto error;
366 | b2j = PyDict_New();
367 | PyObject_SetAttrString(self, "b2j", b2j);
368 |
369 | /* construct b2j here */
370 | //printf("construct b2j\n");
371 | assert(PySequence_Check(b));
372 | fast_b = PySequence_Fast(b, "accessing sequence 2");
373 | Py_DECREF(b);
374 | n = PySequence_Fast_GET_SIZE(fast_b);
375 | for (i = 0; i < n; i++)
376 | {
377 | PyObject *iint;
378 | PyObject *indices;
379 | PyObject *elt = PySequence_Fast_GET_ITEM(fast_b, i);
380 | assert(elt && elt != Py_None);
381 | indices = PyDict_GetItem(b2j, elt);
382 | assert(indices == NULL || indices != Py_None);
383 | if (indices == NULL)
384 | {
385 | if (PyErr_Occurred())
386 | {
387 | if (!PyErr_ExceptionMatches(PyExc_KeyError))
388 | {
389 | Py_DECREF(fast_b);
390 | goto error;
391 | }
392 | PyErr_Clear();
393 | }
394 |
395 | indices = PyList_New(0);
396 | PyDict_SetItem(b2j, elt, indices);
397 | Py_DECREF(indices);
398 | }
399 | iint = PyLong_FromLong(i);
400 | PyList_Append(indices, iint);
401 | Py_DECREF(iint);
402 | }
403 | Py_DECREF(fast_b);
404 |
405 | assert(!PyErr_Occurred());
406 |
407 | //printf("determine junk\n");
408 | junk = PySet_New(NULL);
409 | isjunk = PyObject_GetAttrString(self, "isjunk");
410 | if (isjunk != NULL && isjunk != Py_None)
411 | {
412 | PyObject *keys = PyDict_Keys(b2j);
413 | PyObject *fastkeys;
414 | assert(PySequence_Check(keys));
415 | fastkeys = PySequence_Fast(keys, "dict keys");
416 | Py_DECREF(keys);
417 | /* call isjunk here */
418 | for (i = 0; i < PySequence_Fast_GET_SIZE(fastkeys); i++)
419 | {
420 | PyObject *elt = PySequence_Fast_GET_ITEM(fastkeys, i);
421 | if (call_obj(isjunk, elt))
422 | {
423 | PySet_Add(junk, elt);
424 | PyDict_DelItem(b2j, elt);
425 | }
426 | }
427 | Py_DECREF(fastkeys);
428 | Py_DECREF(isjunk);
429 | }
430 |
431 | /* build autojunk here */
432 | //printf("build autojunk\n");
433 | popular = PySet_New(NULL);
434 | autojunk = PyObject_GetAttrString(self, "autojunk");
435 | assert(autojunk != NULL);
436 | if (PyObject_IsTrue(autojunk) && n >= 200) {
437 | long ntest = n/100 + 1;
438 | PyObject *items = PyDict_Items(b2j);
439 | long b2jlen = PyList_GET_SIZE(items);
440 | for (i = 0; i < b2jlen; i++)
441 | {
442 | PyObject *tuple = PyList_GET_ITEM(items, i);
443 | PyObject *elt = PyTuple_GET_ITEM(tuple, 0);
444 | PyObject *idxs = PyTuple_GET_ITEM(tuple, 1);
445 |
446 | assert(PyList_Check(idxs));
447 |
448 | if (PyList_GET_SIZE(idxs) > ntest)
449 | {
450 | PySet_Add(popular, elt);
451 | PyDict_DelItem(b2j, elt);
452 | }
453 | }
454 | Py_DECREF(items);
455 | }
456 |
457 | retval = Py_BuildValue("OO", junk, popular);
458 | assert(!PyErr_Occurred());
459 |
460 | error:
461 | if (b2j)
462 | Py_DECREF(b2j);
463 | if (junk)
464 | Py_DECREF(junk);
465 | if (popular)
466 | Py_DECREF(popular);
467 | if (autojunk)
468 | Py_DECREF(autojunk);
469 | return retval;
470 | }
471 |
472 | //
473 | // Define functions in this module
474 | //
475 |
476 | static PyMethodDef CDiffLibMethods[4] = {
477 | {"find_longest_match", find_longest_match, METH_VARARGS,
478 | "c implementation of difflib.SequenceMatcher.find_longest_match"},
479 | {"chain_b", chain_b, METH_VARARGS,
480 | "c implementation of most of difflib.SequenceMatcher.__chain_b"},
481 | {"matching_blocks", matching_blocks, METH_VARARGS,
482 | "c implementation of part of difflib.SequenceMatcher.get_matching_blocks"},
483 | {NULL, NULL, 0, NULL} /* Sentinel */
484 | };
485 |
486 | static struct PyModuleDef _cdifflib = {
487 | PyModuleDef_HEAD_INIT,
488 | "_cdifflib",
489 | "C Implementation of Python3's difflib",
490 | -1,
491 | CDiffLibMethods
492 | };
493 |
494 | // Init module
495 |
496 | PyMODINIT_FUNC PyInit__cdifflib()
497 | {
498 | return PyModule_Create(&_cdifflib);
499 | }
500 |
501 | #endif // PY_MAJOR_VERSION == 3
502 |
--------------------------------------------------------------------------------
/cdifflib.py:
--------------------------------------------------------------------------------
1 | """
2 | Module cdifflib -- c implementation of difflib.
3 |
4 | Class CSequenceMatcher:
5 | A faster version of difflib.SequenceMatcher. Reimplements a single
6 | bottleneck function - find_longest_match - in native C. The rest of the
7 | implementation is inherited.
8 | """
9 |
10 | __all__ = ["CSequenceMatcher", "__version__"]
11 |
12 | __version__ = "1.2.9"
13 |
14 | import sys
15 | from difflib import SequenceMatcher as _SequenceMatcher
16 | from difflib import Match as _Match
17 | import _cdifflib
18 |
19 |
20 | class CSequenceMatcher(_SequenceMatcher):
21 | def __init__(self, isjunk=None, a="", b="", autojunk=True):
22 | """Construct a CSequenceMatcher.
23 |
24 | Simply wraps the difflib.SequenceMatcher.
25 | """
26 | if sys.version_info[0] == 2 and sys.version_info[1] < 7:
27 | # No autojunk in Python 2.6 and lower
28 | _SequenceMatcher.__init__(self, isjunk, a, b)
29 | else:
30 | _SequenceMatcher.__init__(self, isjunk, a, b, autojunk)
31 |
32 | def find_longest_match(self, alo, ahi, blo, bhi):
33 | """Find longest matching block in a[alo:ahi] and b[blo:bhi].
34 |
35 | Wrapper for the C implementation of this function.
36 | """
37 | besti, bestj, bestsize = _cdifflib.find_longest_match(self, alo, ahi, blo, bhi)
38 | return _Match(besti, bestj, bestsize)
39 |
40 | def set_seq1(self, a):
41 | """Same as SequenceMatcher.set_seq1, but check for non-list inputs
42 | implementation."""
43 | if a is self.a:
44 | return
45 | self.a = a
46 | if not isinstance(self.a, list):
47 | self.a = list(self.a)
48 | # Types must be hashable to work in the c layer. This will raise if
49 | # list items are *not* hashable.
50 | [hash(x) for x in self.a]
51 | self.matching_blocks = self.opcodes = None
52 |
53 | def set_seq2(self, b):
54 | """Same as SequenceMatcher.set_seq2, but uses the c chainb
55 | implementation.
56 | """
57 | if b is self.b and hasattr(self, "isbjunk"):
58 | return
59 | self.b = b
60 | if not isinstance(self.a, list):
61 | self.a = list(self.a)
62 | if not isinstance(self.b, list):
63 | self.b = list(self.b)
64 |
65 | # Types must be hashable to work in the c layer. This check lines will
66 | # raise the correct error if they are *not* hashable.
67 | [hash(x) for x in self.a]
68 | [hash(x) for x in self.b]
69 |
70 | self.matching_blocks = self.opcodes = None
71 | self.fullbcount = None
72 | junk, popular = _cdifflib.chain_b(self)
73 | assert hasattr(junk, "__contains__")
74 | assert hasattr(popular, "__contains__")
75 | self.isbjunk = junk.__contains__
76 | self.isbpopular = popular.__contains__
77 | # We use this to speed up find_longest_match a smidge
78 |
79 | def get_matching_blocks(self):
80 | """Same as SequenceMatcher.get_matching_blocks, but calls through to a
81 | faster loop for find_longest_match. The rest is the same.
82 | """
83 | if self.matching_blocks is not None:
84 | return self.matching_blocks
85 |
86 | matching_blocks = _cdifflib.matching_blocks(self)
87 | matching_blocks.append((len(self.a), len(self.b), 0))
88 | self.matching_blocks = matching_blocks
89 |
90 | return map(_Match._make, self.matching_blocks)
91 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = ["setuptools >= 61.0", "pytest", "ruff", "twine"]
3 | build-backend = "setuptools.build_meta"
4 |
5 | [project]
6 | name = "cdifflib"
7 | version = "1.2.9"
8 | authors = [
9 | { name="Matthew Duggan", email="mgithub@guarana.org" },
10 | ]
11 | description = "C implementation of parts of difflib"
12 | readme = "README.md"
13 | license = { file="LICENSE" }
14 | requires-python = ">=3.4"
15 | keywords=["difflib", "c", "diff"]
16 | classifiers = [
17 | "Development Status :: 5 - Production/Stable",
18 | "Environment :: Console",
19 | "Intended Audience :: Developers",
20 | "License :: OSI Approved :: BSD License",
21 | "Operating System :: MacOS :: MacOS X",
22 | "Operating System :: Microsoft :: Windows",
23 | "Operating System :: POSIX",
24 | "Programming Language :: Python",
25 | "Topic :: Software Development",
26 | "Topic :: Text Processing :: General",
27 | ]
28 |
29 | [project.urls]
30 | "Homepage" = "https://github.com/mduggan/cdifflib"
31 | "Bug Tracker" = "https://github.com/mduggan/cdifflib/issues"
32 |
33 | #
34 | # TODO: Switch to this method for building instead of using setup.py.
35 | # This requires setuptools >= 74.1 which is a bit too new as of 2025
36 | #
37 | #[tool.setuptools]
38 | #ext-modules = [
39 | # {name = "cdifflib", sources = ["_cdifflib.c", "_cdifflib3.c"]}
40 | #]
41 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, Extension
2 |
3 | #
4 | # TODO: Delete this file entirely once support for `ext-modules` in
5 | # pyproject.toml is mature. It just replicates the same information
6 | # otherwise.
7 | #
8 |
9 | ext_modules = [Extension("_cdifflib", sources=["_cdifflib.c", "_cdifflib3.c"]),]
10 |
11 | with open("README.md") as f:
12 | long_description = f.read()
13 |
14 | setup(
15 | name="cdifflib",
16 | version="1.2.9",
17 | description="C implementation of parts of difflib",
18 | long_description=long_description,
19 | long_description_content_type="text/markdown",
20 | ext_modules=ext_modules,
21 | py_modules=["cdifflib"],
22 | author="Matthew Duggan",
23 | author_email="mgithub@guarana.org",
24 | license="BSD",
25 | url="https://github.com/mduggan/cdifflib",
26 | keywords="difflib c diff",
27 | classifiers=[
28 | "Development Status :: 5 - Production/Stable",
29 | "Environment :: Console",
30 | "Intended Audience :: Developers",
31 | "License :: OSI Approved :: BSD License",
32 | "Operating System :: MacOS :: MacOS X",
33 | "Operating System :: Microsoft :: Windows",
34 | "Operating System :: POSIX",
35 | "Programming Language :: Python",
36 | "Topic :: Software Development",
37 | "Topic :: Text Processing :: General",
38 | ],
39 | )
40 |
--------------------------------------------------------------------------------
/tests/cdifflib_tests.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | import random
3 | import pytest
4 |
5 | import os
6 | import sys
7 |
8 | sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
9 |
10 | from cdifflib import CSequenceMatcher
11 | from difflib import SequenceMatcher
12 |
13 |
14 | ##
15 | # A function to count lines of difference using a provided SequenceMatcher
16 | # class.
17 | #
18 | def linecount(smclass, s1, s2):
19 | linecount = 0
20 | sm = smclass(None, s1, s2, autojunk=False)
21 | i = 0
22 | j = 0
23 | for ai, bj, n in sm.get_matching_blocks():
24 | linecount += max(ai - i, bj - j)
25 | i, j = ai + n, bj + n
26 | return linecount
27 |
28 |
29 | def profile_sequence_matcher(smclass, a, b, n):
30 | for i in range(0, n):
31 | out = linecount(smclass, a, b)
32 | print("Diff from %s is %d" % (smclass.__name__, out))
33 |
34 |
35 | def print_stats(prof):
36 | print("Top 10 by cumtime:")
37 | prof.sort_stats("cumulative")
38 | prof.print_stats(10)
39 | print("Top 10 by selftime:")
40 | prof.sort_stats("time")
41 | prof.print_stats(10)
42 |
43 |
44 | def generate_similar_streams(nlines, ndiffs):
45 | lines = [None] * nlines
46 | chars = [chr(x) for x in range(32, 126)]
47 | for l in range(len(lines)):
48 | lines[l] = "".join([random.choice(chars) for x in range(60)])
49 | orig = list(lines)
50 | for r in range(ndiffs):
51 | row = random.randint(0, len(lines) - 1)
52 | lines[row] = "".join([random.choice(chars) for x in range(60)])
53 | return orig, lines
54 |
55 |
56 | def assertNearlyEqual(x, y):
57 | """Simple test to make sure floats are close"""
58 | assert abs(x - y) < 1e-5
59 |
60 |
61 | @pytest.fixture()
62 | def test_streams():
63 | random.seed(1234)
64 | streama, streamb = generate_similar_streams(2000, 200)
65 | yield streama, streamb
66 |
67 |
68 | def testCDifflibVsDifflibRandom(test_streams):
69 | """Test cdifflib gets same answer as difflib on semi-random sequence of lines"""
70 | cdiff = linecount(CSequenceMatcher, test_streams[0], test_streams[1])
71 | diff = linecount(SequenceMatcher, test_streams[0], test_streams[1])
72 | assert cdiff != 0
73 | assert cdiff == diff
74 |
75 |
76 | def testCDifflibVsDifflibIdentical(test_streams):
77 | """Test cdifflib gets 0 difference on the same sequence of lines"""
78 | cdiff = linecount(CSequenceMatcher, test_streams[0], test_streams[0])
79 | assert cdiff == 0
80 | cdiff = linecount(CSequenceMatcher, test_streams[1], test_streams[1])
81 | assert cdiff == 0
82 |
83 |
84 | def testCDifflibWithEmptyInput(test_streams):
85 | """Test cdifflib gets correct difference vs empty stream"""
86 | cdiff = linecount(CSequenceMatcher, [], [])
87 | assert cdiff == 0
88 | cdiff = linecount(CSequenceMatcher, test_streams[0], [])
89 | assert cdiff == len(test_streams[0])
90 | cdiff = linecount(CSequenceMatcher, [], test_streams[1])
91 | assert cdiff == len(test_streams[1])
92 |
93 |
94 | def testCDifflibWithBadTypes(test_streams):
95 | """Check cdifflib raises the same type complaints as difflib"""
96 | with pytest.raises(TypeError):
97 | linecount(CSequenceMatcher, None, test_streams[1])
98 | with pytest.raises(TypeError):
99 | linecount(SequenceMatcher, None, test_streams[1])
100 | with pytest.raises(TypeError):
101 | linecount(CSequenceMatcher, test_streams[0], 1)
102 | with pytest.raises(TypeError):
103 | linecount(SequenceMatcher, test_streams[0], 1)
104 | with pytest.raises(TypeError):
105 | linecount(SequenceMatcher, test_streams[0], [{}, {}])
106 | with pytest.raises(TypeError):
107 | linecount(SequenceMatcher, [set([])], [1])
108 |
109 |
110 | def testCDifflibWithNonLists(test_streams):
111 | """Check cdifflib handles non-list types the same as difflib"""
112 | cdiff = linecount(CSequenceMatcher, "not a list", "also not a list")
113 | diff = linecount(SequenceMatcher, "not a list", "also not a list")
114 | assert diff == cdiff
115 | assert cdiff == 5
116 |
117 | def gena():
118 | for x in test_streams[0]:
119 | yield x
120 |
121 | def genb():
122 | for x in test_streams[1]:
123 | yield x
124 |
125 | cdiff = linecount(CSequenceMatcher, gena(), genb())
126 | # actually difflib doesn't handle generators, just check cdiff result.
127 | assert cdiff > 0
128 |
129 |
130 | def testCDifflibWithBug5Data():
131 | """Check cdifflib returns the same result for bug #5
132 | (autojunk handling issues)"""
133 | import testdata
134 |
135 | # note: convert both to lists for Python 3.3
136 | sm = SequenceMatcher(None, testdata.a5, testdata.b5)
137 | difflib_matches = list(sm.get_matching_blocks())
138 |
139 | sm = CSequenceMatcher(None, testdata.a5, testdata.b5)
140 | cdifflib_matches = list(sm.get_matching_blocks())
141 |
142 | assert difflib_matches == cdifflib_matches
143 |
144 |
145 | def testSeq1ResetsCorrectly():
146 | s = CSequenceMatcher(None, "abcd", "bcde")
147 | assertNearlyEqual(s.ratio(), 0.75)
148 | s.set_seq1("bcde")
149 | assertNearlyEqual(s.ratio(), 1.0)
150 |
151 |
152 | def main():
153 | from optparse import OptionParser
154 | import time
155 |
156 | parser = OptionParser(
157 | description="Test the C version of difflib. Either "
158 | "specify files, or leave empty for auto-generated "
159 | "random lines",
160 | usage="Usage: %prog [options] [file1 file2]",
161 | )
162 | parser.add_option(
163 | "-n",
164 | "--niter",
165 | dest="niter",
166 | type="int",
167 | help="num of iterations (default=%default)",
168 | default=1,
169 | )
170 | parser.add_option(
171 | "-l",
172 | "--lines",
173 | dest="lines",
174 | type="int",
175 | help="num of lines to generate if no files specified (default=%default)",
176 | default=20000,
177 | )
178 | parser.add_option(
179 | "-d",
180 | "--diffs",
181 | dest="diffs",
182 | type="int",
183 | help="num of random lines to change if no files specified (default=%default)",
184 | default=200,
185 | )
186 | parser.add_option(
187 | "-p",
188 | "--profile",
189 | dest="profile",
190 | default=False,
191 | action="store_true",
192 | help="run in the python profiler and print results",
193 | )
194 | parser.add_option(
195 | "-c",
196 | "--compare",
197 | dest="compare",
198 | default=False,
199 | action="store_true",
200 | help="also run the non-c difflib to compare outputs",
201 | )
202 | parser.add_option(
203 | "-y",
204 | "--yep",
205 | dest="yep",
206 | default=False,
207 | action="store_true",
208 | help="use yep to profile the c code",
209 | )
210 |
211 | (opts, args) = parser.parse_args()
212 |
213 | start = int(time.time())
214 |
215 | if opts.niter < 1:
216 | parser.error("Need to do at least 1 iteration..")
217 |
218 | if args:
219 | if len(args) != 2:
220 | parser.error("Need exactly 2 files to compare.")
221 | try:
222 | print("Reading input files...")
223 | s1 = open(args[0]).readlines()
224 | s2 = open(args[1]).readlines()
225 | except (IOError, OSError):
226 | parser.error("Couldn't load input files %s and %s" % (args[0], args[1]))
227 | else:
228 | print("Generating random similar streams...")
229 | s1, s2 = generate_similar_streams(opts.lines, opts.diffs)
230 |
231 | # shonky, but saves time..
232 | sys.path.append("build/lib.linux-x86_64-2.7/")
233 | sys.path.append("build/lib.linux-x86_64-2.7-pydebug/")
234 | sys.path.append("build/lib.macosx-10.6-intel-2.7")
235 |
236 | if opts.yep:
237 | import yep
238 |
239 | yep.start("cdifflib.prof")
240 |
241 | if opts.profile:
242 | import cProfile
243 | import pstats
244 |
245 | fn = "cdifflib_%d.prof" % start
246 | print("Profiling cdifflib.CSequenceMatcher...")
247 | cProfile.runctx(
248 | "p(sm,a,b,n)",
249 | dict(p=profile_sequence_matcher),
250 | dict(a=s1, b=s2, n=opts.niter, sm=CSequenceMatcher),
251 | fn,
252 | )
253 | print_stats(pstats.Stats(fn))
254 |
255 | if opts.compare:
256 | fn = "difflib_%d.prof" % start
257 | print("Profiling difflib.SequenceMatcher...")
258 | cProfile.runctx(
259 | "p(sm,a,b,n)",
260 | dict(p=profile_sequence_matcher),
261 | dict(a=s1, b=s2, n=opts.niter, sm=SequenceMatcher),
262 | fn,
263 | )
264 | print_stats(pstats.Stats(fn))
265 |
266 | else:
267 | print("Running cdifflib.CSequenceMatcher %d times..." % opts.niter)
268 | profile_sequence_matcher(CSequenceMatcher, s1, s2, opts.niter)
269 | if opts.compare:
270 | print("Running difflib.SequenceMatcher %d times..." % opts.niter)
271 | profile_sequence_matcher(SequenceMatcher, s1, s2, opts.niter)
272 |
273 | if opts.yep:
274 | yep.stop()
275 |
276 |
277 | if __name__ == "__main__":
278 | main()
279 |
--------------------------------------------------------------------------------
/tests/testdata.py:
--------------------------------------------------------------------------------
1 | """
2 | Test data from bug https://github.com/mduggan/cdifflib/issues/5
3 |
4 | Revealed some bugs with autojunk handling
5 | """
6 | a5 = [
7 | 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
8 | 12, 124, 16, 12, 12, 12, 108, 588, 1316, 12, 8, 42, 6, 168, 36, 12, 10, 10,
9 | 158, 36, 10, 24, 152, 914, 84, 216, 4, 10, 254, 8, 40, 54, 20, 12, 54, 38,
10 | 10, 8, 310, 6, 580, 28, 20, 44, 12, 24, 34, 44, 4, 20, 8, 16, 14, 12, 8,
11 | 12, 20, 14, 28, 12, 24, 6, 12, 372, 544, 1212, 28, 64, 12, 16, 16, 34, 146,
12 | 70, 284, 110, 206, 354, 612, 16, 12, 18, 6, 18, 6, 6, 20, 6, 12, 12, 12,
13 | 20, 12, 12, 12, 20, 12, 12, 358, 258, 12, 54, 20, 8, 8, 6, 16, 12, 6, 112,
14 | 130, 16, 8, 26, 8, 8, 44, 44, 22, 88, 314, 394, 588, 122, 6, 644, 6, 32,
15 | 24, 924, 10, 66, 22, 270, 16, 1340, 2408, 54, 452, 158, 1950, 382, 594, 38,
16 | 110, 106, 40, 56, 5302, 1398, 6, 1016, 814, 46, 112, 14, 6, 14, 12, 6, 6,
17 | 46, 16, 80, 80, 68, 84, 82, 6, 1224, 518
18 | ]
19 |
20 | b5 = [
21 | 284, 6, 528, 64, 16, 230, 254, 6, 162, 350, 28, 22, 88, 18, 136, 64, 36,
22 | 32, 102, 14, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
23 | 12, 12, 12, 12, 12, 12, 124, 16, 12, 12, 12, 108, 588, 1312, 12, 8, 42, 6,
24 | 168, 36, 12, 10, 10, 158, 36, 10, 26, 154, 906, 82, 214, 4, 10, 242, 8, 38,
25 | 54, 20, 14, 54, 38, 10, 8, 314, 6, 574, 28, 20, 38, 12, 24, 22, 44, 4, 20,
26 | 8, 16, 12, 12, 8, 12, 20, 14, 24, 12, 24, 6, 12, 382, 544, 1212, 28, 64,
27 | 12, 16, 16, 34, 146, 70, 284, 110, 206, 354, 612, 16, 12, 18, 6, 18, 6, 6,
28 | 20, 6, 12, 12, 12, 20, 12, 12, 12, 20, 12, 12, 358, 258, 12, 54, 20, 8, 8,
29 | 6, 16, 12, 6, 112, 130, 16, 8, 26, 8, 8, 44, 44, 22, 88, 314, 394, 588,
30 | 122, 6, 644, 6, 32, 24, 924, 10, 66, 22, 270, 16, 1340, 2408, 54, 452, 158,
31 | 1950, 382, 594, 38, 110, 106, 40, 56, 5302, 1398, 6, 1016, 814, 46, 112,
32 | 14, 6, 14, 12, 6
33 | ]
34 |
--------------------------------------------------------------------------------