├── .gitignore ├── README.md ├── benchmarking └── isolated_benchmark.py ├── demo.py ├── pure_cpp_parallelism.cpp ├── setup.py └── subinterpreter_parallelism.cpp /.gitignore: -------------------------------------------------------------------------------- 1 | .venv 2 | build/ 3 | dist/ 4 | benchmark.py 5 | *.egg-info/ 6 | __pycache__/ 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # True parallelism in pure-Python code 2 | 3 | Accompanying article: [rishiraj.me/articles/2024-04/python_subinterpreter_parallelism](https://rishiraj.me/articles/2024-04/python_subinterpreter_parallelism) 4 | 5 | This is a proof-of-concept implementation of running "pure" Python code truly parallely in a single CPython process. This is achieved utilizing the new [per-interpreter GIL](https://peps.python.org/pep-0684/) construct from Python 3.12. 6 | 7 | Here is a rough summary of what I have achieved:- 8 | 9 | 1. Create a local CPython function `subinterpreter_parallelism.parallel` which allows users to call any arbitrary Python function using multiple threads, each starting up its own sub-interpreter with its own GIL. 10 | 1. This function takes in a variable number of lists, where each of the lists would consist of ["module_name", "function_name", args]. 11 | 1. Internally, we iterate over each of these lists, and spawn a new thread with their own interpreters to run those specific requests parallely. 12 | 1. Since we can't easily share objects between these interpreters, I've opted to take module name & function name as strings from user and pass the same as an argument to the thread. Here, for the Python function's arguments, I'm pickling the args object to a `std::string` using `pickle` module. 13 | 1. The spawned threads then create a new interpreter with its own GIL & run the function. 14 | 1. Once the function completes execution in the thread, it notifies the result to the Python function using a promise. 15 | 1. The Python function then unpickles the results as the different futures get resolved, and add them to a list. 16 | 1. This list is finally returned to the user when all the function calls have completed. 17 | 18 | ## Caveats 19 | 20 | 1. This might not work well, or at all, with various other commonly used Python libraries (e.g. Numpy) in its current state. This is because, by default, all C/C++ extension modules are initialized [without support for multiple interpreters](https://docs.python.org/3/c-api/module.html#c.Py_mod_multiple_interpreters). This holds true for all modules created using [Cythonize](https://github.com/cython/cython/blob/368bbde62565f8798e061754caf60c94107f2d8c/Cython/Compiler/ModuleNode.py#L3547) (like Numpy), as of April, 2024. This is because C extension libraries interact regularly with low-level APIs (like `PyGIL_*`) which are known to not work well with multiple sub-interpreters. Refer [caveats section from Python documentation](https://docs.python.org/3/c-api/init.html#bugs-and-caveats). Hopefully, more libraries add support for this paradigm as it gains more adoption. 21 | 2. Performance should still be much better with pure C++ code for highly CPU bound tasks, due to the overhead associated with Python being an interpreted language. 22 | 3. Since very little is shared between interpreters in my setup, things like logging configuration, imports etc. need to be explicitly provided in the functions being run parallely. 23 | 24 | Note that this is just an experimental project done over a weekend that might be of interest to others interested in parallelism & Python evolution. 25 | 26 | 27 | ## Installation 28 | Please use Python 3.12 & above for testing this out. 29 | Here, the commands given are for Linux & might require tweaking on other operating systems. 30 | 31 | 1. Create & activate a virtual environment 32 | `python3.12 -m venv .venv && source .venv/bin/activate` 33 | 34 | 1. Add `benchmarking` directory from this folder is treated as a Python source directory. 35 | ```export PYTHONPATH=$PYTHONPATH:`pwd`/benchmarking``` 36 | 37 | 1. Ensure that setuptools is available locally. 38 | `pip install setuptools` 39 | 40 | 1. Setup Python extension locally. 41 | `python3 setup.py install` 42 | 43 | 1. Run demo code to validate things are working fine 44 | `python3 demo.py` 45 | 46 | ## Usage 47 | ```py 48 | from subinterpreter_parallelism import parallel 49 | 50 | # Run 3 threads of pure python functions in parallel using sub-interpreters. 51 | result = parallel(['module1', 'func1', (arg11, arg12, arg13),)], 52 | ['module2', 'func2', (arg21, arg22)], 53 | ['module3', 'func3', tuple()]) 54 | ``` 55 | 56 | 57 | ## Statistics 58 | Using normal Python threads, we can't gain any performance improvement for CPU bound tasks in CPython due to GIL contention. Hence, comparing parallelism using a simple factorial function, we get the following statistics:- 59 | 60 | | Method | Total time taken | 61 | | - | - | 62 | Multi-processing | 15.07s 63 | Sub-interpreters | 11.48s | 64 | C++ extension code (with GIL relinquished) | 0.74s 65 | 66 | Out of the total time taken for running a functions parallely using sub-interpreters, we see the following breakdown of time taken at each step:- 67 | | Step | Time taken (ms) | 68 | | - | - | 69 | | Creating interpreters | 17 | 70 | | Imports & pickling/unpickling | 35 | 71 | | Function call | 2020 | 72 | | Ending interpreters | 2.7 | 73 | 74 | 75 | ## Takeaways 76 | 77 | 1. Using sub-interpreter paralllelism, I was able to verify that the Python process is constantly hitting close to full CPU utilization across all cores (1995% CPU utilization for machine with 20 cores). Note that with regular Python threads, CPU utilization hovers around 100% as expected (i.e. almost full utilization of a single core). 78 | 2. Significantly (>20%) better performance with subinterpreter parallelism compared to multi-processing. 79 | 3. Due to the inherent slowness associated with a interpreted language, it's still better to implement the CPU-bound part of the functionality in C++ using Python extensions. 80 | 4. With Python 3.13, much of this work would be redundant as interpreters would be made part of the stdlib itself. However, it's still fascinating to see how we can achieve similar results in Python 3.12. 81 | 82 | -------------------------------------------------------------------------------- /benchmarking/isolated_benchmark.py: -------------------------------------------------------------------------------- 1 | import time 2 | import logging 3 | 4 | def py_factorial(n): 5 | begin = time.time() 6 | k = 1 7 | for i in range(n,1,-1): 8 | k *= i 9 | k %= 1000000007 10 | logging.warning(f"[PY_FACTORIAL] py_factorial function iternals took {(time.time() - begin)*1000} ms") 11 | return k 12 | -------------------------------------------------------------------------------- /demo.py: -------------------------------------------------------------------------------- 1 | from subinterpreter_parallelism import parallel 2 | from pure_cpp_parallelism import c_factorial 3 | from random import randint 4 | import isolated_benchmark 5 | from multiprocessing import Process 6 | from threading import Thread 7 | import logging 8 | import time 9 | 10 | logging.basicConfig( 11 | format='[%(asctime)s.%(msecs)03d] %(message)s', 12 | datefmt='%Y-%m-%d %H:%M:%S') 13 | 14 | # Number of task runs to spawn. 15 | NUM_TASKS = 10 16 | 17 | # Increase MIN & MAX to increase time taken by each task (should be < 1000000000) 18 | MIN = 90000000 19 | MAX = 90000100 20 | 21 | # Benchmark Python function against multiple Python processes. 22 | logging.warning(f"Multi-processing mode started for {NUM_TASKS} regular Python tasks") 23 | start = time.time() 24 | 25 | processes = [Process(target=isolated_benchmark.py_factorial, args=(randint(MIN, MAX),)) for _ in range(NUM_TASKS)] 26 | for p in processes: 27 | p.start() 28 | for p in processes: 29 | p.join() 30 | 31 | end = time.time() 32 | logging.warning(f"Multi-processing mode ended for {NUM_TASKS} regular Python tasks in {end-start} seconds") 33 | 34 | # Benchmark Python function against the custom handwritten sub-interpreter parallelism module. 35 | logging.warning(f"Sub-interpreter parallelism mode started for {NUM_TASKS} regular Python tasks") 36 | start = time.time() 37 | 38 | result = parallel(*[['isolated_benchmark', 'py_factorial', (randint(MIN,MAX),)] for _ in range(NUM_TASKS)]) 39 | 40 | end = time.time() 41 | logging.warning(f"Sub-interpreter parallelism mode ended with {result=} in {end-start} seconds") 42 | 43 | # Benchmark similar C++ extension function. 44 | logging.warning(f"Multi-threaded C++ code execution from Python function started for {NUM_TASKS} tasks") 45 | start = time.time() 46 | threads = [Thread(target=c_factorial, args=(randint(MIN,MAX),)) for _ in range(NUM_TASKS)] 47 | for thread in threads: 48 | thread.start() 49 | for thread in threads: 50 | thread.join() 51 | end = time.time() 52 | logging.warning(f"Multi-threaded C++ code execution from Python function in {end-start} seconds") 53 | 54 | # Benchmark Python function against a single thread. 55 | start = time.time() 56 | logging.warning("Single threaded mode for 1 regular Python task begins") 57 | result = isolated_benchmark.py_factorial(randint(MIN, MAX)) 58 | end = time.time() 59 | logging.warning(f"Single threaded mode ended for 1 regular Python task with {result=} in {end-start} seconds") 60 | 61 | # Benchmark Python function against multiple threads. 62 | logging.warning(f"Multi-threaded mode for {NUM_TASKS} regular Python functions begins") 63 | start = time.time() 64 | threads = [Thread(target=isolated_benchmark.py_factorial, args=(randint(MIN,MAX),)) for _ in range(NUM_TASKS)] 65 | for thread in threads: 66 | thread.start() 67 | for thread in threads: 68 | thread.join() 69 | end = time.time() 70 | logging.warning(f"Multi-threaded mode ended for {NUM_TASKS} regular Python functions in {end-start} seconds") 71 | -------------------------------------------------------------------------------- /pure_cpp_parallelism.cpp: -------------------------------------------------------------------------------- 1 | #define PY_SSIZE_T_CLEAN 2 | #include 3 | #include 4 | 5 | PyDoc_STRVAR(c_parallelsim_fns_documentation, 6 | "Doc for c_parallelism implemented in C"); 7 | 8 | /** 9 | * Returns the factorial % 1e9+7 for a given number. 10 | * This is just used to demonstrate any expensive operation. 11 | */ 12 | int factorial(int val) { 13 | int i; 14 | struct timeval t1, t2; 15 | time_t ltime; 16 | struct tm *tm; 17 | double elapsedTime; 18 | long long int s = 1; 19 | gettimeofday(&t1, NULL); 20 | 21 | /** ------- ACTUAL BUSINESS LOGIC BEGINS ------------------ */ 22 | for (i = val; i >= 1; --i) { 23 | s *= i; 24 | s %= 1000000007; 25 | } 26 | /** ------- ACTUAL BUSINESS LOGIC ENDS ------------------ */ 27 | 28 | // Calculate elapsed time & log current time 29 | gettimeofday(&t2, NULL); 30 | elapsedTime = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms 31 | elapsedTime += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms 32 | ltime=time(NULL); 33 | tm = localtime(<ime); 34 | printf("[%04d-%02d-%02d %02d:%02d:%02d C API] Time taken is %f ms\n", elapsedTime, tm->tm_year+1900, tm->tm_mon, 35 | tm->tm_mday, tm->tm_hour, tm->tm_min, tm->tm_sec); 36 | return s; 37 | } 38 | 39 | PyObject* c_factorial(PyObject* self, PyObject* args) { 40 | int num; 41 | int sol; 42 | if (!PyArg_ParseTuple(args, "i", &num)) { 43 | return NULL; 44 | } 45 | 46 | Py_BEGIN_ALLOW_THREADS 47 | sol = factorial(num); 48 | Py_END_ALLOW_THREADS 49 | 50 | PyObject* object = PyLong_FromLong(sol); 51 | return object; 52 | } 53 | 54 | static PyMethodDef c_parallelsim_fns[] = { 55 | {"c_factorial", c_factorial, METH_VARARGS, "Calculate factorial in C layer"}, 56 | {NULL, NULL, 0, NULL} /* Sentinel */ 57 | }; 58 | 59 | PyMODINIT_FUNC 60 | PyInit_pure_cpp_parallelism(void) 61 | { 62 | static struct PyModuleDef moduledef = { 63 | PyModuleDef_HEAD_INIT, /* m_base */ 64 | "pure_cpp_parallelism", /* m_name */ 65 | c_parallelsim_fns_documentation, /* m_doc */ 66 | -1, /* m_size */ 67 | c_parallelsim_fns, /* m_methods */ 68 | nullptr, /* m_slots */ 69 | nullptr, /* m_traverse */ 70 | nullptr, /* m_clear */ 71 | nullptr /* m_free */ 72 | }; 73 | return PyModule_Create(&moduledef); 74 | } 75 | 76 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | from setuptools import setup, Extension 3 | 4 | def main(): 5 | setup(name="subinterpreter_parallelism", 6 | language="c++", 7 | version="1.0.0", 8 | description="Description", 9 | author="Rishi", 10 | author_email="contact@rishiraj.me", 11 | ext_modules=[Extension("subinterpreter_parallelism", ["subinterpreter_parallelism.cpp"], language='c++')]) 12 | 13 | setup(name="pure_cpp_parallelism", 14 | language="c++", 15 | version="1.0.0", 16 | description="Description", 17 | author="Rishi", 18 | author_email="contact@rishiraj.me", 19 | ext_modules=[Extension("pure_cpp_parallelism", ["pure_cpp_parallelism.cpp"], language='c++')]) 20 | 21 | 22 | if __name__ == "__main__": 23 | os.environ["CC"] = "g++" 24 | main() 25 | -------------------------------------------------------------------------------- /subinterpreter_parallelism.cpp: -------------------------------------------------------------------------------- 1 | #define PY_SSIZE_T_CLEAN 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | 12 | // Uncomment below line for getting debug logs while running parallelism function. 13 | // #define DEBUG 1 14 | 15 | ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 16 | 17 | #ifdef DEBUG 18 | # define PY_LOG(title, py_obj) std::cerr << title << ": "; PyObject_Print(py_obj, stdout, 0); std::cerr << std::endl; 19 | #else 20 | # define PY_LOG(title, py_obj) do { } while(0); 21 | #endif 22 | 23 | // Load pickling/un-pickling functions. 24 | #define LOAD_PICKLE_UNPICKLE \ 25 | PyObject* pkl_mod_name = PyUnicode_FromString("pickle"); \ 26 | PyObject* pkl_mod = PyImport_Import(pkl_mod_name); \ 27 | PyObject* pkl_dump_func = PyObject_GetAttrString(pkl_mod, "dumps"); \ 28 | PyObject* pkl_load_func = PyObject_GetAttrString(pkl_mod, "loads"); 29 | 30 | // Clear all pickling/un-pickling functions. 31 | #define UNLOAD_PICKLE_UNPICKLE Py_DECREF(pkl_mod_name); Py_DECREF(pkl_mod); Py_DECREF(pkl_dump_func); Py_DECREF(pkl_load_func); 32 | 33 | // Create a PyObject* "obj" by unpickling "pickled" 34 | #define UNPICKLE_PYOBJECT(obj, pickled) \ 35 | PyObject* pickled_bytes = PyBytes_FromString(pickled); \ 36 | PyObject* obj = PyObject_CallFunctionObjArgs(pkl_load_func, pickled_bytes, nullptr); \ 37 | Py_DECREF(pickled_bytes); 38 | 39 | // Create a std::string "str" by unpickling PyObject* "object" 40 | #define PICKLE_PYOBJECT(str, object) \ 41 | PyObject* protocol = PyLong_FromLong(0); \ 42 | PyObject* decoded_bytes = PyObject_CallFunctionObjArgs(pkl_dump_func, object, protocol , nullptr); \ 43 | std::string str = PyBytes_AsString(decoded_bytes); \ 44 | Py_DECREF(protocol); \ 45 | Py_DECREF(decoded_bytes); 46 | 47 | // Check if any Python error has occurred. If so, set exception in the promise & end Python interpreter. 48 | #define CHECK_FOR_ERROR(PROMISE) \ 49 | if (PyErr_Occurred()) { \ 50 | PyObject* err = PyErr_GetRaisedException(); \ 51 | PyObject* err_str = PyObject_Repr(err); \ 52 | auto exc = std::runtime_error(PyUnicode_AsUTF8(err_str)); \ 53 | PROMISE.set_exception(std::make_exception_ptr(exc)); \ 54 | PyErr_Clear(); \ 55 | /** Note that ideally, we should have run Py_DECREFs here*/ \ 56 | Py_EndInterpreter(interp_tstate); \ 57 | return; \ 58 | } 59 | 60 | ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 61 | 62 | PyDoc_STRVAR(subinterpreter_parallelsim_fns_documentation, 63 | "Module for implementing 'true' parallelism using sub-interpreters (needs Python 3.12+)"); 64 | 65 | 66 | ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 67 | 68 | /** 69 | * Run a particular function in a new sub-interpreter. 70 | * This function should always be called from a new thread. 71 | * mod : Name of module 72 | * func: Name of function in module to run. 73 | * args: Pickled tuple of arguments. 74 | * prom: Promise which is resolved with pickled repr of output when the function completes. 75 | */ 76 | void run_parallely_in_subinterp(std::string mod, std::string func, std::string args, std::promise&& prom) { 77 | // Create a new sub-interpreter with its own GIL. 78 | PyThreadState *interp_tstate = nullptr; 79 | Py_ssize_t size; 80 | PyInterpreterConfig sub_interp_config = { 81 | .use_main_obmalloc = 0, 82 | .allow_fork = 0, 83 | .allow_exec = 0, 84 | .allow_threads = 0, 85 | .allow_daemon_threads = 0, 86 | .check_multi_interp_extensions = 1, 87 | .gil = PyInterpreterConfig_OWN_GIL, 88 | }; 89 | PyStatus status = Py_NewInterpreterFromConfig(&interp_tstate, &sub_interp_config); 90 | 91 | // Load pickling/unpickling functions. 92 | LOAD_PICKLE_UNPICKLE 93 | CHECK_FOR_ERROR(prom) 94 | 95 | // Load the Python module. 96 | PyObject* py_mod_str = PyUnicode_FromString(mod.c_str()); 97 | PyObject* py_mod = PyImport_Import(py_mod_str); 98 | PY_LOG("module", py_mod) 99 | CHECK_FOR_ERROR(prom) 100 | 101 | // Load the Python function from the module. 102 | PyObject* py_func = PyObject_GetAttrString(py_mod, func.c_str()); 103 | PY_LOG("function", py_func) 104 | CHECK_FOR_ERROR(prom) 105 | 106 | // Unpickle arguments 107 | UNPICKLE_PYOBJECT(py_args, args.c_str()) 108 | PY_LOG("arguments", py_args) 109 | CHECK_FOR_ERROR(prom) 110 | 111 | // Invoke the Python function with the given arguments. 112 | PyObject* py_result = PyObject_CallObject(py_func, py_args); 113 | PY_LOG("py_result", py_result) 114 | CHECK_FOR_ERROR(prom) 115 | 116 | // Pickle the result & send it as a promise resolution. 117 | PICKLE_PYOBJECT(result, py_result); 118 | prom.set_value(result); 119 | 120 | // Cleanup of objects created 121 | UNLOAD_PICKLE_UNPICKLE 122 | Py_DECREF(py_mod_str); 123 | Py_DECREF(py_mod); 124 | Py_DECREF(py_func); 125 | Py_DECREF(py_args); 126 | Py_DECREF(py_result); 127 | 128 | 129 | // End the interpreter 130 | Py_EndInterpreter(interp_tstate); 131 | } 132 | 133 | 134 | /** 135 | * Python function to run multiple arbitrary python functions parallely 136 | * using multiple threads & sub-interpreters with separate GIL. 137 | */ 138 | PyObject* subinterpreter_parallel(PyObject* self, PyObject* args) { 139 | // Load pickling functions. 140 | LOAD_PICKLE_UNPICKLE 141 | PY_LOG("original args", args) 142 | 143 | 144 | // Create vectors for storing threads, promises & futures. 145 | std::vector threads; 146 | std::vector> promises(PySequence_Length(args)); 147 | std::vector> futures; 148 | for (auto& p: promises) { 149 | futures.push_back(p.get_future()); 150 | } 151 | 152 | // Iterate over all of the arguments. 153 | PyObject* iterator = PyObject_GetIter(args); 154 | PyObject* item; 155 | int idx = 0; 156 | while ((item = PyIter_Next(iterator))) { 157 | PyObject* mod_name = PyList_GetItem(item, 0); 158 | PyObject* func_name = PyList_GetItem(item, 1); 159 | PyObject* py_args = PyList_GetItem(item, 2); 160 | 161 | // Convert Python objects to strings. 162 | auto mod_str = std::string(PyUnicode_AsUTF8(mod_name)); 163 | auto func_str = std::string(PyUnicode_AsUTF8(func_name)); 164 | PICKLE_PYOBJECT(arg_str, py_args); 165 | 166 | // Create thread to run idx-th request in a new thread using a new sub-interpreter with its own GIL. 167 | threads.push_back(std::thread(run_parallely_in_subinterp, mod_str, func_str, arg_str, std::move(promises[idx]))); 168 | idx++; 169 | 170 | // Perform cleanup. 171 | Py_DECREF(mod_name); 172 | Py_DECREF(func_name); 173 | Py_DECREF(py_args); 174 | Py_DECREF(item); 175 | } 176 | 177 | // Fetch results from each of the threads and join them to create the final result. 178 | std::ostringstream errors_buf; 179 | bool err = false; 180 | PyObject* list = PyList_New(threads.size()); 181 | for (size_t i = 0; i < threads.size(); ++i) { 182 | // Release GIL while waiting for thread. 183 | Py_BEGIN_ALLOW_THREADS 184 | threads[i].join(); 185 | Py_END_ALLOW_THREADS 186 | // Re-acquire GIL to perform Python operations. 187 | try { 188 | // Load the result & set it to idx-th member of resultant list. 189 | std::string res = futures[i].get(); 190 | UNPICKLE_PYOBJECT(py_result, res.c_str()) 191 | PyList_SET_ITEM(list, i, py_result); 192 | } catch (const std::exception& e) { 193 | // If an error is recd., add it to errors_buff. 194 | errors_buf << i << "th function call failed with exception: " << e.what() << "\n"; 195 | } 196 | } 197 | 198 | // Set the result as a runtime error if any of the threads failed, with all the error messages. 199 | auto errors = errors_buf.str(); 200 | if (errors.size()) { 201 | PyErr_SetString(PyExc_RuntimeError, errors.c_str()); 202 | Py_DECREF(list); 203 | list = nullptr; 204 | } 205 | 206 | // Clear pickle functions. 207 | UNLOAD_PICKLE_UNPICKLE 208 | Py_DECREF(iterator); 209 | return list; 210 | } 211 | 212 | ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 213 | 214 | // Module functions definition. 215 | static PyMethodDef subinterpreter_parallelsim_fns[] = { 216 | {"parallel", subinterpreter_parallel, METH_VARARGS, "Run things in parallel threads using different sub-interpreters.\nSince this uses multiple sub-interpreters, ensure that the function being run \ndoesn't require any already imported modules, or any other dependency on current Python state.\n Argument can be multiple [module, function, args] pairs.\nExample: parallel([mod1, fn1, arg1], [mod2, fn2, arg2])"}, 217 | {NULL, NULL, 0, NULL} /* Sentinel */ 218 | }; 219 | 220 | // Modue definition 221 | PyMODINIT_FUNC 222 | PyInit_subinterpreter_parallelism(void) 223 | { 224 | static struct PyModuleDef moduledef = { 225 | PyModuleDef_HEAD_INIT, /* m_base */ 226 | "subinterpreter_parallelism", /* m_name */ 227 | subinterpreter_parallelsim_fns_documentation, /* m_doc */ 228 | -1, /* m_size */ 229 | subinterpreter_parallelsim_fns, /* m_methods */ 230 | nullptr, /* m_slots */ 231 | nullptr, /* m_traverse */ 232 | nullptr, /* m_clear */ 233 | nullptr /* m_free */ 234 | }; 235 | return PyModule_Create(&moduledef); 236 | } 237 | 238 | --------------------------------------------------------------------------------