├── .gitignore ├── README.md ├── database.py ├── test_parallel_writes.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.sqlite 2 | venv/ 3 | .pytest_cache/ 4 | __pycache__/ 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Simulating concurrent writes to sqlite3 with multiprocessing and pytest ## 2 | 3 | I have a frequently-called process that I would like to record some metrics about. I've decided that the process will report its metrics once it has completed its main tasks. 4 | 5 | I do not expect more than 1000 calls to this process per-day. 6 | 7 | The schema for the table is simple, the data does not need to be exposed to any other application, and I would like a minimum of connection issues (transient, though they may be). 8 | 9 | How about **sqlite3**? 10 | 11 | **POTENTIAL PROBLEM: will this support concurrent writes?** 12 | 13 | It *is* possible for 2 or more processes to finish and try to insert records into the data at the same time. 14 | 15 | How many of these inserts can be handled successfully? What is the upper limit? 16 | 17 | Let's ask this question in a more technically rigorous way: How many concurrent 1-record inserts can I run on a local sqlite3 database without throwing the `sqlite3.OperationalError: database is locked` exception? 18 | 19 | That is: how do we test _multiple invocations_ of this function at the same time? 20 | 21 | def insert_row(record): 22 | with sqlite3.connect(PATH_TO_DB) as conn: 23 | c = conn.cursor() 24 | 25 | c.execute( 26 | """ 27 | INSERT INTO messages 28 | (msg) 29 | VALUES 30 | (?); 31 | """, 32 | record, 33 | ) 34 | conn.commit() 35 | 36 | 37 | ### multiprocessing: a Pythonic Path to Parallelization ### 38 | 39 | Enter `multiprocessing.Pool.map()`. I'll refer to this as `map()` from here on out. 40 | 41 | `map()` has two required arguments: _func_ and _iterable_. (There's a third, optional argument `chunksize` we won't touch on here.) 42 | According to its Python3.6 docstring: 43 | 44 | Apply `func` to each element in `iterable`, collecting the results 45 | in a list that is returned. 46 | 47 | 48 | `insert_row` is the function that is **applied** in this case. 49 | 50 | The iterable `args_list` is a list of one-element tuples. Each tuple contains a serialized UUID4 that will be inserted into the sqlite3 db in the `msg` column. 51 | 52 | The call to `map()` spawns the parallel processes and returns the results in a list, one item per process. 53 | 54 | The function `insert_rows_in_parallel` looks like this: 55 | 56 | def insert_rows_in_parallel(args_list): 57 | num_procs = len(args_list) 58 | 59 | print(f"Spawning {num_procs} processes...") 60 | 61 | pool = multiprocessing.Pool(num_procs) 62 | 63 | results = pool.map(insert_row, args_list) 64 | 65 | pool.close() 66 | pool.join() 67 | 68 | print(f"{num_procs} processes complete.") 69 | 70 | 71 | The size of the `multiprocessing.Pool` is set by the number of elements in args_list. 72 | 73 | `generate_example_rows()` is a helper function used to provide sample input data. 74 | 75 | Creating a batch of 50 unique test records prepped for database insertion looks like: 76 | 77 | test_records = generate_example_rows(50) 78 | 79 | Spawn 50 calls to `insert_row` to simulate 50 processes reporting in simultaneously. 80 | 81 | insert_rows_in_parallel(test_records) 82 | 83 | There are a few other helper functions to do things like: 84 | 85 | * create a fresh db: `database.create_table()` 86 | * get the number of rows: `database.row_count()` 87 | 88 | We now have the ability to: 89 | 90 | * build and tear down the local db 91 | * insert data, both sequentially _and_ in parallel 92 | * verify record counts 93 | 94 | Time to tie it all together. 95 | 96 | ### Using pytest to drive ### 97 | 98 | `pytest` is a testing framework that is delightfully easy to get started with. 99 | 100 | Each of my tests will be in the form of a function that lives in `test_parallel_writes.py`. 101 | 102 | Here's a snippet to demonstrate. 103 | 104 | 105 | import pytest 106 | 107 | 108 | from database import create_table, row_count, insert_row 109 | from utils import insert_rows_in_parallel, generate_example_rows 110 | 111 | 112 | def test_adding_5_rows_in_parallel_to_new_db(): 113 | # Create a new `messages` sqlite3 table, 114 | # dropping it if one already exists. 115 | create_table() 116 | 117 | assert row_count() == 0 118 | 119 | # Run 5 parallel instances of `insert_rows` 120 | # by way of `insert_rows_in_parallel` 121 | insert_rows_in_parallel(generate_example_rows(5)) 122 | 123 | assert row_count() == 5 124 | 125 | 126 | If either `assert` fails, the whole test fails. Because tear down / setup is so easy, we can simulate that this works with a populated database as well. 127 | 128 | def test_adding_10000_rows_sequentially_then_100_rows_in_parallel(): 129 | create_table() 130 | 131 | assert row_count() == 0 132 | 133 | for example_row in generate_example_rows(10000): 134 | insert_row(example_row) 135 | 136 | assert row_count() == 10000 137 | 138 | insert_rows_in_parallel(generate_example_rows(100)) 139 | 140 | assert row_count() == 10100 141 | 142 | 143 | This could certainly be done without `pytest`, but I find the conventions make it nice to follow. 144 | 145 | 146 | ### Conclusion: Will it Do the Trick? ### 147 | 148 | I _did_ finally start hitting failures once I started trying to feed it 500 new records simultaneously. 149 | 150 | In this case, though, I think that is a tradeoff I am willing to live with. 151 | 152 | I have increased confidence that my solution will work once put into a real workload scenario. These tests also give me an idea of when I might start seeing failures. In the unlikely scenario that 500+ processes wanted to write to this database at the same time, there would be a potential for data loss. 153 | 154 | 155 | ### BONUS SECTION: Enabling / Testing WAL Mode ### 156 | 157 | 158 | After initially publishing this, I learned the sqlite3 supports a WAL (Write-Ahead Logging) mode. 159 | 160 | Here are some reasons to give it a try, straight from the [sqlite3 documentation](https://www.sqlite.org/wal.html): 161 | 162 | 163 | There are advantages and disadvantages to using WAL instead of a rollback journal. Advantages include: 164 | 165 | * WAL is significantly faster in most scenarios. 166 | * WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently. 167 | * Disk I/O operations tends to be more sequential using WAL. 168 | * WAL uses many fewer fsync() operations and is thus less vulnerable to problems on systems where the fsync() system call is broken 169 | 170 | 171 | Enabling WAL mode is straightforward. I added an optional argument to `create_database`. The implementation now looks like: 172 | 173 | 174 | def create_table(enable_wal_mode=False): 175 | with sqlite3.connect(PATH_TO_DB) as conn: 176 | c = conn.cursor() 177 | 178 | c.execute("""DROP TABLE IF EXISTS messages;""") 179 | conn.commit() 180 | 181 | c.execute( 182 | """ 183 | CREATE TABLE messages ( 184 | ts DATE DEFAULT (datetime('now','localtime')), 185 | msg TEXT 186 | ); 187 | """ 188 | ) 189 | conn.commit() 190 | 191 | if enable_wal_mode: 192 | c.execute("""pragma journal_mode=wal;""") 193 | conn.commit() 194 | 195 | 196 | As I understand it, WAL mode mostly helps with concurrent _reads_. My testing showed I wasn't able to succesfully insert more rows in parallel using WAL mode than without. Still something to be cognizant of. 197 | 198 | 199 | ### DOUBLE-PLUS EXTRA BONUS SECTION: Parallelization using concurrent.futures ### 200 | 201 | It's worth knowing there is _at least_ one more way of leveraging multiprocessing in the standard library: `concurrent.futures`. 202 | 203 | I'll be using the `ProcessPoolExecutor` class to manage parallel execution like so: 204 | 205 | 206 | def insert_rows_in_parallel_cf(args_list): 207 | num_procs = len(args_list) 208 | 209 | print(f"Spawning {num_procs} processes...") 210 | 211 | with concurrent.futures.ProcessPoolExecutor(max_workers=num_procs) as executor: 212 | executor.map(insert_row, args_list) 213 | 214 | print(f"{num_procs} processes complete.") 215 | 216 | 217 | That's one tidy API! Just another tool to consider when you need parallel execution. 218 | 219 | 220 | ### Using the Code ### 221 | 222 | If you would like to run this locally, clone down the repo, install `pytest` in a virtualenv and run `pytest`. 223 | 224 | 225 | git clone git@github.com:joedougherty/sqlite3_concurrent_writes_test_suite.git 226 | cd sqlite3_concurrent_writes_test_suite 227 | python3 -m venv venv 228 | source venv/bin/activate 229 | python -m pip install pytest 230 | pytest 231 | 232 | 233 | ### Further Reading: ### 234 | 235 | [multiprocessing.Pool.map documentation](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.map) 236 | 237 | [https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md](https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md) 238 | 239 | [https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/](https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/) 240 | 241 | -------------------------------------------------------------------------------- /database.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | 4 | PATH_TO_DB = "race_condition.sqlite" 5 | 6 | 7 | def create_table(enable_wal_mode=False): 8 | with sqlite3.connect(PATH_TO_DB) as conn: 9 | c = conn.cursor() 10 | 11 | c.execute("""DROP TABLE IF EXISTS messages;""") 12 | conn.commit() 13 | 14 | c.execute( 15 | """ 16 | CREATE TABLE messages ( 17 | ts DATE DEFAULT (datetime('now','localtime')), 18 | msg TEXT 19 | ); 20 | """ 21 | ) 22 | conn.commit() 23 | 24 | if enable_wal_mode: 25 | c.execute("""pragma journal_mode=wal;""") 26 | conn.commit() 27 | 28 | 29 | def insert_row(record): 30 | with sqlite3.connect(PATH_TO_DB) as conn: 31 | c = conn.cursor() 32 | 33 | c.execute( 34 | """ 35 | INSERT INTO messages 36 | (msg) 37 | VALUES 38 | (?); 39 | """, 40 | record, 41 | ) 42 | conn.commit() 43 | 44 | 45 | def row_count(): 46 | with sqlite3.connect(PATH_TO_DB) as conn: 47 | c = conn.cursor() 48 | res = c.execute("""select count(*) from messages;""") 49 | return res.fetchone()[0] 50 | -------------------------------------------------------------------------------- /test_parallel_writes.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | 4 | from database import create_table, row_count, insert_row 5 | from utils import ( 6 | insert_rows_in_parallel, 7 | insert_rows_in_parallel_cf, 8 | generate_example_rows, 9 | ) 10 | 11 | 12 | def test_adding_5_rows_in_parallel_to_new_db(): 13 | # Create a new `messages` sqlite3 table, 14 | # dropping it if one already exists. 15 | create_table() 16 | 17 | assert row_count() == 0 18 | 19 | # Run 5 parallel instances of `insert_rows` 20 | # by way of `insert_rows_in_parallel` 21 | insert_rows_in_parallel(generate_example_rows(5)) 22 | 23 | assert row_count() == 5 24 | 25 | 26 | def test_adding_50_rows_in_parallel_to_new_db(): 27 | create_table() 28 | 29 | assert row_count() == 0 30 | 31 | insert_rows_in_parallel(generate_example_rows(50)) 32 | 33 | assert row_count() == 50 34 | 35 | 36 | def test_adding_250_rows_in_parallel_to_new_db(): 37 | create_table() 38 | 39 | assert row_count() == 0 40 | 41 | insert_rows_in_parallel(generate_example_rows(250)) 42 | 43 | assert row_count() == 250 44 | 45 | 46 | def test_adding_250_rows_in_parallel_to_new_db_cf(): 47 | create_table() 48 | 49 | assert row_count() == 0 50 | 51 | insert_rows_in_parallel_cf(generate_example_rows(250)) 52 | 53 | assert row_count() == 250 54 | 55 | 56 | def test_adding_50_rows_to_populated_db(): 57 | # Row count from previous test should remain unchanged 58 | assert row_count() == 250 59 | 60 | insert_rows_in_parallel(generate_example_rows(50)) 61 | 62 | assert row_count() == 300 63 | 64 | 65 | def test_adding_10000_rows_sequentially_then_100_rows_in_parallel(): 66 | create_table() 67 | 68 | assert row_count() == 0 69 | 70 | for example_row in generate_example_rows(10000): 71 | insert_row(example_row) 72 | 73 | assert row_count() == 10000 74 | 75 | insert_rows_in_parallel(generate_example_rows(100)) 76 | 77 | assert row_count() == 10100 78 | 79 | 80 | def test_adding_250_rows_in_parallel_to_new_db_wal_mode_enabled(): 81 | # After creating the database, enable WAL mode. 82 | # https://www.sqlite.org/pragma.html#pragma_journal_mode 83 | create_table(enable_wal_mode=True) 84 | 85 | assert row_count() == 0 86 | 87 | insert_rows_in_parallel(generate_example_rows(250)) 88 | 89 | assert row_count() == 250 90 | 91 | 92 | def test_adding_500_rows_in_parallel_to_new_db_wal_mode_enabled(): 93 | # After creating the database, enable WAL mode. 94 | # https://www.sqlite.org/pragma.html#pragma_journal_mode 95 | create_table(enable_wal_mode=True) 96 | 97 | assert row_count() == 0 98 | 99 | insert_rows_in_parallel(generate_example_rows(500)) 100 | 101 | assert row_count() == 500 102 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import concurrent.futures 2 | import multiprocessing 3 | import uuid 4 | 5 | 6 | from database import insert_row 7 | 8 | 9 | def insert_rows_in_parallel(args_list): 10 | num_procs = len(args_list) 11 | 12 | print(f"Spawning {num_procs} processes...") 13 | 14 | pool = multiprocessing.Pool(num_procs) 15 | 16 | results = pool.map(insert_row, args_list) 17 | 18 | pool.close() 19 | pool.join() 20 | 21 | print(f"{num_procs} processes complete.") 22 | 23 | 24 | def insert_rows_in_parallel_cf(args_list): 25 | num_procs = len(args_list) 26 | 27 | print(f"Spawning {num_procs} processes...") 28 | 29 | with concurrent.futures.ProcessPoolExecutor(max_workers=num_procs) as executor: 30 | executor.map(insert_row, args_list) 31 | 32 | print(f"{num_procs} processes complete.") 33 | 34 | 35 | def generate_example_rows(num_records): 36 | return [(str(uuid.uuid4()),) for _ in range(0, num_records)] 37 | --------------------------------------------------------------------------------