├── .gitignore
├── README.md
├── database.py
├── test_parallel_writes.py
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.sqlite
2 | venv/
3 | .pytest_cache/
4 | __pycache__/
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Simulating concurrent writes to sqlite3 with multiprocessing and pytest ##
  2 | 
  3 | I have a frequently-called process that I would like to record some metrics about. I've decided that the process will report its metrics once it has completed its main tasks. 
  4 | 
  5 | I do not expect more than 1000 calls to this process per-day.
  6 | 
  7 | The schema for the table is simple, the data does not need to be exposed to any other application, and I would like a minimum of connection issues (transient, though they may be). 
  8 | 
  9 | How about **sqlite3**?
 10 | 
 11 | **POTENTIAL PROBLEM: will this support concurrent writes?**
 12 | 
 13 | It *is* possible for 2 or more processes to finish and try to insert records into the data at the same time.   
 14 | 
 15 | How many of these inserts can be handled successfully? What is the upper limit?
 16 | 
 17 | Let's ask this question in a more technically rigorous way: How many concurrent 1-record inserts can I run on a local sqlite3 database without throwing the `sqlite3.OperationalError: database is locked` exception? 
 18 | 
 19 | That is: how do we test _multiple invocations_ of this function at the same time?
 20 | 
 21 |     def insert_row(record):
 22 |         with sqlite3.connect(PATH_TO_DB) as conn:
 23 |             c = conn.cursor()
 24 | 
 25 |             c.execute(
 26 |                 """
 27 |                 INSERT INTO messages 
 28 |                 (msg) 
 29 |                 VALUES
 30 |                 (?);
 31 |                 """,
 32 |                 record,
 33 |             )
 34 |             conn.commit()
 35 | 
 36 | 
 37 | ### multiprocessing: a Pythonic Path to Parallelization ###
 38 |  
 39 | Enter `multiprocessing.Pool.map()`. I'll refer to this as `map()` from here on out.
 40 | 
 41 | `map()` has two required arguments: _func_ and _iterable_. (There's a third, optional argument `chunksize` we won't touch on here.) 
 42 | According to its Python3.6 docstring:
 43 | 
 44 | 	Apply `func` to each element in `iterable`, collecting the results
 45 | 	in a list that is returned.
 46 | 
 47 | 
 48 | `insert_row` is the function that is **applied** in this case. 
 49 | 
 50 | The iterable `args_list` is a list of one-element tuples. Each tuple contains a serialized UUID4 that will be inserted into the sqlite3 db in the `msg` column.
 51 | 
 52 | The call to `map()` spawns the parallel processes and returns the results in a list, one item per process.
 53 | 
 54 | The function `insert_rows_in_parallel` looks like this:
 55 | 
 56 |     def insert_rows_in_parallel(args_list):
 57 |         num_procs = len(args_list)
 58 | 
 59 |         print(f"Spawning {num_procs} processes...")
 60 | 
 61 |         pool = multiprocessing.Pool(num_procs)
 62 | 
 63 |         results = pool.map(insert_row, args_list)
 64 | 
 65 |         pool.close()
 66 |         pool.join()
 67 | 
 68 |         print(f"{num_procs} processes complete.")
 69 | 
 70 | 
 71 | The size of the `multiprocessing.Pool` is set by the number of elements in args_list. 
 72 | 
 73 | `generate_example_rows()` is a helper function used to provide sample input data.
 74 | 
 75 | Creating a batch of 50 unique test records prepped for database insertion looks like:
 76 | 
 77 |     test_records = generate_example_rows(50)
 78 | 
 79 | Spawn 50 calls to `insert_row` to simulate 50 processes reporting in simultaneously.
 80 | 
 81 |     insert_rows_in_parallel(test_records)
 82 |  
 83 | There are a few other helper functions to do things like:
 84 | 
 85 | * create a fresh db: `database.create_table()`
 86 | * get the number of rows: `database.row_count()`
 87 | 
 88 | We now have the ability to:
 89 | 
 90 | * build and tear down the local db
 91 | * insert data, both sequentially _and_ in parallel
 92 | * verify record counts
 93 | 
 94 | Time to tie it all together.
 95 | 
 96 | ### Using pytest to drive ###
 97 | 
 98 | `pytest` is a testing framework that is delightfully easy to get started with. 
 99 | 
100 | Each of my tests will be in the form of a function that lives in `test_parallel_writes.py`.
101 | 
102 | Here's a snippet to demonstrate.
103 | 
104 | 
105 |     import pytest
106 | 
107 | 
108 |     from database import create_table, row_count, insert_row
109 |     from utils import insert_rows_in_parallel, generate_example_rows
110 | 
111 | 
112 |     def test_adding_5_rows_in_parallel_to_new_db():
113 |         # Create a new `messages` sqlite3 table,
114 |         # dropping it if one already exists.
115 |         create_table()
116 | 
117 |         assert row_count() == 0
118 | 
119 |         # Run 5 parallel instances of `insert_rows`
120 |         # by way of `insert_rows_in_parallel`
121 |         insert_rows_in_parallel(generate_example_rows(5))
122 | 
123 |         assert row_count() == 5
124 | 
125 | 
126 | If either `assert` fails, the whole test fails. Because tear down / setup is so easy, we can simulate that this works with a populated database as well.
127 | 
128 |     def test_adding_10000_rows_sequentially_then_100_rows_in_parallel():
129 |         create_table()
130 | 
131 |         assert row_count() == 0
132 | 
133 |         for example_row in generate_example_rows(10000):
134 |             insert_row(example_row)
135 | 
136 |         assert row_count() == 10000
137 | 
138 |         insert_rows_in_parallel(generate_example_rows(100))
139 | 
140 |         assert row_count() == 10100
141 | 
142 | 
143 | This could certainly be done without `pytest`, but I find the conventions make it nice to follow.
144 | 
145 | 
146 | ### Conclusion: Will it Do the Trick? ###
147 | 
148 | I _did_ finally start hitting failures once I started trying to feed it 500 new records simultaneously. 
149 | 
150 | In this case, though, I think that is a tradeoff I am willing to live with. 
151 | 
152 | I have increased confidence that my solution will work once put into a real workload scenario. These tests also give me an idea of when I might start seeing failures. In the unlikely scenario that 500+ processes wanted to write to this database at the same time, there would be a potential for data loss.
153 | 
154 | 
155 | ### BONUS SECTION: Enabling / Testing WAL Mode ###
156 | 
157 | 
158 | After initially publishing this, I learned the sqlite3 supports a WAL (Write-Ahead Logging) mode.
159 | 
160 | Here are some reasons to give it a try, straight from the [sqlite3 documentation](https://www.sqlite.org/wal.html):
161 | 
162 |     
163 |     There are advantages and disadvantages to using WAL instead of a rollback journal. Advantages include:
164 | 
165 |     * WAL is significantly faster in most scenarios.
166 |     * WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.
167 |     * Disk I/O operations tends to be more sequential using WAL.
168 |     * WAL uses many fewer fsync() operations and is thus less vulnerable to problems on systems where the fsync() system call is broken
169 | 
170 | 
171 | Enabling WAL mode is straightforward. I added an optional argument to `create_database`. The implementation now looks like:
172 | 
173 | 
174 |     def create_table(enable_wal_mode=False):
175 |         with sqlite3.connect(PATH_TO_DB) as conn:
176 |             c = conn.cursor()
177 | 
178 |             c.execute("""DROP TABLE IF EXISTS messages;""")
179 |             conn.commit()
180 | 
181 |             c.execute(
182 |                 """
183 |                 CREATE TABLE messages (
184 |                     ts DATE DEFAULT (datetime('now','localtime')),
185 |                     msg TEXT 
186 |                 );
187 |                 """
188 |             )
189 |             conn.commit()
190 | 
191 |             if enable_wal_mode:
192 |                 c.execute("""pragma journal_mode=wal;""")
193 |                 conn.commit()
194 | 
195 | 
196 | As I understand it, WAL mode mostly helps with concurrent _reads_. My testing showed I wasn't able to succesfully insert more rows in parallel using WAL mode than without. Still something to be cognizant of.
197 | 
198 | 
199 | ### DOUBLE-PLUS EXTRA BONUS SECTION: Parallelization using concurrent.futures ###
200 | 
201 | It's worth knowing there is _at least_ one more way of leveraging multiprocessing in the standard library: `concurrent.futures`.
202 | 
203 | I'll be using the `ProcessPoolExecutor` class to manage parallel execution like so:
204 | 
205 | 
206 |     def insert_rows_in_parallel_cf(args_list):
207 |         num_procs = len(args_list)
208 | 
209 |         print(f"Spawning {num_procs} processes...")
210 | 
211 |         with concurrent.futures.ProcessPoolExecutor(max_workers=num_procs) as executor:
212 |             executor.map(insert_row, args_list)
213 | 
214 |         print(f"{num_procs} processes complete.")
215 | 
216 | 
217 | That's one tidy API! Just another tool to consider when you need parallel execution.
218 | 
219 | 
220 | ### Using the Code ###
221 | 
222 | If you would like to run this locally, clone down the repo, install `pytest` in a virtualenv and run `pytest`.
223 | 
224 | 
225 |     git clone git@github.com:joedougherty/sqlite3_concurrent_writes_test_suite.git
226 |     cd sqlite3_concurrent_writes_test_suite
227 |     python3 -m venv venv
228 |     source venv/bin/activate
229 |     python -m pip install pytest
230 |     pytest 
231 | 
232 | 
233 | ### Further Reading: ###
234 | 
235 | [multiprocessing.Pool.map documentation](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.map)
236 | 
237 | [https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md](https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md)
238 | 
239 | [https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/](https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/)
240 | 
241 | 


--------------------------------------------------------------------------------
/database.py:
--------------------------------------------------------------------------------
 1 | import sqlite3
 2 | 
 3 | 
 4 | PATH_TO_DB = "race_condition.sqlite"
 5 | 
 6 | 
 7 | def create_table(enable_wal_mode=False):
 8 |     with sqlite3.connect(PATH_TO_DB) as conn:
 9 |         c = conn.cursor()
10 | 
11 |         c.execute("""DROP TABLE IF EXISTS messages;""")
12 |         conn.commit()
13 | 
14 |         c.execute(
15 |             """
16 |             CREATE TABLE messages (
17 |                 ts DATE DEFAULT (datetime('now','localtime')),
18 |                 msg TEXT 
19 |             );
20 |             """
21 |         )
22 |         conn.commit()
23 | 
24 |         if enable_wal_mode:
25 |             c.execute("""pragma journal_mode=wal;""")
26 |             conn.commit()
27 | 
28 | 
29 | def insert_row(record):
30 |     with sqlite3.connect(PATH_TO_DB) as conn:
31 |         c = conn.cursor()
32 | 
33 |         c.execute(
34 |             """
35 |             INSERT INTO messages 
36 |             (msg) 
37 |             VALUES
38 |             (?);
39 |             """,
40 |             record,
41 |         )
42 |         conn.commit()
43 | 
44 | 
45 | def row_count():
46 |     with sqlite3.connect(PATH_TO_DB) as conn:
47 |         c = conn.cursor()
48 |         res = c.execute("""select count(*) from messages;""")
49 |         return res.fetchone()[0]
50 | 


--------------------------------------------------------------------------------
/test_parallel_writes.py:
--------------------------------------------------------------------------------
  1 | import pytest
  2 | 
  3 | 
  4 | from database import create_table, row_count, insert_row
  5 | from utils import (
  6 |     insert_rows_in_parallel,
  7 |     insert_rows_in_parallel_cf,
  8 |     generate_example_rows,
  9 | )
 10 | 
 11 | 
 12 | def test_adding_5_rows_in_parallel_to_new_db():
 13 |     # Create a new `messages` sqlite3 table,
 14 |     # dropping it if one already exists.
 15 |     create_table()
 16 | 
 17 |     assert row_count() == 0
 18 | 
 19 |     # Run 5 parallel instances of `insert_rows`
 20 |     # by way of `insert_rows_in_parallel`
 21 |     insert_rows_in_parallel(generate_example_rows(5))
 22 | 
 23 |     assert row_count() == 5
 24 | 
 25 | 
 26 | def test_adding_50_rows_in_parallel_to_new_db():
 27 |     create_table()
 28 | 
 29 |     assert row_count() == 0
 30 | 
 31 |     insert_rows_in_parallel(generate_example_rows(50))
 32 | 
 33 |     assert row_count() == 50
 34 | 
 35 | 
 36 | def test_adding_250_rows_in_parallel_to_new_db():
 37 |     create_table()
 38 | 
 39 |     assert row_count() == 0
 40 | 
 41 |     insert_rows_in_parallel(generate_example_rows(250))
 42 | 
 43 |     assert row_count() == 250
 44 | 
 45 | 
 46 | def test_adding_250_rows_in_parallel_to_new_db_cf():
 47 |     create_table()
 48 | 
 49 |     assert row_count() == 0
 50 | 
 51 |     insert_rows_in_parallel_cf(generate_example_rows(250))
 52 | 
 53 |     assert row_count() == 250
 54 | 
 55 | 
 56 | def test_adding_50_rows_to_populated_db():
 57 |     # Row count from previous test should remain unchanged
 58 |     assert row_count() == 250
 59 | 
 60 |     insert_rows_in_parallel(generate_example_rows(50))
 61 | 
 62 |     assert row_count() == 300
 63 | 
 64 | 
 65 | def test_adding_10000_rows_sequentially_then_100_rows_in_parallel():
 66 |     create_table()
 67 | 
 68 |     assert row_count() == 0
 69 | 
 70 |     for example_row in generate_example_rows(10000):
 71 |         insert_row(example_row)
 72 | 
 73 |     assert row_count() == 10000
 74 | 
 75 |     insert_rows_in_parallel(generate_example_rows(100))
 76 | 
 77 |     assert row_count() == 10100
 78 | 
 79 | 
 80 | def test_adding_250_rows_in_parallel_to_new_db_wal_mode_enabled():
 81 |     # After creating the database, enable WAL mode.
 82 |     # https://www.sqlite.org/pragma.html#pragma_journal_mode
 83 |     create_table(enable_wal_mode=True)
 84 | 
 85 |     assert row_count() == 0
 86 | 
 87 |     insert_rows_in_parallel(generate_example_rows(250))
 88 | 
 89 |     assert row_count() == 250
 90 | 
 91 | 
 92 | def test_adding_500_rows_in_parallel_to_new_db_wal_mode_enabled():
 93 |     # After creating the database, enable WAL mode.
 94 |     # https://www.sqlite.org/pragma.html#pragma_journal_mode
 95 |     create_table(enable_wal_mode=True)
 96 | 
 97 |     assert row_count() == 0
 98 | 
 99 |     insert_rows_in_parallel(generate_example_rows(500))
100 | 
101 |     assert row_count() == 500
102 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | import concurrent.futures
 2 | import multiprocessing
 3 | import uuid
 4 | 
 5 | 
 6 | from database import insert_row
 7 | 
 8 | 
 9 | def insert_rows_in_parallel(args_list):
10 |     num_procs = len(args_list)
11 | 
12 |     print(f"Spawning {num_procs} processes...")
13 | 
14 |     pool = multiprocessing.Pool(num_procs)
15 | 
16 |     results = pool.map(insert_row, args_list)
17 | 
18 |     pool.close()
19 |     pool.join()
20 | 
21 |     print(f"{num_procs} processes complete.")
22 | 
23 | 
24 | def insert_rows_in_parallel_cf(args_list):
25 |     num_procs = len(args_list)
26 | 
27 |     print(f"Spawning {num_procs} processes...")
28 | 
29 |     with concurrent.futures.ProcessPoolExecutor(max_workers=num_procs) as executor:
30 |         executor.map(insert_row, args_list)
31 | 
32 |     print(f"{num_procs} processes complete.")
33 | 
34 | 
35 | def generate_example_rows(num_records):
36 |     return [(str(uuid.uuid4()),) for _ in range(0, num_records)]
37 | 


--------------------------------------------------------------------------------