├── JustInTimePaxos.pdf ├── JustInTimePaxos.tla ├── LICENSE └── README.md /JustInTimePaxos.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kuujo/just-in-time-paxos/1b5e8fb2c6b7067aa9617a7a2ae0b156d7d8baef/JustInTimePaxos.pdf -------------------------------------------------------------------------------- /JustInTimePaxos.tla: -------------------------------------------------------------------------------- 1 | -------------------------- MODULE JustInTimePaxos -------------------------- 2 | 3 | (* 4 | Defines the Just-In-Time Paxos (JITPaxos) protocol. JITPaxos is a 5 | variant of the Paxos consensus protocol designed for environments where 6 | process clocks are synchronized with high precision. The protocol 7 | relies on synchronized clocks to establish a global total ordering of 8 | events, avoiding coordination between replicas when requests arrive in 9 | the expected order, and reconciling requests only when they arrive out 10 | of order. This allows JITPaxos to reach consensus within a single 11 | round trip in the normal case, falling back to traditional replication 12 | strategies only when required. 13 | 14 | Summary: 15 | `^ \begin{describe}{*} 16 | \item[*] View-based protocol 17 | \item[*] Views are identified by a monotonically increasing, globally 18 | unique identifier 19 | \item[*] Each view assigns a primary plus a set of replicas that form 20 | the quorum 21 | \item[*] Clients timestamp each request and send in parallel to all 22 | replicas in the quorum 23 | \item[*] Each replica appends requests to a log in chronological order, 24 | and the primary executes requests 25 | \item[*] If a request is received out of chronological order it is 26 | rejected 27 | \item[*] Replies to clients include a checksum of the log on each 28 | replica 29 | \item[*] If the client receives a reply indicating the request was 30 | received out of chronological order or if a checksum does not match 31 | the primary's checksum, the client initiates a reconciliation protocol 32 | \item[*] To reconcile inconsistencies in the log, replicas pull logs 33 | from the primary, and once logs have been reconciled the original 34 | request is acknowledged 35 | \item[*] Once the client receives matching acknowledgements from all 36 | the replicas in the quorum a request is committed 37 | \item[*] View changes select the most recent log from a majority of the 38 | replicas to ensure the initial view log contains all committed requests 39 | from prior views 40 | \end{describe} ^' 41 | 42 | JITPaxos uses a view-based approach to elect a primary and reconcile 43 | logs across views. Views are identified by a monotonically increasing, 44 | globally unique view ID. Each view deterministically assigns a quorum, 45 | and within the quorum a primary replica responsible for executing 46 | client requests and reconciling inconsistencies in the logs of the 47 | remaining replicas. JITPaxos replicas to not coordinate with each 48 | other in the normal case. Clients send timestamped requests in 49 | parallel to every replica in the view's quorum. When a replica 50 | receives a client request, if the request is received in chronological 51 | order, it's appended to the replica's log. If a request is received 52 | out of order (i.e. the request timestamp is less than the last 53 | timestamp in the replica's log), the request is rejected by the 54 | replica. Clients are responsible for identifying inconsistencies in 55 | the quorum's logs and initiating the reconciliation protocol. To help 56 | clients identify inconsistencies, replicas return a checksum 57 | representing the contents of the log up to the request point with each 58 | client reply. If a client's request is received out of chronological 59 | order, or if the checksums provided by the quorum do not match, the 60 | client must initiate the reconcilitation protocol to reconcile the 61 | inconsistencies in the quorum's logs. 62 | 63 | When requests are received out-of-order, the reconciliation protocol 64 | works to re-order requests using the view's primary as a reference. 65 | When a client initiates the reconciliation protocol for an inconsistent 66 | replica, the replica stops accepting client requests and sends a repair 67 | request to the primary. The primary responds with the subset of the 68 | log not yet reconciled on the replica, and the replica replaces the 69 | out-of-order entries in its log. Once the replica's log has been 70 | reconciled with the primary, it can acknowledge the reconciled request 71 | and begin accepting requests again. Once a client has reconciled all 72 | the divergent replicas and has received acknowledgement from each of 73 | the replicas in the quorum, the request can be committed. 74 | 75 | View primaries and quorums are evenly distributed amongst view IDs. 76 | View changes can be initiated to change the primary or the set of 77 | replicas in the quorum. When a view change is initiated, each replica 78 | sends its log to the primary for the initiated view. Once the primary 79 | has received logs from a majority of replicas, it initializes the view 80 | with the log from the most recent in-sync replica, broadcasting the log 81 | to its peers. The use of quorums to determine both the commitment of a 82 | request and the initialization of new views ensures that each view log 83 | contains all prior committed requests. 84 | *) 85 | 86 | EXTENDS Naturals, Reals, Sequences, FiniteSets, TLC 87 | 88 | \* The set of JITPaxos replicas 89 | CONSTANT Replicas 90 | 91 | \* The set of JITPaxos clients 92 | CONSTANT Clients 93 | 94 | \* The set of possible values 95 | CONSTANT Values 96 | 97 | \* An empty value 98 | CONSTANT Nil 99 | 100 | \* Request/response types 101 | CONSTANTS 102 | MClientRequest, 103 | MClientReply, 104 | MReconcileRequest, 105 | MReconcileReply, 106 | MRepairRequest, 107 | MRepairReply, 108 | MViewChangeRequest, 109 | MViewChangeReply, 110 | MStartViewRequest 111 | 112 | \* Replica statuses 113 | CONSTANTS 114 | SInSync, 115 | SRepair, 116 | SViewChange 117 | ---- 118 | 119 | (***************************************************************************) 120 | (* This section specifies the message types and schemas used in this spec. *) 121 | (* *) 122 | (* ReqIDs == [c \in Clients |-> i \in (1..)] *) 123 | (* *) 124 | (* ViewIDs == [r \in Replicas |-> i \in (1..)] *) 125 | (* *) 126 | (* Logs == [r \in Replicas |-> [i \in (1..) |-> Value]] *) 127 | (* *) 128 | (* Indexes == [r \in Replicas |-> i \in (1..)] *) 129 | (* *) 130 | (* Timestamps == [r \in Replicas |-> i \in (1..)] *) 131 | (* *) 132 | (* Checksums == [r \in Replicas |-> [i \in (1..) |-> t \in Timestamps]] *) 133 | (* *) 134 | (* ClientRequest *) 135 | (* [ src |-> c \in Clients, *) 136 | (* dest |-> r \in Replicas, *) 137 | (* type |-> MClientRequest, *) 138 | (* viewID |-> i \in ViewIDs, *) 139 | (* reqID |-> i \in ReqIDs, *) 140 | (* value |-> v \in Values, *) 141 | (* timestamp |-> t \in Timestamps ] *) 142 | (* *) 143 | (* ClientReply *) 144 | (* [ src |-> r \in Replicas, *) 145 | (* dest |-> c \in Clients, *) 146 | (* req |-> (ClientRequest), *) 147 | (* type |-> MClientReply, *) 148 | (* viewID |-> i \in ViewIDs, *) 149 | (* index |-> i \in Indexes, *) 150 | (* checksum |-> c \in Checksums, *) 151 | (* value |-> v \in Values, *) 152 | (* timestamp |-> t \in Timestamps, *) 153 | (* succeeded |-> TRUE \/ FALSE ] *) 154 | (* *) 155 | (* ReconcileRequest *) 156 | (* [ src |-> c \in Clients, *) 157 | (* dest |-> r \in Replicas, *) 158 | (* type |-> MReconcileRequest, *) 159 | (* viewID |-> i \in ViewIDs, *) 160 | (* reqID |-> i \in ReqIDs, *) 161 | (* index |-> i \in Indexes ] *) 162 | (* *) 163 | (* ReconcileReply *) 164 | (* [ src |-> r \in Replicas, *) 165 | (* dest |-> c \in Clients, *) 166 | (* req |-> (ClientRequest), *) 167 | (* type |-> MReconcileReply, *) 168 | (* viewID |-> i \in ViewIDs, *) 169 | (* index |-> i \in Indexes, *) 170 | (* checksum |-> c \in Checksums, *) 171 | (* value |-> v \in Values, *) 172 | (* timestamp |-> t \in Timestamps, *) 173 | (* succeeded |-> TRUE \/ FALSE ] *) 174 | (* *) 175 | (* RepairRequest *) 176 | (* [ src |-> r \in Replicas, *) 177 | (* dest |-> r \in Replicas, *) 178 | (* req |-> (ClientRequest), *) 179 | (* type |-> MRepairRequest, *) 180 | (* viewID |-> i \in ViewIDs, *) 181 | (* index |-> i \in Indexes ] *) 182 | (* *) 183 | (* RepairReply *) 184 | (* [ src |-> r \in Replicas, *) 185 | (* dest |-> r \in Replicas, *) 186 | (* req |-> (ClientRequest), *) 187 | (* type |-> MRepairReply, *) 188 | (* viewID |-> i \in ViewIDs, *) 189 | (* index |-> i \in Indexes, *) 190 | (* log |-> l \in Logs ] *) 191 | (* *) 192 | (* ViewChangeRequest *) 193 | (* [ src |-> r \in Replicas, *) 194 | (* dest |-> r \in Replicas, *) 195 | (* type |-> MViewChangeRequest, *) 196 | (* viewID |-> i \in ViewIDs ] *) 197 | (* *) 198 | (* ViewChangeReply *) 199 | (* [ src |-> r \in Replicas, *) 200 | (* dest |-> r \in Replicas, *) 201 | (* type |-> MViewChangeReply, *) 202 | (* viewID |-> i \in ViewIDs, *) 203 | (* logViewID |-> i \in ViewIDs, *) 204 | (* log |-> l \in Logs ] *) 205 | (* *) 206 | (* StartViewRequest *) 207 | (* [ src |-> r \in Replicas, *) 208 | (* dest |-> r \in Replicas, *) 209 | (* type |-> MStartViewRequest, *) 210 | (* viewID |-> i \in ViewIDs, *) 211 | (* log |-> l \in Logs ] *) 212 | (***************************************************************************) 213 | 214 | ---- 215 | 216 | \* The set of all messages on the network 217 | VARIABLE messages 218 | 219 | \* The total number of messages sent 220 | VARIABLE messageCount 221 | 222 | \* The total number of steps executed 223 | VARIABLE stepCount 224 | 225 | messageVars == <> 226 | 227 | (* Local client state *) 228 | 229 | \* Strictly increasing representation of synchronized time 230 | VARIABLE cTime 231 | 232 | \* The highest known view ID for a client 233 | VARIABLE cViewID 234 | 235 | \* Client request IDs 236 | VARIABLE cReqID 237 | 238 | \* A client response buffer 239 | VARIABLE cReps 240 | 241 | \* A set of all commits - used for model checking 242 | VARIABLE cCommits 243 | 244 | clientVars == <> 245 | 246 | (* Local replica state *) 247 | 248 | \* The current status of a replica 249 | VARIABLE rStatus 250 | 251 | \* The current view ID for a replica 252 | VARIABLE rViewID 253 | 254 | \* A replica's commit log 255 | VARIABLE rLog 256 | 257 | \* A replica's sync index 258 | VARIABLE rSyncIndex 259 | 260 | \* The view ID for the log 261 | VARIABLE rLogViewID 262 | 263 | \* The set of view change replies 264 | VARIABLE rViewChangeReps 265 | 266 | replicaVars == <> 267 | 268 | vars == <> 269 | 270 | ---- 271 | 272 | (* 273 | This section provides utilities for implementing the spec. 274 | *) 275 | 276 | \* Creates a sequence from set 'S' 277 | RECURSIVE SeqFromSet(_) 278 | SeqFromSet(S) == 279 | IF S = {} THEN 280 | << >> 281 | ELSE LET x == CHOOSE x \in S : TRUE 282 | IN << x >> \o SeqFromSet(S \ {x}) 283 | 284 | RECURSIVE SetReduce(_, _, _) 285 | SetReduce(Op(_, _), S, value) == 286 | IF S = {} THEN 287 | value 288 | ELSE 289 | LET s == CHOOSE s \in S : TRUE 290 | IN SetReduce(Op, S \ {s}, Op(s, value)) 291 | 292 | \* Computes the greatest vlue in set 'S' 293 | Max(S) == CHOOSE x \in S : \A y \in S : x >= y 294 | 295 | \* Computes the sum of numbers in set 'S' 296 | Sum(S) == LET _op(a, b) == a + b 297 | IN SetReduce(_op, S, 0) 298 | 299 | \* The values of a sequence 300 | Range(s) == {s[i] : i \in DOMAIN s} 301 | 302 | ---- 303 | 304 | (* 305 | This section provides helpers for the protocol. 306 | *) 307 | 308 | \* A sorted sequence of replicas 309 | replicas == SeqFromSet(Replicas) 310 | 311 | \* The primary index for view 'v' 312 | PrimaryIndex(v) == (v%Len(replicas)) + (IF v >= Len(replicas) THEN 1 ELSE 0) 313 | 314 | \* The primary for view 'v' 315 | Primary(v) == replicas[PrimaryIndex(v)] 316 | 317 | \* Quorum is the quorum for a given view 318 | Quorum(v) == 319 | LET 320 | quorumSize == Len(replicas) \div 2 + 1 321 | index(i) == PrimaryIndex(v) + (i - 1) 322 | member(i) == IF index(i) > Len(replicas) THEN replicas[index(i)%Len(replicas)] ELSE replicas[index(i)] 323 | IN 324 | {member(i) : i \in 1..quorumSize} 325 | 326 | \* A boolean indicating whether the given set is a quorum 327 | IsQuorum(S) == Cardinality(S) * 2 >= Cardinality(Replicas) 328 | 329 | \* A boolean indicating whether the given set is a quorum that includes the given replica 330 | IsLocalQuorum(r, S) == IsQuorum(S) /\ r \in S 331 | 332 | ---- 333 | 334 | (* 335 | This section models the network. 336 | 337 | Messages between processes are unordered and can be dropped by the network 338 | at any time. 339 | *) 340 | 341 | \* Send a set of messages 342 | Sends(ms) == 343 | /\ messages' = messages \cup ms 344 | /\ messageCount' = messageCount + Cardinality(ms) 345 | /\ stepCount' = stepCount + 1 346 | 347 | \* Send a message 348 | Send(m) == Sends({m}) 349 | 350 | \* Ack a message 351 | Ack(m) == 352 | /\ messages' = messages \ {m} 353 | /\ messageCount' = messageCount + 1 354 | /\ stepCount' = stepCount + 1 355 | 356 | \* Ack a message and send a set of messages 357 | AckAndSends(m, ms) == 358 | /\ messages' = (messages \cup ms) \ {m} 359 | /\ messageCount' = messageCount + Cardinality(ms) 360 | /\ stepCount' = stepCount + 1 361 | 362 | \* Ack and send a message 363 | AckAndSend(m, n) == AckAndSends(m, {n}) 364 | 365 | \* Reply to a message with a set of responses 366 | Replies(req, reps) == AckAndSends(req, reps) 367 | 368 | \* Reply to a message 369 | Reply(req, resp) == AckAndSend(req, resp) 370 | 371 | ---- 372 | 373 | (* 374 | This section models JITPaxos clients. 375 | *) 376 | 377 | \* Client 'c' sends value 'v' to all replicas 378 | ClientRequest(c, v) == 379 | /\ cTime' = cTime + 1 380 | /\ cReqID' = [cReqID EXCEPT ![c] = cReqID[c] + 1] 381 | /\ Sends({[src |-> c, 382 | dest |-> r, 383 | type |-> MClientRequest, 384 | viewID |-> cViewID[c], 385 | reqID |-> cReqID'[c], 386 | value |-> v, 387 | timestamp |-> cTime'] : r \in Quorum(cViewID[c])}) 388 | /\ UNCHANGED <> 389 | 390 | \* Client 'c' handles a response 'm' from replica 'r' 391 | HandleClientReply(c, r, m) == 392 | \* If the reply view ID does not match the request view ID, update the client's view. 393 | /\ \/ /\ m.viewID # m.req.viewID 394 | /\ \/ /\ cViewID[c] < m.viewID 395 | /\ cViewID' = [cViewID EXCEPT ![c] = m.viewID] 396 | \/ /\ cViewID[c] >= m.viewID 397 | /\ UNCHANGED <> 398 | /\ Ack(m) 399 | /\ UNCHANGED <> 400 | \* If the request and reply views match and the reply view matches the client's view, 401 | \* aggregate the replies for the associated client request. 402 | \/ /\ m.viewID = m.req.viewID 403 | /\ m.viewID = cViewID[c] 404 | /\ \/ /\ m.succeeded 405 | /\ cReps' = [cReps EXCEPT ![c] = 406 | (cReps[c] \ {n \in cReps[c] : /\ n.src = m.src 407 | /\ n.viewID = cViewID[c] 408 | /\ n.req.reqID = m.req.reqID 409 | /\ ~n.succeeded}) \cup {m}] 410 | \/ /\ ~m.succeeded 411 | /\ ~\E n \in cReps[c] : /\ n.src = m.src 412 | /\ n.viewID = cViewID[c] 413 | /\ n.req.reqID = m.req.reqID 414 | /\ n.succeeded 415 | /\ cReps' = [cReps EXCEPT ![c] = cReps[c] \cup {m}] 416 | /\ LET reps == {n \in cReps'[c] : /\ n.viewID = cViewID[c] 417 | /\ n.req.reqID = m.req.reqID} 418 | isQuorum == {n.src : n \in {n \in reps : n.succeeded}} = Quorum(cViewID[c]) 419 | isCommitted == /\ \A n \in reps : n.succeeded 420 | /\ Cardinality({n.checksum : n \in reps}) = 1 421 | hasPrimary == \E n \in reps : n.src = Primary(cViewID[c]) /\ n.succeeded 422 | IN 423 | \* If a quorum of successful replies have been received and the checksums 424 | \* match, add the primary reply to commits. 425 | \/ /\ isQuorum 426 | /\ isCommitted 427 | /\ LET commit == CHOOSE n \in reps : n.src = Primary(cViewID[c]) 428 | IN cCommits' = [cCommits EXCEPT ![c] = cCommits[c] \cup {commit}] 429 | /\ Ack(m) 430 | \* If some reply failed or was returned with an incorrect checksum, 431 | \* send a ReconcileRequest to the inconsistent node to force it to 432 | \* reconcile its log with the primary's log. 433 | \/ /\ ~isCommitted 434 | /\ \/ /\ hasPrimary 435 | /\ LET primaryRep == CHOOSE n \in reps : /\ n.src = Primary(cViewID[c]) 436 | /\ n.succeeded 437 | retryReps == {n \in reps : 438 | /\ n.src # Primary(cViewID[c]) 439 | /\ n.checksum # primaryRep.checksum} 440 | IN AckAndSends(m, {[src |-> c, 441 | dest |-> r, 442 | type |-> MReconcileRequest, 443 | viewID |-> cViewID[c], 444 | reqID |-> m.req.reqID, 445 | index |-> primaryRep.index] : n \in retryReps}) 446 | \/ /\ ~hasPrimary 447 | /\ Ack(m) 448 | /\ UNCHANGED <> 449 | \* If a quorum has not yet been reached, wait for more replies. 450 | \/ /\ ~isQuorum 451 | /\ isCommitted 452 | /\ Ack(m) 453 | /\ UNCHANGED <> 454 | /\ UNCHANGED <> 455 | /\ UNCHANGED <> 456 | 457 | HandleReconcileReply(c, r, m) == HandleClientReply(c, r, m) 458 | 459 | ---- 460 | 461 | (* 462 | This section models JITPaxos replicas. 463 | *) 464 | 465 | \* Replica 'r' handles client 'c' request 'm' 466 | HandleClientRequest(r, c, m) == 467 | \* Client requests can only be handled if the replica is in-sync. 468 | /\ rStatus[r] = SInSync 469 | \* If the client's view matches the replica's view, process the client's request. 470 | /\ \/ /\ m.viewID = rViewID[r] 471 | /\ LET lastTimestamp == Max({rLog[r][i].timestamp : i \in DOMAIN rLog[r]} \cup {0}) 472 | IN 473 | \* If the request timestamp is greater than the highest log timestamp, 474 | \* append the entry to the log and return a successful response with 475 | \* the appended entry index. 476 | /\ \/ /\ m.timestamp > lastTimestamp 477 | /\ rLog' = [rLog EXCEPT ![r] = 478 | Append(rLog[r], [value |-> m.value, 479 | timestamp |-> m.timestamp])] 480 | /\ Reply(m, [src |-> r, 481 | dest |-> c, 482 | req |-> m, 483 | type |-> MClientReply, 484 | viewID |-> rViewID[r], 485 | index |-> Len(rLog'[r]), 486 | checksum |-> rLog'[r], 487 | value |-> m.value, 488 | timestamp |-> m.timestamp, 489 | succeeded |-> TRUE]) 490 | \* If the request timestamp matches the highest log timestamp, treat the 491 | \* request as a duplicate. Return a successful response indicating the 492 | \* entry was appended. 493 | \/ /\ m.timestamp = lastTimestamp 494 | /\ Reply(m, [src |-> r, 495 | dest |-> c, 496 | req |-> m, 497 | type |-> MClientReply, 498 | viewID |-> rViewID[r], 499 | index |-> Len(rLog[r]), 500 | checksum |-> rLog[r], 501 | value |-> m.value, 502 | timestamp |-> m.timestamp, 503 | succeeded |-> TRUE]) 504 | /\ UNCHANGED <> 505 | \* If the request timestamp is less than the highest log timestamp, 506 | \* reject the request. 507 | \/ /\ m.timestamp < lastTimestamp 508 | /\ Reply(m, [src |-> r, 509 | dest |-> c, 510 | req |-> m, 511 | type |-> MClientReply, 512 | viewID |-> rViewID[r], 513 | index |-> Len(rLog[r]), 514 | checksum |-> rLog[r], 515 | value |-> m.value, 516 | timestamp |-> m.timestamp, 517 | succeeded |-> FALSE]) 518 | /\ UNCHANGED <> 519 | /\ UNCHANGED <> 520 | \* If the client's view is greater than the replica's view, reject the client's 521 | \* request with the outdated view ID and enter the view change protocol. 522 | \/ /\ m.viewID > rViewID[r] 523 | /\ rViewID' = [rViewID EXCEPT ![r] = m.viewID] 524 | /\ rStatus' = [rStatus EXCEPT ![r] = SViewChange] 525 | /\ rViewChangeReps' = [rViewChangeReps EXCEPT ![r] = {}] 526 | /\ Replies(m, {[src |-> r, 527 | dest |-> c, 528 | req |-> m, 529 | type |-> MClientReply, 530 | viewID |-> rViewID[r], 531 | succeeded |-> FALSE], 532 | [src |-> r, 533 | dest |-> Primary(m.viewID), 534 | type |-> MViewChangeReply, 535 | viewID |-> m.viewID, 536 | logViewID |-> rLogViewID[r], 537 | log |-> rLog[r]]}) 538 | /\ UNCHANGED <> 539 | \* If the client's view is less than the replica's view, reject the client's request 540 | \* with the updated view ID to force the client to retry. 541 | \/ /\ m.viewID < rViewID[r] 542 | /\ Reply(m, [src |-> r, 543 | dest |-> c, 544 | req |-> m, 545 | type |-> MClientReply, 546 | viewID |-> rViewID[r], 547 | succeeded |-> FALSE]) 548 | /\ UNCHANGED <> 549 | /\ UNCHANGED <> 550 | 551 | HandleReconcileRequest(r, c, m) == 552 | /\ rStatus[r] = SInSync 553 | /\ rViewID[r] = m.viewID 554 | /\ \/ /\ rSyncIndex[r] >= m.index 555 | /\ Reply(m, [src |-> r, 556 | dest |-> c, 557 | req |-> m, 558 | type |-> MReconcileReply, 559 | viewID |-> rViewID[r], 560 | index |-> m.index, 561 | checksum |-> [i \in 1..m.index |-> rLog[r][i]], 562 | value |-> rLog[r][m.index].value, 563 | timestamp |-> rLog[r][m.index].timestamp, 564 | succeeded |-> TRUE]) 565 | /\ UNCHANGED <> 566 | \/ /\ rSyncIndex[r] < m.index 567 | /\ Primary(rViewID[r]) # r 568 | /\ rStatus' = [rStatus EXCEPT ![r] = SRepair] 569 | /\ AckAndSend(m, [src |-> r, 570 | dest |-> Primary(rViewID[r]), 571 | req |-> m, 572 | type |-> MRepairRequest, 573 | viewID |-> rViewID[r], 574 | index |-> m.index]) 575 | /\ UNCHANGED <> 576 | 577 | HandleRepairRequest(r, s, m) == 578 | /\ rStatus[r] = SInSync 579 | /\ rViewID[r] = m.viewID 580 | /\ Primary(rViewID[r]) = r 581 | /\ Reply(m, [src |-> r, 582 | dest |-> s, 583 | req |-> m.req, 584 | type |-> MRepairReply, 585 | viewID |-> rViewID[r], 586 | index |-> m.index, 587 | log |-> [i \in 1..m.index |-> rLog[r][i]]]) 588 | /\ UNCHANGED <> 589 | 590 | HandleRepairReply(r, s, m) == 591 | /\ rStatus[r] = SRepair 592 | /\ rViewID[r] = m.viewID 593 | /\ rStatus' = [rStatus EXCEPT ![r] = SInSync] 594 | /\ rLog' = [rLog EXCEPT ![r] = m.log \o SubSeq(rLog[r], Len(m.log), Len(rLog[r]))] 595 | /\ rSyncIndex' = [rSyncIndex EXCEPT ![r] = Len(rLog'[r])] 596 | /\ Reply(m, [src |-> r, 597 | dest |-> m.req.src, 598 | req |-> m.req, 599 | type |-> MReconcileReply, 600 | viewID |-> rViewID[r], 601 | index |-> m.index, 602 | checksum |-> m.log, 603 | value |-> m.log[m.index].value, 604 | timestamp |-> m.log[m.index].timestamp, 605 | succeeded |-> TRUE]) 606 | /\ UNCHANGED <> 607 | 608 | \* Replica 'r' requests a view change 609 | ChangeView(r) == 610 | /\ Sends({[src |-> r, 611 | dest |-> d, 612 | type |-> MViewChangeRequest, 613 | viewID |-> rViewID[r] + 1] : d \in Replicas}) 614 | /\ UNCHANGED <> 615 | 616 | \* Replica 'r' handles replica 's' view change request 'm' 617 | HandleViewChangeRequest(r, s, m) == 618 | /\ \/ /\ rViewID[r] < m.viewID 619 | /\ rViewID' = [rViewID EXCEPT ![r] = m.viewID] 620 | /\ rStatus' = [rStatus EXCEPT ![r] = SViewChange] 621 | /\ rViewChangeReps' = [rViewChangeReps EXCEPT ![r] = {}] 622 | /\ Reply(m, [src |-> r, 623 | dest |-> Primary(m.viewID), 624 | type |-> MViewChangeReply, 625 | viewID |-> m.viewID, 626 | logViewID |-> rLogViewID[r], 627 | log |-> rLog[r]]) 628 | \/ /\ rViewID[r] >= m.viewID 629 | /\ Ack(m) 630 | /\ UNCHANGED <> 631 | /\ UNCHANGED <> 632 | 633 | \* Replica 'r' handles replica 's' view change reply 'm' 634 | HandleViewChangeReply(r, s, m) == 635 | \* The view change protocol is run by the primary for the view. 636 | /\ Primary(m.viewID) = r 637 | /\ rViewID[r] = m.viewID 638 | /\ rStatus[r] = SViewChange 639 | /\ rViewChangeReps' = [rViewChangeReps EXCEPT ![r] = rViewChangeReps[r] \cup {m}] 640 | /\ LET viewChanges == {v \in rViewChangeReps'[r] : v.viewID = rViewID[r]} 641 | IN 642 | \* In order to ensure the new view is initialized with the latest view, 643 | \* a quorum of view change replies must be received to guarantee the last 644 | \* activated view is present in the set of replies. 645 | \* If view change replies have been received from a majority of the replicas, 646 | \* initialize the view using the log from the highest activated view. 647 | \/ /\ IsLocalQuorum(r, {v.src : v \in viewChanges}) 648 | /\ LET latestViewID == Max({v.logViewID : v \in viewChanges}) 649 | latestChange == CHOOSE v \in viewChanges : 650 | /\ v.logViewID = latestViewID 651 | /\ v.src \in Quorum(latestViewID) 652 | IN AckAndSends(m, {[src |-> r, 653 | dest |-> d, 654 | type |-> MStartViewRequest, 655 | viewID |-> rViewID[r], 656 | log |-> latestChange.log] : d \in Replicas}) 657 | \* If view change replies have not yet been received from a quorum, record 658 | \* the view change reply and discard the message. 659 | \/ /\ ~IsLocalQuorum(r, {v.src : v \in viewChanges}) 660 | /\ Ack(m) 661 | /\ UNCHANGED <> 662 | 663 | \* Replica 'r' handles replica 's' start view request 'm' 664 | HandleStartViewRequest(r, s, m) == 665 | \* To activate a view, the replica must either not know of the view or already 666 | \* be participating in the view change protocol for the view. 667 | /\ \/ rViewID[r] < m.viewID 668 | \/ /\ rViewID[r] = m.viewID 669 | /\ rStatus[r] = SViewChange 670 | \* If the replica is part of the quorum for the activated view, update the log 671 | \* and record the activated view for use in the view change protocol. 672 | /\ \/ /\ r \in Quorum(m.viewID) 673 | /\ rLog' = [rLog EXCEPT ![r] = m.log] 674 | /\ rLogViewID' = [rLogViewID EXCEPT ![r] = m.viewID] 675 | /\ rSyncIndex' = [rSyncIndex EXCEPT ![r] = Len(m.log)] 676 | \/ /\ r \notin Quorum(m.viewID) 677 | /\ UNCHANGED <> 678 | \* Update the replica's view ID and status and clean up view change state. 679 | /\ rViewID' = [rViewID EXCEPT ![r] = m.viewID] 680 | /\ rStatus' = [rStatus EXCEPT ![r] = SInSync] 681 | /\ LET viewChanges == {v \in rViewChangeReps[r] : v.viewID = rViewID[r]} 682 | IN rViewChangeReps' = [rViewChangeReps EXCEPT ![r] = rViewChangeReps[r] \ viewChanges] 683 | /\ Ack(m) 684 | /\ UNCHANGED <> 685 | 686 | ---- 687 | 688 | InitMessageVars == 689 | /\ messages = {} 690 | /\ messageCount = 0 691 | /\ stepCount = 0 692 | 693 | InitClientVars == 694 | /\ cTime = 0 695 | /\ cViewID = [c \in Clients |-> 1] 696 | /\ cReqID = [c \in Clients |-> 0] 697 | /\ cReps = [c \in Clients |-> {}] 698 | /\ cCommits = [c \in Clients |-> {}] 699 | 700 | InitReplicaVars == 701 | /\ rStatus = [r \in Replicas |-> SInSync] 702 | /\ rViewID = [r \in Replicas |-> 1] 703 | /\ rLog = [r \in Replicas |-> <<>>] 704 | /\ rSyncIndex = [r \in Replicas |-> 0] 705 | /\ rLogViewID = [r \in Replicas |-> 1] 706 | /\ rViewChangeReps = [r \in Replicas |-> {}] 707 | 708 | Init == 709 | /\ InitMessageVars 710 | /\ InitClientVars 711 | /\ InitReplicaVars 712 | 713 | ---- 714 | 715 | (* 716 | This section specifies the invariants for the protocol. 717 | *) 718 | 719 | \* The linearizability invariant verifies that once a client has received matching 720 | \* acks from a quorum and committed a value, thereafter the value is always present 721 | \* at the committed index on all in-sync replicas. 722 | Linearizability == 723 | \A c \in Clients : 724 | \A e \in cCommits[c] : 725 | ~\E r \in Replicas : 726 | /\ rStatus[r] = SInSync 727 | /\ rViewID[r] >= e.viewID 728 | /\ r \in Quorum(rViewID[r]) 729 | /\ rLog[r][e.index].value # e.value 730 | 731 | ---- 732 | 733 | NextClientRequest == 734 | \E c \in Clients : 735 | \E v \in Values : 736 | ClientRequest(c, v) 737 | 738 | NextChangeView == 739 | \E r \in Replicas : 740 | ChangeView(r) 741 | 742 | NextHandleClientRequest == 743 | \E m \in messages : 744 | /\ m.type = MClientRequest 745 | /\ HandleClientRequest(m.dest, m.src, m) 746 | 747 | NextHandleClientReply == 748 | \E m \in messages : 749 | /\ m.type = MClientReply 750 | /\ HandleClientReply(m.dest, m.src, m) 751 | 752 | NextHandleReconcileRequest == 753 | \E m \in messages : 754 | /\ m.type = MReconcileRequest 755 | /\ HandleReconcileRequest(m.dest, m.src, m) 756 | 757 | NextHandleReconcileReply == 758 | \E m \in messages : 759 | /\ m.type = MReconcileReply 760 | /\ HandleReconcileReply(m.dest, m.src, m) 761 | 762 | NextHandleRepairRequest == 763 | \E m \in messages : 764 | /\ m.type = MRepairRequest 765 | /\ HandleRepairRequest(m.dest, m.src, m) 766 | 767 | NextHandleRepairReply == 768 | \E m \in messages : 769 | /\ m.type = MRepairReply 770 | /\ HandleRepairReply(m.dest, m.src, m) 771 | 772 | NextHandleViewChangeRequest == 773 | \E m \in messages : 774 | /\ m.type = MViewChangeRequest 775 | /\ HandleViewChangeRequest(m.dest, m.src, m) 776 | 777 | NextHandleViewChangeReply == 778 | \E m \in messages : 779 | /\ m.type = MViewChangeReply 780 | /\ HandleViewChangeReply(m.dest, m.src, m) 781 | 782 | NextHandleStartViewRequest == 783 | \E m \in messages : 784 | /\ m.type = MStartViewRequest 785 | /\ HandleStartViewRequest(m.dest, m.src, m) 786 | 787 | NextDropMessage == 788 | \E m \in messages : 789 | /\ Ack(m) 790 | /\ UNCHANGED <> 791 | 792 | Next == 793 | \/ NextClientRequest 794 | \/ NextChangeView 795 | \/ NextHandleClientRequest 796 | \/ NextHandleClientReply 797 | \/ NextHandleReconcileRequest 798 | \/ NextHandleReconcileReply 799 | \/ NextHandleRepairRequest 800 | \/ NextHandleRepairReply 801 | \/ NextHandleViewChangeRequest 802 | \/ NextHandleViewChangeReply 803 | \/ NextHandleStartViewRequest 804 | \/ NextDropMessage 805 | 806 | Spec == Init /\ [][Next]_vars 807 | 808 | ============================================================================= 809 | \* Modification History 810 | \* Last modified Thu Oct 01 11:01:54 PDT 2020 by jordanhalterman 811 | \* Created Fri Sep 18 22:45:21 PDT 2020 by jordanhalterman 812 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # JITPaxos 2 | 3 | This project provides a [TLA+] specification for the Just-In-Time Paxos consensus protocol. 4 | 5 | JITPaxos is a variant of the Paxos consensus protocol designed for environments where 6 | process clocks are synchronized with high precision. The protocol 7 | relies on synchronized clocks to establish a global total ordering of 8 | events, avoiding coordination between replicas when requests arrive in 9 | the expected order, and reconciling requests only when they arrive out 10 | of order. This allows JITPaxos to reach consensus within a single 11 | round trip in the normal case, falling back to traditional replication 12 | strategies only when required. 13 | 14 | ## Summary 15 | * Protocol moves through a sequence of views 16 | * Each view deterministically assigns a primary plus a set of replicas that form a quorum 17 | * Client requests are timestamped using the client's local system clock and sent by the 18 | client in parallel to all replicas in the view's quorum 19 | * Replicas maintain a commit log, appending requests to the log in chronological order 20 | * The primary dictates the final order of the log and executes client requests 21 | * If a request is received out of chronological order it is rejected 22 | * Replicas reply to clients with a checksum of the local log 23 | * If the primary rejects a request, it must be retried with a new timestamp 24 | * If the primary accepts a request, the client compares checksums for each replica to the 25 | primary's checksum; if checksums do not match, the client initiates a reconciliation protocol 26 | * The reconciliation protocol reconciles inconsistencies in the log by replicating the 27 | primary's log to the quorum 28 | * Once the client receives matching acknowledgements from all the replicas in the quorum a request 29 | is committed 30 | * During view changes, the new primary selects the most recent log from a majority of replicas 31 | and replicates the log to the view quorum, ensuring the view is initialized with all commits 32 | from prior views 33 | * Clocks satisfy sequential consistency, and quorums satisfy linearizability across views 34 | 35 | ## Protocol 36 | JITPaxos uses a view-based approach to elect a primary and reconcile 37 | logs across views. Views are identified by a monotonically increasing, 38 | globally unique view ID. Each view deterministically assigns a quorum, 39 | and within the quorum a primary replica responsible for executing 40 | client requests and reconciling inconsistencies in the logs of the 41 | remaining replicas. JITPaxos replicas to not coordinate with each 42 | other in the normal case. Clients send timestamped requests in 43 | parallel to every replica in the view's quorum. When a replica 44 | receives a client request, if the request is received in chronological 45 | order, it's appended to the replica's log. If a request is received 46 | out of order (i.e. the request timestamp is less than the last 47 | timestamp in the replica's log), the request is rejected by the 48 | replica. Clients are responsible for identifying inconsistencies in 49 | the quorum's logs and initiating the reconciliation protocol. To help 50 | clients identify inconsistencies, replicas return a checksum 51 | representing the contents of the log up to the request point with each 52 | client reply. If a client's request is received out of chronological 53 | order, or if the checksums provided by the quorum do not match, the 54 | client must initiate the reconcilitation protocol to reconcile the 55 | inconsistencies in the quorum's logs. 56 | 57 | When requests are received out-of-order, the reconciliation protocol 58 | works to re-order requests using the view's primary as a reference. 59 | When a client initiates the reconciliation protocol for an inconsistent 60 | replica, the replica stops accepting client requests and sends a repair 61 | request to the primary. The primary responds with the subset of the 62 | log not yet reconciled on the replica, and the replica replaces the 63 | out-of-order entries in its log. Once the replica's log has been 64 | reconciled with the primary, it can acknowledge the reconciled request 65 | and begin accepting requests again. Once a client has reconciled all 66 | the divergent replicas and has received acknowledgement from each of 67 | the replicas in the quorum, the request can be committed. 68 | 69 | View primaries and quorums are evenly distributed amongst view IDs. 70 | View changes can be initiated to change the primary or the set of 71 | replicas in the quorum. When a view change is initiated, each replica 72 | sends its log to the primary for the initiated view. Once the primary 73 | has received logs from a majority of replicas, it initializes the view 74 | with the log from the most recent in-sync replica, broadcasting the log 75 | to its peers. The use of quorums to determine both the commitment of a 76 | request and the initialization of new views ensures that each view log 77 | contains all prior committed requests. 78 | 79 | [NOPaxos]: https://www.usenix.org/system/files/conference/osdi16/osdi16-li.pdf 80 | [TLA+]: https://lamport.azurewebsites.net/tla/tla.html --------------------------------------------------------------------------------