├── .gitignore ├── LICENSE ├── README.md ├── Zab.pdf ├── Zab.tla ├── doc-in-chinese ├── README.md └── picture │ ├── pic_commit_wrong.png │ └── pic_double_propose.png └── results └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # .gitignore achieved from https://github.com/github/gitignore 2 | 3 | # Prerequisites 4 | *.d 5 | 6 | # Compiled Object files 7 | *.slo 8 | *.lo 9 | *.o 10 | *.obj 11 | 12 | # Precompiled Headers 13 | *.gch 14 | *.pch 15 | 16 | # Compiled Dynamic libraries 17 | *.so 18 | *.dylib 19 | *.dll 20 | 21 | # Fortran module files 22 | *.mod 23 | *.smod 24 | 25 | # Compiled Static libraries 26 | *.lai 27 | *.la 28 | *.a 29 | *.lib 30 | 31 | # Executables 32 | *.exe 33 | *.out 34 | *.app 35 | 36 | # Used in this project 37 | *.toolbox 38 | ZabWithoutQ.tla 39 | ZabWithoutQ.pdf 40 | experiment/ 41 | schedule 42 | Zab_oldversion.tla 43 | Zab_oldversion.pdf 44 | doc-in-chinese/picture/pic_double_same_transaction.PNG 45 | doc-in-chinese/picture/pic_recovery.PNG 46 | 47 | 48 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Binyu Huang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Zab-tla 2 | 3 | ## Overview 4 | This project is devoted to providing formal specification and verification using TLA+ for the Zookeeper Atomic Broadcast(Zab) consensus protocol proposed in *Junqueira F P, Reed B C, Serafini M. Zab: High-performance broadcast for primary-backup systems[C]//2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 2011: 245-256*. 5 | 6 | We have made a formal [specification](Zab.tla) for Zab using TLA+ toolbox, and we have done a certain scale of model checking to verify the correctness of Zab. 7 | 8 | Currently we have completed writing [spec](https://github.com/BinyuHuang-nju/zookeeper/tree/master/zookeeper-specifications) for ZAB 1.0 that used in ZooKeeper production. 9 | 10 | Due to the simplification of Zab algorithm description in the paper, some details in specification were modified and added. If you have any question, please let us know. 11 | 12 | You can find this document in chinese in [doc-in-chinsese](doc-in-chinese/README.md). 13 | 14 | 15 | ## Requirements 16 | TLA+ toolbox version 1.7.0 17 | 18 | ## Run 19 | Create specification [Zab.tla](Zab.tla) and run models in the following way. 20 | We can clearly divide spec into five modules, which are: 21 | - Phase0. Leader Election 22 | - Phase1. Discovery 23 | - Phase2. Synchronization 24 | - Phase3. Broadcast 25 | - Abnormal situations like failure, network disconnection 26 | 27 | ### Assign constants 28 | After creating this new model and choosing *Temporal formula* with value *Spec*, we first assign most of constants. 29 | We should set CONSTANTS about server states as model value, including *FOLLOWING*, *LEADING*, and *LOOKING*. 30 | We should set CONSTANTS about server zabStates as model value, including *ELECTION*, *DISCOVERY*, *SYNCHRONIZATION*, and *BROADCAST*. 31 | We should set CONSTANTS about message types as model value, including *CEPOCH*, *NEWEPOCH*, *ACKE*, *NEWLEADER*, *ACKLD*, *COMMITLD*, *PROPOSE*, *ACK*, and *COMMIT*. 32 | 33 | ### Assign left constants affecting state space 34 | Then we should assign CONSTANTS *Server* as a symmetrical model value(such as {s1, s2, s3}). 35 | To compress state space, we need to assign CONSTANT *Parameters* as a record, whose domain contains *MaxTimeoutFailures*, *MaxTransactionNum*, *MaxEpoch*, and *MaxRestarts*. For example, we can assign it to format like [MaxTimeoutFailures |-> 3, MaxTransactionNum |-> 5, MaxEpoch |-> 3, MaxRestarts |-> 2]. 36 | 37 | ### Assign invariants 38 | We remove *'Deadlock'* option. 39 | We add invariants defined in spec into *'Invariants'* to check whether the model will reach an illogical state, including *ShouldNotBeTriggered*, *Leadership1*, *Leadership2*, *PrefixConsistency*, *Integrity*, *Agreement*, *TotalOrder*, *LocalPrimaryOrder*, *GlobalPriamryOrder*, and *PrimaryIntegrity*. 40 | Here the meanings of these invariants are described in the following. Except for the first four, all invariants are defined in paper. 41 | - **ShouldNotBeTriggered**: Some conditions should not be triggered when we are running the model. For example, follower should not receive NEWLEADER when its zabState is not SYNCHRONIZATION. 42 | - **Lerdership**: There is most one established leader in a certain epoch.(Established means leader has updated its f.a and switched its zabState to SYNCHRONIZATION.) 43 | - **PrefixConsistency**: Transactions that have been committed in history are the same in any server. 44 | - **Integrity**: If some follower delivers one transaction, some primary must have broadcast it. 45 | - **Agreement**: If some follower *f1* delivers transaction *a* and some follower *f2* delivers transaction *b*, then *f2* delivers *a* or *f1* delivers *b*. 46 | - **TotalOrder**: If some server delivers *a* before *b*, then any server that delivers *b* must also deliver *a* and deliver *a* before *b*. 47 | - **LocalPrimaryOrder**: If a primary broadcasts *a* before it broadcasts *b*, then a follower that delivers *b* must also deliver *a* before *b*. 48 | - **GlobalPrimaryOrder**: A server *f* delivers both *a* with epoch *e* and *b* with epoch *e'*, and *e* < *e'*, then *f* must deliver *a* before *b*. 49 | - **PrimaryIntegrity**: If primary *p* broadcasts *a* and some follower *f* delivers *b* such that *b* has epoch smaller than epoch of *p*, then *p* must deliver *b* before it broadcasts *a*. 50 | 51 | ### Assign additional TLC options 52 | We set number of worker threads as 10(if unavailable on your system, just decrease it). 53 | We can choose checking mode from *Model-checking mode* and *simulation mode*. 54 | - Model-checking mode: It is a traverse method like BFS. Diameter in results represent the maximum depth when traversing. All intermediate results will be saved as binary files locally and occupy a large space if running time is long. 55 | - Simulation mode: Everytime TLC randomly chooses a path and run through it until reaching termination or reaching maximum length of the trace, and randomly chooses another path. Currently we set *Maximum length of the trace* as 100. 56 | Here we mainly use simulation mode to discover if there exists deep bug, which is hard to be found in model-checking mode. 57 | 58 | ### Results 59 | You can find our [result](results/README.md) of verification using TLC model checking. 60 | 61 | ## Abstraction in specification 62 | >The Zab protocol in paper dose not focus on leader election, so we abstract the process of leader election in spec. Our spec can simulate non-Byzantion faults. In addition, what we pay attention to is consistency of system state, and we abstract or omit some parts in actual implementation, such as replying results to client, heartbeat transmission, and so on. These modules all do not affect consistency of Zab. 63 | 64 | ### Abstraction to Election 65 | Since paper pay no attention to leader election, while as to let model run, election module can not be omitted. To simplify election, we have a global variable *leaderOracle*, that all servers can visit. And there are two actions let server complete election, *UpdateLeader* and *FollowLeader*. *UpdateLeader* is used to update leaderOracle. *FolLowLeader* is used to help one server find leader and follow it. 66 | 67 | ### Abstraction to communication medium 68 | Communication in Zab is based on TCP channels, so there is no packet loss, redundancy, or disorder in message delivery. We use module *Sequence* in TLA+ to simulate channel meeting the property of receiving messages in order. So there is a certain difference between our communication channel and end-to-end TCP channel. 69 | We believe it can simulate message delay when a server does not perform the action of receiving messages. And it can simulate a process failure when a server does not perform any action. 70 | 71 | ### Abstraction and omission to actions unrelated to system state 72 | What we care about is consistecy of the state in the system. We do not care about details like client request or the system's reply to client, or server applying transactions to replica. Therefore, we simplify the process of client requesting, and omit reply to client. We assume that each committed transaction will be delivered to replica immediately, so we can treat variable history[i][1..commitIndex] as the transaction sequence that server *i* delivers to the corresponding replica. 73 | 74 | ## Differences between spec and paper 75 | >This section describes difference between the protocol in paper and our specification. We incorporate our own ideas into spec. 76 | 77 | ### (Issue 1) Line: 377, Action: UpdateLeader, FollowLeader 78 | Since paper pay no attention to leader election, we use global variable *leaderOracle* to simplify election module. In action *UpdateLeader*, we let a server in *LOOKING* be the new leader, and update *leaderOracle*. In action *FollowLeader*, we let a server in *LOOKING* switches its state to *FOLLOWING* or *LEADING* according to *leaderOracle*. 79 | 80 | ### (Issue 2) Line: 495, Action: LeaderProcessCEPOCH, LeaderProcessACKEPOCH, LeaderProcessACKLD 81 | In the pseudocode of the paper, it always mentions broadcasting a certain message to *Q* like *NEWEPOCH*, *NEWLEADER*, or *PROPOSE*. But it is very vague that *Q* represents which servers at each stage, and this is not stated in paper. Defining *learners* as the set of servers that establish connection with a certain leader, *cepochRecv* as the set of servers this leader has received *CEPOCH* from, *ackeRecv* as the set of servers this leader has received *ACKEPOCH* from, *ackldRecv* as the set of servers this leader has received *ACKLD* from, we konw that *ackldRecv* ⊆ *ackeRecv* ⊆ *cepochRecv* ⊆ *learners*. It is obviously wrong when we let leader broadcast *COMMITLD* to servers in *cepochRecv*, because there may exist some follower not receiving *NEWLEADER* in *cepochRecv*. So it is very important to define each '*Q*' in paper clearly. 82 | Here we let *cepochRecv* as *Q* in *step l.1.1* to broadcast *NEWECPOCH*, *ackeRecv* as *Q* in *step l.2.1* to broadcast *NEWLEADER*, *ackldRecv* as *Q* in *step l.2.2* to broadcast *COMMITLD*. 83 | 84 | ### (Issue 3) Line: 889, Action: LeaderBroadcastPROPOSE, LeaderProcessACK 85 | Except for *NEWEPOCH*, *NEWLEADER* and *COMMITLD*, leader has to broadcst *PROPOSE* and *COMMIT* in *BROADCAST* stage. According to *step l.3.3* and *step l.3.4* in the pseudocode of the paper, we initially assumed leader broadcasts *PROPOSE* or *COMMIT* to servers in *ackldRecv*. Then it would produce a bug in which follower receives commit of a txn which not exists in history. 86 | ![pic_commit_bug](doc-in-chinese/picture/pic_commit_wrong.png) 87 | It is because follower will not receive *PROPOSE* until reiceiving *COMMITLD*, according to *step l.3.4* in paper. 88 | What we do in spec is when leader broadcasts *PROPOSE*, *Q* is *ackeRecv*, and when leader broadcasts *COMMIT*, *Q* is *ackldRecv*. So any follower that receives *PROPOSE* must have received *NEWLEADER* before, and any follower that receives *COMMIT* must have received *COMMITLD* before. 89 | So we should not directly reply *NEWEPOCH* and *NEWLEADER* when leader receives *CEPOCH* in *BROADCAST*, described in *step l.3.3*. As the same in the previous stages, leader will not reply *NEWLEADER* until receiving *ACKEPOCH*. 90 | So here, *COMMITLD* is a commit of txns in *NEWLEADER* and perhaps several *PROPOSE*. Because follower may reply *ACKLD* late, and it may reiceive several messages of *PROPOSE* but no *COMMIT* of corresponding *PROPOSE*. And we successfully find a bug of pseudocode of the paper. 91 | 92 | ### (Issue 4) Line: 921, Action: FollowerProcessPROPOSE 93 | If action when leader processes *REQUEST* and action when leader broadcasts *PROPOSE* is not atomically performed, there exists another bug in which follower reiceives proposals of one same txn. 94 | ![pic_double_propose](doc-in-chinese/picture/pic_double_propose.png) 95 | We can see that follower C reveives txn with zxid <1,1> twice, which makes conflict. 96 | What we do is spec is when follower receives *PROPOSE*, if the zxid is the next zxid of lastZxid in history, follower accepts this txn. Else, follower ignores this txn, because this txn must exist in history. 97 | Here we can find that when follower receives *PROPOSE*, either the txn exist in history, or the zxid of txn is the next zxid in history. 98 | -------------------------------------------------------------------------------- /Zab.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BinyuHuang-nju/zab-tla/6accd5689b6f9ea018d184bfbe6695a998dd8891/Zab.pdf -------------------------------------------------------------------------------- /Zab.tla: -------------------------------------------------------------------------------- 1 | -------------------------------- MODULE Zab --------------------------------- 2 | (* This is the formal specification for the Zab consensus algorithm, 3 | in DSN'2011, which means Zab pre-1.0.*) 4 | EXTENDS Integers, FiniteSets, Sequences, Naturals, TLC 5 | ----------------------------------------------------------------------------- 6 | \* The set of servers 7 | CONSTANT Server 8 | \* States of server 9 | CONSTANTS LOOKING, FOLLOWING, LEADING 10 | \* Zab states of server 11 | CONSTANTS ELECTION, DISCOVERY, SYNCHRONIZATION, BROADCAST 12 | \* Message types 13 | CONSTANTS CEPOCH, NEWEPOCH, ACKEPOCH, NEWLEADER, ACKLD, COMMITLD, PROPOSE, ACK, COMMIT 14 | \* [MaxTimeoutFailures, MaxTransactionNum, MaxEpoch, MaxRestarts] 15 | CONSTANT Parameters 16 | 17 | MAXEPOCH == 10 18 | NullPoint == CHOOSE p: p \notin Server 19 | Quorums == {Q \in SUBSET Server: Cardinality(Q)*2 > Cardinality(Server)} 20 | ----------------------------------------------------------------------------- 21 | \* Variables that all servers use. 22 | VARIABLES state, \* State of server, in {LOOKING, FOLLOWING, LEADING}. 23 | zabState, \* Current phase of server, in 24 | \* {ELECTION, DISCOVERY, SYNCHRONIZATION, BROADCAST}. 25 | acceptedEpoch, \* Epoch of the last LEADERINFO packet accepted, 26 | \* namely f.p in paper. 27 | currentEpoch, \* Epoch of the last NEWLEADER packet accepted, 28 | \* namely f.a in paper. 29 | history, \* History of servers: sequence of transactions, 30 | \* containing: [zxid, value, ackSid, epoch]. 31 | lastCommitted \* Maximum index and zxid known to be committed, 32 | \* namely 'lastCommitted' in Leader. Starts from 0, 33 | \* and increases monotonically before restarting. 34 | 35 | \* Variables only used for leader. 36 | VARIABLES learners, \* Set of servers leader connects. 37 | cepochRecv, \* Set of learners leader has received CEPOCH from. 38 | \* Set of record [sid, connected, epoch], 39 | \* where epoch means f.p from followers. 40 | ackeRecv, \* Set of learners leader has received ACKEPOCH from. 41 | \* Set of record 42 | \* [sid, connected, peerLastEpoch, peerHistory], 43 | \* to record f.a and h(f) from followers. 44 | ackldRecv, \* Set of learners leader has received ACKLD from. 45 | \* Set of record [sid, connected]. 46 | sendCounter \* Count of txns leader has broadcast. 47 | 48 | \* Variables only used for follower. 49 | VARIABLES leaderAddr \* If follower has connected with leader. 50 | \* If follower lost connection, then null. 51 | 52 | \* Variable representing oracle of leader. 53 | VARIABLE leaderOracle \* Current oracle. 54 | 55 | \* Variables about network channel. 56 | VARIABLE msgs \* Simulates network channel. 57 | \* msgs[i][j] means the input buffer of server j 58 | \* from server i. 59 | 60 | \* Variables only used in verifying properties. 61 | VARIABLES epochLeader, \* Set of leaders in every epoch. 62 | proposalMsgsLog, \* Set of all broadcast messages. 63 | violatedInvariants \* Check whether there are conditions 64 | \* contrary to the facts. 65 | 66 | \* Variable used for recording critical data, 67 | \* to constrain state space or update values. 68 | VARIABLE recorder \* Consists: members of Parameters and pc, values. 69 | \* Form is record: 70 | \* [pc, nTransaction, maxEpoch, nTimeout, nRestart, nClientRequest] 71 | 72 | serverVars == <> 74 | 75 | leaderVars == <> 77 | 78 | followerVars == leaderAddr 79 | 80 | electionVars == leaderOracle 81 | 82 | msgVars == msgs 83 | 84 | verifyVars == <> 85 | 86 | vars == <> 88 | ----------------------------------------------------------------------------- 89 | \* Return the maximum value from the set S 90 | Maximum(S) == IF S = {} THEN -1 91 | ELSE CHOOSE n \in S: \A m \in S: n >= m 92 | \* Return the minimum value from the set S 93 | Minimum(S) == IF S = {} THEN -1 94 | ELSE CHOOSE n \in S: \A m \in S: n <= m 95 | 96 | \* Check server state 97 | IsLeader(s) == state[s] = LEADING 98 | IsFollower(s) == state[s] = FOLLOWING 99 | IsLooking(s) == state[s] = LOOKING 100 | 101 | \* Check if s is a quorum 102 | IsQuorum(s) == s \in Quorums 103 | 104 | IsMyLearner(i, j) == j \in learners[i] 105 | IsMyLeader(i, j) == leaderAddr[i] = j 106 | HasNoLeader(i) == leaderAddr[i] = NullPoint 107 | HasLeader(i) == leaderAddr[i] /= NullPoint 108 | ----------------------------------------------------------------------------- 109 | \* FALSE: zxid1 <= zxid2; TRUE: zxid1 > zxid2 110 | ZxidCompare(zxid1, zxid2) == \/ zxid1[1] > zxid2[1] 111 | \/ /\ zxid1[1] = zxid2[1] 112 | /\ zxid1[2] > zxid2[2] 113 | 114 | ZxidEqual(zxid1, zxid2) == zxid1[1] = zxid2[1] /\ zxid1[2] = zxid2[2] 115 | 116 | TxnZxidEqual(txn, z) == txn.zxid[1] = z[1] /\ txn.zxid[2] = z[2] 117 | 118 | TxnEqual(txn1, txn2) == /\ ZxidEqual(txn1.zxid, txn2.zxid) 119 | /\ txn1.value = txn2.value 120 | 121 | EpochPrecedeInTxn(txn1, txn2) == txn1.zxid[1] < txn2.zxid[1] 122 | ----------------------------------------------------------------------------- 123 | \* Actions about recorder 124 | GetParameter(p) == IF p \in DOMAIN Parameters THEN Parameters[p] ELSE 0 125 | GetRecorder(p) == IF p \in DOMAIN recorder THEN recorder[p] ELSE 0 126 | 127 | RecorderGetHelper(m) == (m :> recorder[m]) 128 | RecorderIncHelper(m) == (m :> recorder[m] + 1) 129 | 130 | RecorderIncTimeout == RecorderIncHelper("nTimeout") 131 | RecorderGetTimeout == RecorderGetHelper("nTimeout") 132 | RecorderIncRestart == RecorderIncHelper("nRestart") 133 | RecorderGetRestart == RecorderGetHelper("nRestart") 134 | RecorderSetTransactionNum(pc) == ("nTransaction" :> 135 | IF pc[1] = "LeaderProcessRequest" THEN 136 | LET s == CHOOSE i \in Server: 137 | \A j \in Server: Len(history'[i]) >= Len(history'[j]) 138 | IN Len(history'[s]) 139 | ELSE recorder["nTransaction"]) 140 | RecorderSetMaxEpoch(pc) == ("maxEpoch" :> 141 | IF pc[1] = "LeaderProcessCEPOCH" THEN 142 | LET s == CHOOSE i \in Server: 143 | \A j \in Server: acceptedEpoch'[i] >= acceptedEpoch'[j] 144 | IN acceptedEpoch'[s] 145 | ELSE recorder["maxEpoch"]) 146 | RecorderSetRequests(pc) == ("nClientRequest" :> 147 | IF pc[1] = "LeaderProcessRequest" THEN 148 | recorder["nClientRequest"] + 1 149 | ELSE recorder["nClientRequest"] ) 150 | RecorderSetPc(pc) == ("pc" :> pc) 151 | RecorderSetFailure(pc) == CASE pc[1] = "Timeout" -> RecorderIncTimeout @@ RecorderGetRestart 152 | [] pc[1] = "LeaderTimeout" -> RecorderIncTimeout @@ RecorderGetRestart 153 | [] pc[1] = "FollowerTimeout" -> RecorderIncTimeout @@ RecorderGetRestart 154 | [] pc[1] = "Restart" -> RecorderIncTimeout @@ RecorderIncRestart 155 | [] OTHER -> RecorderGetTimeout @@ RecorderGetRestart 156 | 157 | UpdateRecorder(pc) == recorder' = RecorderSetFailure(pc) @@ RecorderSetTransactionNum(pc) 158 | @@ RecorderSetMaxEpoch(pc) @@ RecorderSetPc(pc) 159 | @@ RecorderSetRequests(pc) @@ recorder 160 | UnchangeRecorder == UNCHANGED recorder 161 | 162 | CheckParameterHelper(n, p, Comp(_,_)) == IF p \in DOMAIN Parameters 163 | THEN Comp(n, Parameters[p]) 164 | ELSE TRUE 165 | CheckParameterLimit(n, p) == CheckParameterHelper(n, p, LAMBDA i, j: i < j) 166 | 167 | CheckTimeout == CheckParameterLimit(recorder.nTimeout, "MaxTimeoutFailures") 168 | CheckTransactionNum == CheckParameterLimit(recorder.nTransaction, "MaxTransactionNum") 169 | CheckEpoch == CheckParameterLimit(recorder.maxEpoch, "MaxEpoch") 170 | CheckRestart == /\ CheckTimeout 171 | /\ CheckParameterLimit(recorder.nRestart, "MaxRestarts") 172 | 173 | CheckStateConstraints == CheckTimeout /\ CheckTransactionNum /\ CheckEpoch /\ CheckRestart 174 | ----------------------------------------------------------------------------- 175 | \* Actions about network 176 | PendingCEPOCH(i, j) == /\ msgs[j][i] /= << >> 177 | /\ msgs[j][i][1].mtype = CEPOCH 178 | PendingNEWEPOCH(i, j) == /\ msgs[j][i] /= << >> 179 | /\ msgs[j][i][1].mtype = NEWEPOCH 180 | PendingACKEPOCH(i, j) == /\ msgs[j][i] /= << >> 181 | /\ msgs[j][i][1].mtype = ACKEPOCH 182 | PendingNEWLEADER(i, j) == /\ msgs[j][i] /= << >> 183 | /\ msgs[j][i][1].mtype = NEWLEADER 184 | PendingACKLD(i, j) == /\ msgs[j][i] /= << >> 185 | /\ msgs[j][i][1].mtype = ACKLD 186 | PendingCOMMITLD(i, j) == /\ msgs[j][i] /= << >> 187 | /\ msgs[j][i][1].mtype = COMMITLD 188 | PendingPROPOSE(i, j) == /\ msgs[j][i] /= << >> 189 | /\ msgs[j][i][1].mtype = PROPOSE 190 | PendingACK(i, j) == /\ msgs[j][i] /= << >> 191 | /\ msgs[j][i][1].mtype = ACK 192 | PendingCOMMIT(i, j) == /\ msgs[j][i] /= << >> 193 | /\ msgs[j][i][1].mtype = COMMIT 194 | \* Add a message to msgs - add a message m to msgs. 195 | Send(i, j, m) == msgs' = [msgs EXCEPT ![i][j] = Append(msgs[i][j], m)] 196 | \* Remove a message from msgs - discard head of msgs. 197 | Discard(i, j) == msgs' = IF msgs[i][j] /= << >> THEN [msgs EXCEPT ![i][j] = Tail(msgs[i][j])] 198 | ELSE msgs 199 | \* Combination of Send and Discard - discard head of msgs[j][i] and add m into msgs. 200 | Reply(i, j, m) == msgs' = [msgs EXCEPT ![j][i] = Tail(msgs[j][i]), 201 | ![i][j] = Append(msgs[i][j], m)] 202 | \* Shuffle input buffer. 203 | Clean(i, j) == msgs' = [msgs EXCEPT ![j][i] = << >>, ![i][j] = << >>] 204 | CleanInputBuffer(S) == msgs' = [s \in Server |-> 205 | [v \in Server |-> IF v \in S THEN << >> 206 | ELSE msgs[s][v] ] ] 207 | \* Leader broadcasts a message PROPOSE to all other servers in Q. 208 | \* Note: In paper, Q is fuzzy. We think servers who leader broadcasts NEWLEADER to 209 | \* should receive every PROPOSE. So we consider ackeRecv as Q. 210 | \* Since we let ackeRecv = Q, there may exist some follower receiving COMMIT before 211 | \* COMMITLD, and zxid in COMMIT later than zxid in COMMITLD. To avoid this situation, 212 | \* if f \in ackeRecv but \notin ackldRecv, f should not receive COMMIT until 213 | \* f \in ackldRecv and receives COMMITLD. 214 | Broadcast(i, m) == 215 | LET ackeRecv_quorum == {a \in ackeRecv[i]: a.connected = TRUE } 216 | sid_ackeRecv == { a.sid: a \in ackeRecv_quorum } 217 | IN msgs' = [msgs EXCEPT ![i] = [v \in Server |-> IF /\ v \in sid_ackeRecv 218 | /\ v \in learners[i] 219 | /\ v /= i 220 | THEN Append(msgs[i][v], m) 221 | ELSE msgs[i][v] ] ] 222 | \* Since leader decides to broadcasts message COMMIT when processing ACK, so 223 | \* we need to discard ACK and broadcast COMMIT. 224 | \* Here Q is ackldRecv, because we assume that f should not receive COMMIT until 225 | \* f receives COMMITLD. 226 | DiscardAndBroadcast(i, j, m) == 227 | LET ackldRecv_quorum == {a \in ackldRecv[i]: a.connected = TRUE } 228 | sid_ackldRecv == { a.sid: a \in ackldRecv_quorum } 229 | IN msgs' = [msgs EXCEPT ![j][i] = Tail(msgs[j][i]), 230 | ![i] = [v \in Server |-> IF /\ v \in sid_ackldRecv 231 | /\ v \in learners[i] 232 | /\ v /= i 233 | THEN Append(msgs[i][v], m) 234 | ELSE msgs[i][v] ] ] 235 | \* Leader broadcasts LEADERINFO to all other servers in cepochRecv. 236 | DiscardAndBroadcastNEWEPOCH(i, j, m) == 237 | LET new_cepochRecv_quorum == {c \in cepochRecv'[i]: c.connected = TRUE } 238 | new_sid_cepochRecv == { c.sid: c \in new_cepochRecv_quorum } 239 | IN msgs' = [msgs EXCEPT ![j][i] = Tail(msgs[j][i]), 240 | ![i] = [v \in Server |-> IF /\ v \in new_sid_cepochRecv 241 | /\ v \in learners[i] 242 | /\ v /= i 243 | THEN Append(msgs[i][v], m) 244 | ELSE msgs[i][v] ] ] 245 | \* Leader broadcasts NEWLEADER to all other servers in ackeRecv. 246 | DiscardAndBroadcastNEWLEADER(i, j, m) == 247 | LET new_ackeRecv_quorum == {a \in ackeRecv'[i]: a.connected = TRUE } 248 | new_sid_ackeRecv == { a.sid: a \in new_ackeRecv_quorum } 249 | IN msgs' = [msgs EXCEPT ![j][i] = Tail(msgs[j][i]), 250 | ![i] = [v \in Server |-> IF /\ v \in new_sid_ackeRecv 251 | /\ v \in learners[i] 252 | /\ v /= i 253 | THEN Append(msgs[i][v], m) 254 | ELSE msgs[i][v] ] ] 255 | \* Leader broadcasts COMMITLD to all other servers in ackldRecv. 256 | DiscardAndBroadcastCOMMITLD(i, j, m) == 257 | LET new_ackldRecv_quorum == {a \in ackldRecv'[i]: a.connected = TRUE } 258 | new_sid_ackldRecv == { a.sid: a \in new_ackldRecv_quorum } 259 | IN msgs' = [msgs EXCEPT ![j][i] = Tail(msgs[j][i]), 260 | ![i] = [v \in Server |-> IF /\ v \in new_sid_ackldRecv 261 | /\ v \in learners[i] 262 | /\ v /= i 263 | THEN Append(msgs[i][v], m) 264 | ELSE msgs[i][v] ] ] 265 | ----------------------------------------------------------------------------- 266 | \* Define initial values for all variables 267 | InitServerVars == /\ state = [s \in Server |-> LOOKING] 268 | /\ zabState = [s \in Server |-> ELECTION] 269 | /\ acceptedEpoch = [s \in Server |-> 0] 270 | /\ currentEpoch = [s \in Server |-> 0] 271 | /\ history = [s \in Server |-> << >>] 272 | /\ lastCommitted = [s \in Server |-> [ index |-> 0, 273 | zxid |-> <<0, 0>> ] ] 274 | 275 | InitLeaderVars == /\ learners = [s \in Server |-> {}] 276 | /\ cepochRecv = [s \in Server |-> {}] 277 | /\ ackeRecv = [s \in Server |-> {}] 278 | /\ ackldRecv = [s \in Server |-> {}] 279 | /\ sendCounter = [s \in Server |-> 0] 280 | 281 | InitFollowerVars == leaderAddr = [s \in Server |-> NullPoint] 282 | 283 | InitElectionVars == leaderOracle = NullPoint 284 | 285 | InitMsgVars == msgs = [s \in Server |-> [v \in Server |-> << >>] ] 286 | 287 | InitVerifyVars == /\ proposalMsgsLog = {} 288 | /\ epochLeader = [i \in 1..MAXEPOCH |-> {} ] 289 | /\ violatedInvariants = [stateInconsistent |-> FALSE, 290 | proposalInconsistent |-> FALSE, 291 | commitInconsistent |-> FALSE, 292 | ackInconsistent |-> FALSE, 293 | messageIllegal |-> FALSE ] 294 | 295 | InitRecorder == recorder = [nTimeout |-> 0, 296 | nTransaction |-> 0, 297 | maxEpoch |-> 0, 298 | nRestart |-> 0, 299 | pc |-> <<"Init">>, 300 | nClientRequest |-> 0] 301 | 302 | Init == /\ InitServerVars 303 | /\ InitLeaderVars 304 | /\ InitFollowerVars 305 | /\ InitElectionVars 306 | /\ InitVerifyVars 307 | /\ InitMsgVars 308 | /\ InitRecorder 309 | ----------------------------------------------------------------------------- 310 | \* Utils in state switching 311 | FollowerShutdown(i) == 312 | /\ state' = [state EXCEPT ![i] = LOOKING] 313 | /\ zabState' = [zabState EXCEPT ![i] = ELECTION] 314 | /\ leaderAddr' = [leaderAddr EXCEPT ![i] = NullPoint] 315 | 316 | LeaderShutdown(i) == 317 | /\ LET S == learners[i] 318 | IN /\ state' = [s \in Server |-> IF s \in S THEN LOOKING ELSE state[s] ] 319 | /\ zabState' = [s \in Server |-> IF s \in S THEN ELECTION ELSE zabState[s] ] 320 | /\ leaderAddr' = [s \in Server |-> IF s \in S THEN NullPoint ELSE leaderAddr[s] ] 321 | /\ CleanInputBuffer(S) 322 | /\ learners' = [learners EXCEPT ![i] = {}] 323 | 324 | SwitchToFollower(i) == 325 | /\ state' = [state EXCEPT ![i] = FOLLOWING] 326 | /\ zabState' = [zabState EXCEPT ![i] = DISCOVERY] 327 | 328 | SwitchToLeader(i) == 329 | /\ state' = [state EXCEPT ![i] = LEADING] 330 | /\ zabState' = [zabState EXCEPT ![i] = DISCOVERY] 331 | /\ learners' = [learners EXCEPT ![i] = {i}] 332 | /\ cepochRecv' = [cepochRecv EXCEPT ![i] = { [ sid |-> i, 333 | connected |-> TRUE, 334 | epoch |-> acceptedEpoch[i] ] }] 335 | /\ ackeRecv' = [ackeRecv EXCEPT ![i] = { [ sid |-> i, 336 | connected |-> TRUE, 337 | peerLastEpoch |-> currentEpoch[i], 338 | peerHistory |-> history[i] ] }] 339 | /\ ackldRecv' = [ackldRecv EXCEPT ![i] = { [ sid |-> i, 340 | connected |-> TRUE ] }] 341 | /\ sendCounter' = [sendCounter EXCEPT ![i] = 0] 342 | 343 | RemoveCepochRecv(set, sid) == 344 | LET sid_cepochRecv == {s.sid: s \in set} 345 | IN IF sid \notin sid_cepochRecv THEN set 346 | ELSE LET info == CHOOSE s \in set: s.sid = sid 347 | new_info == [ sid |-> sid, 348 | connected |-> FALSE, 349 | epoch |-> info.epoch ] 350 | IN (set \ {info}) \union {new_info} 351 | 352 | RemoveAckeRecv(set, sid) == 353 | LET sid_ackeRecv == {s.sid: s \in set} 354 | IN IF sid \notin sid_ackeRecv THEN set 355 | ELSE LET info == CHOOSE s \in set: s.sid = sid 356 | new_info == [ sid |-> sid, 357 | connected |-> FALSE, 358 | peerLastEpoch |-> info.peerLastEpoch, 359 | peerHistory |-> info.peerHistory ] 360 | IN (set \ {info}) \union {new_info} 361 | 362 | RemoveAckldRecv(set, sid) == 363 | LET sid_ackldRecv == {s.sid: s \in set} 364 | IN IF sid \notin sid_ackldRecv THEN set 365 | ELSE LET info == CHOOSE s \in set: s.sid = sid 366 | new_info == [ sid |-> sid, 367 | connected |-> FALSE ] 368 | IN (set \ {info}) \union {new_info} 369 | 370 | RemoveLearner(i, j) == 371 | /\ learners' = [learners EXCEPT ![i] = @ \ {j}] 372 | /\ cepochRecv' = [cepochRecv EXCEPT ![i] = RemoveCepochRecv(@, j) ] 373 | /\ ackeRecv' = [ackeRecv EXCEPT ![i] = RemoveAckeRecv(@, j) ] 374 | /\ ackldRecv' = [ackldRecv EXCEPT ![i] = RemoveAckldRecv(@, j) ] 375 | ----------------------------------------------------------------------------- 376 | \* Actions of abnormal situations and election 377 | UpdateLeader(i) == 378 | /\ IsLooking(i) 379 | /\ leaderOracle /= i 380 | /\ leaderOracle' = i 381 | /\ SwitchToLeader(i) 382 | /\ UNCHANGED <> 384 | /\ UpdateRecorder(<<"UpdateLeader", i>>) 385 | 386 | FollowLeader(i) == 387 | /\ IsLooking(i) 388 | /\ leaderOracle /= NullPoint 389 | /\ \/ /\ leaderOracle = i 390 | /\ SwitchToLeader(i) 391 | \/ /\ leaderOracle /= i 392 | /\ SwitchToFollower(i) 393 | /\ UNCHANGED leaderVars 394 | /\ UNCHANGED <> 396 | /\ UpdateRecorder(<<"FollowLeader", i>>) 397 | 398 | \* Follower connecting to leader fails and truns to LOOKING. 399 | FollowerTimeout(i) == 400 | /\ CheckTimeout \* test restrictions of timeout_1 401 | /\ IsFollower(i) 402 | /\ HasNoLeader(i) 403 | /\ FollowerShutdown(i) 404 | /\ CleanInputBuffer({i}) 405 | /\ UNCHANGED <> 407 | /\ UpdateRecorder(<<"FollowerTimeout", i>>) 408 | 409 | \* Leader loses support from a quorum and turns to LOOKING. 410 | LeaderTimeout(i) == 411 | /\ CheckTimeout \* test restrictions of timeout_2 412 | /\ IsLeader(i) 413 | /\ \lnot IsQuorum(learners[i]) 414 | /\ LeaderShutdown(i) 415 | /\ UNCHANGED <> 418 | /\ UpdateRecorder(<<"LeaderTimeout", i>>) 419 | 420 | \* Timeout between leader and follower. 421 | Timeout(i, j) == 422 | /\ CheckTimeout \* test restrictions of timeout_3 423 | /\ IsLeader(i) /\ IsMyLearner(i, j) 424 | /\ IsFollower(j) /\ IsMyLeader(j, i) 425 | /\ RemoveLearner(i, j) 426 | /\ FollowerShutdown(j) 427 | /\ Clean(i, j) 428 | /\ UNCHANGED <> 430 | /\ UpdateRecorder(<<"Timeout", i, j>>) 431 | 432 | Restart(i) == 433 | /\ CheckRestart \* test restrictions of restart 434 | /\ \/ /\ IsLooking(i) 435 | /\ UNCHANGED <> 437 | \/ /\ IsFollower(i) 438 | /\ LET connectedWithLeader == HasLeader(i) 439 | IN \/ /\ connectedWithLeader 440 | /\ LET leader == leaderAddr[i] 441 | IN 442 | /\ FollowerShutdown(i) 443 | /\ RemoveLearner(leader, i) 444 | /\ Clean(leader, i) 445 | \/ /\ ~connectedWithLeader 446 | /\ FollowerShutdown(i) 447 | /\ CleanInputBuffer({i}) 448 | /\ UNCHANGED <> 449 | \/ /\ IsLeader(i) 450 | /\ LeaderShutdown(i) 451 | /\ UNCHANGED <> 452 | /\ lastCommitted' = [lastCommitted EXCEPT ![i] = [ index |-> 0, 453 | zxid |-> <<0, 0>> ] ] 454 | /\ UNCHANGED <> 456 | /\ UpdateRecorder(<<"Restart", i>>) 457 | ----------------------------------------------------------------------------- 458 | (* Establish connection between leader and follower. *) 459 | ConnectAndFollowerSendCEPOCH(i, j) == 460 | /\ IsLeader(i) /\ \lnot IsMyLearner(i, j) 461 | /\ IsFollower(j) /\ HasNoLeader(j) /\ leaderOracle = i 462 | /\ learners' = [learners EXCEPT ![i] = @ \union {j}] 463 | /\ leaderAddr' = [leaderAddr EXCEPT ![j] = i] 464 | /\ Send(j, i, [ mtype |-> CEPOCH, 465 | mepoch |-> acceptedEpoch[j] ]) \* contains f.p 466 | /\ UNCHANGED <> 468 | /\ UpdateRecorder(<<"ConnectAndFollowerSendCEPOCH", i, j>>) 469 | 470 | CepochRecvQuorumFormed(i) == LET sid_cepochRecv == {c.sid: c \in cepochRecv[i]} 471 | IN IsQuorum(sid_cepochRecv) 472 | CepochRecvBecomeQuorum(i) == LET sid_cepochRecv == {c.sid: c \in cepochRecv'[i]} 473 | IN IsQuorum(sid_cepochRecv) 474 | 475 | UpdateCepochRecv(oldSet, sid, peerEpoch) == 476 | LET sid_set == {s.sid: s \in oldSet} 477 | IN IF sid \in sid_set 478 | THEN LET old_info == CHOOSE info \in oldSet: info.sid = sid 479 | new_info == [ sid |-> sid, 480 | connected |-> TRUE, 481 | epoch |-> peerEpoch ] 482 | IN ( oldSet \ {old_info} ) \union {new_info} 483 | ELSE LET follower_info == [ sid |-> sid, 484 | connected |-> TRUE, 485 | epoch |-> peerEpoch ] 486 | IN oldSet \union {follower_info} 487 | 488 | \* Determine new e' in this round from a quorum of CEPOCH. 489 | DetermineNewEpoch(i) == 490 | LET epoch_cepochRecv == {c.epoch: c \in cepochRecv'[i]} 491 | IN Maximum(epoch_cepochRecv) + 1 492 | 493 | (* Leader waits for receiving FOLLOWERINFO from a quorum including itself, 494 | and chooses a new epoch e' as its own epoch and broadcasts NEWEPOCH. *) 495 | LeaderProcessCEPOCH(i, j) == 496 | /\ CheckEpoch \* test restrictions of max epoch 497 | /\ IsLeader(i) 498 | /\ PendingCEPOCH(i, j) 499 | /\ LET msg == msgs[j][i][1] 500 | infoOk == IsMyLearner(i, j) 501 | IN /\ infoOk 502 | /\ \/ \* 1. has not broadcast NEWEPOCH 503 | /\ ~CepochRecvQuorumFormed(i) 504 | /\ \/ /\ zabState[i] = DISCOVERY 505 | /\ UNCHANGED violatedInvariants 506 | \/ /\ zabState[i] /= DISCOVERY 507 | /\ PrintT("Exception: CepochRecvQuorumFormed false," \o 508 | " while zabState not DISCOVERY.") 509 | /\ violatedInvariants' = [violatedInvariants 510 | EXCEPT !.stateInconsistent = TRUE] 511 | /\ cepochRecv' = [cepochRecv EXCEPT ![i] = UpdateCepochRecv(@, j, msg.mepoch) ] 512 | /\ \/ \* 1.1. cepochRecv becomes quorum, 513 | \* then determine e' and broadcasts NEWEPOCH in Q. 514 | /\ CepochRecvBecomeQuorum(i) 515 | /\ acceptedEpoch' = [acceptedEpoch EXCEPT ![i] = DetermineNewEpoch(i)] 516 | /\ LET m == [ mtype |-> NEWEPOCH, 517 | mepoch |-> acceptedEpoch'[i] ] 518 | IN DiscardAndBroadcastNEWEPOCH(i, j, m) 519 | \/ \* 1.2. cepochRecv still not quorum. 520 | /\ ~CepochRecvBecomeQuorum(i) 521 | /\ Discard(j, i) 522 | /\ UNCHANGED acceptedEpoch 523 | \/ \* 2. has broadcast NEWEPOCH 524 | /\ CepochRecvQuorumFormed(i) 525 | /\ cepochRecv' = [cepochRecv EXCEPT ![i] = UpdateCepochRecv(@, j, msg.mepoch) ] 526 | /\ Reply(i, j, [ mtype |-> NEWEPOCH, 527 | mepoch |-> acceptedEpoch[i] ]) 528 | /\ UNCHANGED <> 529 | /\ UNCHANGED <> 532 | /\ UpdateRecorder(<<"LeaderProcessCEPOCH", i, j>>) 533 | 534 | (* Follower receives LEADERINFO. If newEpoch >= acceptedEpoch, then follower 535 | updates acceptedEpoch and sends ACKEPOCH back, containing currentEpoch and 536 | history. After this, zabState turns to SYNC. *) 537 | FollowerProcessNEWEPOCH(i, j) == 538 | /\ IsFollower(i) 539 | /\ PendingNEWEPOCH(i, j) 540 | /\ LET msg == msgs[j][i][1] 541 | infoOk == IsMyLeader(i, j) 542 | stateOk == zabState[i] = DISCOVERY 543 | epochOk == msg.mepoch >= acceptedEpoch[i] 544 | IN /\ infoOk 545 | /\ \/ \* 1. Normal case 546 | /\ epochOk 547 | /\ \/ /\ stateOk 548 | /\ acceptedEpoch' = [acceptedEpoch EXCEPT ![i] = msg.mepoch] 549 | /\ LET m == [ mtype |-> ACKEPOCH, 550 | mepoch |-> currentEpoch[i], 551 | mhistory |-> history[i] ] 552 | IN Reply(i, j, m) 553 | /\ zabState' = [zabState EXCEPT ![i] = SYNCHRONIZATION] 554 | /\ UNCHANGED violatedInvariants 555 | \/ /\ ~stateOk 556 | /\ PrintT("Exception: Follower receives NEWEPOCH," \o 557 | " whileZabState not DISCOVERY.") 558 | /\ violatedInvariants' = [violatedInvariants 559 | EXCEPT !.stateInconsistent = TRUE] 560 | /\ Discard(j, i) 561 | /\ UNCHANGED <> 562 | /\ UNCHANGED <> 564 | \/ \* 2. Abnormal case - go back to election 565 | /\ ~epochOk 566 | /\ FollowerShutdown(i) 567 | /\ LET leader == leaderAddr[i] 568 | IN /\ Clean(i, leader) 569 | /\ RemoveLearner(leader, i) 570 | /\ UNCHANGED <> 571 | /\ UNCHANGED <> 573 | /\ UpdateRecorder(<<"FollowerProcessNEWEPOCH", i, j>>) 574 | 575 | AckeRecvQuorumFormed(i) == LET sid_ackeRecv == {a.sid: a \in ackeRecv[i]} 576 | IN IsQuorum(sid_ackeRecv) 577 | AckeRecvBecomeQuorum(i) == LET sid_ackeRecv == {a.sid: a \in ackeRecv'[i]} 578 | IN IsQuorum(sid_ackeRecv) 579 | 580 | UpdateAckeRecv(oldSet, sid, peerEpoch, peerHistory) == 581 | LET sid_set == {s.sid: s \in oldSet} 582 | follower_info == [ sid |-> sid, 583 | connected |-> TRUE, 584 | peerLastEpoch |-> peerEpoch, 585 | peerHistory |-> peerHistory ] 586 | IN IF sid \in sid_set 587 | THEN LET old_info == CHOOSE info \in oldSet: info.sid = sid 588 | IN (oldSet \ {old_info}) \union {follower_info} 589 | ELSE oldSet \union {follower_info} 590 | 591 | \* for checking invariants 592 | RECURSIVE SetPacketsForChecking(_,_,_,_,_,_) 593 | SetPacketsForChecking(set, src, ep, his, cur, end) == 594 | IF cur > end THEN set 595 | ELSE LET m_proposal == [ source |-> src, 596 | epoch |-> ep, 597 | zxid |-> his[cur].zxid, 598 | data |-> his[cur].value ] 599 | IN SetPacketsForChecking((set \union {m_proposal}), src, ep, his, cur + 1, end) 600 | 601 | 602 | LastZxidOfHistory(his) == IF Len(his) = 0 THEN <<0, 0>> 603 | ELSE his[Len(his)].zxid 604 | 605 | \* TRUE: f1.a > f2.a or (f1.a = fa.a and f1.zxid >= f2.zxid) 606 | MoreResentOrEqual(ss1, ss2) == \/ ss1.currentEpoch > ss2.currentEpoch 607 | \/ /\ ss1.currentEpoch = ss2.currentEpoch 608 | /\ ~ZxidCompare(ss2.lastZxid, ss1.lastZxid) 609 | 610 | \* Determine initial history Ie' in this round from a quorum of ACKEPOCH. 611 | DetermineInitialHistory(i) == 612 | LET set == ackeRecv'[i] 613 | ss_set == { [ sid |-> a.sid, 614 | currentEpoch |-> a.peerLastEpoch, 615 | lastZxid |-> LastZxidOfHistory(a.peerHistory) ] 616 | : a \in set } 617 | selected == CHOOSE ss \in ss_set: 618 | \A ss1 \in (ss_set \ {ss}): MoreResentOrEqual(ss, ss1) 619 | info == CHOOSE f \in set: f.sid = selected.sid 620 | IN info.peerHistory 621 | 622 | RECURSIVE InitAcksidHelper(_,_) 623 | InitAcksidHelper(txns, src) == IF Len(txns) = 0 THEN << >> 624 | ELSE LET oldTxn == txns[1] 625 | newTxn == [ zxid |-> oldTxn.zxid, 626 | value |-> oldTxn.value, 627 | ackSid |-> {src}, 628 | epoch |-> oldTxn.epoch ] 629 | IN <> \o InitAcksidHelper( Tail(txns), src) 630 | 631 | \* Atomically let all txns in initial history contain self's acks. 632 | InitAcksid(i, his) == InitAcksidHelper(his, i) 633 | 634 | (* Leader waits for receiving ACKEPOPCH from a quorum, and determines initialHistory 635 | according to history of whom has most recent state summary from them. After this, 636 | leader's zabState turns to SYNCHRONIZATION. *) 637 | LeaderProcessACKEPOCH(i, j) == 638 | /\ IsLeader(i) 639 | /\ PendingACKEPOCH(i, j) 640 | /\ LET msg == msgs[j][i][1] 641 | infoOk == IsMyLearner(i, j) 642 | IN /\ infoOk 643 | /\ \/ \* 1. has broadcast NEWLEADER 644 | /\ AckeRecvQuorumFormed(i) 645 | /\ ackeRecv' = [ackeRecv EXCEPT ![i] = UpdateAckeRecv(@, j, 646 | msg.mepoch, msg.mhistory) ] 647 | /\ LET toSend == history[i] \* contains (Ie', Be') 648 | m == [ mtype |-> NEWLEADER, 649 | mepoch |-> acceptedEpoch[i], 650 | mhistory |-> toSend ] 651 | set_forChecking == SetPacketsForChecking({ }, i, 652 | acceptedEpoch[i], toSend, 1, Len(toSend)) 653 | IN 654 | /\ Reply(i, j, m) 655 | /\ proposalMsgsLog' = proposalMsgsLog \union set_forChecking 656 | /\ UNCHANGED <> 658 | \/ \* 2. has not broadcast NEWLEADER 659 | /\ ~AckeRecvQuorumFormed(i) 660 | /\ \/ /\ zabState[i] = DISCOVERY 661 | /\ UNCHANGED violatedInvariants 662 | \/ /\ zabState[i] /= DISCOVERY 663 | /\ PrintT("Exception: AckeRecvQuorumFormed false," \o 664 | " while zabState not DISCOVERY.") 665 | /\ violatedInvariants' = [violatedInvariants EXCEPT 666 | !.stateInconsistent = TRUE] 667 | /\ ackeRecv' = [ackeRecv EXCEPT ![i] = UpdateAckeRecv(@, j, 668 | msg.mepoch, msg.mhistory) ] 669 | /\ \/ \* 2.1. ackeRecv becomes quorum, determine Ie' 670 | \* and broacasts NEWLEADER in Q. (l.1.2 + l.2.1) 671 | /\ AckeRecvBecomeQuorum(i) 672 | /\ \* Update f.a 673 | LET newLeaderEpoch == acceptedEpoch[i] IN 674 | /\ currentEpoch' = [currentEpoch EXCEPT ![i] = newLeaderEpoch] 675 | /\ epochLeader' = [epochLeader EXCEPT ![newLeaderEpoch] 676 | = @ \union {i} ] \* for checking invariants 677 | /\ \* Determine initial history Ie' 678 | LET initialHistory == DetermineInitialHistory(i) IN 679 | history' = [history EXCEPT ![i] = InitAcksid(i, initialHistory) ] 680 | /\ \* Update zabState 681 | zabState' = [zabState EXCEPT ![i] = SYNCHRONIZATION] 682 | /\ \* Broadcast NEWLEADER with (e', Ie') 683 | LET toSend == history'[i] 684 | m == [ mtype |-> NEWLEADER, 685 | mepoch |-> acceptedEpoch[i], 686 | mhistory |-> toSend ] 687 | set_forChecking == SetPacketsForChecking({ }, i, 688 | acceptedEpoch[i], toSend, 1, Len(toSend)) 689 | IN 690 | /\ DiscardAndBroadcastNEWLEADER(i, j, m) 691 | /\ proposalMsgsLog' = proposalMsgsLog \union set_forChecking 692 | \/ \* 2.2. ackeRecv still not quorum. 693 | /\ ~AckeRecvBecomeQuorum(i) 694 | /\ Discard(j, i) 695 | /\ UNCHANGED <> 697 | /\ UNCHANGED <> 699 | /\ UpdateRecorder(<<"LeaderProcessACKEPOCH", i, j>>) 700 | ----------------------------------------------------------------------------- 701 | (* Follower receives NEWLEADER. Update f.a and history. *) 702 | FollowerProcessNEWLEADER(i, j) == 703 | /\ IsFollower(i) 704 | /\ PendingNEWLEADER(i, j) 705 | /\ LET msg == msgs[j][i][1] 706 | infoOk == IsMyLeader(i, j) 707 | epochOk == acceptedEpoch[i] = msg.mepoch 708 | stateOk == zabState[i] = SYNCHRONIZATION 709 | IN /\ infoOk 710 | /\ \/ \* 1. f.p not equals e', starts a new iteration. 711 | /\ ~epochOk 712 | /\ FollowerShutdown(i) 713 | /\ LET leader == leaderAddr[i] 714 | IN /\ Clean(i, leader) 715 | /\ RemoveLearner(leader, i) 716 | /\ UNCHANGED <> 717 | \/ \* 2. f.p equals e'. 718 | /\ epochOk 719 | /\ \/ /\ stateOk 720 | /\ UNCHANGED violatedInvariants 721 | \/ /\ ~stateOk 722 | /\ PrintT("Exception: Follower receives NEWLEADER," \o 723 | " whileZabState not SYNCHRONIZATION.") 724 | /\ violatedInvariants' = [violatedInvariants 725 | EXCEPT !.stateInconsistent = TRUE] 726 | /\ currentEpoch' = [currentEpoch EXCEPT ![i] = acceptedEpoch[i]] 727 | /\ history' = [history EXCEPT ![i] = msg.mhistory] \* no need to care ackSid 728 | /\ LET m == [ mtype |-> ACKLD, 729 | mzxid |-> LastZxidOfHistory(history'[i]) ] 730 | IN Reply(i, j, m) 731 | /\ UNCHANGED <> 733 | /\ UNCHANGED <> 735 | /\ UpdateRecorder(<<"FollowerProcessNEWLEADER", i, j>>) 736 | 737 | AckldRecvQuorumFormed(i) == LET sid_ackldRecv == {a.sid: a \in ackldRecv[i]} 738 | IN IsQuorum(sid_ackldRecv) 739 | AckldRecvBecomeQuorum(i) == LET sid_ackldRecv == {a.sid: a \in ackldRecv'[i]} 740 | IN IsQuorum(sid_ackldRecv) 741 | 742 | UpdateAckldRecv(oldSet, sid) == 743 | LET sid_set == {s.sid: s \in oldSet} 744 | follower_info == [ sid |-> sid, 745 | connected |-> TRUE ] 746 | IN IF sid \in sid_set 747 | THEN LET old_info == CHOOSE info \in oldSet: info.sid = sid 748 | IN (oldSet \ {old_info}) \union {follower_info} 749 | ELSE oldSet \union {follower_info} 750 | 751 | LastZxid(i) == LastZxidOfHistory(history[i]) 752 | 753 | RECURSIVE UpdateAcksidHelper(_,_,_) 754 | UpdateAcksidHelper(txns, target, endZxid) == 755 | IF Len(txns) = 0 THEN << >> 756 | ELSE LET oldTxn == txns[1] 757 | IN IF ZxidCompare(oldTxn.zxid, endZxid) THEN txns 758 | ELSE LET newTxn == [ zxid |-> oldTxn.zxid, 759 | value |-> oldTxn.value, 760 | ackSid |-> IF target \in oldTxn.ackSid 761 | THEN oldTxn.ackSid 762 | ELSE oldTxn.ackSid \union {target}, 763 | epoch |-> oldTxn.epoch ] 764 | IN <> \o UpdateAcksidHelper( Tail(txns), target, endZxid) 765 | 766 | \* Atomically add ackSid of one learner according to zxid in ACKLD. 767 | UpdateAcksid(his, target, endZxid) == UpdateAcksidHelper(his, target, endZxid) 768 | 769 | (* Leader waits for receiving ACKLD from a quorum including itself, 770 | and broadcasts COMMITLD and turns to BROADCAST. *) 771 | LeaderProcessACKLD(i, j) == 772 | /\ IsLeader(i) 773 | /\ PendingACKLD(i, j) 774 | /\ LET msg == msgs[j][i][1] 775 | infoOk == IsMyLearner(i, j) 776 | IN /\ infoOk 777 | /\ \/ \* 1. has not broadcast COMMITLD 778 | /\ ~AckldRecvQuorumFormed(i) 779 | /\ \/ /\ zabState[i] = SYNCHRONIZATION 780 | /\ UNCHANGED violatedInvariants 781 | \/ /\ zabState[i] /= SYNCHRONIZATION 782 | /\ PrintT("Exception: AckldRecvQuorumFormed false," \o 783 | " while zabState not SYNCHRONIZATION.") 784 | /\ violatedInvariants' = [violatedInvariants 785 | EXCEPT !.stateInconsistent = TRUE] 786 | /\ ackldRecv' = [ackldRecv EXCEPT ![i] = UpdateAckldRecv(@, j) ] 787 | /\ history' = [history EXCEPT ![i] = UpdateAcksid(@, j, msg.mzxid)] 788 | /\ \/ \* 1.1. ackldRecv becomes quorum, 789 | \* then broadcasts COMMITLD and turns to BROADCAST. 790 | /\ AckldRecvBecomeQuorum(i) 791 | /\ lastCommitted' = [lastCommitted EXCEPT 792 | ![i] = [ index |-> Len(history[i]), 793 | zxid |-> LastZxid(i) ] ] 794 | /\ zabState' = [zabState EXCEPT ![i] = BROADCAST] 795 | /\ LET m == [ mtype |-> COMMITLD, 796 | mzxid |-> LastZxid(i) ] 797 | IN DiscardAndBroadcastCOMMITLD(i, j, m) 798 | \/ \* 1.2. ackldRecv still not quorum. 799 | /\ ~AckldRecvBecomeQuorum(i) 800 | /\ Discard(j, i) 801 | /\ UNCHANGED <> 802 | \/ \* 2. has broadcast COMMITLD 803 | /\ AckldRecvQuorumFormed(i) 804 | /\ \/ /\ zabState[i] = BROADCAST 805 | /\ UNCHANGED violatedInvariants 806 | \/ /\ zabState[i] /= BROADCAST 807 | /\ PrintT("Exception: AckldRecvQuorumFormed true," \o 808 | " while zabState not BROADCAST.") 809 | /\ violatedInvariants' = [violatedInvariants 810 | EXCEPT !.stateInconsistent = TRUE] 811 | /\ ackldRecv' = [ackldRecv EXCEPT ![i] = UpdateAckldRecv(@, j) ] 812 | /\ history' = [history EXCEPT ![i] = UpdateAcksid(@, j, msg.mzxid)] 813 | /\ Reply(i, j, [ mtype |-> COMMITLD, 814 | mzxid |-> lastCommitted[i].zxid ]) 815 | /\ UNCHANGED <> 816 | /\ UNCHANGED <> 818 | /\ UpdateRecorder(<<"LeaderProcessACKLD", i, j>>) 819 | 820 | RECURSIVE ZxidToIndexHepler(_,_,_,_) 821 | ZxidToIndexHepler(his, zxid, cur, appeared) == 822 | IF cur > Len(his) THEN cur 823 | ELSE IF TxnZxidEqual(his[cur], zxid) 824 | THEN CASE appeared = TRUE -> -1 825 | [] OTHER -> Minimum( { cur, 826 | ZxidToIndexHepler(his, zxid, cur + 1, TRUE) } ) 827 | ELSE ZxidToIndexHepler(his, zxid, cur + 1, appeared) 828 | 829 | \* return -1: this zxid appears at least twice. Len(his) + 1: does not exist. 830 | \* 1 - Len(his): exists and appears just once. 831 | ZxidToIndex(his, zxid) == IF ZxidEqual( zxid, <<0, 0>> ) THEN 0 832 | ELSE IF Len(his) = 0 THEN 1 833 | ELSE LET len == Len(his) IN 834 | IF \E idx \in 1..len: TxnZxidEqual(his[idx], zxid) 835 | THEN ZxidToIndexHepler(his, zxid, 1, FALSE) 836 | ELSE len + 1 837 | 838 | (* Follower receives COMMITLD. Commit all txns. *) 839 | FollowerProcessCOMMITLD(i, j) == 840 | /\ IsFollower(i) 841 | /\ PendingCOMMITLD(i, j) 842 | /\ LET msg == msgs[j][i][1] 843 | infoOk == IsMyLeader(i, j) 844 | index == IF ZxidEqual(msg.mzxid, <<0, 0>>) THEN 0 845 | ELSE ZxidToIndex(history[i], msg.mzxid) 846 | logOk == index >= 0 /\ index <= Len(history[i]) 847 | IN /\ infoOk 848 | /\ \/ /\ logOk 849 | /\ UNCHANGED violatedInvariants 850 | \/ /\ ~logOk 851 | /\ PrintT("Exception: zxid in COMMITLD not exists in history.") 852 | /\ violatedInvariants' = [violatedInvariants 853 | EXCEPT !.proposalInconsistent = TRUE] 854 | /\ lastCommitted' = [lastCommitted EXCEPT ![i] = [ index |-> index, 855 | zxid |-> msg.mzxid ] ] 856 | /\ zabState' = [zabState EXCEPT ![i] = BROADCAST] 857 | /\ Discard(j, i) 858 | /\ UNCHANGED <> 860 | /\ UpdateRecorder(<<"FollowerProcessCOMMITLD", i, j>>) 861 | ---------------------------------------------------------------------------- 862 | IncZxid(s, zxid) == IF currentEpoch[s] = zxid[1] THEN <> 863 | ELSE <> 864 | 865 | (* Leader receives client request. 866 | Note: In production, any server in traffic can receive requests and 867 | forward it to leader if necessary. We choose to let leader be 868 | the sole one who can receive write requests, to simplify spec 869 | and keep correctness at the same time. *) 870 | LeaderProcessRequest(i) == 871 | /\ CheckTransactionNum \* test restrictions of transaction num 872 | /\ IsLeader(i) 873 | /\ zabState[i] = BROADCAST 874 | /\ LET request_value == GetRecorder("nClientRequest") \* unique value 875 | newTxn == [ zxid |-> IncZxid(i, LastZxid(i)), 876 | value |-> request_value, 877 | ackSid |-> {i}, 878 | epoch |-> currentEpoch[i] ] 879 | IN history' = [history EXCEPT ![i] = Append(@, newTxn) ] 880 | /\ UNCHANGED <> 882 | /\ UpdateRecorder(<<"LeaderProcessRequest", i>>) 883 | 884 | \* Latest counter existing in history. 885 | CurrentCounter(i) == IF LastZxid(i)[1] = currentEpoch[i] THEN LastZxid(i)[2] 886 | ELSE 0 887 | 888 | (* Leader broadcasts PROPOSE when sendCounter < currentCounter. *) 889 | LeaderBroadcastPROPOSE(i) == 890 | /\ IsLeader(i) 891 | /\ zabState[i] = BROADCAST 892 | /\ sendCounter[i] < CurrentCounter(i) \* there exists proposal to be sent 893 | /\ LET toSendCounter == sendCounter[i] + 1 894 | toSendZxid == <> 895 | toSendIndex == ZxidToIndex(history[i], toSendZxid) 896 | toSendTxn == history[i][toSendIndex] 897 | m_proposal == [ mtype |-> PROPOSE, 898 | mzxid |-> toSendTxn.zxid, 899 | mdata |-> toSendTxn.value ] 900 | m_proposal_forChecking == [ source |-> i, 901 | epoch |-> currentEpoch[i], 902 | zxid |-> toSendTxn.zxid, 903 | data |-> toSendTxn.value ] 904 | IN /\ sendCounter' = [sendCounter EXCEPT ![i] = toSendCounter] 905 | /\ Broadcast(i, m_proposal) 906 | /\ proposalMsgsLog' = proposalMsgsLog \union {m_proposal_forChecking} 907 | /\ UNCHANGED <> 909 | /\ UpdateRecorder(<<"LeaderBroadcastPROPOSE", i>>) 910 | 911 | IsNextZxid(curZxid, nextZxid) == 912 | \/ \* first PROPOSAL in this epoch 913 | /\ nextZxid[2] = 1 914 | /\ curZxid[1] < nextZxid[1] 915 | \/ \* not first PROPOSAL in this epoch 916 | /\ nextZxid[2] > 1 917 | /\ curZxid[1] = nextZxid[1] 918 | /\ curZxid[2] + 1 = nextZxid[2] 919 | 920 | (* Follower processes PROPOSE, saves it in history and replies ACK. *) 921 | FollowerProcessPROPOSE(i, j) == 922 | /\ IsFollower(i) 923 | /\ PendingPROPOSE(i, j) 924 | /\ LET msg == msgs[j][i][1] 925 | infoOk == IsMyLeader(i, j) 926 | isNext == IsNextZxid(LastZxid(i), msg.mzxid) 927 | newTxn == [ zxid |-> msg.mzxid, 928 | value |-> msg.mdata, 929 | ackSid |-> {}, 930 | epoch |-> currentEpoch[i] ] 931 | m_ack == [ mtype |-> ACK, 932 | mzxid |-> msg.mzxid ] 933 | IN /\ infoOk 934 | /\ \/ /\ isNext 935 | /\ history' = [history EXCEPT ![i] = Append(@, newTxn)] 936 | /\ Reply(i, j, m_ack) 937 | /\ UNCHANGED violatedInvariants 938 | \/ /\ ~isNext 939 | /\ LET index == ZxidToIndex(history[i], msg.mzxid) 940 | exist == index > 0 /\ index <= Len(history[i]) 941 | IN \/ /\ exist 942 | /\ UNCHANGED violatedInvariants 943 | \/ /\ ~exist 944 | /\ PrintT("Exception: Follower receives PROPOSE, while" \o 945 | " txn is neither the next nor exists in history.") 946 | /\ violatedInvariants' = [violatedInvariants EXCEPT 947 | !.proposalInconsistent = TRUE] 948 | /\ Discard(j, i) 949 | /\ UNCHANGED history 950 | /\ UNCHANGED <> 952 | /\ UpdateRecorder(<<"FollowerProcessPROPOSE", i, j>>) 953 | 954 | LeaderTryToCommit(s, index, zxid, newTxn, follower) == 955 | LET allTxnsBeforeCommitted == lastCommitted[s].index >= index - 1 956 | \* Only when all proposals before zxid has been committed, 957 | \* this proposal can be permitted to be committed. 958 | hasAllQuorums == IsQuorum(newTxn.ackSid) 959 | \* In order to be committed, a proposal must be accepted 960 | \* by a quorum. 961 | ordered == lastCommitted[s].index + 1 = index 962 | \* Commit proposals in order. 963 | IN \/ /\ \* Current conditions do not satisfy committing the proposal. 964 | \/ ~allTxnsBeforeCommitted 965 | \/ ~hasAllQuorums 966 | /\ Discard(follower, s) 967 | /\ UNCHANGED <> 968 | \/ /\ allTxnsBeforeCommitted 969 | /\ hasAllQuorums 970 | /\ \/ /\ ~ordered 971 | /\ PrintT("Warn: Committing zxid " \o ToString(zxid) \o " not first.") 972 | /\ violatedInvariants' = [violatedInvariants EXCEPT 973 | !.commitInconsistent = TRUE] 974 | \/ /\ ordered 975 | /\ UNCHANGED violatedInvariants 976 | /\ lastCommitted' = [lastCommitted EXCEPT ![s] = [ index |-> index, 977 | zxid |-> zxid ] ] 978 | /\ LET m_commit == [ mtype |-> COMMIT, 979 | mzxid |-> zxid ] 980 | IN DiscardAndBroadcast(s, follower, m_commit) 981 | 982 | 983 | LastAckIndexFromFollower(i, j) == 984 | LET set_index == {idx \in 1..Len(history[i]): j \in history[i][idx].ackSid } 985 | IN Maximum(set_index) 986 | 987 | (* Leader Keeps a count of acks for a particular proposal, and try to 988 | commit the proposal. If committed, COMMIT of proposal will be broadcast. *) 989 | LeaderProcessACK(i, j) == 990 | /\ IsLeader(i) 991 | /\ PendingACK(i, j) 992 | /\ LET msg == msgs[j][i][1] 993 | infoOk == IsMyLearner(i, j) 994 | index == ZxidToIndex(history[i], msg.mzxid) 995 | exist == index >= 1 /\ index <= Len(history[i]) \* proposal exists in history 996 | outstanding == lastCommitted[i].index < Len(history[i]) \* outstanding not null 997 | hasCommitted == ~ZxidCompare(msg.mzxid, lastCommitted[i].zxid) 998 | ackIndex == LastAckIndexFromFollower(i, j) 999 | monotonicallyInc == \/ ackIndex = -1 1000 | \/ ackIndex + 1 = index 1001 | IN /\ infoOk 1002 | /\ \/ /\ exist 1003 | /\ monotonicallyInc 1004 | /\ LET txn == history[i][index] 1005 | txnAfterAddAck == [ zxid |-> txn.zxid, 1006 | value |-> txn.value, 1007 | ackSid |-> txn.ackSid \union {j} , 1008 | epoch |-> txn.epoch ] 1009 | IN 1010 | /\ history' = [history EXCEPT ![i][index] = txnAfterAddAck ] 1011 | /\ \/ /\ \* Note: outstanding is 0. 1012 | \* / proposal has already been committed. 1013 | \/ ~outstanding 1014 | \/ hasCommitted 1015 | /\ Discard(j, i) 1016 | /\ UNCHANGED <> 1017 | \/ /\ outstanding 1018 | /\ ~hasCommitted 1019 | /\ LeaderTryToCommit(i, index, msg.mzxid, txnAfterAddAck, j) 1020 | \/ /\ \/ ~exist 1021 | \/ ~monotonicallyInc 1022 | /\ PrintT("Exception: No such zxid. " \o 1023 | " / ackIndex doesn't inc monotonically.") 1024 | /\ violatedInvariants' = [violatedInvariants 1025 | EXCEPT !.ackInconsistent = TRUE] 1026 | /\ Discard(j, i) 1027 | /\ UNCHANGED <> 1028 | /\ UNCHANGED <> 1030 | /\ UpdateRecorder(<<"LeaderProcessACK", i, j>>) 1031 | 1032 | (* Follower processes COMMIT. *) 1033 | FollowerProcessCOMMIT(i, j) == 1034 | /\ IsFollower(i) 1035 | /\ PendingCOMMIT(i, j) 1036 | /\ LET msg == msgs[j][i][1] 1037 | infoOk == IsMyLeader(i, j) 1038 | pending == lastCommitted[i].index < Len(history[i]) 1039 | IN /\ infoOk 1040 | /\ \/ /\ ~pending 1041 | /\ PrintT("Warn: Committing zxid without seeing txn.") 1042 | /\ UNCHANGED <> 1043 | \/ /\ pending 1044 | /\ LET firstElement == history[i][lastCommitted[i].index + 1] 1045 | match == ZxidEqual(firstElement.zxid, msg.mzxid) 1046 | IN 1047 | \/ /\ ~match 1048 | /\ PrintT("Exception: Committing zxid not equals" \o 1049 | " next pending txn zxid.") 1050 | /\ violatedInvariants' = [violatedInvariants EXCEPT 1051 | !.commitInconsistent = TRUE] 1052 | /\ UNCHANGED lastCommitted 1053 | \/ /\ match 1054 | /\ lastCommitted' = [lastCommitted EXCEPT ![i] = 1055 | [ index |-> lastCommitted[i].index + 1, 1056 | zxid |-> firstElement.zxid ] ] 1057 | /\ UNCHANGED violatedInvariants 1058 | /\ Discard(j, i) 1059 | /\ UNCHANGED <> 1061 | /\ UpdateRecorder(<<"FollowerProcessCOMMIT", i, j>>) 1062 | ---------------------------------------------------------------------------- 1063 | (* Used to discard some messages which should not exist in network channel. 1064 | This action should not be triggered. *) 1065 | FilterNonexistentMessage(i) == 1066 | /\ \E j \in Server \ {i}: /\ msgs[j][i] /= << >> 1067 | /\ LET msg == msgs[j][i][1] 1068 | IN 1069 | \/ /\ IsLeader(i) 1070 | /\ LET infoOk == IsMyLearner(i, j) 1071 | IN 1072 | \/ msg.mtype = NEWEPOCH 1073 | \/ msg.mtype = NEWLEADER 1074 | \/ msg.mtype = COMMITLD 1075 | \/ msg.mtype = PROPOSE 1076 | \/ msg.mtype = COMMIT 1077 | \/ /\ ~infoOk 1078 | /\ \/ msg.mtype = CEPOCH 1079 | \/ msg.mtype = ACKEPOCH 1080 | \/ msg.mtype = ACKLD 1081 | \/ msg.mtype = ACK 1082 | \/ /\ IsFollower(i) 1083 | /\ LET infoOk == IsMyLeader(i, j) 1084 | IN 1085 | \/ msg.mtype = CEPOCH 1086 | \/ msg.mtype = ACKEPOCH 1087 | \/ msg.mtype = ACKLD 1088 | \/ msg.mtype = ACK 1089 | \/ /\ ~infoOk 1090 | /\ \/ msg.mtype = NEWEPOCH 1091 | \/ msg.mtype = NEWLEADER 1092 | \/ msg.mtype = COMMITLD 1093 | \/ msg.mtype = PROPOSE 1094 | \/ msg.mtype = COMMIT 1095 | \/ IsLooking(i) 1096 | /\ Discard(j, i) 1097 | /\ violatedInvariants' = [violatedInvariants EXCEPT !.messageIllegal = TRUE] 1098 | /\ UNCHANGED <> 1100 | /\ UnchangeRecorder 1101 | ---------------------------------------------------------------------------- 1102 | \* Defines how the variables may transition. 1103 | Next == 1104 | (* Election *) 1105 | \/ \E i \in Server: UpdateLeader(i) 1106 | \/ \E i \in Server: FollowLeader(i) 1107 | (* Abnormal situations like failure, network disconnection *) 1108 | \/ \E i \in Server: FollowerTimeout(i) 1109 | \/ \E i \in Server: LeaderTimeout(i) 1110 | \/ \E i, j \in Server: Timeout(i, j) 1111 | \/ \E i \in Server: Restart(i) 1112 | (* Zab module - Discovery and Synchronization part *) 1113 | \/ \E i, j \in Server: ConnectAndFollowerSendCEPOCH(i, j) 1114 | \/ \E i, j \in Server: LeaderProcessCEPOCH(i, j) 1115 | \/ \E i, j \in Server: FollowerProcessNEWEPOCH(i, j) 1116 | \/ \E i, j \in Server: LeaderProcessACKEPOCH(i, j) 1117 | \/ \E i, j \in Server: FollowerProcessNEWLEADER(i, j) 1118 | \/ \E i, j \in Server: LeaderProcessACKLD(i, j) 1119 | \/ \E i, j \in Server: FollowerProcessCOMMITLD(i, j) 1120 | (* Zab module - Broadcast part *) 1121 | \/ \E i \in Server: LeaderProcessRequest(i) 1122 | \/ \E i \in Server: LeaderBroadcastPROPOSE(i) 1123 | \/ \E i, j \in Server: FollowerProcessPROPOSE(i, j) 1124 | \/ \E i, j \in Server: LeaderProcessACK(i, j) 1125 | \/ \E i, j \in Server: FollowerProcessCOMMIT(i, j) 1126 | (* An action used to judge whether there are redundant messages in network *) 1127 | \/ \E i \in Server: FilterNonexistentMessage(i) 1128 | 1129 | Spec == Init /\ [][Next]_vars 1130 | ---------------------------------------------------------------------------- 1131 | \* Define safety properties of Zab. 1132 | 1133 | ShouldNotBeTriggered == \A p \in DOMAIN violatedInvariants: violatedInvariants[p] = FALSE 1134 | 1135 | \* There is most one established leader for a certain epoch. 1136 | Leadership1 == \A i, j \in Server: 1137 | /\ IsLeader(i) /\ zabState[i] \in {SYNCHRONIZATION, BROADCAST} 1138 | /\ IsLeader(j) /\ zabState[j] \in {SYNCHRONIZATION, BROADCAST} 1139 | /\ currentEpoch[i] = currentEpoch[j] 1140 | => i = j 1141 | 1142 | Leadership2 == \A epoch \in 1..MAXEPOCH: Cardinality(epochLeader[epoch]) <= 1 1143 | 1144 | \* PrefixConsistency: The prefix that have been committed 1145 | \* in history in any process is the same. 1146 | PrefixConsistency == \A i, j \in Server: 1147 | LET smaller == Minimum({lastCommitted[i].index, lastCommitted[j].index}) 1148 | IN \/ smaller = 0 1149 | \/ /\ smaller > 0 1150 | /\ \A index \in 1..smaller: 1151 | TxnEqual(history[i][index], history[j][index]) 1152 | 1153 | \* Integrity: If some follower delivers one transaction, then some primary has broadcast it. 1154 | Integrity == \A i \in Server: 1155 | /\ IsFollower(i) 1156 | /\ lastCommitted[i].index > 0 1157 | => \A idx \in 1..lastCommitted[i].index: \E proposal \in proposalMsgsLog: 1158 | LET txn_proposal == [ zxid |-> proposal.zxid, 1159 | value |-> proposal.data ] 1160 | IN TxnEqual(history[i][idx], txn_proposal) 1161 | 1162 | \* Agreement: If some follower f delivers transaction a and some follower f' delivers transaction b, 1163 | \* then f' delivers a or f delivers b. 1164 | Agreement == \A i, j \in Server: 1165 | /\ IsFollower(i) /\ lastCommitted[i].index > 0 1166 | /\ IsFollower(j) /\ lastCommitted[j].index > 0 1167 | => 1168 | \A idx1 \in 1..lastCommitted[i].index, idx2 \in 1..lastCommitted[j].index : 1169 | \/ \E idx_j \in 1..lastCommitted[j].index: 1170 | TxnEqual(history[j][idx_j], history[i][idx1]) 1171 | \/ \E idx_i \in 1..lastCommitted[i].index: 1172 | TxnEqual(history[i][idx_i], history[j][idx2]) 1173 | 1174 | \* Total order: If some follower delivers a before b, then any process that delivers b 1175 | \* must also deliver a and deliver a before b. 1176 | TotalOrder == \A i, j \in Server: 1177 | LET committed1 == lastCommitted[i].index 1178 | committed2 == lastCommitted[j].index 1179 | IN committed1 >= 2 /\ committed2 >= 2 1180 | => \A idx_i1 \in 1..(committed1 - 1) : \A idx_i2 \in (idx_i1 + 1)..committed1 : 1181 | LET logOk == \E idx \in 1..committed2 : 1182 | TxnEqual(history[i][idx_i2], history[j][idx]) 1183 | IN \/ ~logOk 1184 | \/ /\ logOk 1185 | /\ \E idx_j2 \in 1..committed2 : 1186 | /\ TxnEqual(history[i][idx_i2], history[j][idx_j2]) 1187 | /\ \E idx_j1 \in 1..(idx_j2 - 1): 1188 | TxnEqual(history[i][idx_i1], history[j][idx_j1]) 1189 | 1190 | \* Local primary order: If a primary broadcasts a before it broadcasts b, then a follower that 1191 | \* delivers b must also deliver a before b. 1192 | LocalPrimaryOrder == LET p_set(i, e) == {p \in proposalMsgsLog: /\ p.source = i 1193 | /\ p.epoch = e } 1194 | txn_set(i, e) == { [ zxid |-> p.zxid, 1195 | value |-> p.data ] : p \in p_set(i, e) } 1196 | IN \A i \in Server: \A e \in 1..currentEpoch[i]: 1197 | \/ Cardinality(txn_set(i, e)) < 2 1198 | \/ /\ Cardinality(txn_set(i, e)) >= 2 1199 | /\ \E txn1, txn2 \in txn_set(i, e): 1200 | \/ TxnEqual(txn1, txn2) 1201 | \/ /\ ~TxnEqual(txn1, txn2) 1202 | /\ LET TxnPre == IF ZxidCompare(txn1.zxid, txn2.zxid) THEN txn2 ELSE txn1 1203 | TxnNext == IF ZxidCompare(txn1.zxid, txn2.zxid) THEN txn1 ELSE txn2 1204 | IN \A j \in Server: /\ lastCommitted[j].index >= 2 1205 | /\ \E idx \in 1..lastCommitted[j].index: 1206 | TxnEqual(history[j][idx], TxnNext) 1207 | => \E idx2 \in 1..lastCommitted[j].index: 1208 | /\ TxnEqual(history[j][idx2], TxnNext) 1209 | /\ idx2 > 1 1210 | /\ \E idx1 \in 1..(idx2 - 1): 1211 | TxnEqual(history[j][idx1], TxnPre) 1212 | 1213 | \* Global primary order: A follower f delivers both a with epoch e and b with epoch e', and e < e', 1214 | \* then f must deliver a before b. 1215 | GlobalPrimaryOrder == \A i \in Server: lastCommitted[i].index >= 2 1216 | => \A idx1, idx2 \in 1..lastCommitted[i].index: 1217 | \/ ~EpochPrecedeInTxn(history[i][idx1], history[i][idx2]) 1218 | \/ /\ EpochPrecedeInTxn(history[i][idx1], history[i][idx2]) 1219 | /\ idx1 < idx2 1220 | 1221 | \* Primary integrity: If primary p broadcasts a and some follower f delivers b such that b has epoch 1222 | \* smaller than epoch of p, then p must deliver b before it broadcasts a. 1223 | PrimaryIntegrity == \A i, j \in Server: /\ IsLeader(i) /\ IsMyLearner(i, j) 1224 | /\ IsFollower(j) /\ IsMyLeader(j, i) 1225 | /\ zabState[i] = BROADCAST 1226 | /\ zabState[j] = BROADCAST 1227 | /\ lastCommitted[j].index >= 1 1228 | => \A idx_j \in 1..lastCommitted[j].index: 1229 | \/ history[j][idx_j].zxid[1] >= currentEpoch[i] 1230 | \/ /\ history[j][idx_j].zxid[1] < currentEpoch[i] 1231 | /\ \E idx_i \in 1..lastCommitted[i].index: 1232 | TxnEqual(history[i][idx_i], history[j][idx_j]) 1233 | ============================================================================= 1234 | \* Modification History 1235 | \* Last modified Sat Dec 11 22:31:08 CST 2021 by Dell 1236 | \* Created Thu Dec 02 20:49:23 CST 2021 by Dell 1237 | -------------------------------------------------------------------------------- /doc-in-chinese/README.md: -------------------------------------------------------------------------------- 1 | # Zab的TLA+规约 文档 2 | 3 | ## 概述 4 | - 本实验是由论文*Junqueira F P, Reed B C, Serafini M. Zab: High-performance broadcast for primary-backup systems[C]//2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 2011: 245-256.*启发。本实验根据该论文描述的Zab协议进行了Zab的TLA+规约。 5 | - 我们对Zab使用TLA+工具做了形式化规约,并在此基础上做了一定量的模型检验来验证Zab的正确性。 6 | - 当前我们同样完成了对最新版的ZooKeeper开源项目中的[ZAB 1.0](https://github.com/BinyuHuang-nju/zookeeper/tree/master/zookeeper-specifications)的形式化规约。 7 | - 由于论文中对Zab算法描述的精简和细节上的省略,在进行规约时协议中的一些细节由本实验作者进行修改和增加。如有疑问,欢迎指出。 8 | 9 | ## 支撑平台 10 | TLA+ toolbox 版本1.7.0 11 | 12 | ## 运行 13 | >具体配置环境手段见主目录README相应模块。 14 | 15 | 创建规约并以正常方式建模并运行模型。 16 | 我们通过该规约,能够很清晰地将规约划分为五个模块: 17 | - Phase0. 选主阶段 18 | - Phase1. 发现阶段 19 | - Phase2. 同步阶段 20 | - Phase3. 广播阶段 21 | - 各种异常情况如节点失效、网络断连等 22 | 23 | 24 | 25 | ## 规约中的抽象 26 | >论文中的Zab协议不关注选主过程,我们对选主过程进行抽象;且规约中能够模拟系统中的非拜占庭错误;此外,我们关注的是系统内状态的一致,从而抽象或省略了如向客户端回复结果等一些实际实现中的处理。 27 | 28 | ### 对选主过程的抽象 29 | 由于论文不关注选主,但是为了使模型可被运行,选主模块是不可或缺的。我们用动作*UpdateLeader*和*FollowLeader*来完成选主。 30 | 31 | ### 对通信介质的抽象 32 | Zab使用的通信信道是TCP信道,所以消息传递不会出现丢包、冗余、乱序的情况,我们用*Sequence*模块来模拟满足按序接收消息的信道,故这里的信道和端到端的TCP信道存在一定差异。我们认为某个服务器不执行接收消息的动作可以模拟消息延迟,某个服务器不执行任何动作可以模拟单机失败。 33 | 34 | ### 对与系统状态无关的动作的抽象和省略 35 | 我们关心的是系统内状态的一致,不关心客户端(client)向系统请求执行和系统向客户端回复结果等的细节,以及各个服务器向副本(复制状态机)提交事务(deliver transaction)的细节。因此我们粗化了*ClientRequest*,省略了向客户端的回复。我们假设每一个可提交的事务会被立即提交到副本中,故可用变量*history[i][1..commitIndex]*来模拟节点*i*向副本提交的事务序列。 36 | 37 | ## 与论文的差异 38 | >该部分描述的是规约与论文的协议之间的差异。我们将我们自己的做法融入进规约中。 39 | 40 | ### Issue 1 Line: 377, Action: UpdateLeader, FollowLeader 41 | 由于论文不关注选主模块,我们使用全局变量*leaderOracle*来简化选主模块。在动作*UpdateLeader*中,我们让*LOOKING*状态的服务器变成新的领导者,并更新*leaderOracle*。在动作*FollowLeader*中,我们让处于*LOOKING*状态的服务器根据*leaderOracle*切换状态到*FOLLOWING*或*LEADING*。 42 | 43 | 44 | ### Issue 2 Line: 495, Action: LeaderProcessCEPOCH, LeaderProcessACKEPOCH, LeaderProcessACKLD 45 | 在论文的伪代码中,总是提到领导者向*Q*广播某个消息,如*NEWEPOCH*、*NEWLEADER*、*PROPOSE*等等。但是*Q*在每个阶段所代表哪些服务器,这是非常模糊的,在论文中也没有进一步说明。我们将*learners*定义为与某个领导者建立连接的服务器集合,*cepochRecv*为领导者接收到*CEPOCH*的服务器集合,*ackeRecv*为领导者接收到*ACKEPOCH*的服务器集合,*ackldRecv*为领导者接收到*ACKLD*的服务器集合,则我们知道*ackldRecv* ⊆ *ackeRecv* ⊆ *cepochRecv* ⊆ *learners*。又很显然让领导者向*epochRecv*中的服务器广播*COMMITLD*是很显然错误的,因为*cepochRecv*中可能存在跟随者还没接收到*NEWLEADER*。所以在论文中定义每一个*Q*代表的集合是非常有必要的。 46 | 这里我们让*cepochRecv*在*step l.1.1*中作为*Q*来广播*NEWECPOCH*,*ackeRecv* 在*step l.2.1*中作为*Q*来广播*NEWLEADER*,*ackldRecv*作为*Q*在*step l.2.2* 中广播*COMMITLD*。 47 | 48 | ### Issue 3 Line: 889, Action: LeaderBroadcastPROPOSE, LeaderProcessACK 49 | 事实上,除了*NEWEPOCH*、*NEWLEADER*和*COMMITLD*,领导者还需要在*BROADCAST*阶段广播*PROPOSE*和*COMMIT*。根据论文伪代码中的*step l.3.3*和*step l.3.4*,我们最初认为领导者广播这两个类型的消息的集合都是*ackldRecv*,这是很自然的。但这会带来一个bug,其中跟随者接收到了本地日志中不存在的事务的*COMMIT*。 50 | ![pic_commit_bug](picture/pic_commit_wrong.png) 51 | 这是因为根据论文中的*step l.3.4*,跟随者直到接收到了*COMMITLD*,才会接收到*PROPOSE*。 52 | 于是我们在规约中的做法是,当领导者广播*PROPOSE*时,*Q*是*ackeRecv*,当领导者广播*COMMIT*时,*Q*是*ackldRecv*。因此,任何收到*PROPOSE*的跟随者肯定在这之前接收过*COMMITLD*。 53 | 所以,我们也不应该在*step l.3.3*中,当领导者接收到*CEPOCH*时直接回复*NEWEPOCH*和*NEWLEADER*,而是和之前的阶段一样,回复*NEWEPOCH*,然后等待接收到*ACKEPOCH*后再回复*NEWLEADER*,这样才能陆续更新*ackeRecv*和*ackldRecv*。 54 | 此外,这里的*COMMITLD*所提交的事务可能不仅仅包含*NEWLEADER*中的事务,还会包含一些*PROPOSE*中的事务的提交。这也是因为*PROPOSE*和*COMMIT*的广播集合不同,一些还在恢复阶段的跟随者可能不会接收到对应的*COMMIT*,而是在*COMMITLD*中一并被提交了。 55 | 所以这里我们成功找到了论文伪代码中的一个bug。 56 | 57 | ### Issue 4 Line: 921, Action: FollowerProcessPROPOSE 58 | 在规约中,我们假设领导者处理来自客户端的请求和领导者广播*PROPOSE*这两个动作不是原子执行的,这样会存在另一个bug,即跟随者可能会收到两个对同一个事务的提议。 59 | ![pic_double_propose](picture/pic_double_propose.png) 60 | 从该图中我们可以看到,对zxid为<1,1>的事务,跟随者收到了两次提议。 61 | 我们在规约中的做法是,当跟随者接收到*PROPOSE*时,会进行判断,如果该消息中的zxid是当前日志中的lastZxid的下一个zxid,则接收这个提议,否则说明跟随者本地早已存在这个事务,直接忽略该提议即可。 62 | 在这里我们可以发现,跟随者接收到*PROPOSE*时,要么是该事务已存在于本地,要么是该事务的txn是本地中的下一项zxid。 -------------------------------------------------------------------------------- /doc-in-chinese/picture/pic_commit_wrong.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BinyuHuang-nju/zab-tla/6accd5689b6f9ea018d184bfbe6695a998dd8891/doc-in-chinese/picture/pic_commit_wrong.png -------------------------------------------------------------------------------- /doc-in-chinese/picture/pic_double_propose.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BinyuHuang-nju/zab-tla/6accd5689b6f9ea018d184bfbe6695a998dd8891/doc-in-chinese/picture/pic_double_propose.png -------------------------------------------------------------------------------- /results/README.md: -------------------------------------------------------------------------------- 1 | # Result of Verification 2 | > Our verification of Zab using model checking is in progress, and we have obtained part of data set. 3 | > We show our result in this doc and the doc is currently not complete. 4 | 5 | ## Experiment configuration 6 | 7 | Our statistical results include: diameter of the system states that have been traversed, the number of states that have been traversed, the number of different states that have been discovered, and the time spent in the experiment. 8 | 9 | The machine configuration used in the experiment is 2.40 GHz, 10-core CPU, 64GB memory. The TLC version number is 1.7.0. 10 | 11 | ## State space constraints in model checking 12 | 13 | Due to the state space explosion in model checking and the complex actions of Zab protocol, as well as unlimited number of rounds and unlimited length of history, it is impossible to traverse all states. 14 | We try to let models can tap into larger state space. See CONSTANT *Parameters* for details. 15 | 16 | ## Verification results of model checking 17 | | Mode | TLC model | Diameter | num of states | time of checking(hh:mm:ss) | 18 | | ----- | ---------------------- | ------------- | ------------------ | ------------------ | 19 | | BFS | (2 servers,3 rounds,2 transactions) | 59 | 7758091583 | 17:28:17| 20 | | Simulation | (2 servers,3 rounds,2 transactions) | -| 6412825222| 17:07:20 | 21 | | BFS | (3 servers,2 rounds,2 transactions) | 19 | 4275801206 | 09:40:08| 22 | | Simulation | (3 servers,2 rounds,2 transactions) | -| 10899460942| 20:15:11 | 23 | | BFS | (3 servers,2 rounds,3 transactions) | 22 | 8740566213 | 17:49:09 | 24 | | Simulation | (3 servers,2 rounds,3 transactions) | - | 9639135842 | 20:10:00 | 25 | | BFS | (3 servers,3 rounds,2 transactions) | 21 | 7079744342 |14:17:45 | 26 | | Simulation | (3 servers,3 rounds,2 transactions) | - | 6254964039 | 15:08:42 | 27 | | BFS | (4 servers,3 rounds,2 transactions) | 16 | 5634112480 |15:42:09 | 28 | | Simulation | (4 servers,3 rounds,2 transactions) | - | 3883461291 | 15:52:03 | 29 | 30 | ## Verification results with parameters (count of servers, MaxTotalRestartNum, MaxElectionNum, MaxTransactionNum) 31 | | Mode | TLC model | Diameter | num of states | time of checking(hh:mm:ss) | 32 | | ----- | ---------------------- | ------------- | ------------------ | ------------------ | 33 | | BFS | (2,2,3,2,termination) | 55 | 10772649 | 00:02:21| 34 | | BFS | (3,1,1,2) | 45 | 9602018536 | 31:01:57| 35 | 36 | --------------------------------------------------------------------------------