├── .gitignore ├── Dockerfile ├── README.md ├── client ├── client.go └── main.go ├── config.yml ├── docker └── swimring-cluster.sh ├── glide.lock ├── glide.yaml ├── hashring ├── hashring.go └── rbtree.go ├── main.go ├── membership ├── disseminator.go ├── gossip.go ├── join.go ├── member.go ├── memberlist.go ├── node.go ├── ping.go ├── protocol_handlers.go └── state_transitions.go ├── request_coordinator.go ├── storage ├── kvstore.go └── request_handlers.go ├── swimring.go └── util └── util.go /.gitignore: -------------------------------------------------------------------------------- 1 | # Compiled Object files, Static and Dynamic libs (Shared Objects) 2 | *.o 3 | *.a 4 | *.so 5 | 6 | # Folders 7 | _obj 8 | _test 9 | 10 | # Architecture specific extensions/prefixes 11 | *.[568vq] 12 | [568vq].out 13 | 14 | *.cgo1.go 15 | *.cgo2.c 16 | _cgo_defun.c 17 | _cgo_gotypes.go 18 | _cgo_export.* 19 | 20 | _testmain.go 21 | 22 | *.exe 23 | *.test 24 | *.prof 25 | 26 | .vscode/ 27 | vendor/ 28 | swimring 29 | client/client 30 | 31 | *.log -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM golang 2 | 3 | COPY docker/swimring-cluster.sh /usr/local/bin/swimring-cluster 4 | COPY . /go/src/github.com/hungys/swimring 5 | COPY config.yml /go/bin 6 | 7 | WORKDIR /go/src/github.com/hungys/swimring 8 | RUN curl https://glide.sh/get | sh 9 | RUN glide install 10 | RUN go install github.com/hungys/swimring 11 | 12 | WORKDIR /go/bin 13 | 14 | ENTRYPOINT swimring-cluster 15 | 16 | EXPOSE 7000 7001 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | SwimRing 2 | ======== 3 | 4 | SwimRing is our final project for Fault Tolerant Computing course in NCTU. The main goal is to build a minimal **distributed** **fault-tolerant** key-value store with *SWIM Gossip Protocol* and *Consistent Hash Ring*. The project is implemented in Go, and many of design and architecture are adapted from [Uber](http://www.uber.com)'s open source project - [Ringpop](https://github.com/uber/ringpop-go). 5 | 6 | Note that this project is **NOT** intended to be deployed in production environment. 7 | 8 | # System Design 9 | 10 | ## Partitioning 11 | 12 | One of the key features of SwimRing is the ability to scale incrementally. The horizontal partitioning mechanism, which is also known as **sharding**, is used to achieve this. SwimRing partitions data across the cluster using *consistent hashing*, which is a special kind of hashing such that when a hash table is resized, only K/n keys need to be remapped in average, where K is the number of keys, and n is the number of slots. In consistent hashing, the output range of a hash function is treated as a fixed circular space or **ring**. Each node in the cluster is assigned a random value within this space which represents its position on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. 13 | 14 | One obvious challenge of the basic consistent hashing is that the random position assignment of each node on the ring leads to non-uniform data and load distribution. SwimRing solves this by adopting the approach proposed in [Dynamo](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf), to add some *virtual nodes* (vnode) to the hash ring and assign multiple positions to each real node. 15 | 16 | In SwimRing, every node plays the same role except the data items they actually store; that is, we don’t need to distinguish the role of master and slave, and all the nodes are able to act as a coordinator to handle and forward the incoming request. With this design, there will be no single point of failure, although we can still leverage some existing systems like [ZooKeeper](https://zookeeper.apache.org/), or implement consensus algorithm like [Paxos](http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/past/03F/notes/paxos-simple.pdf) to elect a primary coordinator. 17 | 18 | ## Replication 19 | 20 | SwimRing uses replication to achieve high availability and durability. Each data item is replicated at N nodes, where N is the *replication factor* configured by the administrator. When a write request comes to coordinator, the coordinator will calculate the hash based on key to determine which node is the primary replica. The other N-1 replicas are chosen by picking N-1 **successors** of the primary replica on the ring. Then, the write request is forwarded to all the replicas for updating the data items. Like [Cassandra](http://cassandra.apache.org/), SwimRing provides different level of read/write consistency, including *ONE*, *QUORUM*, and *ALL*. If the write consistency level specified by the client is *QUORUM*, the response will not be sent back to client until more than half writes complete. 21 | 22 | For the read request, the most recent data item (based on timestamp) will be forwarded back to the client. To ensure that all replicas have the most recent version of frequently-read data, the coordinator also contacts and compares the data from all replicas in the background. If the replicas are inconsistent, the **read-repair** process will be executed to update the out-of-date data items. 23 | 24 | ## Membership / Failure Detection 25 | 26 | Membership information in SwimRing is maintained in a ring-like structure, and disseminated to other nodes by the *Gossip* module, which is also used for failure detection. The gossip protocol in SwimRing is based on [SWIM](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.5253&rep=rep1&type=pdf). Failure detection is done by periodic random probing (*ping*). If the node fails to send ACK back within a reasonable time, then an indirect probe (*ping-req*) is attempted. An indirect probe asks a configurable number of random nodes to probe the same node, in case there are network issues causing our own node to fail the probe. If both our probe and the indirect probes fail within a reasonable time, then the node is marked *suspect* and this knowledge is gossiped to the cluster. A suspected node is still considered a member of cluster. If the suspect member of the cluster does not dispute the suspicion within a configurable period of time, the node is finally considered faulty, and this state is then gossiped to the cluster. 27 | 28 | ## Local Persistence 29 | 30 | The SwimRing system relies on the local file system for data persistence. Since it’s not the main scope of this project, the storage engine is simplified to provide only the basic crash recovery ability. When a write request comes, the data item is first written into an **append-only** *commit log* on disk, and then written to the *memtable* in memory. SwimRing periodically checkpoints memtable by making a snapshot into *dump file* and stored it on disk. To recover from node crash caused by power failure, the *dump file* is loaded into *memtable* and the *commit log* will be replayed. 31 | 32 | # Get Started 33 | 34 | To get SwimRing, 35 | 36 | ```bash 37 | $ go get github.com/hungys/swimring 38 | ``` 39 | 40 | Then switch to your GOPATH and build the project, 41 | 42 | ```bash 43 | $ cd $GOPATH/src/github.com/hungys/swimring 44 | $ glide install 45 | $ go build 46 | ``` 47 | 48 | To launch a node, please first check `config.yml` is configured correctly, and run, 49 | 50 | ```bash 51 | (term1) $ ./swimring 52 | ``` 53 | 54 | This command will load configurations from `config.yml`, including RPC listen port, bootstrap seeds, timeout, and **replication factor** (Make sure the value must be no more than total number of nodes in the cluster)... 55 | 56 | For testing purpose, you can form a cluster locally, and you can set command line flags to specify external and internal RPC port manually, 57 | 58 | ```bash 59 | (term2) $ ./swimring -export=8000 -inport=8001 60 | (term3) $ ./swimring -export=9000 -inport=9001 61 | (term4) $ ./swimring -export=10000 -inport=10001 62 | (term5) $ ./swimring -export=11000 -inport=11001 63 | (term6) $ ./swimring -export=12000 -inport=12001 64 | ``` 65 | 66 | Now, a cluster of 6 SwimRing nodes are running. 67 | 68 | To use client program, 69 | 70 | ```bash 71 | $ cd client 72 | $ go build 73 | ``` 74 | 75 | By default, it will connect to port 7000 (Node 1 in this example), and both read consistency level and write consistency level are set to *QUORUM*. See `./client -h` for more custom options, 76 | 77 | ```bash 78 | $ ./client -h 79 | Usage of ./client: 80 | -host string 81 | address of server node (default "127.0.0.1") 82 | -port int 83 | port number of server node (default 7000) 84 | -rl string 85 | read consistency level (default "QUORUM"): ONE, QUORUM, ALL 86 | -wl string 87 | write consistency level (default "QUORUM"): ONE, QUORUM, ALL 88 | ``` 89 | 90 | Now you are able to use `get `, `put `, `del ` to operate on databases. To check current status of the cluster, simply use `stat` command. 91 | 92 | ``` 93 | $ ./client 94 | connected to 127.0.0.1:7000 95 | > get 1 96 | error: key not found 97 | > put 1 1 98 | ok 99 | > get 1 100 | 1 101 | > stat 102 | +--------------------+--------+-----------+ 103 | | ADDRESS | STATUS | KEY COUNT | 104 | +--------------------+--------+-----------+ 105 | | 192.168.1.63:7001 | alive | 1 | 106 | | 192.168.1.63:8001 | alive | 0 | 107 | | 192.168.1.63:9001 | alive | 0 | 108 | | 192.168.1.63:10001 | alive | 1 | 109 | | 192.168.1.63:11001 | alive | 0 | 110 | | 192.168.1.63:12001 | alive | 1 | 111 | +--------------------+--------+-----------+ 112 | ``` 113 | 114 | ## Docker container 115 | 116 | We also provide a Dockerfile for deploying SwimRing. To build the Docker image, 117 | 118 | ```bash 119 | $ docker build -t swimring . 120 | ``` 121 | 122 | Then you can start to run SwimRing in Docker containers. To configure the **seeds** (nodes that can help join the cluser), simply specify their addresses and ports in environment variable `SEEDS`. 123 | 124 | For example, the following commands will launch a cluster of 4 nodes locally, 125 | 126 | ```bash 127 | $ docker run -d -p 5001:7000 --name s1 swimring 128 | $ docker run -d -p 5002:7000 -e 'SEEDS=["172.17.0.2:7001"]' --name s2 swimring 129 | $ docker run -d -p 5003:7000 -e 'SEEDS=["172.17.0.2:7001"]' --name s3 swimring 130 | $ docker run -d -p 5004:7000 -e 'SEEDS=["172.17.0.2:7001"]' --name s4 swimring 131 | ``` 132 | 133 | # Reference 134 | 135 | - Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." *ACM SIGOPS Operating Systems Review* 44.2 (2010): 35-40. 136 | - DeCandia, Giuseppe, et al. "Dynamo: amazon's highly available key-value store." *ACM SIGOPS Operating Systems Review*. Vol. 41. No. 6. ACM, 2007. 137 | - Das, Abhinandan, Indranil Gupta, and Ashish Motivala. "Swim: Scalable weakly-consistent infection-style process group membership protocol." *Dependable Systems and Networks*, 2002. DSN 2002. Proceedings. International Conference on. IEEE, 2002. 138 | - Ringpop by Uber. http://uber.github.io/ringpop/ 139 | -------------------------------------------------------------------------------- /client/client.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "errors" 5 | "fmt" 6 | "net/rpc" 7 | "strconv" 8 | "strings" 9 | ) 10 | 11 | const ( 12 | // ONE is the weakest consistency level. 13 | // For read request, returns value when the first response arrived. 14 | // For write request, returns when the first ACK received. 15 | ONE = "ONE" 16 | // QUORUM is the moderate consistency level. 17 | // For read request, returns value when the quorum set of replicas all responded. 18 | // For write request, returns when the quorum set of replicas all responded ACKs. 19 | QUORUM = "QUORUM" 20 | // ALL is the strongest consistency level. 21 | // For read request, returns value when all replicas responded. 22 | // For write request, returns when all replicas all responded ACKs. 23 | ALL = "ALL" 24 | // GetOp is the name of the service method for Get. 25 | GetOp = "SwimRing.Get" 26 | // PutOp is the name of the service method for Put. 27 | PutOp = "SwimRing.Put" 28 | // DeleteOp is the name of the service method for Delete. 29 | DeleteOp = "SwimRing.Delete" 30 | // StatOp is the name of the service method for Stat. 31 | StatOp = "SwimRing.Stat" 32 | ) 33 | 34 | // SwimringClient is a RPC client for connecting to SwimRing server. 35 | type SwimringClient struct { 36 | address string 37 | port int 38 | client *rpc.Client 39 | 40 | readLevel string 41 | writeLevel string 42 | } 43 | 44 | // GetRequest is the payload of Get. 45 | type GetRequest struct { 46 | Level string 47 | Key string 48 | } 49 | 50 | // GetResponse is the payload of the response of Get. 51 | type GetResponse struct { 52 | Key, Value string 53 | } 54 | 55 | // PutRequest is the payload of Put. 56 | type PutRequest struct { 57 | Level string 58 | Key, Value string 59 | } 60 | 61 | // PutResponse is the payload of the response of Put. 62 | type PutResponse struct{} 63 | 64 | // DeleteRequest is the payload of Delete. 65 | type DeleteRequest struct { 66 | Level string 67 | Key string 68 | } 69 | 70 | // DeleteResponse is the payload of the response of Delete. 71 | type DeleteResponse struct{} 72 | 73 | // StateRequest is the payload of Stat. 74 | type StateRequest struct{} 75 | 76 | // StateResponse is the payload of the response of Stat. 77 | type StateResponse struct { 78 | Nodes []NodeStat 79 | } 80 | 81 | // NodeStat stores the information of a Node 82 | type NodeStat struct { 83 | Address string 84 | Status string 85 | KeyCount int 86 | } 87 | 88 | // NodeStats is an array of NodeStat 89 | type NodeStats []NodeStat 90 | 91 | // NewSwimringClient returns a new SwimringClient instance. 92 | func NewSwimringClient(address string, port int) *SwimringClient { 93 | c := &SwimringClient{ 94 | address: address, 95 | port: port, 96 | readLevel: ALL, 97 | writeLevel: ALL, 98 | } 99 | 100 | return c 101 | } 102 | 103 | // SetReadLevel sets the readLevel to specific level. 104 | func (c *SwimringClient) SetReadLevel(level string) { 105 | c.readLevel = level 106 | } 107 | 108 | // SetWriteLevel sets the writeLevel to specific level. 109 | func (c *SwimringClient) SetWriteLevel(level string) { 110 | c.writeLevel = level 111 | } 112 | 113 | // Connect establishes a connection to remote RPC server. 114 | func (c *SwimringClient) Connect() error { 115 | var err error 116 | c.client, err = rpc.Dial("tcp", fmt.Sprintf("%s:%d", c.address, c.port)) 117 | if err != nil { 118 | return err 119 | } 120 | 121 | return nil 122 | } 123 | 124 | // Get calls the remote Get method and returns the requested value. 125 | func (c *SwimringClient) Get(key string) (string, error) { 126 | if c.client == nil { 127 | return "", errors.New("not connected") 128 | } 129 | 130 | req := &GetRequest{ 131 | Key: key, 132 | Level: c.readLevel, 133 | } 134 | resp := &GetResponse{} 135 | 136 | err := c.client.Call(GetOp, req, resp) 137 | if err != nil { 138 | return "", err 139 | } 140 | 141 | return resp.Value, nil 142 | } 143 | 144 | // Put calls the remote Put method to update for specific key. 145 | func (c *SwimringClient) Put(key, value string) error { 146 | if c.client == nil { 147 | return errors.New("not connected") 148 | } 149 | 150 | req := &PutRequest{ 151 | Key: key, 152 | Value: value, 153 | Level: c.writeLevel, 154 | } 155 | resp := &PutResponse{} 156 | 157 | err := c.client.Call(PutOp, req, resp) 158 | if err != nil { 159 | return err 160 | } 161 | 162 | return nil 163 | } 164 | 165 | // Delete calls the remote Delete method to remove specific key. 166 | func (c *SwimringClient) Delete(key string) error { 167 | if c.client == nil { 168 | return errors.New("not connected") 169 | } 170 | 171 | req := &DeleteRequest{ 172 | Key: key, 173 | Level: c.writeLevel, 174 | } 175 | resp := &DeleteResponse{} 176 | 177 | err := c.client.Call(DeleteOp, req, resp) 178 | if err != nil { 179 | return err 180 | } 181 | 182 | return nil 183 | } 184 | 185 | // Stat calls the remote Stat method to gather Nodes' information. 186 | func (c *SwimringClient) Stat() (NodeStats, error) { 187 | if c.client == nil { 188 | return nil, errors.New("not connected") 189 | } 190 | 191 | req := &StateRequest{} 192 | resp := &StateResponse{} 193 | 194 | err := c.client.Call(StatOp, req, resp) 195 | if err != nil { 196 | return nil, err 197 | } 198 | 199 | return NodeStats(resp.Nodes), nil 200 | } 201 | 202 | func (ns NodeStats) Len() int { 203 | return len(ns) 204 | } 205 | 206 | func (ns NodeStats) Less(i, j int) bool { 207 | itokens := strings.Split(ns[i].Address, ":") 208 | jtokens := strings.Split(ns[j].Address, ":") 209 | 210 | if itokens[0] != jtokens[0] { 211 | return itokens[0] < jtokens[0] 212 | } 213 | 214 | iport, _ := strconv.Atoi(itokens[1]) 215 | jport, _ := strconv.Atoi(jtokens[1]) 216 | return iport < jport 217 | } 218 | 219 | func (ns NodeStats) Swap(i, j int) { 220 | ns[i], ns[j] = ns[j], ns[i] 221 | } 222 | -------------------------------------------------------------------------------- /client/main.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "bufio" 5 | "errors" 6 | "flag" 7 | "fmt" 8 | "os" 9 | "sort" 10 | "strconv" 11 | "strings" 12 | 13 | "github.com/hungys/swimring/util" 14 | "github.com/olekukonko/tablewriter" 15 | ) 16 | 17 | const ( 18 | GetCmd = "get" 19 | PutCmd = "put" 20 | DeleteCmd = "del" 21 | StatCmd = "stat" 22 | ExitCmd = "exit" 23 | ) 24 | 25 | var client *SwimringClient 26 | 27 | func main() { 28 | var serverAddr string 29 | var serverPort int 30 | var readLevel, writeLevel string 31 | 32 | flag.StringVar(&serverAddr, "host", "127.0.0.1", "address of server node") 33 | flag.IntVar(&serverPort, "port", 7000, "port number of server node") 34 | flag.StringVar(&readLevel, "rl", QUORUM, "read consistency level") 35 | flag.StringVar(&writeLevel, "wl", QUORUM, "write consistency level") 36 | flag.Parse() 37 | 38 | client = NewSwimringClient(serverAddr, serverPort) 39 | client.SetReadLevel(readLevel) 40 | client.SetWriteLevel(writeLevel) 41 | 42 | err := client.Connect() 43 | if err != nil { 44 | fmt.Printf("error: unable to connect to %s:%d\n", serverAddr, serverPort) 45 | os.Exit(0) 46 | } 47 | fmt.Printf("connected to %s:%d\n", serverAddr, serverPort) 48 | 49 | reader := bufio.NewReader(os.Stdin) 50 | for { 51 | fmt.Print("> ") 52 | command, _ := reader.ReadString('\n') 53 | if err := processCommand(strings.Trim(command, " \r\n")); err != nil { 54 | fmt.Println(err.Error()) 55 | } 56 | } 57 | } 58 | 59 | func processCommand(line string) error { 60 | tokens := util.SafeSplit(line) 61 | 62 | if len(tokens) == 0 { 63 | return nil 64 | } 65 | 66 | switch tokens[0] { 67 | case GetCmd: 68 | processGet(tokens) 69 | case PutCmd: 70 | processPut(tokens) 71 | case DeleteCmd: 72 | processDelete(tokens) 73 | case StatCmd: 74 | processStat(tokens) 75 | case ExitCmd: 76 | os.Exit(0) 77 | default: 78 | return errors.New("unknown command") 79 | } 80 | 81 | return nil 82 | } 83 | 84 | func processGet(tokens []string) { 85 | if len(tokens) != 2 { 86 | fmt.Println("usage: get ") 87 | return 88 | } 89 | 90 | val, err := client.Get(tokens[1]) 91 | if err != nil { 92 | fmt.Printf("error: %s\n", err.Error()) 93 | return 94 | } 95 | 96 | fmt.Println(val) 97 | } 98 | 99 | func processPut(tokens []string) { 100 | if len(tokens) != 3 { 101 | fmt.Println("usage: put ") 102 | return 103 | } 104 | 105 | err := client.Put(tokens[1], tokens[2]) 106 | if err != nil { 107 | fmt.Printf("error: %s\n", err.Error()) 108 | return 109 | } 110 | 111 | fmt.Println("ok") 112 | } 113 | 114 | func processDelete(tokens []string) { 115 | if len(tokens) != 2 { 116 | fmt.Println("usage: del ") 117 | return 118 | } 119 | 120 | err := client.Delete(tokens[1]) 121 | if err != nil { 122 | fmt.Printf("error: %s\n", err.Error()) 123 | return 124 | } 125 | 126 | fmt.Println("ok") 127 | } 128 | 129 | func processStat(tokens []string) { 130 | nodes, err := client.Stat() 131 | if err != nil { 132 | fmt.Printf("error: %s\n", err.Error()) 133 | return 134 | } 135 | 136 | var data [][]string 137 | sort.Sort(nodes) 138 | 139 | for _, node := range nodes { 140 | var n []string 141 | n = append(n, node.Address) 142 | n = append(n, node.Status) 143 | n = append(n, strconv.Itoa(node.KeyCount)) 144 | data = append(data, n) 145 | } 146 | 147 | table := tablewriter.NewWriter(os.Stdout) 148 | table.SetHeader([]string{"Address", "Status", "Key Count"}) 149 | 150 | for _, d := range data { 151 | table.Append(d) 152 | } 153 | table.Render() 154 | } 155 | -------------------------------------------------------------------------------- /config.yml: -------------------------------------------------------------------------------- 1 | ExternalPort: 7000 2 | InternalPort: 7001 3 | JoinTimeout: 1000 4 | SuspectTimeout: 5000 5 | PingTimeout: 1500 6 | PingRequestTimeout: 5000 7 | MinProtocolPeriod: 200 8 | PingRequestSize: 3 9 | VirtualNodeSize: 5 10 | KVSReplicaPoints: 3 11 | BootstrapNodes: [":7001"] -------------------------------------------------------------------------------- /docker/swimring-cluster.sh: -------------------------------------------------------------------------------- 1 | if [ -z "$SEEDS" ]; then 2 | echo "No seeds specified, being my own seed..." 3 | SEEDS = "[\":7001\"]" 4 | fi 5 | sed -i -e "s/BootstrapNodes: \[\":7001\"\]/BootstrapNodes: $SEEDS/" /go/bin/config.yml 6 | 7 | /go/bin/swimring -------------------------------------------------------------------------------- /glide.lock: -------------------------------------------------------------------------------- 1 | hash: dd6123f1d7c4fb49f1e3b1d5ef7af68ee2001e268746640b85d965332efdf668 2 | updated: 2016-07-12T15:00:17.353811197+08:00 3 | imports: 4 | - name: github.com/dgryski/go-farm 5 | version: f94e93e5a2b3a579af0f5367c84b54cb9aa00074 6 | - name: github.com/mattn/go-runewidth 7 | version: d6bea18f789704b5f83375793155289da36a3c7f 8 | - name: github.com/olekukonko/tablewriter 9 | version: daf2955e742cf123959884fdff4685aa79b63135 10 | - name: github.com/op/go-logging 11 | version: b2cb9fa56473e98db8caba80237377e83fe44db5 12 | - name: gopkg.in/yaml.v2 13 | version: a83829b6f1293c91addabc89d0571c246397bbf4 14 | testImports: [] 15 | -------------------------------------------------------------------------------- /glide.yaml: -------------------------------------------------------------------------------- 1 | package: github.com/hungys/swimring 2 | import: 3 | - package: github.com/dgryski/go-farm 4 | - package: github.com/olekukonko/tablewriter 5 | - package: github.com/op/go-logging 6 | version: ^1.0.0 7 | - package: gopkg.in/yaml.v2 8 | -------------------------------------------------------------------------------- /hashring/hashring.go: -------------------------------------------------------------------------------- 1 | package hashring 2 | 3 | import ( 4 | "fmt" 5 | "sync" 6 | 7 | "github.com/op/go-logging" 8 | ) 9 | 10 | var logger = logging.MustGetLogger("hashring") 11 | 12 | // HashRing stores strings on a consistent hash ring. HashRing internally uses 13 | // a Red-Black Tree to achieve O(log N) lookup and insertion time. 14 | type HashRing struct { 15 | sync.RWMutex 16 | 17 | hashfunc func(string) int 18 | replicaPoints int 19 | 20 | serverSet map[string]struct{} 21 | tree *redBlackTree 22 | } 23 | 24 | // NewHashRing instantiates and returns a new HashRing. 25 | func NewHashRing(hashfunc func([]byte) uint32, replicaPoints int) *HashRing { 26 | r := &HashRing{ 27 | replicaPoints: replicaPoints, 28 | hashfunc: func(str string) int { 29 | return int(hashfunc([]byte(str))) 30 | }, 31 | } 32 | 33 | r.serverSet = make(map[string]struct{}) 34 | r.tree = &redBlackTree{} 35 | return r 36 | } 37 | 38 | // AddServer adds a server and its replicas onto the HashRing. 39 | func (r *HashRing) AddServer(address string) bool { 40 | r.Lock() 41 | ok := r.addServerNoLock(address) 42 | r.Unlock() 43 | return ok 44 | } 45 | 46 | func (r *HashRing) addServerNoLock(address string) bool { 47 | if _, ok := r.serverSet[address]; ok { 48 | return false 49 | } 50 | 51 | r.addVirtualNodesNoLock(address) 52 | logger.Noticef("Server %s added to ring", address) 53 | return true 54 | } 55 | 56 | func (r *HashRing) addVirtualNodesNoLock(server string) { 57 | r.serverSet[server] = struct{}{} 58 | for i := 0; i < r.replicaPoints; i++ { 59 | address := fmt.Sprintf("%s%v", server, i) 60 | key := r.hashfunc(address) 61 | r.tree.Insert(key, server) 62 | logger.Debugf("Virtual node %d added for %s", key, server) 63 | } 64 | } 65 | 66 | // RemoveServer removes a server and its replicas from the HashRing. 67 | func (r *HashRing) RemoveServer(address string) bool { 68 | r.Lock() 69 | ok := r.removeServerNoLock(address) 70 | r.Unlock() 71 | return ok 72 | } 73 | 74 | func (r *HashRing) removeServerNoLock(address string) bool { 75 | if _, ok := r.serverSet[address]; !ok { 76 | return false 77 | } 78 | 79 | r.removeVirtualNodesNoLock(address) 80 | logger.Noticef("Server %s removed from ring", address) 81 | return true 82 | } 83 | 84 | func (r *HashRing) removeVirtualNodesNoLock(server string) { 85 | delete(r.serverSet, server) 86 | for i := 0; i < r.replicaPoints; i++ { 87 | address := fmt.Sprintf("%s%v", server, i) 88 | key := r.hashfunc(address) 89 | r.tree.Delete(key) 90 | logger.Debugf("Virtual node %d removed for %s", key, server) 91 | } 92 | } 93 | 94 | // AddRemoveServers adds and removes servers and all replicas associated to those 95 | // servers to and from the HashRing. Returns whether the HashRing has changed. 96 | func (r *HashRing) AddRemoveServers(add []string, remove []string) bool { 97 | r.Lock() 98 | result := r.addRemoveServersNoLock(add, remove) 99 | r.Unlock() 100 | return result 101 | } 102 | 103 | func (r *HashRing) addRemoveServersNoLock(add []string, remove []string) bool { 104 | changed := false 105 | 106 | for _, server := range add { 107 | if r.addServerNoLock(server) { 108 | changed = true 109 | } 110 | } 111 | 112 | for _, server := range remove { 113 | if r.removeServerNoLock(server) { 114 | changed = true 115 | } 116 | } 117 | 118 | return changed 119 | } 120 | 121 | func (r *HashRing) copyServersNoLock() []string { 122 | var servers []string 123 | for server := range r.serverSet { 124 | servers = append(servers, server) 125 | } 126 | return servers 127 | } 128 | 129 | // Lookup returns the owner of the given key and whether the HashRing contains 130 | // the key at all. 131 | func (r *HashRing) Lookup(key string) (string, bool) { 132 | strs := r.LookupN(key, 1) 133 | if len(strs) == 0 { 134 | return "", false 135 | } 136 | 137 | logger.Debugf("Lookup(%s)=%s", key, strs[0]) 138 | return strs[0], true 139 | } 140 | 141 | // LookupN returns the N servers that own the given key. Duplicates in the form 142 | // of virtual nodes are skipped to maintain a list of unique servers. If there 143 | // are less servers than N, we simply return all existing servers. 144 | func (r *HashRing) LookupN(key string, n int) []string { 145 | r.RLock() 146 | servers := r.lookupNNoLock(key, n) 147 | r.RUnlock() 148 | 149 | logger.Debugf("LookupN(%s) = %v", key, servers) 150 | return servers 151 | } 152 | 153 | func (r *HashRing) lookupNNoLock(key string, n int) []string { 154 | if n >= len(r.serverSet) { 155 | return r.copyServersNoLock() 156 | } 157 | 158 | hash := r.hashfunc(key) 159 | unique := make(map[string]struct{}) 160 | 161 | r.tree.LookupNUniqueAt(n, hash, unique) 162 | if len(unique) < n { 163 | r.tree.LookupNUniqueAt(n, 0, unique) 164 | } 165 | 166 | var servers []string 167 | for server := range unique { 168 | servers = append(servers, server) 169 | } 170 | return servers 171 | } 172 | -------------------------------------------------------------------------------- /hashring/rbtree.go: -------------------------------------------------------------------------------- 1 | package hashring 2 | 3 | // redBlackTree is an implemantation of a Red Black Tree 4 | type redBlackTree struct { 5 | root *redBlackNode 6 | size int 7 | } 8 | 9 | // redBlackNode is a node of the redBlackTree 10 | type redBlackNode struct { 11 | val int 12 | str string 13 | left *redBlackNode 14 | right *redBlackNode 15 | red bool 16 | } 17 | 18 | // Size returns the number of nodes in the redBlackTree 19 | func (t *redBlackTree) Size() int { 20 | return t.size 21 | } 22 | 23 | // Child returns the left or right node of the redBlackTree 24 | func (n *redBlackNode) Child(right bool) *redBlackNode { 25 | if right { 26 | return n.right 27 | } 28 | return n.left 29 | } 30 | 31 | func (n *redBlackNode) setChild(right bool, node *redBlackNode) { 32 | if right { 33 | n.right = node 34 | } else { 35 | n.left = node 36 | } 37 | } 38 | 39 | // returns true if redBlackNode is red 40 | func isRed(node *redBlackNode) bool { 41 | return node != nil && node.red 42 | } 43 | 44 | func singleRotate(oldroot *redBlackNode, dir bool) *redBlackNode { 45 | newroot := oldroot.Child(!dir) 46 | 47 | oldroot.setChild(!dir, newroot.Child(dir)) 48 | newroot.setChild(dir, oldroot) 49 | 50 | oldroot.red = true 51 | newroot.red = false 52 | 53 | return newroot 54 | } 55 | 56 | func doubleRotate(root *redBlackNode, dir bool) *redBlackNode { 57 | root.setChild(!dir, singleRotate(root.Child(!dir), !dir)) 58 | return singleRotate(root, dir) 59 | } 60 | 61 | // Insert inserts a value and string into the tree 62 | // Returns true on succesful insertion, false if duplicate exists 63 | func (t *redBlackTree) Insert(val int, str string) (ret bool) { 64 | if t.root == nil { 65 | t.root = &redBlackNode{val: val, str: str} 66 | ret = true 67 | } else { 68 | var head = &redBlackNode{} 69 | 70 | var dir = true 71 | var last = true 72 | 73 | var parent *redBlackNode // parent 74 | var gparent *redBlackNode // grandparent 75 | var ggparent = head // great grandparent 76 | var node = t.root 77 | 78 | ggparent.right = t.root 79 | 80 | for { 81 | if node == nil { 82 | // insert new node at bottom 83 | node = &redBlackNode{val: val, str: str, red: true} 84 | parent.setChild(dir, node) 85 | ret = true 86 | } else if isRed(node.left) && isRed(node.right) { 87 | // flip colors 88 | node.red = true 89 | node.left.red, node.right.red = false, false 90 | } 91 | // fix red violation 92 | if isRed(node) && isRed(parent) { 93 | dir2 := ggparent.right == gparent 94 | 95 | if node == parent.Child(last) { 96 | ggparent.setChild(dir2, singleRotate(gparent, !last)) 97 | } else { 98 | ggparent.setChild(dir2, doubleRotate(gparent, !last)) 99 | } 100 | } 101 | 102 | cmp := node.val - val 103 | 104 | // stop if found 105 | if cmp == 0 { 106 | break 107 | } 108 | 109 | last = dir 110 | dir = cmp < 0 111 | 112 | // update helpers 113 | if gparent != nil { 114 | ggparent = gparent 115 | } 116 | gparent = parent 117 | parent = node 118 | 119 | node = node.Child(dir) 120 | } 121 | 122 | t.root = head.right 123 | } 124 | 125 | // make root black 126 | t.root.red = false 127 | 128 | if ret { 129 | t.size++ 130 | } 131 | 132 | return ret 133 | } 134 | 135 | // Delete removes a value from the redBlackTree 136 | // Returns true on succesful deletion, false if val is not in tree 137 | func (t *redBlackTree) Delete(val int) bool { 138 | if t.root == nil { 139 | return false 140 | } 141 | 142 | var head = &redBlackNode{red: true} // fake red node to push down 143 | var node = head 144 | var parent *redBlackNode //parent 145 | var gparent *redBlackNode //grandparent 146 | var found *redBlackNode 147 | 148 | var dir = true 149 | 150 | node.right = t.root 151 | 152 | for node.Child(dir) != nil { 153 | last := dir 154 | 155 | // update helpers 156 | gparent = parent 157 | parent = node 158 | node = node.Child(dir) 159 | 160 | cmp := node.val - val 161 | 162 | dir = cmp < 0 163 | 164 | // save node if found 165 | if cmp == 0 { 166 | found = node 167 | } 168 | 169 | // pretend to push red node down 170 | if !isRed(node) && !isRed(node.Child(dir)) { 171 | if isRed(node.Child(!dir)) { 172 | sr := singleRotate(node, dir) 173 | parent.setChild(last, sr) 174 | parent = sr 175 | } else { 176 | sibling := parent.Child(!last) 177 | if sibling != nil { 178 | if !isRed(sibling.Child(!last)) && !isRed(sibling.Child(last)) { 179 | // flip colors 180 | parent.red = false 181 | sibling.red, node.red = true, true 182 | } else { 183 | dir2 := gparent.right == parent 184 | 185 | if isRed(sibling.Child(last)) { 186 | gparent.setChild(dir2, doubleRotate(parent, last)) 187 | } else if isRed(sibling.Child(!last)) { 188 | gparent.setChild(dir2, singleRotate(parent, last)) 189 | } 190 | 191 | gpc := gparent.Child(dir2) 192 | gpc.red = true 193 | node.red = true 194 | gpc.left.red, gpc.right.red = false, false 195 | } 196 | } 197 | } 198 | } 199 | } 200 | 201 | // get rid of node if we've found one 202 | if found != nil { 203 | found.val = node.val 204 | found.str = node.str 205 | parent.setChild(parent.right == node, node.Child(node.left == nil)) 206 | t.size-- 207 | } 208 | 209 | t.root = head.right 210 | if t.root != nil { 211 | t.root.red = false 212 | } 213 | 214 | return found != nil 215 | } 216 | 217 | func (n *redBlackNode) search(val int) (string, bool) { 218 | if n.val == val { 219 | return n.str, true 220 | } else if val < n.val { 221 | if n.left != nil { 222 | return n.left.search(val) 223 | } 224 | } else if n.right != nil { 225 | return n.right.search(val) 226 | } 227 | return "", false 228 | } 229 | 230 | // Search searches for a value in the redBlackTree, returns the string and true 231 | // if found or the empty string and false if val is not in the tree. 232 | func (t *redBlackTree) Search(val int) (string, bool) { 233 | if t.root == nil { 234 | return "", false 235 | } 236 | return t.root.search(val) 237 | } 238 | 239 | // LookupNAt iterates through the tree from the node with value val, and 240 | // returns the next n unique strings. This function is not guaranteed to 241 | // return n strings. 242 | func (t *redBlackTree) LookupNUniqueAt(n int, val int, result map[string]struct{}) { 243 | findNUniqueAbove(t.root, n, val, result) 244 | } 245 | 246 | // findNUniqueAbove is a recursive search that finds n unique strings 247 | // with a value bigger or equal than val 248 | func findNUniqueAbove(node *redBlackNode, n int, val int, result map[string]struct{}) { 249 | if len(result) >= n || node == nil { 250 | return 251 | } 252 | 253 | // skip left branch when all its values are smaller than val 254 | if node.val >= val { 255 | findNUniqueAbove(node.left, n, val, result) 256 | } 257 | 258 | // Make sure to stop when we have n unique strings 259 | if len(result) >= n { 260 | return 261 | } 262 | 263 | if node.val >= val { 264 | result[node.str] = struct{}{} 265 | } 266 | 267 | findNUniqueAbove(node.right, n, val, result) 268 | } 269 | -------------------------------------------------------------------------------- /main.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "flag" 5 | "io/ioutil" 6 | "os" 7 | "strings" 8 | 9 | "gopkg.in/yaml.v2" 10 | 11 | "github.com/hungys/swimring/util" 12 | "github.com/op/go-logging" 13 | ) 14 | 15 | var logger = logging.MustGetLogger("swimring") 16 | 17 | func main() { 18 | var externalPort, internalPort int 19 | var localIPAddr string 20 | 21 | initializeLogger() 22 | localIPAddr = util.GetLocalIP() 23 | config := loadConfig() 24 | 25 | flag.IntVar(&externalPort, "export", config.ExternalPort, "port number for external request") 26 | flag.IntVar(&internalPort, "inport", config.InternalPort, "port number for internal protocal communication") 27 | flag.Parse() 28 | 29 | config.Host = localIPAddr 30 | config.ExternalPort = externalPort 31 | config.InternalPort = internalPort 32 | 33 | logger.Infof("IP address: %s", localIPAddr) 34 | logger.Infof("External port: %d", config.ExternalPort) 35 | logger.Infof("Internal port: %d", config.InternalPort) 36 | logger.Infof("Bootsrap nodes: %v", config.BootstrapNodes) 37 | 38 | swimring := NewSwimRing(config) 39 | swimring.Bootstrap() 40 | 41 | select {} 42 | } 43 | 44 | func initializeLogger() { 45 | var format = logging.MustStringFormatter( 46 | `%{color}%{time:15:04:05.000} %{shortpkg} ▶ %{level:.4s}%{color:reset} %{message}`, 47 | ) 48 | 49 | backend := logging.NewLogBackend(os.Stderr, "", 0) 50 | backendFormatter := logging.NewBackendFormatter(backend, format) 51 | logging.SetBackend(backendFormatter) 52 | } 53 | 54 | func loadConfig() *configuration { 55 | logger.Info("Loading configurations from config.yml") 56 | 57 | config := &configuration{ 58 | Host: "0.0.0.0", 59 | ExternalPort: 7000, 60 | InternalPort: 7001, 61 | JoinTimeout: 1000, 62 | SuspectTimeout: 5000, 63 | PingTimeout: 1500, 64 | PingRequestTimeout: 5000, 65 | MinProtocolPeriod: 200, 66 | PingRequestSize: 3, 67 | VirtualNodeSize: 5, 68 | KVSReplicaPoints: 3, 69 | BootstrapNodes: []string{}, 70 | } 71 | 72 | data, err := ioutil.ReadFile("config.yml") 73 | if err != nil { 74 | logger.Warning("Cannot load config.yml") 75 | } 76 | 77 | err = yaml.Unmarshal(data, config) 78 | if err != nil { 79 | logger.Error("Fail to unmarshal config.yml") 80 | } 81 | 82 | for i, addr := range config.BootstrapNodes { 83 | if strings.HasPrefix(addr, ":") { 84 | config.BootstrapNodes[i] = util.GetLocalIP() + addr 85 | } 86 | } 87 | 88 | return config 89 | } 90 | -------------------------------------------------------------------------------- /membership/disseminator.go: -------------------------------------------------------------------------------- 1 | package membership 2 | 3 | import "sync" 4 | 5 | const defaultPFactor int = 15 6 | 7 | type pChange struct { 8 | Change 9 | p int 10 | } 11 | 12 | type disseminator struct { 13 | node *Node 14 | changes map[string]*pChange 15 | 16 | maxP int 17 | pFactor int 18 | 19 | sync.RWMutex 20 | } 21 | 22 | func newDisseminator(n *Node) *disseminator { 23 | d := &disseminator{ 24 | node: n, 25 | changes: make(map[string]*pChange), 26 | maxP: defaultPFactor, 27 | pFactor: defaultPFactor, 28 | } 29 | 30 | return d 31 | } 32 | 33 | // MembershipAsChanges returns a Change array containing all the members 34 | // in memberlist of Node. 35 | func (d *disseminator) MembershipAsChanges() (changes []Change) { 36 | d.Lock() 37 | 38 | for _, member := range d.node.memberlist.Members() { 39 | changes = append(changes, Change{ 40 | Address: member.Address, 41 | Incarnation: member.Incarnation, 42 | Source: d.node.Address(), 43 | SourceIncarnation: d.node.Incarnation(), 44 | Status: member.Status, 45 | }) 46 | } 47 | 48 | d.Unlock() 49 | 50 | return changes 51 | } 52 | 53 | // IssueAsSender collects all changes a node needs when sending a ping or 54 | // ping-req. The second return value is a callback that raises the piggyback 55 | // counters of the given changes. 56 | func (d *disseminator) IssueAsSender() (changes []Change, bumpPiggybackCounters func()) { 57 | changes = d.issueChanges() 58 | return changes, func() { 59 | d.bumpPiggybackCounters(changes) 60 | } 61 | } 62 | 63 | // IssueAsReceiver collects all changes a node needs when responding to a ping 64 | // or ping-req. Unlike IssueAsSender, IssueAsReceiver automatically increments 65 | // the piggyback counters because it's difficult to find out whether a response 66 | // reaches the client. The second return value indicates whether a full sync 67 | // is triggered. 68 | func (d *disseminator) IssueAsReceiver(senderAddress string, senderIncarnation int64, senderChecksum uint32) (changes []Change) { 69 | changes = d.filterChangesFromSender(d.issueChanges(), senderAddress, senderIncarnation) 70 | 71 | d.bumpPiggybackCounters(changes) 72 | 73 | if len(changes) > 0 || d.node.memberlist.Checksum() == senderChecksum { 74 | return changes 75 | } 76 | 77 | return d.MembershipAsChanges() 78 | } 79 | 80 | func (d *disseminator) filterChangesFromSender(cs []Change, source string, incarnation int64) []Change { 81 | for i := 0; i < len(cs); i++ { 82 | if incarnation == cs[i].SourceIncarnation && source == cs[i].Source { 83 | cs[i], cs[len(cs)-1] = cs[len(cs)-1], cs[i] 84 | cs = cs[:len(cs)-1] 85 | i-- 86 | } 87 | } 88 | return cs 89 | } 90 | 91 | func (d *disseminator) bumpPiggybackCounters(changes []Change) { 92 | d.Lock() 93 | for _, change := range changes { 94 | c, ok := d.changes[change.Address] 95 | if !ok { 96 | continue 97 | } 98 | 99 | c.p++ 100 | if c.p >= d.maxP { 101 | delete(d.changes, c.Address) 102 | } 103 | } 104 | d.Unlock() 105 | } 106 | 107 | func (d *disseminator) issueChanges() []Change { 108 | d.Lock() 109 | 110 | result := []Change{} 111 | for _, change := range d.changes { 112 | result = append(result, change.Change) 113 | } 114 | 115 | d.Unlock() 116 | 117 | return result 118 | } 119 | 120 | // RecordChange stores the Change into the disseminator. 121 | func (d *disseminator) RecordChange(change Change) { 122 | d.Lock() 123 | d.changes[change.Address] = &pChange{change, 0} 124 | d.Unlock() 125 | } 126 | 127 | // ClearChange removes the Change record of specific address. 128 | func (d *disseminator) ClearChange(address string) { 129 | d.Lock() 130 | delete(d.changes, address) 131 | d.Unlock() 132 | } 133 | 134 | // ClearChanges clears all the Changes from disseminator. 135 | func (d *disseminator) ClearChanges() { 136 | d.Lock() 137 | d.changes = make(map[string]*pChange) 138 | d.Unlock() 139 | } 140 | -------------------------------------------------------------------------------- /membership/gossip.go: -------------------------------------------------------------------------------- 1 | package membership 2 | 3 | import ( 4 | "sync" 5 | "time" 6 | ) 7 | 8 | type gossip struct { 9 | node *Node 10 | 11 | status struct { 12 | stopped bool 13 | sync.RWMutex 14 | } 15 | 16 | minProtocolPeriod time.Duration 17 | 18 | protocol struct { 19 | numPeriods int 20 | lastPeriod time.Time 21 | lastRate time.Duration 22 | sync.RWMutex 23 | } 24 | } 25 | 26 | func newGossip(node *Node, minProtocolPeriod time.Duration) *gossip { 27 | gossip := &gossip{ 28 | node: node, 29 | minProtocolPeriod: minProtocolPeriod, 30 | } 31 | 32 | gossip.SetStopped(true) 33 | 34 | return gossip 35 | } 36 | 37 | // Stopped returns whether or not the gossip sub-protocol is stopped. 38 | func (g *gossip) Stopped() bool { 39 | g.status.RLock() 40 | stopped := g.status.stopped 41 | g.status.RUnlock() 42 | 43 | return stopped 44 | } 45 | 46 | // SetStopped sets the gossip sub-protocol to stopped or not stopped. 47 | func (g *gossip) SetStopped(stopped bool) { 48 | g.status.Lock() 49 | g.status.stopped = stopped 50 | g.status.Unlock() 51 | } 52 | 53 | // Start start the gossip protocol. 54 | func (g *gossip) Start() { 55 | if !g.Stopped() { 56 | return 57 | } 58 | 59 | g.SetStopped(false) 60 | g.RunProtocolPeriodLoop() 61 | 62 | logger.Notice("Gossip protocol started") 63 | } 64 | 65 | // Stop start the gossip protocol. 66 | func (g *gossip) Stop() { 67 | if g.Stopped() { 68 | return 69 | } 70 | 71 | g.SetStopped(true) 72 | 73 | logger.Notice("Gossip protocol stopped") 74 | } 75 | 76 | // ProtocolPeriod run a gossip protocol period. 77 | func (g *gossip) ProtocolPeriod() { 78 | g.node.pingNextMember() 79 | } 80 | 81 | // RunProtocolPeriodLoop run the gossip protocol period loop. 82 | func (g *gossip) RunProtocolPeriodLoop() { 83 | go func() { 84 | for !g.Stopped() { 85 | delay := g.minProtocolPeriod 86 | g.ProtocolPeriod() 87 | time.Sleep(delay) 88 | } 89 | }() 90 | } 91 | -------------------------------------------------------------------------------- /membership/join.go: -------------------------------------------------------------------------------- 1 | package membership 2 | 3 | import ( 4 | "errors" 5 | "time" 6 | ) 7 | 8 | func sendJoin(node *Node, target string, timeout time.Duration) (*JoinResponse, error) { 9 | if target == node.Address() { 10 | logger.Error("Cannot join local node") 11 | return nil, errors.New("cannot join local node") 12 | } 13 | 14 | req := &JoinRequest{ 15 | Source: node.address, 16 | Incarnation: node.Incarnation(), 17 | Timeout: timeout, 18 | } 19 | resp := &JoinResponse{} 20 | 21 | errCh := make(chan error, 1) 22 | go func() { 23 | client, err := node.memberlist.MemberClient(target) 24 | if err != nil { 25 | errCh <- err 26 | return 27 | } 28 | 29 | errCh <- client.Call("Protocol.Join", req, resp) 30 | }() 31 | 32 | var err error 33 | select { 34 | case err = <-errCh: 35 | case <-time.After(timeout): 36 | logger.Error("Join request timeout") 37 | err = errors.New("join timeout") 38 | } 39 | 40 | if err != nil { 41 | return nil, err 42 | } 43 | 44 | return resp, err 45 | } 46 | -------------------------------------------------------------------------------- /membership/member.go: -------------------------------------------------------------------------------- 1 | package membership 2 | 3 | import ( 4 | "math/rand" 5 | "sync" 6 | ) 7 | 8 | const ( 9 | Alive = "alive" 10 | Suspect = "suspect" 11 | Faulty = "faulty" 12 | ) 13 | 14 | type Member struct { 15 | sync.RWMutex 16 | Address string 17 | Status string 18 | Incarnation int64 19 | } 20 | 21 | func shuffle(members []*Member) []*Member { 22 | newMembers := make([]*Member, len(members), cap(members)) 23 | newIndexes := rand.Perm(len(members)) 24 | 25 | for o, n := range newIndexes { 26 | newMembers[n] = members[o] 27 | } 28 | 29 | return newMembers 30 | } 31 | 32 | func (m *Member) nonLocalOverride(change Change) bool { 33 | if change.Incarnation > m.Incarnation { 34 | return true 35 | } 36 | 37 | if change.Incarnation < m.Incarnation { 38 | return false 39 | } 40 | 41 | return statePrecedence(change.Status) > statePrecedence(m.Status) 42 | } 43 | 44 | func (m *Member) localOverride(local string, change Change) bool { 45 | if m.Address != local { 46 | return false 47 | } 48 | return change.Status == Faulty || change.Status == Suspect 49 | } 50 | 51 | func statePrecedence(s string) int { 52 | switch s { 53 | case Alive: 54 | return 0 55 | case Suspect: 56 | return 1 57 | case Faulty: 58 | return 2 59 | default: 60 | return -1 61 | } 62 | } 63 | 64 | func (m *Member) isReachable() bool { 65 | return m.Status == Alive || m.Status == Suspect 66 | } 67 | 68 | type Change struct { 69 | Source string 70 | SourceIncarnation int64 71 | Address string 72 | Incarnation int64 73 | Status string 74 | } 75 | -------------------------------------------------------------------------------- /membership/memberlist.go: -------------------------------------------------------------------------------- 1 | package membership 2 | 3 | import ( 4 | "bytes" 5 | "fmt" 6 | "math/rand" 7 | "net/rpc" 8 | "sort" 9 | "sync" 10 | "time" 11 | 12 | "github.com/dgryski/go-farm" 13 | ) 14 | 15 | type memberlist struct { 16 | node *Node 17 | local *Member 18 | 19 | members struct { 20 | list []*Member 21 | byAddress map[string]*Member 22 | rpcClients map[string]*rpc.Client 23 | checksum uint32 24 | sync.RWMutex 25 | } 26 | 27 | sync.Mutex 28 | } 29 | 30 | type memberlistIter struct { 31 | m *memberlist 32 | currentIndex int 33 | currentRound int 34 | } 35 | 36 | func newMemberlist(n *Node) *memberlist { 37 | m := &memberlist{ 38 | node: n, 39 | } 40 | 41 | m.members.byAddress = make(map[string]*Member) 42 | m.members.rpcClients = make(map[string]*rpc.Client) 43 | 44 | return m 45 | } 46 | 47 | func newMemberlistIter(m *memberlist) *memberlistIter { 48 | iter := &memberlistIter{ 49 | m: m, 50 | currentIndex: -1, 51 | currentRound: 0, 52 | } 53 | 54 | iter.m.Shuffle() 55 | 56 | return iter 57 | } 58 | 59 | // Next returns the next pingable member in the member list, if it 60 | // visits all members but none are pingable returns nil, false. 61 | func (i *memberlistIter) Next() (*Member, bool) { 62 | numOfMembers := i.m.NumMembers() 63 | visited := make(map[string]bool) 64 | 65 | for len(visited) < numOfMembers { 66 | i.currentIndex++ 67 | 68 | if i.currentIndex >= i.m.NumMembers() { 69 | i.currentIndex = 0 70 | i.currentRound++ 71 | i.m.Shuffle() 72 | } 73 | 74 | member := i.m.MemberAt(i.currentIndex) 75 | visited[member.Address] = true 76 | 77 | if i.m.Pingable(*member) { 78 | return member, true 79 | } 80 | } 81 | 82 | return nil, false 83 | } 84 | 85 | // Checksum returns the checksum of the memberlist. 86 | func (m *memberlist) Checksum() uint32 { 87 | m.members.Lock() 88 | checksum := m.members.checksum 89 | m.members.Unlock() 90 | 91 | return checksum 92 | } 93 | 94 | // ComputeChecksum computes the checksum of the memberlist. 95 | func (m *memberlist) ComputeChecksum() { 96 | m.members.Lock() 97 | checksum := farm.Fingerprint32([]byte(m.genChecksumString())) 98 | m.members.checksum = checksum 99 | m.members.Unlock() 100 | } 101 | 102 | func (m *memberlist) genChecksumString() string { 103 | var strings sort.StringSlice 104 | 105 | for _, member := range m.members.list { 106 | s := fmt.Sprintf("%s,%s,%v", member.Address, member.Status, member.Incarnation) 107 | strings = append(strings, s) 108 | } 109 | 110 | strings.Sort() 111 | 112 | buffer := bytes.NewBuffer([]byte{}) 113 | for _, str := range strings { 114 | buffer.WriteString(str) 115 | buffer.WriteString("|") 116 | } 117 | 118 | return buffer.String() 119 | } 120 | 121 | // Member returns the member at a specific address. 122 | func (m *memberlist) Member(address string) (*Member, bool) { 123 | m.members.RLock() 124 | member, ok := m.members.byAddress[address] 125 | m.members.RUnlock() 126 | 127 | return member, ok 128 | } 129 | 130 | // MemberClient returns the RPC client of the member at a specific address, 131 | // and it will dial to RPC server if client is not in rpcClients map. 132 | func (m *memberlist) MemberClient(address string) (*rpc.Client, error) { 133 | m.members.Lock() 134 | client, ok := m.members.rpcClients[address] 135 | m.members.Unlock() 136 | 137 | if ok { 138 | return client, nil 139 | } 140 | 141 | logger.Debugf("Dialing to RPC server: %s", address) 142 | client, err := rpc.Dial("tcp", address) 143 | if err == nil { 144 | logger.Debugf("RPC connection established: %s", address) 145 | m.members.Lock() 146 | m.members.rpcClients[address] = client 147 | m.members.Unlock() 148 | } else { 149 | logger.Debugf("Cannot connect to RPC server: %s", address) 150 | } 151 | 152 | return client, err 153 | } 154 | 155 | // CloseMemberClient removes the client instance of the member at a specific address. 156 | func (m *memberlist) CloseMemberClient(address string) { 157 | m.members.Lock() 158 | delete(m.members.rpcClients, address) 159 | m.members.Unlock() 160 | } 161 | 162 | // MemberAt returns the i-th member in the list. 163 | func (m *memberlist) MemberAt(i int) *Member { 164 | m.members.RLock() 165 | member := m.members.list[i] 166 | m.members.RUnlock() 167 | 168 | return member 169 | } 170 | 171 | // NumMembers returns the number of members in the memberlist. 172 | func (m *memberlist) NumMembers() int { 173 | m.members.RLock() 174 | n := len(m.members.list) 175 | m.members.RUnlock() 176 | 177 | return n 178 | } 179 | 180 | // NumPingableMembers returns the number of pingable members in the memberlist. 181 | func (m *memberlist) NumPingableMembers() (n int) { 182 | m.members.Lock() 183 | for _, member := range m.members.list { 184 | if m.Pingable(*member) { 185 | n++ 186 | } 187 | } 188 | m.members.Unlock() 189 | 190 | return 191 | } 192 | 193 | // Members returns all the members in the memberlist. 194 | func (m *memberlist) Members() (members []Member) { 195 | m.members.RLock() 196 | for _, member := range m.members.list { 197 | members = append(members, *member) 198 | } 199 | m.members.RUnlock() 200 | 201 | return 202 | } 203 | 204 | // Pingable returns whether or not a member is pingable. 205 | func (m *memberlist) Pingable(member Member) bool { 206 | return member.Address != m.local.Address && member.isReachable() 207 | } 208 | 209 | // RandomPingableMembers returns the number of pingable members in the memberlist. 210 | func (m *memberlist) RandomPingableMembers(n int, excluding map[string]bool) []*Member { 211 | var members []*Member 212 | 213 | m.members.RLock() 214 | for _, member := range m.members.list { 215 | if m.Pingable(*member) && !excluding[member.Address] { 216 | members = append(members, member) 217 | } 218 | } 219 | m.members.RUnlock() 220 | 221 | members = shuffle(members) 222 | 223 | if n > len(members) { 224 | return members 225 | } 226 | return members[:n] 227 | } 228 | 229 | // Reincarnate sets the status of the node to Alive and updates the incarnation 230 | // number. It adds the change to the disseminator as well. 231 | func (m *memberlist) Reincarnate() []Change { 232 | return m.MarkAlive(m.node.Address(), time.Now().Unix()) 233 | } 234 | 235 | // MarkAlive sets the status of the node at specific address to Alive and 236 | // updates the incarnation number. It adds the change to the disseminator as well. 237 | func (m *memberlist) MarkAlive(address string, incarnation int64) []Change { 238 | return m.MakeChange(address, incarnation, Alive) 239 | } 240 | 241 | // MarkSuspect sets the status of the node at specific address to Suspect and 242 | // updates the incarnation number. It adds the change to the disseminator as well. 243 | func (m *memberlist) MarkSuspect(address string, incarnation int64) []Change { 244 | return m.MakeChange(address, incarnation, Suspect) 245 | } 246 | 247 | // MarkFaulty sets the status of the node at specific address to Faulty and 248 | // updates the incarnation number. It adds the change to the disseminator as well. 249 | func (m *memberlist) MarkFaulty(address string, incarnation int64) []Change { 250 | return m.MakeChange(address, incarnation, Faulty) 251 | } 252 | 253 | // MakeChange makes a change to the memberlist. 254 | func (m *memberlist) MakeChange(address string, incarnation int64, status string) []Change { 255 | if m.local == nil { 256 | m.local = &Member{ 257 | Address: m.node.Address(), 258 | Incarnation: 0, 259 | Status: Alive, 260 | } 261 | } 262 | 263 | changes := m.Update([]Change{Change{ 264 | Source: m.local.Address, 265 | SourceIncarnation: m.local.Incarnation, 266 | Address: address, 267 | Incarnation: incarnation, 268 | Status: status, 269 | }}) 270 | 271 | return changes 272 | } 273 | 274 | // Update updates the memberlist with the slice of changes, applying selectively. 275 | func (m *memberlist) Update(changes []Change) (applied []Change) { 276 | if m.node.Stopped() || len(changes) == 0 { 277 | return nil 278 | } 279 | 280 | m.Lock() 281 | m.members.Lock() 282 | 283 | for _, change := range changes { 284 | member, ok := m.members.byAddress[change.Address] 285 | 286 | if !ok { 287 | if m.applyChange(change) { 288 | applied = append(applied, change) 289 | } 290 | continue 291 | } 292 | 293 | if member.localOverride(m.node.Address(), change) { 294 | overrideChange := Change{ 295 | Source: change.Source, 296 | SourceIncarnation: change.SourceIncarnation, 297 | Address: change.Address, 298 | Incarnation: time.Now().Unix(), 299 | Status: Alive, 300 | } 301 | 302 | if m.applyChange(overrideChange) { 303 | applied = append(applied, overrideChange) 304 | } 305 | 306 | continue 307 | } 308 | 309 | if member.nonLocalOverride(change) { 310 | if m.applyChange(change) { 311 | applied = append(applied, change) 312 | } 313 | } 314 | } 315 | 316 | m.members.Unlock() 317 | 318 | if len(applied) > 0 { 319 | m.ComputeChecksum() 320 | m.node.handleChanges(applied) 321 | m.node.swimring.HandleChanges(applied) 322 | } 323 | 324 | m.Unlock() 325 | return applied 326 | } 327 | 328 | // AddJoinList adds the list to the membership with the Update function. 329 | // However, as a side effect, Update adds changes to the disseminator as well. 330 | // Since we don't want to disseminate the potentially very large join lists, 331 | // we clear all the changes from the disseminator, except for the one change 332 | // that refers to the make-alive of this node. 333 | func (m *memberlist) AddJoinList(list []Change) { 334 | applied := m.Update(list) 335 | for _, member := range applied { 336 | if member.Address == m.node.Address() { 337 | continue 338 | } 339 | m.node.disseminator.ClearChange(member.Address) 340 | } 341 | } 342 | 343 | func (m *memberlist) getJoinPosition() int { 344 | l := len(m.members.list) 345 | if l == 0 { 346 | return l 347 | } 348 | return rand.Intn(l) 349 | } 350 | 351 | func (m *memberlist) applyChange(change Change) bool { 352 | member, ok := m.members.byAddress[change.Address] 353 | 354 | if !ok { 355 | member = &Member{ 356 | Address: change.Address, 357 | Status: change.Status, 358 | Incarnation: change.Incarnation, 359 | } 360 | 361 | if member.Address == m.node.Address() { 362 | m.local = member 363 | } 364 | 365 | m.members.byAddress[change.Address] = member 366 | i := m.getJoinPosition() 367 | m.members.list = append(m.members.list[:i], append([]*Member{member}, m.members.list[i:]...)...) 368 | 369 | logger.Noticef("Server %s added to memberlist", member.Address) 370 | } 371 | 372 | member.Lock() 373 | member.Status = change.Status 374 | member.Incarnation = change.Incarnation 375 | member.Unlock() 376 | 377 | logger.Noticef("%s is marked as %s node", member.Address, change.Status) 378 | 379 | return true 380 | } 381 | 382 | // Shuffle shuffles the memberlist. 383 | func (m *memberlist) Shuffle() { 384 | m.members.Lock() 385 | m.members.list = shuffle(m.members.list) 386 | m.members.Unlock() 387 | } 388 | -------------------------------------------------------------------------------- /membership/node.go: -------------------------------------------------------------------------------- 1 | package membership 2 | 3 | import ( 4 | "errors" 5 | "net/rpc" 6 | "sync" 7 | "time" 8 | 9 | "github.com/hungys/swimring/util" 10 | "github.com/op/go-logging" 11 | ) 12 | 13 | var logger = logging.MustGetLogger("membership") 14 | 15 | var ( 16 | // ErrNodeNotReady is returned when a remote request is being handled while the node is not yet ready 17 | ErrNodeNotReady = errors.New("node is not ready to handle requests") 18 | ) 19 | 20 | type changeHandler interface { 21 | HandleChanges(changes []Change) 22 | } 23 | 24 | // Options is a configuration struct passed into NewNode constructor. 25 | type Options struct { 26 | JoinTimeout, SuspectTimeout, PingTimeout, PingRequestTimeout, MinProtocolPeriod time.Duration 27 | 28 | PingRequestSize int 29 | BootstrapNodes []string 30 | } 31 | 32 | func defaultOptions() *Options { 33 | opts := &Options{ 34 | JoinTimeout: 1000 * time.Millisecond, 35 | SuspectTimeout: 5 * time.Second, 36 | PingTimeout: 1500 * time.Millisecond, 37 | PingRequestTimeout: 5000 * time.Millisecond, 38 | MinProtocolPeriod: 200 * time.Millisecond, 39 | PingRequestSize: 3, 40 | } 41 | 42 | return opts 43 | } 44 | 45 | func mergeDefaultOptions(opts *Options) *Options { 46 | def := defaultOptions() 47 | 48 | if opts == nil { 49 | return def 50 | } 51 | 52 | opts.JoinTimeout = util.SelectDurationOpt(opts.JoinTimeout, def.JoinTimeout) 53 | opts.SuspectTimeout = util.SelectDurationOpt(opts.SuspectTimeout, def.SuspectTimeout) 54 | opts.PingTimeout = util.SelectDurationOpt(opts.PingTimeout, def.PingTimeout) 55 | opts.PingRequestTimeout = util.SelectDurationOpt(opts.PingRequestTimeout, def.PingRequestTimeout) 56 | opts.MinProtocolPeriod = util.SelectDurationOpt(opts.MinProtocolPeriod, def.MinProtocolPeriod) 57 | opts.PingRequestSize = util.SelectIntOpt(opts.PingRequestSize, def.PingRequestSize) 58 | 59 | return opts 60 | } 61 | 62 | // Node is a SWIM member. 63 | type Node struct { 64 | address string 65 | 66 | status struct { 67 | stopped, destroyed, pinging, ready bool 68 | sync.RWMutex 69 | } 70 | 71 | swimring changeHandler 72 | memberlist *memberlist 73 | memberiter *memberlistIter 74 | disseminator *disseminator 75 | stateTransitions *stateTransitions 76 | gossip *gossip 77 | protocolHandlers *ProtocolHandlers 78 | 79 | joinTimeout, suspectTimeout, pingTimeout, pingRequestTimeout time.Duration 80 | 81 | pingRequestSize int 82 | bootstrapNodes []string 83 | } 84 | 85 | // NewNode returns a new SWIM node. 86 | func NewNode(swimring changeHandler, address string, opts *Options) *Node { 87 | opts = mergeDefaultOptions(opts) 88 | 89 | node := &Node{ 90 | address: address, 91 | } 92 | 93 | node.swimring = swimring 94 | node.memberlist = newMemberlist(node) 95 | node.memberiter = newMemberlistIter(node.memberlist) 96 | node.disseminator = newDisseminator(node) 97 | node.stateTransitions = newStateTransitions(node) 98 | node.gossip = newGossip(node, opts.MinProtocolPeriod) 99 | node.protocolHandlers = NewProtocolHandler(node) 100 | 101 | node.joinTimeout = opts.JoinTimeout 102 | node.suspectTimeout = opts.SuspectTimeout 103 | node.pingTimeout = opts.PingTimeout 104 | node.pingRequestTimeout = opts.PingRequestTimeout 105 | node.pingRequestSize = opts.PingRequestSize 106 | node.bootstrapNodes = opts.BootstrapNodes 107 | 108 | return node 109 | } 110 | 111 | // Address returns the address of the SWIM node. 112 | func (n *Node) Address() string { 113 | return n.address 114 | } 115 | 116 | // Members returns all the members in Node's memberlist. 117 | func (n *Node) Members() []Member { 118 | return n.memberlist.Members() 119 | } 120 | 121 | // MemberClient returns the RPC client of the member at a specific address, 122 | // and it will dial to RPC server if client is not in rpcClients map. 123 | func (n *Node) MemberClient(address string) (*rpc.Client, error) { 124 | return n.memberlist.MemberClient(address) 125 | } 126 | 127 | // MemberReachable returns whether or not the member is reachable 128 | func (n *Node) MemberReachable(address string) bool { 129 | member, ok := n.memberlist.Member(address) 130 | if !ok { 131 | return false 132 | } 133 | 134 | return member.isReachable() 135 | } 136 | 137 | // Start starts the SWIM protocol and all sub-protocols. 138 | func (n *Node) Start() { 139 | n.gossip.Start() 140 | n.stateTransitions.Enable() 141 | 142 | n.status.Lock() 143 | n.status.stopped = false 144 | n.status.Unlock() 145 | 146 | logger.Noticef("Local node %s started", n.Address()) 147 | } 148 | 149 | // Stop stops the SWIM protocol and all sub-protocols. 150 | func (n *Node) Stop() { 151 | n.gossip.Stop() 152 | n.stateTransitions.Disable() 153 | 154 | n.status.Lock() 155 | n.status.stopped = true 156 | n.status.Unlock() 157 | 158 | logger.Noticef("Local node %s stopped", n.Address()) 159 | } 160 | 161 | // Stopped returns whether or not the SWIM protocol is currently stopped. 162 | func (n *Node) Stopped() bool { 163 | n.status.RLock() 164 | stopped := n.status.stopped 165 | n.status.RUnlock() 166 | 167 | return stopped 168 | } 169 | 170 | // Destroy stops the SWIM protocol and all sub-protocols. 171 | func (n *Node) Destroy() { 172 | n.status.Lock() 173 | if n.status.destroyed { 174 | n.status.Unlock() 175 | return 176 | } 177 | n.status.destroyed = true 178 | n.status.Unlock() 179 | 180 | n.Stop() 181 | 182 | logger.Noticef("Local node %s destroyed", n.Address()) 183 | } 184 | 185 | // Destroyed returns whether or not the node has been destroyed. 186 | func (n *Node) Destroyed() bool { 187 | n.status.RLock() 188 | destroyed := n.status.destroyed 189 | n.status.RUnlock() 190 | 191 | return destroyed 192 | } 193 | 194 | // Ready returns whether or not the node has bootstrapped and is ready for use. 195 | func (n *Node) Ready() bool { 196 | n.status.RLock() 197 | ready := n.status.ready 198 | n.status.RUnlock() 199 | 200 | return ready 201 | } 202 | 203 | // Incarnation returns the incarnation number of the Node. 204 | func (n *Node) Incarnation() int64 { 205 | if n.memberlist != nil && n.memberlist.local != nil { 206 | n.memberlist.local.RLock() 207 | incarnation := n.memberlist.local.Incarnation 208 | n.memberlist.local.RUnlock() 209 | return incarnation 210 | } 211 | return -1 212 | } 213 | 214 | // Bootstrap joins the Node to a cluster. 215 | func (n *Node) Bootstrap() ([]string, error) { 216 | logger.Notice("Bootstrapping local node...") 217 | 218 | n.memberlist.Reincarnate() 219 | nodesJoined := n.joinCluster() 220 | n.gossip.Start() 221 | 222 | n.status.Lock() 223 | n.status.ready = true 224 | n.status.Unlock() 225 | 226 | return nodesJoined, nil 227 | } 228 | 229 | // RegisterRPCHandlers registers the RPC handlers for internal SWIM protocol. 230 | func (n *Node) RegisterRPCHandlers(server *rpc.Server) error { 231 | server.RegisterName("Protocol", n.protocolHandlers) 232 | logger.Info("SWIM protocol RPC handlers registered") 233 | return nil 234 | } 235 | 236 | func (n *Node) handleChanges(changes []Change) { 237 | for _, change := range changes { 238 | n.disseminator.RecordChange(change) 239 | 240 | switch change.Status { 241 | case Alive: 242 | n.stateTransitions.Cancel(change) 243 | case Suspect: 244 | n.stateTransitions.ScheduleSuspectToFaulty(change) 245 | } 246 | } 247 | } 248 | 249 | func (n *Node) pinging() bool { 250 | n.status.RLock() 251 | pinging := n.status.pinging 252 | n.status.RUnlock() 253 | 254 | return pinging 255 | } 256 | 257 | func (n *Node) setPinging(pinging bool) { 258 | n.status.Lock() 259 | n.status.pinging = pinging 260 | n.status.Unlock() 261 | } 262 | 263 | func (n *Node) pingNextMember() { 264 | if n.pinging() { 265 | return 266 | } 267 | 268 | member, ok := n.memberiter.Next() 269 | if !ok { 270 | return 271 | } 272 | 273 | n.setPinging(true) 274 | defer n.setPinging(false) 275 | 276 | res, err := sendDirectPing(n, member.Address, n.pingTimeout) 277 | if err == nil { 278 | n.memberlist.Update(res.Changes) 279 | return 280 | } 281 | 282 | n.memberlist.CloseMemberClient(member.Address) 283 | targetReached, _ := sendIndirectPing(n, member.Address, n.pingRequestSize, n.pingRequestTimeout) 284 | 285 | if !targetReached { 286 | if member.Status != Suspect { 287 | logger.Errorf("Cannot reach %s, mark it suspect", member.Address) 288 | } 289 | n.memberlist.MarkSuspect(member.Address, member.Incarnation) 290 | return 291 | } 292 | } 293 | 294 | func (n *Node) joinCluster() []string { 295 | var nodesJoined []string 296 | var wg sync.WaitGroup 297 | 298 | logger.Infof("Trying to join the cluster...") 299 | for _, target := range n.bootstrapNodes { 300 | wg.Add(1) 301 | 302 | go func(target string) { 303 | defer wg.Done() 304 | res, err := sendJoin(n, target, n.joinTimeout) 305 | 306 | if err != nil { 307 | return 308 | } 309 | 310 | logger.Noticef("Join %s successfully, %d peers found", target, len(res.Membership)) 311 | n.memberlist.AddJoinList(res.Membership) 312 | nodesJoined = append(nodesJoined, target) 313 | }(target) 314 | } 315 | 316 | wg.Wait() 317 | 318 | return nodesJoined 319 | } 320 | -------------------------------------------------------------------------------- /membership/ping.go: -------------------------------------------------------------------------------- 1 | package membership 2 | 3 | import ( 4 | "errors" 5 | "sync" 6 | "time" 7 | ) 8 | 9 | func sendDirectPing(node *Node, target string, timeout time.Duration) (*Ping, error) { 10 | changes, bumpPiggybackCounters := node.disseminator.IssueAsSender() 11 | 12 | res, err := sendPingWithChanges(node, target, changes, timeout) 13 | if err != nil { 14 | return res, err 15 | } 16 | 17 | bumpPiggybackCounters() 18 | 19 | return res, err 20 | } 21 | 22 | func sendPingWithChanges(node *Node, target string, changes []Change, timeout time.Duration) (*Ping, error) { 23 | req := &Ping{ 24 | Checksum: node.memberlist.Checksum(), 25 | Changes: changes, 26 | Source: node.Address(), 27 | SourceIncarnation: node.Incarnation(), 28 | } 29 | 30 | errCh := make(chan error, 1) 31 | resp := &Ping{} 32 | go func() { 33 | client, err := node.memberlist.MemberClient(target) 34 | if err != nil { 35 | errCh <- err 36 | return 37 | } 38 | 39 | if client != nil { 40 | errCh <- client.Call("Protocol.Ping", req, resp) 41 | } 42 | }() 43 | 44 | var err error 45 | select { 46 | case err = <-errCh: 47 | case <-time.After(timeout): 48 | logger.Warningf("Ping to %s timeout", target) 49 | err = errors.New("ping timeout") 50 | } 51 | 52 | if err != nil { 53 | return nil, err 54 | } 55 | 56 | return resp, err 57 | } 58 | 59 | func sendIndirectPing(node *Node, target string, amount int, timeout time.Duration) (reached bool, errs []error) { 60 | resCh := sendPingRequests(node, target, amount, timeout) 61 | 62 | for result := range resCh { 63 | switch res := result.(type) { 64 | case *PingResponse: 65 | if res.Ok { 66 | return true, errs 67 | } 68 | case error: 69 | errs = append(errs, res) 70 | } 71 | } 72 | 73 | return false, errs 74 | } 75 | 76 | func sendPingRequests(node *Node, target string, amount int, timeout time.Duration) <-chan interface{} { 77 | peers := node.memberlist.RandomPingableMembers(amount, map[string]bool{target: true}) 78 | 79 | var wg sync.WaitGroup 80 | resCh := make(chan interface{}, amount) 81 | 82 | for _, peer := range peers { 83 | wg.Add(1) 84 | 85 | go func(peer Member) { 86 | defer wg.Done() 87 | 88 | res, err := sendPingRequest(node, peer.Address, target, timeout) 89 | if err != nil { 90 | resCh <- err 91 | return 92 | } 93 | 94 | resCh <- res 95 | }(*peer) 96 | } 97 | 98 | go func() { 99 | wg.Wait() 100 | close(resCh) 101 | }() 102 | 103 | return resCh 104 | } 105 | 106 | func sendPingRequest(node *Node, peer string, target string, timeout time.Duration) (*PingResponse, error) { 107 | changes, bumpPiggybackCounters := node.disseminator.IssueAsSender() 108 | req := &PingRequest{ 109 | Source: node.Address(), 110 | SourceIncarnation: node.Incarnation(), 111 | Checksum: node.memberlist.Checksum(), 112 | Changes: changes, 113 | Target: target, 114 | } 115 | 116 | errCh := make(chan error, 1) 117 | resp := &PingResponse{} 118 | go func() { 119 | client, err := node.memberlist.MemberClient(peer) 120 | if err != nil { 121 | errCh <- err 122 | return 123 | } 124 | 125 | if client != nil { 126 | err = client.Call("Protocol.PingRequest", req, resp) 127 | if err != nil { 128 | errCh <- err 129 | return 130 | } 131 | } 132 | 133 | bumpPiggybackCounters() 134 | errCh <- nil 135 | }() 136 | 137 | var err error 138 | select { 139 | case err = <-errCh: 140 | if err == nil { 141 | node.memberlist.Update(resp.Changes) 142 | } 143 | return resp, err 144 | case <-time.After(timeout): 145 | logger.Warningf("Ping request to %s timeout", target) 146 | return nil, errors.New("ping request timeout") 147 | } 148 | } 149 | -------------------------------------------------------------------------------- /membership/protocol_handlers.go: -------------------------------------------------------------------------------- 1 | package membership 2 | 3 | import "time" 4 | 5 | // ProtocolHandlers defines a set of RPC handlers for SWIM gossip protocol. 6 | type ProtocolHandlers struct { 7 | node *Node 8 | } 9 | 10 | // Ping is the payload of ping and ping response. 11 | type Ping struct { 12 | Changes []Change 13 | Checksum uint32 14 | Source string 15 | SourceIncarnation int64 16 | } 17 | 18 | // PingRequest is the payload of ping request. 19 | type PingRequest struct { 20 | Source string 21 | SourceIncarnation int64 22 | Target string 23 | Checksum uint32 24 | Changes []Change 25 | } 26 | 27 | // PingResponse is the payload of the response of ping request. 28 | type PingResponse struct { 29 | Ok bool 30 | Target string 31 | Changes []Change 32 | } 33 | 34 | // JoinRequest is the payload of join request. 35 | type JoinRequest struct { 36 | Source string 37 | Incarnation int64 38 | Timeout time.Duration 39 | } 40 | 41 | // JoinResponse is the payload of the response of join request. 42 | type JoinResponse struct { 43 | Coordinator string 44 | Membership []Change 45 | Checksum uint32 46 | } 47 | 48 | // NewProtocolHandler returns a new ProtocolHandlers. 49 | func NewProtocolHandler(n *Node) *ProtocolHandlers { 50 | p := &ProtocolHandlers{ 51 | node: n, 52 | } 53 | 54 | return p 55 | } 56 | 57 | // Ping handles the incoming Ping. 58 | func (p *ProtocolHandlers) Ping(req *Ping, resp *Ping) error { 59 | if !p.node.Ready() { 60 | return ErrNodeNotReady 61 | } 62 | 63 | p.node.memberlist.Update(req.Changes) 64 | 65 | changes := p.node.disseminator.IssueAsReceiver(req.Source, req.SourceIncarnation, req.Checksum) 66 | 67 | resp.Checksum = p.node.memberlist.Checksum() 68 | resp.Changes = changes 69 | resp.Source = p.node.Address() 70 | resp.SourceIncarnation = p.node.Incarnation() 71 | 72 | return nil 73 | } 74 | 75 | // PingRequest handles the incoming PingRequest. It helps the source node to send 76 | // Ping to the target. 77 | func (p *ProtocolHandlers) PingRequest(req *PingRequest, resp *PingResponse) error { 78 | if !p.node.Ready() { 79 | return ErrNodeNotReady 80 | } 81 | 82 | p.node.memberlist.Update(req.Changes) 83 | 84 | logger.Infof("Handling ping request to %s (from %s)", req.Target, req.Source) 85 | 86 | res, err := sendDirectPing(p.node, req.Target, p.node.pingTimeout) 87 | pingOk := err == nil 88 | 89 | if pingOk { 90 | p.node.memberlist.Update(res.Changes) 91 | } 92 | 93 | changes := p.node.disseminator.IssueAsReceiver(req.Source, req.SourceIncarnation, req.Checksum) 94 | 95 | resp.Target = req.Target 96 | resp.Ok = pingOk 97 | resp.Changes = changes 98 | 99 | return nil 100 | } 101 | 102 | // Join handles the incoming Join request. 103 | func (p *ProtocolHandlers) Join(req *JoinRequest, resp *JoinResponse) error { 104 | logger.Infof("Handling join request from %s", req.Source) 105 | 106 | resp.Coordinator = p.node.Address() 107 | resp.Membership = p.node.disseminator.MembershipAsChanges() 108 | resp.Checksum = p.node.memberlist.Checksum() 109 | 110 | return nil 111 | } 112 | -------------------------------------------------------------------------------- /membership/state_transitions.go: -------------------------------------------------------------------------------- 1 | package membership 2 | 3 | import ( 4 | "sync" 5 | "time" 6 | ) 7 | 8 | type transitionTimer struct { 9 | *time.Timer 10 | state string 11 | } 12 | 13 | type stateTransitions struct { 14 | sync.Mutex 15 | 16 | node *Node 17 | timers map[string]*transitionTimer 18 | enabled bool 19 | } 20 | 21 | func newStateTransitions(n *Node) *stateTransitions { 22 | return &stateTransitions{ 23 | node: n, 24 | timers: make(map[string]*transitionTimer), 25 | enabled: true, 26 | } 27 | } 28 | 29 | // ScheduleSuspectToFaulty starts the suspect timer. After the Suspect timeout, 30 | // the node will be marked as a faulty node. 31 | func (s *stateTransitions) ScheduleSuspectToFaulty(change Change) { 32 | s.Lock() 33 | s.schedule(change, Suspect, s.node.suspectTimeout, func() { 34 | logger.Warningf("Suspect timer expired, mark %s as faulty node", change.Address) 35 | s.node.memberlist.MarkFaulty(change.Address, change.Incarnation) 36 | }) 37 | logger.Infof("Suspect timer for %s scheduled", change.Address) 38 | s.Unlock() 39 | } 40 | 41 | func (s *stateTransitions) schedule(change Change, state string, timeout time.Duration, transition func()) { 42 | if !s.enabled { 43 | return 44 | } 45 | 46 | if s.node.Address() == change.Address { 47 | return 48 | } 49 | 50 | if timer, ok := s.timers[change.Address]; ok { 51 | if timer.state == state { 52 | return 53 | } 54 | timer.Stop() 55 | } 56 | 57 | timer := time.AfterFunc(timeout, func() { 58 | transition() 59 | }) 60 | 61 | s.timers[change.Address] = &transitionTimer{ 62 | Timer: timer, 63 | state: state, 64 | } 65 | } 66 | 67 | // Cancel cancels the scheduled transition for the change. 68 | func (s *stateTransitions) Cancel(change Change) { 69 | s.Lock() 70 | 71 | if timer, ok := s.timers[change.Address]; ok { 72 | timer.Stop() 73 | delete(s.timers, change.Address) 74 | } 75 | 76 | s.Unlock() 77 | } 78 | 79 | // Enable enables state transition controller. 80 | func (s *stateTransitions) Enable() { 81 | s.Lock() 82 | s.enabled = true 83 | s.Unlock() 84 | 85 | logger.Notice("State transitions enabled") 86 | } 87 | 88 | // Disable cancels all scheduled state transitions and disables the state transition controller. 89 | func (s *stateTransitions) Disable() { 90 | s.Lock() 91 | 92 | s.enabled = false 93 | for address, timer := range s.timers { 94 | timer.Stop() 95 | delete(s.timers, address) 96 | } 97 | 98 | s.Unlock() 99 | 100 | logger.Notice("State transitions disabled") 101 | } 102 | -------------------------------------------------------------------------------- /request_coordinator.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "errors" 5 | "math" 6 | "sync" 7 | "time" 8 | 9 | "github.com/hungys/swimring/membership" 10 | "github.com/hungys/swimring/storage" 11 | ) 12 | 13 | const ( 14 | // ONE is the weakest consistency level. 15 | // For read request, returns value when the first response arrived. 16 | // For write request, returns when the first ACK received. 17 | ONE = "ONE" 18 | // QUORUM is the moderate consistency level. 19 | // For read request, returns value when the quorum set of replicas all responded. 20 | // For write request, returns when the quorum set of replicas all responded ACKs. 21 | QUORUM = "QUORUM" 22 | // ALL is the strongest consistency level. 23 | // For read request, returns value when all replicas responded. 24 | // For write request, returns when all replicas all responded ACKs. 25 | ALL = "ALL" 26 | // GetOp is the name of the service method for Get. 27 | GetOp = "KVS.Get" 28 | // PutOp is the name of the service method for Put. 29 | PutOp = "KVS.Put" 30 | // DeleteOp is the name of the service method for Delete. 31 | DeleteOp = "KVS.Delete" 32 | // StatOp is the name of the service method for Stat. 33 | StatOp = "KVS.Stat" 34 | ) 35 | 36 | // RequestCoordinator is the coordinator for all the incoming external request. 37 | type RequestCoordinator struct { 38 | sr *SwimRing 39 | } 40 | 41 | // GetRequest is the payload of Get. 42 | type GetRequest struct { 43 | Level string 44 | Key string 45 | } 46 | 47 | // GetResponse is the payload of the response of Get. 48 | type GetResponse struct { 49 | Key, Value string 50 | } 51 | 52 | // PutRequest is the payload of Put. 53 | type PutRequest struct { 54 | Level string 55 | Key, Value string 56 | } 57 | 58 | // PutResponse is the payload of the response of Put. 59 | type PutResponse struct{} 60 | 61 | // DeleteRequest is the payload of Delete. 62 | type DeleteRequest struct { 63 | Level string 64 | Key string 65 | } 66 | 67 | // DeleteResponse is the payload of the response of Delete. 68 | type DeleteResponse struct{} 69 | 70 | // StateRequest is the payload of Stat. 71 | type StateRequest struct{} 72 | 73 | // StateResponse is the payload of the response of Stat. 74 | type StateResponse struct { 75 | Nodes []NodeStat 76 | } 77 | 78 | // NodeStat stores the information of a Node 79 | type NodeStat struct { 80 | Address string 81 | Status string 82 | KeyCount int 83 | } 84 | 85 | // NewRequestCoordinator returns a new RequestCoordinator. 86 | func NewRequestCoordinator(sr *SwimRing) *RequestCoordinator { 87 | rc := &RequestCoordinator{ 88 | sr: sr, 89 | } 90 | 91 | return rc 92 | } 93 | 94 | // Get handles the incoming Get request. It first looks up for the owner replicas 95 | // of the given key, forwards request to all replicas and deals with them according to 96 | // consistency level. Read repair is initiated if necessary. 97 | func (rc *RequestCoordinator) Get(req *GetRequest, resp *GetResponse) error { 98 | logger.Debugf("Coordinating external request Get(%s, %s)", req.Key, req.Level) 99 | 100 | internalReq := &storage.GetRequest{ 101 | Key: req.Key, 102 | } 103 | 104 | replicas := rc.sr.ring.LookupN(req.Key, rc.sr.config.KVSReplicaPoints) 105 | resCh := rc.sendRPCRequests(replicas, GetOp, internalReq) 106 | resp.Key = req.Key 107 | 108 | ackNeed := rc.numOfRequiredACK(req.Level) 109 | ackReceived := 0 110 | ackOk := 0 111 | latestTimestamp := int64(0) 112 | latestValue := "" 113 | 114 | var resList []*storage.GetResponse 115 | 116 | for result := range resCh { 117 | switch res := result.(type) { 118 | case *storage.GetResponse: 119 | resList = append(resList, res) 120 | 121 | ackReceived++ 122 | if res.Ok { 123 | ackOk++ 124 | } 125 | 126 | if res.Ok && res.Value.Timestamp > latestTimestamp { 127 | latestTimestamp = res.Value.Timestamp 128 | latestValue = res.Value.Value 129 | } 130 | 131 | if ackReceived >= ackNeed { 132 | go rc.readRepair(resList, req.Key, latestValue, latestTimestamp, ackOk, resCh) 133 | 134 | if ackOk == 0 { 135 | logger.Debugf("No ACK with Ok received for Get(%s): %s", req.Key, res.Message) 136 | return errors.New(res.Message) 137 | } 138 | 139 | resp.Value = latestValue 140 | return nil 141 | } 142 | case error: 143 | continue 144 | } 145 | } 146 | 147 | logger.Errorf("Cannot reach consistency requirements for Get(%s, %s)", req.Key, req.Level) 148 | return errors.New("cannot reach consistency level") 149 | } 150 | 151 | // Put handles the incoming Put request. It first looks up for the owner replicas 152 | // of the given key, forwards request to all replicas and deals with them according to 153 | // consistency level. 154 | func (rc *RequestCoordinator) Put(req *PutRequest, resp *PutResponse) error { 155 | logger.Debugf("Coordinating external request Put(%s, %s, %s)", req.Key, req.Value, req.Level) 156 | 157 | internalReq := &storage.PutRequest{ 158 | Key: req.Key, 159 | Value: req.Value, 160 | } 161 | 162 | replicas := rc.sr.ring.LookupN(req.Key, rc.sr.config.KVSReplicaPoints) 163 | resCh := rc.sendRPCRequests(replicas, PutOp, internalReq) 164 | 165 | ackNeed := rc.numOfRequiredACK(req.Level) 166 | ackReceived := 0 167 | ackOk := 0 168 | 169 | for result := range resCh { 170 | switch res := result.(type) { 171 | case *storage.PutResponse: 172 | ackReceived++ 173 | if res.Ok { 174 | ackOk++ 175 | } 176 | 177 | if ackReceived >= ackNeed { 178 | if ackOk == 0 { 179 | logger.Debugf("No ACK with Ok received for Put(%s, %s): %s", req.Key, req.Value, res.Message) 180 | return errors.New(res.Message) 181 | } 182 | return nil 183 | } 184 | case error: 185 | continue 186 | } 187 | } 188 | 189 | logger.Errorf("Cannot reach consistency requirements for Put(%s, %s, %s)", req.Key, req.Value, req.Level) 190 | return errors.New("cannot reach consistency level") 191 | } 192 | 193 | // Delete handles the incoming Delete request. It first looks up for the owner replicas 194 | // of the given key, forwards request to all replicas and deals with them according to 195 | // consistency level. 196 | func (rc *RequestCoordinator) Delete(req *DeleteRequest, resp *DeleteResponse) error { 197 | logger.Debugf("Coordinating external request Delete(%s, %s)", req.Key, req.Level) 198 | 199 | internalReq := &storage.DeleteRequest{ 200 | Key: req.Key, 201 | } 202 | 203 | replicas := rc.sr.ring.LookupN(req.Key, rc.sr.config.KVSReplicaPoints) 204 | resCh := rc.sendRPCRequests(replicas, DeleteOp, internalReq) 205 | 206 | ackNeed := rc.numOfRequiredACK(req.Level) 207 | ackReceived := 0 208 | ackOk := 0 209 | 210 | for result := range resCh { 211 | switch res := result.(type) { 212 | case *storage.DeleteResponse: 213 | ackReceived++ 214 | if res.Ok { 215 | ackOk++ 216 | } 217 | 218 | if ackReceived >= ackNeed { 219 | if ackOk == 0 { 220 | logger.Debugf("No ACK with Ok received for Delete(%s): %s", req.Key, res.Message) 221 | return errors.New(res.Message) 222 | } 223 | return nil 224 | } 225 | case error: 226 | continue 227 | } 228 | } 229 | 230 | logger.Errorf("Cannot reach consistency requirements for Delete(%s, %s)", req.Key, req.Level) 231 | return errors.New("cannot reach consistency level") 232 | } 233 | 234 | // Stat handles the incoming Stat request. 235 | func (rc *RequestCoordinator) Stat(req *StateRequest, resp *StateResponse) error { 236 | logger.Debug("Coordinating external request Stat()") 237 | 238 | internalReq := &storage.StatRequest{} 239 | 240 | members := rc.sr.node.Members() 241 | resCh := make(chan interface{}, len(members)) 242 | var wg sync.WaitGroup 243 | 244 | for _, member := range members { 245 | wg.Add(1) 246 | 247 | go func(member membership.Member) { 248 | defer wg.Done() 249 | 250 | stat := NodeStat{ 251 | Address: member.Address, 252 | Status: member.Status, 253 | KeyCount: 0, 254 | } 255 | 256 | res, err := rc.sendRPCRequest(member.Address, StatOp, internalReq) 257 | if err == nil { 258 | stat.KeyCount = res.(*storage.StatResponse).Count 259 | } 260 | 261 | resCh <- stat 262 | }(member) 263 | } 264 | 265 | go func() { 266 | wg.Wait() 267 | close(resCh) 268 | }() 269 | 270 | for result := range resCh { 271 | resp.Nodes = append(resp.Nodes, result.(NodeStat)) 272 | } 273 | 274 | return nil 275 | } 276 | 277 | func (rc *RequestCoordinator) sendRPCRequests(replicas []string, op string, req interface{}) <-chan interface{} { 278 | var wg sync.WaitGroup 279 | resCh := make(chan interface{}, len(replicas)) 280 | 281 | for _, replica := range replicas { 282 | wg.Add(1) 283 | 284 | go func(address string) { 285 | defer wg.Done() 286 | 287 | res, err := rc.sendRPCRequest(address, op, req) 288 | if err != nil { 289 | resCh <- err 290 | return 291 | } 292 | 293 | resCh <- res 294 | }(replica) 295 | } 296 | 297 | go func() { 298 | wg.Wait() 299 | close(resCh) 300 | }() 301 | 302 | return resCh 303 | } 304 | 305 | func (rc *RequestCoordinator) sendRPCRequest(server string, op string, req interface{}) (interface{}, error) { 306 | if !rc.sr.node.MemberReachable(server) { 307 | return nil, errors.New("not reachable") 308 | } 309 | 310 | var resp interface{} 311 | switch op { 312 | case GetOp: 313 | resp = &storage.GetResponse{} 314 | case PutOp: 315 | resp = &storage.PutResponse{} 316 | case DeleteOp: 317 | resp = &storage.DeleteResponse{} 318 | case StatOp: 319 | resp = &storage.StatResponse{} 320 | } 321 | 322 | errCh := make(chan error, 1) 323 | go func() { 324 | client, err := rc.sr.node.MemberClient(server) 325 | if err != nil { 326 | errCh <- err 327 | return 328 | } 329 | 330 | if client != nil { 331 | logger.Infof("Sending %s request to %s", op, server) 332 | errCh <- client.Call(op, req, resp) 333 | } 334 | }() 335 | 336 | var err error 337 | select { 338 | case err = <-errCh: 339 | if err != nil { 340 | logger.Errorf("%s response from %s: %s", op, server, err.Error()) 341 | } else { 342 | logger.Infof("%s response from %s: ok", op, server) 343 | } 344 | case <-time.After(1500 * time.Millisecond): 345 | logger.Warningf("%s request to %s: timeout", op, server) 346 | err = errors.New("request timeout") 347 | } 348 | 349 | if err != nil { 350 | return nil, err 351 | } 352 | 353 | return resp, err 354 | } 355 | 356 | func (rc *RequestCoordinator) numOfRequiredACK(level string) int { 357 | switch level { 358 | case ONE: 359 | return 1 360 | case QUORUM: 361 | return int(math.Floor(float64(rc.sr.config.KVSReplicaPoints)/2)) + 1 362 | case ALL: 363 | return rc.sr.config.KVSReplicaPoints 364 | } 365 | 366 | return rc.sr.config.KVSReplicaPoints 367 | } 368 | 369 | func (rc *RequestCoordinator) readRepair(resList []*storage.GetResponse, key string, value string, timestamp int64, okCount int, resCh <-chan interface{}) { 370 | latestTimestamp := timestamp 371 | latestValue := value 372 | ackOk := okCount 373 | 374 | for result := range resCh { 375 | switch res := result.(type) { 376 | case *storage.GetResponse: 377 | resList = append(resList, res) 378 | 379 | if res.Ok { 380 | ackOk++ 381 | } 382 | 383 | if res.Ok && res.Value.Timestamp > latestTimestamp { 384 | latestTimestamp = res.Value.Timestamp 385 | latestValue = res.Value.Value 386 | } 387 | case error: 388 | continue 389 | } 390 | } 391 | 392 | if ackOk == 0 { 393 | return 394 | } 395 | 396 | for _, res := range resList { 397 | if !res.Ok || res.Value.Value != latestValue { 398 | logger.Debugf("Initiating read repair for %s: (%s, %s)", res.Node, key, latestValue) 399 | go rc.sendRPCRequest(res.Node, PutOp, &storage.PutRequest{ 400 | Key: key, 401 | Value: latestValue, 402 | }) 403 | } 404 | } 405 | } 406 | -------------------------------------------------------------------------------- /storage/kvstore.go: -------------------------------------------------------------------------------- 1 | package storage 2 | 3 | import ( 4 | "bufio" 5 | "errors" 6 | "io/ioutil" 7 | "net/rpc" 8 | "os" 9 | "regexp" 10 | "strconv" 11 | "strings" 12 | "sync" 13 | "time" 14 | 15 | "github.com/op/go-logging" 16 | ) 17 | 18 | var logger = logging.MustGetLogger("storage") 19 | 20 | // KVStore is a key-value storage engine. 21 | type KVStore struct { 22 | sync.RWMutex 23 | address string 24 | memtable map[string]*KVEntry 25 | 26 | requestHandlers *RequestHandlers 27 | 28 | commitLogName, dumpFileName string 29 | mapSize, boundarySize, dumpsIndex int 30 | } 31 | 32 | // KVEntry is a storage unit for a value. 33 | type KVEntry struct { 34 | Value string 35 | Timestamp int64 36 | Exist int 37 | } 38 | 39 | // NewKVStore returns a new KVStore instance. 40 | func NewKVStore(address string) *KVStore { 41 | kvs := &KVStore{ 42 | address: address, 43 | mapSize: 0, 44 | boundarySize: 128, 45 | dumpsIndex: 1, 46 | } 47 | kvs.memtable = make(map[string]*KVEntry) 48 | kvs.commitLogName = strings.Replace(address, ":", "_", -1) + "_commit.log" 49 | kvs.dumpFileName = strings.Replace(address, ":", "_", -1) + "_dump.log" 50 | 51 | requestHandlers := NewRequestHandler(kvs) 52 | kvs.requestHandlers = requestHandlers 53 | 54 | if _, err := os.Stat(kvs.commitLogName); os.IsNotExist(err) { 55 | os.Create(kvs.commitLogName) 56 | } 57 | 58 | kvs.repairDB() 59 | go kvs.flushToDumpFile() 60 | 61 | return kvs 62 | } 63 | 64 | // Get returns the KVEntry of the given key. 65 | func (k *KVStore) Get(key string) (*KVEntry, error) { 66 | k.Lock() 67 | value, ok := k.memtable[key] 68 | k.Unlock() 69 | 70 | if !ok || value.Exist == 0 { 71 | return nil, errors.New("key not found") 72 | } 73 | return value, nil 74 | } 75 | 76 | // Put updates the value for the given key. 77 | func (k *KVStore) Put(key, value string) error { 78 | entry := KVEntry{Value: value, Timestamp: time.Now().UnixNano(), Exist: 1} 79 | 80 | k.Lock() 81 | k.appendToCommitLog(key, &entry) 82 | k.memtable[key] = &entry 83 | k.Unlock() 84 | 85 | logger.Infof("Key-value pair (%s, %s) updated to memtable", key, value) 86 | 87 | return nil 88 | } 89 | 90 | // Delete removes the entry of the given key. 91 | func (k *KVStore) Delete(key string) error { 92 | value, _ := k.Get(key) 93 | 94 | if value == nil { 95 | return errors.New("key not found") 96 | } 97 | value = &KVEntry{Value: "", Timestamp: time.Now().UnixNano(), Exist: 0} 98 | 99 | k.Lock() 100 | k.appendToCommitLog(key, value) 101 | k.memtable[key] = value 102 | k.Unlock() 103 | 104 | return nil 105 | } 106 | 107 | // Count returns the number of entries in local KVS. 108 | func (k *KVStore) Count() int { 109 | return len(k.memtable) 110 | } 111 | 112 | // RegisterRPCHandlers registers the internal RPC handlers. 113 | func (k *KVStore) RegisterRPCHandlers(server *rpc.Server) error { 114 | server.RegisterName("KVS", k.requestHandlers) 115 | logger.Info("Internal KVS request RPC handlers registered") 116 | return nil 117 | } 118 | 119 | func (k *KVStore) appendToCommitLog(key string, entry *KVEntry) error { 120 | fLogfile, err := os.OpenFile(k.commitLogName, os.O_APPEND|os.O_WRONLY, 0644) 121 | if err != nil { 122 | logger.Error(err.Error()) 123 | return err 124 | } 125 | 126 | k.writeKeyValueToFile(fLogfile, key, entry) 127 | logger.Infof("Key-value pair (%s, %s) appended to commit log", key, entry.Value) 128 | 129 | return nil 130 | } 131 | 132 | func (k *KVStore) flushToDumpFile() error { 133 | for { 134 | time.Sleep(30 * time.Second) 135 | 136 | k.Lock() 137 | f, _ := os.Create(k.dumpFileName) 138 | for key, value := range k.memtable { 139 | k.writeKeyValueToFile(f, key, value) 140 | } 141 | f, _ = os.Create(k.commitLogName) 142 | k.Unlock() 143 | 144 | logger.Notice("Memtable dumped to disk") 145 | logger.Info("Commit log cleared") 146 | } 147 | } 148 | 149 | func (k *KVStore) repairDB() { 150 | files, _ := ioutil.ReadDir("./") 151 | matchpattern := strings.Replace(k.address, ":", "_", -1) + "_(dump|commit).log" 152 | 153 | logger.Notice("Trying to repair key value storage...") 154 | 155 | for _, file := range files { 156 | if ok, _ := regexp.Match(matchpattern, []byte(file.Name())); ok { 157 | fLog, err := os.Open(file.Name()) 158 | if err != nil { 159 | return 160 | } 161 | f := bufio.NewReader(fLog) 162 | for { 163 | key, value, timestamp, exist, err := k.getNextKeyValueFromFile(f) 164 | if err != nil { 165 | break 166 | } 167 | 168 | if cur, ok := k.memtable[key]; ok { 169 | if cur.Timestamp < timestamp { 170 | cur.Timestamp = timestamp 171 | cur.Exist = exist 172 | cur.Value = value 173 | } 174 | } else { 175 | tmpKVEntry := &KVEntry{Value: value, Timestamp: timestamp, Exist: exist} 176 | k.memtable[key] = tmpKVEntry 177 | } 178 | } 179 | } 180 | } 181 | } 182 | 183 | func (k *KVStore) getNextKeyValueFromFile(f *bufio.Reader) (string, string, int64, int, error) { 184 | var nextLenStr string 185 | var err error 186 | if nextLenStr, err = f.ReadString(' '); err != nil { 187 | return "", "", 0, 0, err 188 | } 189 | nextLenStr = nextLenStr[:len(nextLenStr)-1] 190 | nextLen, _ := strconv.Atoi(nextLenStr) 191 | readKey := "" 192 | for len(readKey) != nextLen { 193 | tmp := make([]byte, nextLen-len(readKey)) 194 | f.Read(tmp) 195 | readKey += string(tmp[:]) 196 | } 197 | nextLenStr, _ = f.ReadString(' ') 198 | nextLenStr, _ = f.ReadString(' ') 199 | nextLenStr = nextLenStr[:len(nextLenStr)-1] 200 | nextLen, _ = strconv.Atoi(nextLenStr) 201 | readValue := make([]byte, nextLen) 202 | if _, err = f.Read(readValue); err != nil { 203 | return "", "", 0, 0, err 204 | } 205 | 206 | readTimestamp, err := f.ReadString(' ') 207 | readTimestamp, err = f.ReadString(' ') 208 | readTimestamp = readTimestamp[:len(readTimestamp)-1] 209 | timestamp, _ := strconv.ParseInt(readTimestamp, 10, 64) 210 | 211 | readExist, err := f.ReadString('\n') 212 | readExist = readExist[:len(readExist)-1] 213 | exist, _ := strconv.Atoi(readExist) 214 | 215 | return string(readKey[:]), string(readValue[:]), timestamp, exist, nil 216 | } 217 | 218 | func (k *KVStore) writeKeyValueToFile(f *os.File, key string, value *KVEntry) error { 219 | keyString := strconv.Itoa(len(key)) + " " + key 220 | if _, err := f.WriteString(keyString + " "); err != nil { 221 | logger.Error(err.Error()) 222 | } 223 | 224 | valueString := strconv.Itoa(len(value.Value)) + " " + value.Value 225 | if _, err := f.WriteString(valueString + " "); err != nil { 226 | logger.Error(err.Error()) 227 | } 228 | 229 | timeStampString := strconv.FormatInt(value.Timestamp, 10) 230 | if _, err := f.WriteString(timeStampString + " "); err != nil { 231 | logger.Error(err.Error()) 232 | } 233 | 234 | existString := strconv.Itoa(value.Exist) 235 | if _, err := f.WriteString(existString + "\n"); err != nil { 236 | logger.Error(err.Error()) 237 | } 238 | 239 | return nil 240 | } 241 | -------------------------------------------------------------------------------- /storage/request_handlers.go: -------------------------------------------------------------------------------- 1 | package storage 2 | 3 | // RequestHandlers defines a set of RPC handlers for internal KVS request. 4 | type RequestHandlers struct { 5 | kvs *KVStore 6 | } 7 | 8 | // GetRequest is the payload of Get. 9 | type GetRequest struct { 10 | Key string 11 | } 12 | 13 | // GetResponse is the payload of the response of Get. 14 | type GetResponse struct { 15 | Ok bool 16 | Message string 17 | 18 | Node string 19 | Key string 20 | Value KVEntry 21 | } 22 | 23 | // PutRequest is the payload of Put. 24 | type PutRequest struct { 25 | Key, Value string 26 | } 27 | 28 | // PutResponse is the payload of the response of Put. 29 | type PutResponse struct { 30 | Ok bool 31 | Message string 32 | } 33 | 34 | // DeleteRequest is the payload of Delete. 35 | type DeleteRequest struct { 36 | Key string 37 | } 38 | 39 | // DeleteResponse is the payload of the response of Delete. 40 | type DeleteResponse struct { 41 | Ok bool 42 | Message string 43 | } 44 | 45 | // StatRequest is the payload of Stat. 46 | type StatRequest struct{} 47 | 48 | // StatResponse is the payload of the response of Stat. 49 | type StatResponse struct { 50 | Ok bool 51 | Count int 52 | } 53 | 54 | // NewRequestHandler returns a new RequestHandlers. 55 | func NewRequestHandler(kvs *KVStore) *RequestHandlers { 56 | rh := &RequestHandlers{ 57 | kvs: kvs, 58 | } 59 | 60 | return rh 61 | } 62 | 63 | // Get handles the incoming Get request. 64 | func (rh *RequestHandlers) Get(req *GetRequest, resp *GetResponse) error { 65 | logger.Infof("Handling intrnal request Get(%s)", req.Key) 66 | 67 | value, err := rh.kvs.Get(req.Key) 68 | resp.Node = rh.kvs.address 69 | if err != nil { 70 | resp.Ok = false 71 | resp.Message = err.Error() 72 | return nil 73 | } 74 | 75 | resp.Ok = true 76 | resp.Key = req.Key 77 | resp.Value = *value 78 | return nil 79 | } 80 | 81 | // Put handles the incoming Put request. 82 | func (rh *RequestHandlers) Put(req *PutRequest, resp *PutResponse) error { 83 | logger.Infof("Handling intrnal request Put(%s, %s)", req.Key, req.Value) 84 | 85 | err := rh.kvs.Put(req.Key, req.Value) 86 | if err != nil { 87 | resp.Ok = false 88 | resp.Message = err.Error() 89 | return nil 90 | } 91 | 92 | resp.Ok = true 93 | return nil 94 | } 95 | 96 | // Delete handles the incoming Delete request. 97 | func (rh *RequestHandlers) Delete(req *DeleteRequest, resp *DeleteResponse) error { 98 | logger.Infof("Handling intrnal request Delete(%s)", req.Key) 99 | 100 | err := rh.kvs.Delete(req.Key) 101 | if err != nil { 102 | resp.Ok = false 103 | resp.Message = err.Error() 104 | return nil 105 | } 106 | 107 | resp.Ok = true 108 | return nil 109 | } 110 | 111 | // Stat handles the incoming Stat request. 112 | func (rh *RequestHandlers) Stat(req *StatRequest, resp *StatResponse) error { 113 | logger.Info("Handling intrnal request Stat()") 114 | 115 | resp.Ok = true 116 | resp.Count = rh.kvs.Count() 117 | 118 | return nil 119 | } 120 | -------------------------------------------------------------------------------- /swimring.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "fmt" 5 | "net" 6 | "net/rpc" 7 | "sync" 8 | "time" 9 | 10 | "github.com/dgryski/go-farm" 11 | "github.com/hungys/swimring/hashring" 12 | "github.com/hungys/swimring/membership" 13 | "github.com/hungys/swimring/storage" 14 | ) 15 | 16 | type configuration struct { 17 | Host string 18 | ExternalPort int `yaml:"ExternalPort"` 19 | InternalPort int `yaml:"InternalPort"` 20 | 21 | JoinTimeout int `yaml:"JoinTimeout"` 22 | SuspectTimeout int `yaml:"SuspectTimeout"` 23 | PingTimeout int `yaml:"PingTimeout"` 24 | PingRequestTimeout int `yaml:"PingRequestTimeout"` 25 | 26 | MinProtocolPeriod int `yaml:"MinProtocolPeriod"` 27 | PingRequestSize int `yaml:"PingRequestSize"` 28 | 29 | VirtualNodeSize int `yaml:"VirtualNodeSize"` 30 | KVSReplicaPoints int `yaml:"KVSReplicaPoints"` 31 | 32 | BootstrapNodes []string `yaml:"BootstrapNodes"` 33 | } 34 | 35 | // SwimRing is a local key-value store replica consisting of a SWIM node, 36 | // a consistent hash ring and a storage engine. 37 | type SwimRing struct { 38 | config *configuration 39 | 40 | status status 41 | statusMutex sync.RWMutex 42 | 43 | node *membership.Node 44 | ring *hashring.HashRing 45 | kvs *storage.KVStore 46 | rc *RequestCoordinator 47 | } 48 | 49 | type status uint 50 | 51 | const ( 52 | created status = iota 53 | initialized 54 | ready 55 | destroyed 56 | ) 57 | 58 | // NewSwimRing returns a new SwimRing instance. 59 | func NewSwimRing(config *configuration) *SwimRing { 60 | sr := &SwimRing{ 61 | config: config, 62 | } 63 | sr.setStatus(created) 64 | 65 | return sr 66 | } 67 | 68 | func (sr *SwimRing) init() error { 69 | address := fmt.Sprintf("%s:%d", sr.config.Host, sr.config.InternalPort) 70 | 71 | sr.node = membership.NewNode(sr, address, &membership.Options{ 72 | JoinTimeout: time.Duration(sr.config.JoinTimeout) * time.Millisecond, 73 | SuspectTimeout: time.Duration(sr.config.SuspectTimeout) * time.Millisecond, 74 | PingTimeout: time.Duration(sr.config.PingTimeout) * time.Millisecond, 75 | PingRequestTimeout: time.Duration(sr.config.PingRequestTimeout) * time.Millisecond, 76 | MinProtocolPeriod: time.Duration(sr.config.MinProtocolPeriod) * time.Millisecond, 77 | PingRequestSize: sr.config.PingRequestSize, 78 | BootstrapNodes: sr.config.BootstrapNodes, 79 | }) 80 | 81 | sr.ring = hashring.NewHashRing(farm.Fingerprint32, sr.config.VirtualNodeSize) 82 | sr.kvs = storage.NewKVStore(address) 83 | sr.rc = NewRequestCoordinator(sr) 84 | 85 | sr.setStatus(initialized) 86 | 87 | return nil 88 | } 89 | 90 | // Status returns the status of the current SwimRing instance. 91 | func (sr *SwimRing) Status() status { 92 | sr.statusMutex.RLock() 93 | r := sr.status 94 | sr.statusMutex.RUnlock() 95 | return r 96 | } 97 | 98 | func (sr *SwimRing) setStatus(s status) { 99 | sr.statusMutex.Lock() 100 | sr.status = s 101 | sr.statusMutex.Unlock() 102 | } 103 | 104 | // Bootstrap starts communication for this SwimRing instance. 105 | // 106 | // It first checks if the instance is initialized, then registers RPC handlers, 107 | // and calls Bootstap method of Node instance. 108 | // 109 | // After all the operations, the SwimRing instance enters ready state. 110 | func (sr *SwimRing) Bootstrap() ([]string, error) { 111 | if sr.Status() < initialized { 112 | err := sr.init() 113 | if err != nil { 114 | return nil, err 115 | } 116 | } 117 | 118 | sr.registerInternalRPCHandlers() 119 | sr.registerExternalRPCHandlers() 120 | joined, err := sr.node.Bootstrap() 121 | if err != nil { 122 | sr.setStatus(initialized) 123 | } 124 | 125 | sr.setStatus(ready) 126 | 127 | return joined, nil 128 | } 129 | 130 | // HandleChanges reveives the change events emitted from memberlist, 131 | // then add/remove servers to/from hashring correspondingly. 132 | func (sr *SwimRing) HandleChanges(changes []membership.Change) { 133 | var serversToAdd, serversToRemove []string 134 | 135 | for _, change := range changes { 136 | switch change.Status { 137 | case membership.Alive, membership.Suspect: 138 | serversToAdd = append(serversToAdd, change.Address) 139 | case membership.Faulty: 140 | // serversToRemove = append(serversToRemove, change.Address) 141 | } 142 | } 143 | 144 | sr.ring.AddRemoveServers(serversToAdd, serversToRemove) 145 | } 146 | 147 | func (sr *SwimRing) registerInternalRPCHandlers() error { 148 | addr, err := net.ResolveTCPAddr("tcp", fmt.Sprintf(":%d", sr.config.InternalPort)) 149 | if err != nil { 150 | return err 151 | } 152 | 153 | conn, err := net.ListenTCP("tcp", addr) 154 | if err != nil { 155 | return err 156 | } 157 | 158 | server := rpc.NewServer() 159 | sr.node.RegisterRPCHandlers(server) 160 | sr.kvs.RegisterRPCHandlers(server) 161 | go server.Accept(conn) 162 | 163 | logger.Noticef("Internal RPC server listening at port %d...", sr.config.InternalPort) 164 | 165 | return nil 166 | } 167 | 168 | func (sr *SwimRing) registerExternalRPCHandlers() error { 169 | addr, err := net.ResolveTCPAddr("tcp", fmt.Sprintf(":%d", sr.config.ExternalPort)) 170 | if err != nil { 171 | return err 172 | } 173 | 174 | conn, err := net.ListenTCP("tcp", addr) 175 | if err != nil { 176 | return err 177 | } 178 | 179 | server := rpc.NewServer() 180 | server.RegisterName("SwimRing", sr.rc) 181 | logger.Info("External KVS request RPC handlers registered") 182 | go server.Accept(conn) 183 | 184 | logger.Noticef("External RPC server listening at port %d...", sr.config.ExternalPort) 185 | 186 | return nil 187 | } 188 | -------------------------------------------------------------------------------- /util/util.go: -------------------------------------------------------------------------------- 1 | package util 2 | 3 | import ( 4 | "net" 5 | "strings" 6 | "time" 7 | ) 8 | 9 | const loopbackIP = "127.0.0.1" 10 | 11 | // SelectIntOpt takes an option and a default value and returns the default value if 12 | // the option is equal to zero, and the option otherwise. 13 | func SelectIntOpt(opt, def int) int { 14 | if opt == 0 { 15 | return def 16 | } 17 | return opt 18 | } 19 | 20 | // SelectDurationOpt takes an option and a default value and returns the default value if 21 | // the option is equal to zero, and the option otherwise. 22 | func SelectDurationOpt(opt, def time.Duration) time.Duration { 23 | if opt == time.Duration(0) { 24 | return def 25 | } 26 | return opt 27 | } 28 | 29 | // GetLocalIP returns the local IP address. 30 | func GetLocalIP() string { 31 | addrs, err := net.InterfaceAddrs() 32 | if err != nil { 33 | return loopbackIP 34 | } 35 | 36 | for _, addr := range addrs { 37 | if ipnet, ok := addr.(*net.IPNet); ok && !ipnet.IP.IsLoopback() { 38 | if ipnet.IP.To4() != nil { 39 | return ipnet.IP.String() 40 | } 41 | } 42 | } 43 | 44 | return loopbackIP 45 | } 46 | 47 | // SafeSplit splits the give string by space and handles quotation marks 48 | func SafeSplit(s string) []string { 49 | split := strings.Split(s, " ") 50 | 51 | var result []string 52 | var inquote string 53 | var block string 54 | for _, i := range split { 55 | if inquote == "" { 56 | if strings.HasPrefix(i, "'") || strings.HasPrefix(i, "\"") { 57 | inquote = string(i[0]) 58 | block = strings.TrimPrefix(i, inquote) + " " 59 | } else { 60 | result = append(result, i) 61 | } 62 | } else { 63 | if !strings.HasSuffix(i, inquote) { 64 | block += i + " " 65 | } else { 66 | block += strings.TrimSuffix(i, inquote) 67 | inquote = "" 68 | result = append(result, block) 69 | block = "" 70 | } 71 | } 72 | } 73 | 74 | return result 75 | } 76 | --------------------------------------------------------------------------------