├── .gitignore ├── Makefile ├── README.md ├── active ├── 0000-proposal-template.md ├── 0001-proposal-process.md ├── 0023-rocksdb-message-persistence.md ├── 0024-assets │ ├── .DS_Store │ ├── emqx3-session-resumption.png │ ├── emqx4-session-takeover.png │ └── emqx5-hybrid-session-takeover-resumption.png ├── 0024-hybrid-session-takeover-and-resumption.md ├── 0025-ds-for-retainer.md ├── 0028-assets │ └── general-design.png ├── 0028-durable-shared-subscriptions.md ├── 0029-assets │ ├── cluster-link-route-repl-msg-fwd.png │ └── cluster-link-route-repl.png └── 0029-cluster-linking.md ├── implemented ├── 0002-new-config-syntax.md ├── 0003-new-config-files.md ├── 0004-assets │ ├── agent-fsm.png │ ├── agent-fsm.uml │ ├── replicant-fsm.png │ ├── replicant-fsm.uml │ ├── replication-msc.png │ └── replication-msc.uml ├── 0004-async-mnesia-change-log-replication.md ├── 0007-rocksdb-mnesia-backend.md ├── 0010-improved-monitoring.md ├── 0010-runq-based-overload-protection.md ├── 0011-assets │ ├── current-implementation.png │ ├── current-implementation.uml │ ├── flows.png │ ├── flows.uml │ ├── init-fsm.png │ └── init-fsm.uml ├── 0011-jq-in-rule-engine.md ├── 0011-persistent-sessions.md ├── 0012-cluster-call.md ├── 0013-gen-swagger-spec.md ├── 0014-rolling-cluster-upgrade.md ├── 0015-unified-authentication.md ├── 0016-emqx-conf.md ├── 0018-unified-authorization.md ├── 0019-plugins.md ├── 0020-assets │ ├── evacuation-coordinator-statuses-enforce.png │ ├── evacuation-coordinator-statuses-enforce.uml │ ├── evacuation-coordinator-statuses-rebalance.png │ ├── evacuation-coordinator-statuses-rebalance.uml │ ├── eviction-agent.png │ ├── node-rebalance-algo0.png │ ├── node-rebalance-algo1.png │ ├── node-rebalance-algo2.png │ └── node-rebalance.png ├── 0020-node-rebalance.md ├── 0021-assets │ ├── aws-mqtt-file-delivery.png │ ├── aws-mqtt-file-delivery.uml │ ├── flow-abort.png │ ├── flow-abort.uml │ ├── flow-async-1.png │ ├── flow-async-1.uml │ ├── flow-happy-path.png │ ├── flow-happy-path.uml │ ├── flow-restart.png │ ├── flow-restart.uml │ ├── flow-sync-1.png │ ├── flow-sync-1.uml │ ├── flow-sync-2.png │ └── flow-sync-2.uml ├── 0021-transfer-files-over-mqtt.md ├── 0022-forward-check-install-upgrade-script.md ├── 0023-assets │ └── trace-export-example.png ├── 0023-opentelemetry-trace-integration.md ├── 0026-schema-validation.md ├── 0027-message-transform.md ├── 0030-clientinfo-authentication-rules.md └── 0031-immutable-config-base-for-cluster.hocon └── rejected ├── 0005-stateless-brokers.md ├── 0005-stateless-brokers.png ├── 0006-plugins-on-dashboard.md ├── 0008-community-plugins.md └── 0009-consul-cluster-discovery.md /.gitignore: -------------------------------------------------------------------------------- 1 | /.idea/ 2 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | PICS=$(patsubst %.uml,%.png,$(wildcard */*-assets/*.uml)) 2 | 3 | .PHONY: all 4 | all: $(PICS) 5 | 6 | %.png: %.uml 7 | plantuml $< 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # EMQ X Improvement Proposals (EIP) 2 | 3 | This repository contains the EMQ X Improvement Proposals (EIPs), to documentation 4 | the ideas, designs, or implement details of new features. All the EIPs are in 5 | Markdown (`*.md`) format. 6 | 7 | New EIPs should first go to the `active` directory by creating a pull request 8 | and ask for an approval. After the feature is implemented it will be put into 9 | the `implemented` directory. 10 | 11 | Before submitting an EIP, please read the 12 | [0000-proposal-template](active/0000-proposal-template.md), which is a template 13 | demonstrating the format of EIPs. 14 | 15 | ## Creating UML diagrams 16 | 17 | It is possible to add UML diagrams using [PlantUML](https://plantuml.com/). 18 | In order to do this, create a directory called `active/XXXX-assets` (replace 19 | `XXXX` with the EIP number), and put the files there. All files should have 20 | `uml` extension. 21 | 22 | Then run `make` to generate the images. 23 | -------------------------------------------------------------------------------- /active/0000-proposal-template.md: -------------------------------------------------------------------------------- 1 | # An Example of EMQX Improvement Proposal 2 | 3 | ## Changelog 4 | 5 | * 2020-10-21: @emqxplus Initial draft 6 | * 2020-02-05: @terry-xiaoyu Restructure 7 | * 2021-02-21: @zmstone Add 'Declined Alternatives' section 8 | 9 | ## Abstract 10 | 11 | A short (~200 word) description of the technical issue being addressed. 12 | 13 | ## Motivation 14 | 15 | This section should clearly explain why the functionality proposed by this EIP 16 | is necessary. EIP submissions without sufficient motivation may be rejected 17 | outright. 18 | 19 | ## Design 20 | 21 | This section should describe the design of the feature in detail. If it is a 22 | change to the architecture, some diagrams may be necessary. 23 | 24 | ## Configuration Changes 25 | 26 | This section should list all the changes to the configuration files (if any). 27 | 28 | ## Backwards Compatibility 29 | 30 | This sections should shows how to make the feature is backwards compatible. 31 | If it can not be compatible with the previous emqx versions, explain how do you 32 | propose to deal with the incompatibilities. 33 | 34 | ## Document Changes 35 | 36 | If there is any document change, give a brief description of it here. 37 | 38 | ## Testing Suggestions 39 | 40 | The final implementation must include unit test or common test code. If some 41 | more tests such as integration test or benchmarking test that need to be done 42 | manually, list them here. 43 | 44 | ## Declined Alternatives 45 | 46 | Here goes which alternatives were discussed but considered worse than the current. 47 | It's to help people understand how we reached the current state and also to 48 | prevent going through the discussion again when an old alternative is brought 49 | up again in the future. 50 | 51 | -------------------------------------------------------------------------------- /active/0001-proposal-process.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0001-proposal-process.md -------------------------------------------------------------------------------- /active/0024-assets/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0024-assets/.DS_Store -------------------------------------------------------------------------------- /active/0024-assets/emqx3-session-resumption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0024-assets/emqx3-session-resumption.png -------------------------------------------------------------------------------- /active/0024-assets/emqx4-session-takeover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0024-assets/emqx4-session-takeover.png -------------------------------------------------------------------------------- /active/0024-assets/emqx5-hybrid-session-takeover-resumption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0024-assets/emqx5-hybrid-session-takeover-resumption.png -------------------------------------------------------------------------------- /active/0024-hybrid-session-takeover-and-resumption.md: -------------------------------------------------------------------------------- 1 | # Hybrid MQTT Session Takeover and Resumption 2 | 3 | ## Changelog 4 | 5 | * 2023-06-27: @emqplus Initial commit 6 | 7 | ## Abstract 8 | 9 | This proposal presents a hybrid architecture for MQTT session takeover and resumption in EMQX version 5. 10 | 11 | ## Motivation 12 | 13 | MQTT is widely used in IoT systems, where devices may experience intermittent connectivity or transient network outages. In such scenarios, MQTT session resumption and takeover is an essential feature for helping to maintain reliable communication between the devices and the MQTT broker. I propose a new design for the MQTT session takeover and resumption in the later EMQX 5 release. 14 | 15 | ## Acronyms Used 16 | 17 | | Name | Acronym | Description | 18 | | --- | --- | --- | 19 | | Node1 | N1 | Node 1 in a cluster | 20 | | Node2 | N2 | Node 2 in a cluster | 21 | | PubSub | PS | PubSub service on a node | 22 | | Connection1 | C1 | MQTT Connection 1 | 23 | | Connection2 | C2 | MQTT Connection 2 | 24 | | Session1 | S1 | MQTT Session 1 | 25 | | Session2 | S2 | MQTT Session 1 | 26 | | Channel1 | CH1 | MQTT Channel 1 | 27 | | Channel2 | CH2 | MQTT Channel 2 | 28 | 29 | ## MQTT Session Resumption in EMQX 3 30 | 31 | EMQX 3 uses two separate Erlang processes for an MQTT connection and session: 32 | 33 | - When a client disconnects, the connection process will be shut down while the session process is still alive and keep the session state; 34 | - When a new client connects, a new connection process will be spawned and resume the session state from the existing session process 35 | 36 | The process for session resumption is as follows: 37 | 38 | ![Session Resumption in EMQX 3](0024-assets/emqx3-session-resumption.png) 39 | 40 | ``` 41 | title Session Resumption in EMQX 3 42 | 43 | C2(N2)->S1(N1): Resume 44 | S1(N1)->C1(N1): Discard 45 | C1(N1)->C1(N1): Shutdown 46 | PS(N1)->S1(N1): Deliver 47 | S1(N1)->C2(N2): Deliver 48 | ``` 49 | 50 | After resumption, the session and connection processes are located on two different nodes. On delivery, MQTT messages will have to be forwarded between the nodes. 51 | 52 | **Pros** 53 | 54 | - Simple and fast to resume an MQTT session 55 | 56 | **Cons** 57 | 58 | - Much memory usage with two Erlang processes 59 | - Low performance for MQTT message delivery 60 | 61 | ## MQTT Session Takeover in EMQX 4 62 | 63 | EMQX 4 uses one Erlang process called the MQTT channel to wrap all the states of the connection and session. 64 | 65 | - When a persistent MQTT client disconnects, the channel process will be changed to `disconnected` status, and keep the session state; 66 | - When a new client connects, a new channel process will be created and take over the session state from the old one; 67 | - The old channel process will be shut down when the takeover process is done. 68 | 69 | The process for session takeover is as follows: 70 | 71 | ![Session Takeover in EMQX 4](0024-assets/emqx4-session-takeover.png) 72 | 73 | ``` 74 | title Session Takeover in EMQX 4 75 | 76 | CH2(N2)->CH1(N1): Takeover begin 77 | CH1(N1)-->CH2(N2): Session state 78 | CH2(N2)->PS(N2): Subscribe 79 | CH2(N2)->CH1(N1): Takeover end 80 | CH1(N1)->PS(N1): Unsubscribe 81 | CH1(N1)-->CH2(N2): Pending messages 82 | CH1(N1)->CH1(N1): Shutdown 83 | CH2(N2)->CH2(N2): Merge messages 84 | CH2(N2)->CH2(N2): Resume 85 | ``` 86 | 87 | **Pros** 88 | 89 | - One Erlang process for an MQTT client, less memory usage, and fast message delivery. 90 | 91 | **Cons** 92 | 93 | - The takeover operation is a bit expensive, and use more CPU. 94 | - High CPU usage will make the broker unserviceable when too many takeovers occur. 95 | 96 | ## Hybrid Takeover and Resumption in EMQX 5 97 | 98 | I propose a hybrid session takeover and resumption with an MQTT-aware load balancer in EMQX 5. 99 | 100 | - Still use one channel process to wrap the MQTT connection and session state; 101 | - When an MQTT client disconnects, the persistent session state will be kept in the channel process; 102 | - When a new client connects from the same node, resume the session by passing the socket control and connection state to the channel process; 103 | - If the new client connects from a different node, just takeover the session like EMQX 4. 104 | 105 | ![Hybrid Session Takeover and Resumption in EMQX 5](0024-assets/emqx5-hybrid-session-takeover-resumption.png) 106 | 107 | ``` 108 | title Hybrid Session Takeover and Resumption in EMQX 5 109 | 110 | alt same node 111 | CH2(N1)->CH1(N1): Resume with new connection/socket 112 | CH1(N1)-->CH2(N1): Success 113 | CH2(N1)->CH2(N1): Shutdown 114 | else another node 115 | CH2(N2)->CH1(N1): Takeover begin 116 | CH1(N1)-->CH2(N2): Session state 117 | CH2(N2)->PS(N2): Subscribe 118 | CH2(N2)->CH1(N1): Takeover end 119 | CH1(N1)->PS(N1): Unsubscribe 120 | CH1(N1)-->CH2(N2): Pending messages 121 | CH1(N1)->CH1(N1): Shutdown 122 | CH2(N2)->CH2(N2): Merge messages 123 | CH2(N2)->CH2(N2): Resume 124 | end 125 | ``` 126 | 127 | **Pros** 128 | 129 | - Save CPU usage when session takeover occurs frequently 130 | - Don’t need to unsubscribe and resubscribe 131 | 132 | **Cons** 133 | 134 | - A bit complex for test and debug 135 | 136 | ## Configuration Changes 137 | 138 | N/A 139 | 140 | ## Backwards Compatibility 141 | 142 | N/A 143 | 144 | ## Document Changes 145 | 146 | Documentation related to MQTT session management should be updated. 147 | 148 | ## Testing Suggestions 149 | 150 | - Add more test cases for session resumption from the same node. 151 | 152 | -------------------------------------------------------------------------------- /active/0025-ds-for-retainer.md: -------------------------------------------------------------------------------- 1 | # Durable storage for MQTT retained messages 2 | 3 | ## Changelog 4 | 5 | * 2024-01-18: @savonarola Initial draft 6 | * 2024-01-22: @savonarola Add more optimizations for the "straightforward approach with optimizations"; declare it as the chosen one 7 | 8 | ## Abstract 9 | 10 | We implement [reliable persistent storage](0023-rocksdb-message-persistence.md) for the messages participating in publish/subscribe operations. We want to have a similar storage for the retained messages. 11 | 12 | ## Motivation 13 | 14 | Retained messages are an important feature of MQTT. For example, they may be used as a state or configuration storage for devices without persistent storage. The current implementation has significant limitations: 15 | * Retained messages are stored in a mnesia table, so scalability is limited for such message insertions. 16 | * To provide fast lookup, the message table is also stored in memory, so scalability is also limited from the memory consumption point of view. 17 | * Current indexing capabilities have some limitations: 18 | * Appropriate indices should be manually specified for nonstandard topic schemes. 19 | * The reindex process is not automatic and quite complicated. 20 | * Indices consume a significant amount of memory. 21 | 22 | We want to get rid of many of these limitations by reusing the durable storage (DS) concept and implementations from general message persistence: 23 | * We want reliable and efficient off-memory storage for retained messages. 24 | * We want a more effective mechanism for indexing retained messages, requiring less memory and being more automatic (like LTS). 25 | * We want a more flexible mechanism for changing the storage schema and retention policies. 26 | * We want to take advantage of any implemented DS backends for storing retained messages. 27 | 28 | ## Design 29 | 30 | ### Straightforward approach 31 | 32 | The straightforward approach is to use just the same DS as for the regular message replay: 33 | * have a different DB for retained messages; 34 | * append (`store_batch`) retained messages into the DB; 35 | * interpret empty bodies as tombstones; 36 | * on message lookup, calculate streams and immediately fold them to reduce each topic to the last message. 37 | 38 | #### Possible callback implementation 39 | 40 | The retainer must provide the following callbacks: 41 | 42 | ```erlang 43 | -callback delete_message(context(), topic()) -> ok. 44 | %% strore_batch([tombstone_message()]) 45 | 46 | -callback store_retained(context(), message()) -> ok. 47 | %% strore_batch([message()]) 48 | 49 | -callback read_message(context(), topic()) -> {ok, list()}. 50 | %% get_streams for a concrete topic & fold over a concrete topic 51 | 52 | -callback page_read(context(), topic(), non_neg_integer(), non_neg_integer()) -> 53 | {ok, list()}. 54 | %% get_streams for a filter & fold over all topics matching filter (!!) 55 | 56 | -callback match_messages(context(), topic(), cursor()) -> {ok, list(), cursor()}. 57 | %% get_streams for a filter & fold over all topics matching filter (!!) 58 | 59 | -callback clear_expired(context()) -> ok. 60 | %% drop generations 61 | 62 | -callback clean(context()) -> ok. 63 | %% drop all generations 64 | 65 | -callback size(context()) -> non_neg_integer(). 66 | %% get_streams over all topics & fold (!!) 67 | ``` 68 | 69 | #### Problems of the straightforward approach 70 | 71 | * `page_read` is used from the dashboard and is often used with the `#` topic. Implementing the callback efficiently is impossible — we should fold over _all_ topics, sort somehow, and only then cut out the required page. 72 | * The same is true for the `size` callback. 73 | * The same is true for the `match_messages` callback. However, _currently_ we admit that the whole set of retained messages for a subscription is small enough to fit into memory. 74 | 75 | ### Straightforward approach with optimizations 76 | 77 | An obvious optimization is to have slightly different key schemas for the LTS storage of retained messages storage implementation — not 78 | ``` 79 | ts_epoch | lts_index_bits | ts_rest | message_id => message 80 | ``` 81 | 82 | but just 83 | 84 | ``` 85 | lts_index_bits | topic => message 86 | ``` 87 | 88 | to have generation automatically "compacted". 89 | 90 | Other optimizations: 91 | * We do not use generations for retained messages. 92 | * We implement alternative sharding based on the topic, not on the client id/node id. 93 | * To simplify replay, we may encapsulate streams for different shards into a single iterator. 94 | 95 | Thus each topic is stored uiquely in the storage, and we do not need to fold over all topics to implement `page_read` and `size` callbacks. 96 | 97 | With this approach, we may implement `match_messages` callback without folding, in constant space. 98 | We use iterator state as `context()`. 99 | 100 | `page_read` may be implemented in the same manner, however, we need to "scroll" the iterator to the required page. 101 | 102 | ### Topic indexing approach 103 | 104 | The alternative approach is that we use the DS only for _topic indexing_. That is, we store not 105 | ``` 106 | lts_index_bits | topic => message 107 | ``` 108 | but 109 | ``` 110 | lts_index_bits | topic => #message{} 111 | ``` 112 | 113 | At the same time, we have a separate storage for the messages themselves, and we store them by topic. 114 | ``` 115 | topic => message 116 | ``` 117 | 118 | The storage 119 | * is an ordered key-value storage, where the key is a topic, and the value is the message. 120 | * is not sharded/generational; only one message per topic is stored. 121 | 122 | With this approach, the callbacks `page_read` and `size` are trivially implemented. Also, `read_message` is implemented by reading the message from the KV storage by a key. 123 | 124 | #### Problems of the topic indexing approach 125 | 126 | The topic indexing is more flexible (e.g., we may index messages in replicated RocksDB DS but keep them in FoundationDB KV), it has some problems: 127 | * things get entangled. The storage implementation passed to DS should cooperate with "additional" KV storage 128 | of the retainer. 129 | * We need to tie many things together: DS, DS storage implementation, KV 130 | 131 | ### Other approaches 132 | 133 | We may create completely standalone storage for retained messages, not using high-level DS at all, but using only low-level DS primitives, like LTS tries and bitfield mappers. 134 | 135 | ### Additional opportunities 136 | 137 | In straightforward approaches, we may still keep the TS part in the storage but additionally introduce some kind of a "secondary index" where we keep timestamped key by topic/clientid/etc. 138 | 139 | "Secondary index" will allow us still have storage "compacted" by topic/clientid/etc: we will delete the old timestamped key when we store a new one. 140 | 141 | E.g., to have compaction by topic: 142 | 143 | 1. We want to insert a message `#message{topic="a/b/c", ts=TS1, ...} = message1`. 144 | 1. We insert the message into the storage `high_bits(TS1) | lts(topic) | low_bits(TS1) => message1`. 145 | 1. We insert the key `topic => high_bits(TS1) | lts(topic) | low_bits(TS1)` into the "secondary index". 146 | 147 | Then, we want to insert the new message with the same topic: 148 | 149 | 1. We want to insert the mesage `#message{topic="a/b/c", ts=TS2, ...} = message2`. 150 | 1. We insert the message into the storage `high_bits(TS2) | lts(topic) | low_bits(TS2) => message2`. 151 | 1. We get the old key `high_bits(TS1) | lts(topic) | low_bits(TS1)` from the "secondary index" by `topic`. 152 | 1. We insert the key `topic => high_bits(TS2) | lts(topic) | low_bits(TS2)` into the "secondary index". 153 | 1. We delete `message1` by the fetched old key `high_bits(TS1) | lts(topic) | low_bits(TS1)` from the storage. 154 | 155 | This will have the advantage of still being able to "subscribe" to any topic pattern with a regular DS replayer, with the semantics: "give me the actual message(data) and all the ongoing updates." In turn, it may be helpful for subscriptions to some kinds of events, like session registrations, takeovers, etc. 156 | 157 | ### Conclusion 158 | 159 | We choose the straightforward approach with optimizations. 160 | It allows us to reuse the existing DS implementations and abstractions. Also, with this approach, we may implement all the operations in constant space, which will be a significant improvement over the current implementation. 161 | 162 | ## Configuration Changes 163 | 164 | Currently, we have only one type of storage for retained messages: 165 | 166 | ``` 167 | retainer { 168 | ... 169 | backend { 170 | type = built_in_database 171 | ... 172 | } 173 | } 174 | ``` 175 | 176 | Like in message persistence, we will have: 177 | 178 | ``` 179 | retainer { 180 | ... 181 | backends { 182 | built_in_database { 183 | enabled = false 184 | } 185 | fdb { 186 | enabled = true 187 | ds { 188 | ... 189 | # options for emqx_ds:open/2 190 | } 191 | ... other options 192 | } 193 | } 194 | } 195 | ``` 196 | 197 | ## Backwards Compatibility 198 | 199 | No backwards compatibility issues are expected. Retainer configs having old `backend` will use the old storage, and those having `backends` will use the new one. 200 | 201 | ## Testing Suggestions 202 | 203 | ## Declined Alternatives 204 | 205 | * Straightforward approach (without optimizations). 206 | * Topic-only indexing approach. 207 | 208 | -------------------------------------------------------------------------------- /active/0028-assets/general-design.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0028-assets/general-design.png -------------------------------------------------------------------------------- /active/0028-durable-shared-subscriptions.md: -------------------------------------------------------------------------------- 1 | # Durable Shared Subscriptions 2 | 3 | ## Changelog 4 | 5 | * 2024-05-10: @savonarola Initial draft 6 | * 2024-06-28: @savonarola 7 | * Add the Agent abstraction 8 | * Describe the two-side communication sequence between an Agent and the SGL 9 | * Describe the stream reassignment algorithm 10 | * 2025-01-08: @savonarola 11 | * Change the naming to contain less abbreviations and more comprehensive names 12 | * Extend the introductory section a bit to make the general problem clear 13 | * Add a section with the general layer structure 14 | * Updated the interaction details according to the new simpler design 15 | 16 | ## Abstract 17 | 18 | We describe the implementation of shared subscriptions based on the concept of durable storage. 19 | 20 | ## Motivation 21 | 22 | Since we have durable storage-based implementation for the regular subscriptions, we want to extend its advantages over the shared subscriptions. That is, mainly: 23 | * To have messages persisted, that is, not lost regardless of the crashes or absence of consumers. 24 | * To be able to replay messages from the past. 25 | 26 | ## Design 27 | 28 | ### General Idea 29 | 30 | Shared subscriptions (or queues) are a feature that allows multiple consumers to consume messages from some topic filter cooperatively. 31 | 32 | Several consumers subscribe to a special topic `$shared/GROUP_ID/TOPIC_FILTER` (or `$queue/GROUP_ID/TOPIC_FILTER` in the case of queues) and consume messages from `TOPIC_FILTER`. Each message goes to a single consumer. 33 | 34 | With the durable backend (Durable Storage backend, DS), all messages pertain to the ordered _streams_ of messages. Streams may be read sequentially (possibly skipping some messages). Streams may be finalized (i.e., closed) and so fully consumed. 35 | 36 | Since the streams are means of sharding messages, the natural idea is to use the same sharding mechanism for shared subscriptions. That is, assign disjointed subsets of streams to different consumers and let them consume their streams in parallel. 37 | 38 | We need an entity responsible for distributing such streams across consumers. 39 | We implement such entity as a cluster-unique process called **Shared Group Leader** or simply **Leader**. 40 | 41 | The Leader is spawned by the node-local leader registry when the first consumer connects to the group subscription. 42 | 43 | The global registration mechanism is based on the DS precondition feature, which allows the creation/deletion of message entries in the DS atomically. 44 | Current leader owns a special data entry in the DS and periodically renews the ownership. 45 | If the leader dies (e.g. its node goes down), the connected sessions eventually detect the leader loss and ask the node-local leader registry to find the leader again. 46 | The leader registry either reclaims the leader's data in the DS and spawns a new leader or finds the data already reclaimed by another leader and responds to the session with the new leader's location. 47 | 48 | 49 | Leaders' data is also stored in the DS. Note that the DS for the Leader registration and other data is completely separate from the DS for the messages. 50 | 51 | The Leader keeps track of topics belonging to the group, their streams, and stream progress. It is the only entity that tracks the replay progress of these streams. 52 | 53 | The consumers are persistent sessions. They connect to the Leader via the encapsulated **Agent** entity, and the Leader grants them streams to consume. Sessions consume these streams together with their proper streams but do not persist the progress. Instead, they report the progress to the Leader. 54 | 55 | The Leader is responsible for reassigning streams to the other sessions in case a consumer session disconnects and for reassigning streams to the new consumers. 56 | 57 | ### Layer design 58 | 59 | The high-level layers are: 60 | * Session *Shared Subscription Handler* 61 | * Shared Subscription *Agent* 62 | * Shared Subscription *Borrower* 63 | * Shared Subscription *Leader* 64 | 65 | #### Session Shared Subscription Handler 66 | 67 | Session Shared Subscription Handler (or simply Shared Subscription Handler) is the session-side facade for the shared subscriptions. 68 | It is a counterpart of the module responsible for the regular (private) session subscriptions. 69 | 70 | Session Shared Subscription 71 | * Handles the `on_subscribe`/`on_unsubscribe` events from the session, creating/deleting subscriptions in the session's state and forwarding the requests to the Agent. 72 | * Receives stream granting/revocation messages from the Agent and injects stream states into the session's state and the scheduler. 73 | * Receives stream consumption progress updates and sends them to the Agent. 74 | 75 | So, the Shared Subscription Handler knows how the session works but nothing about how the streams are obtained and managed. This knowledge is encapsulated in the **Agent** abstraction. 76 | 77 | #### Shared Subscription Agent 78 | 79 | The Agent is the entity that provides the interface for the Shared Subscription Handler to obtain stream granting/revocation events and reports stream consumption progress. 80 | 81 | For the community edition, the durable shared subscription feature is unavailable, so the Agent is implemented as a stub that does not perform any actions, so sessions' subscriptions and unsubscriptions have no effect. 82 | A client having a durable session still may subscribe to some shared subscription topic, but no correspondent messages will be delivered. 83 | 84 | For the enterprise edition, the Agent actually communicates with the Leaders, receives streams for consumption, and reports stream consumption progress. 85 | 86 | Technically, the Agent itself does not have much communication logic, because it handles _all_ shared subscriptions of a single session. So its responsibility is to maintain a collection of Shared Subscription Borrowers and to forward events belonging to the particular shared subscription to the corresponding Borrower. 87 | 88 | #### Shared Subscription Borrower 89 | 90 | Borrower is the entity within the Agent. It is responsible for a single shared subscription. It talks to the Leader, receives streams for subscription, and reports stream consumption progress. 91 | 92 | It is important, that the Borrowers within the session's Agent are isolated from each other and are _not identified_ by the group ID + topic filter. In case of quick unsubscribe/subscribe sequence, there may be multiple Borrowers within the same Agent talking to the same Leader. One connecting to the Leader and the other ones finalizing the previous subscriptions. 93 | 94 | #### Shared Subscription Leader 95 | 96 | The Leader is the entity that is responsible for a single shared subscription group. The Leader 97 | * Tracks and renews streams for the shared subscription's topic filter. 98 | * Tracks the connected Borrowers. 99 | * Assigns and revokes streams to the Borrowers. 100 | * Receives stream consumption progress updates from the Borrowers. 101 | * Persists the shared subscription's state (e.g. stream consumption progress). 102 | 103 | #### Layer interaction 104 | 105 | ![general-design](./0028-assets/general-design.png) 106 | 107 | The Shared Subscription Handler, Agent, and Borrowers are nested session-side entities: The Shared Subscription Handler encapsulates an Agent, which encapsulates a collection of Borrowers. Communication between them is done via simple function calls. 108 | 109 | Note that Borrowers are the innermost entities, so these messages to and from the Leader are opaquely propagated through the Agent and Shared Subscription Handler layers. 110 | 111 | Leader resides in a separate process, so it communicates with Borrowers via completely asynchronous message-based protocol. The Leader is a cluster-unique process, so it may belong to any node in the cluster, e.g. a different node from the one where a session resides. 112 | 113 | ### Protocol between Borrower and Leader 114 | 115 | The most complicated part is the asynchronous protocol between a Borrower and a Leader. The other interactions (Agent and Borrower, Shared Subscription Handler and Agent) are mostly forwarding events and callbacks. 116 | 117 | On the Borrower side, we have 118 | * A state machine for the Borrower's state as a whole. 119 | * A collection of state machines for each stream granted to the Borrower. 120 | 121 | #### Borrower's statuses 122 | 123 | The Borrower's statuses are the following: 124 | * `connecting` - the Borrower is created (a client subscribed to a shared subscription or restored an existing subscription). 125 | It is looking for a Leader periodically sending `find_leader` messages. 126 | * `connected` - the Borrower is connected to the Leader, receiving streams (or revoke commands) and reporting progress. 127 | * `unsubscribing` - the session unsubscribed from the shared subscription. The Borrower is waiting for consistent progress from the session, reports it, and terminates. 128 | 129 | There are no cyclic status transitions, the statuses change as 130 | `[new]` -> `connecting` -> `connected` -> `unsubscribing` -> `[destroyed]` 131 | 132 | If a Borrower detects an inconsistent state (e.g., an unexpected message from the Leader), it terminates itself and asks the enclosing Agent to recreate it from scratch. The new Borrower will obtain a new identifier, and the Leader will see it as a completely new Borrower. 133 | 134 | The Borrower has the following timers: 135 | * In `connecting` state, there is a periodic find leader timer. It is used to reissue the `find_leader` message if the Leader is not found. 136 | * In `connected` state, there is a periodic ping timer and a ping response timeout timer. On ping timer, a ping message is sent to the Leader. If there is no response within the ping timeout, the Borrower invalidates (stops and asks the enclosing Agent to recreate it from scratch). 137 | * In `unsubscribing` state, there is a unsubscribe timeout timer. If within the timeout the Borrower does not receive the final consistent progress from the session, it reports incomplete progress and terminates. 138 | 139 | ### Individual stream states 140 | 141 | Each stream has its own state. The stream state is the following: 142 | * `granted` - the stream is granted to the Borrower. 143 | * `revoking` - the stream is being revoked from the Borrower by the Leader. 144 | 145 | Stream state changes are also without cyclic transitions; they are `[absent]` -> `granted` -> `revoking` -> `[absent]`. 146 | 147 | Stream becomes `granted` when the Leader assigns it to the Borrower (a `grant` event is received). 148 | 149 | Stream becomes `revoking` when the Leader revokes it from the Borrower (a `revoke` event is received). 150 | On revoke, the stream is marked as `unsubscribed` in the enclosing session but still belongs to the Borrower. 151 | The Borrower waits for the final consistent progress from the session. 152 | 153 | The stream is removed when a `revoked` event is received from the Leader. 154 | This means that the Leader confirms that the final progress is received. 155 | 156 | ### Messages/callbacks between the Borrower and the Leader 157 | 158 | #### `connecting` state 159 | 160 | From Borrower: 161 | * `leader_wanted` — a request to find the Leader for the shared subscription. 162 | Since the Borrower is not connected to the Leader yet, it sends this message to a node-local leader registry. The registry will find the Leader and the Leader will respond with a `leader_connect_response` message. 163 | 164 | From Leader to Borrower: 165 | * `leader_connect_response` — the Leader responds to the `leader_wanted` message. The response contains the Leader's id. 166 | 167 | From the enclosing Agent/Session: 168 | * `on_disconnect`, `on_unsubscribe` — since we have no streams, we send `disconnect` message to the Leader and terminate the Borrower. 169 | 170 | #### `connected` and `unsubscribing` states 171 | 172 | From Borrower: 173 | * `ping` — a periodic ping message to keep the connection alive. 174 | * `disconnect` — a message to disconnect from the Leader. The message contains the latest progress of all granted streams. 175 | * `update_progress` — a message to update the progress of the stream consumption. 176 | * `revoke_finished` — a message to notify the Leader that the stream revocation is finished. 177 | 178 | From Leader to Borrower: 179 | * `ping_response` — a response to the `ping` message. 180 | * `grant` — a message that the Leader grants a stream to the Borrower. 181 | * in `unsubscribing` state it is ignored. 182 | * in `connected` state, the granted stream is added to the Borrower's stream set and an event returned to the enclosing session to install the stream. 183 | * `revoke` — a message that the Leader revokes a stream from the Borrower. 184 | * in `unsubscribing` state it is ignored. 185 | * in `connected` state, the stream is marked as `revoking` and an event returned to the enclosing session to unsubscribe from the stream. We still keep the stream in the Borrower's stream set until the final progress is received. 186 | * `revoked` — a message that the Leader confirms that the final progress is received. 187 | * in `unsubscribing` state it is ignored. 188 | * in `connected` state, the stream is removed from the Borrower's stream set. We respond to the Leader with `revoke_finished` message. 189 | 190 | From the enclosing Agent/Session: 191 | * `on_disconnect` — we send the current progress to the Leader and terminate the Borrower. 192 | * `on_unsubscribe` — we move the Borrower to the `unsubscribing` state. 193 | * `on_stream_progress` — we send the progress to the Leader via the `update_progress` message. 194 | 195 | #### All state messages 196 | 197 | `invalidate` — a message that the Leader wants to invalidate the Borrower. The Borrower terminates itself and asks the enclosing Agent to recreate it from scratch. 198 | 199 | ### Leader's logic 200 | 201 | Leader maintains: 202 | * The renewed set of streams for the topic filter of the shared subscription. 203 | * The progress of each stream. 204 | * The set of connected Borrowers. 205 | * The assignment of streams to the Borrowers. 206 | 207 | The stream assignment to a borrower has the following statuses: 208 | * `granting` — the stream is being granted to the Borrower. 209 | * `granted` — the stream is assigned to the Borrower. 210 | * `revoking` — the stream is being revoked from the Borrower. 211 | * `revoked` — the stream is revoked from the Borrower. 212 | 213 | Periodically, or after some events, the Leader runs the stream reassignment process. 214 | 215 | The stream reassignment process is the following: 216 | * We renew the set of streams for the topic filter. 217 | * We check the total number of streams and the registered Borrowers. 218 | * We calculate the desired number of streams per Borrower. 219 | * For borrowers having more streams than desired, we revoke some of its streams. 220 | * For borrowers having fewer streams than desired, we grant some free streams (not assigned to any Borrower). 221 | 222 | The granting process is the following: 223 | * We create the stream assignment `stream <-> borrower_id` in the `granting` status. 224 | * We send the `grant` message to the Borrower together with the stream and its progress. 225 | * We resend the `grant` message on timeout. 226 | * After the `grant` message is received by the Borrower it starts to send stream progress. 227 | * On receiving the progress, we consider the stream granted and update the stream assignment status to `granted`. 228 | 229 | The revoking process is the following: 230 | * We move the stream assignment `stream <-> borrower_id` in the `revoking` status 231 | * We send the `revoke` message to the Borrower. 232 | * We resend the `revoke` message on timeout. 233 | * After the `revoke` message is received by the Borrower, it starts to finalize the stream consumption. 234 | * When we receive the progress from the Borrower with the stream final progress, we move the stream assignment status to `revoked`. 235 | * We send the `revoked` message to the Borrower. 236 | * We resend the `revoked` message on timeout. 237 | * After the `revoked` message is received by the Borrower, it deletes all the stream-related data and responds with `revoke_finished` message. 238 | * On receiving the `revoke_finished` message, the Leader deletes the stream assignment. 239 | 240 | ### Configuration Changes 241 | 242 | ### Backwards Compatibility 243 | 244 | One of the main difficulties is the coexistence of durable shared subscriptions with regular shared subscriptions. For example, consuming messages by an in-memory session from a shared group backed by durable storage. 245 | 246 | ### Document Changes 247 | 248 | ### Testing Suggestions 249 | 250 | ### Declined Alternatives 251 | 252 | Previous PoC implementation appeared to be too complex both for implementation and for understanding. 253 | * There was not one-to-one Borrower <-> Subscription correspondence. That made resubscribing complicated and led to much complex logic in the Shared Subscription Handler. 254 | * Consequently, the Borrowers handled invalidation and resubscription themselves. Their state machine was larger and had cycles. 255 | * The Borrowers and the Leader did not have separate communication levels (connection maintenance vs. stream assignment and progress reporting). Instead, the Leader and the Borrower exchanged versioned sets of streams, which also appeared to be too complex. 256 | 257 | -------------------------------------------------------------------------------- /active/0029-assets/cluster-link-route-repl-msg-fwd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0029-assets/cluster-link-route-repl-msg-fwd.png -------------------------------------------------------------------------------- /active/0029-assets/cluster-link-route-repl.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0029-assets/cluster-link-route-repl.png -------------------------------------------------------------------------------- /implemented/0002-new-config-syntax.md: -------------------------------------------------------------------------------- 1 | # New Configuration Syntax for EMQ X v5.0 2 | 3 | ## Abstract 4 | 5 | Introduce a new configuration format in EMQX v5.0 release. 6 | 7 | ## Motivation 8 | 9 | - The `k=v` format is too verbose 10 | - Should support hierarchy and list 11 | 12 | ## Design 13 | 14 | ### HOCON Style 15 | 16 | Node/cluster config in HOCON style: 17 | 18 | ```hocon 19 | node { 20 | name: "emqx@127.0.0.1" 21 | cookie: emqxsecretcookie 22 | data_dir: "{{ platform_data_dir }}" 23 | global_gc_interval: 15m 24 | crash_dump: "{{ platform_log_dir }}/crash.dump" 25 | } 26 | 27 | cluster { 28 | name: emqxcl 29 | proto_dist: inet_tcp 30 | discovery: manual 31 | autoheal: on 32 | autoclean: 5m 33 | } 34 | ``` 35 | 36 | Listener config in HOCON style: 37 | 38 | ```hocon 39 | listener.tcp { 40 | bind: "0.0.0.0:1883" 41 | zone: default 42 | acceptors: 8 43 | max_conn_rate: 1000 44 | max_connections: 1024000 45 | } 46 | 47 | listener.tcp { 48 | bind: "127.0.0.1:11883" 49 | zone: internal 50 | acceptors: 4 51 | max_connections: 1024000 52 | max_conn_rate: 1000 53 | active_n: 1000 54 | tcp_options: ${tcp.options} //Substitution 55 | } 56 | 57 | listener.ssl { 58 | bind: 8883 59 | zone: default 60 | acceptors: 16 61 | max_conn_rate: 1000 62 | max_connections: 102400 63 | include "ssl.conf" //Include 64 | } 65 | 66 | listener.ws { 67 | bind: 8083 68 | zone: default 69 | acceptors: 4 70 | max_conn_rate: 1000 71 | max_connections: 102400 72 | mqtt_path: /mqtt 73 | } 74 | 75 | listener.wss { 76 | bind: 8084 77 | zone: default 78 | enable_ssl: on 79 | acceptors: 4 80 | max_conn_rate: 1000 81 | max_connections: 102400 82 | mqtt_path: /mqtt 83 | } 84 | 85 | tcp.options { 86 | backlog: 512 87 | send_timeout: 5s 88 | send_timeout_close: on 89 | recbuf: 64KB 90 | sndbuf: 64KB 91 | buffer: 16KB 92 | } 93 | ``` 94 | 95 | ### YAML 96 | 97 | YAML is a declined option. We have decided to use HOCON as a configuration format. 98 | 99 | ## Rationale 100 | 101 | ## Implementation 102 | 103 | We need a parser that reads HOCON file into nested map. 104 | 105 | `foo.conf` 106 | ``` 107 | a = 1 108 | b = { p = 2 } 109 | ``` 110 | 111 | ``` 112 | hocon:load("foo.conf"). 113 | #{a => 1, b => #{p => 2}} 114 | ``` 115 | 116 | And we may need each values with metadata. 117 | Add `filename` and `line` by default to print helpful error info. 118 | We should also support injecting metadata from API. 119 | e.g. if dashboard has the config editor, we may want to know who updated the value. 120 | 121 | ``` 122 | hocon:load("foo.conf", #{include_metadata => {true, [{'changed_at', 100}, {'changed_by', "kiyofuji"}]})). 123 | #{a => #{ value => 1, 124 | metadata => #{ filename => "foo.conf", 125 | line => 1, 126 | 'changed_at' => 100, 127 | 'changed_by' => "kiyofuji" }, 128 | b => #{ value => #{ p => #{ value => 2, 129 | metadata => #{ filename => "foo.conf", 130 | line => 2, 131 | 'changed_at' => 100, 132 | 'changed_by' => "kiyofuji" }}}, 133 | metadata => #{ filename => "foo.conf", 134 | line => 2, 135 | 'changed_at' => 100, 136 | 'changed_by' => "kiyofuji" }}} 137 | ``` 138 | 139 | The map is then passed into cuttlefish, 140 | where the validation of values and so on take place as same as v4.x. 141 | Hence, we also need to modify cuttlefish to accept above maps as input, 142 | and add metadata (filename and line) into error info. 143 | 144 | ## References 145 | 146 | - [HOCON Config](https://github.com/lightbend/config) 147 | - [SAP Integrations and Data Management](https://help.sap.com/viewer/50c996852b32456c96d3161a95544cdb/1905/en-US/25550740941d434b8c003347601af0ac.html) 148 | - [HashiCorp Resources](https://www.terraform.io/docs/configuration/syntax.html) 149 | 150 | -------------------------------------------------------------------------------- /implemented/0004-assets/agent-fsm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0004-assets/agent-fsm.png -------------------------------------------------------------------------------- /implemented/0004-assets/agent-fsm.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | 3 | catchup: Replaying transactions from the rlog 4 | switchover: Transient state: buffering realtime events\nwhile still replaying records from the rlog 5 | normal: Forwarding the rlog table\nmnesia events 6 | 7 | [*] --> catchup 8 | catchup --> switchover : reached a record in the rlog\nthat is newer than\nnow() - SafeInterval 9 | switchover --> normal : reached the end of rlog 10 | 11 | @enduml 12 | -------------------------------------------------------------------------------- /implemented/0004-assets/replicant-fsm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0004-assets/replicant-fsm.png -------------------------------------------------------------------------------- /implemented/0004-assets/replicant-fsm.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | 3 | bootstrap: Receiving all the records\n from the core node 4 | local_replay: Replaying transactions\n that have been buffered locally\n during bootstrap 5 | normal: Remote transactions are applied\ndirectly to the replica 6 | 7 | [*] --> bootstrap : local checkpoint not\n found or too old 8 | [*] --> normal : local checkpoint\n is compatible 9 | bootstrap --> local_replay : received bootstrap_complete 10 | local_replay --> normal : reached the end of the local rlog 11 | 12 | @enduml 13 | -------------------------------------------------------------------------------- /implemented/0004-assets/replication-msc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0004-assets/replication-msc.png -------------------------------------------------------------------------------- /implemented/0004-assets/replication-msc.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | scale 1024 width 3 | 4 | participant "RLOG server" as server #ffc 5 | participant "RLOG agent" as agent #ffc 6 | participant "RLOG bootstrapper" as boot_serv #ffc 7 | 8 | participant "RLOG replica" as repl #ccf 9 | participant "bootstrap client" as boot_client #ccf 10 | 11 | activate server 12 | activate repl 13 | 14 | group Agent initialization 15 | repl -> server : {connect, LocalCheckpointTS} 16 | note over server : LocalCheckpointTS is too old.\n Client needs to bootstrap 17 | server -\\ agent : spawn(now() - SafeInterval) 18 | activate agent 19 | repl <- server : {need_bootstrap, AgentPID} 20 | end 21 | 22 | group Bootstraper initialization 23 | hnote over repl : bootstrap 24 | 25 | repl -\\ boot_client : spawn() 26 | activate boot_client 27 | 28 | boot_client -> server : {bootstrap, self()} 29 | server -\\ boot_serv : spawn(RemotePid) 30 | activate boot_serv 31 | 32 | boot_serv -> boot_serv : mnesia:dirty_all_keys\nfor each table in shard 33 | 34 | server -> boot_client : {ok, Pid} 35 | end 36 | 37 | group Bootstrap 38 | note over boot_serv : Iterate through the\ncached keys 39 | loop 40 | boot_serv -> boot_client : {batch, [{Tab, Record}]} 41 | boot_client -> boot_client : import batch to the\ntable replica 42 | boot_serv <- boot_client : ok 43 | end 44 | 45 | note over agent : At the same time... 46 | 47 | loop 48 | agent -> repl : {batch, [MnesiaOps]} 49 | repl -> repl : cache batch to the local rlog 50 | end 51 | 52 | boot_serv -> boot_client : bootstrap_complete 53 | deactivate boot_serv 54 | boot_client -> repl : bootstrap_complete 55 | deactivate boot_client 56 | end 57 | 58 | group local_replay 59 | hnote over repl : local_replay 60 | 61 | note over repl : Iterate through the\ncached transactions 62 | 63 | loop 64 | agent -> repl : {batch, [MnesiaOps]} 65 | repl -> repl : cache batch in the local rlog 66 | 67 | repl -> repl : Import ops from the local rlog\nto the local replica 68 | end 69 | 70 | note over repl : Reached the end of\nthe local rlog 71 | end 72 | 73 | hnote over repl : normal 74 | 75 | loop 76 | agent -> repl : {batch, [MnesiaOps]} 77 | repl -> repl : Import batch to the\nlocal replica 78 | end 79 | 80 | @enduml -------------------------------------------------------------------------------- /implemented/0004-async-mnesia-change-log-replication.md: -------------------------------------------------------------------------------- 1 | # Async Mnesia transaction replication in EMQ X 5.0 2 | 3 | ## Change log 4 | 5 | * 2021-02-21: @zmstone Add more details 6 | * 2021-03-01: @k32 Minor fixes 7 | * 2021-03-05: @k32 Add more test scenarios and elaborate on the push model. 8 | * 2021-03-11: @k32 Elaborate transaction key generation 9 | * 2021-03-11: @k32 Add MSC diagrams for the bootstrap process 10 | 11 | ## Abstract 12 | 13 | Escape from Erlang distribution mesh network, embrace `gen_rpc`. 14 | 15 | ## Motivation 16 | 17 | The current replication (Mnesia) is based on full-mesh Erlang distribution which 18 | does not scale well and has the risk of split-brain. 19 | 20 | ## Design 21 | 22 | ### Log-based replication for Mnesia 23 | 24 | Log-based replication is the most commonly use approach in distributed 25 | databases. 26 | 27 | Typically when strong consistency is required, database operations or 28 | transactions will have to be serialized by an elected leader which means all 29 | nodes will have to delegate the operations through the leader. 30 | The drawback of this approach is that the leader will easily become a bottleneck 31 | when the cluster size grows. 32 | 33 | For key-value stores, one way to solve it is to shard the database, e.g. Riak 34 | and Cassandra, nodes form a hash ring, and only manage keys hashed to their 35 | ranges. The DB entrypoint may also not have to be the leader, or there is simply 36 | no leader at all, as soon as this happens, the consistency is no longer 'strong' 37 | and there is a need to resolve conflicts. e.g. when two clients try to write the 38 | same key concurrently and hitting two different nodes in the cluster which do 39 | not sync with each other. 40 | 41 | If the value is a primitive value set operation, typically last-write-wins is 42 | good enough to resolve conflicts. If the writes update a small part of an 43 | object, CRDT is for the rescue. 44 | 45 | While we there is still a lack of full investigation on how much of the data 46 | in EMQ X requires CRDT to get away from ACID transactions, below two types 47 | of data seem to be of simple enough schema for last-write-wins. 48 | 49 | * Routing tables `emqx_route`, `emqx_trie` and `emqx_trie_node`. 50 | * Global channel registry table `emqx_channel_registry`. 51 | 52 | After all, we use Mnesia dirty APIs to write some of the tables. 53 | 54 | ### Async-replication of Mnesia changes 55 | 56 | TODO: check if dirty operations in transaction triggers activity logging 57 | 58 | * Log Mnesia changes in the Mnesia cluster 59 | 60 | A pseudo implementation of the transaction layer: 61 | 62 | ``` 63 | transaction(Fun, Args) -> 64 | Fun2 = fun() -> 65 | ok = Fun(Args), 66 | Changes = get_mnesia_activity(), 67 | Key = generate_key(erlang:timestamp(), node()), 68 | %% Note: Real code should avoid traversing Ops list multiple times: 69 | [ok = write_ops_to_another_table(Shard, Key, find_ops_for_shard(Ops, Shard)) || Shard <- shards()], 70 | ok 71 | end, 72 | {atomic, ok} = mnesia:transaction(Fun2) 73 | end. 74 | ``` 75 | 76 | Where `Changes` is essentially a list of table operations like: 77 | 78 | ``` 79 | [ {{TableName, Key}, Record, write}, 80 | {{TableName, Key}, Record, delete} 81 | ] 82 | ``` 83 | 84 | Changes are pushed to the replicant nodes over `gen_rpc`. 85 | Replicant nodes issue a `watch` call to one of the core nodes. 86 | The core node creates an agent process that issues `gen_rpc` calls to the replicant nodes using data about transactions that were recorded to the rlog. 87 | Once the replication is close to the end of the rlog, the agent process subscribes to the realtime stream of mnesia events for the rlog table, and starts feeding the replicant with realtime stream of OPs. 88 | The time threshold to identify 'close to the end of rlog' should be configurable, and realtime stream should start after (with maybe a bit overlapping) the agent reaches the last historical event. 89 | 90 | Note that `generate_key` function in the above pseudocode can affect the overall design of the system in a rather fundamental way, so let's consider some alternatives. 91 | 92 | #### Alternative 1: using monotonic timestamp + node id for the key 93 | 94 | This kind of key guarantees uniqueness, but not ordering. 95 | It can be used to prevent lock conflicts while writing to the rlog table, but it MUST NOT be used to establish the order of events. 96 | Consider the following situation: there are two core nodes `N1` and `N2`, and a replicant node `N3`. 97 | Let's use `N1`'s clock as the reference. 98 | Suppose `N2`'s clock drift is `Dt > 0`. 99 | Consider the following timeline of events: 100 | 101 | 1. At `T1`, `N2` commits a transaction that sets value of key `K` to `V1`, with the transaction key `{T1 + Dt, N2}` 102 | 1. At `T2`, `N1` commits another transaction that sets value of `K` to `V2`, with the transaction key `{T2, N1}`. 103 | Suppose `T2 - T1 < Dt`. 104 | 1. `N3` replays the records in the rlog table, and it first replays the transaction `{T2, N1}`, then `{T1 + Dt, N2}`. 105 | This leads to inconsistency: value of `K` on the core nodes equals `V2`, but on the replicant nodes it equals `V1`. 106 | 107 | Therefore traversing the rlog table in the natural key order can lead to inconsistencies and must not be used. 108 | The rlog table must be used for mnesia events only, and the actual contents of this table can be discarded. 109 | In order to keep the historical transactions used to bootstrap the replicant, a separate storage is needed. 110 | It could be a disk log that contains the rlog table events. 111 | (TODO: How to deal with the gaps in the transaction log while the core node is down?) 112 | 113 | #### Alternative 2: using a globally (or partially) ordered transaction key 114 | 115 | A naive implementation of the global ordering key can be the following: there is a global server generating transaction keys. 116 | The keys look like this: `{GenerationId, Counter}`. 117 | For each shard, there is one core node that runs such server. 118 | This node is elected using a textbook distributed consensus algorithm. 119 | Every time the shard key server restarts, `GenerationId` counter increments. 120 | `Counter` is an integer that is incremented every time anyone talks to the server, and resets to 0 when the server restarts. 121 | 122 | The obvious downside of this naive implementation is the additional latency added to the transaction. 123 | It's an open question whether or not this additional latency will hurt the throughput of the database in a significant way. 124 | 125 | The benefit of this solution is that it allows to use the contents of the `rlog` table in a meaningful way. 126 | 127 | ### Actors 128 | 129 | #### RLOG Server 130 | 131 | RLOG server is a `gen_server` process that runs on the core node. 132 | It is responsible for the initial communication with the RLOG replica processes, and spawning RLOG agent and RLOG bootstrapper processes. 133 | 134 | #### RLOG Agent 135 | 136 | RLOG agent is a `gen_statem` process that runs on the core node. 137 | This processes' lifetime is tied to the lifetime of the remote RLOG replica process. 138 | It is responsible for pushing the transaction ops from the core node to the replicant node. 139 | The agent operates in two modes: `catchup` where it reads transactions from the rlog, and `normal` where it forwards realtime mnesia events. 140 | There should be a third transient state called `switchover` where the agent subscribes to the mnesia events, while still consuming from the rlog. 141 | This is to ensure overlap between the stored transactions and the events. 142 | 143 | ![Agent FSM](0004-assets/agent-fsm.png) 144 | 145 | #### RLOG Replica 146 | 147 | RLOG replica is a `gen_statem` process that runs on the replicant node. 148 | It spawns during the node startup under the `rlog` supervisor, and is restarted indefinitely. 149 | It talks to the RLOG server in its `init` callback, and establishes connection to the remote RLOG agent process. 150 | In some cases it also creates a bootstrap client process and manages it. 151 | 152 | ![Replicant FSM](0004-assets/replicant-fsm.png) 153 | 154 | Full process of shard replication: 155 | 156 | ![Replication MSC](0004-assets/replication-msc.png) 157 | 158 | #### RLOG bootstrapper (client/server) 159 | 160 | RLOG bootstrapper is a temporary `gen_server` process that runs on both core and replicant nodes during replica initialization. 161 | RLOG bootstrapper server runs `mnesia:dirty_all_keys` operation on the tables within the shard, and then iterates through the cached keys. 162 | For each table and key pair it performs `mnesia:dirty_read` operation and caches the result. 163 | If the value for the key is missing, such record is ignored. 164 | Records are sent to the remote bootstrapper client process in batches. 165 | Bootstrapper client applies batches to the local table replica using dirty operations. 166 | 167 | ### Bootstrapping Empty Nodes 168 | 169 | The transaction logs should have a limited retention, so it is impossible to keep all the changes from the very beginning. 170 | 171 | An empty node will have to fetch all the records from transaction log before applying the real-time change logs. 172 | 173 | Bootstrapping can be done using dirty operations. 174 | Transaction log has an interesting property: replaying it can heal a partially corrupted replica. 175 | Transaction log replay can fix missing or reordered updates and deletes, as long as the replica has been consistent prior to the first replayed transaction. 176 | This healing property of the TLOG can be used to bootstrap the replica using only dirty operations. (TODO: prove it) 177 | One downside of this approach is that the replica contains subtle inconsistencies during the replay, and cannot be used until the replay process finishes. 178 | It should be mandatory to shutdown business applications while bootstrap and syncing are going on. 179 | 180 | ### Zombie fencing in push model 181 | 182 | In push model, a replicant node should make sure *not* to ingest transaction pushes from a stale core node which may have a zombie agent lingering around. 183 | i.e. A replicant node should 'remember' which node it is watching, and upon receiving transactions from an unknown node, 184 | it should reply with a rejection error message for the push calls. 185 | 186 | With this implementation, there should not be a need for the core nodes to coordinate with each other. 187 | 188 | ### Preventing infinite bootstrap / catchup loop 189 | 190 | There is a problem of catchup never completing, when bootstrap takes longer than the rlog retention time. 191 | In order to work around this problem the replicant node shall start consuming transactions from the rlog in parallel with bootstrapping. 192 | 193 | ## Configuration Changes 194 | 195 | Two new configuration needs to be added to `emqx.conf`: 196 | 197 | 1. `node_role`: enum [`core`, `replicant`] 198 | 2. `core_nodes`: a list of core nodes for a `replicant` node to 'watch' 199 | and from which transaction logs are fetched. 200 | 201 | ## Backwards Compatibility 202 | 203 | A `replicant` node should never originate data `write`s and `delete`s. 204 | Due to the fact that the nodes are still all clustered using erlang 205 | distribution. So some of the `rpc`s, (such cluster_call) should not be made 206 | towards the replicant nodes if they are intended for writes. 207 | 208 | ## Document Changes 209 | 210 | 1. New clustering setup guide 211 | 2. Update configuration doc for new config entries 212 | 213 | ## Testing Suggestions 214 | 215 | 1. Regression: clustering test in github actions. 216 | 1. Functionality: generate data operations (write and delete), 217 | apply operations and compare data integrity between core and replicant nodes 218 | 1. Performance: benchmark throughput and latency 219 | 1. Regression: test clock skews. 220 | 221 | 1. Create a cluster of two core nodes (A and B) and a replicant node C. 222 | 1. Set time to the future on one of the core nodes, say A 223 | 1. Restart the replicant node, make sure A node detects that first and removes the routes to C 224 | 1. Immediately connect some clients to the replicant node 225 | 1. Check that the replicant node didn't lose its own routes after replaying the transactions from the rlog 226 | 227 | ## Declined Alternatives 228 | 229 | * `riak_core` was the original proposal, it's declined because the change is 230 | considered too radical for the next release. We may re-visit it in the future. 231 | 232 | * Bootstrapping replicant nodes using mnesia checkpoints is the easiest option, that guarantees consistent results. 233 | Mnesia checkpoint is a standard feature that is used, for example, to perform backups. 234 | Core node can activate a local checkpoint containing all the tables needed for the shard, and then iterate through it during bootstrap process. 235 | Records from the mnesia checkpoint can be sent in batches using the same protocol as the online replication. 236 | 237 | This solution has downsides, though: 238 | 239 | + Checkpoints take non-trivial amount of resources and may slow down Mnesia: in order to make the checkpoint consistent, mnesia spawns a retainer process and installs a hook before the transaction commits. 240 | This hook forwards values of the records to the retainer process, before they are overwritten. 241 | Retainer process saves the values of the old records in a separate table. 242 | Given that the snapshot is going to be updated with all the recent transactions as soon as bootstrapping completes, this is excessive. 243 | 244 | + Checkpoint API in mnesia is designed in imperative style, and lifetime management of the checkpoints can be nontrivial, considering that core nodes can restart, replicant nodes can reconnect, and so on. 245 | 246 | * Pull-based model transaction replication model, where agent processes don't subscribe to mnesia events, but replicant nodes periodically poll for new records in the transaction log on the core nodes. 247 | -------------------------------------------------------------------------------- /implemented/0007-rocksdb-mnesia-backend.md: -------------------------------------------------------------------------------- 1 | ## Rocksdb Mnesia backend for EMQ X v5.0 2 | 3 | ``` 4 | Author: Zaiming Shi 5 | Status: Draft 6 | Type: Design 7 | Created: 2020-10-27 8 | EMQ X Version: 5.0 9 | Post-History: 10 | ``` 11 | 12 | ## Abstract 13 | 14 | Add [mnesia_rocksdb](https://github.com/aeternity/mnesia_rocksdb) as a mnesia 15 | backend, so that we can get away from troubles created by dets. 16 | 17 | -------------------------------------------------------------------------------- /implemented/0010-improved-monitoring.md: -------------------------------------------------------------------------------- 1 | # Improve performance monitoring 2 | 3 | ## Change log 4 | 5 | * 2021-03-03: @k32 Initial draft 6 | 7 | ## Abstract 8 | 9 | Integrate a changed version of [system_monitor](https://github.com/klarna-incubator/system_monitor/) into EMQX to collect `process_info` data in the background. 10 | 11 | ## Motivation 12 | 13 | Investigation of performance bottlenecks can be greatly simplified by utilizing BEAM VM introspection functions, such as `processes` and `process_info`. 14 | Well-known libraries like `recon` and `observer` make use of this data. However, these libraries don't collect historical data. 15 | 16 | Historical data about Erlang processes is of special interest during analysis of bottlenecks. 17 | Also, sometimes the designer needs to investigate a one-time, non-reproducible event. 18 | `system_monitor` application runs in the background all the time, collecting information about the activities in the BEAM VM. 19 | Therefore it has a better chance of capturing the relevant data. 20 | 21 | ## Design 22 | 23 | Currently, `system_monitor` is designed to publish the telemetry to Kafka, which is not suitable for EMQX. 24 | This design can be simplified, and the telemetry data should be written to the local log files managed by OTP kernel `logger` instead. 25 | "Abnormal node state" detection can be incorporated into `system_monitor` to reduce the size of the log files: `system_monitor` should only log data when BEAM schedulers are saturated. 26 | 27 | Example log entry format: 28 | 29 | ``` 30 | 31 | [#{app_memory => [{unknown,3507672},{system_monitor,2093320}], %% List of top OTP applications by memory consumption 32 | app_top => %% List of top OTP applications by reduction consumption 33 | [{system_monitor,0.9504843084075939}, 34 | {unknown,0.04951569159240604}], 35 | proc_top => %% List of top N erlang processes with the largest memory, reduction or mailbox size: 36 | {{1614,779722,299981}, 37 | [#erl_top{pid = "<0.10.0>",dreductions = 1.4991432396385467, 38 | dmemory = 0.0,reductions = 4817775,memory = 1115180, 39 | message_queue_len = 0, 40 | current_function = {erl_prim_loader,loop,3}, 41 | initial_call = {erlang,apply,2}, 42 | registered_name = erl_prim_loader,stack_size = 7, 43 | heap_size = 17731,total_heap_size = 139267, 44 | current_stacktrace = [{erl_prim_loader,loop,3,[]}], 45 | group_leader = "<0.0.0>"}, 46 | #erl_top{pid = "<0.44.0>",dreductions = 1.4991432396385467, 47 | dmemory = 0.0,reductions = 100455,memory = 460260, 48 | message_queue_len = 0, 49 | current_function = {gen_server,loop,7}, 50 | initial_call = {erlang,apply,2}, 51 | registered_name = application_controller,stack_size = 8, 52 | heap_size = 10958,total_heap_size = 57380, 53 | current_stacktrace = [{gen_server,loop,7, 54 | [{file,"gen_server.erl"},{line,437}]}], 55 | group_leader = "<0.152.0>"}, 56 | #erl_top{pid = "<0.50.0>",dreductions = 1.4991432396385467, 57 | dmemory = 0.0,reductions = 169004,memory = 142796, 58 | message_queue_len = 0, 59 | current_function = {code_server,loop,1}, 60 | initial_call = {erlang,apply,2}, 61 | registered_name = code_server,stack_size = 5, 62 | heap_size = 6772,total_heap_size = 17730, 63 | current_stacktrace = [{code_server,loop,1, 64 | [{file,"code_server.erl"},{line,151}]}], 65 | group_leader = "<0.152.0>"}, 66 | #erl_top{pid = "<0.151.0>",dreductions = 202.3843373512038, 67 | dmemory = 0.0,reductions = 15929,memory = 26612, 68 | message_queue_len = 0, 69 | current_function = {user_drv,server_loop,6}, 70 | initial_call = {user_drv,server,2}, 71 | registered_name = user_drv,stack_size = 10,heap_size = 2586, 72 | total_heap_size = 3196, 73 | current_stacktrace = [{user_drv,server_loop,6, 74 | [{file,"user_drv.erl"},{line,191}]}], 75 | group_leader = "<0.152.0>"}, 76 | ... 77 | ``` 78 | 79 | Alternatively, logs can be written in a binary form, to save space. 80 | Also different kinds of messages can be written to different log files instead of a single one. 81 | 82 | Finally, `system_monitor` should be added as a release app to the EMQX relx configuration. 83 | 84 | ## Configuration Changes 85 | 86 | TBD. The following parameters might be configurable: 87 | 88 | - "Abnormal load" threshold 89 | 90 | - Frequency of data collection 91 | 92 | - Log retention parameters 93 | 94 | ## Backwards Compatibility 95 | 96 | This change is backward-compatible 97 | 98 | ## Document Changes 99 | 100 | Contents of the new logs should be documented. 101 | 102 | ## Testing Suggestions 103 | 104 | `system_monitor` has unit tests. 105 | 106 | ## Declined Alternatives 107 | -------------------------------------------------------------------------------- /implemented/0010-runq-based-overload-protection.md: -------------------------------------------------------------------------------- 1 | # Run queue based overload protection. 2 | 3 | ## Changelog 4 | * 2021-03-11: @qzhuyan Initial Draft 5 | * 2021-04-08: @qzhuyan Don't hibernate process when overloaded. 6 | 7 | ## Abstract 8 | 9 | Run queue based overload protection for EMQX. 10 | 11 | EMQX needs some mechanism to cool down the node when the node is overloaded. 12 | 13 | Runq (Erlang VM run queue) is a performance critical metric. 14 | It is a sign of overloading when the runq number is greater than the number of schedulers for a long period. 15 | 16 | This document describes a mechanism how EMQX can cool down itself by monitoring runq. 17 | 18 | ## Motivation 19 | 20 | Some user reports EMQX runs at 100% cpu load and etop trace shows the runq of a 2 cores node is around 100 21 | and lasts more than 5 mins until the user kicks in then restarts the nodes. 22 | 23 | Ideally, the runq should not be greater than the number of schedulers. The runq can grow 24 | for a short period of time to handle peak traffic but should not stay high for longer than 1 mins. 25 | 26 | Erlang is soft real-time system, a long run queue can cause time-critical processes not getting scheduled on time that may 27 | lead timeout in upper layer's protocol. High CPU usage may also prevent users from login for O&M operations and leave nodes in zombie state. 28 | 29 | EMQX needs some mechanism to cool down itself before it runs into the situation above. 30 | 31 | And most importantly, operators need to get notified that the system is overloaded to take actions such as scale up the node/cluster. 32 | 33 | ## Design 34 | 35 | ### Run queue monitoring and flagging of long runq 36 | 37 | An isolated process runs under top supervisor with high scheduling priority to poll the runq value every 3 secs. 38 | 39 | If the runq is longer than 10x (number of online schedulers) 40 | for last 5 polls, it should: 41 | 42 | 1. Raise the overload flag by registering a process name such like '__long_runq__' 43 | 44 | 1. Monitoring systems should be notified such as logging and metrics counter updates. 45 | 46 | 1. Alert should be triggered, if the flag is not cleared after [X timer] mins. 47 | 48 | ### When node need cooldown 49 | 50 | Each application of EMQX should decide for itself how to deal with the overload. 51 | 52 | The overload flag should be checked in all performance-critical code paths and react on it. 53 | 54 | Suggested actions: 55 | 56 | - Reject new connections to keep the existing connections alive. 57 | 58 | consequences: 59 | 60 | New Client would retry and may get redirected by load balancer to other nodes in the cluster. 61 | 62 | - Defer processing of non-real-time requests such as Web UI requests. 63 | 64 | consequences: 65 | 66 | Web UI would be slow to respond and operator have to wait. 67 | 68 | - Stop spawning more parallel workers. 69 | 70 | consequences: 71 | might hurt latency. 72 | 73 | - Do not trigger active GC. 74 | 75 | - [Unclear] Redirect requests to other nodes. 76 | 77 | Wonder if on protocol level there is such spec for this. 78 | 79 | - Update backend status for polling from load balancer. 80 | 81 | consequences: 82 | 83 | Load balancer could lower the weight of this node in pool 84 | even remove the node from the pool for new connections. 85 | 86 | - Different QoS messages would be handled differently. 87 | 88 | Such as drop low priority messages. 89 | 90 | - Stop hibernating processes. 91 | 92 | gen_server should prefer not going to hibernate state. 93 | 94 | ### When runq is back to normal. 95 | 96 | '__long_runq__' should be unregistered. 97 | 98 | 99 | ## Configuration Changes 100 | 101 | 1. disable/enable overload protection 102 | 103 | 1. The [timer X] 104 | 105 | 1. Configuration for overload handling in different APPs. 106 | 107 | ## Backwards Compatibility 108 | 109 | N/A 110 | 111 | ## Document Changes 112 | 113 | 1. Description of overload protection. 114 | 115 | 1. Operation 116 | 117 | 1. Monitoring and Alerting 118 | 119 | ## Testing Suggestions 120 | 121 | Apart from regular performance tests, overload tests should be performed with at least 5X of regular rate. 122 | 123 | ## Declined Alternatives 124 | 125 | N/A 126 | -------------------------------------------------------------------------------- /implemented/0011-assets/current-implementation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0011-assets/current-implementation.png -------------------------------------------------------------------------------- /implemented/0011-assets/current-implementation.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | 3 | participant Client as client 4 | participant "Connection\nProcess" as connection 5 | participant emqx_cm_locker as locker 6 | box "Mnesia replication\nin the cluster" 7 | database Mnesia as mnesia 8 | end box 9 | box "Possibly Other Node" 10 | participant "emqx_cm_locker" as other_locker 11 | participant "Connection" as other_connection 12 | end box 13 | 14 | client -> connection : connect 15 | group emqx_cm : open_session 16 | connection -> locker : lock session id 17 | locker <-> other_locker : lock session id 18 | connection <-> mnesia : lookup session id 19 | group emqx_cm : takeover_session 20 | connection -> other_connection : begin takeover 21 | connection <- other_connection : "#session{}" 22 | rnote over other_connection 23 | Buffer 24 | messages 25 | endrnote 26 | rnote over connection 27 | Take over 28 | subscriptions. 29 | endrnote 30 | connection -> other_connection : end takeover 31 | other_connection -> connection : all pending msgs 32 | end 33 | rnote over other_connection 34 | Unsubscribe 35 | and terminate 36 | endrnote 37 | connection -> mnesia : register channel\n(emqx_cm_registry) 38 | connection -> locker : release lock 39 | locker <-> other_locker : release lock 40 | end 41 | 42 | client <- connection : connack 43 | client <- connection : all pending msgs 44 | 45 | @enduml 46 | 47 | -------------------------------------------------------------------------------- /implemented/0011-assets/flows.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0011-assets/flows.png -------------------------------------------------------------------------------- /implemented/0011-assets/flows.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | 3 | actor Subscriber as sub 4 | participant Connection as subcon 5 | participant Broker as broker 6 | box "Mnesia replication\nin the cluster" 7 | database Mnesia as mnesia 8 | end box 9 | box "Possibly other node" 10 | participant Writer as writer 11 | participant Connection as pubcon 12 | end box 13 | actor Publisher as pub 14 | 15 | == Clean start == 16 | sub -> subcon : Connect\n(Session-Expiry > 0,\nClean-Start: 1) 17 | group ClientID lock aquired (in cluster) 18 | subcon -> mnesia : Register (fresh) sessionID\nbased on clientID 19 | subcon -> mnesia : Clean start\n(discard old sessionID) 20 | end 21 | 22 | == Subscribe == 23 | sub -> subcon : Subscribe 24 | subcon -> broker : Subscribe 25 | group Cluster-global transaction 26 | broker -> mnesia : Store topic filter in trie and\nsession table 27 | end group 28 | subcon -> sub : suback 29 | 30 | == Publish == 31 | pub -> pubcon : publish 32 | pubcon -> mnesia : lookup topic in trie for\npersistent session 33 | pubcon -> mnesia : persist message 34 | pubcon -> writer : message 35 | writer -> mnesia : lookup sessionID 36 | writer -> mnesia : persist session\nmessage details 37 | rnote over pubcon 38 | Still responsive 39 | endrnote 40 | writer -> subcon : message (RPC or direct send) 41 | subcon -> sub : message 42 | writer -> pubcon : ack 43 | pubcon -> pub : puback 44 | subcon -> mnesia : mark as delivered 45 | 46 | 47 | == Persistent resume (connection gone) == 48 | sub -> subcon : Connect\n(Session-Expiry > 0,\nClean-Start: 0) 49 | group ClientID lock aquired (in cluster) 50 | subcon -> mnesia : Register under the same sessionID as before 51 | subcon -> mnesia : Get state 52 | end group 53 | group Recovery state machine 54 | subcon -> mnesia : Get all pending messages 55 | subcon -> sub : Pending messages 56 | subcon -> mnesia : Mark as delivered 57 | group Cluster-global transaction 58 | subcon -> broker : Subscribe to topics (transaction) 59 | end group 60 | group For all writers in parallel 61 | subcon -> writer : Sync marker 62 | rnote over subcon 63 | Drop all incoming messages from writer 64 | These messages will eventually come 65 | from the DB 66 | end rnote 67 | writer -> subcon : Sync marker 68 | group Wait (poll) for marker in DB (in pending messages) 69 | rnote over subcon 70 | Buffer messages from writer 71 | end rnote 72 | subcon -> mnesia : Get pending messages from writer 73 | subcon -> sub : pending messages 74 | subcon -> mnesia : Mark as delivered 75 | end group 76 | writer -> mnesia : Sync marker 77 | mnesia -> subcon : Sync marker 78 | subcon -> sub : buffered messages from writer 79 | subcon -> mnesia : Mark as delivered 80 | end group 81 | rnote over subcon 82 | Normal operations 83 | end rnote 84 | 85 | @enduml 86 | -------------------------------------------------------------------------------- /implemented/0011-assets/init-fsm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0011-assets/init-fsm.png -------------------------------------------------------------------------------- /implemented/0011-assets/init-fsm.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | 3 | 4 | state init as "Init session" 5 | init: * Get pending messages from db 6 | init: * Subscribe to topics 7 | init: * Send pending messages to client 8 | init: * Send marker to writers 9 | init: * Discard messages from RTF if not marker 10 | 11 | state writers as "For all nodes" { 12 | state sync1 as "Sync DB flow (DBF) with writer" 13 | sync1: * Poll for pending messages in db 14 | sync1: * Send messages earlier than marker from db to client 15 | 16 | state sync2 as "Finalize sync with writer" 17 | sync2: * Send buffered messages to client 18 | 19 | [*] --> sync1 : Marker received from RTF 20 | sync1 --> sync2 : Marker received from DBF 21 | } 22 | 23 | init --> writers 24 | 25 | @enduml 26 | -------------------------------------------------------------------------------- /implemented/0011-jq-in-rule-engine.md: -------------------------------------------------------------------------------- 1 | # An Example of EMQ X Improvement Proposal 2 | 3 | ## Change log 4 | 5 | * 2020-03-12: @terry-xiaoyu first draft 6 | * 2020-03-21: @terry-xiaoyu add section for JQ NIF 7 | 8 | ## Abstract 9 | 10 | Introduce [JQ](https://stedolan.github.io/jq/) syntax to the SQL of emqx rule 11 | engine. 12 | 13 | ## Motivation 14 | 15 | The emqx rule engine now supports a subset of SQL syntax for creating rules. We 16 | also provide a set of of SQL functions for manipulating the data structures. 17 | 18 | A common use case is to transform JSON strings from MQTT messages. 19 | The solution now is that first decode the JSON string to Erlang terms using 20 | the sql function `json_decode/1`, assign it to a temporary variable, and then 21 | do the transformations. As a handy way for decoding JSON Object and get the 22 | value of a key, we could use the dot syntax like `payload.x`. 23 | 24 | The problem is that the sql functions we provided are too limited to transform 25 | complex JSON strings. For instance if we want to run a lambda for each of the 26 | element of an array, we need the `map/2` or `reduce/3` function, and a sql 27 | syntax for defining a lambda to do the transformation. JQ has done a great work 28 | on this, in a concise syntax. Because JQ has been so widely used it has become 29 | the de facto standard for processing JSON strings, it would be nice if we 30 | introduced it to the emqx rule engine. 31 | 32 | ## Design 33 | 34 | The SQL syntax of emqx rule engine now support a very limited set of operators 35 | to retrieve and set the JSON values. 36 | 37 | For example the following code snippet retrieves the value from an MQTT message 38 | recursively: 39 | 40 | ``` 41 | SELECT payload.a.b[1].c 42 | Input: {"a": {"b": [{"c": 1}]}} 43 | Output: 1 44 | ``` 45 | 46 | And for updating an JSON object, we override the `AS` operator of SQL syntax. 47 | 48 | The following example update a value with key `c` from `1` to `0`: 49 | 50 | ``` 51 | SELECT 0 as payload.a.b[1].c 52 | Input: {"a": {"b": [{"c": 1}]}} 53 | Output: {"a": {"b": [{"c": 0}]}} 54 | ``` 55 | 56 | This is straightforward but far from "enough" for processing JSON strings. 57 | JQ provide a simple and clean syntax for the frequently used operations, e.g.: 58 | 59 | ``` 60 | jq '[.user, .projects[]]' 61 | Input: {"user":"stedolan", "projects": ["jq", "wikiflow"]} 62 | Output: ["stedolan", "jq", "wikiflow"] 63 | ``` 64 | 65 | Where `[]` is a "collector" for collecting multiple element into a single array, 66 | and the operator `.projects[]` gets all of the values from the array `projects` 67 | as multiple results. This would require several lines of code if we don't use 68 | JQ: first we need to get the values of "user" and "project" and then flatten the 69 | results into a single list. 70 | 71 | And here's a more complex example for reducing array to accumulate the numbers: 72 | 73 | ``` 74 | jq 'reduce .[] as $item ([]; . + [$item.a]) | {"acc": .}' 75 | Input: [{"a": 1}, {"a": 2}, {"a": 3}] 76 | Output: {"acc": [1,2,3]} 77 | ``` 78 | 79 | ### Timeout 80 | 81 | Having a timeout when executing JQ code in the rule engine is important because 82 | JQ programs can potentially execute forever (JQ is a Turing complete 83 | programming language that allows recursive functions). JQ programs that execute 84 | forever or a very long time are probably buggy and could cause performance 85 | and debugging problems. 86 | 87 | Additionally, if one allowed JQ programs to execute forever, one would need 88 | a way to terminate them, for example, if a user want to manually terminate a JQ 89 | program. This could be tricky and time consuming to implement as one would need 90 | an interface to monitor JQ programs and terminate specific ones. 91 | 92 | ### Suggested JQ Syntax in Rule SQL 93 | 94 | We'd better use JQ along with the rule SQL, as it is common to use the output of 95 | an SQL function as the input of JQ, or to assign the output of JQ to an SQL 96 | variable and then SELECT it as part of the SQL result. 97 | 98 | One way is use JQ as a normal SQL function, e.g.: 99 | 100 | ``` 101 | SELECT 102 | jq('reduce .[] as $item ([]; . + [$item.a]) | {"acc": .}', 103 | payload) as result 104 | ``` 105 | 106 | The above suggestion has been added to the rule engine in the 5.0 release of 107 | EMQX. The second argument in the implementation can be a non-string value as in 108 | the example above. The second argument can also be a JSON value encoded as a 109 | string, in which case the function will automatically transform the argument to 110 | the encoded value before it is sent to the JQ program. An implicit timeout which 111 | can be configured with the `rule_engine.jq_function_default_timeout` setting is 112 | used to timeout the JQ function after a certain amount of time. A JQ function 113 | that takes three arguments has also been added in the 5.0 release. The third 114 | argument can be used to explicitly specify a timeout in milliseconds. 115 | 116 | To make the code cleaner, we could create an SQL keyword for JQ, this way 117 | we can remove the surrounding quotes out of the JQ filters: 118 | 119 | ``` 120 | SELECT 121 | JQ payload 122 | DO 123 | reduce .[] as $item ([]; . + [$item.a]) 124 | | {"acc": .} 125 | END 126 | ``` 127 | 128 | As JQ can read input from the environment variables, we could simplify it more 129 | by setting all the available SQL variables into JQ filters as ENVs: 130 | 131 | ``` 132 | SELECT 133 | JQ 134 | $ENV.payload 135 | | reduce .[] as $item ([]; . + [$item.a]) 136 | | {"acc": .} 137 | END 138 | ``` 139 | 140 | The `JQ` clause doesn't have to be used in the `SELECT` clause of SQL. If we put 141 | `JQ` clause before the `SELECT`, it would look even better: 142 | 143 | ``` 144 | JQ 145 | .payload 146 | | reduce .[] as $item ([]; . + [$item.a]) 147 | | {"acc": .} 148 | SELECT 149 | acc[1] as first 150 | ``` 151 | 152 | This way we use the output of the `JQ` as the input of `SELECT`. The above code 153 | snippet will output `{"first": 1}`. 154 | 155 | The only problem now is we can not utilize the existing SQL functions. 156 | Then we change it again by simply putting the `JQ` clause behind the `SELECT`: 157 | 158 | ``` 159 | SELECT 160 | decode(payload) as p 161 | JQ 162 | .p 163 | | reduce .[] as $item ([]; . + [$item.a]) 164 | | {"acc": .} 165 | | {first: (.acc|.[0])} 166 | ``` 167 | 168 | This example does exactly what the previous example do. The SELECT clause can 169 | only output a single map result, and the map will then be piped to the JQ clause. 170 | 171 | Another option is to support multiple languages for writing rule engine 172 | programs. One of those languages could be JQ and one could be the current SQL 173 | based language. With this suggestion the user would need to select which 174 | language to use. This could be done with a dropbox in the GUI interface for 175 | adding rules. Here are some advantages with this suggestion: 176 | 177 | * It is future proof as it would be easy to add a third language in the future 178 | * The syntax of each individual language would be less cluttered as one would 179 | not need to combine the SQL based language with JQ 180 | * Implementation is simple (one does not need to extend the SQL based language) 181 | 182 | If this suggestion is chosen, one could also add a feature that would make it 183 | possible to pipe the output of one rule engine program to another. For example, 184 | the user might want to pipe the output of a JQ rule engine program to one 185 | written in the SQL based language. This could be useful, for example, if one 186 | wants to use the FOREACH statement to output multiple messages from a single 187 | message, or if one want to use a function that exists in the SQL based language 188 | but not in JQ. 189 | 190 | ### Introduce JQ as Port, NIF or Compiled BEAM Code 191 | 192 | JQ is written in portable C as a single binary, reading command line argument 193 | from stdin and outputting results to the stdout. It supports Linux, OS X, 194 | FreeBSD, Solaris, and Windows for now, so the simplest way is to package the `jq` 195 | binary along with all of the emqx installation packages, and talk to `jq` using 196 | the [erlang port](http://erlang.org/doc/tutorial/c_port.html). For someone who 197 | is building emqx from source code of emqx repo, he can put the jq binary in to 198 | the right path according to the configuration. The current Erlang JQ library 199 | (see Section "JQ NIF/Port Library") uses a long running port program that uses 200 | an LRU cache to cache compiled programs for increased efficiency. 201 | 202 | The second approach is NIF, with the drawback of more changes to the code (compile 203 | the code to a dynamic C library rather than a single binary), and safety (it 204 | brings down the entire erlang system on crash, and may hold up the erlang 205 | scheduler if it returns too late). But this way has the benefit of efficiency 206 | and no independent `jq` binary is required. The NIF approach has also been 207 | implemented in the Erlang JQ library but it currently (2022-07-10) lacks support 208 | for timeouts (which is tricky to implement as it will requires non-trivial 209 | modification to the main jq library). 210 | 211 | The third approach is to compile JQ programs directly to BEAM bytecode. Here 212 | are some of the benefits one might get from doing this compared to the other 213 | approaches: 214 | 215 | * Speed - No context switching (between BEAM code and port or dirty NIF thread), 216 | and running JITed BEAM code will probably be faster than running interpreted 217 | JQ byte code 218 | * Fairness - BEAM code is run on a main Erlang scheduler and preemted in the 219 | same way as compiled Erlang code 220 | * Tools - This would play well with the Erlang VM's tools for tracing and 221 | performance measuring and so on 222 | * Safety - Running BEAM code is safer than a NIF as a NIF can crash the VM, 223 | cause hard to debug problems, and leak memory 224 | 225 | One could get some of the benefits mentioned above by making the NIF 226 | implementation yielding. However this would be a lot of work (even though some 227 | of the work could be automated with 228 | (YCF)[https://github.com/kjellwinblad/yielding_c_fun]), and would 229 | make upgrades of the JQ library more difficult. 230 | 231 | Compiling JQ code to BEAM code would probably be quite straightforward. Both JQ 232 | and Erlang are functional languages. One option is to make use of the JQ 233 | compiler (which can be exposed to Erlang as a Port program) to transform JQ 234 | code to JQ bytecode and then one only have to implement a transformation of JQ 235 | bytecode to BEAM bytecode. This is probably easier than transforming JQ code 236 | to BEAM bytecode without an intermediate step. 237 | 238 | ### JQ NIF/Port Library 239 | 240 | 241 | An Erlang JQ library has been created at "https://github.com/emqx/jq". The 242 | library interface supports both a NIF based backend and port based backed. The 243 | user can configure which backend to use. At the time of writing (2022-07-10), 244 | the JQ function in the rule engine can only use the port based backend as this 245 | backend is currently the only one that supports timeouts. There is a plan to 246 | also support timeouts in the NIF based backend. When that is possible, EMQX 247 | users should be given the option to configure which backend to use. Here are 248 | some examples that shows how the library can be used: 249 | 250 | ``` 251 | rebar3 shell 252 | ... 253 | 254 | 1> jqerl:process_json(<<".a">>, <<"{\"b\": 1}">>). 255 | {ok,[<<"null">>]} 256 | 257 | 2> jqerl:process_json(<<".a">>, <<"{\"a\": {\"b\": {\"c\": 1}}}">>). 258 | {ok,[<<"{\"b\":{\"c\":1}}">>]} 259 | 260 | 3> jqerl:process_json(<<".a|.[]">>, <<"{\"a\": [1,2,3]}">>). 261 | {ok,[<<"1">>,<<"2">>,<<"3">>]} 262 | ``` 263 | 264 | If very thing is OK, the API `jqerl:parse/2` always returns a list as the second 265 | element of the tuple, because jq may have multiple outputs. 266 | 267 | If there's some error in the jq filter or the json string, `{error, Reason}` will 268 | be returned: 269 | 270 | ``` 271 | 1> jqerl:parse(<<".a">>, <<"{\"a\": ">>). 272 | {error,{jq_err_parse,<<"Unfinished JSON term at EOF at line 1, column 6 (while parsing '{\"a\": ')">>}} 273 | ``` 274 | 275 | ## Configuration Changes 276 | 277 | The `jq/2` function that was introduced in the EMQX 5.0 release reads the 278 | configuration setting `rule_engine.jq_function_default_timeout` to get the 279 | default timeout in milliseconds. We may also introduce a setting for 280 | configuring which backend to use (the port based one or the NIF based one) when 281 | we have implemented the timeout feature in NIF backend of the Erlang JQ 282 | library. 283 | 284 | ## Backwards Compatibility 285 | 286 | There's no backward compatibility problems. 287 | 288 | ## Document Changes 289 | 290 | The JQ can be used in the following syntax: 291 | 292 | ``` 293 | SELECT 294 | * 295 | JQ 296 | {u: .username, c: .clientid} 297 | FROM 298 | "t/1" 299 | ``` 300 | 301 | This outputs `{"u": "Shawn", "c": "00001"}` if the user "Shawn" with client-id 302 | "00001" publishes a message to "t/1". 303 | 304 | You can learn more about JQ [here](https://stedolan.github.io/jq/). 305 | 306 | ## Testing Suggestions 307 | 308 | Benchmarking for processing large JSON strings using the new JQ syntax is 309 | required. 310 | 311 | ## Declined Alternatives 312 | 313 | Another way to process JSON strings in SQL is to provide more SQL functions or 314 | SQL keywords, just like how the jq does. But this would be too complicated and 315 | the syntax we created is hard to beat the JQ. The idea is that it's better to 316 | use the well-tested library to do the work, rather than spend time reinventing 317 | the wheels. 318 | 319 | ## Status 2022-06-09: JQ Function Added to the Rule Engine in the EMQX 5.0 Release 320 | 321 | 322 | In the EMQX 5.0 release, we have introduced JQ functions to EMQX's SQL based 323 | rule engine language. This is the first suggestion discussed in the "Suggested 324 | JQ Syntax in Rule SQL" section above. Please read the first part of the 325 | "Suggested JQ Syntax in Rule SQL" Section for more details about the added 326 | functions. 327 | 328 | This way of introducing JQ to the rule engine was chosen as it makes it possible 329 | to use JQ in the rule engine without invalidating any of the other suggestions 330 | for extending the syntax of the rule engine. 331 | -------------------------------------------------------------------------------- /implemented/0012-cluster-call.md: -------------------------------------------------------------------------------- 1 | # Cluster call transaction 2 | 3 | ## Changelog 4 | 5 | * 2021-08-11: @zhongwencool Initial draft. 6 | * 2021-08-30: @zhongwencool add "How to confirm that the commit has succeeded?" section. 7 | 8 | ## Abstract 9 | 10 | When EMQX updates the cluster resources via HTTP API, it first updates the local node resources, and then updates all other nodes via RPC Multi Call to ensure the consistency of resources (configuration) in the cluster. 11 | 12 | **In order to ensure consistency, it must ensure that the updates will be be eventually applied on all nodes in the cluster**. 13 | 14 | ## Motivation 15 | 16 | The current solution is to update the resources of the local node successfully, and then RPC calls to update the resources of other nodes synchronously. 17 | 18 | Update resources may be lost during the RPC call. 19 | 20 | - If there is a network disturbance during RPC, it may cause the RPC to fail. 21 | - If a remote operation to update a resource fails, there is no retry or any other remedy, causing inconsistent configuration in the cluster. 22 | - If multiple updates are performed concurrently, it may happen that node 1 performs updates in order 1, 2, 3, but node 2 updates in order 1, 3, 2. There is no order guarantee. 23 | - Lack of replay. If a node is down for a while, there is a lack of history event replay to catch up with the changes happened during the down time. 24 | 25 | ## Design 26 | 27 | This proposal uses mnesia to record the execution status of MFA, to ensure the consistency of the final resources & data in the cluster. 28 | 29 | This proposal is not applicable to high frequency request calls, all updates are performed in strict order. 30 | 31 | ### mnesia table structure 32 | 33 | ```erlang 34 | -record(cluster_rpc_mfa, {tnx_id :: pos_integer(), mfa :: mfa(), created_at :: calendar:datetime(), initiator :: node()}). 35 | -record(cluster_rpc_commit, {node :: node(), tnx_id :: pos_integer()}). 36 | mnesia(boot) -> 37 | ok = ekka_mnesia:create_table(?CLUSTER_MFA, [ 38 | {type, ordered_set}, 39 | {disc_copies, [node()]}, 40 | {rlog_shard, ?COMMON_SHARD}, 41 | {record_name, cluster_rpc_mfa}, 42 | {attributes, record_info(fields, cluster_rpc_mfa)}]), 43 | ok = ekka_mnesia:create_table(?CLUSTER_COMMIT, [ 44 | {type, set}, 45 | {disc_copies, [node()]}, 46 | {rlog_shard, ?COMMON_SHARD}, 47 | {record_name, cluster_rpc_commit}, 48 | {attributes, record_info(fields, cluster_rpc_commit)}]); 49 | mnesia(copy) -> 50 | ok = ekka_mnesia:copy_table(cluster_rpc_mfa, disc_copies), 51 | ok = ekka_mnesia:copy_table(cluster_rpc_commit, disc_copies). 52 | ``` 53 | 54 | - `tnx_id` is strictly +1 incremental, all executed transactions must be executed in strict order, if there is node 1 executing transaction 1, 2, 3, but node 2 keeps failing in executing transaction 2 after executing transaction 1, it will keep retrying transaction 2 until it succeeds before executing transaction 3. 55 | 56 | - `cluster_call_commit` : Records the maximum `tnx_id` executed by the node. All transactions less than this id have been executed successfully on this node. 57 | - `cluster_call_mfa`: `ordered_set`: Records the MFA for each `tnx_id` in pairs. Keep the latest completed 100 records for querying and troubleshooting. 58 | 59 | ### Flow 60 | 61 | 1. `emqx_cluster_rpc` register as `gen_server` on each node, subscribes to the mnesia table simple event, and is responsible for the execution of all transactions. 62 | 63 | 2. `handler` init will catch up latest tnx_id. if node's tnx_id is 5, but latest MFA's tnx_id is 10, it will try to run MFA from 6 to 10 by 5 transactions. 64 | 65 | ```erlang 66 | init([Node, RetryMs]) -> 67 | {ok, _} = mnesia:subscribe({table, ?CLUSTER_MFA, simple}), 68 | {ok, #{node => Node, retry_interval => RetryMs}, {continue, ?CATCH_UP}}. 69 | 70 | handle_continue(?CATCH_UP, State) -> 71 | {noreply, State, catch_up(State)}. 72 | 73 | catch_up(#{node := Node, retry_interval := RetryMs} = State) -> 74 | case transaction(fun get_next_mfa/1, [Node]) of 75 | {atomic, caught_up} -> ?TIMEOUT; 76 | {atomic, {still_lagging, NextId, MFA}} -> 77 | case apply_mfa(NextId, MFA) of 78 | ok -> 79 | case transaction(fun commit/2, [Node, NextId]) of 80 | {atomic, ok} -> catch_up(State); 81 | Error -> 82 | ?SLOG(error, #{ 83 | msg => "mnesia write transaction failed", 84 | node => Node, 85 | nextId => NextId, 86 | error => Error}), 87 | RetryMs 88 | end; 89 | _Error -> RetryMs 90 | end; 91 | {aborted, Reason} -> 92 | ?SLOG(error, #{ 93 | msg => "get_next_mfa transaction failed", 94 | node => Node, error => Reason}), 95 | RetryMs 96 | end. 97 | get_next_mfa(Node) -> 98 | NextId = 99 | case mnesia:wread({?CLUSTER_COMMIT, Node}) of 100 | [] -> 101 | LatestId = get_latest_id(), 102 | TnxId = max(LatestId - 1, 0), 103 | commit(Node, TnxId), 104 | ?SLOG(notice, #{ 105 | msg => "New node first catch up and start commit.", 106 | node => Node, tnx_id => TnxId}), 107 | TnxId; 108 | [#cluster_rpc_commit{tnx_id = LastAppliedID}] -> LastAppliedID + 1 109 | end, 110 | case mnesia:read(?CLUSTER_MFA, NextId) of 111 | [] -> caught_up; 112 | [#cluster_rpc_mfa{mfa = MFA}] -> {still_lagging, NextId, MFA} 113 | end. 114 | 115 | get_latest_id() -> 116 | case mnesia:last(?CLUSTER_MFA) of 117 | '$end_of_table' -> 0; 118 | Id -> Id 119 | end. 120 | ``` 121 | 122 | 3. If a new update operation is added, the `handler` will receive a write event of the `cluster_rpc_mfa` table. 123 | "read the next record" -> "execute action" -> "commit" loop, with iteration triggered by mnesia events. The content of the events could be ignored. 124 | 125 | ```erlang 126 | handle_info({mnesia_table_event, _}, State) -> 127 | {noreply, State, catch_up(State)}; 128 | 129 | 130 | 4. The initial transaction must be executed in the `emqx_cluster_rpc` process. if this transaction succeeds, the call returns success directly, if the transaction fails, the call aborts with failure. 131 | 132 | ```erlang 133 | handle_call({initiate, MFA}, _From, State = #{node := Node}) -> 134 | case transaction(fun init_mfa/2, [Node, MFA]) of 135 | {atomic, {ok, TnxId}} -> 136 | {reply, {ok, TnxId}, State, {continue, ?CATCH_UP}}; 137 | {aborted, Reason} -> 138 | {reply, {error, Reason}, State, {continue, ?CATCH_UP}} 139 | end; 140 | 141 | init_mfa(Node, MFA) -> 142 | mnesia:write_lock_table(?CLUSTER_MFA), 143 | LatestId = get_latest_id(), 144 | ok = do_catch_up_in_one_trans(LatestId, Node), 145 | TnxId = LatestId + 1, 146 | mnesia:write(#cluster_rpc_cursor{node = Node, tnx_id = TnxId}), 147 | mnesia:write(#cluster_rpc_mfa{tnx_id = TnxId, mfa = MFA, initiator = Node, created_at = erlang:localtime()}), 148 | ok = apply_mfa(MFA). 149 | 150 | do_catch_up_in_one_trans(LatestId, Node) -> 151 | case do_catch_up(LatestId, Node) of 152 | caught_up -> ok; 153 | ok -> do_catch_up_in_one_trans(LatestId, Node); 154 | _ -> mnesia:abort("catch up failed") 155 | end. 156 | ``` 157 | 158 | **Risk point**: If the previous unfinished MFA in the `init_mfa` transaction is executed successfully, but the latest MFA fails and leads to abort, it will roll back the previous unfinished MFA as well, thus causing the MFA to be executed again later. So MFA must be idempotent. 159 | 160 | If nodes A,B have completed transaction 4, and they are fighting to update transaction 5 at the same time, A finally gets transaction 5 and commits successfully, at this time, B can only update to transaction 6, when the event of transaction 5 has not reached node B, B got transaction 6, it will lead to the error that transaction 6 is executed before transaction 5. So we must check if there are any uncompleted transactions in transaction 6(`do_catch_up_in_one_trans`). 161 | 162 | In addition to this solution, we can also determine in `init_mfa` that if there are still transactions that have not been applied, then we will return an error directly. 163 | 164 | 5. Only keep the latest completed 100 records for querying and troubleshooting 165 | 166 | 6. The MFA function must return `ok|{ok, any()}` if it is executed successfully, otherwise mark it as failed and retry later. 167 | 168 | ```erlang 169 | apply_mfa({M, F, A}) -> 170 | try erlang:apply(M, F, A) 171 | catch E:R -> {error, {E, R}} 172 | end. 173 | ``` 174 | 175 | 7. How to confirm that the commit has succeeded? 176 | 177 | There is a difference in the definition of success required by each caller, where a strong requirement is that all nodes succeed before returning success, and a more lenient requirement is that only the local node call succeeds. Therefore, it is necessary to add the parameter `SucceedNum(1~all)` to the API to tell the caller how many nodes succeeded so that the caller can be informed of success. If no specified number of nodes succeeded within the specified time, the failed node was not executed successfully is returned. 178 | 179 | PS: If set to all, it will never succeed when the existing nodes are not online. 180 | 181 | Implementation details: After the first initiator apply succeeds, do not return the result to the caller immediately, periodically check whether the number of successful nodes in the mfa table meets the requirement, and return success if the condition is met. If the requirement is not met after timeout, the failed node is returned. 182 | 183 | Alternative: Add `from` field (caller PID) in MFA table to indicate who is the caller? Used to return the result to the caller, after the successful execution of MFA, send the PID success information. This proposal has one drawback: since it is impossible to monitor other processes, when there are 3 nodes, success will return 3 messages, but when the number of successes is set to 2, there will be a message left in the mailbox without knowing what to do with it. 184 | 185 | 8. What are the fixes for the case that the transaction keeps failing? 186 | If a node keeps retrying MFA unsuccessfully, subsequent transactions for this node will not be processed. The customer can manually specify the corresponding node to skip this doomed error commit after knowing explicitly that the error cannot be handled. 187 | 188 | ### API Design 189 | 190 | ```erlang 191 | -spec(emqx_cluster_rpc:call(M,F,A, SucceedNum) -> {ok,TnxId,Result}|{error,Reason} when 192 | M :: module(), 193 | F :: atom(), 194 | A :: [term()], 195 | SucceedNum :: pos_integer() | all, 196 | Result :: ok |{ok, term()}, 197 | TxnId :: pos_integer(). 198 | -spec(emqx_cluster_rpc:skip_failed_commit(Node) -> ok. 199 | -spec(emqx_cluster_rpc:reset() -> ok. 200 | -spec(emqx_cluster_rpc:status() -> [#{tnx_id => pos_integer(), mfa => mfa(), 201 | pending_node => [node()], 202 | initiator => node(), 203 | created_at => localtime()}]). 204 | ``` 205 | 206 | ## Configuration Changes 207 | 208 | ```yaml 209 | node.cluster_call { 210 | ## Time interval to retry after a failed call 211 | ## 212 | ## @doc node.cluster_call.retry_interval 213 | ## ValueType: Duration 214 | ## Default: 1s 215 | retry_interval = 1s 216 | ## Retain the maximum number of completed transactions (for queries) 217 | ## 218 | ## @doc node.cluster_call.max_history 219 | ## ValueType: Integer 220 | ## Range: [1, 500] 221 | ## Default: 100 222 | max_history = 100 223 | ## Time interval to clear completed but stale transactions. 224 | ## Ensure that the number of completed transactions is less than the max_history 225 | ## 226 | ## @doc node.cluster_call.cleanup_interval 227 | ## ValueType: Duration 228 | ## Default: 5m 229 | cleanup_interval = 5m 230 | } 231 | ``` 232 | 233 | 234 | 235 | ## Backwards Compatibility 236 | 237 | N/A 238 | 239 | ## Document Changes 240 | 241 | N/A 242 | 243 | ## Testing Suggestions 244 | 245 | The final implementation must include unit test or common test code. If some 246 | more tests such as integration test or benchmarking test that need to be done 247 | manually, list them here. 248 | 249 | ## Declined Alternatives 250 | 251 | Here goes which alternatives were discussed but considered worse than the current. 252 | It's to help people understand how we reached the current state and also to 253 | prevent going through the discussion again when an old alternative is brought 254 | up again in the future. 255 | 256 | -------------------------------------------------------------------------------- /implemented/0013-gen-swagger-spec.md: -------------------------------------------------------------------------------- 1 | # An Example of EMQ X Improvement Proposal 2 | 3 | ## Changelog 4 | 5 | * 2021-09-02: @zhongwencool init draft 6 | 7 | ## Abstract 8 | 9 | The HTTP REST API implementation generates swagger spec directly from the code, without the need to maintain an additional swagger spec document. 10 | 11 | ## Motivation 12 | 13 | To implement the HTTP REST API interface, the developer needs to maintain a separate swagger spec in top of the implementation code, which is completely separate from the code and is difficult to update and maintain. Manually updating the swagger spec is intricate and error-prone. This proposal proposes to generate swagger spec by code, and swagger schema can reuse the schema in hocon, so that it is also convenient to do entry checking by automatic schema. 14 | 15 | ## Design 16 | 17 | ### Basic Structure 18 | 19 | we define a basic structure. 20 | 21 | ```erlang 22 | paths() -> ["/user/:user_id"]. 23 | 24 | schema("/user/:user_id") -> 25 | #{operationId => user}. 26 | ``` 27 | 28 | It defines all the paths of this spec, specify a unique `operationId`. Using this value to name the corresponding methods in code. 29 | 30 | ### Operations 31 | 32 | For each path, we define operations (HTTP methods) that can be used to access that path. A single path can support multiple operations, for example, `GET /users` to get a list of users and `POST /users` to add a new user. we defines a unique operation as a combination of a path.. Minimal example of an operation: 33 | 34 | ```erlang 35 | schema("/user/:user_id/:fingerprint") -> 36 | #{ 37 | operationId => user, 38 | get => #{response => 39 | #{200 => 40 | hocon:mk(hoconsc:ref(?MODULE, "user"), 41 | #{description => <<"return self user informations">>})} 42 | }. 43 | ``` 44 | 45 | Operations also support some optional elements for documentation purposes: `summary, description, tags`. 46 | 47 | ### Query String in Paths 48 | 49 | Query string paramters is defined as query parameters: 50 | 51 | ```erlang 52 | #{put => 53 | #{ 54 | parameters => [ 55 | {user_id, hoconsc:mk(string(), #{in => path, description => <<"The client ID of your Emqx app">>, example => <<"an long client id">>})}, 56 | {per_page, mk(range(1, 50), #{required => true, in => query, example => "10"})}, 57 | {is_admin, mk(boolean(), #{required => true, in => query, example => "true"})}, 58 | {oneof_test_in_query, mk(hoconsc:union([string(), integer()]), #{in => query, example => "a_good_oneof_in_query"})} 59 | ]}} 60 | ``` 61 | 62 | will generate a swagger json: 63 | 64 | ```json 65 | parameters: [ 66 | {example: "a_good_oneof_in_query", 67 | in: "query", 68 | name: "oneof_test_in_query", 69 | schema: { 70 | oneOf: [ 71 | {example: 100,type: "integer"}, 72 | {example: "string example",type: "string"}]}}, 73 | {example: "true",in: "query",name: "is_admin",required: true, 74 | schema: {example: true,type: "boolean"}}, 75 | {example: "10", in: "query", name: "per_page", required: true, 76 | schema: {example: 1, maximum: 50, minimum: 1, type: "integer"}}, 77 | {description: "The client ID of your Emqx app", example: "an long client id", in: "path", name: "client_id", 78 | required: true, 79 | schema: {example: "string example",type: "string"}}]. 80 | ``` 81 | 82 | ### Request Body 83 | 84 | Request bodies are typically used with “create” and “update” operations (POST, PUT, PATCH). For example, when creating a resource using POST or PUT, the request body usually contains the representation of the resource to be created. OpenAPI 3.0 provides the `requestBody` keyword to describe request bodies. 85 | 86 | ```erlang 87 | #{requestBody => 88 | #{ 89 | client_secret => mk(string(), #{description => <<"The OAuth app client secret for which to create the token.">>, maxLength => 40}), 90 | scopes => mk(hoconsc:array(string()), #{<<"description">> => "A list of scopes that this authorization is in.", example => ["public_repo", "user"], nullable => true}), 91 | test => mk(hoconsc:enum([test, good]), #{<<"description">> => "good", example => test}), 92 | note => mk(string(), #{description => <<"A note to remind you what the OAuth token is for.">>, example => <<"Update all gems">>}), 93 | note_url => mk(string(), #{description => <<"A URL to remind you what app the OAuth token is for.">>}), 94 | page => mk(range(1, 100), #{description => <<"Page Description.">>}), 95 | ip => mk(emqx_schema:ip_port(), #{description => <<"ip:port">>, example => "127.0.0.1:8081"}), 96 | oneof_test => mk(hoconsc:union([range(1, 100), infinity, hoconsc:ref(?MODULE, client_id)]), #{description => "oneof description", example => "1"}) 97 | }}. 98 | ``` 99 | 100 | RequestBody is json object. If we have too much nesting, we can use `hoconsc:ref/2` to make the code a little clearer. such as: 101 | 102 | `hoconsc:ref(?MODULE, client_id)` will call `?MODULE:fields(client_id)` to get specific schema. 103 | 104 | ### Responses 105 | 106 | ```erlang 107 | #{responses => 108 | #{ 109 | 200 => mk(hoconsc:ref(?MODULE, "authorization"), #{description => <<"if returning an existing token">>}), 110 | 422 => mk(hoconsc:ref(?MODULE, "validation_failed"), #{}), 111 | 400 => mk(hoconsc:array(hoconsc:ref(?MODULE, "authorization")), #{}), 112 | 401 => #{ 113 | total_count => mk(integer(), #{required => true}), 114 | artifacts => mk(hoconsc:array(hoconsc:ref(?MODULE, "authorization")), #{}) 115 | }, 116 | 203 => maps:from_list(emqx_schema:fields("authorization")) 117 | }}. 118 | ``` 119 | 120 | will generate swagger.json: 121 | 122 | ```json 123 | responses: { 124 | 200: { 125 | description: "if returning an existing token", 126 | content: {application/json: {schema: {$ref: "#/components/schemas/emqx_swagger_api.authorization"}}}}, 127 | 203: {description: "",content: {application/json: {schema: {properties: {cache: 128 | {$ref:"#/components/schemas/emqx_schema.cache"}, 129 | deny_action: { 130 | default: "ignore", 131 | enum: ["ignore","disconnect"], 132 | type: "string"}, 133 | no_match: {default: "allow",enum: ["allow","deny"],type: "string"}}, 134 | type: "object"}}}}, 135 | 400: { 136 | description: "", 137 | content: { 138 | application/json: { 139 | schema: {items: {$ref: "#/components/schemas/emqx_swagger_api.authorization"}, 140 | type: "array"}}}}, 141 | 401: {description: "",content: {application/json: {schema: {required: ["total_count"], 142 | properties: { 143 | artifacts: {items: {$ref: "#/components/schemas/emqx_swagger_api.authorization"},type: "array"}, 144 | total_count: {example: 100,type: "integer"}}, 145 | type: "object"}}}}, 146 | 422: {description: "",content: 147 | {application/json: {schema: {$ref: "#/components/schemas/emqx_swagger_api.validation_failed"}}}}} 148 | }. 149 | ``` 150 | 151 | Only json format is needed for now 152 | 153 | 1. The developer writes the specific implementation code, due to the entry check has been done by the code SPEC above, so the specific implementation code can be used directly without parameter verification. 154 | 155 | ```erlang 156 | user(put, #{body := Params}) -> 157 | #{<<"ip">> := {IP, Port} = Params, %% the {IP Port} has already converted by schema. 158 | .... 159 | {200, #{...response json..}}. 160 | 161 | ``` 162 | 163 | We no longer need to update swagger.json manually. 164 | 165 | We don't check response schema, the repsonse schema only use for generate swagger.json. 166 | 167 | ## Configuration Changes 168 | 169 | This section should list all the changes to the configuration files (if any). 170 | 171 | ## Backwards Compatibility 172 | 173 | This sections should shows how to make the feature is backwards compatible. 174 | If it can not be compatible with the previous emqx versions, explain how do you 175 | propose to deal with the incompatibilities. 176 | 177 | ## Document Changes 178 | 179 | If there is any document change, give a brief description of it here. 180 | 181 | ## Testing Suggestions 182 | 183 | The final implementation must include unit test or common test code. If some 184 | more tests such as integration test or benchmarking test that need to be done 185 | manually, list them here. 186 | 187 | ## Declined Alternatives 188 | 189 | Here goes which alternatives were discussed but considered worse than the current. 190 | It's to help people understand how we reached the current state and also to 191 | prevent going through the discussion again when an old alternative is brought 192 | up again in the future. 193 | 194 | -------------------------------------------------------------------------------- /implemented/0014-rolling-cluster-upgrade.md: -------------------------------------------------------------------------------- 1 | # Rolling cluster upgrade 2 | 3 | ## Change log 4 | 5 | * 2021-09-21: @k32 Initial draft 6 | * 2021-09-23: @k32 Applied remarks 7 | * 2021-09-28: @k32 Applied remarks 8 | 9 | ## Abstract 10 | 11 | Currently EMQ X upgrade procedure has two modes of upgrade: 12 | 13 | 1. Patch releases. 14 | Upgrading between the patch versions is done by patching the live nodes using Erlang hot code patching feature. 15 | The patch is loaded to the existing running nodes. 16 | This is a low-maintenance upgrade path, however severely limited in the scope of the changes. 17 | 18 | 1. Minor and major releases requires taking down the entire cluster and redirecting the MQTT traffic to a different cluster running the new version of the software. 19 | This approach introduces a lot of operational overhead. 20 | 21 | EMQ X should support rolling upgrade of the cluster when nodes (or pods) are taken down and replaced one by one for minor version upgrades. 22 | Patch version upgrades will remain the same. 23 | 24 | This EIP introduces the guidelines for writing the backward- and forward-compatible code. 25 | Also it documents the necessary changes to the existing code base. 26 | 27 | ## Motivation 28 | 29 | In order to make upgrading the cluster smoother, EMQ X should support rolling cluster upgrades and green-blue deployments. 30 | It should be possible to upgrade the cluster without taking it down entirely, by making sure that different minor versions of the code can communicate in both directions. 31 | 32 | ## Design 33 | 34 | Patch version upgrades will not be changed. 35 | They will follow the existing instruction: https://docs.emqx.io/en/broker/v4.3/advanced/relup.html 36 | 37 | Live upgrade paths will be limited to `major.minor.* -> major.minor+1.*` or `major.minor.* -> major.minor.*` formulas. 38 | 39 | A new concept of a "backplane protocol" is introduced. 40 | 41 | ### Upgrade procedure 42 | 43 | Cluster upgrade should be split in roughly three stages: 44 | 45 | 1. Optional step: inject the forward compatibility support into the old upgrade. 46 | This may be needed because we can't predict in advance what will be changed in the next release. 47 | The upgrade code should be able to inject pre-upgrade beams to the old release. 48 | This part of the upgrade procedure should be idempotent and reversible. 49 | It will work much like a patch version upgrade, using `appup` files in the same way as now. 50 | 51 | During this stage, all nodes in the cluster are running the same `major.minor` version of the code. 52 | 53 | 1. Rolling upgrade of the cluster that involves taking nodes down and replacing them with the newer version. 54 | This part of the upgrade is revesible. 55 | 56 | During this stage, different nodes in the cluster are running different minor versions of the code. 57 | Appup files are not used, because the nodes are created from scratch. 58 | 59 | 1. Once all the nodes are upgraded, a data migration can start. 60 | This part of the upgrade is not reversible, because the migration process is destructive. 61 | It is triggered by an explicit command from the operator, and runs asynchronously (TODO: describe the user interface for doing this). 62 | (TODO: come up with a testing strategy that verifies handling of the partially migrated data) 63 | 64 | Deprecated APIs modules can be removed in the next release. 65 | 66 | There are two major areas that need to be considered to support this kind of upgrade: 67 | 68 | 1. RPC compatibility 69 | 1. Mnesia schema backward-compatibility 70 | 71 | ### RPC compatibility 72 | 73 | In order to simplify the reasoning about the backplane API backward-compatibility, all the functions that may be called remotely should be identified and gathered in specialized modules. 74 | These modules should be named `_proto_v1` (where v1 is the protocol version). 75 | These modules should be immutable, and ideally they should not contain any business logic: only a (gen_)rpc call or cast. 76 | When a new version of the API module is created, the previous one should be kept. 77 | Direct sending messages to the remote processes is prohibited. 78 | Instead, a helper function in the API module should be introduced that does a cast. 79 | 80 | Example: 81 | 82 | ```erlang 83 | -module(foo_proto_v2). 84 | 85 | -export([ foo/2 86 | , bar/2 87 | ]). 88 | 89 | -spec foo(node(), atom()) -> ok. 90 | foo(Node, Arg1) -> 91 | rpc:call(Node, foo_proto_impl, foo, [Arg1]). 92 | 93 | -spec bar(node(), atom()) -> ok. 94 | bar(Node, Arg1) -> 95 | ApiVersion = 2, %% Optional. In case the same implementation is reused in different API versions 96 | rpc:call(Node, foo_proto_impl, bar, [ApiVersion, Arg1]). 97 | ``` 98 | 99 | #### Protocol version negotiation 100 | 101 | bproto application should provide an API function that returns the latest version of the protocol supported by the remote node. 102 | This information should be called before establishing any session. 103 | This information can be cached for the entire session, because the upgrade is done by taking down the entire node, so the protocol version cannot suddenly change. 104 | 105 | #### Static checks 106 | 107 | The following xref checks should be written: 108 | 109 | 1. Only the functions in `.*_proto_v[0-9]+` modules are allowed to call `rpc:call` and `gen_rpc:call` function 110 | 1. Each API function does an RPC to an existing function 111 | 112 | ### Design of the helper bproto application 113 | 114 | We propose to create a new helper application that helps to maintain the API backward-compatibility. 115 | This application will contain the code that performs static checks of the API modules. 116 | Also it will contain the runtime for version negotiation. 117 | 118 | A helper mnesia table tracking the upgrade state and release version for each node can be used to perform the checks between proceeding to the next stage of the upgrade. 119 | 120 | ### Mnesia schema compatibility 121 | 122 | Fields of the tables can't be removed (until the last stage of the upgrade?). 123 | Non-trivial changes to the schema should be performed in stages: 124 | 125 | 1. The new version of the code should be able to work with the old schema. 126 | 1. Once all the nodes in the cluster are updated, start an async process of migrating the data to the new table. 127 | 1. Once the data has been migrated and checks pass, the old table can be removed. 128 | 1. In the next release the code supporting the old schema can be dropped. 129 | 130 | #### Table migration 131 | 132 | `bproto` application takes care of the static checks. 133 | Writing migration in a way that avoids whole-table locks is a complicated process, and should be done on a case-by-case basis. 134 | Migration process could utilize mnesia transactions to traverse the tables entry-by-entry and move the records to the new table, deleting the old record. 135 | The read code should read both tables. 136 | 137 | ## Configuration Changes 138 | 139 | n/a 140 | 141 | ## Backwards Compatibility 142 | 143 | This change is backward-compatible. 144 | 145 | ## Document Changes 146 | 147 | Document cluster upgrade procedure. 148 | 149 | ## Testing Suggestions 150 | 151 | 1. Change the CI, so some or all cluster test suites run on a cluster consisting the nodes running two different versions of EMQ X. 152 | 153 | ## Declined Alternatives 154 | 155 | ### Offline cluster upgrade via backup transformation 156 | 157 | Reason to reject: changed business requirements. 158 | 159 | ### Versioning of the individual API functions 160 | 161 | Annotations (or edoc tags) can be used to specify the API functions: 162 | 163 | ```erlang 164 | -module(foo_api). 165 | 166 | -introduced_in({foo/3, {5,0,0}}). 167 | foo(A, B, C) -> 168 | rpc:call(foo_api_impl, foo, [A, B, C]). 169 | 170 | -introduced_in({bar/3, {5,0,0}}). 171 | -deprecated_in({bar/3, {5,1,0}}). 172 | foo(A, B, C) -> 173 | rpc:call(foo_api_impl, bar, [A, B, C]). 174 | ``` 175 | 176 | ```erlang 177 | -module(foo_api_impl). 178 | 179 | foo(A, B, C) -> 180 | ... 181 | 182 | foo(A, B, C) -> 183 | ... 184 | ``` 185 | 186 | Compatibility of the typespecs of the API modules: 187 | - Function domain is not reduced 188 | - Function co-domain is not extended 189 | 190 | The last check uses custom code, that iterates through function arguments and return types and calls `erl_types:t_is_subtype/2` function for the corresponding types in the new and the old minor version of the application (determined by the `app.src` file). 191 | 192 | Reasons to reject: it does not address bi-directional calls. 193 | Mitigation of this problem is too complicated. 194 | 195 | ### Having bpapi application in a separate repo 196 | 197 | EMQ X dependencies perform RPCs too. 198 | The idea was to make them compatible using the same principles. 199 | 200 | Reason to reject: reduce scope of the feature. 201 | -------------------------------------------------------------------------------- /implemented/0015-unified-authentication.md: -------------------------------------------------------------------------------- 1 | # Unified Authentication in EMQ X 5.0 2 | 3 | ## Change log 4 | 5 | * 2021-05-17: @zhouzb Initial draft 6 | * 2021-10-04: @zmstone Sync the doc from internal updated 0012 from doc 7 | * 2022-08-10: @savonarola Update and move to implemented 8 | 9 | ## Abstract 10 | 11 | This proposal introduces a new design for EMQ X 5.0 authentication, 12 | which aims to provide: for EMQ X users, a better user experience with more configurable interfaces, 13 | and for EMQ X developers, a better development framework without repeating themselves. 14 | 15 | ## Motivation 16 | 17 | EMQ X authentication is implemented by the hook-callback for hook-point `client.authenticate`. 18 | Up to v4.3, EMQ X supported 8 different authentication plugins, namely: 19 | 20 | ``` 21 | emqx_auth_http 22 | emqx_auth_jwt 23 | emqx_auth_ldap 24 | emqx_auth_mnesia 25 | emqx_auth_mongo 26 | emqx_auth_mysql 27 | emqx_auth_pgsql 28 | emqx_auth_redis 29 | ``` 30 | 31 | Some of the pain points in the old implementation 32 | 33 | 1. The authentication plugins are implemented more or less the same, 34 | and works more or less the same too. There is a lack of abstraction for the common parts, 35 | causing developers to repeat themselves when adding new features or fixing issues. 36 | 1. There is a lack of a nice re-configure interface, the only way to configure a plugin 37 | is to update the config file, and reload (stop, start) the plugin. 38 | 1. If there are more than one auth plugin enabled, there is no deterministic order for 39 | how the different backends are checked. 40 | 1. Enabled authentication plugins are collectively considered one global instance, 41 | there is a lack of granularity for scoped control levels. e.g. per-zone, or per-listener. 42 | 43 | To address the pain-points in 5.0, we propose below enhancements. 44 | 45 | ## Design 46 | 47 | ### One app for all 48 | 49 | One `emqx_authn` app to unify the management of all different backends (except for ldap being postponed for now). 50 | 51 | ### The same hook-point 52 | 53 | In this design, there is no intention to change how EMQ X hooks work, 54 | the new app `emqx_authn` will continue to make use of the `client.authenticate` hook-point, 55 | only to dispatch auth requests to the underlying backends inside one single hook call. 56 | 57 | ### Composable authn "chain" 58 | 59 | We should allow users to compose (configure) a "chain" of backends with a defined order in which the 60 | checks are performed one after another. Each check against the backend in the chain may yield 3 different 61 | results for one-request authentication: 62 | 63 | - `ignore` is to indicate there is no auth information found hence should 64 | continue validate the client against the rest of the backends in the chain. 65 | - `{ok, Info}` as a login accepted, hence to terminate the auth calls from here, 66 | where `Info` may contain additional information such as to indicate if the user is a super-user. 67 | - `{error, Reason}` to indicate that client's login should be denied. 68 | 69 | NOTE: for temporary errors, such as database connection issue, the error is logged, 70 | and the auth result is `ignore` so to move forward to the next node in the chain. 71 | 72 | NOTE: if there is no `ok` (accepted) result after a full chain exhaustion, the login is rejected. 73 | 74 | NOTE: empty chain allows anonymous access. 75 | 76 | For enhanced authentication, such as `scram` there can be messages after the first request, 77 | hence the backend may return `{continue, Data}`, 78 | where `Data` is to be kept by the connection process as handling context for the following messages. 79 | 80 | ### Fine-grained configuration levels 81 | 82 | By default, EMQ X user can configure one global chain which applies to all MQTT listeners, 83 | we should however also allow a per-listener configuration to override the global chain. 84 | Together with firewall rules, this will allow users to have different auth solution for 85 | MQTT service facing different group of clients coming from their designated network. 86 | 87 | ### Reconfigurable on the fly 88 | 89 | The changes in the auth chain or the backends should be applied on-the-fly 90 | i.e. without having to restart the `eqmx_authn` application. 91 | 92 | 93 | ## Configuration 94 | 95 | - Example config for built_in_database (mnesia) username/password based global auth 96 | 97 | ``` 98 | authentication { 99 | backend: 'built_in_database', 100 | mechanism: "password_based", 101 | ... 102 | user_id_type: clientid 103 | } 104 | ``` 105 | 106 | - Example 'chain' config 107 | 108 | ``` 109 | authentication = [ 110 | { 111 | backend: 'built_in_database', 112 | mechanism: "password_based", 113 | ... 114 | user_id_type: clientid 115 | }, 116 | { 117 | algorithm = "hmac-based" 118 | mechanism = "jwt" 119 | secret = "emqxsecret" 120 | "secret_base64_encoded" = false 121 | use_jwks = false 122 | verify_claims {} 123 | } 124 | ] 125 | ``` 126 | 127 | - Example config for built_in_database (mnesia) username/password based per-listener auth 128 | 129 | ``` 130 | listener.tcp.default { 131 | ... 132 | authentication: { 133 | backend: "built_in_database", 134 | type: "password_based", 135 | user_id_type: username 136 | } 137 | ... 138 | } 139 | ``` 140 | 141 | - Example config for built_in_database (mnesia) username/password based per-gateway/per-listener auth 142 | 143 | ``` 144 | gateway.stomp { 145 | ... 146 | # Specific global authenticator for all STOMP listeners 147 | authentication = { 148 | backend: "built_in_database", 149 | type: "password_based", 150 | user_id_type: username 151 | } 152 | 153 | listeners.tcp.default { 154 | ... 155 | # Specific authenticator for the specified STOMP listener 156 | authentication = { 157 | backend: "built_in_database", 158 | type: "password_based", 159 | user_id_type: username 160 | } 161 | } 162 | } 163 | ``` 164 | 165 | Gateways allow only single authenticator in the chain. 166 | 167 | - Disable authentication for a specific listener 168 | 169 | ``` 170 | listener.tcp.default { 171 | ... 172 | enable_authn = false 173 | ... 174 | } 175 | ``` 176 | 177 | ## APIs 178 | 179 | 180 | ### Global auth chain APIs 181 | 182 | - Get global auth chain 183 | 184 | ``` 185 | GET /authentication 186 | ``` 187 | 188 | - Add authenticator to the global chain 189 | 190 | ``` 191 | POST /authentication 192 | { 193 | "backend": "built_in_database", 194 | "type": "password_based", 195 | ... 196 | } 197 | ``` 198 | 199 | - Manage individual authenticators in the global chain 200 | 201 | ``` 202 | GET /authentication/:id 203 | 204 | DELETE /authentication/:id 205 | 206 | PUT /authentication/password_based:built_in_database 207 | { 208 | ... 209 | } 210 | ``` 211 | 212 | Where `id` is of format `:`. e.g. `password_based:built_in_database`. 213 | 214 | The `PUT` body should be constructed according to the config schema. 215 | 216 | ### Per-listener auth chain APIs 217 | 218 | For per-listener authentication chains, the APIs are mostly the same, 219 | as the ones for global instances, only the path is prefixed with `listener/listener_id`. 220 | 221 | ``` 222 | POST /listeners/:listener_id/authentication 223 | GET /listeners/:listener_id/authentication 224 | 225 | GET /listeners/:listener_id/authentication/:id 226 | DELETE /listeners/:listener_id/authentication/:id 227 | PUT /listeners/:listener_id/authentication/:id 228 | ``` 229 | 230 | A listener name is of format `protocol:id` which is assigend in the config file, e.g. 231 | 232 | ``` 233 | listeners.tcp.default { 234 | bind = ... 235 | } 236 | ``` 237 | 238 | The name of this listener is `tcp:default`. 239 | 240 | Gateway endpoints are: 241 | 242 | ``` 243 | /gateway/:protocol/authentication 244 | /gateway/:protocol/listeners/:listener_id/authentication 245 | ``` 246 | 247 | ### Re-positioning APIs 248 | 249 | ``` 250 | POST /authentication/:id/move 251 | POST /listeners/:listener_id/authentication/:id/move 252 | ``` 253 | 254 | With a JSON body to indicate where the authenticator is to be re-positioned. 255 | The positions can be `top` (front of the list), `bottom` (the rear of the list), 256 | or `before` / `after` another ID. 257 | 258 | for example: 259 | ``` 260 | curl -X 'POST' \ 261 | 'http://localhost:18083/api/v5/authentication/jwt/move' \ 262 | -H 'accept: */*' \ 263 | -H 'Content-Type: application/json' \ 264 | -d '{ 265 | "position": "before:password_based:built_in_database" 266 | }' 267 | ``` 268 | 269 | ### User management APIs 270 | 271 | We should also support CRUD APIs for user management, with below endpoints. 272 | 273 | - Manage users 274 | 275 | ``` 276 | POST /authentication/password_based:built_in_database/users 277 | { 278 | ... 279 | } 280 | GET /authentication/password_based:built_in_database/users 281 | ``` 282 | 283 | - Manage individual users 284 | ``` 285 | GET /authentication/password_based:built_in_database/users/:user_id 286 | 287 | PUT /authentication/password_based:built_in_database/users/:user_id 288 | { 289 | 290 | } 291 | 292 | DELETE /authentication/password_based:built_in_database/users/:user_id 293 | ``` 294 | 295 | The authenticator ID is made generic although 5.0, 296 | only the built-in database (Mnesia) is supported. 297 | That is, only `password_based:built_in_database` is valid for `:id` so far. 298 | 299 | The corresponding per-listener endpoints are: 300 | 301 | ``` 302 | POST /listeners/:listener_id/authentication/:id/users 303 | GET /listeners/:listener_id/authentication/:id/users 304 | 305 | GET /listeners/:listener_id/authentication/:id/users/:user_id 306 | PUT /listeners/:listener_id/authentication/:id/users/:user_id 307 | DELETE /listeners/:listener_id/authentication/:id/users/:user_id 308 | 309 | POST /gateway/:name/authentication/users 310 | GET /gateway/:name/authentication/users 311 | 312 | GET /gateway/:name/authentication/users/:user_id 313 | PUT /gateway/:name/authentication/users/:user_id 314 | DELETE /gateway/:name/authentication/users/:user_id 315 | 316 | POST /gateway/:name/listeners/:id/authentication/users 317 | GET /gateway/:name/listeners/:id/authentication/users 318 | 319 | GET /gateway/:name/listeners/:id/authentication/users/:uset_id 320 | PUT /gateway/:name/listeners/:id/authentication/users/:uset_id 321 | DELETE /gateway/:name/listeners/:id/authentication/users/:uset_id 322 | ``` 323 | 324 | ## Testing suggestions 325 | 326 | There should three levels of tests. 327 | 328 | * Unit tests for module level tests 329 | * Regular common tests (maybe with mocks if necessary) to test full flows 330 | * Integrated common tests verify the code against external auth providers running in docker container 331 | -------------------------------------------------------------------------------- /implemented/0016-emqx-conf.md: -------------------------------------------------------------------------------- 1 | # EMQX Configuration Manager 2 | 3 | ## Changelog 4 | 5 | * 2021-10-12: @zhongwencool init draft 6 | 7 | ## Abstract 8 | 9 | This proposal introduces a new Erlang application to handle EMQ X's configuration management with a focus on config live-reloads, and cluster wide config change syncs. 10 | 11 | ## Motivation 12 | Prior to 5.0, EMQ X's configuration management are quite static. 13 | * The user interfaces for config changes are environment variables or a text editor for the config files. 14 | * To load changed configs, it usually requires restarting an application, or reloading a plugin, or sometimes even restarting the node. 15 | * When managing a cluster, one would have to update config files one node after another. 16 | * Mnesia was used to store some of the configs (such as rule-engine resources) in order to get them replicated, which made it less configurable because it was not possible to bootstrap such configs from a file which can be prepared before the node boots. Instead, one would have to wait for the node to boot, and then call HTTP API to make the changes. 17 | 18 | In this proposal, we try to address the pain-points by 19 | - Supporting HTTP APIs to perform live config changes and reload. 20 | - Persisting changes made from HTTP APIs on disk in HOCON format. 21 | - Maintaining consistency across the nodes in the cluster. For example authentication & authorisation (ACL) configs, and rule-engine rules. 22 | 23 | In 5.0, no config is stored in Mnesia, however such changes are not in the scope of this EIP. 24 | Some configs may not make sense to be the same for all nodes, so we should also allow local node overrides. such as `rate_limit` settings for nodes per their hardware capacity. 25 | 26 | ## Design 27 | 28 | ### Configuration files 29 | 30 | #### emqx.conf 31 | 32 | EMQ X reads `emqx.conf` for converting this hocon file into Erlang format at startup, and `emqx.conf` explicitly include `override.conf` at the end of the file. 33 | 34 | ```erlang 35 | include "data/configs/cluster-override.conf" 36 | include "data/configs/local-override.conf" 37 | ``` 38 | 39 | - If the user wants to manually modify a node's configuration item before startup, it can be appended to the end of the `emqx.conf`, or use `include "data/configs/user_default.conf`, and for the same configuration, the later value will overwrite the earlier one. 40 | - If the user specifies to read environment variables for a configuration item, this value is read-only at runtime and will not be modified. In other words, the environment variables are always taken at the end of the `emqx.conf` file and has the highest priority. 41 | - For now, we need to integrate all the configurations into `emqx.conf`. But it is planned that `emqx.conf` will no longer contain configurations that use default values in the next phase(propose another EIP), the default is only embedded in the code. Also, users can view all configuration items via the HTTP API (described later). This would have 2 benefits: 42 | - In subsequent version upgrades, adding/removing/updating configurations will not be overwritten by `emqx.conf`. 43 | - This allows the user to focus only on the configuration that has been modified. 44 | 45 | #### emqx_cluster_override.conf 46 | 47 | - This file can only be modified via the API and manual modification of file directly is not supported. 48 | 49 | - When updating configurations that must require consistency across the cluster, they are persisted to this file. 50 | - The node will copy this file from the longest surviving core node before initializing the configuration, this file will be added to initialize the configuration together. 51 | - When the node is updated, the configuration within the cluster is updated via cluster call and persisted to this file. 52 | - This file must be kept the same for all nodes, we will add an extra process to check the content of the file periodically, and alarm after 3 continuous differences are found. 53 | 54 | #### emqx_local_override.conf 55 | 56 | - This file can only be modified via the API and manual modification of file directly is not supported. 57 | 58 | - When the configuration of a specific node is updated via HTTP API, it will be persisted to this file. 59 | 60 | #### emqx_conf application 61 | Before, the initialization of the configuration file, cluster_call, was done through the `emqx_machine`, we will split this part of the functionality from the` emqx_machine` and make a new application `emqx_conf` 62 | 63 | The role of this application is to 64 | - Convert the configuration from HOCON format to Erlang sys.config format at initialization. 65 | - Manage live-updates and deletions of the configurations, and replicate across the cluster. 66 | 67 | Other apps that want to update the configuration must call through the `emqx_conf`'s' API, which cannot call emqx API directly. 68 | The specific flow is: 69 | 70 | ```bash 71 | Other Apps(eg: emqx_resource) => emqx_conf => emqx API. 72 | ``` 73 | 74 | ### HTTP API design 75 | 76 | #### Get the whole configurations. 77 | 78 | ```erlang 79 | #{ 80 | get => #{ 81 | description => <<"Get all the configurations of a given node, or all nodes in the cluster.">>, 82 | parameters => [ 83 | {node, hoconsc:mk(typerefl:atom(), 84 | #{in => query, required => false, example => <<"emqx@127.0.0.1">>, 85 | desc => <<"Node name. When this parameter is not provided, configs for all nodes in the cluster are returned">>})}, 86 | {debug, hoconsc:mk(typerefl:boolean(), #{in => query, required => false, 87 | desc => <<"Carries debug (metadata) information, such as file name and line number">>})}], 88 | responses => #{ 89 | 200 => #{"$node" => configs_list()} 90 | } 91 | } 92 | }; 93 | ``` 94 | 95 | - Returns what the current value/documentation of all configuration items is, group by nodename. 96 | - `debug=true`, will return all the meta data, such as line number, default, document, easy to locate the problem. 97 | 98 | #### Update specific configuration 99 | 100 | ```erlang 101 | schema("/configs/:rootname") -> 102 | #{ 103 | get => #{ 104 | description => <<"Get the sub-configurations">>, 105 | parameters => [ 106 | {node, hoconsc:mk(atom(), #{in => query, required => false})}, 107 | {debug, hoconsc:mk(typerefl:boolean(), #{in => query, required => false})} 108 | ], 109 | responses => #{ 110 | 200 => #{<<"$node">> => config_list()}, 111 | 404 => emqx_dashboard_swagger:error_codes(['NOT_FOUND'], <<"config not found">>) 112 | } 113 | }, 114 | put => #{ 115 | description => <<"Update the sub-configurations">>, 116 | parameters => [{node, hoconsc:mk(atom(), #{in => query, required => false})}], 117 | requestBody => config_list(), 118 | responses => #{ 119 | 200 => #{<<"$node">> => config_list()}, 120 | 400 => emqx_dashboard_swagger:error_codes(['UPDATE_FAILED']) 121 | } 122 | } 123 | }. 124 | ``` 125 | 126 | - get specific configuation, such as: `/configs/emqx_dashboard` will return : 127 | There should be a `sensitive` flag in the schema for sensitive fields, and the value should be reported back as `"******"` in the API, such as password. 128 | 129 | ```erlang 130 | #{'emqx@127.0.0.1' => 131 | #{default_password => "****", 132 | default_username => "admin", 133 | listeners => 134 | [#{backlog => 512,inet6 => false,ipv6_v6only => false, 135 | max_connections => 512,num_acceptors => 4,port => 18083, 136 | protocol => http,send_timeout => 5000}], 137 | sample_interval => 10,token_expired_time => 3600000}, 138 | 'emqx1@127.0.0.1' => ... 139 | } 140 | ``` 141 | 142 | - Update specific configuation without the 'node' query string will modify the configuration of all nodes in the cluster and persist it in `emqx_cluster_override.conf`. 143 | 144 | - Update specific configuation with `node='xxx@xx.xx.xx'` in the query string, only the configuration of the specified node will be modified, persisted to `emqx_local_override.conf`. 145 | 146 | - If we have already modified a configuration in `emqx_local_override.conf` successfully, trying to update this value in `emqx_cluster_override.conf` again will return a failure, and the user will be instructed to reset this configuration from local before the update can succeed. Otherwise, since the priority of `emqx_local` is higher than that of `emqx_cluster`, the changes made in `emqx_cluster` will not take effect. 147 | 148 | - Update requests carry the latest value back, and if the update fails, it also explains what the current value is. 149 | 150 | #### Reset specific configuration 151 | 152 | ```erlang 153 | schema("/configs_reset/:rootname") -> 154 | #{put => #{ 155 | description => <<"Reset the sub-configurations">>, 156 | parameters => [{node, hoconsc:mk(atom(), #{in => query, required => false})}], 157 | requestBody => config_list(), 158 | responses => #{ 159 | 200 => #{<<"$node">> => config_list()}, 160 | 400 => emqx_dashboard_swagger:error_codes(['REST_FAILED']) 161 | } 162 | } 163 | ``` 164 | 165 | - We can't delete a configuration item, we can only reset it. 166 | - Reset specific configuation without query string, will delete the configuation in `emqx_cluster_override.conf`. 167 | - Reset specific configuation with `node='xxx@xx.xx.xx'` in the query string, only the configuration of the specified node in `emqx_local_overide.conf` will be deleted. 168 | 169 | ## Configuration Changes 170 | 171 | This section should list all the changes to the configuration files (if any). 172 | 173 | ## Backwards Compatibility 174 | 175 | This sections should shows how to make the feature is backwards compatible. 176 | If it can not be compatible with the previous emqx versions, explain how do you 177 | propose to deal with the incompatibilities. 178 | 179 | ## Document Changes 180 | 181 | If there is any document change, give a brief description of it here. 182 | 183 | ## Testing Suggestions 184 | 185 | The final implementation must include unit test or common test code. If some 186 | more tests such as integration test or benchmarking test that need to be done 187 | manually, list them here. 188 | 189 | ## Declined Alternatives 190 | 191 | The `emqx_cluster_override.conf` and `emqx_local_override.conf` can't be directly modified by handle, we can also store this information in mnesia. But it is not convenient for users to see it. 192 | 193 | -------------------------------------------------------------------------------- /implemented/0018-unified-authorization.md: -------------------------------------------------------------------------------- 1 | # Unified Authorization in EMQ X 5.0 2 | 3 | ## Change log 4 | 5 | * 2021-05: @Rory-Z 6 | * 2021-10-29: @zmstone Sync the doc from draft in internal wiki 7 | * 2022-08-10: @savonarola Update and move to implemented 8 | 9 | ## Abstract 10 | 11 | 12 | In EMQ X 4.x, authorization (ACL) is provided by the `emqx_auth_xxx` applications as small Erlang applications, 13 | while it was nice having the flexibility, the pain-points are: 14 | 15 | - Scattered configuration management for EMQ X users 16 | - Repeated (copy-paste) work for EMQ X developers 17 | - Non-deterministic ordering of the hook callback registration leading to non-deterministic ordering of the ACL rules 18 | 19 | This proposal introduces a new design for EMQ X 5.0 authorization (ACL), which aims to provide: 20 | 21 | - For EMQ X users, a better user experience with more configurable interfaces 22 | - For EMQ X developers, a better development framework without repeating themselves 23 | 24 | ## Terms 25 | 26 | * ACL: Access Control List which defines a set of 'rules' for message publish and topic subscribe requests from MQTT clients; 27 | * Authorization: another word for ACL; 28 | * Source: an ACL rule data provider, such as 'file', 'http', 'mongo' and 'mysql' etc. 29 | * Chain: an ordered list of Sources 30 | 31 | ## High level requirements 32 | 33 | * Multiple sources for ACL rule persistence 34 | - File 35 | - MySQL 36 | - PostgreSQ 37 | - Redis 38 | - MongoDB 39 | - Mnesia (built-in-database) 40 | - WebServer (http) 41 | 42 | * Fallback action if no rule matches a request (publish or subscribe) 43 | - deny 44 | - allow 45 | - disconnect 46 | 47 | * Rule cache 48 | In 4.x series, the rules are cached in client's process dictionary. 49 | There is no intention to change such behaviour in 5.0 50 | 51 | * Allow more than one source to form the chain 52 | - The chain should have a determined order. Unlike the situation in 4.x the ACL check order depends on the plugin start/restart order 53 | - Only one instance is allowed for each type of source, e.g. one should not be allowed to configure more than one `file` type source or `http` type source 54 | - Provide APIs to adjust the ordering of the chained sources 55 | 56 | * ACL for gateways, CoAP, MQTT-SN, exproto, Stomp (but not LwM2M) 57 | 58 | * Management API to upload rules for `file` type ACL source 59 | 60 | 61 | ## Design 62 | 63 | Config proposal 64 | 65 | ``` 66 | authorization { 67 | no_match = allow | deny 68 | deny_action = disconnect | ignore 69 | 70 | cache { 71 | enable = true 72 | max_size = 32 73 | ttl = 30m 74 | } 75 | sources: [ 76 | { 77 | type = file 78 | enable = true 79 | path = "etc/example.conf" 80 | }, 81 | { 82 | type = mysql 83 | enable = true 84 | database = mqtt 85 | username = root 86 | password = xxx 87 | pool_size = 8 88 | query = "select * from table1 where clientid = xxx" 89 | } 90 | ] 91 | } 92 | ``` 93 | 94 | ### File 95 | 96 | #### config 97 | ``` 98 | { 99 | type = file 100 | enable = true 101 | path = "/path/to/example.conf" 102 | } 103 | ``` 104 | 105 | ### File content example (same as in 4.x) 106 | 107 | ``` 108 | {allow, {username, "^dashboard?"}, subscribe, ["$SYS/#"]}. 109 | {allow, {ipaddr, "127.0.0.1"}, pubsub, ["$SYS/#", "#"]}. 110 | ``` 111 | 112 | ### MySQL 113 | ``` 114 | { 115 | type = mysql 116 | enable = true 117 | server = "127.0.0.1:3306" 118 | database = mqtt 119 | pool_size = 1 120 | username = root 121 | password = public 122 | auto_reconnect = true 123 | ssl = { 124 | enable = true 125 | cacertfile = xxx.ca 126 | certfile = xxx.cert 127 | keyfile = xxx.key 128 | } 129 | query: "select ipaddress, username, clientid, action, permission, topic from mqtt_authz where ipaddr = '${peerhost}' or username = '${username}' or clientid = '${clientid}'" 130 | } 131 | ``` 132 | 133 | ### PostgresSQL 134 | ``` 135 | { 136 | type = postgresql 137 | enable = true 138 | server = "127.0.0.1:5432" 139 | database = mqtt 140 | pool_size = 1 141 | username = root 142 | password = public 143 | auto_reconnect = true 144 | ssl = { 145 | enable = true 146 | cacertfile = xxx.ca 147 | certfile = xxx.cert 148 | keyfile = xxx.key 149 | } 150 | query: "select ipaddress, username, clientid, action, permission, topic from mqtt_authz where ipaddr = '${peerhost}' or username = '${username}' or clientid = '${clientid}'" 151 | 152 | } 153 | ``` 154 | 155 | ### Redis 156 | ``` 157 | { 158 | type = redis 159 | enable = true 160 | redis_type = single 161 | server = "127.0.0.1:6379" 162 | database = 0 163 | pool_size = 1 164 | password = public 165 | auto_reconnect = true 166 | ssl = {enable = false} 167 | cmd = "HGETALL mqtt_authz:${username}" 168 | } 169 | ``` 170 | 171 | ### MongoDB 172 | ``` 173 | { 174 | type = mongodb 175 | enable = true 176 | mongo_type = single 177 | server = "127.0.0.1:27017" 178 | pool_size = 1 179 | database = mqtt 180 | ssl = {enable = false} 181 | collection = mqtt_authz 182 | selector: { "$or": [ { "username": "${username}" }, { "clientid": "${clientid}" } ] } 183 | } 184 | ``` 185 | 186 | ### Management APIs 187 | 188 | #### Get root level settings 189 | 190 | ``` 191 | GET /authorization/settings 192 | RESP: 193 | { 194 | "no_match": "allow" | "deny", 195 | "deny_action": "disconnect" | "ignore", 196 | "cache" { 197 | "enable": true, 198 | "max_size": 32, 199 | "ttl": "30m" 200 | } 201 | } 202 | ``` 203 | 204 | #### Update root level settings 205 | ``` 206 | PUT /authorization/settings 207 | BODY: 208 | { 209 | "no_match": "allow" | "deny", 210 | "deny_action": "disconnect" | "ignore", 211 | "cache": { 212 | "enable": true, 213 | "max_size": 32, 214 | "ttl": "30m" 215 | } 216 | } 217 | ``` 218 | 219 | #### Create ACL data sources 220 | ``` 221 | POST /authorization/sources 222 | BODY: 223 | { "type": xxx, ... } 224 | ``` 225 | 226 | #### Get ACL data sources 227 | ``` 228 | GET /authorization/sources 229 | RESP: 230 | [{ "type": xxx }, { "type": xxx }] 231 | ``` 232 | 233 | #### Get detailed source config per type 234 | 235 | ``` 236 | GET /authorization/sources/{type} # mysql,redis,mongodb,postgresql,http.... 237 | RESP: 238 | {"type": "mysql", ...} 239 | ``` 240 | 241 | #### Update (reload) source config per type 242 | 243 | When needed, the underlying resource such as MySQL connection pool should be restarted when handing such update requests. 244 | 245 | ``` 246 | PUT /authorization/sources/{type} # mysql,redis,mongodb,postgresql,http.... 247 | BODY: 248 | {"type": "mysql", ...} 249 | ``` 250 | 251 | #### Delete a source cofnig per type 252 | 253 | ``` 254 | DELETE /authorization/sources/{type} # mysql,redis,mongodb,postgresql,http.... 255 | ``` 256 | 257 | #### Adjust source's position in the chain 258 | 259 | ``` 260 | POST /authorization/sources/{type}/move # mysql,redis,mongodb,postgresql,http.... 261 | { "position": "top" | "bottom" | "after:{type}" | "before:{type}" } 262 | ``` 263 | 264 | #### APIs to manage `file` type source 265 | 266 | ``` 267 | GET /authorization/sources/file 268 | RESP: 269 | { "type": "file", "rules": "...", "path": "..." } 270 | ``` 271 | 272 | ``` 273 | PUT /authorization/sources/file 274 | BODY: 275 | { "type": "file", "rules": "...", "path": "..." } 276 | ``` 277 | -------------------------------------------------------------------------------- /implemented/0019-plugins.md: -------------------------------------------------------------------------------- 1 | # EMQ X 5.0 plugins 2 | 3 | ## Change log 4 | 5 | * 2021-11: Guowei Li 6 | * 2021-12-07: @zmstone move it to GitHub 7 | 8 | ## Abstract 9 | 10 | This EIP documents the implementation proposal EMQ X 5.0 plugins. 11 | 12 | ## Background 13 | 14 | Prior to EMQ X 5.0, most of the features are implemented as plugins. A plugin is an Erlang application which registers a set of call back APIs at certain pre-defined hook points. 15 | 16 | The applications are configured and loaded separately after the node is booted. 17 | 18 | In 5.0, although most of the features such as authentication, authorisation, and rule-engine are still implemented by registering hooks, the management of the features are no longer as the old plugins. 19 | 20 | Prior to 4.3, most of the plugins are hosted in their own Git repos. This was changed in 4.3 as an umbrella project. 21 | 22 | All these changes we have made in the past are to provide better user experience as well as development experiences. But the flexibility of plugin applications and how they can be hosted in separate git repos still have their advantages and we should continue supporting it. 23 | 24 | The challenges ahead are 25 | 26 | - We want to continue supporting old plugins, but how to minimise the effort to migrate to 5.0. 27 | Although the hook points are still there, the config format is slightly different, and the schema is also different. (cuttlefish → HOCON) 28 | - How to provide a better management interfaces for external plugins e.g. management API or even UI. 29 | - Callbacks under the same hook point are ordered by their priority, however most of the plugins (internal plugins included) today are using the default value 0, meaning it's ordered by luck. 30 | 31 | Worth mentioning, since 4.3.2, EMQ X for the first supported drop-in installation of external plugins. That is, plugins can be compiled and packaged separately (instead of being released as a part of EMQ X package), see [Develop and deploy EMQ X plugin for Enterprise 4.3](https://emqx.atlassian.net/wiki/spaces/EMQX/blog/2021/05/23/168591472) 32 | 33 | ## References 34 | 35 | - RabbitMQ: [Plugin Development Basics — RabbitMQ](https://www.rabbitmq.com/plugin-development.html) 36 | - Grafana: [Pie Chart](https://grafana.com/grafana/plugins/grafana-piechart-panel/) 37 | - WordPress: [WordPress Plugins](https://wordpress.org/plugins/) 38 | - RabbitMQ Plugin release packages: [Releases · rabbitmq/rabbitmq-management-exchange](https://github.com/rabbitmq/rabbitmq-management-exchange/releases) 39 | 40 | ### High level requirements 41 | 42 | 1. Hook points should be compatible with 4.x 43 | 1. More clear ordering of all hook callbacks under one certain hook point, for example, internal hooks use priority 1000, and force external plugins to provide a priority number. If they wish to have it ordered before internal plugins they can choose to use a number smaller than 1000. 44 | 1. All built-in features such as authentication and authorisation are not presented as plugins except for 45 | 1. LDAP authentication 46 | 1. PSK authentication 47 | 48 | ### Plugin types 49 | 50 | There are two different kinds of plugsins, 'prebuilt' and 'external'. 51 | Pre-built plugins are released as a part of the EMQ X (CE or EE) official release package. 52 | External plugins are developed and release independently. 53 | 54 | ### External plugin security concerns 55 | 56 | An external plugins is loaded and executed as any other EMQ X component without any 57 | access restriction, or scope confinement. 58 | 59 | EMQ X team's long term plan is to introduce a code review & build platform 60 | (like an app market place) so EMQ X CE users and EE customers can have a trusted 61 | source to download the packages. 62 | 63 | Before the review & build process is in place, 64 | EMQ X's users and customers are only adviced to take 65 | extra care when loading a plugin developed by thirdy party. 66 | 67 | ### Basic steps to install an external plugin 68 | 69 | - Download compiled zip package 70 | - Upload to a specific dir 71 | - Execute a command to validate & install & enable & uninstall the plugin 72 | 73 | ### Manage plugins from Dashboard UI 74 | 75 | - Manage installation from Dashboard GUI 76 | - upload (and extract, but not persist it) 77 | - install 78 | - uninstall 79 | - Manage a list of installed plugins, supported actions: 80 | - List view 81 | - Show running status: "running" or "stopped". 82 | Status should be presented per-node. e.g. `"status": "running"` for the current node (serving the API), or `"node_status": [{"node": "node1", "status": "running"}i, ...]` for a summary view of all nodes in the cluster. 83 | - support actions: "start" or "stop" 84 | 85 | ## Plugin package 86 | 87 | A plugin package is a zip file made of two files inside: 88 | 89 | * A tar file for the compiled beams (and maybe source code too), 90 | * A metadata file in JSON format 91 | 92 | ### Plugin tar 93 | 94 | The tar should be of layout 95 | 96 | ``` 97 | ├── emqx_extplug1 98 | │   ├── LICENSE 99 | │   ├── Makefile 100 | │   ├── README.md 101 | │   ├── ebin 102 | │   │ ├── emqx_extplug1.app 103 | │   │   └── emqx_extplug1.beam 104 | │   ├── etc 105 | │   │   └── emqx_extplug1.conf 106 | │   ├── priv 107 | │   │   └── .. # maybe 108 | │   ├── rebar.config 109 | │   └── src 110 | │   └── ... # maybe 111 | ├── extplug1_dep1 112 | │   ├── LICENSE 113 | ... 114 | ``` 115 | 116 | ### Plugin metadata 117 | 118 | Inside the plugin zip-package, there should be a JSON file to help describe, identify and validate the package. 119 | 120 | - Name (same as the Erlang application name, it has to be globally unique) 121 | - Version 122 | - Build-datetime 123 | - sha256-checksum-for-tar 124 | - Authors 125 | - Free text Name & Contact information such as email or website 126 | - Builder 127 | - Name 128 | - Contact 129 | - Optional: Builder's website (to find e.g. public key) 130 | - Optional: Builder's signning signature for the package 131 | - URL to source code 132 | - What functionalities (one or more of below) 133 | - authentication 134 | - authorisation 135 | - data_persistence 136 | - rule_engine_extension 137 | - Compatibility 138 | - Compatible with EMQ X version(s), implicit low boundary of supported versions range is `5.0.0`, also to support version compares: `~>`, `>=`, `>`, `<=`, `<`, `==` 139 | ref: https://github.com/erlang/rebar3/blob/c102379385013896711bba3969f280f851c67cc7/src/rebar_packages.erl#L376-L392 140 | - Supported OTP releases (has to be the same as EMQ X's supported OTP versions) 141 | 142 | We will perhaps need a rebar3 plugin for to help generate the metadata file. 143 | 144 | ## User Interface 145 | 146 | ### Management **APIs** 147 | 148 | - List plugins 149 | 150 | ``` 151 | /plugins 152 | [{"metadata": 153 | { "name": "emqx_foobar", 154 | "description": "EMQ X plugin to implement foobar feature", 155 | "version": "0.1.0", 156 | ... 157 | }, 158 | "status": "running" // disabled | running | stopped 159 | "node_status": [...] 160 | },..] 161 | ``` 162 | 163 | The `disabled` state is recognised when the plugin is installed (unziped), but not configured to be loaded. 164 | 165 | - Get one plugin 166 | 167 | ``` 168 | GET /plugins/{name} 169 | {"metadata": 170 | { "name": "emqx_foobar", 171 | "description": "EMQ X plugin to implement foobar feature", 172 | "version": "0.1.0", 173 | ... 174 | } 175 | "status": "running" 176 | "node_status": [...] 177 | } 178 | ``` 179 | 180 | - Upload a package 181 | 182 | This API uploads and extracts a package 183 | 184 | NOTE: since this API is a potential security threat, it is disabled by default. 185 | To enable it, one have change the configuration to enable. 186 | Such config change is not possible to be done via http API, only by changing 187 | the conifg file, or from `emqx_ctl` command line. 188 | 189 | ``` 190 | POST /plugins/upload 191 | request: binary data 192 | response: OK | error 193 | ``` 194 | 195 | - Enable / Start / Stop a plugin 196 | 197 | ``` 198 | PUT /plugins/{name} 199 | { 200 | "status": running | stopped | disabled 201 | } 202 | 203 | PUT /nodes/{node}/plugins/{name} 204 | { 205 | "status": running | stopped | disabled 206 | } 207 | ``` 208 | 209 | - Delete a plugin 210 | 211 | ``` 212 | DELETE /plugins/{name} 213 | DELETE /nodes/{node}/plugins/{name} 214 | ``` 215 | -------------------------------------------------------------------------------- /implemented/0020-assets/evacuation-coordinator-statuses-enforce.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/evacuation-coordinator-statuses-enforce.png -------------------------------------------------------------------------------- /implemented/0020-assets/evacuation-coordinator-statuses-enforce.uml: -------------------------------------------------------------------------------- 1 | @startuml evacuation-coordinator-statuses-enforce 2 | skinparam monochrome true 3 | skinparam ranksep 20 4 | skinparam dpi 150 5 | skinparam arrowThickness 0.7 6 | skinparam packageTitleAlignment left 7 | skinparam usecaseBorderThickness 0.4 8 | skinparam defaultFontSize 12 9 | 10 | (disabled) --> (evicting_conns) 11 | (evicting_conns) --> (evicting_conns) 12 | (evicting_conns) --> (disabled) 13 | (evicting_conns) --> (waiting_takeover) 14 | (waiting_takeover) --> (evicting_sessions) 15 | (waiting_takeover) --> (disabled) 16 | (evicting_sessions) --> (evicting_sessions) 17 | (evicting_sessions) --> (prohibiting) 18 | (evicting_sessions) --> (disabled) 19 | (prohibiting) --> (disabled) 20 | @enduml 21 | -------------------------------------------------------------------------------- /implemented/0020-assets/evacuation-coordinator-statuses-rebalance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/evacuation-coordinator-statuses-rebalance.png -------------------------------------------------------------------------------- /implemented/0020-assets/evacuation-coordinator-statuses-rebalance.uml: -------------------------------------------------------------------------------- 1 | @startuml evacuation-coordinator-statuses-rebalance 2 | skinparam monochrome true 3 | skinparam dpi 150 4 | skinparam arrowThickness 0.7 5 | skinparam usecaseBorderThickness 0.4 6 | skinparam defaultFontSize 12 7 | 8 | 9 | (disabled) --> (wait_health_check) 10 | (wait_health_check) --> (evicting_conns) 11 | (wait_health_check) --> (disabled) 12 | (evicting_conns) --> (evicting_conns) 13 | (evicting_conns) --> (wait_takeover) 14 | (evicting_conns) --> (disabled) 15 | (wait_takeover) --> (evicting_sessions) 16 | (wait_takeover) --> (disabled) 17 | (evicting_sessions) --> (evicting_sessions) 18 | (evicting_sessions) --> (disabled) 19 | @enduml 20 | 21 | -------------------------------------------------------------------------------- /implemented/0020-assets/eviction-agent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/eviction-agent.png -------------------------------------------------------------------------------- /implemented/0020-assets/node-rebalance-algo0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/node-rebalance-algo0.png -------------------------------------------------------------------------------- /implemented/0020-assets/node-rebalance-algo1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/node-rebalance-algo1.png -------------------------------------------------------------------------------- /implemented/0020-assets/node-rebalance-algo2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/node-rebalance-algo2.png -------------------------------------------------------------------------------- /implemented/0020-assets/node-rebalance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/node-rebalance.png -------------------------------------------------------------------------------- /implemented/0021-assets/aws-mqtt-file-delivery.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/aws-mqtt-file-delivery.png -------------------------------------------------------------------------------- /implemented/0021-assets/aws-mqtt-file-delivery.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | !theme blueprint 3 | == Use DescribeStream to get stream data == 4 | Client -> Server: Subscribe('/description') 5 | Server --> Client: Sub Ack 6 | Client -> Server: Subscribe('/rejected') 7 | Server --> Client: Sub Ack 8 | Client -> Server: Publish('/describe') 9 | note right 10 | DescribeStreamRequest 11 | { 12 | "c": "ec944cfb-1e3c-49ac-97de-9dc4aaad0039" 13 | } 14 | "c" client token field 15 | end note 16 | Server --> Client: Publish('/description') 17 | note left 18 | DescribeStreamResponse 19 | { 20 | "c": "ec944cfb-1e3c-49ac-97de-9dc4aaad0039", 21 | "s": 1, 22 | "d": "This is the description of stream ABC.", 23 | "r": [ 24 | { 25 | "f": 0, 26 | "z": 131072 27 | }, 28 | { 29 | "f": 1, 30 | "z": 51200 31 | } 32 | ] 33 | } 34 | "c": client token field 35 | "s": stream version as an integer 36 | "r": contains a list of the files in the stream. 37 | "f": stream file ID as an integer. 38 | "z": stream file size in number of bytes. 39 | "d": description of the stream. 40 | end note 41 | == Get data blocks from a stream file == 42 | Client -> Server: Subscribe('/data') 43 | Server --> Client: Sub Ack 44 | Client -> Server: Publish('/get') 45 | note right 46 | GetStreamRequest 47 | { 48 | "c": "1bb8aaa1-5c18-4d21-80c2-0b44fee10380", 49 | "s": 1, 50 | "f": 0, 51 | "l": 4096, 52 | "o": 2, 53 | "n": 100, 54 | "b": "..." 55 | } 56 | [optional] "c": client token field 57 | [optional] "s": stream version field 58 | "f": stream file ID 59 | "l": data block size in bytes 60 | [optional] "o": offset of the block in the stream file 61 | [optional] "n": number of blocks requested 62 | [optional] "b": bitmap that represents the blocks being requested 63 | end note 64 | Server --> Client: Publish('/data') 65 | note left 66 | GetStreamResponse 67 | { 68 | "c": "1bb8aaa1-5c18-4d21-80c2-0b44fee10380", 69 | "f": 0, 70 | "l": 4096, 71 | "i": 2, 72 | "p": "..." 73 | } 74 | "c": client token field 75 | "f": ID of the stream file 76 | "l": size of the data block payload in bytes 77 | "i": ID of the data block contained in the payload 78 | "p": the data block payload (base64) 79 | end note 80 | Server --> Client: Publish('/data') 81 | note left 82 | GetStreamResponse 83 | end note 84 | Server --> Client: Publish('/data') 85 | note left 86 | GetStreamResponse 87 | end note 88 | @enduml 89 | -------------------------------------------------------------------------------- /implemented/0021-assets/flow-abort.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-abort.png -------------------------------------------------------------------------------- /implemented/0021-assets/flow-abort.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | !theme blueprint 3 | Client -> Broker: PUBLISH('$file/{fileId}/0/{sha256}') 4 | note right 5 | Payload: 6 | end note 7 | Broker -> Broker: store {filepath}/{filename} at 0, 1kB 8 | Broker --> Client: PUBACK 0x00 9 | Client -> Broker: PUBLISH('$file/{fileId}/abort') 10 | Broker -> Broker: delete {filepath}/{filename} 11 | Broker --> Client: PUBACK 0x00 12 | @enduml 13 | -------------------------------------------------------------------------------- /implemented/0021-assets/flow-async-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-async-1.png -------------------------------------------------------------------------------- /implemented/0021-assets/flow-async-1.uml: -------------------------------------------------------------------------------- 1 | @startuml flow-async-1 2 | !theme blueprint 3 | Client -> Broker: PUBLISH(pktid=1, topic=$file-async/[COMMAND]) 4 | Broker --> Client: PUBACK(pktid=1, rc=0) 5 | note right 6 | Operation start 7 | end note 8 | Client -> Broker: PUBLISH(pktid=2, topic=...) 9 | Broker --> Client: PUBACK(pktid=2, rc=...) 10 | Broker -> "$file-response/{clientId}": PUBLISH $file-response/{clientId} 11 | note left 12 | Operation end 13 | end note 14 | note right 15 | { 16 | "vsn": "0.1", 17 | "topic": "$file-async/[COMMAND]", 18 | "packet_id": 1, 19 | "reason_code": 0, 20 | "reason_description": "success" 21 | } 22 | end note 23 | @enduml 24 | -------------------------------------------------------------------------------- /implemented/0021-assets/flow-happy-path.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-happy-path.png -------------------------------------------------------------------------------- /implemented/0021-assets/flow-happy-path.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | !theme blueprint 3 | Client -> Client: generate UUID=8568BA42-.. 4 | Client -> Broker: PUBLISH('$file/8568BA42-../init') 5 | note right 6 | { 7 | "name": "ml-logs-data.log", 8 | "expire_at": 1696659943 9 | "size": 3075 10 | } 11 | end note 12 | Broker --> Client: PUBACK 0x00 13 | Client -> Client: read segment #0, calculate sha256 14 | Client -> Broker: PUBLISH('$file/8568BA42-../0/{sha256}') 15 | note right 16 | Payload: 17 | end note 18 | Broker -> Broker: verify checksum - ok 19 | Broker -> Broker: store logs/data-log at 0, 1kB 20 | Broker --> Client: PUBACK 0x00 21 | Client -> Client: read segment #1, calculate sha256 22 | Client -> Broker: PUBLISH('$file/8568BA42-../1024/{sha256}') 23 | note right 24 | Payload: 25 | end note 26 | Client -> Client: read segment #2, calculate sha256 27 | Client -> Broker: PUBLISH('$file/8568BA42-../2048/{sha256}') 28 | note right 29 | Payload: 30 | end note 31 | Client -> Client: read segment #3, calculate sha256 32 | Client -> Broker: PUBLISH('$file/8568BA42-../3072/{sha256}') 33 | note right 34 | Payload: 35 | end note 36 | Broker -> Broker: verify checksum - ok 37 | Broker -> Broker: verify checksum - ok 38 | Broker -> Broker: verify checksum - ok 39 | Broker -> Broker: store logs/data-log at 1024, 2051 bytes 40 | Broker --> Client: PUBACK 0x00 41 | Broker --> Client: PUBACK 0x00 42 | Broker --> Client: PUBACK 0x00 43 | Client -> Broker: PUBLISH('$file/8568BA42-../fin/3075/{sha256}') 44 | Broker -> Broker: verify checksum - ok 45 | Broker -> Broker: finalize logs/data-log 46 | Broker -> Storage: upload logs/data-log 47 | Storage --> Broker: upload ok 48 | Broker --> Client: PUBACK 0x00 49 | Broker --> Broker: cleanup 50 | @enduml 51 | -------------------------------------------------------------------------------- /implemented/0021-assets/flow-restart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-restart.png -------------------------------------------------------------------------------- /implemented/0021-assets/flow-restart.uml: -------------------------------------------------------------------------------- 1 | @startuml 2 | !theme blueprint 3 | Client -> Broker: PUBLISH('$file/{fileId}/0/{sha256}') 4 | note right 5 | Payload: 6 | end note 7 | Broker --> Broker: verify checksum - failed 8 | Broker --> Client: PUBACK 0x80 9 | Client -> Broker: PUBLISH('$file/{fileId}/0/{sha256}') 10 | note right 11 | Payload: 12 | end note 13 | Broker --> Broker: verify checksum - ok 14 | Broker -> Broker: store {filepath}/{filename} at 0, 1kB 15 | Broker --> Client: PUBACK 0x00 16 | @enduml 17 | -------------------------------------------------------------------------------- /implemented/0021-assets/flow-sync-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-sync-1.png -------------------------------------------------------------------------------- /implemented/0021-assets/flow-sync-1.uml: -------------------------------------------------------------------------------- 1 | @startuml flow-sync-1 2 | !theme blueprint 3 | Client -> Broker: PUBLISH(pktid=1, topic=$file/[COMMAND]) 4 | note right 5 | Operation start 6 | end note 7 | Broker --> Client: PUBACK(pktid=1, rc=0) 8 | note right 9 | Operation end 10 | end note 11 | @enduml 12 | -------------------------------------------------------------------------------- /implemented/0021-assets/flow-sync-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-sync-2.png -------------------------------------------------------------------------------- /implemented/0021-assets/flow-sync-2.uml: -------------------------------------------------------------------------------- 1 | @startuml flow-sync-2 2 | !theme blueprint 3 | Client -> Broker: PUBLISH(pktid=1, topic=$file/[COMMAND]) 4 | note right 5 | Operation start 6 | end note 7 | Client -> Broker: PUBLISH(pktid=2, topic=...) 8 | Broker --> Client: PUBACK(pktid=2, rc=...) 9 | Broker --> Client: PUBACK(pktid=1, rc=0) 10 | note right 11 | Operation end 12 | end note 13 | @enduml 14 | -------------------------------------------------------------------------------- /implemented/0021-transfer-files-over-mqtt.md: -------------------------------------------------------------------------------- 1 | # File transfer over MQTT 2 | 3 | ## Changelog 4 | 5 | * 2022-09-20: @qzhuyan, @id, @zmstone Initial draft 6 | 7 | ## Abstract 8 | 9 | This document defines protocol to send files from MQTT clients to MQTT server. It is using only `PUBLISH` and `PUBACK` messages and does not require MQTT 5.0 features like topic aliases. 10 | 11 | ## Motivation 12 | 13 | EMQX customers are asking for file transfer functionality from IoT devices to their cloud (primary use case), and from cloud to IoT devices over MQTT. Right now they are uploading files from devices via FTP or HTTPS (e.g. to S3), but this approach has downsides: 14 | 15 | * FTP and HTTP servers usually struggle to keep up with large number of simultaneous bandwidth-intensive connections 16 | * packet loss or reconnect forces clients to restart the transfer 17 | * devices which already talk MQTT need to integrate with one more SDK, address authentication and authorization, and potentially go through an additional round of security audit 18 | 19 | Known cases of device-to-cloud file transfer: 20 | 21 | * [CAN bus](https://en.wikipedia.org/wiki/CAN_bus) data 22 | * Image taken by industry camera for Quality Assurance 23 | * Large data file collected from forklift 24 | * Video and audio data from truck cars, and video data captured by inbound unloading cameras 25 | * Vehicle real-time logging, telemetry, messaging 26 | * Upload collected ML logs 27 | 28 | Known cases of cloud-to-device file transfer: 29 | 30 | * Upload AI/ML models 31 | * Firmware upgrades 32 | 33 | Even though devices could already send binary data in MQTT packets, it is not trivial to guarantee reliable file transfer without clear expectations from client and server side. 34 | 35 | ## Terminology 36 | 37 | The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [BCP 14](https://www.rfc-editor.org/bcp/bcp14) [[RFC2119](https://www.rfc-editor.org/rfc/rfc2119)] [[RFC8174](https://www.rfc-editor.org/rfc/rfc8174)] when, and only when, they appear in all capitals, as shown here. 38 | 39 | The following terms are used as described in [MQTT Version 5.0 Specification](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html): 40 | * Application Message 41 | * Server 42 | * Client 43 | * Topic 44 | * Topic Name 45 | * Topic Filter 46 | * MQTT Control Packet 47 | 48 | *At least once*: a message can be delivered many times, but cannot be lost 49 | 50 | ## Requirements 51 | 52 | * The protocol MUST use only PUBLISH type of MQTT Control Packet 53 | * The protocol MUST support transfer of file segments 54 | * Server MUST be able to verify integrity of each file segment and of the whole file 55 | * Client MAY know total file size when initiating the transfer 56 | * Client MAY abort file transfer 57 | * Server MAY ask the client to pause file transfer 58 | * Server MAY ask the client to abort file transfer 59 | * The protocol MUST guarantee "At least once" delivery 60 | * Server MUST NOT support subscription on topics dedicated for file transfer 61 | 62 | ## AWS IoT MQTT-based file delivery (reference design) 63 | 64 | As an example of existing implementation we can look at AWS IoT Core [which provides functionality](https://docs.aws.amazon.com/iot/latest/developerguide/mqtt-based-file-delivery.html) to [deliver files to IoT devices](https://docs.aws.amazon.com/iot/latest/developerguide/mqtt-based-file-delivery-in-devices.html): 65 | 66 | ![](0021-assets/aws-mqtt-file-delivery.png) 67 | 68 | ## Design 69 | 70 | ### Overview 71 | 72 | * Files are split in segments, segments can be of arbitrary length 73 | * Client SHOULD generate unique identifier for each file being transferred and use it as `fileId` in Topic Name (UUID according to [RFC 4122](https://www.rfc-editor.org/rfc/rfc4122) is recommended) 74 | * Client SHOULD consider `fileId` as a unique identifier for the file transfer, and MUST NOT reuse it for other file transfers 75 | * Broker SHOULD consider `clientId` + `fileId` pair as a Broker-wide unique identifier for the file transfer 76 | * Client MAY calculate SHA-256 checksum of the segment it's about to send and send it as part of Topic Name 77 | * Client MAY calculate SHA-256 checksum of the file it's about to send and include it in the `init` message payload or send is as part of the `fin` message 78 | * If Client chooses to provide checksum for file segments, whole file, or both, it MUST use [SHA-256](https://www.rfc-editor.org/rfc/rfc6234) 79 | * If checksum is included in the `init` message payload, the Broker MUST use it to verify integrity of the file after receiving the `fin` message for the corresponding file transfer 80 | * If checksum is included in topic name, Broker MUST use it to verify integrity of corresponding data: 81 | * segment, if it's a segment transfer message 82 | * whole file, if it's a `fin` message 83 | * If checksum verification fails, Broker MUST reject the corresponding data 84 | * Client MUST use Topic starting with `$file/`(`$file-async/`) to transfer files. 85 | * Broker MUST NOT let clients subscribe to Topics starting with `$file/`(`$file-async/`) topics 86 | * Segment length can be calculated on the server side by subtracting the length of the [Variable Header](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901025) from the [Remaining Length](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901105) field that is in the [Fixed Header](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901021) 87 | 88 | ### Protocol flow 89 | 90 | Data is transferred in PUBLISH packets in the following order: 91 | 92 | 1. `$file[-async]/{fileId}/init` 93 | 94 | ``` 95 | { 96 | "name": "ml-logs-data.log", 97 | "size": 12345, 98 | "checksum": "1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef", 99 | "expire_at": 1696659943, 100 | "segments_ttl": 600 101 | } 102 | ``` 103 | 104 | 2. `$file[-async]/{fileId}/{offset}[/{checksum}]` 105 | 106 | ``` 107 | 108 | ``` 109 | 110 | 3. `$file[-async]/{fileId}/{offset}[/{checksum}]` 111 | 112 | ``` 113 | 114 | ``` 115 | 116 | 4. ... 117 | 118 | 5. `$file[-async]/{fileId}/[fin/{fileSize}[/{checksum}] | abort]` 119 | 120 | No payload 121 | 122 | 123 | ### Sync mode vs Async mode 124 | 125 | File transfer individual commands may be handled in two modes: synchronous and asynchronous. The mode is chosen by the client, by sending commands either to `$file/...` (sync mode) or `$file-async/...` (async mode) topics. 126 | 127 | Client is free to mix arbitrarily sync and async commands in the same file transfer session. 128 | 129 | #### Synchronous mode 130 | 131 | In synchronous mode, the successful/unsuccessful status of individual operations is communicated to the client via Reason Code field of MQTT `PUBACK` messages. See [Reason codes](#reason-codes) for details. 132 | 133 | ![](./0021-assets/flow-sync-1.png) 134 | 135 | Caveats: 136 | 137 | * Some operations (`fin` command) may take considerable time to complete. So if a client wants to utilize the session while waiting for the result, it should either use async mode or implement some kind of asynchronous logic itself, that is to deal with several unacked PUBLISH messages in parallel, like this: 138 | 139 | ![](./0021-assets/flow-sync-2.png) 140 | * MQTTv3 clients do not support reason codes, so async mode is the preferred option for them. 141 | 142 | #### Asynchronous mode 143 | 144 | In asynchronous mode, the logic of command handling is the following: 145 | * Client sends a command to `$file-async/...` topic. 146 | * Broker responds with `PUBACK`. Nonzero Reason Code indicates immediate failure. 147 | * Zero Reason Code indicates that the command has been accepted for processing. The client is expected to wait for the actual result of the command via `$file-response/{clientId}` topic. 148 | 149 | ![](./0021-assets/flow-async-1.png) 150 | 151 | The result of the command is a JSON document with the following fields: 152 | ```json 153 | { 154 | "vsn": "0.1", 155 | "topic": "$file-async/[COMMAND]", 156 | "packet_id": 1, 157 | "reason_code": 0, 158 | "reason_description": "success" 159 | } 160 | ``` 161 | 162 | | Field | Description | 163 | |-------|-------------| 164 | | `vsn` | response document format version | 165 | | `topic` | the topic of the command that the response is for, e.g. `$file-async/somefileid/init` | 166 | | `packet_id` | the MQTT packet id of the command packet (`PUBLISH`) that the response is for | 167 | | `reason_code` | the result code of the command execution. See [Reason Codes](#reason-codes) for details | 168 | | `reason_description` | the human-readable description of the Reason Code | 169 | 170 | JSON Schema is available [here](https://github.com/emqx/mqtt-file-transfer-schema/blob/v1.1.0/response.json). 171 | 172 | Notes: 173 | * The operation result is always sent to the response topic (both in case of immediate failure and in case of actual processing). So the response topic may be used as the only source of information about the operation result. Also, this is the only variant available for MQTTv3 clients. 174 | * The response topic is client-specific. The client should subscribe to it before sending the command. 175 | * The client may override the response topic using `REQUEST/RESPONSE` ![pattern](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901253). 176 | 177 | #### Reason Codes 178 | 179 | In the subsequent sections, the _reason code_ of a file transfer operation means the _final_ reason code of the command execution, that is, the Reason Code of `PUBACK` packet in the sync mode or the `reason_code` response document field value in the async mode. 180 | 181 | ### Protocol messages 182 | 183 | #### `$file[-async]/{fileId}/init` message 184 | 185 | Initialize the file transfer. Server is expected to store metadata from the payload in the session along with `{fileId}` as a reference for the rest of file transfer. 186 | 187 | * Qos=1 188 | * Payload Format Indicator=0x01 189 | * `{fileId}` is corresponding file UUID 190 | * Payload is a JSON document 191 | 192 | Getting a successful reason code from the broker means that the file transfer has been initialized successfully, and the metadata has been persisted in the storage. 193 | 194 | Broker MAY refuse to accept the file transfer in case of the metadata conflict, e.g. if the transfer with the same `{fileId}` from the same Client has different `name` or `checksum` value. Client is expected to start the transfer with a different `{fileId}`. 195 | 196 | Broker MAY abort incomplete file transfers after their respective sessions have been discarded, and clean up any resources associated with them. 197 | 198 | Broker MAY refuse the file transfer if the `fileId` is too long, but generally `fileId`s of up to 255 bytes (in UTF-8 encoding) should be safe to use. 199 | 200 | ##### `init` payload JSON Schema 201 | 202 | Available [here](https://github.com/emqx/mqtt-file-transfer-schema/blob/v1.1.0/init.json). 203 | 204 | * Broker MAY use `name` value as a filename in a file system 205 | 206 | This generally means that it SHOULD NOT contain path separators, and SHOULD NOT contain characters or sequences 207 | of characters that are not allowed in filenames in the file system where the file is going to be stored. Also, 208 | the filename SHOULD be limited to 255 bytes (in UTF-8 encoding). 209 | 210 | * Broker SHOULD consider `size` value as informational only, given it's not required to be provided by the client 211 | 212 | Mandatory file size should be specified in the `fin` message Topic anyway, and may be different from the value 213 | provided in the `size` field. The `size` field may be used for example to calculate the progress of the transfer, 214 | which thus may be inaccurate. 215 | 216 | * Broker SHOULD have default setting for `segments_ttl` 217 | 218 | * Broker MAY delete segments of unfinished file transfers when their TTL has expired 219 | 220 | * Broker MAY NOT honor `segments_ttl` value that is either too large or too small 221 | 222 | What means _too large_ or _too small_ is up to the Broker implementation and/or configuration. 223 | 224 | #### `$file[-async]/{fileId}/{offset}[/{checksum}]` message 225 | 226 | One such message for each file segment. 227 | 228 | * Qos=1 229 | * Payload Format Indicator=0x00 230 | * Payload is file segment bytes 231 | * `{offset}` is byte offset of the given segment 232 | * optional `{checksum}` is SHA-256 checksum of the file segment 233 | 234 | Getting a successful reason code from the Broker means that the file segment has been verified (if checksum was provided) and successfully persisted in the storage. 235 | 236 | #### `$file/{fileId}/fin/{fileSize}[/{checksum}]` message 237 | 238 | All file segments have been successfully transferred. 239 | 240 | * Qos=1 241 | * no payload 242 | * optional `{checksum}` is SHA-256 checksum of the file 243 | 244 | Getting a successful reason code from the Broker means that the file being transferred is ready to be used. This implies a lot of things: 245 | * Broker has verified that it has corresponding metadata for the file 246 | * Broker has verified that it has all the segments of the file up to `{fileSize}` persisted in the storage 247 | * Broker has verified the file integrity (if checksum was provided) 248 | * Broker has published the file along with its metadata to the location where it can be accessed by other users 249 | 250 | In cases when checksum was provided both in the `init` message and in the `fin` message, Broker MUST ignore the former and use the latter. 251 | 252 | Clients MUST expect that handling of the `fin` message may take considerable time, depending on the file size and the 253 | Broker implementation or configuration. 254 | 255 | #### `$file/{fileId}/abort` message 256 | 257 | Client wants to abort the transfer. 258 | 259 | * Qos=1 260 | * no payload 261 | 262 | ### Durability 263 | 264 | This specification does not define how reliably the file transfer data SHOULD be persisted. It is up to the Broker implementation what specific durability guarantees it provides (e.g. datasync or replication factor). However, Broker is expected to support transfers that are interrupted by a network failure, Broker restart, or Client reconnect. 265 | 266 | ### Reason codes 267 | 268 | | Reason code | MQTT Name | Meaning in file transfer context | 269 | |-------------|-------------------------------|-----------------------------------------------------| 270 | | OMIT | | Same as 0x00 | 271 | | 0x00 | Success | File segment has been successfully persisted | 272 | | 0x10 | No matching subscribers | Server asks Client to retransmit all segments | 273 | | 0x80 | Unspecified error | For segment transmission, Server asks Client to retransmit the segment. For `fin`, Server asks Client to retransmit all segments | 274 | | 0x83 | Implementation specific error | Server asks Client to cancel the transfer | 275 | | 0x97 | Quota exceeded | Server asks Client to pause the transfer | 276 | 277 | #### 0x97, "Quota exceeded", "Pause Transfer" 278 | 279 | Client is expected to wait before trying to retransmit file segment again. 280 | 281 | ### PUBACK from MQTT servers < v5.0 282 | 283 | `PUBACK` messages prior to MQTT v5.0 do not carry Reason Code, so the only way for the client to know the result of the operation is to use async mode and wait for the response on the response topic. 284 | 285 | ### Happy path 286 | 287 | ![](0021-assets/flow-happy-path.png) 288 | 289 | ### Transfer abort initiated by client 290 | 291 | ![](0021-assets/flow-abort.png) 292 | 293 | ### Transfer restart initiated by server 294 | 295 | ![](0021-assets/flow-restart.png) 296 | 297 | ## Backwards Compatibility 298 | 299 | Full backward compatibility with MQTT 5.x and MQTT 3.x. 300 | 301 | ## Innovation Opportunities 302 | 303 | * Integration with rule engine 304 | * Pluggable storage backends 305 | * Multiple backend writes to a single file 306 | * Do the entire MQTT message logging 307 | * QUIC pure binary stream support 308 | * ACL enables client-side control of file size 309 | * Bulk upload at EMQX 310 | * Multi-node local cache utilization similar to HDFS 311 | * Cheaper (in computational costs) checksum algorithm 312 | * Server-to-client file transfer protocol 313 | 314 | ## Declined Alternatives 315 | 316 | * Use of MQTT extension headers 317 | * Poor compatibility and complex application layer implementation 318 | * Capability negotiation 319 | * Poor client compatibility, complex application layer implementation 320 | * Client supplied file name as file identifier instead of UUID 321 | * potential security issues 322 | * potential name clashing issues 323 | * checksum value instead of UUID 324 | * will not work if client does not have full file when initiating the transfer 325 | * checksum as part of the payload, not as part of topic name 326 | * Requires specification for payload serialization format leading to more complicated client code 327 | * File segment offset as part of the payload 328 | * Same as above, Requires specification for payload serialization format and more complicated code 329 | * Using constant segment size (per file) and sequential segment number instead of byte offset to reduce packet size 330 | * Clients may want to change segment size dynamically to account for changes in network properties (e.g. moving from faster to slower and spotty network, or vice versa) 331 | 332 | -------------------------------------------------------------------------------- /implemented/0022-forward-check-install-upgrade-script.md: -------------------------------------------------------------------------------- 1 | # Forward compatibility check for install upgrade script (5.0) 2 | 3 | ## Changelog 4 | 5 | * 2022-12-07: @thalesmg Initial draft 6 | * 2022-12-13: @thalesmg Added comment about downgrading 7 | 8 | ## Abstract 9 | 10 | This proposes a way to circumvent the limitation that the hot upgrade 11 | installation scripts that are executed cannot be changed after a 12 | package is released, making it impossible to fix bugs or change 13 | upgrade logic. 14 | 15 | ## Motivation 16 | 17 | This section should clearly explain why the functionality proposed by this EIP 18 | is necessary. EIP submissions without sufficient motivation may be rejected 19 | outright. 20 | 21 | Currently, when performing a hot upgrade, the scripts that are run are 22 | those from the currently installed EMQX version. It has a validation 23 | that prevents upgrade between different minor versions. If we want to 24 | allow an upgrade from, say, 5.0.x to 5.1.y, then the scripts in 25 | already released packages will deny such operation. Also, if an 26 | upgrade installation script contains a bug in the current, it will 27 | never be able to execute properly without manual patching. 28 | 29 | By attempting to execute the scripts from the _target version_, we may 30 | add fixes and new validations to new EMQX versions and have them 31 | executed by older versions. 32 | 33 | ## Design 34 | 35 | ### Current upgrade procedure 36 | 37 | 1. A zip file with a conventional filename format 38 | (`-.tar.gz`) is placed in the `releases` 39 | directory of the currently running EMQX installation. Typically 40 | this means `/usr/lib/emqx/releases`. 41 | 2. The user runs `emqx {install,upgrade,unpack} -`. 42 | 3. The `emqx` script then calls `install_upgrade.escript` **from the 43 | currently running version** with some info alongside the desired 44 | operation and new version. 45 | 4. The script then processes the desired operation. 46 | 47 | Currently, a versioned copy of `install_upgrade.escript` and `emqx` 48 | are already installed with EMQX in the `bin` directory. That means 49 | `install_upgrade.escript-` and `emqx-`, 50 | respectively. 51 | 52 | ### Proposed new procedure 53 | 54 | 1. First of all, we check if the currently installed and the target 55 | release profiles match. For example, if the enterprise edition is 56 | currently installed and the user attempts to hot-upgrade using a 57 | community edition package (`emqx-.tar.gz`) or vice-versa, 58 | the operation should abort. 59 | - This can be done by checking the `$IS_ENTERPRISE` variable that 60 | is set in `emqx_vars` and loaded by `bin/emqx` against the 61 | package filename: `emqx-.tar.gz` for community edition, 62 | `emqx-enterprise-.tar.gz` for enterprise edition. 63 | 2. In the `emqx` script, we parse the desired operation and target 64 | version and check if scripts called 65 | `install_upgrade.escript-` and `emqx-` exist in the `bin` directory. 67 | - If it does, it means that the target version was already unpacked 68 | at some point, and we just execute `emqx-` 69 | passing it the necessary info. 70 | - We also need to check inside the `emqx` script if it is the 71 | `emqx-` script itself, to avoid an infinite loop. 72 | 3. If any such file does not exist, then, without executing the 73 | currently installed `install_upgrade.escript` file: 74 | 1. We check if the `-.tar.gz` file is at 75 | the expected location (`releases`) and it's readable, bailing 76 | out otherwise. 77 | 2. We extract _only_ the `install_upgrade.escript-` 78 | and `emqx-` files to the `bin` directory. 79 | 3. Then we just call `emqx-` with the same 80 | arguments: `emqx- {install,unpack,upgrade} 81 | `. 82 | 83 | For _downgrading_, since a newer script might have bug fixes and also 84 | have laxer restrictions on the source and target versions (for 85 | example, an older script might forbid 5.0 <-> 5.1, but the newer one 86 | allows it), we have to use the _newer_ script as well. Otherwise the 87 | user might risk getting stuck at the newer version without being able 88 | to downgrade. 89 | 90 | ## Configuration Changes 91 | 92 | No configuration changes needed. 93 | 94 | ## Backwards Compatibility 95 | 96 | Since EMQX 5.0, at the time of writing, has not been tracking appup 97 | changes for hot upgrades, so this change shouldn't pose backwards 98 | compatibility issues. 99 | 100 | ## Document Changes 101 | 102 | The documentation for 5.0 currently doesn't even have a section for 103 | hot upgrades, so it'll need to be ported from 4.x. 104 | 105 | ## Testing Suggestions 106 | 107 | The final implementation must include unit test or common test code. If some 108 | more tests such as integration test or benchmarking test that need to be done 109 | manually, list them here. 110 | 111 | ## Declined Alternatives 112 | 113 | No prior alternatives discussed. 114 | -------------------------------------------------------------------------------- /implemented/0023-assets/trace-export-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0023-assets/trace-export-example.png -------------------------------------------------------------------------------- /implemented/0023-opentelemetry-trace-integration.md: -------------------------------------------------------------------------------- 1 | # OpenTelemetry Traces Integration 2 | 3 | ## Changelog 4 | - 2023-09-28: Initial draft 5 | - 2023-10-04: 6 | - Apply review remarks (define trace spans for the first iterations) 7 | - Add description of `emx_external_trace` behaviour 8 | - 2023-12-26: 9 | - Update the document to match the actual implementation and move to implemented 10 | 11 | ## Abstract 12 | 13 | This document describes EMQX OpenTelemetry integration design proposal. 14 | 15 | More details about related components, concepts and conventions can be found in the following resources: 16 | 17 | - [OpenTelemetry Erlang lib documentation](https://opentelemetry.io/docs/instrumentation/erlang/) 18 | - [OpenTelemetry main trace components overview](https://opentelemetry.io/docs/concepts/signals/traces) 19 | - [MQTT trace context specification](https://w3c.github.io/trace-context-mqtt/) 20 | - [Trace context header specification](https://www.w3.org/TR/trace-context-1/#tracestate-header) 21 | - [OpenTelemetry messaging system semantic conventions](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md) 22 | - [Draft PR](https://github.com/emqx/emqx/pull/11696) 23 | 24 | ## Motivation 25 | 26 | OpenTelemetry distributed trace integration is the part of EMQX Product Road-map. 27 | 28 | ## Design 29 | 30 | ### Core concepts and tracing scope 31 | 32 | The core traceable entity for EMQX is a message. It means that one trace should be associated with one message. 33 | 34 | For example, a single HTTP request is a common traceable entity for a HTTP server. HTTP server instrumented with OpenTelemetry receives a HTTP request, extracts trace context from headers (e.g., `traceparent`, `tracestate` headers) and traces any processing steps (spans) up to sending a HTTP response back to the client, associating all the spans with the same trace ID (1 request - 1 trace ID). 35 | HTTP client after receiving the response may proceed executing some subsequent operations, tracing and linking them to the same trace ID. 36 | 37 | Somewhat analogously, EMQX instrumented with OpenTelemetry, is expected to receive a published message, extract trace context (e.g., `traceparent`, `tracestate` User-Properties), and trace some/all processing steps under the same trace ID. 38 | Producer/consumer of the message may proceed tracing any subsequent operations relating them to the same trace ID. 39 | 40 | These traced steps (or spans) should include the following (in the first iteration): 41 | 42 | - Process a published message (traced by a node that received a published message). 43 | This span starts when PUBLISH packet is received and parsed by a connection process and ends when the message is dispatched to local subscribers and/or forwarded to other nodes (forwarding is async by default). 44 | - Send a published message to a subscriber (traced by all nodes that have matched subscribers). 45 | This span is traced by each connection process (so there will be one span per each subscriber). It will be started when 'deliver` message is received by a connection controlling process and ended when outgoing packet is serialized and sent to the socket port. 46 | 47 | NOTE: the above list may be extended/changed in the next iterations. 48 | 49 | ![An actual EMQX trace example as exported by POC implementation](0023-assets/trace-export-example.png) 50 | 51 | Any other processing steps/events like client connection, authentication, subscription are currently not considered for OpenTelemetry tracing due to the following reasons: 52 | 53 | - these actions are not directly associated with the main traceable entity (message) defined above 54 | - these actions seem not absolutely suitable for distributed tracing, they can be probably traced only as internal EMQX events 55 | 56 | ### Implementation details 57 | 58 | Erlang OpenTelemetry lib heavily relies on propagating trace context by means of process dictionary. 59 | Obviously, this works fine when function calls are being traced within the context of the same process and needs little efforts when the context is to be propagated to a new spawned process. 60 | 61 | However, this approach is not absolutely suitable for EMQX distributed architecture: 62 | 63 | - correlated spans can be executed on different nodes and/or by different processes 64 | - a batch of items relating to different traces can be processed together as a single unit of work, e.g., `emqx_connection`, `emqx_channel` modules process deliver messages in batches, where each message would have a unique trace ID if tracing is enabled. 65 | 66 | That’s why the proposed implementation mostly relies on propagating the tracing context as a part of the message itself, which has the following advantages: 67 | 68 | - inter-cluster communication doesn’t require any changes to support trace context propagation and is backward compatible (trace context is added to a reserved `#emqx_message.extra` field) 69 | - tracing individual messages processed in batches is possible and doesn’t require any significant changes in the current implementation. 70 | 71 | API and context propagation examples (see: [full implementation](https://github.com/emqx/emqx/pull/11984/files#diff-73384b930f330bcf64fb285a0bbcdce0edd015f6c7598e847da49d46b878ebe4): 72 | 73 | ```erlang 74 | 75 | put_ctx_to_msg(OtelCtx, Msg = #message{extra = Extra}) when is_map(Extra) -> 76 | Msg#message{extra = Extra#{?EMQX_OTEL_CTX => OtelCtx}}; 77 | %% extra field has not being used previously and defaulted to an empty list, it's safe to overwrite it 78 | put_ctx_to_msg(OtelCtx, Msg) when is_record(Msg, message) -> 79 | Msg#message{extra = #{?EMQX_OTEL_CTX => OtelCtx}}. 80 | 81 | get_ctx_from_msg(#message{extra = Extra}) -> 82 | from_extra(Extra). 83 | 84 | get_ctx_from_packet(#mqtt_packet{variable = #mqtt_packet_publish{properties = #{internal_extra := Extra}}}) -> 85 | from_extra(Extra); 86 | get_ctx_from_packet(_) -> 87 | undefined. 88 | 89 | from_extra(#{?EMQX_OTEL_CTX := OtelCtx}) -> 90 | OtelCtx; 91 | from_extra(_) -> 92 | undefined. 93 | ``` 94 | 95 | Some drawbacks of the proposed implementation should also be mentioned: 96 | 97 | - internal tracing API (as of now, implemented in `emqx_otel_trace` module) is not decoupled from the rest of the code base: each traceable action (span) is traced by a specific function and all these functions are quite specific. For example, they may extract/propagate the context differently and/or rely on the previous (parent) span. For now, it doesn’t seem feasible to create a generic trace wrapper that can trace an arbitrary function. 98 | 99 | #### emqx_external_trace behaviour 100 | 101 | Most (currently all) trace spans are expected to be added to the core `emqx` OTP application. However, `emqx` application mustn't depend on `opentelemetry` libs/apps. 102 | Moreover, we already have `emqx_opentelemetry` OTP application that implements OpenTelementry metrics, schema, configuration, etc. 103 | In order to keep `emqx` application decoupled from `opentelemetry` specific code, it's proposed to introduce `emqx_external_trace` module in `emqx` application. 104 | The module will include necessary callbacks that an actual trace backend must implement. It will also implement `register_provider/1`, `unregister_provider/1` functions, so that `opentelemetry` backend trace module can register itself as a trace provider. 105 | 106 | `apps/emqx/src/emqx_external_trace.erl`: 107 | ```erlang 108 | -module(emqx_external_trace). 109 | 110 | -callback trace_process_publish(Packet, Channel, fun((Packet, Channel) -> Res)) -> Res when 111 | Packet :: emqx_types:packet(), 112 | Channel :: emqx_channel:channel(), 113 | Res :: term(). 114 | 115 | ... 116 | 117 | -define(PROVIDER, {?MODULE, trace_provider}). 118 | 119 | -define(with_provider(IfRegisitered, IfNotRegisired), 120 | case persistent_term:get(?PROVIDER, undefined) of 121 | undefined -> 122 | IfNotRegisired; 123 | Provider -> 124 | Provider:IfRegisitered 125 | end 126 | ). 127 | 128 | %%-------------------------------------------------------------------- 129 | %% provider API 130 | %%-------------------------------------------------------------------- 131 | 132 | -spec register_provider(module()) -> ok | {error, term()}. 133 | register_provider(Module) when is_atom(Module) -> 134 | case is_valid_provider(Module) of 135 | true -> 136 | persistent_term:put(?PROVIDER, Module); 137 | false -> 138 | {error, invalid_provider} 139 | end. 140 | 141 | -spec unregister_provider(module()) -> ok | {error, term()}. 142 | unregister_provider(Module) -> 143 | case persistent_term:get(?PROVIDER, undefined) of 144 | Module -> 145 | persistent_term:erase(?PROVIDER), 146 | ok; 147 | _ -> 148 | {error, not_registered} 149 | end. 150 | 151 | %%-------------------------------------------------------------------- 152 | %% trace API 153 | %%-------------------------------------------------------------------- 154 | 155 | -spec trace_process_publish(Packet, Channel, fun((Packet, Channel) -> Res)) -> Res when 156 | Packet :: emqx_types:packet(), 157 | Channel :: emqx_channel:channel(), 158 | Res :: term(). 159 | trace_process_publish(Packet, Channel, ProcessFun) -> 160 | ?with_provider(?FUNCTION_NAME(Packet, Channel, ProcessFun), ProcessFun(Packet, Channel)). 161 | 162 | ``` 163 | 164 | ### External trace context propagation 165 | 166 | If EMQX receives trace context in a published message, e.g., `traceparent`/`tracestate` User-property for MQTT v5.0, it must be sent unaltered when forwarding the Application Message to a Client to conform with [MQTT specification 3.3.2.3.7](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901116). 167 | 168 | This also perfectly follows [OpenTelemetry semantics for messaging systems](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#context-propagation): 169 | 170 | > Messaging systems themselves may trace messages as the messages travels from producers to consumers. Such tracing would cover the transport layer but would not help in correlating producers with consumers. To be able to directly correlate producers with consumers, another context that is propagated with the message is required. 171 | > 172 | > A message creation context allows correlating producers with consumers of a message and model the dependencies between them, regardless of the underlying messaging transport mechanism and its instrumentation. 173 | > 174 | > The message creation context is created by the producer and should be propagated to the consumer(s). 175 | > 176 | > A producer SHOULD attach a message creation context to each message. If possible, the message creation context SHOULD be attached in such a way that it cannot be changed by intermediaries. 177 | 178 | In fact, EMQX is capable of participating in distributed trace out of the box (without OpenTelemetry instrumentation), simply because it implements the above MQTT specification requirement and propagates User-Properties from a publisher to a subscriber. 179 | 180 | However, if no trace context received but a message still should be traced, one of the following options should be chosen: 181 | 182 | - create internal trace context and trace only internal EMQX events and do not propagate the context to receivers and/or external data systems (if bridges are set up) 183 | - create internal trace context and propagate it to receivers and/or external data systems. 184 | 185 | The option shall be configurable and default to not propagating internally created trace context (controlled via `opentelementry.traces.filter.trace_all` configuration parameter). 186 | 187 | ### Attributes 188 | 189 | OpenTelemetry defines [some conventions](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#messaging-attributes) (status: Experimental at the time of writing this document). 190 | 191 | The attributes are grouped under several name-spaces: 192 | 193 | - `messaging.*` 194 | - `network.*` 195 | - `server.*` 196 | 197 | The implementation shall follow these conventions, but as of now, only a small subset of attributes are added. 198 | The attributes can be extended in future upon request. 199 | 200 | ### Sampling 201 | 202 | OpenTelemetry sampling is described in great depth in the [official documentation](https://opentelemetry.io/docs/concepts/sampling). 203 | 204 | Erlang opentelemetry lib implements only head sampling. Head sampling implies that a sampling decision is made as early as possible, e.g., by following a configured percentage of traces to sample (100% by default). A decision to sample or drop a span or trace is not made by inspecting the trace as a whole. 205 | 206 | Sampling rate option should be added to EMQX configuration. 207 | 208 | Tail sampling that makes a sampling decision after all the spans are done would need to be implemented by extending opentelemetry lib. 209 | 210 | Examples of tail sampling capabilities: 211 | 212 | - sample traces based on their latency (e.g. sample only traces that take more than 5ms) 213 | - sample traces only if they contain an error, a specific event or attribute value 214 | 215 | The first iteration of EMQX OpenTelemetry integration doesn't implement any sampling. This feature can be considered for development in the next EMQX releases. 216 | 217 | ### Filtering 218 | 219 | The goal of filtering is similar to one of sampling: to narrow down the amount of traces. 220 | However, filtering is considered as a EMQX/MQTT specific extension that doesn’t necessary follow OpenTelemetry sampling concepts. 221 | 222 | NOTE: filters can be implemented using `otel_sampler` behavior, but it doesn’t seem to have any advantages. 223 | It is suggested to implement a configurable filtering rules, so that a user can control which messages should be traced. It must be possible to leave filtering rules blank, so that all the incoming messages are traced (if tracing itself is enabled). 224 | 225 | The filters may be similar to ones used in EMQX tracing: 226 | 227 | - Client ID 228 | - Topic 229 | - IP address 230 | 231 | The filtering rule should not probably be too complex to minimize performance impact. 232 | 233 | The first iteration of EMQX OpenTelemetry integration defines only one boolean filter: `trace_all`. If it is enabled, all published messages are traced, and a new trace ID is generated if it can't be extracted from the message. 234 | Otherwise, only messages published with trace context are traced. 235 | 236 | ## Configuration Changes 237 | 238 | The existing EMQX OpenTelemetry schema (defined in emqx_otel_schema module) must be extended to include trace specific configuration. 239 | 240 | Current HOCON config example: 241 | ``` 242 | opentelemetry { 243 | enable = true 244 | exporter {endpoint = "http://172.18.0.2:4317", interval = 10s} 245 | } 246 | ``` 247 | 248 | Suggested HOCON config example: 249 | ``` 250 | opentelemetry { 251 | metrics {enable = true} # must be backward-compatible with opentelemetry.enable 252 | exporter {endpoint = "http://172.18.0.2:4317"} 253 | trace { 254 | enable = true 255 | filter {} 256 | ... 257 | } 258 | } 259 | ``` 260 | 261 | ## Backward Compatibility 262 | 263 | All changes are backward compatible. 264 | 265 | ## Testing 266 | 267 | Besides integration/unit tests, it is necessary to make performance tests/profiling to measure the impact of tracing on EMQX performance. 268 | -------------------------------------------------------------------------------- /implemented/0026-schema-validation.md: -------------------------------------------------------------------------------- 1 | # Built-in Schema Validation 2 | 3 | ## Changelog 4 | 5 | * 2024-05-22: @thalesmg Rename feature to "schema validation". 6 | * 2023-10-31: @zmstone Initial draft message validation, filter and transformation. 7 | * 2022-12-19: @zmstone Limit the scope to filter and validation. 8 | 9 | ## Abstract 10 | 11 | A new feature for EMQX v5 to validate messages. 12 | 13 | ## Motivation 14 | 15 | ### Schema Validation 16 | 17 | Technically, by using EMQX rules engine, it is possible to validate the incoming 18 | MQTT messages before sending the message off to data bridges. 19 | 20 | For instance, one can write a rules engine SQL like below to 21 | drop messages which are not avro encoded and also when the 'val' field is not greater than 10. 22 | 23 | ``` 24 | SELECT schema_decode(my_avro_schema, payload) as decoded from t/# where decoded.val > 10 25 | ``` 26 | 27 | However, currently rules are only used in data bridges, but they do not interfere MQTT message pub/sub. 28 | For example, if there is client subscribed to `t/#` it would still receive the avro encoded message. 29 | 30 | If we also want to stop the message being published, then there would be a need for a message republish action, 31 | and the publisher and subscriber will likely be forced to use different topics. 32 | e.g. publisher publishes to `t1/...`, republish action send it to `t2/...` for subscribers to receive. 33 | 34 | This can work, but not easy to use. 35 | 36 | ### Message filter 37 | 38 | Message filtering can be achieved by configuring schema validation rule to drop invalid messages. 39 | 40 | ## Design 41 | 42 | This data validation step should happen after authorization and before rules engine. 43 | Probably can make use of the 'message.publish' with a hook priority which 44 | positions the callback after authorization, but before data bridges. 45 | 46 | ### High level design 47 | 48 | * Make use of the rules engine SQL, but without the `FROM` clause as it's always the MQTT message being th input. 49 | 50 | Each data validation rule should have a unique ID. And each rule should have a input topic-filter name which 51 | can be used to build a topic index for quick look-up. 52 | 53 | The common parts can be described as hocon config below: 54 | 55 | ### Schema Validation 56 | 57 | ``` 58 | validations = [ 59 | { 60 | name = validation1 61 | tags = [] 62 | description = "" 63 | enable = true # or 'false' 64 | type = validation 65 | description = "drop message if it is not compabitble with my avro schema or if payload.value is less than 0" 66 | topics = "t/#" # or topics = ["t/1", "t/2"] 67 | strategy = any_pass # or all_pass 68 | failure_action = disconnect # (disconnect also implies 'drop') or 'drop' to only drop the message 69 | log_failure_at = none # debug, notice, info, warning, error 70 | checks = [ 71 | { 72 | # message payload payload to be validated against JSON schema 73 | type = json # can also be 'avro' or 'protobuf' 74 | schema = "my-json-schema-registration-id" 75 | } 76 | { 77 | # SQL evaluates to empty result indicates validation falure 78 | # in this example, if payload.value is less than 0, the message is considiered invalid 79 | type = sql 80 | sql = "SELECT * WHERE payload.value < 0" 81 | } 82 | ] 83 | } 84 | ] 85 | ``` 86 | 87 | If there are more than one validation matched for one message, all validations should be executed 88 | in the configured order. 89 | For example, if one is configured with `topics = "t/#"` and another with `topics = "t/1"`, 90 | when a message is published to `t/1`, the both validations should be triggered. 91 | 92 | ## Configuration Changes 93 | 94 | A new config root named `message_validation` is to be added. 95 | 96 | ``` 97 | message_validation { 98 | validations = [ 99 | ... 100 | ] 101 | } 102 | ``` 103 | 104 | ## APIs 105 | 106 | - GET /schema_validations 107 | To list all the schema validations 108 | 109 | - GET /schema_validations?topic=t/#&schema_name=jsonsch1&schema_type=json 110 | Fetch validations based on filter 111 | 112 | - PUT /schema_validations 113 | To update a validation 114 | 115 | - POST /schema_validations 116 | To create a new validation 117 | 118 | ## Observerbility 119 | 120 | - There should be metrics created for each processor name. 121 | - Opentelemetry tracing context in message properties should be preserved. 122 | - Client disconnect and message drop events should be traceable in the EMQX builtin tracing. 123 | - Emit a new event e.g. `schema.validation_failure`. 124 | This allows users to handle the validation failures in Rule-Engine. 125 | For instance, publish the message to a different topic. 126 | 127 | ## Backwards Compatibility 128 | 129 | New feature. 130 | 131 | ## Document Changes 132 | 133 | - Docs should be created to guide users to use the new feature. 134 | - New APIs in swagger spec 135 | - New config schema for configuration manual doc 136 | 137 | ## Testing Suggestions 138 | 139 | The significant part of this feature is pure functional, this makes the testing more suitable as unit tests. 140 | 141 | Integration tests should be more focused on the management APIs. 142 | 143 | ## Declined Alternatives 144 | 145 | - Extend authz to support SQL. 146 | This was declined because authz's principal is MQTT topcis, we do not want to introduce data to it. 147 | - Add new actions to rules for 'deny' and 'disconnect' actions. 148 | This was declined because 149 | - Rules lack of ordering while validation and transform are to be chained. 150 | - We might implement as a part of the rule engine internally, but at least from user's perspective, it should be an independent feature (not like an extension of rules) 151 | -------------------------------------------------------------------------------- /implemented/0027-message-transform.md: -------------------------------------------------------------------------------- 1 | # Buit-in Message Transformation 2 | 3 | ## Changelog 4 | 5 | * 2024-08-13: @zmstone Update to reflect the actual implementation. 6 | * 2023-10-31: @zmstone Initial draft for message validation, filter and transformation (as a part of EIP-0024). 7 | * 2022-12-19: @zmstone Limit the scope message filter and tranformation. 8 | 9 | ## Abstract 10 | 11 | An extension of EMQX Enterprise to support action-less message filter and transformation. 12 | 13 | ## Motivation 14 | 15 | 16 | Currently, if one wants to tranform a message before the message is published, the only option is to 17 | make use of the Rule Engine's republish action. 18 | 19 | This requires the clients to publish to one topic, and subscribe to another. 20 | 21 | ## Design 22 | 23 | Each transfromation consists of 24 | 25 | - A set of topic match patterns 26 | - A list of operations for data mutation 27 | - Payload decode/encode rules 28 | - Error handling strategy 29 | 30 | The transformation is done in below steps for each message. 31 | 32 | - Match topic (filter). Only those messages of topic matching the configured topic (filter) are processed. 33 | - Payload decode. When the transformation is about message payload, the payload has to be decoded first. e.g. from JSON, or avro. 34 | - Mutation. Evaluate the expression to mutate the input data. 35 | - Payload encode. When the transformation is about message payload, the payload has to be encoded back to the desired format of the subscribers. 36 | 37 | We pre-define a set of keys (message attributes) which can be the subjects of data mutation. 38 | 39 | - Topic 40 | - Payload 41 | - QoS 42 | - Retain flag 43 | - User properties 44 | 45 | Each tranformation is a [varifom expression](https://docs.emqx.com/en/emqx/latest/configuration/configuration.html#variform-expressions) 46 | which can be used to perform string operations. 47 | 48 | ## Configuration Changes 49 | 50 | Add a `message_tranformation` config root. 51 | 52 | ``` 53 | message_transformation { 54 | transformations = [ 55 | { 56 | name = trans1 57 | description = "" 58 | failure_action = ignore 59 | log_failure {level = warning} 60 | operations = [ 61 | { 62 | key = topic 63 | ## prepend client ID to the publishing topic 64 | value = "concat([clientid,'/',topic])" 65 | } 66 | ] 67 | payload_decoder {type = none} 68 | payload_encoder {type = none} 69 | topics = [ 70 | "#" 71 | ] 72 | } 73 | ] 74 | } 75 | ``` 76 | 77 | ### Expression Examples 78 | 79 | - Add client ID as the prefix of publishing topics. 80 | 81 | ``` 82 | { 83 | key = topic 84 | value = "concat([clientid,'/',topic])" 85 | } 86 | ``` 87 | 88 | - Add client TLS certificate's OU as a MQTT message user property 89 | 90 | ``` 91 | { 92 | key = "user_property.ou" 93 | # if client_attrs.cert_dn is initialized, extract the OU 94 | # otherwise user_property.ou is set as 'none' 95 | value = "coalesce(regex_extract(client_attrs.cert_dn,'OU=([a-zA-Z0-9_-]+)'), 'none')" 96 | } 97 | ``` 98 | 99 | ### Declined Alternatives 100 | 101 | - Employee rule SQL expressions 102 | - Message multiplication (transform one message to many) 103 | -------------------------------------------------------------------------------- /implemented/0030-clientinfo-authentication-rules.md: -------------------------------------------------------------------------------- 1 | # Authenticate MQTT clients with composible rules against client info 2 | 3 | ## Changelog 4 | 5 | * 2024-06-27: @zmstone Initial draft 6 | * 2024-09-20: @zmstone Update to reflect the latest design 7 | 8 | ## Abstract 9 | 10 | Implement a new feature which can authenticate MQTT clients based on a set of composible rules against client properties and attributes. 11 | 12 | ## Motivation 13 | 14 | EMQX has a set comprehensive authentication chain, most of which requires out-of-band requests towards an external service such as HTTP server, or a database. However, there are some scenarios where the authentication decision can be made based on the client properties and attributes, such as client ID, username, and TLS certificate. 15 | 16 | For certain use cases, it is more efficient to authenticate clients based on the client properties and attributes. 17 | Some examples: 18 | 19 | - Quick deny clients which have no username. 20 | - Only allow clients with certain client ID prefix to connect. 21 | - Username prefix must match the OU (Organizational Unit) in the TLS certificate. 22 | - Password is a hash of the client ID and a secret key defined in a environment variable. 23 | 24 | Such rules can be added to the authentication chain (often to the head of it), to effectively fence off clients that do not meet the criteria. Or used standalone to authenticate clients if the checks are sufficient. 25 | 26 | ## Design 27 | 28 | In addition to the current `Password Based`, `JWT` and `SCRAM`, we add a new authentication mechanism called `Client Info`. 29 | The `Client Info` mechanism has no external dependencies, but should have a set of configurable checks. 30 | 31 | The checks can be composed similar to the `ACL` rules, formatted externally it's a HOCON array of objects, each object is a check. 32 | The checks are evaluated in order, and the first check that matches the client info will be used to authenticate the client. 33 | 34 | Here is an example of the configuration: 35 | 36 | ``` 37 | checks = [ 38 | # Allow clients with username starts with 'super-' 39 | { 40 | is_match = "regex_match(username, '^super-.+$')" 41 | result = allow 42 | }, 43 | # deny clients with empty username and client ID starts with 'v1-' 44 | { 45 | # When is_match is an array, it yields 'true' if all individual checks yield 'true' 46 | is_match = ["str_eq(username, '')", "str_eq(nth(1,tokens(clientid,'-')), 'v1')"] 47 | result = deny 48 | } 49 | # If all checks are exhausted without an 'allow' or a 'deny' result 50 | # this authenticator results in `ignore` so to continue to the next authenticator in the chain 51 | ] 52 | ``` 53 | 54 | Each check object has two fields: 55 | 56 | - `is_match`: One or more boolean expressions that evaluates to `true` or any other string value as `false`. 57 | - `result`: either `allow`,`deny` or `ignore`. 58 | 59 | ### Logical Operators 60 | 61 | There is no explicit logical `AND` or `OR` operator support for checks and match conditions, but the following rules apply: 62 | 63 | - Since each `check` can yield a `result`, one may consider the `checks` arrary are connected by a logical `||` (`OR`) operator. 64 | - When `is_match` is an array, it yields `true` if all individual checks yield `true`, one may consider the `is_match` array are connected by a logical `&&` (`AND`) operator. 65 | 66 | ### Functions 67 | 68 | The `is_match` expressions are Variform expressions used in other parts of EMQX, they are evaluated in the context of the client info. 69 | 70 | Find more information about the Variform expressions in the [Variform documentation](https://docs.emqx.com/en/emqx/v5.8/configuration/configuration.html#variform-expressions) 71 | 72 | ### Predefined Variables 73 | 74 | - `username`: the username of the client. 75 | - `clientid`: the client ID of the client. 76 | - `peerhost`: the IP address of the client. 77 | - `cert_subject`: the subject of the TLS certificate. 78 | - `cert_common_name`: the issuer of the TLS certificate. 79 | - `client_attrs.*`: the client attributes of the client. See more in the [Client Attributes documentation](https://docs.emqx.com/en/emqx/v5.8/client-attributes/client-attributes.html#mqtt-client-attributes) 80 | 81 | ## Configuration Changes 82 | 83 | This section should list all the changes to the configuration files (if any). 84 | 85 | ## Backwards Compatibility 86 | 87 | This is a new fature and should not affect any existing configurations. 88 | 89 | ## Document Changes 90 | 91 | A new section should be added to the authentication documentation to describe the new `Client Info` authentication mechanism. 92 | 93 | ## Testing Suggestions 94 | 95 | Test coverage should be added to cover the new `Client Info` authentication mechanism. 96 | 97 | ## Declined Alternatives 98 | -------------------------------------------------------------------------------- /implemented/0031-immutable-config-base-for-cluster.hocon: -------------------------------------------------------------------------------- 1 | # Immutable config base for cluster.hocon 2 | 3 | ## Changelog 4 | 5 | * 2024-11-13: @zmstone Initial draft 6 | 7 | ## Abstract 8 | 9 | Make EMQX configuration files great again. 10 | 11 | ## Motivation 12 | 13 | The config overriding rule since EMQX 5.1 is not easy to understand and manage. 14 | The current (as of 5.8) config layers are as following (in the order of overrideing precidence from high to low): 15 | 16 | - Environment variables (highest precidence) 17 | Environment variables which start with `EMQX_` are translated to config keys and sit on top of the override chain. 18 | Since usually the environment variables are set by the system administrator, it is considered the highest precidence. 19 | - File `etc/emqx.conf` (medium precidence) 20 | This file holds all the manually crafted configurations. 21 | The contnet of this file is not mutable to EMQX software (in fact it can be a read-only file). 22 | Since this is manually crafted, it is considered a higher precidence than `cluster.hocon`. 23 | - File `data/configs/cluster.hocon` (low precidence) 24 | This config file is mostly hidden from the users. It holds the config changes made from the UI, API or CLI. 25 | 26 | One may argue that this is not the perfect order of overriding, but that's the fixed order in EMQX since 5.1 and we cannot change it at will. i.e. This proposal is not about changing the overriding order. 27 | 28 | The ordering itself is not the problem, but the problem is, there is lack of support for maually crafted config files under `cluster.hocon`. 29 | Since emqx.conf is the only option, people started putting their custom configurations in `etc/emqx.conf`, but also want to use the UI/API/CLI to change some of the configurations. 30 | As a result, changes made from the UI/API/CLI will override the existing merged config during the runtime, but gets overriden by the `etc/emqx.conf` when the node restarts. 31 | 32 | This is in particular a problem for emqx kubernetes operator, because it encourages the users to put all configs in one yaml block which gets mapped to `etc/emqx.conf` when bootstraping the deployment. 33 | And changes made to the resource will be applied by calling the API which only temporarily changes the configurations until the pod restarts. 34 | 35 | ## Design 36 | 37 | Add a conventional layer named (`base.hocon`) to the config overriding chain, which sits under `cluster.hocon`. 38 | So the new overriding chain will be: 39 | 40 | - Environment variables (highest precidence) 41 | - File `etc/emqx.conf` (medium precidence) 42 | - File `data/configs/cluster.hocon` (low precidence) 43 | - File `etc/base.hocon` (lowest precidence) 44 | 45 | ## Configuration Changes 46 | 47 | Add a new line to `cluster.hocon` at the top of it: 48 | 49 | ``` 50 | include "etc/base.hocon" 51 | ``` 52 | 53 | The path to base.hocon depends on the packaging flavor of EMQX. 54 | 55 | - docker: `/opt/emqx/etc/base.hocon` 56 | - RPM/DEB: `/etc/emqx/base.hocon` 57 | 58 | ## Backwards Compatibility 59 | 60 | Since hocon's include directive is used, the existing configurations will not be affected. 61 | Also, include is silently ignored if the file does not exist, so the new file can be added without breaking the existing installations. 62 | 63 | ## Document Changes 64 | 65 | Configuration documentation should be updated to reflect the new file. 66 | 67 | ## Testing Suggestions 68 | 69 | Since we are risking at duplicating the configurations in base.hocon and cluster.hocon, we should cover the test scenarios where the content of base.hocon and cluster.hocon are very large and have many overlapping configurations. 70 | 71 | - Functionality wise, the configuration should be verified to be merged correctly. 72 | - Performance wise, large amount of configs should not affect the startup time of the node. 73 | 74 | ## Declined Alternatives 75 | -------------------------------------------------------------------------------- /rejected/0005-stateless-brokers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/rejected/0005-stateless-brokers.png -------------------------------------------------------------------------------- /rejected/0006-plugins-on-dashboard.md: -------------------------------------------------------------------------------- 1 | # Stateless Brokers in EMQ X v5.0 2 | 3 | ``` 4 | Author: Shawn 5 | Status: Draft 6 | Type: Design 7 | Created: 2020-10-27 8 | EMQ X Version: 5.0 9 | Post-History: 10 | ``` 11 | 12 | ## Abstract 13 | 14 | ## Motivation 15 | 16 | ## Rationale 17 | 18 | ## Architecture 19 | 20 | Please attach the architecture diagrams. 21 | 22 | ## Design 23 | 24 | ## Discussion 25 | 26 | ## References 27 | 28 | -------------------------------------------------------------------------------- /rejected/0008-community-plugins.md: -------------------------------------------------------------------------------- 1 | ## Community Plugins 2 | 3 | ``` 4 | Author: Yudai Kiyofuji 5 | Status: Draft 6 | First Created: 2021-02-21 7 | EMQ X Version: 4.3 8 | ``` 9 | 10 | ## Change log 11 | 12 | * 2021-02-23: @z8674558 Initial draft 13 | * 2021-02-25: @z8674558 Add proposal on elixir plugins 14 | 15 | ## Abstract 16 | 17 | This proposal suggests ways to encourage plugin development by community. 18 | 19 | ## Motivation 20 | 21 | Allowing people to develop their own plugins is a good way for EMQX to gain popularity. 22 | To achieve this, it is nice to `approve` some community plugins and let people use them. 23 | 24 | ## Design 25 | 26 | At the root of emqx.git, we add the file `community-plugins`, 27 | where we list approved community plugins. 28 | (the advantage of having it in a separate file is to keep minimum lines of such in `rebar.config.erl` 29 | which may otherwise cause more lines of conflicts when porting changes to enterprise.) 30 | 31 | ```erlang 32 | {erlang_plugins, [{foo_plugin, {git, "https://github.com", {tag, "1.0.0"}}}}]}. 33 | ``` 34 | 35 | And when a user would like to use one of them, 36 | he/she can do so by setting env variable `EMQX_COMMUNITY_PLUGINS=foo_plugin`. 37 | Then `rebar.config.erl` read the file and the environment variables, to include specified ones. 38 | 39 | ### Elixir Plugins 40 | 41 | Considering the recent popularity of Elixir, we have decided to continue supporting Elixir plugins in v4.3. 42 | At the end of `community-plugins` file, there should be 43 | 44 | ```erlang 45 | {elixir_plugins, [{bar_plugin, {git, "https://github.com", {tag, "1.0.0"}}}}]}. 46 | ``` 47 | 48 | ## Configuration Changes 49 | 50 | 51 | 52 | ## Backwards Compatibility 53 | 54 | 55 | ## Document Changes 56 | 57 | In `emqx-doc`, there should be detailed information 58 | on how to use third-party plugins. 59 | 60 | Add detailed information on how one can develop their own plugins 61 | in `emqx-plugin-template` and `emqx-elixir-plugin`. 62 | 63 | ## Testing Suggestions 64 | 65 | Suppose we have approved a third-party plugin `emqx-some-plugin`. 66 | Since we have an umbrella project in v4.3, 67 | The developers of `emqx-some-plugin` is going to run the test 68 | by placing it to emqx.git, for example in `_checkouts` dir. 69 | 70 | In `emqx-some-plugin`'s CI, they have to fetch emqx.git then run the test. 71 | -------------------------------------------------------------------------------- /rejected/0009-consul-cluster-discovery.md: -------------------------------------------------------------------------------- 1 | # Support Consul for cluster discovery 2 | 3 | ``` 4 | Author: 5 | Status: Draft 6 | Created: 2021-03-05 7 | ``` 8 | 9 | ## Abstract 10 | 11 | Cluster discovery should support Consul as an option to store and lookup node details. 12 | 13 | ## Motivation 14 | 15 | Consul is another popular key value storage system (similar to etcd). This will add another valuable clustering option to emqx. 16 | 17 | ## Design 18 | 19 | Each emqx node would write a key on a give path that would be provided in the configuration and make an API call 20 | to Consul, e.g. `PUT /kv/emqx/` - [Consul key create API](https://www.consul.io/api-docs/kv#create-update-key). 21 | 22 | Each emqx node would then be able to query Consul for other keys on that path, e.g. `GET /kv/emqx/?keys` - 23 | [Consul get multiple keys docs](https://www.consul.io/api-docs/kv#keys-response) 24 | 25 | With the node information - clustering would proceed in the standard emqx fashion. 26 | 27 | ## Configuration Changes 28 | 29 | `consul` would be added to the list of clustering options and config values would look like: 30 | 31 | ``` 32 | cluster.discovery = consul 33 | cluster.consul.server = http://127.0.0.1:8500 34 | cluster.consul.prefix = emqcl 35 | ``` 36 | 37 | may want to look at adding support for consul authentication as well - 38 | [auth docs](https://www.consul.io/api-docs#authentication) 39 | 40 | ``` 41 | cluster.consul.token = 42 | ``` 43 | 44 | ## Backwards Compatibility 45 | 46 | There should be no issue with backwards compatibility as this is a new feature that would 47 | not impact the performance of any previous features. 48 | 49 | ## Document Changes 50 | 51 | Documentation will need to be updated to include the new clustering option. 52 | 53 | ## Testing Suggestions 54 | 55 | Could be tested the same as etcd. 56 | 57 | ## Declined Alternatives 58 | 59 | None 60 | --------------------------------------------------------------------------------