├── .gitignore
├── Makefile
├── README.md
├── active
    ├── 0000-proposal-template.md
    ├── 0001-proposal-process.md
    ├── 0023-rocksdb-message-persistence.md
    ├── 0024-assets
    │   ├── .DS_Store
    │   ├── emqx3-session-resumption.png
    │   ├── emqx4-session-takeover.png
    │   └── emqx5-hybrid-session-takeover-resumption.png
    ├── 0024-hybrid-session-takeover-and-resumption.md
    ├── 0025-ds-for-retainer.md
    ├── 0028-assets
    │   └── general-design.png
    ├── 0028-durable-shared-subscriptions.md
    ├── 0029-assets
    │   ├── cluster-link-route-repl-msg-fwd.png
    │   └── cluster-link-route-repl.png
    └── 0029-cluster-linking.md
├── implemented
    ├── 0002-new-config-syntax.md
    ├── 0003-new-config-files.md
    ├── 0004-assets
    │   ├── agent-fsm.png
    │   ├── agent-fsm.uml
    │   ├── replicant-fsm.png
    │   ├── replicant-fsm.uml
    │   ├── replication-msc.png
    │   └── replication-msc.uml
    ├── 0004-async-mnesia-change-log-replication.md
    ├── 0007-rocksdb-mnesia-backend.md
    ├── 0010-improved-monitoring.md
    ├── 0010-runq-based-overload-protection.md
    ├── 0011-assets
    │   ├── current-implementation.png
    │   ├── current-implementation.uml
    │   ├── flows.png
    │   ├── flows.uml
    │   ├── init-fsm.png
    │   └── init-fsm.uml
    ├── 0011-jq-in-rule-engine.md
    ├── 0011-persistent-sessions.md
    ├── 0012-cluster-call.md
    ├── 0013-gen-swagger-spec.md
    ├── 0014-rolling-cluster-upgrade.md
    ├── 0015-unified-authentication.md
    ├── 0016-emqx-conf.md
    ├── 0018-unified-authorization.md
    ├── 0019-plugins.md
    ├── 0020-assets
    │   ├── evacuation-coordinator-statuses-enforce.png
    │   ├── evacuation-coordinator-statuses-enforce.uml
    │   ├── evacuation-coordinator-statuses-rebalance.png
    │   ├── evacuation-coordinator-statuses-rebalance.uml
    │   ├── eviction-agent.png
    │   ├── node-rebalance-algo0.png
    │   ├── node-rebalance-algo1.png
    │   ├── node-rebalance-algo2.png
    │   └── node-rebalance.png
    ├── 0020-node-rebalance.md
    ├── 0021-assets
    │   ├── aws-mqtt-file-delivery.png
    │   ├── aws-mqtt-file-delivery.uml
    │   ├── flow-abort.png
    │   ├── flow-abort.uml
    │   ├── flow-async-1.png
    │   ├── flow-async-1.uml
    │   ├── flow-happy-path.png
    │   ├── flow-happy-path.uml
    │   ├── flow-restart.png
    │   ├── flow-restart.uml
    │   ├── flow-sync-1.png
    │   ├── flow-sync-1.uml
    │   ├── flow-sync-2.png
    │   └── flow-sync-2.uml
    ├── 0021-transfer-files-over-mqtt.md
    ├── 0022-forward-check-install-upgrade-script.md
    ├── 0023-assets
    │   └── trace-export-example.png
    ├── 0023-opentelemetry-trace-integration.md
    ├── 0026-schema-validation.md
    ├── 0027-message-transform.md
    ├── 0030-clientinfo-authentication-rules.md
    └── 0031-immutable-config-base-for-cluster.hocon
└── rejected
    ├── 0005-stateless-brokers.md
    ├── 0005-stateless-brokers.png
    ├── 0006-plugins-on-dashboard.md
    ├── 0008-community-plugins.md
    └── 0009-consul-cluster-discovery.md


/.gitignore:
--------------------------------------------------------------------------------
1 | /.idea/
2 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | PICS=$(patsubst %.uml,%.png,$(wildcard */*-assets/*.uml))
2 | 
3 | .PHONY: all
4 | all: $(PICS)
5 | 
6 | %.png: %.uml
7 | 	plantuml $<
8 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # EMQ X Improvement Proposals (EIP)
 2 | 
 3 | This repository contains the EMQ X Improvement Proposals (EIPs), to documentation
 4 | the ideas, designs, or implement details of new features. All the EIPs are in
 5 | Markdown (`*.md`) format.
 6 | 
 7 | New EIPs should first go to the `active` directory by creating a pull request
 8 | and ask for an approval. After the feature is implemented it will be put into
 9 | the `implemented` directory.
10 | 
11 | Before submitting an EIP, please read the
12 | [0000-proposal-template](active/0000-proposal-template.md), which is a template
13 | demonstrating the format of EIPs.
14 | 
15 | ## Creating UML diagrams
16 | 
17 | It is possible to add UML diagrams using [PlantUML](https://plantuml.com/).
18 | In order to do this, create a directory called `active/XXXX-assets` (replace
19 | `XXXX` with the EIP number), and put the files there. All files should have
20 | `uml` extension.
21 | 
22 | Then run `make` to generate the images.
23 | 


--------------------------------------------------------------------------------
/active/0000-proposal-template.md:
--------------------------------------------------------------------------------
 1 | # An Example of EMQX Improvement Proposal
 2 | 
 3 | ## Changelog
 4 | 
 5 | * 2020-10-21: @emqxplus Initial draft
 6 | * 2020-02-05: @terry-xiaoyu Restructure
 7 | * 2021-02-21: @zmstone Add 'Declined Alternatives' section
 8 | 
 9 | ## Abstract
10 | 
11 | A short (~200 word) description of the technical issue being addressed.
12 | 
13 | ## Motivation
14 | 
15 | This section should clearly explain why the functionality proposed by this EIP
16 | is necessary. EIP submissions without sufficient motivation may be rejected
17 | outright.
18 | 
19 | ## Design
20 | 
21 | This section should describe the design of the feature in detail. If it is a
22 | change to the architecture, some diagrams may be necessary.
23 | 
24 | ## Configuration Changes
25 | 
26 | This section should list all the changes to the configuration files (if any).
27 | 
28 | ## Backwards Compatibility
29 | 
30 | This sections should shows how to make the feature is backwards compatible.
31 | If it can not be compatible with the previous emqx versions, explain how do you
32 | propose to deal with the incompatibilities.
33 | 
34 | ## Document Changes
35 | 
36 | If there is any document change, give a brief description of it here.
37 | 
38 | ## Testing Suggestions
39 | 
40 | The final implementation must include unit test or common test code. If some
41 | more tests such as integration test or benchmarking test that need to be done
42 | manually, list them here.
43 | 
44 | ## Declined Alternatives
45 | 
46 | Here goes which alternatives were discussed but considered worse than the current.
47 | It's to help people understand how we reached the current state and also to
48 | prevent going through the discussion again when an old alternative is brought
49 | up again in the future.
50 | 
51 | 


--------------------------------------------------------------------------------
/active/0001-proposal-process.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0001-proposal-process.md


--------------------------------------------------------------------------------
/active/0024-assets/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0024-assets/.DS_Store


--------------------------------------------------------------------------------
/active/0024-assets/emqx3-session-resumption.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0024-assets/emqx3-session-resumption.png


--------------------------------------------------------------------------------
/active/0024-assets/emqx4-session-takeover.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0024-assets/emqx4-session-takeover.png


--------------------------------------------------------------------------------
/active/0024-assets/emqx5-hybrid-session-takeover-resumption.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0024-assets/emqx5-hybrid-session-takeover-resumption.png


--------------------------------------------------------------------------------
/active/0024-hybrid-session-takeover-and-resumption.md:
--------------------------------------------------------------------------------
  1 | # Hybrid MQTT Session Takeover and Resumption
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2023-06-27: @emqplus Initial commit
  6 | 
  7 | ## Abstract
  8 | 
  9 | This proposal presents a hybrid architecture for MQTT session takeover and resumption in EMQX version 5.
 10 | 
 11 | ## Motivation
 12 | 
 13 | MQTT is widely used in IoT systems, where devices may experience intermittent connectivity or transient network outages. In such scenarios, MQTT session resumption and takeover is an essential feature for helping to maintain reliable communication between the devices and the MQTT broker. I propose a new design for the MQTT session takeover and resumption in the later EMQX 5 release.
 14 | 
 15 | ## Acronyms Used
 16 | 
 17 | | Name | Acronym | Description |
 18 | | --- | --- | --- |
 19 | | Node1 | N1 | Node 1 in a cluster |
 20 | | Node2 | N2 | Node 2 in a cluster |
 21 | | PubSub | PS | PubSub service on a node |
 22 | | Connection1 | C1 | MQTT Connection 1 |
 23 | | Connection2 | C2 | MQTT Connection 2 |
 24 | | Session1 | S1 | MQTT Session 1 |
 25 | | Session2 | S2 | MQTT Session 1 |
 26 | | Channel1 | CH1 | MQTT Channel 1 |
 27 | | Channel2 | CH2 | MQTT Channel 2 |
 28 | 
 29 | ## MQTT Session Resumption in EMQX 3
 30 | 
 31 | EMQX 3 uses two separate Erlang processes for an MQTT connection and session:
 32 | 
 33 | - When a client disconnects, the connection process will be shut down while the session process is still alive and keep the session state;
 34 | - When a new client connects, a new connection process will be spawned and resume the session state from the existing session process
 35 | 
 36 | The process for session resumption is as follows:
 37 | 
 38 | ![Session Resumption in EMQX 3](0024-assets/emqx3-session-resumption.png)
 39 | 
 40 | ```
 41 | title Session Resumption in EMQX 3
 42 | 
 43 | C2(N2)->S1(N1): Resume
 44 | S1(N1)->C1(N1): Discard
 45 | C1(N1)->C1(N1): Shutdown
 46 | PS(N1)->S1(N1): Deliver
 47 | S1(N1)->C2(N2): Deliver
 48 | ```
 49 | 
 50 | After resumption, the session and connection processes are located on two different nodes. On delivery, MQTT messages will have to be forwarded between the nodes.
 51 | 
 52 | **Pros**
 53 | 
 54 | - Simple and fast to resume an MQTT session
 55 | 
 56 | **Cons**
 57 | 
 58 | - Much memory usage with two Erlang processes
 59 | - Low performance for MQTT message delivery
 60 | 
 61 | ## MQTT Session Takeover in EMQX 4
 62 | 
 63 | EMQX 4 uses one Erlang process called the MQTT channel to wrap all the states of the connection and session.
 64 | 
 65 | - When a persistent MQTT client disconnects, the channel process will be changed to `disconnected` status, and keep the session state;
 66 | - When a new client connects, a new channel process will be created and take over the session state from the old one;
 67 | - The old channel process will be shut down when the takeover process is done.
 68 | 
 69 | The process for session takeover is as follows:
 70 | 
 71 | ![Session Takeover in EMQX 4](0024-assets/emqx4-session-takeover.png)
 72 | 
 73 | ```
 74 | title Session Takeover in EMQX 4
 75 | 
 76 | CH2(N2)->CH1(N1): Takeover begin
 77 | CH1(N1)-->CH2(N2): Session state
 78 | CH2(N2)->PS(N2): Subscribe
 79 | CH2(N2)->CH1(N1): Takeover end
 80 | CH1(N1)->PS(N1): Unsubscribe
 81 | CH1(N1)-->CH2(N2): Pending messages
 82 | CH1(N1)->CH1(N1): Shutdown
 83 | CH2(N2)->CH2(N2): Merge messages
 84 | CH2(N2)->CH2(N2): Resume
 85 | ```
 86 | 
 87 | **Pros**
 88 | 
 89 | - One Erlang process for an MQTT client, less memory usage, and fast message delivery.
 90 | 
 91 | **Cons**
 92 | 
 93 | - The takeover operation is a bit expensive, and use more CPU.
 94 | - High CPU usage will make the broker unserviceable when too many takeovers occur.
 95 | 
 96 | ## Hybrid Takeover and Resumption in EMQX 5
 97 | 
 98 | I propose a hybrid session takeover and resumption with an MQTT-aware load balancer in EMQX 5.
 99 | 
100 | - Still use one channel process to wrap the MQTT connection and session state;
101 | - When an MQTT client disconnects, the persistent session state will be kept in the channel process;
102 | - When a new client connects from the same node, resume the session by passing the socket control and connection state to the channel process;
103 | - If the new client connects from a different node, just takeover the session like EMQX 4.
104 | 
105 | ![Hybrid Session Takeover and Resumption in EMQX 5](0024-assets/emqx5-hybrid-session-takeover-resumption.png)
106 | 
107 | ```
108 | title Hybrid Session Takeover and Resumption in EMQX 5
109 | 
110 | alt same node
111 |     CH2(N1)->CH1(N1): Resume with new connection/socket
112 |     CH1(N1)-->CH2(N1): Success
113 |     CH2(N1)->CH2(N1): Shutdown
114 | else another node
115 |     CH2(N2)->CH1(N1): Takeover begin
116 |     CH1(N1)-->CH2(N2): Session state
117 |     CH2(N2)->PS(N2): Subscribe
118 |     CH2(N2)->CH1(N1): Takeover end
119 |     CH1(N1)->PS(N1): Unsubscribe
120 |     CH1(N1)-->CH2(N2): Pending messages
121 |     CH1(N1)->CH1(N1): Shutdown
122 |     CH2(N2)->CH2(N2): Merge messages
123 |     CH2(N2)->CH2(N2): Resume
124 | end
125 | ```
126 | 
127 | **Pros**
128 | 
129 | - Save CPU usage when session takeover occurs frequently
130 | - Don’t need to unsubscribe and resubscribe
131 | 
132 | **Cons**
133 | 
134 | - A bit complex for test and debug
135 | 
136 | ## Configuration Changes
137 | 
138 | N/A
139 | 
140 | ## Backwards Compatibility
141 | 
142 | N/A
143 | 
144 | ## Document Changes
145 | 
146 | Documentation related to MQTT session management should be updated.
147 | 
148 | ## Testing Suggestions
149 | 
150 | - Add more test cases for session resumption from the same node.
151 | 
152 | 


--------------------------------------------------------------------------------
/active/0025-ds-for-retainer.md:
--------------------------------------------------------------------------------
  1 | # Durable storage for MQTT retained messages
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2024-01-18: @savonarola Initial draft
  6 | * 2024-01-22: @savonarola Add more optimizations for the "straightforward approach with optimizations"; declare it as the chosen one
  7 | 
  8 | ## Abstract
  9 | 
 10 | We implement [reliable persistent storage](0023-rocksdb-message-persistence.md) for the messages participating in publish/subscribe operations. We want to have a similar storage for the retained messages.
 11 | 
 12 | ## Motivation
 13 | 
 14 | Retained messages are an important feature of MQTT. For example, they may be used as a state or configuration storage for devices without persistent storage. The current implementation has significant limitations:
 15 | * Retained messages are stored in a mnesia table, so scalability is limited for such message insertions.
 16 | * To provide fast lookup, the message table is also stored in memory, so scalability is also limited from the memory consumption point of view.
 17 | * Current indexing capabilities have some limitations:
 18 |     * Appropriate indices should be manually specified for nonstandard topic schemes.
 19 |     * The reindex process is not automatic and quite complicated.
 20 |     * Indices consume a significant amount of memory.
 21 | 
 22 | We want to get rid of many of these limitations by reusing the durable storage (DS) concept and implementations from general message persistence:
 23 | * We want reliable and efficient off-memory storage for retained messages.
 24 | * We want a more effective mechanism for indexing retained messages, requiring less memory and being more automatic (like LTS).
 25 | * We want a more flexible mechanism for changing the storage schema and retention policies.
 26 | * We want to take advantage of any implemented DS backends for storing retained messages.
 27 | 
 28 | ## Design
 29 | 
 30 | ### Straightforward approach
 31 | 
 32 | The straightforward approach is to use just the same DS as for the regular message replay:
 33 | * have a different DB for retained messages;
 34 | * append (`store_batch`) retained messages into the DB;
 35 | * interpret empty bodies as tombstones;
 36 | * on message lookup, calculate streams and immediately fold them to reduce each topic to the last message.
 37 | 
 38 | #### Possible callback implementation
 39 | 
 40 | The retainer must provide the following callbacks:
 41 | 
 42 | ```erlang
 43 | -callback delete_message(context(), topic()) -> ok.
 44 | %% strore_batch([tombstone_message()])
 45 | 
 46 | -callback store_retained(context(), message()) -> ok.
 47 | %% strore_batch([message()])
 48 | 
 49 | -callback read_message(context(), topic()) -> {ok, list()}.
 50 | %% get_streams for a concrete topic & fold over a concrete topic
 51 | 
 52 | -callback page_read(context(), topic(), non_neg_integer(), non_neg_integer()) ->
 53 |     {ok, list()}.
 54 | %% get_streams for a filter & fold over all topics matching filter (!!)
 55 | 
 56 | -callback match_messages(context(), topic(), cursor()) -> {ok, list(), cursor()}.
 57 | %% get_streams for a filter & fold over all topics matching filter (!!)
 58 | 
 59 | -callback clear_expired(context()) -> ok.
 60 | %% drop generations
 61 | 
 62 | -callback clean(context()) -> ok.
 63 | %% drop all generations
 64 | 
 65 | -callback size(context()) -> non_neg_integer().
 66 | %% get_streams over all topics & fold (!!)
 67 | ```
 68 | 
 69 | #### Problems of the straightforward approach
 70 | 
 71 | * `page_read` is used from the dashboard and is often used with the `#` topic. Implementing the callback efficiently is impossible — we should fold over _all_ topics, sort somehow, and only then cut out the required page.
 72 | * The same is true for the `size` callback.
 73 | * The same is true for the `match_messages` callback. However, _currently_ we admit that the whole set of retained messages for a subscription is small enough to fit into memory.
 74 | 
 75 | ### Straightforward approach with optimizations
 76 | 
 77 | An obvious optimization is to have slightly different key schemas for the LTS storage of retained messages storage implementation — not
 78 | ```
 79 | ts_epoch | lts_index_bits | ts_rest | message_id => message
 80 | ```
 81 | 
 82 | but just
 83 | 
 84 | ```
 85 | lts_index_bits | topic => message
 86 | ```
 87 | 
 88 | to have generation automatically "compacted".
 89 | 
 90 | Other optimizations:
 91 | * We do not use generations for retained messages.
 92 | * We implement alternative sharding based on the topic, not on the client id/node id.
 93 | * To simplify replay, we may encapsulate streams for different shards into a single iterator.
 94 | 
 95 | Thus each topic is stored uiquely in the storage, and we do not need to fold over all topics to implement `page_read` and `size` callbacks.
 96 | 
 97 | With this approach, we may implement `match_messages` callback without folding, in constant space.
 98 | We use iterator state as `context()`.
 99 | 
100 | `page_read` may be implemented in the same manner, however, we need to "scroll" the iterator to the required page.
101 | 
102 | ### Topic indexing approach
103 | 
104 | The alternative approach is that we use the DS only for _topic indexing_. That is, we store not
105 | ```
106 | lts_index_bits | topic => message
107 | ```
108 | but
109 | ```
110 | lts_index_bits | topic => #message{}
111 | ```
112 | 
113 | At the same time, we have a separate storage for the messages themselves, and we store them by topic.
114 | ```
115 | topic => message
116 | ```
117 | 
118 | The storage
119 | * is an ordered key-value storage, where the key is a topic, and the value is the message.
120 | * is not sharded/generational; only one message per topic is stored.
121 | 
122 | With this approach, the callbacks `page_read` and `size` are trivially implemented. Also, `read_message` is implemented by reading the message from the KV storage by a key.
123 | 
124 | #### Problems of the topic indexing approach
125 | 
126 | The topic indexing is more flexible (e.g., we may index messages in replicated RocksDB DS but keep them in FoundationDB KV), it has some problems:
127 | * things get entangled. The storage implementation passed to DS should cooperate with "additional" KV storage
128 | of the retainer.
129 | * We need to tie many things together: DS, DS storage implementation, KV
130 | 
131 | ### Other approaches
132 | 
133 | We may create completely standalone storage for retained messages, not using high-level DS at all, but using only low-level DS primitives, like LTS tries and bitfield mappers.
134 | 
135 | ### Additional opportunities
136 | 
137 | In straightforward approaches, we may still keep the TS part in the storage but additionally introduce some kind of a "secondary index" where we keep timestamped key by topic/clientid/etc.
138 | 
139 | "Secondary index" will allow us still have storage "compacted" by topic/clientid/etc: we will delete the old timestamped key when we store a new one.
140 | 
141 | E.g., to have compaction by topic:
142 | 
143 | 1. We want to insert a message `#message{topic="a/b/c", ts=TS1, ...} = message1`.
144 | 1. We insert the message into the storage `high_bits(TS1) | lts(topic) | low_bits(TS1) => message1`.
145 | 1. We insert the key `topic => high_bits(TS1) | lts(topic) | low_bits(TS1)` into the "secondary index".
146 | 
147 | Then, we want to insert the new message with the same topic:
148 | 
149 | 1. We want to insert the mesage `#message{topic="a/b/c", ts=TS2, ...} = message2`.
150 | 1. We insert the message into the storage `high_bits(TS2) | lts(topic) | low_bits(TS2) => message2`.
151 | 1. We get the old key `high_bits(TS1) | lts(topic) | low_bits(TS1)` from the "secondary index" by `topic`.
152 | 1. We insert the key `topic => high_bits(TS2) | lts(topic) | low_bits(TS2)` into the "secondary index".
153 | 1. We delete `message1` by the fetched old key `high_bits(TS1) | lts(topic) | low_bits(TS1)` from the storage.
154 | 
155 | This will have the advantage of still being able to "subscribe" to any topic pattern with a regular DS replayer, with the semantics: "give me the actual message(data) and all the ongoing updates." In turn, it may be helpful for subscriptions to some kinds of events, like session registrations, takeovers, etc.
156 | 
157 | ### Conclusion
158 | 
159 | We choose the straightforward approach with optimizations.
160 | It allows us to reuse the existing DS implementations and abstractions. Also, with this approach, we may implement all the operations in constant space, which will be a significant improvement over the current implementation.
161 | 
162 | ## Configuration Changes
163 | 
164 | Currently, we have only one type of storage for retained messages:
165 | 
166 | ```
167 | retainer {
168 |     ...
169 |     backend {
170 |         type = built_in_database
171 |         ...
172 |     }
173 | }
174 | ```
175 | 
176 | Like in message persistence, we will have:
177 | 
178 | ```
179 | retainer {
180 |     ...
181 |     backends {
182 |         built_in_database {
183 |             enabled = false
184 |         }
185 |         fdb {
186 |             enabled = true
187 |             ds {
188 |                 ...
189 |                 # options for emqx_ds:open/2
190 |             }
191 |             ... other options
192 |         }
193 |     }
194 | }
195 | ```
196 | 
197 | ## Backwards Compatibility
198 | 
199 | No backwards compatibility issues are expected. Retainer configs having old `backend` will use the old storage, and those having `backends` will use the new one.
200 | 
201 | ## Testing Suggestions
202 | 
203 | ## Declined Alternatives
204 | 
205 | * Straightforward approach (without optimizations).
206 | * Topic-only indexing approach.
207 | 
208 | 


--------------------------------------------------------------------------------
/active/0028-assets/general-design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0028-assets/general-design.png


--------------------------------------------------------------------------------
/active/0028-durable-shared-subscriptions.md:
--------------------------------------------------------------------------------
  1 | # Durable Shared Subscriptions
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2024-05-10: @savonarola Initial draft
  6 | * 2024-06-28: @savonarola
  7 |     * Add the Agent abstraction
  8 |     * Describe the two-side communication sequence between an Agent and the SGL
  9 |     * Describe the stream reassignment algorithm
 10 | * 2025-01-08: @savonarola
 11 |     * Change the naming to contain less abbreviations and more comprehensive names
 12 |     * Extend the introductory section a bit to make the general problem clear
 13 |     * Add a section with the general layer structure
 14 |     * Updated the interaction details according to the new simpler design
 15 | 
 16 | ## Abstract
 17 | 
 18 | We describe the implementation of shared subscriptions based on the concept of durable storage.
 19 | 
 20 | ## Motivation
 21 | 
 22 | Since we have durable storage-based implementation for the regular subscriptions, we want to extend its advantages over the shared subscriptions. That is, mainly:
 23 | * To have messages persisted, that is, not lost regardless of the crashes or absence of consumers.
 24 | * To be able to replay messages from the past.
 25 | 
 26 | ## Design
 27 | 
 28 | ### General Idea
 29 | 
 30 | Shared subscriptions (or queues) are a feature that allows multiple consumers to consume messages from some topic filter cooperatively.
 31 | 
 32 | Several consumers subscribe to a special topic `$shared/GROUP_ID/TOPIC_FILTER` (or `$queue/GROUP_ID/TOPIC_FILTER` in the case of queues) and consume messages from `TOPIC_FILTER`. Each message goes to a single consumer.
 33 | 
 34 | With the durable backend (Durable Storage backend, DS), all messages pertain to the ordered _streams_ of messages. Streams may be read sequentially (possibly skipping some messages). Streams may be finalized (i.e., closed) and so fully consumed.
 35 | 
 36 | Since the streams are means of sharding messages, the natural idea is to use the same sharding mechanism for shared subscriptions. That is, assign disjointed subsets of streams to different consumers and let them consume their streams in parallel.
 37 | 
 38 | We need an entity responsible for distributing such streams across consumers.
 39 | We implement such entity as a cluster-unique process called **Shared Group Leader** or simply **Leader**.
 40 | 
 41 | The Leader is spawned by the node-local leader registry when the first consumer connects to the group subscription.
 42 | 
 43 | The global registration mechanism is based on the DS precondition feature, which allows the creation/deletion of message entries in the DS atomically.
 44 | Current leader owns a special data entry in the DS and periodically renews the ownership.
 45 | If the leader dies (e.g. its node goes down), the connected sessions eventually detect the leader loss and ask the node-local leader registry to find the leader again.
 46 | The leader registry either reclaims the leader's data in the DS and spawns a new leader or finds the data already reclaimed by another leader and responds to the session with the new leader's location.
 47 | 
 48 | 
 49 | Leaders' data is also stored in the DS. Note that the DS for the Leader registration and other data is completely separate from the DS for the messages.
 50 | 
 51 | The Leader keeps track of topics belonging to the group, their streams, and stream progress. It is the only entity that tracks the replay progress of these streams.
 52 | 
 53 | The consumers are persistent sessions. They connect to the Leader via the encapsulated **Agent** entity, and the Leader grants them streams to consume. Sessions consume these streams together with their proper streams but do not persist the progress. Instead, they report the progress to the Leader.
 54 | 
 55 | The Leader is responsible for reassigning streams to the other sessions in case a consumer session disconnects and for reassigning streams to the new consumers.
 56 | 
 57 | ### Layer design
 58 | 
 59 | The high-level layers are:
 60 | * Session *Shared Subscription Handler*
 61 | * Shared Subscription *Agent*
 62 | * Shared Subscription *Borrower*
 63 | * Shared Subscription *Leader*
 64 | 
 65 | #### Session Shared Subscription Handler
 66 | 
 67 | Session Shared Subscription Handler (or simply Shared Subscription Handler) is the session-side facade for the shared subscriptions.
 68 | It is a counterpart of the module responsible for the regular (private) session subscriptions.
 69 | 
 70 | Session Shared Subscription
 71 | * Handles the `on_subscribe`/`on_unsubscribe` events from the session, creating/deleting subscriptions in the session's state and forwarding the requests to the Agent.
 72 | * Receives stream granting/revocation messages from the Agent and injects stream states into the session's state and the scheduler.
 73 | * Receives stream consumption progress updates and sends them to the Agent.
 74 | 
 75 | So, the Shared Subscription Handler knows how the session works but nothing about how the streams are obtained and managed. This knowledge is encapsulated in the **Agent** abstraction.
 76 | 
 77 | #### Shared Subscription Agent
 78 | 
 79 | The Agent is the entity that provides the interface for the Shared Subscription Handler to obtain stream granting/revocation events and reports stream consumption progress.
 80 | 
 81 | For the community edition, the durable shared subscription feature is unavailable, so the Agent is implemented as a stub that does not perform any actions, so sessions' subscriptions and unsubscriptions have no effect.
 82 | A client having a durable session still may subscribe to some shared subscription topic, but no correspondent messages will be delivered.
 83 | 
 84 | For the enterprise edition, the Agent actually communicates with the Leaders, receives streams for consumption, and reports stream consumption progress.
 85 | 
 86 | Technically, the Agent itself does not have much communication logic, because it handles _all_ shared subscriptions of a single session. So its responsibility is to maintain a collection of Shared Subscription Borrowers and to forward events belonging to the particular shared subscription to the corresponding Borrower.
 87 | 
 88 | #### Shared Subscription Borrower
 89 | 
 90 | Borrower is the entity within the Agent. It is responsible for a single shared subscription. It talks to the Leader, receives streams for subscription, and reports stream consumption progress.
 91 | 
 92 | It is important, that the Borrowers within the session's Agent are isolated from each other and are _not identified_ by the group ID + topic filter. In case of quick unsubscribe/subscribe sequence, there may be multiple Borrowers within the same Agent talking to the same Leader. One connecting to the Leader and the other ones finalizing the previous subscriptions.
 93 | 
 94 | #### Shared Subscription Leader
 95 | 
 96 | The Leader is the entity that is responsible for a single shared subscription group. The Leader
 97 | * Tracks and renews streams for the shared subscription's topic filter.
 98 | * Tracks the connected Borrowers.
 99 | * Assigns and revokes streams to the Borrowers.
100 | * Receives stream consumption progress updates from the Borrowers.
101 | * Persists the shared subscription's state (e.g. stream consumption progress).
102 | 
103 | #### Layer interaction
104 | 
105 | ![general-design](./0028-assets/general-design.png)
106 | 
107 | The Shared Subscription Handler, Agent, and Borrowers are nested session-side entities: The Shared Subscription Handler encapsulates an Agent, which encapsulates a collection of Borrowers. Communication between them is done via simple function calls.
108 | 
109 | Note that Borrowers are the innermost entities, so these messages to and from the Leader are opaquely propagated through the Agent and Shared Subscription Handler layers.
110 | 
111 | Leader resides in a separate process, so it communicates with Borrowers via completely asynchronous message-based protocol. The Leader is a cluster-unique process, so it may belong to any node in the cluster, e.g. a different node from the one where a session resides.
112 | 
113 | ### Protocol between Borrower and Leader
114 | 
115 | The most complicated part is the asynchronous protocol between a Borrower and a Leader. The other interactions (Agent and Borrower, Shared Subscription Handler and Agent) are mostly forwarding events and callbacks.
116 | 
117 | On the Borrower side, we have
118 | * A state machine for the Borrower's state as a whole.
119 | * A collection of state machines for each stream granted to the Borrower.
120 | 
121 | #### Borrower's statuses
122 | 
123 | The Borrower's statuses are the following:
124 | * `connecting` - the Borrower is created (a client subscribed to a shared subscription or restored an existing subscription).
125 | It is looking for a Leader periodically sending `find_leader` messages.
126 | * `connected` - the Borrower is connected to the Leader, receiving streams (or revoke commands) and reporting progress.
127 | * `unsubscribing` - the session unsubscribed from the shared subscription. The Borrower is waiting for consistent progress from the session, reports it, and terminates.
128 | 
129 | There are no cyclic status transitions, the statuses change as
130 | `[new]` -> `connecting` -> `connected` -> `unsubscribing` -> `[destroyed]`
131 | 
132 | If a Borrower detects an inconsistent state (e.g., an unexpected message from the Leader), it terminates itself and asks the enclosing Agent to recreate it from scratch. The new Borrower will obtain a new identifier, and the Leader will see it as a completely new Borrower.
133 | 
134 | The Borrower has the following timers:
135 | * In `connecting` state, there is a periodic find leader timer. It is used to reissue the `find_leader` message if the Leader is not found.
136 | * In `connected` state, there is a periodic ping timer and a ping response timeout timer. On ping timer, a ping message is sent to the Leader. If there is no response within the ping timeout, the Borrower invalidates (stops and asks the enclosing Agent to recreate it from scratch).
137 | * In `unsubscribing` state, there is a unsubscribe timeout timer. If within the timeout the Borrower does not receive the final consistent progress from the session, it reports incomplete progress and terminates.
138 | 
139 | ### Individual stream states
140 | 
141 | Each stream has its own state. The stream state is the following:
142 | * `granted` - the stream is granted to the Borrower.
143 | * `revoking` - the stream is being revoked from the Borrower by the Leader.
144 | 
145 | Stream state changes are also without cyclic transitions; they are `[absent]` -> `granted` -> `revoking` -> `[absent]`.
146 | 
147 | Stream becomes `granted` when the Leader assigns it to the Borrower (a `grant` event is received).
148 | 
149 | Stream becomes `revoking` when the Leader revokes it from the Borrower (a `revoke` event is received).
150 | On revoke, the stream is marked as `unsubscribed` in the enclosing session but still belongs to the Borrower.
151 | The Borrower waits for the final consistent progress from the session.
152 | 
153 | The stream is removed when a `revoked` event is received from the Leader.
154 | This means that the Leader confirms that the final progress is received.
155 | 
156 | ### Messages/callbacks between the Borrower and the Leader
157 | 
158 | #### `connecting` state
159 | 
160 | From Borrower:
161 | * `leader_wanted` — a request to find the Leader for the shared subscription.
162 | Since the Borrower is not connected to the Leader yet, it sends this message to a node-local leader registry. The registry will find the Leader and the Leader will respond with a `leader_connect_response` message.
163 | 
164 | From Leader to Borrower:
165 | * `leader_connect_response` — the Leader responds to the `leader_wanted` message. The response contains the Leader's id.
166 | 
167 | From the enclosing Agent/Session:
168 | * `on_disconnect`, `on_unsubscribe` — since we have no streams, we send `disconnect` message to the Leader and terminate the Borrower.
169 | 
170 | #### `connected` and `unsubscribing` states
171 | 
172 | From Borrower:
173 | * `ping` — a periodic ping message to keep the connection alive.
174 | * `disconnect` — a message to disconnect from the Leader. The message contains the latest progress of all granted streams.
175 | * `update_progress` — a message to update the progress of the stream consumption.
176 | * `revoke_finished` — a message to notify the Leader that the stream revocation is finished.
177 | 
178 | From Leader to Borrower:
179 | * `ping_response` — a response to the `ping` message.
180 | * `grant` — a message that the Leader grants a stream to the Borrower.
181 |   * in `unsubscribing` state it is ignored.
182 |   * in `connected` state, the granted stream is added to the Borrower's stream set and an event returned to the enclosing session to install the stream.
183 | * `revoke` — a message that the Leader revokes a stream from the Borrower.
184 |   * in `unsubscribing` state it is ignored.
185 |   * in `connected` state, the stream is marked as `revoking` and an event returned to the enclosing session to unsubscribe from the stream. We still keep the stream in the Borrower's stream set until the final progress is received.
186 | * `revoked` — a message that the Leader confirms that the final progress is received.
187 |   * in `unsubscribing` state it is ignored.
188 |   * in `connected` state, the stream is removed from the Borrower's stream set. We respond to the Leader with `revoke_finished` message.
189 | 
190 | From the enclosing Agent/Session:
191 | * `on_disconnect` — we send the current progress to the Leader and terminate the Borrower.
192 | * `on_unsubscribe` — we move the Borrower to the `unsubscribing` state.
193 | * `on_stream_progress` — we send the progress to the Leader via the `update_progress` message.
194 | 
195 | #### All state messages
196 | 
197 | `invalidate` — a message that the Leader wants to invalidate the Borrower. The Borrower terminates itself and asks the enclosing Agent to recreate it from scratch.
198 | 
199 | ### Leader's logic
200 | 
201 | Leader maintains:
202 | * The renewed set of streams for the topic filter of the shared subscription.
203 | * The progress of each stream.
204 | * The set of connected Borrowers.
205 | * The assignment of streams to the Borrowers.
206 | 
207 | The stream assignment to a borrower has the following statuses:
208 | * `granting` — the stream is being granted to the Borrower.
209 | * `granted` — the stream is assigned to the Borrower.
210 | * `revoking` — the stream is being revoked from the Borrower.
211 | * `revoked` — the stream is revoked from the Borrower.
212 | 
213 | Periodically, or after some events, the Leader runs the stream reassignment process.
214 | 
215 | The stream reassignment process is the following:
216 | * We renew the set of streams for the topic filter.
217 | * We check the total number of streams and the registered Borrowers.
218 | * We calculate the desired number of streams per Borrower.
219 | * For borrowers having more streams than desired, we revoke some of its streams.
220 | * For borrowers having fewer streams than desired, we grant some free streams (not assigned to any Borrower).
221 | 
222 | The granting process is the following:
223 | * We create the stream assignment `stream <-> borrower_id` in the `granting` status.
224 | * We send the `grant` message to the Borrower together with the stream and its progress.
225 | * We resend the `grant` message on timeout.
226 | * After the `grant` message is received by the Borrower it starts to send stream progress.
227 | * On receiving the progress, we consider the stream granted and update the stream assignment status to `granted`.
228 | 
229 | The revoking process is the following:
230 | * We move the stream assignment `stream <-> borrower_id` in the `revoking` status
231 | * We send the `revoke` message to the Borrower.
232 | * We resend the `revoke` message on timeout.
233 | * After the `revoke` message is received by the Borrower, it starts to finalize the stream consumption.
234 | * When we receive the progress from the Borrower with the stream final progress, we move the stream assignment status to `revoked`.
235 | * We send the `revoked` message to the Borrower.
236 | * We resend the `revoked` message on timeout.
237 | * After the `revoked` message is received by the Borrower, it deletes all the stream-related data and responds with `revoke_finished` message.
238 | * On receiving the `revoke_finished` message, the Leader deletes the stream assignment.
239 | 
240 | ### Configuration Changes
241 | 
242 | ### Backwards Compatibility
243 | 
244 | One of the main difficulties is the coexistence of durable shared subscriptions with regular shared subscriptions. For example, consuming messages by an in-memory session from a shared group backed by durable storage.
245 | 
246 | ### Document Changes
247 | 
248 | ### Testing Suggestions
249 | 
250 | ### Declined Alternatives
251 | 
252 | Previous PoC implementation appeared to be too complex both for implementation and for understanding.
253 | * There was not one-to-one Borrower <-> Subscription correspondence. That made resubscribing complicated and led to much complex logic in the Shared Subscription Handler.
254 | * Consequently, the Borrowers handled invalidation and resubscription themselves. Their state machine was larger and had cycles.
255 | * The Borrowers and the Leader did not have separate communication levels (connection maintenance vs. stream assignment and progress reporting). Instead, the Leader and the Borrower exchanged versioned sets of streams, which also appeared to be too complex.
256 | 
257 | 


--------------------------------------------------------------------------------
/active/0029-assets/cluster-link-route-repl-msg-fwd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0029-assets/cluster-link-route-repl-msg-fwd.png


--------------------------------------------------------------------------------
/active/0029-assets/cluster-link-route-repl.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/active/0029-assets/cluster-link-route-repl.png


--------------------------------------------------------------------------------
/implemented/0002-new-config-syntax.md:
--------------------------------------------------------------------------------
  1 | # New Configuration Syntax for EMQ X v5.0
  2 | 
  3 | ## Abstract
  4 | 
  5 | Introduce a new configuration format in EMQX v5.0 release.
  6 | 
  7 | ## Motivation
  8 | 
  9 | - The `k=v` format is too verbose
 10 | - Should support hierarchy and list
 11 | 
 12 | ## Design
 13 | 
 14 | ### HOCON Style
 15 | 
 16 | Node/cluster config in HOCON style:
 17 | 
 18 | ```hocon
 19 | node {
 20 |   name: "emqx@127.0.0.1"
 21 |   cookie: emqxsecretcookie
 22 |   data_dir: "{{ platform_data_dir }}"
 23 |   global_gc_interval: 15m
 24 |   crash_dump: "{{ platform_log_dir }}/crash.dump"
 25 | }
 26 | 
 27 | cluster {
 28 |   name: emqxcl
 29 |   proto_dist: inet_tcp
 30 |   discovery: manual
 31 |   autoheal: on
 32 |   autoclean: 5m
 33 | }
 34 | ```
 35 | 
 36 | Listener config in HOCON style:
 37 | 
 38 | ```hocon
 39 | listener.tcp {
 40 |   bind: "0.0.0.0:1883"
 41 |   zone: default
 42 |   acceptors: 8
 43 |   max_conn_rate: 1000
 44 |   max_connections: 1024000
 45 | }
 46 | 
 47 | listener.tcp {
 48 |   bind: "127.0.0.1:11883"
 49 |   zone: internal
 50 |   acceptors: 4
 51 |   max_connections: 1024000
 52 |   max_conn_rate: 1000
 53 |   active_n: 1000
 54 |   tcp_options: ${tcp.options} //Substitution
 55 | }
 56 | 
 57 | listener.ssl {
 58 |   bind: 8883
 59 |   zone: default
 60 |   acceptors: 16
 61 |   max_conn_rate: 1000
 62 |   max_connections: 102400
 63 |   include "ssl.conf" //Include
 64 | }
 65 | 
 66 | listener.ws {
 67 |   bind: 8083
 68 |   zone: default
 69 |   acceptors: 4
 70 |   max_conn_rate: 1000
 71 |   max_connections: 102400
 72 |   mqtt_path: /mqtt
 73 | }
 74 | 
 75 | listener.wss {
 76 |   bind: 8084
 77 |   zone: default
 78 |   enable_ssl: on
 79 |   acceptors: 4
 80 |   max_conn_rate: 1000
 81 |   max_connections: 102400
 82 |   mqtt_path: /mqtt
 83 | }
 84 | 
 85 | tcp.options {
 86 |   backlog: 512
 87 |   send_timeout: 5s
 88 |   send_timeout_close: on
 89 |   recbuf: 64KB
 90 |   sndbuf: 64KB
 91 |   buffer: 16KB
 92 | }
 93 | ```
 94 | 
 95 | ### YAML
 96 | 
 97 | YAML is a declined option. We have decided to use HOCON as a configuration format.
 98 | 
 99 | ## Rationale
100 | 
101 | ## Implementation
102 | 
103 | We need a parser that reads HOCON file into nested map.
104 | 
105 | `foo.conf`
106 | ```
107 | a = 1
108 | b = { p = 2 }
109 | ```
110 | 
111 | ```
112 | hocon:load("foo.conf").
113 | #{a => 1, b => #{p => 2}}
114 | ```
115 | 
116 | And we may need each values with metadata. 
117 | Add `filename` and `line` by default to print helpful error info.
118 | We should also support injecting metadata from API.
119 | e.g. if dashboard has the config editor, we may want to know who updated the value.
120 | 
121 | ```
122 | hocon:load("foo.conf", #{include_metadata => {true, [{'changed_at', 100}, {'changed_by', "kiyofuji"}]})).
123 | #{a => #{ value => 1, 
124 |           metadata => #{ filename => "foo.conf", 
125 |                          line => 1, 
126 |                          'changed_at' => 100, 
127 |                          'changed_by' => "kiyofuji" },
128 |   b => #{ value => #{ p => #{ value => 2,
129 |                               metadata => #{ filename => "foo.conf", 
130 |                                              line => 2, 
131 |                                              'changed_at' => 100, 
132 |                                              'changed_by' => "kiyofuji" }}}, 
133 |           metadata => #{ filename => "foo.conf", 
134 |                          line => 2, 
135 |                          'changed_at' => 100, 
136 |                          'changed_by' => "kiyofuji" }}}
137 | ```
138 | 
139 | The map is then passed into cuttlefish, 
140 | where the validation of values and so on take place as same as v4.x.
141 | Hence, we also need to modify cuttlefish to accept above maps as input,
142 | and add metadata (filename and line) into error info.
143 | 
144 | ## References
145 | 
146 | - [HOCON Config](https://github.com/lightbend/config)
147 | - [SAP Integrations and Data Management](https://help.sap.com/viewer/50c996852b32456c96d3161a95544cdb/1905/en-US/25550740941d434b8c003347601af0ac.html)
148 | - [HashiCorp Resources](https://www.terraform.io/docs/configuration/syntax.html)
149 | 
150 | 


--------------------------------------------------------------------------------
/implemented/0004-assets/agent-fsm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0004-assets/agent-fsm.png


--------------------------------------------------------------------------------
/implemented/0004-assets/agent-fsm.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | 
 3 | catchup: Replaying transactions from the rlog
 4 | switchover: Transient state: buffering realtime events\nwhile still replaying records from the rlog
 5 | normal: Forwarding the rlog table\nmnesia events
 6 | 
 7 | [*] --> catchup
 8 | catchup --> switchover : reached a record in the rlog\nthat is newer than\nnow() - SafeInterval
 9 | switchover --> normal : reached the end of rlog
10 | 
11 | @enduml
12 | 


--------------------------------------------------------------------------------
/implemented/0004-assets/replicant-fsm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0004-assets/replicant-fsm.png


--------------------------------------------------------------------------------
/implemented/0004-assets/replicant-fsm.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | 
 3 | bootstrap: Receiving all the records\n from the core node
 4 | local_replay: Replaying transactions\n that have been buffered locally\n during bootstrap
 5 | normal: Remote transactions are applied\ndirectly to the replica
 6 | 
 7 | [*] --> bootstrap : local checkpoint not\n found or too old
 8 | [*] --> normal : local checkpoint\n is compatible
 9 | bootstrap --> local_replay : received bootstrap_complete
10 | local_replay --> normal : reached the end of the local rlog
11 | 
12 | @enduml
13 | 


--------------------------------------------------------------------------------
/implemented/0004-assets/replication-msc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0004-assets/replication-msc.png


--------------------------------------------------------------------------------
/implemented/0004-assets/replication-msc.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | scale 1024 width
 3 | 
 4 | participant "RLOG server" as server #ffc
 5 | participant "RLOG agent" as agent #ffc
 6 | participant "RLOG bootstrapper" as boot_serv #ffc
 7 | 
 8 | participant "RLOG replica" as repl #ccf
 9 | participant "bootstrap client" as boot_client #ccf
10 | 
11 | activate server
12 | activate repl
13 | 
14 | group Agent initialization
15 |   repl -> server : {connect, LocalCheckpointTS}
16 |   note over server : LocalCheckpointTS is too old.\n Client needs to bootstrap
17 |   server -\\ agent : spawn(now() - SafeInterval)
18 |   activate agent
19 |   repl <- server : {need_bootstrap, AgentPID}
20 | end
21 | 
22 | group Bootstraper initialization
23 |   hnote over repl : bootstrap
24 | 
25 |   repl -\\ boot_client : spawn()
26 |   activate boot_client
27 | 
28 |   boot_client -> server : {bootstrap, self()}
29 |   server -\\ boot_serv : spawn(RemotePid)
30 |   activate boot_serv
31 | 
32 |   boot_serv -> boot_serv : mnesia:dirty_all_keys\nfor each table in shard
33 | 
34 |   server -> boot_client : {ok, Pid}
35 | end
36 | 
37 | group Bootstrap
38 |   note over boot_serv : Iterate through the\ncached keys
39 |   loop
40 |     boot_serv -> boot_client : {batch, [{Tab, Record}]}
41 |     boot_client -> boot_client : import batch to the\ntable replica
42 |     boot_serv <- boot_client : ok
43 |   end
44 | 
45 |   note over agent : At the same time...
46 | 
47 |   loop
48 |     agent -> repl : {batch, [MnesiaOps]}
49 |     repl -> repl : cache batch to the local rlog
50 |   end
51 | 
52 |   boot_serv -> boot_client : bootstrap_complete
53 |   deactivate boot_serv
54 |   boot_client -> repl : bootstrap_complete
55 |   deactivate boot_client
56 | end
57 | 
58 | group local_replay
59 |   hnote over repl : local_replay
60 | 
61 |   note over repl : Iterate through the\ncached transactions
62 | 
63 |   loop
64 |     agent -> repl : {batch, [MnesiaOps]}
65 |     repl -> repl : cache batch in the local rlog
66 | 
67 |     repl -> repl : Import ops from the local rlog\nto the local replica
68 |   end
69 | 
70 |   note over repl : Reached the end of\nthe local rlog
71 | end
72 | 
73 | hnote over repl : normal
74 | 
75 | loop
76 |   agent -> repl : {batch, [MnesiaOps]}
77 |   repl -> repl : Import batch to the\nlocal replica
78 | end
79 | 
80 | @enduml


--------------------------------------------------------------------------------
/implemented/0004-async-mnesia-change-log-replication.md:
--------------------------------------------------------------------------------
  1 | # Async Mnesia transaction replication in EMQ X 5.0
  2 | 
  3 | ## Change log
  4 | 
  5 | * 2021-02-21: @zmstone Add more details
  6 | * 2021-03-01: @k32 Minor fixes
  7 | * 2021-03-05: @k32 Add more test scenarios and elaborate on the push model.
  8 | * 2021-03-11: @k32 Elaborate transaction key generation
  9 | * 2021-03-11: @k32 Add MSC diagrams for the bootstrap process
 10 | 
 11 | ## Abstract
 12 | 
 13 | Escape from Erlang distribution mesh network, embrace `gen_rpc`.
 14 | 
 15 | ## Motivation
 16 | 
 17 | The current replication (Mnesia) is based on full-mesh Erlang distribution which
 18 | does not scale well and has the risk of split-brain.
 19 | 
 20 | ## Design
 21 | 
 22 | ### Log-based replication for Mnesia
 23 | 
 24 | Log-based replication is the most commonly use approach in distributed
 25 | databases.
 26 | 
 27 | Typically when strong consistency is required, database operations or
 28 | transactions will have to be serialized by an elected leader which means all
 29 | nodes will have to delegate the operations through the leader.
 30 | The drawback of this approach is that the leader will easily become a bottleneck
 31 | when the cluster size grows.
 32 | 
 33 | For key-value stores, one way to solve it is to shard the database, e.g. Riak
 34 | and Cassandra, nodes form a hash ring, and only manage keys hashed to their
 35 | ranges. The DB entrypoint may also not have to be the leader, or there is simply
 36 | no leader at all, as soon as this happens, the consistency is no longer 'strong'
 37 | and there is a need to resolve conflicts. e.g. when two clients try to write the
 38 | same key concurrently and hitting two different nodes in the cluster which do
 39 | not sync with each other.
 40 | 
 41 | If the value is a primitive value set operation, typically last-write-wins is
 42 | good enough to resolve conflicts. If the writes update a small part of an
 43 | object, CRDT is for the rescue.
 44 | 
 45 | While we there is still a lack of full investigation on how much of the data
 46 | in EMQ X requires CRDT to get away from ACID transactions, below two types
 47 | of data seem to be of simple enough schema for last-write-wins.
 48 | 
 49 | * Routing tables `emqx_route`, `emqx_trie` and `emqx_trie_node`.
 50 | * Global channel registry table `emqx_channel_registry`.
 51 | 
 52 | After all, we use Mnesia dirty APIs to write some of the tables.
 53 | 
 54 | ### Async-replication of Mnesia changes
 55 | 
 56 | TODO: check if dirty operations in transaction triggers activity logging
 57 | 
 58 | * Log Mnesia changes in the Mnesia cluster
 59 | 
 60 | A pseudo implementation of the transaction layer:
 61 | 
 62 | ```
 63 | transaction(Fun, Args) ->
 64 |   Fun2 = fun() ->
 65 |     ok = Fun(Args),
 66 |     Changes = get_mnesia_activity(),
 67 |     Key = generate_key(erlang:timestamp(), node()),
 68 |     %% Note: Real code should avoid traversing Ops list multiple times:
 69 |     [ok = write_ops_to_another_table(Shard, Key, find_ops_for_shard(Ops, Shard)) || Shard <- shards()],
 70 |     ok
 71 |   end,
 72 |   {atomic, ok} = mnesia:transaction(Fun2)
 73 | end.
 74 | ```
 75 | 
 76 | Where `Changes` is essentially a list of table operations like:
 77 | 
 78 | ```
 79 | [ {{TableName, Key}, Record, write},
 80 |   {{TableName, Key}, Record, delete}
 81 | ]
 82 | ```
 83 | 
 84 | Changes are pushed to the replicant nodes over `gen_rpc`.
 85 | Replicant nodes issue a `watch` call to one of the core nodes.
 86 | The core node creates an agent process that issues `gen_rpc` calls to the replicant nodes using data about transactions that were recorded to the rlog.
 87 | Once the replication is close to the end of the rlog, the agent process subscribes to the realtime stream of mnesia events for the rlog table, and starts feeding the replicant with realtime stream of OPs.
 88 | The time threshold to identify 'close to the end of rlog' should be configurable, and realtime stream should start after (with maybe a bit overlapping) the agent reaches the last historical event.
 89 | 
 90 | Note that `generate_key` function in the above pseudocode can affect the overall design of the system in a rather fundamental way, so let's consider some alternatives.
 91 | 
 92 | #### Alternative 1: using monotonic timestamp + node id for the key
 93 | 
 94 | This kind of key guarantees uniqueness, but not ordering.
 95 | It can be used to prevent lock conflicts while writing to the rlog table, but it MUST NOT be used to establish the order of events.
 96 | Consider the following situation: there are two core nodes `N1` and `N2`, and a replicant node `N3`.
 97 | Let's use `N1`'s clock as the reference.
 98 | Suppose `N2`'s clock drift is `Dt > 0`.
 99 | Consider the following timeline of events:
100 | 
101 | 1. At `T1`, `N2` commits a transaction that sets value of key `K` to `V1`, with the transaction key `{T1 + Dt, N2}`
102 | 1. At `T2`, `N1` commits another transaction that sets value of `K` to `V2`, with the transaction key `{T2, N1}`.
103 |    Suppose `T2 - T1 < Dt`.
104 | 1. `N3` replays the records in the rlog table, and it first replays the transaction `{T2, N1}`, then `{T1 + Dt, N2}`.
105 |    This leads to inconsistency: value of `K` on the core nodes equals `V2`, but on the replicant nodes it equals `V1`.
106 | 
107 | Therefore traversing the rlog table in the natural key order can lead to inconsistencies and must not be used.
108 | The rlog table must be used for mnesia events only, and the actual contents of this table can be discarded.
109 | In order to keep the historical transactions used to bootstrap the replicant, a separate storage is needed.
110 | It could be a disk log that contains the rlog table events.
111 | (TODO: How to deal with the gaps in the transaction log while the core node is down?)
112 | 
113 | #### Alternative 2: using a globally (or partially) ordered transaction key
114 | 
115 | A naive implementation of the global ordering key can be the following: there is a global server generating transaction keys.
116 | The keys look like this: `{GenerationId, Counter}`.
117 | For each shard, there is one core node that runs such server.
118 | This node is elected using a textbook distributed consensus algorithm.
119 | Every time the shard key server restarts, `GenerationId` counter increments.
120 | `Counter` is an integer that is incremented every time anyone talks to the server, and resets to 0 when the server restarts.
121 | 
122 | The obvious downside of this naive implementation is the additional latency added to the transaction.
123 | It's an open question whether or not this additional latency will hurt the throughput of the database in a significant way.
124 | 
125 | The benefit of this solution is that it allows to use the contents of the `rlog` table in a meaningful way.
126 | 
127 | ### Actors
128 | 
129 | #### RLOG Server
130 | 
131 | RLOG server is a `gen_server` process that runs on the core node.
132 | It is responsible for the initial communication with the RLOG replica processes, and spawning RLOG agent and RLOG bootstrapper processes.
133 | 
134 | #### RLOG Agent
135 | 
136 | RLOG agent is a `gen_statem` process that runs on the core node.
137 | This processes' lifetime is tied to the lifetime of the remote RLOG replica process.
138 | It is responsible for pushing the transaction ops from the core node to the replicant node.
139 | The agent operates in two modes: `catchup` where it reads transactions from the rlog, and `normal` where it forwards realtime mnesia events.
140 | There should be a third transient state called `switchover` where the agent subscribes to the mnesia events, while still consuming from the rlog.
141 | This is to ensure overlap between the stored transactions and the events.
142 | 
143 | ![Agent FSM](0004-assets/agent-fsm.png)
144 | 
145 | #### RLOG Replica
146 | 
147 | RLOG replica is a `gen_statem` process that runs on the replicant node.
148 | It spawns during the node startup under the `rlog` supervisor, and is restarted indefinitely.
149 | It talks to the RLOG server in its `init` callback, and establishes connection to the remote RLOG agent process.
150 | In some cases it also creates a bootstrap client process and manages it.
151 | 
152 | ![Replicant FSM](0004-assets/replicant-fsm.png)
153 | 
154 | Full process of shard replication:
155 | 
156 | ![Replication MSC](0004-assets/replication-msc.png)
157 | 
158 | #### RLOG bootstrapper (client/server)
159 | 
160 | RLOG bootstrapper is a temporary `gen_server` process that runs on both core and replicant nodes during replica initialization.
161 | RLOG bootstrapper server runs `mnesia:dirty_all_keys` operation on the tables within the shard, and then iterates through the cached keys.
162 | For each table and key pair it performs `mnesia:dirty_read` operation and caches the result.
163 | If the value for the key is missing, such record is ignored.
164 | Records are sent to the remote bootstrapper client process in batches.
165 | Bootstrapper client applies batches to the local table replica using dirty operations.
166 | 
167 | ### Bootstrapping Empty Nodes
168 | 
169 | The transaction logs should have a limited retention, so it is impossible to keep all the changes from the very beginning.
170 | 
171 | An empty node will have to fetch all the records from transaction log before applying the real-time change logs.
172 | 
173 | Bootstrapping can be done using dirty operations.
174 | Transaction log has an interesting property: replaying it can heal a partially corrupted replica.
175 | Transaction log replay can fix missing or reordered updates and deletes, as long as the replica has been consistent prior to the first replayed transaction.
176 | This healing property of the TLOG can be used to bootstrap the replica using only dirty operations. (TODO: prove it)
177 | One downside of this approach is that the replica contains subtle inconsistencies during the replay, and cannot be used until the replay process finishes.
178 | It should be mandatory to shutdown business applications while bootstrap and syncing are going on.
179 | 
180 | ### Zombie fencing in push model
181 | 
182 | In push model, a replicant node should make sure *not* to ingest transaction pushes from a stale core node which may have a zombie agent lingering around.
183 | i.e. A replicant node should 'remember' which node it is watching, and upon receiving transactions from an unknown node,
184 | it should reply with a rejection error message for the push calls.
185 | 
186 | With this implementation, there should not be a need for the core nodes to coordinate with each other.
187 | 
188 | ### Preventing infinite bootstrap / catchup loop
189 | 
190 | There is a problem of catchup never completing, when bootstrap takes longer than the rlog retention time.
191 | In order to work around this problem the replicant node shall start consuming transactions from the rlog in parallel with bootstrapping.
192 | 
193 | ## Configuration Changes
194 | 
195 | Two new configuration needs to be added to `emqx.conf`:
196 | 
197 | 1. `node_role`: enum [`core`, `replicant`]
198 | 2. `core_nodes`: a list of core nodes for a `replicant` node to 'watch'
199 |    and from which transaction logs are fetched.
200 | 
201 | ## Backwards Compatibility
202 | 
203 | A `replicant` node should never originate data `write`s and `delete`s.
204 | Due to the fact that the nodes are still all clustered using erlang
205 | distribution. So some of the `rpc`s, (such cluster_call) should not be made
206 | towards the replicant nodes if they are intended for writes.
207 | 
208 | ## Document Changes
209 | 
210 | 1. New clustering setup guide
211 | 2. Update configuration doc for new config entries
212 | 
213 | ## Testing Suggestions
214 | 
215 | 1. Regression: clustering test in github actions.
216 | 1. Functionality: generate data operations (write and delete),
217 |    apply operations and compare data integrity between core and replicant nodes
218 | 1. Performance: benchmark throughput and latency
219 | 1. Regression: test clock skews.
220 | 
221 |    1. Create a cluster of two core nodes (A and B) and a replicant node C.
222 |    1. Set time to the future on one of the core nodes, say A
223 |    1. Restart the replicant node, make sure A node detects that first and removes the routes to C
224 |    1. Immediately connect some clients to the replicant node
225 |    1. Check that the replicant node didn't lose its own routes after replaying the transactions from the rlog
226 | 
227 | ## Declined Alternatives
228 | 
229 | * `riak_core` was the original proposal, it's declined because the change is
230 |   considered too radical for the next release. We may re-visit it in the future.
231 | 
232 | * Bootstrapping replicant nodes using mnesia checkpoints is the easiest option, that guarantees consistent results.
233 |   Mnesia checkpoint is a standard feature that is used, for example, to perform backups.
234 |   Core node can activate a local checkpoint containing all the tables needed for the shard, and then iterate through it during bootstrap process.
235 |   Records from the mnesia checkpoint can be sent in batches using the same protocol as the online replication.
236 | 
237 |   This solution has downsides, though:
238 | 
239 |   + Checkpoints take non-trivial amount of resources and may slow down Mnesia: in order to make the checkpoint consistent, mnesia spawns a retainer process and installs a hook before the transaction commits.
240 |     This hook forwards values of the records to the retainer process, before they are overwritten.
241 |     Retainer process saves the values of the old records in a separate table.
242 |     Given that the snapshot is going to be updated with all the recent transactions as soon as bootstrapping completes, this is excessive.
243 | 
244 |   + Checkpoint API in mnesia is designed in imperative style, and lifetime management of the checkpoints can be nontrivial, considering that core nodes can restart, replicant nodes can reconnect, and so on.
245 | 
246 | * Pull-based model transaction replication model, where agent processes don't subscribe to mnesia events, but replicant nodes periodically poll for new records in the transaction log on the core nodes.
247 | 


--------------------------------------------------------------------------------
/implemented/0007-rocksdb-mnesia-backend.md:
--------------------------------------------------------------------------------
 1 | ## Rocksdb Mnesia backend for EMQ X v5.0
 2 | 
 3 | ```
 4 | Author: Zaiming Shi<stone@emqx.io>
 5 | Status: Draft
 6 | Type: Design
 7 | Created: 2020-10-27
 8 | EMQ X Version: 5.0
 9 | Post-History:
10 | ```
11 | 
12 | ## Abstract
13 | 
14 | Add [mnesia_rocksdb](https://github.com/aeternity/mnesia_rocksdb) as a mnesia
15 | backend, so that we can get away from troubles created by dets.
16 | 
17 | 


--------------------------------------------------------------------------------
/implemented/0010-improved-monitoring.md:
--------------------------------------------------------------------------------
  1 | # Improve performance monitoring
  2 | 
  3 | ## Change log
  4 | 
  5 | * 2021-03-03: @k32 Initial draft
  6 | 
  7 | ## Abstract
  8 | 
  9 | Integrate a changed version of [system_monitor](https://github.com/klarna-incubator/system_monitor/) into EMQX to collect `process_info` data in the background.
 10 | 
 11 | ## Motivation
 12 | 
 13 | Investigation of performance bottlenecks can be greatly simplified by utilizing BEAM VM introspection functions, such as `processes` and `process_info`.
 14 | Well-known libraries like `recon` and `observer` make use of this data. However, these libraries don't collect historical data.
 15 | 
 16 | Historical data about Erlang processes is of special interest during analysis of bottlenecks.
 17 | Also, sometimes the designer needs to investigate a one-time, non-reproducible event.
 18 | `system_monitor` application runs in the background all the time, collecting information about the activities in the BEAM VM.
 19 | Therefore it has a better chance of capturing the relevant data.
 20 | 
 21 | ## Design
 22 | 
 23 | Currently, `system_monitor` is designed to publish the telemetry to Kafka, which is not suitable for EMQX.
 24 | This design can be simplified, and the telemetry data should be written to the local log files managed by OTP kernel `logger` instead.
 25 | "Abnormal node state" detection can be incorporated into `system_monitor` to reduce the size of the log files: `system_monitor` should only log data when BEAM schedulers are saturated.
 26 | 
 27 | Example log entry format:
 28 | 
 29 | ```
 30 | 
 31 | [#{app_memory => [{unknown,3507672},{system_monitor,2093320}],  %% List of top OTP applications by memory consumption
 32 |    app_top => %% List of top OTP applications by reduction consumption
 33 |        [{system_monitor,0.9504843084075939},
 34 |         {unknown,0.04951569159240604}],
 35 |    proc_top => %% List of top N erlang processes with the largest memory, reduction or mailbox size:
 36 |        {{1614,779722,299981},
 37 |         [#erl_top{pid = "<0.10.0>",dreductions = 1.4991432396385467,
 38 |                   dmemory = 0.0,reductions = 4817775,memory = 1115180,
 39 |                   message_queue_len = 0,
 40 |                   current_function = {erl_prim_loader,loop,3},
 41 |                   initial_call = {erlang,apply,2},
 42 |                   registered_name = erl_prim_loader,stack_size = 7,
 43 |                   heap_size = 17731,total_heap_size = 139267,
 44 |                   current_stacktrace = [{erl_prim_loader,loop,3,[]}],
 45 |                   group_leader = "<0.0.0>"},
 46 |          #erl_top{pid = "<0.44.0>",dreductions = 1.4991432396385467,
 47 |                   dmemory = 0.0,reductions = 100455,memory = 460260,
 48 |                   message_queue_len = 0,
 49 |                   current_function = {gen_server,loop,7},
 50 |                   initial_call = {erlang,apply,2},
 51 |                   registered_name = application_controller,stack_size = 8,
 52 |                   heap_size = 10958,total_heap_size = 57380,
 53 |                   current_stacktrace = [{gen_server,loop,7,
 54 |                                                     [{file,"gen_server.erl"},{line,437}]}],
 55 |                   group_leader = "<0.152.0>"},
 56 |          #erl_top{pid = "<0.50.0>",dreductions = 1.4991432396385467,
 57 |                   dmemory = 0.0,reductions = 169004,memory = 142796,
 58 |                   message_queue_len = 0,
 59 |                   current_function = {code_server,loop,1},
 60 |                   initial_call = {erlang,apply,2},
 61 |                   registered_name = code_server,stack_size = 5,
 62 |                   heap_size = 6772,total_heap_size = 17730,
 63 |                   current_stacktrace = [{code_server,loop,1,
 64 |                                                      [{file,"code_server.erl"},{line,151}]}],
 65 |                   group_leader = "<0.152.0>"},
 66 |          #erl_top{pid = "<0.151.0>",dreductions = 202.3843373512038,
 67 |                   dmemory = 0.0,reductions = 15929,memory = 26612,
 68 |                   message_queue_len = 0,
 69 |                   current_function = {user_drv,server_loop,6},
 70 |                   initial_call = {user_drv,server,2},
 71 |                   registered_name = user_drv,stack_size = 10,heap_size = 2586,
 72 |                   total_heap_size = 3196,
 73 |                   current_stacktrace = [{user_drv,server_loop,6,
 74 |                                                   [{file,"user_drv.erl"},{line,191}]}],
 75 |                   group_leader = "<0.152.0>"},
 76 | ...
 77 | ```
 78 | 
 79 | Alternatively, logs can be written in a binary form, to save space.
 80 | Also different kinds of messages can be written to different log files instead of a single one.
 81 | 
 82 | Finally, `system_monitor` should be added as a release app to the EMQX relx configuration.
 83 | 
 84 | ## Configuration Changes
 85 | 
 86 | TBD. The following parameters might be configurable:
 87 | 
 88 | - "Abnormal load" threshold
 89 | 
 90 | - Frequency of data collection
 91 | 
 92 | - Log retention parameters
 93 | 
 94 | ## Backwards Compatibility
 95 | 
 96 | This change is backward-compatible
 97 | 
 98 | ## Document Changes
 99 | 
100 | Contents of the new logs should be documented.
101 | 
102 | ## Testing Suggestions
103 | 
104 | `system_monitor` has unit tests.
105 | 
106 | ## Declined Alternatives
107 | 


--------------------------------------------------------------------------------
/implemented/0010-runq-based-overload-protection.md:
--------------------------------------------------------------------------------
  1 | # Run queue based overload protection.
  2 | 
  3 | ## Changelog
  4 | * 2021-03-11: @qzhuyan Initial Draft
  5 | * 2021-04-08: @qzhuyan Don't hibernate process when overloaded.
  6 | 
  7 | ## Abstract
  8 | 
  9 | Run queue based overload protection for EMQX.
 10 | 
 11 | EMQX needs some mechanism to cool down the node when the node is overloaded.
 12 | 
 13 | Runq (Erlang VM run queue) is a performance critical metric.
 14 | It is a sign of overloading when the runq number is greater than the number of schedulers for a long period.
 15 | 
 16 | This document describes a mechanism how EMQX can cool down itself by monitoring runq.
 17 | 
 18 | ## Motivation
 19 | 
 20 | Some user reports EMQX runs at 100% cpu load and etop trace shows the runq of a 2 cores node is around 100
 21 | and lasts more than 5 mins until the user kicks in then restarts the nodes.
 22 | 
 23 | Ideally, the runq should not be greater than the number of schedulers. The runq can grow
 24 | for a short period of time to handle peak traffic but should not stay high for longer than 1 mins.
 25 | 
 26 | Erlang is soft real-time system, a long run queue can cause time-critical processes not getting scheduled on time that may
 27 | lead timeout in upper layer's protocol. High CPU usage may also prevent users from login for O&M operations and leave nodes in zombie state.
 28 | 
 29 | EMQX needs some mechanism to cool down itself before it runs into the situation above.
 30 | 
 31 | And most importantly, operators need to get notified that the system is overloaded to take actions such as scale up the node/cluster.
 32 | 
 33 | ## Design
 34 | 
 35 | ### Run queue monitoring and flagging of long runq
 36 | 
 37 | An isolated process runs under top supervisor with high scheduling priority to poll the runq value every 3 secs.
 38 | 
 39 | If the runq is longer than 10x (number of online schedulers)
 40 | for last 5 polls, it should:
 41 | 
 42 | 1. Raise the overload flag by registering a process name such like '__long_runq__'
 43 | 
 44 | 1. Monitoring systems should be notified such as logging and metrics counter updates.
 45 | 
 46 | 1. Alert should be triggered, if the flag is not cleared after [X timer] mins.
 47 | 
 48 | ### When node need cooldown
 49 | 
 50 | Each application of EMQX should decide for itself how to deal with the overload.
 51 | 
 52 | The overload flag should be checked in all performance-critical code paths and react on it.
 53 | 
 54 | Suggested actions:
 55 | 
 56 | - Reject new connections to keep the existing connections alive.
 57 | 
 58 |   consequences:
 59 | 
 60 |     New Client would retry and may get redirected by load balancer to other nodes in the cluster.
 61 | 
 62 | - Defer processing of non-real-time requests such as Web UI requests.
 63 | 
 64 |   consequences:
 65 | 
 66 |     Web UI would be slow to respond and operator have to wait.
 67 | 
 68 | - Stop spawning more parallel workers.
 69 | 
 70 |   consequences:
 71 |     might hurt latency.
 72 | 
 73 | - Do not trigger active GC.
 74 | 
 75 | - [Unclear] Redirect requests to other nodes.
 76 | 
 77 |   Wonder if on protocol level there is such spec for this.
 78 | 
 79 | - Update backend status for polling from load balancer.
 80 | 
 81 |   consequences:
 82 | 
 83 |     Load balancer could lower the weight of this node in pool
 84 |     even remove the node from the pool for new connections.
 85 | 
 86 | - Different QoS messages would be handled differently.
 87 | 
 88 |   Such as drop low priority messages.
 89 |   
 90 | - Stop hibernating processes.
 91 | 
 92 |   gen_server should prefer not going to hibernate state.
 93 | 
 94 | ### When runq is back to normal.
 95 | 
 96 | '__long_runq__' should be unregistered.
 97 | 
 98 | 
 99 | ## Configuration Changes
100 | 
101 | 1. disable/enable overload protection
102 | 
103 | 1. The [timer X]
104 | 
105 | 1. Configuration for overload handling in different APPs.
106 | 
107 | ## Backwards Compatibility
108 | 
109 | N/A
110 | 
111 | ## Document Changes
112 | 
113 | 1. Description of overload protection.
114 | 
115 | 1. Operation
116 | 
117 | 1. Monitoring and Alerting
118 | 
119 | ## Testing Suggestions
120 | 
121 | Apart from regular performance tests, overload tests should be performed with at least 5X of regular rate.
122 | 
123 | ## Declined Alternatives
124 | 
125 | N/A
126 | 


--------------------------------------------------------------------------------
/implemented/0011-assets/current-implementation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0011-assets/current-implementation.png


--------------------------------------------------------------------------------
/implemented/0011-assets/current-implementation.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | 
 3 | participant Client                 as client
 4 | participant "Connection\nProcess"  as connection
 5 | participant emqx_cm_locker         as locker
 6 | box "Mnesia replication\nin the cluster"
 7 |   database Mnesia                  as mnesia
 8 | end box
 9 | box "Possibly Other Node"
10 |   participant "emqx_cm_locker"     as other_locker
11 |   participant "Connection"         as other_connection
12 | end box
13 | 
14 | client -> connection               : connect
15 | group emqx_cm                      : open_session
16 |   connection -> locker             : lock session id
17 |   locker <-> other_locker          : lock session id
18 |   connection <-> mnesia            : lookup session id
19 |   group emqx_cm                    : takeover_session
20 |     connection -> other_connection : begin takeover
21 |     connection <- other_connection : "#session{}"
22 |     rnote over other_connection
23 |       Buffer
24 |       messages
25 |     endrnote
26 |     rnote over connection
27 |       Take over
28 |       subscriptions.
29 |     endrnote
30 |     connection -> other_connection : end takeover
31 |     other_connection -> connection : all pending msgs
32 |   end
33 |   rnote over other_connection
34 |     Unsubscribe
35 |     and terminate
36 |   endrnote
37 |   connection -> mnesia             : register channel\n(emqx_cm_registry)
38 |   connection -> locker             : release lock
39 |   locker <-> other_locker          : release lock
40 | end
41 | 
42 | client <- connection      : connack
43 | client <- connection      : all pending msgs
44 | 
45 | @enduml
46 | 
47 | 


--------------------------------------------------------------------------------
/implemented/0011-assets/flows.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0011-assets/flows.png


--------------------------------------------------------------------------------
/implemented/0011-assets/flows.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | 
 3 | actor Subscriber as sub
 4 | participant Connection as subcon
 5 | participant Broker as broker
 6 | box "Mnesia replication\nin the cluster"
 7 |   database Mnesia as mnesia
 8 | end box
 9 | box "Possibly other node"
10 |   participant Writer as writer
11 |   participant Connection as pubcon
12 | end box
13 | actor Publisher as pub
14 | 
15 | == Clean start ==
16 | sub -> subcon : Connect\n(Session-Expiry > 0,\nClean-Start: 1)
17 | group ClientID lock aquired (in cluster)
18 |   subcon -> mnesia : Register (fresh) sessionID\nbased on clientID
19 |   subcon -> mnesia : Clean start\n(discard old sessionID)
20 | end
21 | 
22 | == Subscribe ==
23 | sub -> subcon : Subscribe
24 | subcon -> broker : Subscribe
25 | group Cluster-global transaction
26 |   broker -> mnesia : Store topic filter in trie and\nsession table
27 | end group
28 | subcon -> sub : suback
29 | 
30 | == Publish ==
31 | pub -> pubcon : publish
32 | pubcon -> mnesia : lookup topic in trie for\npersistent session
33 | pubcon -> mnesia : persist message
34 | pubcon -> writer : message
35 | writer -> mnesia : lookup sessionID
36 | writer -> mnesia : persist session\nmessage details
37 | rnote over pubcon
38 |   Still responsive
39 | endrnote
40 | writer -> subcon : message (RPC or direct send)
41 | subcon -> sub : message
42 | writer -> pubcon : ack
43 | pubcon -> pub : puback
44 | subcon -> mnesia : mark as delivered
45 | 
46 | 
47 | == Persistent resume (connection gone) ==
48 | sub -> subcon : Connect\n(Session-Expiry > 0,\nClean-Start: 0)
49 | group ClientID lock aquired (in cluster)
50 |   subcon -> mnesia : Register under the same sessionID as before
51 |   subcon -> mnesia : Get state
52 | end group
53 | group Recovery state machine
54 |   subcon -> mnesia : Get all pending messages
55 |   subcon -> sub : Pending messages
56 |   subcon -> mnesia : Mark as delivered
57 |   group Cluster-global transaction
58 |     subcon -> broker : Subscribe to topics (transaction)
59 |   end group
60 |   group For all writers in parallel
61 |     subcon -> writer : Sync marker
62 |     rnote over subcon
63 |       Drop all incoming messages from writer
64 |       These messages will eventually come
65 |       from the DB
66 |     end rnote
67 |     writer -> subcon : Sync marker
68 |     group Wait (poll) for marker in DB (in pending messages)
69 |       rnote over subcon
70 |         Buffer messages from writer
71 |       end rnote
72 |       subcon -> mnesia : Get pending messages from writer
73 |       subcon -> sub : pending messages
74 |       subcon -> mnesia : Mark as delivered
75 |     end group
76 |     writer -> mnesia : Sync marker
77 |     mnesia -> subcon : Sync marker
78 |     subcon -> sub : buffered messages from writer
79 |     subcon -> mnesia : Mark as delivered
80 |   end group
81 |   rnote over subcon
82 |     Normal operations
83 |   end rnote
84 | 
85 | @enduml
86 | 


--------------------------------------------------------------------------------
/implemented/0011-assets/init-fsm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0011-assets/init-fsm.png


--------------------------------------------------------------------------------
/implemented/0011-assets/init-fsm.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | 
 3 | 
 4 | state init as "Init session"
 5 | init: * Get pending messages from db
 6 | init: * Subscribe to topics
 7 | init: * Send pending messages to client
 8 | init: * Send marker to writers
 9 | init: * Discard messages from RTF if not marker
10 | 
11 | state writers as "For all nodes" {
12 |   state sync1 as "Sync DB flow (DBF) with writer"
13 |   sync1: * Poll for pending messages in db
14 |   sync1: * Send messages earlier than marker from db to client
15 | 
16 |   state sync2 as "Finalize sync with writer"
17 |   sync2: * Send buffered messages to client
18 | 
19 |   [*] --> sync1  : Marker received from RTF
20 |   sync1 --> sync2 : Marker received from DBF
21 | }
22 | 
23 | init --> writers
24 | 
25 | @enduml
26 | 


--------------------------------------------------------------------------------
/implemented/0011-jq-in-rule-engine.md:
--------------------------------------------------------------------------------
  1 | # An Example of EMQ X Improvement Proposal
  2 | 
  3 | ## Change log
  4 | 
  5 | * 2020-03-12: @terry-xiaoyu first draft
  6 | * 2020-03-21: @terry-xiaoyu add section for JQ NIF
  7 | 
  8 | ## Abstract
  9 | 
 10 | Introduce [JQ](https://stedolan.github.io/jq/) syntax to the SQL of emqx rule
 11 | engine.
 12 | 
 13 | ## Motivation
 14 | 
 15 | The emqx rule engine now supports a subset of SQL syntax for creating rules. We
 16 | also provide a set of of SQL functions for manipulating the data structures.
 17 | 
 18 | A common use case is to transform JSON strings from MQTT messages.
 19 | The solution now is that first decode the JSON string to Erlang terms using
 20 | the sql function `json_decode/1`, assign it to a temporary variable, and then
 21 | do the transformations. As a handy way for decoding JSON Object and get the
 22 | value of a key, we could use the dot syntax like `payload.x`.
 23 | 
 24 | The problem is that the sql functions we provided are too limited to transform
 25 | complex JSON strings. For instance if we want to run a lambda for each of the
 26 | element of an array, we need the `map/2` or `reduce/3` function, and a sql
 27 | syntax for defining a lambda to do the transformation. JQ has done a great work
 28 | on this, in a concise syntax. Because JQ has been so widely used it has become
 29 | the de facto standard for processing JSON strings, it would be nice if we
 30 | introduced it to the emqx rule engine.
 31 | 
 32 | ## Design
 33 | 
 34 | The SQL syntax of emqx rule engine now support a very limited set of operators
 35 | to retrieve and set the JSON values.
 36 | 
 37 | For example the following code snippet retrieves the value from an MQTT message
 38 | recursively:
 39 | 
 40 | ```
 41 |         SELECT payload.a.b[1].c
 42 |  Input: {"a": {"b": [{"c": 1}]}}
 43 | Output: 1
 44 | ```
 45 | 
 46 | And for updating an JSON object, we override the `AS` operator of SQL syntax.
 47 | 
 48 | The following example update a value with key `c` from `1` to `0`:
 49 | 
 50 | ```
 51 |         SELECT 0 as payload.a.b[1].c
 52 |  Input: {"a": {"b": [{"c": 1}]}}
 53 | Output: {"a": {"b": [{"c": 0}]}}
 54 | ```
 55 | 
 56 | This is straightforward but far from "enough" for processing JSON strings.
 57 | JQ provide a simple and clean syntax for the frequently used operations, e.g.:
 58 | 
 59 | ```
 60 |         jq '[.user, .projects[]]'
 61 | Input:	{"user":"stedolan", "projects": ["jq", "wikiflow"]}
 62 | Output:	["stedolan", "jq", "wikiflow"]
 63 | ```
 64 | 
 65 | Where `[]` is a "collector" for collecting multiple element into a single array,
 66 | and the operator `.projects[]` gets all of the values from the array `projects`
 67 | as multiple results. This would require several lines of code if we don't use
 68 | JQ: first we need to get the values of "user" and "project" and then flatten the
 69 | results into a single list.
 70 | 
 71 | And here's a more complex example for reducing array to accumulate the numbers:
 72 | 
 73 | ```
 74 |         jq 'reduce .[] as $item ([]; . + [$item.a]) | {"acc": .}'
 75 | Input:  [{"a": 1}, {"a": 2}, {"a": 3}]
 76 | Output: {"acc": [1,2,3]}
 77 | ```
 78 | 
 79 | ### Timeout
 80 | 
 81 | Having a timeout when executing JQ code in the rule engine is important because
 82 | JQ programs can potentially execute forever (JQ is a Turing complete
 83 | programming language that allows recursive functions). JQ programs that execute
 84 | forever or a very long time are probably buggy and could cause performance
 85 | and debugging problems.
 86 | 
 87 | Additionally, if one allowed JQ programs to execute forever, one would need
 88 | a way to terminate them, for example, if a user want to manually terminate a JQ
 89 | program. This could be tricky and time consuming to implement as one would need
 90 | an interface to monitor JQ programs and terminate specific ones.
 91 | 
 92 | ### Suggested JQ Syntax in Rule SQL
 93 | 
 94 | We'd better use JQ along with the rule SQL, as it is common to use the output of
 95 | an SQL function as the input of JQ, or to assign the output of JQ to an SQL
 96 | variable and then SELECT it as part of the SQL result.
 97 | 
 98 | One way is use JQ as a normal SQL function, e.g.:
 99 | 
100 | ```
101 | SELECT
102 |     jq('reduce .[] as $item ([]; . + [$item.a]) | {"acc": .}',
103 |         payload) as result
104 | ```
105 | 
106 | The above suggestion has been added to the rule engine in the 5.0 release of
107 | EMQX. The second argument in the implementation can be a non-string value as in
108 | the example above. The second argument can also be a JSON value encoded as a
109 | string, in which case the function will automatically transform the argument to
110 | the encoded value before it is sent to the JQ program. An implicit timeout which
111 | can be configured with the `rule_engine.jq_function_default_timeout` setting is
112 | used to timeout the JQ function after a certain amount of time. A JQ function
113 | that takes three arguments has also been added in the 5.0 release. The third
114 | argument can be used to explicitly specify a timeout in milliseconds.
115 | 
116 | To make the code cleaner, we could create an SQL keyword for JQ, this way
117 | we can remove the surrounding quotes out of the JQ filters:
118 | 
119 | ```
120 | SELECT
121 |     JQ payload
122 |     DO
123 |         reduce .[] as $item ([]; . + [$item.a])
124 |         | {"acc": .}
125 |     END
126 | ```
127 | 
128 | As JQ can read input from the environment variables, we could simplify it more
129 | by setting all the available SQL variables into JQ filters as ENVs:
130 | 
131 | ```
132 | SELECT
133 |     JQ
134 |         $ENV.payload
135 |         | reduce .[] as $item ([]; . + [$item.a])
136 |         | {"acc": .}
137 |     END
138 | ```
139 | 
140 | The `JQ` clause doesn't have to be used in the `SELECT` clause of SQL. If we put
141 | `JQ` clause before the `SELECT`, it would look even better:
142 | 
143 | ```
144 | JQ
145 |     .payload
146 |     | reduce .[] as $item ([]; . + [$item.a])
147 |     | {"acc": .}
148 | SELECT
149 |     acc[1] as first
150 | ```
151 | 
152 | This way we use the output of the `JQ` as the input of `SELECT`. The above code
153 | snippet will output `{"first": 1}`.
154 | 
155 | The only problem now is we can not utilize the existing SQL functions.
156 | Then we change it again by simply putting the `JQ` clause behind the `SELECT`:
157 | 
158 | ```
159 | SELECT
160 |     decode(payload) as p
161 | JQ
162 |     .p
163 |     | reduce .[] as $item ([]; . + [$item.a])
164 |     | {"acc": .}
165 |     | {first: (.acc|.[0])}
166 | ```
167 | 
168 | This example does exactly what the previous example do. The SELECT clause can
169 | only output a single map result, and the map will then be piped to the JQ clause.
170 | 
171 | Another option is to support multiple languages for writing rule engine
172 | programs. One of those languages could be JQ and one could be the current SQL
173 | based language. With this suggestion the user would need to select which
174 | language to use. This could be done with a dropbox in the GUI interface for
175 | adding rules. Here are some advantages with this suggestion:
176 | 
177 | * It is future proof as it would be easy to add a third language in the future
178 | * The syntax of each individual language would be less cluttered as one would
179 |   not need to combine the SQL based language with JQ
180 | * Implementation is simple (one does not need to extend the SQL based language)
181 | 
182 | If this suggestion is chosen, one could also add a feature that would make it
183 | possible to pipe the output of one rule engine program to another. For example,
184 | the user might want to pipe the output of a JQ rule engine program to one
185 | written in the SQL based language. This could be useful, for example, if one
186 | wants to use the FOREACH statement to output multiple messages from a single
187 | message, or if one want to use a function that exists in the SQL based language
188 | but not in JQ.
189 | 
190 | ### Introduce JQ as Port, NIF or Compiled BEAM Code
191 | 
192 | JQ is written in portable C as a single binary, reading command line argument
193 | from stdin and outputting results to the stdout. It supports Linux, OS X,
194 | FreeBSD, Solaris, and Windows for now, so the simplest way is to package the `jq`
195 | binary along with all of the emqx installation packages, and talk to `jq` using
196 | the [erlang port](http://erlang.org/doc/tutorial/c_port.html). For someone who
197 | is building emqx from source code of emqx repo, he can put the jq binary in to
198 | the right path according to the configuration. The current Erlang JQ library
199 | (see Section "JQ NIF/Port Library") uses a long running port program that uses 
200 | an LRU cache to cache compiled programs for increased efficiency. 
201 | 
202 | The second approach is NIF, with the drawback of more changes to the code (compile
203 | the code to a dynamic C library rather than a single binary), and safety (it
204 | brings down the entire erlang system on crash, and may hold up the erlang
205 | scheduler if it returns too late). But this way has the benefit of efficiency
206 | and no independent `jq` binary is required. The NIF approach has also been
207 | implemented in the Erlang JQ library but it currently (2022-07-10) lacks support
208 | for timeouts (which is tricky to implement as it will requires non-trivial
209 | modification to the main jq library).
210 | 
211 | The third approach is to compile JQ programs directly to BEAM bytecode. Here
212 | are some of the benefits one might get from doing this compared to the other
213 | approaches:
214 | 
215 | * Speed - No context switching (between BEAM code and port or dirty NIF thread),
216 |   and running JITed BEAM code will probably be faster than running interpreted
217 |   JQ byte code
218 | * Fairness - BEAM code is run on a main Erlang scheduler and preemted in the
219 |   same way as compiled Erlang code 
220 | * Tools - This would play well with the Erlang VM's tools for tracing and
221 |   performance measuring and so on
222 | * Safety - Running BEAM code is safer than a NIF as a NIF can crash the VM,
223 |   cause hard to debug problems, and leak memory
224 | 
225 | One could get some of the benefits mentioned above by making the NIF
226 | implementation yielding. However this would be a lot of work (even though some
227 | of the work could be automated with
228 | (YCF)[https://github.com/kjellwinblad/yielding_c_fun]), and would
229 | make upgrades of the JQ library more difficult.
230 | 
231 | Compiling JQ code to BEAM code would probably be quite straightforward. Both JQ
232 | and Erlang are functional languages. One option is to make use of the JQ
233 | compiler (which can be exposed to Erlang as a Port program) to transform JQ
234 | code to JQ bytecode and then one only have to implement a transformation of JQ
235 | bytecode to BEAM bytecode. This is probably easier than transforming JQ code
236 | to BEAM bytecode without an intermediate step.
237 | 
238 | ### JQ NIF/Port Library
239 | 
240 | 
241 | An Erlang JQ library has been created at "https://github.com/emqx/jq". The
242 | library interface supports both a NIF based backend and port based backed. The
243 | user can configure which backend to use. At the time of writing (2022-07-10),
244 | the JQ function in the rule engine can only use the port based backend as this
245 | backend is currently the only one that supports timeouts. There is a plan to
246 | also support timeouts in the NIF based backend. When that is possible, EMQX
247 | users should be given the option to configure which backend to use. Here are
248 | some examples that shows how the library can be used:
249 | 
250 | ```
251 | rebar3 shell
252 | ...
253 | 
254 | 1> jqerl:process_json(<<".a">>, <<"{\"b\": 1}">>).
255 | {ok,[<<"null">>]}
256 | 
257 | 2> jqerl:process_json(<<".a">>, <<"{\"a\": {\"b\": {\"c\": 1}}}">>).
258 | {ok,[<<"{\"b\":{\"c\":1}}">>]}
259 | 
260 | 3> jqerl:process_json(<<".a|.[]">>, <<"{\"a\": [1,2,3]}">>).
261 | {ok,[<<"1">>,<<"2">>,<<"3">>]}
262 | ```
263 | 
264 | If very thing is OK, the API `jqerl:parse/2` always returns a list as the second
265 | element of the tuple, because jq may have multiple outputs.
266 | 
267 | If there's some error in the jq filter or the json string, `{error, Reason}` will
268 | be returned:
269 | 
270 | ```
271 | 1> jqerl:parse(<<".a">>, <<"{\"a\": ">>).
272 | {error,{jq_err_parse,<<"Unfinished JSON term at EOF at line 1, column 6 (while parsing '{\"a\": ')">>}}
273 | ```
274 | 
275 | ## Configuration Changes
276 | 
277 | The `jq/2` function that was introduced in the EMQX 5.0 release reads the
278 | configuration setting `rule_engine.jq_function_default_timeout` to get the
279 | default timeout in milliseconds. We may also introduce a setting for
280 | configuring which backend to use (the port based one or the NIF based one) when
281 | we have implemented the timeout feature in NIF backend of the Erlang JQ
282 | library.  
283 | 
284 | ## Backwards Compatibility
285 | 
286 | There's no backward compatibility problems.
287 | 
288 | ## Document Changes
289 | 
290 | The JQ can be used in the following syntax:
291 | 
292 | ```
293 | SELECT
294 |     *
295 | JQ
296 |     {u: .username, c: .clientid}
297 | FROM
298 |     "t/1"
299 | ```
300 | 
301 | This outputs `{"u": "Shawn", "c": "00001"}` if the user "Shawn" with client-id
302 | "00001" publishes a message to "t/1".
303 | 
304 | You can learn more about JQ [here](https://stedolan.github.io/jq/).
305 | 
306 | ## Testing Suggestions
307 | 
308 | Benchmarking for processing large JSON strings using the new JQ syntax is
309 | required.
310 | 
311 | ## Declined Alternatives
312 | 
313 | Another way to process JSON strings in SQL is to provide more SQL functions or
314 | SQL keywords, just like how the jq does. But this would be too complicated and
315 | the syntax we created is hard to beat the JQ. The idea is that it's better to
316 | use the well-tested library to do the work, rather than spend time reinventing
317 | the wheels.
318 | 
319 | ## Status 2022-06-09: JQ Function Added to the Rule Engine in the EMQX 5.0 Release
320 | 
321 | 
322 | In the EMQX 5.0 release, we have introduced JQ functions to EMQX's SQL based
323 | rule engine language. This is the first suggestion discussed in the "Suggested
324 | JQ Syntax in Rule SQL" section above. Please read the first part of the
325 | "Suggested JQ Syntax in Rule SQL" Section for more details about the added
326 | functions.
327 | 
328 | This way of introducing JQ to the rule engine was chosen as it makes it possible
329 | to use JQ in the rule engine without invalidating any of the other suggestions
330 | for extending the syntax of the rule engine.
331 | 


--------------------------------------------------------------------------------
/implemented/0012-cluster-call.md:
--------------------------------------------------------------------------------
  1 | # Cluster call transaction
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2021-08-11: @zhongwencool Initial draft.
  6 | * 2021-08-30: @zhongwencool add "How to confirm that the commit has succeeded?" section.
  7 | 
  8 | ## Abstract
  9 | 
 10 | When EMQX updates the cluster resources via HTTP API,  it first updates the local node resources, and then updates all other nodes via RPC Multi Call to ensure the consistency of resources (configuration)  in the cluster.
 11 | 
 12 | **In order to ensure consistency, it must ensure that the updates will be be eventually applied on all nodes in the cluster**.
 13 | 
 14 | ## Motivation
 15 | 
 16 | The current solution is to update the resources of the local node successfully, and then RPC calls to update the resources of other nodes synchronously.
 17 | 
 18 | Update resources may be lost during the RPC call.
 19 | 
 20 | - If there is a network disturbance during RPC, it may cause the RPC to fail.
 21 | - If a remote operation to update a resource fails, there is no retry or any other remedy, causing inconsistent configuration in the cluster.
 22 | - If multiple updates are performed concurrently, it may happen that node 1 performs updates in order 1, 2, 3, but node 2 updates in order 1, 3, 2. There is no order guarantee.
 23 | - Lack of replay. If a node is down for a while, there is a lack of history event replay to catch up with the changes happened during the down time.
 24 | 
 25 | ## Design
 26 | 
 27 | This proposal uses mnesia to record the execution status of MFA, to ensure the consistency of the final resources & data in the cluster.
 28 | 
 29 | This proposal is not applicable to high frequency request calls, all updates are performed in strict order.
 30 | 
 31 | ### mnesia table structure
 32 | 
 33 | ```erlang
 34 | -record(cluster_rpc_mfa, {tnx_id :: pos_integer(), mfa :: mfa(), created_at :: calendar:datetime(), initiator :: node()}).
 35 | -record(cluster_rpc_commit, {node :: node(), tnx_id :: pos_integer()}).
 36 | mnesia(boot) ->
 37 |     ok = ekka_mnesia:create_table(?CLUSTER_MFA, [
 38 |         {type, ordered_set},
 39 |         {disc_copies, [node()]},
 40 |         {rlog_shard, ?COMMON_SHARD},
 41 |         {record_name, cluster_rpc_mfa},
 42 |         {attributes, record_info(fields, cluster_rpc_mfa)}]),
 43 |     ok = ekka_mnesia:create_table(?CLUSTER_COMMIT, [
 44 |         {type, set},
 45 |         {disc_copies, [node()]},
 46 |         {rlog_shard, ?COMMON_SHARD},
 47 |         {record_name, cluster_rpc_commit},
 48 |         {attributes, record_info(fields, cluster_rpc_commit)}]);
 49 | mnesia(copy) ->
 50 |     ok = ekka_mnesia:copy_table(cluster_rpc_mfa, disc_copies),
 51 |     ok = ekka_mnesia:copy_table(cluster_rpc_commit, disc_copies).
 52 | ```
 53 | 
 54 | - `tnx_id` is strictly +1 incremental, all executed transactions must be executed in strict order, if there is node 1 executing transaction 1, 2, 3, but node 2 keeps failing in executing transaction 2 after executing transaction 1, it will keep retrying transaction 2 until it succeeds before executing transaction 3.
 55 | 
 56 | - `cluster_call_commit` :   Records the maximum `tnx_id` executed by the node. All transactions less than this id have been executed successfully on this node.
 57 | - `cluster_call_mfa`: `ordered_set`:  Records the MFA for each `tnx_id` in pairs. Keep the latest completed 100 records for querying and troubleshooting.
 58 | 
 59 | ### Flow
 60 | 
 61 | 1. `emqx_cluster_rpc` register as `gen_server` on each node, subscribes to the mnesia table simple event, and is responsible for the execution of all transactions.
 62 | 
 63 | 2. `handler` init will catch up latest tnx_id. if node's tnx_id is 5, but latest MFA's tnx_id is 10, it will try to run MFA from 6 to 10 by 5 transactions.
 64 | 
 65 |    ```erlang
 66 |    init([Node, RetryMs]) ->
 67 |        {ok, _} = mnesia:subscribe({table, ?CLUSTER_MFA, simple}),
 68 |        {ok, #{node => Node, retry_interval => RetryMs}, {continue, ?CATCH_UP}}.
 69 |    
 70 |    handle_continue(?CATCH_UP, State) ->
 71 |        {noreply, State, catch_up(State)}.
 72 |    
 73 |    catch_up(#{node := Node, retry_interval := RetryMs} = State) ->
 74 |        case transaction(fun get_next_mfa/1, [Node]) of
 75 |            {atomic, caught_up} -> ?TIMEOUT;
 76 |            {atomic, {still_lagging, NextId, MFA}} ->
 77 |                case apply_mfa(NextId, MFA) of
 78 |                    ok ->
 79 |                        case transaction(fun commit/2, [Node, NextId]) of
 80 |                            {atomic, ok} -> catch_up(State);
 81 |                            Error ->
 82 |                                ?SLOG(error, #{
 83 |                                    msg => "mnesia write transaction failed",
 84 |                                    node => Node,
 85 |                                    nextId => NextId,
 86 |                                    error => Error}),
 87 |                                RetryMs
 88 |                        end;
 89 |                    _Error -> RetryMs
 90 |                end;
 91 |            {aborted, Reason} ->
 92 |                ?SLOG(error, #{
 93 |                    msg => "get_next_mfa transaction failed",
 94 |                    node => Node, error => Reason}),
 95 |                RetryMs
 96 |        end.
 97 |    get_next_mfa(Node) ->
 98 |        NextId =
 99 |            case mnesia:wread({?CLUSTER_COMMIT, Node}) of
100 |                [] ->
101 |                    LatestId = get_latest_id(),
102 |                    TnxId = max(LatestId - 1, 0),
103 |                    commit(Node, TnxId),
104 |                    ?SLOG(notice, #{
105 |                        msg => "New node first catch up and start commit.",
106 |                        node => Node, tnx_id => TnxId}),
107 |                    TnxId;
108 |                [#cluster_rpc_commit{tnx_id = LastAppliedID}] -> LastAppliedID + 1
109 |            end,
110 |        case mnesia:read(?CLUSTER_MFA, NextId) of
111 |            [] -> caught_up;
112 |            [#cluster_rpc_mfa{mfa = MFA}] -> {still_lagging, NextId, MFA}
113 |        end.
114 |    
115 |    get_latest_id() ->
116 |        case mnesia:last(?CLUSTER_MFA) of
117 |            '$end_of_table' -> 0;
118 |            Id -> Id
119 |        end.
120 |    ```
121 | 
122 | 3. If a new update operation is added, the `handler` will receive a write event of the `cluster_rpc_mfa` table.
123 |    "read the next record" -> "execute action" -> "commit" loop, with iteration triggered by mnesia events. The content of the events could be ignored.
124 | 
125 |    ```erlang
126 |    handle_info({mnesia_table_event, _}, State) ->
127 |        {noreply, State, catch_up(State)};
128 | 
129 | 
130 | 4. The initial transaction must be executed in the `emqx_cluster_rpc` process. if this transaction succeeds, the call returns success directly, if the transaction fails, the call aborts with failure.
131 | 
132 |    ```erlang
133 |    handle_call({initiate, MFA}, _From, State = #{node := Node}) ->
134 |        case transaction(fun init_mfa/2, [Node, MFA]) of
135 |            {atomic, {ok, TnxId}} ->
136 |                {reply, {ok, TnxId}, State, {continue, ?CATCH_UP}};
137 |            {aborted, Reason} ->
138 |                {reply, {error, Reason}, State, {continue, ?CATCH_UP}}
139 |        end;
140 |    
141 |    init_mfa(Node, MFA) ->
142 |        mnesia:write_lock_table(?CLUSTER_MFA),
143 |        LatestId = get_latest_id(),
144 |        ok = do_catch_up_in_one_trans(LatestId, Node),
145 |        TnxId = LatestId + 1,
146 |        mnesia:write(#cluster_rpc_cursor{node = Node, tnx_id = TnxId}),
147 |        mnesia:write(#cluster_rpc_mfa{tnx_id = TnxId, mfa = MFA, initiator = Node, created_at = erlang:localtime()}),
148 |        ok = apply_mfa(MFA).
149 |    
150 |    do_catch_up_in_one_trans(LatestId, Node) ->
151 |        case do_catch_up(LatestId, Node) of
152 |            caught_up -> ok;
153 |            ok -> do_catch_up_in_one_trans(LatestId, Node);
154 |            _ -> mnesia:abort("catch up failed")
155 |        end.
156 |    ```
157 | 
158 |    **Risk point**: If the previous unfinished MFA in the `init_mfa` transaction is executed successfully, but the latest MFA fails and leads to abort,  it will roll back the previous unfinished MFA as well, thus causing the MFA to be executed again later. So MFA must be idempotent.
159 | 
160 |    If nodes A,B have completed transaction 4, and they are fighting to update transaction 5 at the same time, A finally gets transaction 5 and commits successfully, at this time, B can only update to transaction 6, when the event of transaction 5 has not reached node B, B got transaction 6, it will lead to the error that transaction 6 is executed before transaction 5. So we must check  if there are any uncompleted transactions in transaction 6(`do_catch_up_in_one_trans`).
161 | 
162 |    In addition to this solution, we can also determine in `init_mfa` that if there are still transactions that have not been applied, then we will return an error directly.
163 | 
164 | 5. Only keep the latest completed 100 records for querying and troubleshooting
165 | 
166 | 6. The MFA function must return `ok|{ok, any()}` if it is executed successfully, otherwise mark it as failed and retry later.
167 | 
168 |    ```erlang
169 |    apply_mfa({M, F, A}) ->
170 |        try erlang:apply(M, F, A)
171 |        catch E:R -> {error, {E, R}}
172 |        end.
173 |    ```
174 | 
175 | 7. How to confirm that the commit has succeeded?
176 | 
177 |    There is a difference in the definition of success required by each caller, where a strong requirement is that all nodes succeed before returning success, and a more lenient requirement is that only the local node call succeeds. Therefore, it is necessary to add the parameter `SucceedNum(1~all)` to the API to tell the caller how many nodes succeeded so that the caller can be informed of success. If no specified number of nodes succeeded within the specified time, the failed node was not executed successfully is returned.
178 | 
179 |    PS: If set to all, it will never succeed when the existing nodes are not online.
180 | 
181 |    Implementation details:  After the first initiator apply succeeds, do not return the result to the caller immediately, periodically check whether the number of successful nodes in the mfa table meets the requirement, and return success if the condition is met. If the requirement is not met after timeout, the failed node is returned.
182 | 
183 |    Alternative: Add `from` field (caller PID) in MFA table to indicate who is the caller? Used to return the result to the caller, after the successful execution of MFA, send the PID success information. This proposal has one drawback: since it is impossible to monitor other processes, when there are 3 nodes, success will return 3 messages, but when the number of successes is set to 2, there will be a message left in the mailbox without knowing what to do with it.
184 |    
185 | 8. What are the fixes for the case that the transaction keeps failing?
186 |    If a node keeps retrying MFA unsuccessfully, subsequent transactions for this node will not be processed. The customer can manually specify the corresponding node to skip this doomed error commit after knowing explicitly that the error cannot be handled.
187 | 
188 | ### API Design
189 | 
190 | ```erlang
191 | -spec(emqx_cluster_rpc:call(M,F,A, SucceedNum) -> {ok,TnxId,Result}|{error,Reason} when                          
192 |                           M :: module(),
193 |                           F :: atom(),
194 |                           A :: [term()],
195 |                           SucceedNum :: pos_integer() | all,
196 |                           Result :: ok |{ok, term()},
197 |                           TxnId :: pos_integer().
198 | -spec(emqx_cluster_rpc:skip_failed_commit(Node) -> ok.
199 | -spec(emqx_cluster_rpc:reset() -> ok.
200 | -spec(emqx_cluster_rpc:status() -> [#{tnx_id => pos_integer(), mfa => mfa(), 
201 |                                       pending_node => [node()],
202 |                                       initiator => node(),
203 |                                       created_at => localtime()}]).
204 | ```
205 | 
206 | ## Configuration Changes
207 | 
208 | ```yaml
209 | node.cluster_call {
210 |     ## Time interval to retry after a failed call
211 |     ##
212 |     ## @doc node.cluster_call.retry_interval
213 |     ## ValueType: Duration
214 |     ## Default: 1s
215 |     retry_interval = 1s
216 |     ## Retain the maximum number of completed transactions (for queries)
217 |     ##
218 |     ## @doc node.cluster_call.max_history
219 |     ## ValueType: Integer
220 |     ## Range: [1, 500]
221 |     ## Default: 100
222 |     max_history = 100
223 |     ## Time interval to clear completed but stale transactions.
224 |     ## Ensure that the number of completed transactions is less than the max_history
225 |     ##
226 |     ## @doc node.cluster_call.cleanup_interval
227 |     ## ValueType: Duration
228 |     ## Default: 5m
229 |     cleanup_interval = 5m
230 |     }
231 | ```
232 | 
233 | 
234 | 
235 | ## Backwards Compatibility
236 | 
237 | N/A
238 | 
239 | ## Document Changes
240 | 
241 | N/A
242 | 
243 | ## Testing Suggestions
244 | 
245 | The final implementation must include unit test or common test code. If some
246 | more tests such as integration test or benchmarking test that need to be done
247 | manually, list them here.
248 | 
249 | ## Declined Alternatives
250 | 
251 | Here goes which alternatives were discussed but considered worse than the current.
252 | It's to help people understand how we reached the current state and also to
253 | prevent going through the discussion again when an old alternative is brought
254 | up again in the future.
255 | 
256 | 


--------------------------------------------------------------------------------
/implemented/0013-gen-swagger-spec.md:
--------------------------------------------------------------------------------
  1 | # An Example of EMQ X Improvement Proposal
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2021-09-02: @zhongwencool init draft
  6 | 
  7 | ## Abstract
  8 | 
  9 | The HTTP REST API implementation generates swagger spec directly from the code, without the need to maintain an additional swagger spec document.
 10 | 
 11 | ## Motivation
 12 | 
 13 | To implement the HTTP REST API interface, the developer needs to maintain a separate swagger spec in top of the implementation code, which is completely separate from the code and is difficult to update and maintain. Manually updating the swagger spec is intricate and error-prone. This proposal proposes to generate swagger spec by code, and swagger schema can reuse the schema in hocon, so that it is also convenient to do entry checking by automatic schema.
 14 | 
 15 | ## Design
 16 | 
 17 | ### Basic Structure
 18 | 
 19 | we define a basic structure.
 20 | 
 21 | ```erlang
 22 | paths() -> ["/user/:user_id"].
 23 | 
 24 | schema("/user/:user_id") ->
 25 |    #{operationId => user}.
 26 | ```
 27 | 
 28 | It defines all the paths of this spec, specify a unique `operationId`.  Using this value to name the corresponding methods in code.
 29 | 
 30 | ### Operations
 31 | 
 32 | For each path, we define operations (HTTP methods) that can be used to access that path. A single path can support multiple operations, for example, `GET /users` to get a list of users and `POST /users` to add a new user. we defines a unique operation as a combination of a path.. Minimal example of an operation:
 33 | 
 34 | ```erlang
 35 | schema("/user/:user_id/:fingerprint") ->
 36 |    #{
 37 |      operationId => user,
 38 |      get => #{response => 
 39 |                #{200 => 
 40 |                   hocon:mk(hoconsc:ref(?MODULE, "user"), 
 41 |                     #{description => <<"return self user informations">>})}
 42 |     }.
 43 | ```
 44 | 
 45 | Operations also support some optional elements for documentation purposes: `summary, description, tags`.
 46 | 
 47 | ### Query String in Paths
 48 | 
 49 |  Query string paramters is defined as query parameters:
 50 | 
 51 | ```erlang
 52 | #{put =>
 53 |   #{
 54 |    parameters => [
 55 |     {user_id, hoconsc:mk(string(), #{in => path, description => <<"The client ID of your Emqx app">>, example => <<"an long client id">>})},                
 56 |     {per_page, mk(range(1, 50), #{required => true, in => query, example => "10"})},
 57 |     {is_admin, mk(boolean(), #{required => true, in => query, example => "true"})},  
 58 |     {oneof_test_in_query, mk(hoconsc:union([string(), integer()]), #{in => query, example => "a_good_oneof_in_query"})}
 59 |    ]}}
 60 | ```
 61 | 
 62 | will generate a swagger json:
 63 | 
 64 | ```json
 65 | parameters: [
 66 |    {example: "a_good_oneof_in_query",
 67 |     in: "query",
 68 |     name: "oneof_test_in_query",
 69 |     schema: {
 70 |       oneOf: [
 71 |         {example: 100,type: "integer"},
 72 |         {example: "string example",type: "string"}]}},
 73 |    {example: "true",in: "query",name: "is_admin",required: true,
 74 |     schema: {example: true,type: "boolean"}},
 75 |    {example: "10", in: "query", name: "per_page", required: true,
 76 |    schema: {example: 1, maximum: 50, minimum: 1, type: "integer"}},   
 77 |    {description: "The client ID of your Emqx app", example: "an long client id", in: "path", name: "client_id",
 78 |     required: true,
 79 |     schema: {example: "string example",type: "string"}}].
 80 | ```
 81 | 
 82 | ### Request Body
 83 | 
 84 | Request bodies are typically used with “create” and “update” operations (POST, PUT, PATCH). For example, when creating a resource using POST or PUT, the request body usually contains the representation of the resource to be created. OpenAPI 3.0 provides the `requestBody` keyword to describe request bodies.
 85 | 
 86 | ```erlang
 87 | #{requestBody =>
 88 |   #{
 89 |     client_secret => mk(string(), #{description => <<"The OAuth app client secret for which to create the token.">>, maxLength => 40}),
 90 |     scopes => mk(hoconsc:array(string()), #{<<"description">> => "A list of scopes that this authorization is in.", example => ["public_repo", "user"], nullable => true}),
 91 |     test => mk(hoconsc:enum([test, good]), #{<<"description">> => "good", example => test}),
 92 |     note => mk(string(), #{description => <<"A note to remind you what the OAuth token is for.">>, example => <<"Update all gems">>}),
 93 |     note_url => mk(string(), #{description => <<"A URL to remind you what app the OAuth token is for.">>}),
 94 |     page => mk(range(1, 100), #{description => <<"Page Description.">>}),
 95 |     ip => mk(emqx_schema:ip_port(), #{description => <<"ip:port">>, example => "127.0.0.1:8081"}),
 96 |     oneof_test => mk(hoconsc:union([range(1, 100), infinity, hoconsc:ref(?MODULE, client_id)]), #{description => "oneof description", example => "1"})
 97 |    }}.
 98 | ```
 99 | 
100 |  RequestBody is json object. If we have too much nesting, we can use `hoconsc:ref/2` to make the code a little clearer. such as:
101 | 
102 | `hoconsc:ref(?MODULE, client_id)` will call `?MODULE:fields(client_id)` to get specific schema.
103 | 
104 | ### Responses
105 | 
106 | ```erlang
107 | #{responses => 
108 |   #{
109 |    200 => mk(hoconsc:ref(?MODULE, "authorization"), #{description => <<"if returning an existing token">>}),
110 |    422 => mk(hoconsc:ref(?MODULE, "validation_failed"), #{}),
111 |    400 => mk(hoconsc:array(hoconsc:ref(?MODULE, "authorization")), #{}),
112 |    401 => #{
113 |             total_count => mk(integer(), #{required => true}),
114 |             artifacts => mk(hoconsc:array(hoconsc:ref(?MODULE, "authorization")), #{})
115 |            },
116 |    203 => maps:from_list(emqx_schema:fields("authorization"))
117 |   }}.
118 | ```
119 | 
120 | will generate swagger.json:
121 | 
122 | ```json
123 | responses: {
124 |   200: {
125 |     description: "if returning an existing token",
126 |     content: {application/json: {schema: {$ref: "#/components/schemas/emqx_swagger_api.authorization"}}}},
127 |   203: {description: "",content: {application/json: {schema: {properties: {cache:   
128 |                 {$ref:"#/components/schemas/emqx_schema.cache"},
129 |         deny_action: {
130 |             default: "ignore",
131 |             enum: ["ignore","disconnect"],
132 |             type: "string"},
133 |         no_match: {default: "allow",enum: ["allow","deny"],type: "string"}},
134 |         type: "object"}}}},
135 |    400: {
136 |      description: "",
137 |      content: {
138 |      application/json: {
139 |      schema: {items: {$ref: "#/components/schemas/emqx_swagger_api.authorization"},
140 |      type: "array"}}}},
141 |    401: {description: "",content: {application/json: {schema: {required: ["total_count"],
142 |          properties: {
143 |          artifacts: {items: {$ref: "#/components/schemas/emqx_swagger_api.authorization"},type: "array"},
144 |          total_count: {example: 100,type: "integer"}},
145 |          type: "object"}}}},  
146 |    422: {description: "",content: 
147 |          {application/json: {schema: {$ref: "#/components/schemas/emqx_swagger_api.validation_failed"}}}}}
148 | }.
149 | ```
150 | 
151 | Only json format is needed for now
152 | 
153 | 1. The developer writes the specific implementation code, due to the entry check has been done by the code SPEC above, so the specific implementation code can be used directly without parameter verification.
154 | 
155 |    ```erlang
156 |    user(put, #{body := Params}) ->
157 |        #{<<"ip">> := {IP, Port} = Params, %% the {IP Port} has already converted by schema.
158 |         ....
159 |        {200, #{...response json..}}.
160 |    
161 |    ```
162 | 
163 | We no longer need to update swagger.json manually. 
164 | 
165 | We don't check response schema, the repsonse schema only use for generate swagger.json.
166 | 
167 | ## Configuration Changes
168 | 
169 | This section should list all the changes to the configuration files (if any).
170 | 
171 | ## Backwards Compatibility
172 | 
173 | This sections should shows how to make the feature is backwards compatible.
174 | If it can not be compatible with the previous emqx versions, explain how do you
175 | propose to deal with the incompatibilities.
176 | 
177 | ## Document Changes
178 | 
179 | If there is any document change, give a brief description of it here.
180 | 
181 | ## Testing Suggestions
182 | 
183 | The final implementation must include unit test or common test code. If some
184 | more tests such as integration test or benchmarking test that need to be done
185 | manually, list them here.
186 | 
187 | ## Declined Alternatives
188 | 
189 | Here goes which alternatives were discussed but considered worse than the current.
190 | It's to help people understand how we reached the current state and also to
191 | prevent going through the discussion again when an old alternative is brought
192 | up again in the future.
193 | 
194 | 


--------------------------------------------------------------------------------
/implemented/0014-rolling-cluster-upgrade.md:
--------------------------------------------------------------------------------
  1 | # Rolling cluster upgrade
  2 | 
  3 | ## Change log
  4 | 
  5 | * 2021-09-21: @k32 Initial draft
  6 | * 2021-09-23: @k32 Applied remarks
  7 | * 2021-09-28: @k32 Applied remarks
  8 | 
  9 | ## Abstract
 10 | 
 11 | Currently EMQ X upgrade procedure has two modes of upgrade:
 12 | 
 13 | 1. Patch releases.
 14 |    Upgrading between the patch versions is done by patching the live nodes using Erlang hot code patching feature.
 15 |    The patch is loaded to the existing running nodes.
 16 |    This is a low-maintenance upgrade path, however severely limited in the scope of the changes.
 17 | 
 18 | 1. Minor and major releases requires taking down the entire cluster and redirecting the MQTT traffic to a different cluster running the new version of the software.
 19 |    This approach introduces a lot of operational overhead.
 20 | 
 21 | EMQ X should support rolling upgrade of the cluster when nodes (or pods) are taken down and replaced one by one for minor version upgrades.
 22 | Patch version upgrades will remain the same.
 23 | 
 24 | This EIP introduces the guidelines for writing the backward- and forward-compatible code.
 25 | Also it documents the necessary changes to the existing code base.
 26 | 
 27 | ## Motivation
 28 | 
 29 | In order to make upgrading the cluster smoother, EMQ X should support rolling cluster upgrades and green-blue deployments.
 30 | It should be possible to upgrade the cluster without taking it down entirely, by making sure that different minor versions of the code can communicate in both directions.
 31 | 
 32 | ## Design
 33 | 
 34 | Patch version upgrades will not be changed.
 35 | They will follow the existing instruction: https://docs.emqx.io/en/broker/v4.3/advanced/relup.html
 36 | 
 37 | Live upgrade paths will be limited to `major.minor.* -> major.minor+1.*` or `major.minor.* -> major.minor.*` formulas.
 38 | 
 39 | A new concept of a "backplane protocol" is introduced.
 40 | 
 41 | ### Upgrade procedure
 42 | 
 43 | Cluster upgrade should be split in roughly three stages:
 44 | 
 45 | 1. Optional step: inject the forward compatibility support into the old upgrade.
 46 |    This may be needed because we can't predict in advance what will be changed in the next release.
 47 |    The upgrade code should be able to inject pre-upgrade beams to the old release.
 48 |    This part of the upgrade procedure should be idempotent and reversible.
 49 |    It will work much like a patch version upgrade, using `appup` files in the same way as now.
 50 | 
 51 |    During this stage, all nodes in the cluster are running the same `major.minor` version of the code.
 52 | 
 53 | 1. Rolling upgrade of the cluster that involves taking nodes down and replacing them with the newer version.
 54 |    This part of the upgrade is revesible.
 55 | 
 56 |    During this stage, different nodes in the cluster are running different minor versions of the code.
 57 |    Appup files are not used, because the nodes are created from scratch.
 58 | 
 59 | 1. Once all the nodes are upgraded, a data migration can start.
 60 |    This part of the upgrade is not reversible, because the migration process is destructive.
 61 |    It is triggered by an explicit command from the operator, and runs asynchronously (TODO: describe the user interface for doing this).
 62 |    (TODO: come up with a testing strategy that verifies handling of the partially migrated data)
 63 | 
 64 | Deprecated APIs modules can be removed in the next release.
 65 | 
 66 | There are two major areas that need to be considered to support this kind of upgrade:
 67 | 
 68 | 1. RPC compatibility
 69 | 1. Mnesia schema backward-compatibility
 70 | 
 71 | ### RPC compatibility
 72 | 
 73 | In order to simplify the reasoning about the backplane API backward-compatibility, all the functions that may be called remotely should be identified and gathered in specialized modules.
 74 | These modules should be named `<application_name>_proto_v1` (where v1 is the protocol version).
 75 | These modules should be immutable, and ideally they should not contain any business logic: only a (gen_)rpc call or cast.
 76 | When a new version of the API module is created, the previous one should be kept.
 77 | Direct sending messages to the remote processes is prohibited.
 78 | Instead, a helper function in the API module should be introduced that does a cast.
 79 | 
 80 | Example:
 81 | 
 82 | ```erlang
 83 | -module(foo_proto_v2).
 84 | 
 85 | -export([ foo/2
 86 |         , bar/2
 87 |         ]).
 88 | 
 89 | -spec foo(node(), atom()) -> ok.
 90 | foo(Node, Arg1) ->
 91 |     rpc:call(Node, foo_proto_impl, foo, [Arg1]).
 92 | 
 93 | -spec bar(node(), atom()) -> ok.
 94 | bar(Node, Arg1) ->
 95 |     ApiVersion = 2, %% Optional. In case the same implementation is reused in different API versions
 96 |     rpc:call(Node, foo_proto_impl, bar, [ApiVersion, Arg1]).
 97 | ```
 98 | 
 99 | #### Protocol version negotiation
100 | 
101 | bproto application should provide an API function that returns the latest version of the protocol supported by the remote node.
102 | This information should be called before establishing any session.
103 | This information can be cached for the entire session, because the upgrade is done by taking down the entire node, so the protocol version cannot suddenly change.
104 | 
105 | #### Static checks
106 | 
107 | The following xref checks should be written:
108 | 
109 | 1. Only the functions in `.*_proto_v[0-9]+` modules are allowed to call `rpc:call` and `gen_rpc:call` function
110 | 1. Each API function does an RPC to an existing function
111 | 
112 | ### Design of the helper bproto application
113 | 
114 | We propose to create a new helper application that helps to maintain the API backward-compatibility.
115 | This application will contain the code that performs static checks of the API modules.
116 | Also it will contain the runtime for version negotiation.
117 | 
118 | A helper mnesia table tracking the upgrade state and release version for each node can be used to perform the checks between proceeding to the next stage of the upgrade.
119 | 
120 | ### Mnesia schema compatibility
121 | 
122 | Fields of the tables can't be removed (until the last stage of the upgrade?).
123 | Non-trivial changes to the schema should be performed in stages:
124 | 
125 | 1. The new version of the code should be able to work with the old schema.
126 | 1. Once all the nodes in the cluster are updated, start an async process of migrating the data to the new table.
127 | 1. Once the data has been migrated and checks pass, the old table can be removed.
128 | 1. In the next release the code supporting the old schema can be dropped.
129 | 
130 | #### Table migration
131 | 
132 | `bproto` application takes care of the static checks.
133 | Writing migration in a way that avoids whole-table locks is a complicated process, and should be done on a case-by-case basis.
134 | Migration process could utilize mnesia transactions to traverse the tables entry-by-entry and move the records to the new table, deleting the old record.
135 | The read code should read both tables.
136 | 
137 | ## Configuration Changes
138 | 
139 | n/a
140 | 
141 | ## Backwards Compatibility
142 | 
143 | This change is backward-compatible.
144 | 
145 | ## Document Changes
146 | 
147 | Document cluster upgrade procedure.
148 | 
149 | ## Testing Suggestions
150 | 
151 | 1. Change the CI, so some or all cluster test suites run on a cluster consisting the nodes running two different versions of EMQ X.
152 | 
153 | ## Declined Alternatives
154 | 
155 | ### Offline cluster upgrade via backup transformation
156 | 
157 | Reason to reject: changed business requirements.
158 | 
159 | ### Versioning of the individual API functions
160 | 
161 | Annotations (or edoc tags) can be used to specify the API functions:
162 | 
163 | ```erlang
164 | -module(foo_api).
165 | 
166 | -introduced_in({foo/3, {5,0,0}}).
167 | foo(A, B, C) ->
168 |     rpc:call(foo_api_impl, foo, [A, B, C]).
169 | 
170 | -introduced_in({bar/3, {5,0,0}}).
171 | -deprecated_in({bar/3, {5,1,0}}).
172 | foo(A, B, C) ->
173 |     rpc:call(foo_api_impl, bar, [A, B, C]).
174 | ```
175 | 
176 | ```erlang
177 | -module(foo_api_impl).
178 | 
179 | foo(A, B, C) ->
180 |     ...
181 | 
182 | foo(A, B, C) ->
183 |     ...
184 | ```
185 | 
186 | Compatibility of the typespecs of the API modules:
187 | - Function domain is not reduced
188 | - Function co-domain is not extended
189 | 
190 | The last check uses custom code, that iterates through function arguments and return types and calls `erl_types:t_is_subtype/2` function for the corresponding types in the new and the old minor version of the application (determined by the `app.src` file).
191 | 
192 | Reasons to reject: it does not address bi-directional calls.
193 | Mitigation of this problem is too complicated.
194 | 
195 | ### Having bpapi application in a separate repo
196 | 
197 | EMQ X dependencies perform RPCs too.
198 | The idea was to make them compatible using the same principles.
199 | 
200 | Reason to reject: reduce scope of the feature.
201 | 


--------------------------------------------------------------------------------
/implemented/0015-unified-authentication.md:
--------------------------------------------------------------------------------
  1 | # Unified Authentication in EMQ X 5.0
  2 | 
  3 | ## Change log
  4 | 
  5 | * 2021-05-17: @zhouzb Initial draft
  6 | * 2021-10-04: @zmstone Sync the doc from internal updated 0012 from doc
  7 | * 2022-08-10: @savonarola Update and move to implemented
  8 | 
  9 | ## Abstract
 10 | 
 11 | This proposal introduces a new design for EMQ X 5.0 authentication,
 12 | which aims to provide: for EMQ X users, a better user experience with more configurable interfaces,
 13 | and for EMQ X developers, a better development framework without repeating themselves.
 14 | 
 15 | ## Motivation
 16 | 
 17 | EMQ X authentication is implemented by the hook-callback for hook-point `client.authenticate`.
 18 | Up to v4.3, EMQ X supported 8 different authentication plugins, namely:
 19 | 
 20 | ```
 21 | emqx_auth_http
 22 | emqx_auth_jwt
 23 | emqx_auth_ldap
 24 | emqx_auth_mnesia
 25 | emqx_auth_mongo
 26 | emqx_auth_mysql
 27 | emqx_auth_pgsql
 28 | emqx_auth_redis
 29 | ```
 30 | 
 31 | Some of the pain points in the old implementation
 32 | 
 33 | 1. The authentication plugins are implemented more or less the same,
 34 |    and works more or less the same too. There is a lack of abstraction for the common parts,
 35 |    causing developers to repeat themselves when adding new features or fixing issues.
 36 | 1. There is a lack of a nice re-configure interface, the only way to configure a plugin
 37 |    is to update the config file, and reload (stop, start) the plugin.
 38 | 1. If there are more than one auth plugin enabled, there is no deterministic order for
 39 |    how the different backends are checked.
 40 | 1. Enabled authentication plugins are collectively considered one global instance,
 41 |    there is a lack of granularity for scoped control levels. e.g. per-zone, or per-listener.
 42 | 
 43 | To address the pain-points in 5.0, we propose below enhancements.
 44 | 
 45 | ## Design
 46 | 
 47 | ### One app for all
 48 | 
 49 | One `emqx_authn` app to unify the management of all different backends (except for ldap being postponed for now).
 50 | 
 51 | ### The same hook-point
 52 | 
 53 | In this design, there is no intention to change how EMQ X hooks work,
 54 | the new app `emqx_authn` will continue to make use of the `client.authenticate` hook-point,
 55 | only to dispatch auth requests to the underlying backends inside one single hook call.
 56 | 
 57 | ### Composable authn "chain"
 58 | 
 59 | We should allow users to compose (configure) a "chain" of backends with a defined order in which the
 60 | checks are performed one after another. Each check against the backend in the chain may yield 3 different
 61 | results for one-request authentication:
 62 | 
 63 | - `ignore` is to indicate there is no auth information found hence should
 64 |    continue validate the client against the rest of the backends in the chain.
 65 | - `{ok, Info}` as a login accepted, hence to terminate the auth calls from here,
 66 |    where `Info` may contain additional information such as to indicate if the user is a super-user.
 67 | - `{error, Reason}` to indicate that client's login should be denied.
 68 | 
 69 | NOTE: for temporary errors, such as database connection issue, the error is logged,
 70 |       and the auth result is `ignore` so to move forward to the next node in the chain.
 71 | 
 72 | NOTE: if there is no `ok` (accepted) result after a full chain exhaustion, the login is rejected.
 73 | 
 74 | NOTE: empty chain allows anonymous access.
 75 | 
 76 | For enhanced authentication, such as `scram` there can be messages after the first request,
 77 | hence the backend may return `{continue, Data}`,
 78 | where `Data` is to be kept by the connection process as handling context for the following messages.
 79 | 
 80 | ### Fine-grained configuration levels
 81 | 
 82 | By default, EMQ X user can configure one global chain which applies to all MQTT listeners,
 83 | we should however also allow a per-listener configuration to override the global chain.
 84 | Together with firewall rules, this will allow users to have different auth solution for
 85 | MQTT service facing different group of clients coming from their designated network.
 86 | 
 87 | ### Reconfigurable on the fly
 88 | 
 89 | The changes in the auth chain or the backends should be applied on-the-fly
 90 | i.e. without having to restart the `eqmx_authn` application.
 91 | 
 92 | 
 93 | ## Configuration
 94 | 
 95 | - Example config for built_in_database (mnesia) username/password based global auth
 96 | 
 97 | ```
 98 | authentication {
 99 |   backend: 'built_in_database',
100 |   mechanism: "password_based",
101 |   ...
102 |   user_id_type: clientid
103 | }
104 | ```
105 | 
106 | - Example 'chain' config
107 | 
108 | ```
109 | authentication = [
110 |   {
111 |     backend: 'built_in_database',
112 |     mechanism: "password_based",
113 |     ...
114 |     user_id_type: clientid
115 |   },
116 |   {
117 |     algorithm = "hmac-based"
118 |     mechanism = "jwt"
119 |     secret = "emqxsecret"
120 |     "secret_base64_encoded" = false
121 |     use_jwks = false
122 |     verify_claims {}
123 |   }
124 | ]
125 | ```
126 | 
127 | - Example config for built_in_database (mnesia) username/password based per-listener auth
128 | 
129 | ```
130 | listener.tcp.default {
131 |   ...
132 |   authentication: {
133 |     backend: "built_in_database",
134 |     type: "password_based",
135 |     user_id_type: username
136 |   }
137 |   ...
138 | }
139 | ```
140 | 
141 | - Example config for built_in_database (mnesia) username/password based per-gateway/per-listener auth
142 | 
143 | ```
144 | gateway.stomp {
145 |   ...
146 |   # Specific global authenticator for all STOMP listeners
147 |   authentication = {
148 |     backend: "built_in_database",
149 |     type: "password_based",
150 |     user_id_type: username
151 |   }
152 | 
153 |   listeners.tcp.default {
154 |     ...
155 |     # Specific authenticator for the specified STOMP listener
156 |     authentication = {
157 |       backend: "built_in_database",
158 |       type: "password_based",
159 |       user_id_type: username
160 |     }
161 |   }
162 | }
163 | ```
164 | 
165 | Gateways allow only single authenticator in the chain.
166 | 
167 | - Disable authentication for a specific listener
168 | 
169 | ```
170 | listener.tcp.default {
171 |   ...
172 |   enable_authn = false
173 |   ...
174 | }
175 | ```
176 | 
177 | ## APIs
178 | 
179 | 
180 | ### Global auth chain APIs
181 | 
182 | - Get global auth chain
183 | 
184 | ```
185 | GET /authentication
186 | ```
187 | 
188 | - Add authenticator to the global chain
189 | 
190 | ```
191 | POST /authentication
192 | {
193 |     "backend": "built_in_database",
194 |     "type": "password_based",
195 |     ...
196 | }
197 | ```
198 | 
199 | - Manage individual authenticators in the global chain
200 | 
201 | ```
202 | GET /authentication/:id
203 | 
204 | DELETE /authentication/:id
205 | 
206 | PUT /authentication/password_based:built_in_database
207 | {
208 |       ...
209 | }
210 | ```
211 | 
212 | Where `id` is of format `<Mechanism>:<Backend>`. e.g. `password_based:built_in_database`.
213 | 
214 | The `PUT` body should be constructed according to the config schema.
215 | 
216 | ### Per-listener auth chain APIs
217 | 
218 | For per-listener authentication chains, the APIs are mostly the same,
219 | as the ones for global instances, only the path is prefixed with `listener/listener_id`.
220 | 
221 | ```
222 | POST /listeners/:listener_id/authentication
223 | GET /listeners/:listener_id/authentication
224 | 
225 | GET /listeners/:listener_id/authentication/:id
226 | DELETE /listeners/:listener_id/authentication/:id
227 | PUT /listeners/:listener_id/authentication/:id
228 | ```
229 | 
230 | A listener name is of format `protocol:id` which is assigend in the config file, e.g.
231 | 
232 | ```
233 | listeners.tcp.default {
234 |     bind = ...
235 | }
236 | ```
237 | 
238 | The name of this listener is `tcp:default`.
239 | 
240 | Gateway endpoints are:
241 | 
242 | ```
243 | /gateway/:protocol/authentication
244 | /gateway/:protocol/listeners/:listener_id/authentication
245 | ```
246 | 
247 | ### Re-positioning APIs
248 | 
249 | ```
250 | POST /authentication/:id/move
251 | POST /listeners/:listener_id/authentication/:id/move
252 | ```
253 | 
254 | With a JSON body to indicate where the authenticator is to be re-positioned.
255 | The positions can be `top` (front of the list), `bottom` (the rear of the list),
256 | or `before` / `after` another ID.
257 | 
258 | for example:
259 | ```
260 | curl -X 'POST' \
261 |   'http://localhost:18083/api/v5/authentication/jwt/move' \
262 |   -H 'accept: */*' \
263 |   -H 'Content-Type: application/json' \
264 |   -d '{
265 |   "position": "before:password_based:built_in_database"
266 | }'
267 | ```
268 | 
269 | ### User management APIs
270 | 
271 | We should also support CRUD APIs for user management, with below endpoints.
272 | 
273 | - Manage users
274 | 
275 | ```
276 | POST /authentication/password_based:built_in_database/users
277 | {
278 |       ...
279 | }
280 | GET /authentication/password_based:built_in_database/users
281 | ```
282 | 
283 | - Manage individual users
284 | ```
285 | GET /authentication/password_based:built_in_database/users/:user_id
286 | 
287 | PUT /authentication/password_based:built_in_database/users/:user_id
288 | {
289 | 
290 | }
291 | 
292 | DELETE /authentication/password_based:built_in_database/users/:user_id
293 | ```
294 | 
295 | The authenticator ID is made generic although 5.0,
296 | only the built-in database (Mnesia) is supported.
297 | That is, only `password_based:built_in_database` is valid for `:id` so far.
298 | 
299 | The corresponding per-listener endpoints are:
300 | 
301 | ```
302 | POST /listeners/:listener_id/authentication/:id/users
303 | GET /listeners/:listener_id/authentication/:id/users
304 | 
305 | GET /listeners/:listener_id/authentication/:id/users/:user_id
306 | PUT /listeners/:listener_id/authentication/:id/users/:user_id
307 | DELETE /listeners/:listener_id/authentication/:id/users/:user_id
308 | 
309 | POST /gateway/:name/authentication/users
310 | GET /gateway/:name/authentication/users
311 | 
312 | GET /gateway/:name/authentication/users/:user_id
313 | PUT /gateway/:name/authentication/users/:user_id
314 | DELETE /gateway/:name/authentication/users/:user_id
315 | 
316 | POST /gateway/:name/listeners/:id/authentication/users
317 | GET /gateway/:name/listeners/:id/authentication/users
318 | 
319 | GET /gateway/:name/listeners/:id/authentication/users/:uset_id
320 | PUT /gateway/:name/listeners/:id/authentication/users/:uset_id
321 | DELETE /gateway/:name/listeners/:id/authentication/users/:uset_id
322 | ```
323 | 
324 | ## Testing suggestions
325 | 
326 | There should three levels of tests.
327 | 
328 | * Unit tests for module level tests
329 | * Regular common tests (maybe with mocks if necessary) to test full flows
330 | * Integrated common tests verify the code against external auth providers running in docker container
331 | 


--------------------------------------------------------------------------------
/implemented/0016-emqx-conf.md:
--------------------------------------------------------------------------------
  1 | # EMQX Configuration Manager
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2021-10-12: @zhongwencool init draft
  6 | 
  7 | ## Abstract
  8 | 
  9 | This proposal introduces a new Erlang application to handle EMQ X's configuration management with a focus on config live-reloads, and cluster wide config change syncs.
 10 | 
 11 | ## Motivation
 12 | Prior to 5.0, EMQ X's configuration management are quite static.
 13 | * The user interfaces for config changes are environment variables or a text editor for the config files.
 14 | * To load changed configs, it usually requires restarting an application, or reloading a plugin, or sometimes even restarting the node.
 15 | * When managing a cluster, one would have to update config files one node after another. 
 16 | * Mnesia was used to store some of the configs (such as rule-engine resources) in order to get them replicated, which made it less configurable because it was not possible to bootstrap such configs from a file which can be prepared before the node boots. Instead, one would have to wait for the node to boot, and then call HTTP API to make the changes.
 17 | 
 18 | In this proposal, we try to address the pain-points by
 19 | - Supporting HTTP APIs to perform live config changes and reload.
 20 | - Persisting changes made from HTTP APIs on disk in HOCON format. 
 21 | - Maintaining consistency across the nodes in the cluster. For example authentication & authorisation (ACL) configs, and rule-engine rules.
 22 | 
 23 | In 5.0, no config is stored in Mnesia, however such changes are not in the scope of this EIP.
 24 | Some configs may not make sense to be the same for all nodes, so we should also allow local node overrides. such as `rate_limit` settings for nodes per their hardware capacity.
 25 | 
 26 | ## Design
 27 | 
 28 | ### Configuration files
 29 | 
 30 | #### emqx.conf
 31 | 
 32 | EMQ X reads `emqx.conf` for converting this hocon file into Erlang format at startup, and `emqx.conf` explicitly include `override.conf` at the end of the file.
 33 | 
 34 | ```erlang
 35 | include "data/configs/cluster-override.conf"
 36 | include "data/configs/local-override.conf"
 37 | ```
 38 | 
 39 | - If the user wants to manually modify a node's configuration item before startup, it can be appended to the end of the `emqx.conf`, or use `include "data/configs/user_default.conf`, and for the same configuration, the later value will overwrite the earlier one.
 40 | - If the user specifies to read environment variables for a configuration item, this value is read-only at runtime and will not be modified. In other words, the environment variables are always taken at the end of the `emqx.conf` file and has the highest priority.
 41 | - For now, we need to integrate all the configurations into `emqx.conf`. But it is planned that `emqx.conf` will no longer contain configurations that use default values in the next phase(propose another EIP), the default is only embedded in the code. Also, users can view all configuration items via the HTTP API (described later). This would have 2 benefits: 
 42 |   - In subsequent version upgrades, adding/removing/updating configurations will not be overwritten by `emqx.conf`. 
 43 |   - This allows the user to focus only on the configuration that has been modified. 
 44 | 
 45 | #### emqx_cluster_override.conf
 46 | 
 47 | - This file can only be modified via the API and manual modification of file directly is not supported.
 48 | 
 49 | - When updating configurations that must require consistency across the cluster, they are persisted to this file.
 50 | - The node will copy this file from the longest surviving core node before initializing the configuration, this file will be added to initialize the configuration together.
 51 | - When the node is updated, the configuration within the cluster is updated via cluster call and persisted to this file.
 52 | - This file must be kept the same for all nodes, we will add an extra process to check the content of the file periodically, and alarm after 3 continuous differences are found.
 53 | 
 54 | #### emqx_local_override.conf
 55 | 
 56 | - This file can only be modified via the API and manual modification of file directly is not supported.
 57 | 
 58 | - When the configuration of a specific node is updated via HTTP API, it will be persisted to this file.
 59 | 
 60 | #### emqx_conf application
 61 | Before, the initialization of the configuration file, cluster_call, was done through the `emqx_machine`, we will split this part of the functionality from the` emqx_machine` and make a new application `emqx_conf`
 62 | 
 63 | The role of this application is to
 64 | - Convert the configuration from HOCON format to Erlang sys.config format at initialization.
 65 | - Manage live-updates and deletions of the configurations, and replicate across the cluster.
 66 | 
 67 | Other apps that want to update the configuration must call through the `emqx_conf`'s' API, which cannot call emqx API directly.
 68 | The specific flow is: 
 69 | 
 70 | ```bash
 71 | Other Apps(eg: emqx_resource) => emqx_conf => emqx API.
 72 | ```
 73 | 
 74 | ### HTTP API design
 75 | 
 76 | #### Get the whole configurations.
 77 | 
 78 | ```erlang
 79 | #{
 80 |   get => #{
 81 |     description => <<"Get all the configurations of a given node, or all nodes in the cluster.">>,
 82 |     parameters => [
 83 |        {node, hoconsc:mk(typerefl:atom(),
 84 |           #{in => query, required => false, example => <<"emqx@127.0.0.1">>,
 85 |           desc => <<"Node name. When this parameter is not provided, configs for all nodes in the cluster are returned">>})},
 86 |        {debug, hoconsc:mk(typerefl:boolean(), #{in => query, required => false,
 87 |           desc => <<"Carries debug (metadata) information, such as file name and line number">>})}],
 88 |             responses => #{
 89 |                 200 => #{"$node" => configs_list()}
 90 |             }
 91 |         }
 92 |     };
 93 | ```
 94 | 
 95 | - Returns what the current value/documentation of all configuration items is, group by nodename.
 96 | -  `debug=true`, will return all the meta data, such as line number, default, document, easy to locate the problem.
 97 | 
 98 | #### Update specific configuration
 99 | 
100 | ```erlang
101 | schema("/configs/:rootname") ->
102 |     #{
103 |       get => #{
104 |             description => <<"Get the sub-configurations">>,
105 |             parameters => [
106 |               {node, hoconsc:mk(atom(), #{in => query, required => false})},
107 |               {debug, hoconsc:mk(typerefl:boolean(), #{in => query, required => false})}
108 |                           ],
109 |             responses => #{
110 |                 200 => #{<<"$node">> => config_list()},
111 |                 404 => emqx_dashboard_swagger:error_codes(['NOT_FOUND'], <<"config not found">>)
112 |             }
113 |         },
114 |         put => #{
115 |             description => <<"Update the sub-configurations">>,
116 |             parameters => [{node, hoconsc:mk(atom(), #{in => query, required => false})}],
117 |             requestBody => config_list(),
118 |             responses => #{
119 |                 200 => #{<<"$node">> => config_list()},
120 |                 400 => emqx_dashboard_swagger:error_codes(['UPDATE_FAILED'])
121 |             }
122 |         }
123 |     }.
124 | ```
125 | 
126 | - get specific configuation, such as: `/configs/emqx_dashboard` will return :
127 |   There should be a `sensitive` flag in the schema for sensitive fields, and the value should be reported back as `"******"` in the API, such as password.
128 | 
129 |   ```erlang
130 |   #{'emqx@127.0.0.1' => 
131 |      #{default_password => "****",
132 |        default_username => "admin",
133 |        listeners =>
134 |          [#{backlog => 512,inet6 => false,ipv6_v6only => false,
135 |             max_connections => 512,num_acceptors => 4,port => 18083,
136 |             protocol => http,send_timeout => 5000}],
137 |        sample_interval => 10,token_expired_time => 3600000},
138 |     'emqx1@127.0.0.1' => ...
139 |   }   
140 |   ```
141 | 
142 | - Update specific configuation without the 'node' query string will modify the configuration of all nodes in the cluster and persist it in `emqx_cluster_override.conf`.
143 | 
144 | - Update specific configuation with `node='xxx@xx.xx.xx'` in the query string, only the configuration of the specified node will be modified, persisted to `emqx_local_override.conf`.
145 | 
146 | - If we have already modified a configuration in `emqx_local_override.conf` successfully, trying to update this value in `emqx_cluster_override.conf` again will return a failure, and the user will be instructed to reset this configuration from local before the update can succeed. Otherwise, since the priority of `emqx_local` is higher than that of `emqx_cluster`, the changes made in `emqx_cluster` will not take effect.
147 | 
148 | - Update requests carry the latest value back, and if the update fails, it also explains what the current value is.
149 | 
150 | #### Reset specific configuration
151 | 
152 | ```erlang
153 | schema("/configs_reset/:rootname") ->
154 |   #{put => #{
155 |      description => <<"Reset the sub-configurations">>,
156 |      parameters => [{node, hoconsc:mk(atom(), #{in => query, required => false})}],
157 |      requestBody => config_list(),
158 |      responses => #{
159 |        200 => #{<<"$node">> => config_list()},
160 |        400 => emqx_dashboard_swagger:error_codes(['REST_FAILED'])
161 |        }
162 |    }
163 | ```
164 | 
165 | - We can't delete a configuration item, we can only reset it.
166 | - Reset specific configuation without query string, will delete the configuation in `emqx_cluster_override.conf`. 
167 | - Reset specific configuation with `node='xxx@xx.xx.xx'` in the query string, only the configuration of the specified node in `emqx_local_overide.conf` will be deleted.
168 | 
169 | ## Configuration Changes
170 | 
171 | This section should list all the changes to the configuration files (if any).
172 | 
173 | ## Backwards Compatibility
174 | 
175 | This sections should shows how to make the feature is backwards compatible.
176 | If it can not be compatible with the previous emqx versions, explain how do you
177 | propose to deal with the incompatibilities.
178 | 
179 | ## Document Changes
180 | 
181 | If there is any document change, give a brief description of it here.
182 | 
183 | ## Testing Suggestions
184 | 
185 | The final implementation must include unit test or common test code. If some
186 | more tests such as integration test or benchmarking test that need to be done
187 | manually, list them here.
188 | 
189 | ## Declined Alternatives
190 | 
191 | The `emqx_cluster_override.conf` and `emqx_local_override.conf` can't be directly modified by handle, we can also store this information in mnesia. But it is not convenient for users to see it.
192 | 
193 | 


--------------------------------------------------------------------------------
/implemented/0018-unified-authorization.md:
--------------------------------------------------------------------------------
  1 | # Unified Authorization in EMQ X 5.0
  2 | 
  3 | ## Change log
  4 | 
  5 | * 2021-05: @Rory-Z
  6 | * 2021-10-29: @zmstone Sync the doc from draft in internal wiki
  7 | * 2022-08-10: @savonarola Update and move to implemented
  8 | 
  9 | ## Abstract
 10 | 
 11 | 
 12 | In EMQ X 4.x, authorization (ACL) is provided by the `emqx_auth_xxx` applications as small Erlang applications,
 13 | while it was nice having the flexibility, the pain-points are:
 14 | 
 15 | - Scattered configuration management for EMQ X users
 16 | - Repeated (copy-paste) work for EMQ X developers
 17 | - Non-deterministic ordering of the hook callback registration leading to non-deterministic ordering of the ACL rules
 18 | 
 19 | This proposal introduces a new design for EMQ X 5.0 authorization (ACL), which aims to provide:
 20 | 
 21 | - For EMQ X users, a better user experience with more configurable interfaces
 22 | - For EMQ X developers, a better development framework without repeating themselves
 23 | 
 24 | ## Terms
 25 | 
 26 | * ACL: Access Control List which defines a set of 'rules' for message publish and topic subscribe requests from MQTT clients;
 27 | * Authorization: another word for ACL;
 28 | * Source: an ACL rule data provider, such as 'file', 'http', 'mongo' and 'mysql' etc.
 29 | * Chain: an ordered list of Sources
 30 | 
 31 | ## High level requirements
 32 | 
 33 | * Multiple sources for ACL rule persistence
 34 |   - File
 35 |   - MySQL
 36 |   - PostgreSQ
 37 |   - Redis
 38 |   - MongoDB
 39 |   - Mnesia (built-in-database)
 40 |   - WebServer (http)
 41 | 
 42 | * Fallback action if no rule matches a request (publish or subscribe)
 43 |   - deny
 44 |   - allow
 45 |   - disconnect
 46 | 
 47 | * Rule cache
 48 |   In 4.x series, the rules are cached in client's process dictionary.
 49 |   There is no intention to change such behaviour in 5.0
 50 | 
 51 | * Allow more than one source to form the chain
 52 |   - The chain should have a determined order. Unlike the situation in 4.x the ACL check order depends on the plugin start/restart order
 53 |   - Only one instance is allowed for each type of source, e.g. one should not be allowed to configure more than one `file` type source or `http` type source
 54 |   - Provide APIs to adjust the ordering of the chained sources
 55 | 
 56 | * ACL for gateways, CoAP, MQTT-SN, exproto, Stomp (but not LwM2M)
 57 | 
 58 | * Management API to upload rules for `file` type ACL source
 59 | 
 60 | 
 61 | ## Design
 62 | 
 63 | Config proposal
 64 | 
 65 | ```
 66 | authorization {
 67 |   no_match = allow | deny
 68 |   deny_action = disconnect | ignore
 69 | 
 70 |   cache {
 71 |     enable = true
 72 |     max_size = 32
 73 |     ttl = 30m
 74 |   }
 75 |   sources: [
 76 |     {
 77 |       type = file
 78 |       enable = true
 79 |       path = "etc/example.conf"
 80 |     },
 81 |     {
 82 |       type = mysql
 83 |       enable = true
 84 |       database = mqtt
 85 |       username = root
 86 |       password = xxx
 87 |       pool_size = 8
 88 |       query = "select * from table1 where clientid = xxx"
 89 |     }
 90 |   ]
 91 | }
 92 | ```
 93 | 
 94 | ### File
 95 | 
 96 | #### config
 97 | ```
 98 | {
 99 |   type = file
100 |   enable = true
101 |   path = "/path/to/example.conf"
102 | }
103 | ```
104 | 
105 | ### File content example (same as in 4.x)
106 | 
107 | ```
108 | {allow, {username, "^dashboard?"}, subscribe, ["$SYS/#"]}.
109 | {allow, {ipaddr, "127.0.0.1"}, pubsub, ["$SYS/#", "#"]}.
110 | ```
111 | 
112 | ### MySQL
113 | ```
114 | {
115 |   type = mysql
116 |   enable = true
117 |   server = "127.0.0.1:3306"
118 |   database = mqtt
119 |   pool_size = 1
120 |   username = root
121 |   password = public
122 |   auto_reconnect = true
123 |   ssl = {
124 |     enable = true
125 |     cacertfile = xxx.ca
126 |     certfile = xxx.cert
127 |     keyfile = xxx.key
128 |   }
129 |   query: "select ipaddress, username, clientid, action, permission, topic from mqtt_authz where ipaddr = '${peerhost}' or   username = '${username}' or clientid = '${clientid}'"
130 | }
131 | ```
132 | 
133 | ### PostgresSQL
134 | ```
135 | {
136 |   type = postgresql
137 |   enable = true
138 |   server = "127.0.0.1:5432"
139 |   database = mqtt
140 |   pool_size = 1
141 |   username = root
142 |   password = public
143 |   auto_reconnect = true
144 |   ssl = {
145 |     enable = true
146 |     cacertfile = xxx.ca
147 |     certfile = xxx.cert
148 |     keyfile = xxx.key
149 |   }
150 |   query: "select ipaddress, username, clientid, action, permission, topic from mqtt_authz where ipaddr = '${peerhost}' or username = '${username}' or clientid = '${clientid}'"
151 | 
152 | }
153 | ```
154 | 
155 | ### Redis
156 | ```
157 | {
158 |   type = redis
159 |   enable = true
160 |   redis_type = single
161 |   server = "127.0.0.1:6379"
162 |   database = 0
163 |   pool_size = 1
164 |   password = public
165 |   auto_reconnect = true
166 |   ssl = {enable = false}
167 |   cmd = "HGETALL mqtt_authz:${username}"
168 | }
169 | ```
170 | 
171 | ### MongoDB
172 | ```
173 | {
174 |   type = mongodb
175 |   enable = true
176 |   mongo_type = single
177 |   server = "127.0.0.1:27017"
178 |   pool_size = 1
179 |   database = mqtt
180 |   ssl = {enable = false}
181 |   collection = mqtt_authz
182 |   selector: { "$or": [ { "username": "${username}" }, { "clientid": "${clientid}" } ] }
183 | }
184 | ```
185 | 
186 | ### Management APIs
187 | 
188 | #### Get root level settings
189 | 
190 | ```
191 | GET /authorization/settings
192 | RESP:
193 | {
194 |   "no_match": "allow" | "deny",
195 |   "deny_action": "disconnect" | "ignore",
196 |   "cache" {
197 |     "enable": true,
198 |     "max_size": 32,
199 |     "ttl": "30m"
200 |   }
201 | }
202 | ```
203 | 
204 | #### Update root level settings
205 | ```
206 | PUT /authorization/settings
207 | BODY:
208 | {
209 |   "no_match": "allow" | "deny",
210 |   "deny_action": "disconnect" | "ignore",
211 |   "cache": {
212 |     "enable": true,
213 |     "max_size": 32,
214 |     "ttl": "30m"
215 |   }
216 | }
217 | ```
218 | 
219 | #### Create ACL data sources
220 | ```
221 | POST /authorization/sources
222 | BODY:
223 | { "type": xxx, ... }
224 | ```
225 | 
226 | #### Get ACL data sources
227 | ```
228 | GET /authorization/sources
229 | RESP:
230 | [{ "type": xxx }, { "type": xxx }]
231 | ```
232 | 
233 | #### Get detailed source config per type
234 | 
235 | ```
236 | GET /authorization/sources/{type} # mysql,redis,mongodb,postgresql,http....
237 | RESP:
238 | {"type": "mysql", ...}
239 | ```
240 | 
241 | #### Update (reload) source config per type
242 | 
243 | When needed, the underlying resource such as MySQL connection pool should be restarted when handing such update requests.
244 | 
245 | ```
246 | PUT /authorization/sources/{type} # mysql,redis,mongodb,postgresql,http....
247 | BODY:
248 | {"type": "mysql", ...}
249 | ```
250 | 
251 | #### Delete a source cofnig per type
252 | 
253 | ```
254 | DELETE /authorization/sources/{type} # mysql,redis,mongodb,postgresql,http....
255 | ```
256 | 
257 | #### Adjust source's position in the chain
258 | 
259 | ```
260 | POST /authorization/sources/{type}/move # mysql,redis,mongodb,postgresql,http....
261 | { "position": "top" | "bottom" | "after:{type}" | "before:{type}" }
262 | ```
263 | 
264 | #### APIs to manage `file` type source
265 | 
266 | ```
267 | GET /authorization/sources/file
268 | RESP:
269 | { "type": "file", "rules": "...", "path": "..." }
270 | ```
271 | 
272 | ```
273 | PUT /authorization/sources/file
274 | BODY:
275 | { "type": "file", "rules": "...", "path": "..." }
276 | ```
277 | 


--------------------------------------------------------------------------------
/implemented/0019-plugins.md:
--------------------------------------------------------------------------------
  1 | # EMQ X 5.0 plugins
  2 | 
  3 | ## Change log
  4 | 
  5 | * 2021-11: Guowei Li
  6 | * 2021-12-07: @zmstone move it to GitHub
  7 | 
  8 | ## Abstract
  9 | 
 10 | This EIP documents the implementation proposal EMQ X 5.0 plugins.
 11 | 
 12 | ## Background
 13 | 
 14 | Prior to EMQ X 5.0, most of the features are implemented as plugins. A plugin is an Erlang application which registers a set of call back APIs at certain pre-defined hook points.
 15 | 
 16 | The applications are configured and loaded separately after the node is booted.
 17 | 
 18 | In 5.0, although most of the features such as authentication, authorisation, and rule-engine are still implemented by registering hooks, the management of the features are no longer as the old plugins. 
 19 | 
 20 | Prior to 4.3, most of the plugins are hosted in their own Git repos. This was changed in 4.3 as an umbrella project.
 21 | 
 22 | All these changes we have made in the past are to provide better user experience as well as development experiences. But the flexibility of plugin applications and how they can be hosted in separate git repos still have their advantages and we should continue supporting it.
 23 | 
 24 | The challenges ahead are
 25 | 
 26 | - We want to continue supporting old plugins, but how to minimise the effort to migrate to 5.0.
 27 | Although the hook points are still there, the config format is slightly different, and the schema is also different. (cuttlefish → HOCON)
 28 | - How to provide a better management interfaces for external plugins e.g. management API or even UI.
 29 | - Callbacks under the same hook point are ordered by their priority, however most of the plugins (internal plugins included) today are using the default value 0, meaning it's ordered by luck.
 30 | 
 31 | Worth mentioning, since 4.3.2, EMQ X for the first supported drop-in installation of external plugins. That is, plugins can be compiled and packaged separately (instead of being released as a part of EMQ X package), see [Develop and deploy EMQ X plugin for Enterprise 4.3](https://emqx.atlassian.net/wiki/spaces/EMQX/blog/2021/05/23/168591472)
 32 | 
 33 | ## References
 34 | 
 35 | - RabbitMQ: [Plugin Development Basics — RabbitMQ](https://www.rabbitmq.com/plugin-development.html)
 36 | - Grafana: [Pie Chart](https://grafana.com/grafana/plugins/grafana-piechart-panel/)
 37 | - WordPress: [WordPress Plugins](https://wordpress.org/plugins/)
 38 | - RabbitMQ Plugin release packages: [Releases · rabbitmq/rabbitmq-management-exchange](https://github.com/rabbitmq/rabbitmq-management-exchange/releases)
 39 | 
 40 | ### High level requirements
 41 | 
 42 | 1. Hook points should be compatible with 4.x
 43 | 1. More clear ordering of all hook callbacks under one certain hook point, for example, internal hooks use priority 1000, and force external plugins to provide a priority number. If they wish to have it ordered before internal plugins they can choose to use a number smaller than 1000.
 44 | 1. All built-in features such as authentication and authorisation are not presented as plugins except for
 45 |     1. LDAP authentication
 46 |     1. PSK authentication
 47 | 
 48 | ### Plugin types
 49 | 
 50 | There are two different kinds of plugsins, 'prebuilt' and 'external'.
 51 | Pre-built plugins are released as a part of the EMQ X (CE or EE) official release package.
 52 | External plugins are developed and release independently.
 53 | 
 54 | ### External plugin security concerns
 55 | 
 56 | An external plugins is loaded and executed as any other EMQ X component without any
 57 | access restriction, or scope confinement.
 58 | 
 59 | EMQ X team's long term plan is to introduce a code review & build platform
 60 | (like an app market place) so EMQ X CE users and EE customers can have a trusted
 61 | source to download the packages.
 62 | 
 63 | Before the review & build process is in place,
 64 | EMQ X's users and customers are only adviced to take
 65 | extra care when loading a plugin developed by thirdy party.
 66 | 
 67 | ### Basic steps to install an external plugin
 68 | 
 69 | - Download compiled zip package
 70 | - Upload to a specific dir
 71 | - Execute a command to validate & install & enable & uninstall the plugin
 72 | 
 73 | ### Manage plugins from Dashboard UI
 74 | 
 75 | - Manage installation from Dashboard GUI
 76 |     - upload (and extract, but not persist it)
 77 |     - install
 78 |     - uninstall
 79 | - Manage a list of installed plugins, supported actions:
 80 |     - List view
 81 |     - Show running status: "running" or "stopped".
 82 |       Status should be presented per-node. e.g. `"status": "running"` for the current node (serving the API), or `"node_status": [{"node": "node1", "status": "running"}i, ...]` for a summary view of all nodes in the cluster.
 83 |     - support actions: "start" or "stop"
 84 | 
 85 | ## Plugin package
 86 | 
 87 | A plugin package is a zip file made of two files inside:
 88 | 
 89 | * A tar file for the compiled beams (and maybe source code too),
 90 | * A metadata file in JSON format
 91 | 
 92 | ### Plugin tar
 93 | 
 94 | The tar should be of layout
 95 | 
 96 | ```
 97 | ├── emqx_extplug1
 98 | │   ├── LICENSE
 99 | │   ├── Makefile
100 | │   ├── README.md
101 | │   ├── ebin
102 | │   │   ├── emqx_extplug1.app
103 | │   │   └── emqx_extplug1.beam
104 | │   ├── etc
105 | │   │   └── emqx_extplug1.conf
106 | │   ├── priv
107 | │   │   └── .. # maybe
108 | │   ├── rebar.config
109 | │   └── src
110 | │       └── ... # maybe
111 | ├── extplug1_dep1
112 | │   ├── LICENSE
113 | ...
114 | ```
115 | 
116 | ### Plugin metadata
117 | 
118 | Inside the plugin zip-package, there should be a JSON file to help describe, identify and validate the package.
119 | 
120 | - Name (same as the Erlang application name, it has to be globally unique)
121 | - Version
122 | - Build-datetime
123 | - sha256-checksum-for-tar
124 | - Authors
125 |     - Free text Name & Contact information such as email or website
126 | - Builder
127 |     - Name
128 |     - Contact
129 |     - Optional: Builder's website (to find e.g. public key)
130 |     - Optional: Builder's signning signature for the package
131 | - URL to source code
132 | - What functionalities (one or more of below)
133 |     - authentication
134 |     - authorisation
135 |     - data_persistence
136 |     - rule_engine_extension
137 | - Compatibility
138 |     - Compatible with EMQ X version(s), implicit low boundary of supported versions range is `5.0.0`, also to support version compares: `~>`, `>=`, `>`, `<=`, `<`, `==`
139 |       ref: https://github.com/erlang/rebar3/blob/c102379385013896711bba3969f280f851c67cc7/src/rebar_packages.erl#L376-L392
140 |     - Supported OTP releases (has to be the same as EMQ X's supported OTP versions)
141 | 
142 | We will perhaps need a rebar3 plugin for to help generate the metadata file.
143 | 
144 | ## User Interface
145 | 
146 | ### Management **APIs**
147 | 
148 | - List plugins
149 | 
150 | ```
151 | /plugins
152 | [{"metadata":
153 |     { "name": "emqx_foobar",
154 |       "description": "EMQ X plugin to implement foobar feature",
155 |       "version": "0.1.0",
156 |       ...
157 |     },
158 |   "status": "running" // disabled | running | stopped
159 |   "node_status": [...]
160 | },..]
161 | ```
162 | 
163 | The `disabled` state is recognised when the plugin is installed (unziped), but not configured to be loaded.
164 | 
165 | - Get one plugin
166 | 
167 | ```
168 | GET /plugins/{name}
169 | {"metadata":
170 |     { "name": "emqx_foobar",
171 |       "description": "EMQ X plugin to implement foobar feature",
172 |       "version": "0.1.0",
173 |       ...
174 |     }
175 |  "status": "running"
176 |  "node_status": [...]
177 | }
178 | ```
179 | 
180 | - Upload a package
181 | 
182 | This API uploads and extracts a package
183 | 
184 | NOTE: since this API is a potential security threat, it is disabled by default.
185 | To enable it, one have change the configuration to enable.
186 | Such config change is not possible to be done via http API, only by changing
187 | the conifg file, or from `emqx_ctl` command line.
188 | 
189 | ```
190 | POST /plugins/upload
191 | request: binary data
192 | response: OK | error
193 | ```
194 | 
195 | - Enable / Start / Stop a plugin
196 | 
197 | ```
198 | PUT /plugins/{name}
199 | {
200 |     "status": running | stopped | disabled
201 | }
202 | 
203 | PUT /nodes/{node}/plugins/{name}
204 | {
205 |     "status": running | stopped | disabled
206 | }
207 | ```
208 | 
209 | - Delete a plugin
210 | 
211 | ```
212 | DELETE /plugins/{name}
213 | DELETE /nodes/{node}/plugins/{name}
214 | ```
215 | 


--------------------------------------------------------------------------------
/implemented/0020-assets/evacuation-coordinator-statuses-enforce.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/evacuation-coordinator-statuses-enforce.png


--------------------------------------------------------------------------------
/implemented/0020-assets/evacuation-coordinator-statuses-enforce.uml:
--------------------------------------------------------------------------------
 1 | @startuml evacuation-coordinator-statuses-enforce
 2 | skinparam monochrome true
 3 | skinparam ranksep 20
 4 | skinparam dpi 150
 5 | skinparam arrowThickness 0.7
 6 | skinparam packageTitleAlignment left
 7 | skinparam usecaseBorderThickness 0.4
 8 | skinparam defaultFontSize 12
 9 | 
10 | (disabled) --> (evicting_conns)
11 | (evicting_conns) --> (evicting_conns)
12 | (evicting_conns) --> (disabled)
13 | (evicting_conns) --> (waiting_takeover)
14 | (waiting_takeover) --> (evicting_sessions)
15 | (waiting_takeover) --> (disabled)
16 | (evicting_sessions) --> (evicting_sessions)
17 | (evicting_sessions) --> (prohibiting)
18 | (evicting_sessions) --> (disabled)
19 | (prohibiting) --> (disabled)
20 | @enduml
21 | 


--------------------------------------------------------------------------------
/implemented/0020-assets/evacuation-coordinator-statuses-rebalance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/evacuation-coordinator-statuses-rebalance.png


--------------------------------------------------------------------------------
/implemented/0020-assets/evacuation-coordinator-statuses-rebalance.uml:
--------------------------------------------------------------------------------
 1 | @startuml evacuation-coordinator-statuses-rebalance
 2 | skinparam monochrome true
 3 | skinparam dpi 150
 4 | skinparam arrowThickness 0.7
 5 | skinparam usecaseBorderThickness 0.4
 6 | skinparam defaultFontSize 12
 7 | 
 8 | 
 9 | (disabled) --> (wait_health_check)
10 | (wait_health_check) --> (evicting_conns)
11 | (wait_health_check) --> (disabled)
12 | (evicting_conns) --> (evicting_conns)
13 | (evicting_conns) --> (wait_takeover)
14 | (evicting_conns) --> (disabled)
15 | (wait_takeover) --> (evicting_sessions)
16 | (wait_takeover) --> (disabled)
17 | (evicting_sessions) --> (evicting_sessions)
18 | (evicting_sessions) --> (disabled)
19 | @enduml
20 | 
21 | 


--------------------------------------------------------------------------------
/implemented/0020-assets/eviction-agent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/eviction-agent.png


--------------------------------------------------------------------------------
/implemented/0020-assets/node-rebalance-algo0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/node-rebalance-algo0.png


--------------------------------------------------------------------------------
/implemented/0020-assets/node-rebalance-algo1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/node-rebalance-algo1.png


--------------------------------------------------------------------------------
/implemented/0020-assets/node-rebalance-algo2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/node-rebalance-algo2.png


--------------------------------------------------------------------------------
/implemented/0020-assets/node-rebalance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0020-assets/node-rebalance.png


--------------------------------------------------------------------------------
/implemented/0021-assets/aws-mqtt-file-delivery.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/aws-mqtt-file-delivery.png


--------------------------------------------------------------------------------
/implemented/0021-assets/aws-mqtt-file-delivery.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | !theme blueprint
 3 | == Use DescribeStream to get stream data ==
 4 | Client -> Server: Subscribe('/description')
 5 | Server --> Client: Sub Ack
 6 | Client -> Server: Subscribe('/rejected')
 7 | Server --> Client: Sub Ack
 8 | Client -> Server: Publish('/describe')
 9 | note right
10 | DescribeStreamRequest
11 | {
12 |     "c": "ec944cfb-1e3c-49ac-97de-9dc4aaad0039"
13 | }
14 | "c" client token field
15 | end note
16 | Server --> Client: Publish('/description')
17 | note left
18 | DescribeStreamResponse
19 | {
20 |     "c": "ec944cfb-1e3c-49ac-97de-9dc4aaad0039",
21 |     "s": 1,
22 |     "d": "This is the description of stream ABC.",
23 |     "r": [
24 |         {
25 |             "f": 0,
26 |             "z": 131072
27 |         },
28 |         {
29 |             "f": 1,
30 |             "z": 51200
31 |         }
32 |     ]
33 | }
34 | "c": client token field
35 | "s": stream version as an integer
36 | "r": contains a list of the files in the stream.
37 |      "f": stream file ID as an integer.
38 |      "z": stream file size in number of bytes.
39 | "d": description of the stream.
40 | end note
41 | == Get data blocks from a stream file ==
42 | Client -> Server: Subscribe('/data')
43 | Server --> Client: Sub Ack
44 | Client -> Server: Publish('/get')
45 | note right
46 | GetStreamRequest
47 | {
48 |     "c": "1bb8aaa1-5c18-4d21-80c2-0b44fee10380",
49 |     "s": 1,
50 |     "f": 0,
51 |     "l": 4096,
52 |     "o": 2,
53 |     "n": 100,
54 |     "b": "..."
55 | }
56 | [optional] "c": client token field
57 | [optional] "s": stream version field
58 | "f": stream file ID
59 | "l": data block size in bytes
60 | [optional] "o": offset of the block in the stream file
61 | [optional] "n": number of blocks requested
62 | [optional] "b": bitmap that represents the blocks being requested
63 | end note
64 | Server --> Client: Publish('/data')
65 | note left
66 | GetStreamResponse
67 | {
68 |     "c": "1bb8aaa1-5c18-4d21-80c2-0b44fee10380",
69 |     "f": 0,
70 |     "l": 4096,
71 |     "i": 2,
72 |     "p": "..."
73 | }
74 | "c": client token field
75 | "f": ID of the stream file
76 | "l": size of the data block payload in bytes
77 | "i": ID of the data block contained in the payload
78 | "p": the data block payload (base64)
79 | end note
80 | Server --> Client: Publish('/data')
81 | note left
82 | GetStreamResponse
83 | end note
84 | Server --> Client: Publish('/data')
85 | note left
86 | GetStreamResponse
87 | end note
88 | @enduml
89 | 


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-abort.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-abort.png


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-abort.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | !theme blueprint
 3 | Client -> Broker: PUBLISH('$file/{fileId}/0/{sha256}')
 4 | note right
 5 | Payload: <binary blob 1kB>
 6 | end note
 7 | Broker -> Broker: store {filepath}/{filename} at 0, 1kB
 8 | Broker --> Client: PUBACK 0x00
 9 | Client -> Broker: PUBLISH('$file/{fileId}/abort')
10 | Broker -> Broker: delete {filepath}/{filename}
11 | Broker --> Client: PUBACK 0x00
12 | @enduml
13 | 


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-async-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-async-1.png


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-async-1.uml:
--------------------------------------------------------------------------------
 1 | @startuml flow-async-1
 2 | !theme blueprint
 3 | Client -> Broker: PUBLISH(pktid=1, topic=$file-async/[COMMAND])
 4 | Broker --> Client: PUBACK(pktid=1, rc=0)
 5 | note right
 6 | Operation start
 7 | end note
 8 | Client -> Broker: PUBLISH(pktid=2, topic=...)
 9 | Broker --> Client: PUBACK(pktid=2, rc=...)
10 | Broker -> "$file-response/{clientId}": PUBLISH $file-response/{clientId}
11 | note left
12 | Operation end
13 | end note
14 | note right
15 | {
16 |   "vsn": "0.1",
17 |   "topic": "$file-async/[COMMAND]",
18 |   "packet_id": 1,
19 |   "reason_code": 0,
20 |   "reason_description": "success"
21 | }
22 | end note
23 | @enduml
24 | 


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-happy-path.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-happy-path.png


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-happy-path.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | !theme blueprint
 3 | Client -> Client: generate UUID=8568BA42-..
 4 | Client -> Broker: PUBLISH('$file/8568BA42-../init')
 5 | note right
 6 | {
 7 |   "name": "ml-logs-data.log",
 8 |   "expire_at": 1696659943
 9 |   "size": 3075
10 | }
11 | end note
12 | Broker --> Client: PUBACK 0x00
13 | Client -> Client: read segment #0, calculate sha256
14 | Client -> Broker: PUBLISH('$file/8568BA42-../0/{sha256}')
15 | note right
16 | Payload: <binary blob 1kB>
17 | end note
18 | Broker -> Broker: verify checksum - ok
19 | Broker -> Broker: store logs/data-log at 0, 1kB
20 | Broker --> Client: PUBACK 0x00
21 | Client -> Client: read segment #1, calculate sha256
22 | Client -> Broker: PUBLISH('$file/8568BA42-../1024/{sha256}')
23 | note right
24 | Payload: <binary blob 1kB>
25 | end note
26 | Client -> Client: read segment #2, calculate sha256
27 | Client -> Broker: PUBLISH('$file/8568BA42-../2048/{sha256}')
28 | note right
29 | Payload: <binary blob 1kB>
30 | end note
31 | Client -> Client: read segment #3, calculate sha256
32 | Client -> Broker: PUBLISH('$file/8568BA42-../3072/{sha256}')
33 | note right
34 | Payload: <binary blob 3 bytes>
35 | end note
36 | Broker -> Broker: verify checksum - ok
37 | Broker -> Broker: verify checksum - ok
38 | Broker -> Broker: verify checksum - ok
39 | Broker -> Broker: store logs/data-log at 1024, 2051 bytes
40 | Broker --> Client: PUBACK 0x00
41 | Broker --> Client: PUBACK 0x00
42 | Broker --> Client: PUBACK 0x00
43 | Client -> Broker: PUBLISH('$file/8568BA42-../fin/3075/{sha256}')
44 | Broker -> Broker: verify checksum - ok
45 | Broker -> Broker: finalize logs/data-log
46 | Broker -> Storage: upload logs/data-log
47 | Storage --> Broker: upload ok
48 | Broker --> Client: PUBACK 0x00
49 | Broker --> Broker: cleanup
50 | @enduml
51 | 


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-restart.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-restart.png


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-restart.uml:
--------------------------------------------------------------------------------
 1 | @startuml
 2 | !theme blueprint
 3 | Client -> Broker: PUBLISH('$file/{fileId}/0/{sha256}')
 4 | note right
 5 | Payload: <binary blob 1kB>
 6 | end note
 7 | Broker --> Broker: verify checksum - failed
 8 | Broker --> Client: PUBACK 0x80
 9 | Client -> Broker: PUBLISH('$file/{fileId}/0/{sha256}')
10 | note right
11 | Payload: <binary blob 1kB>
12 | end note
13 | Broker --> Broker: verify checksum - ok
14 | Broker -> Broker: store {filepath}/{filename} at 0, 1kB
15 | Broker --> Client: PUBACK 0x00
16 | @enduml
17 | 


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-sync-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-sync-1.png


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-sync-1.uml:
--------------------------------------------------------------------------------
 1 | @startuml flow-sync-1
 2 | !theme blueprint
 3 | Client -> Broker: PUBLISH(pktid=1, topic=$file/[COMMAND])
 4 | note right
 5 | Operation start
 6 | end note
 7 | Broker --> Client: PUBACK(pktid=1, rc=0)
 8 | note right
 9 | Operation end
10 | end note
11 | @enduml
12 | 


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-sync-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0021-assets/flow-sync-2.png


--------------------------------------------------------------------------------
/implemented/0021-assets/flow-sync-2.uml:
--------------------------------------------------------------------------------
 1 | @startuml flow-sync-2
 2 | !theme blueprint
 3 | Client -> Broker: PUBLISH(pktid=1, topic=$file/[COMMAND])
 4 | note right
 5 | Operation start
 6 | end note
 7 | Client -> Broker: PUBLISH(pktid=2, topic=...)
 8 | Broker --> Client: PUBACK(pktid=2, rc=...)
 9 | Broker --> Client: PUBACK(pktid=1, rc=0)
10 | note right
11 | Operation end
12 | end note
13 | @enduml
14 | 


--------------------------------------------------------------------------------
/implemented/0021-transfer-files-over-mqtt.md:
--------------------------------------------------------------------------------
  1 | # File transfer over MQTT
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2022-09-20: @qzhuyan, @id, @zmstone Initial draft
  6 | 
  7 | ## Abstract
  8 | 
  9 | This document defines protocol to send files from MQTT clients to MQTT server. It is using only `PUBLISH` and `PUBACK` messages and does not require MQTT 5.0 features like topic aliases.
 10 | 
 11 | ## Motivation
 12 | 
 13 | EMQX customers are asking for file transfer functionality from IoT devices to their cloud (primary use case), and from cloud to IoT devices over MQTT. Right now they are uploading files from devices via FTP or HTTPS (e.g. to S3), but this approach has downsides:
 14 | 
 15 | * FTP and HTTP servers usually struggle to keep up with large number of simultaneous bandwidth-intensive connections
 16 | * packet loss or reconnect forces clients to restart the transfer
 17 | * devices which already talk MQTT need to integrate with one more SDK, address authentication and authorization, and potentially go through an additional round of security audit
 18 | 
 19 | Known cases of device-to-cloud file transfer:
 20 | 
 21 | * [CAN bus](https://en.wikipedia.org/wiki/CAN_bus) data
 22 | * Image taken by industry camera for Quality Assurance
 23 | * Large data file collected from forklift
 24 | * Video and audio data from truck cars, and video data captured by inbound unloading cameras
 25 | * Vehicle real-time logging, telemetry, messaging
 26 | * Upload collected ML logs
 27 | 
 28 | Known cases of cloud-to-device file transfer:
 29 | 
 30 | * Upload AI/ML models
 31 | * Firmware upgrades
 32 | 
 33 | Even though devices could already send binary data in MQTT packets, it is not trivial to guarantee reliable file transfer without clear expectations from client and server side.
 34 | 
 35 | ## Terminology
 36 | 
 37 | The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [BCP 14](https://www.rfc-editor.org/bcp/bcp14) [[RFC2119](https://www.rfc-editor.org/rfc/rfc2119)] [[RFC8174](https://www.rfc-editor.org/rfc/rfc8174)] when, and only when, they appear in all capitals, as shown here.
 38 | 
 39 | The following terms are used as described in [MQTT Version 5.0 Specification](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html):
 40 | * Application Message
 41 | * Server
 42 | * Client
 43 | * Topic
 44 | * Topic Name
 45 | * Topic Filter
 46 | * MQTT Control Packet
 47 | 
 48 | *At least once*: a message can be delivered many times, but cannot be lost
 49 | 
 50 | ## Requirements
 51 | 
 52 | * The protocol MUST use only PUBLISH type of MQTT Control Packet
 53 | * The protocol MUST support transfer of file segments
 54 | * Server MUST be able to verify integrity of each file segment and of the whole file
 55 | * Client MAY know total file size when initiating the transfer
 56 | * Client MAY abort file transfer
 57 | * Server MAY ask the client to pause file transfer
 58 | * Server MAY ask the client to abort file transfer
 59 | * The protocol MUST guarantee "At least once" delivery
 60 | * Server MUST NOT support subscription on topics dedicated for file transfer
 61 | 
 62 | ## AWS IoT MQTT-based file delivery (reference design)
 63 | 
 64 | As an example of existing implementation we can look at AWS IoT Core [which provides functionality](https://docs.aws.amazon.com/iot/latest/developerguide/mqtt-based-file-delivery.html) to [deliver files to IoT devices](https://docs.aws.amazon.com/iot/latest/developerguide/mqtt-based-file-delivery-in-devices.html):
 65 | 
 66 | ![](0021-assets/aws-mqtt-file-delivery.png)
 67 | 
 68 | ## Design
 69 | 
 70 | ### Overview
 71 | 
 72 | * Files are split in segments, segments can be of arbitrary length
 73 | * Client SHOULD generate unique identifier for each file being transferred and use it as `fileId` in Topic Name (UUID according to [RFC 4122](https://www.rfc-editor.org/rfc/rfc4122) is recommended)
 74 | * Client SHOULD consider `fileId` as a unique identifier for the file transfer, and MUST NOT reuse it for other file transfers
 75 | * Broker SHOULD consider `clientId` + `fileId` pair as a Broker-wide unique identifier for the file transfer
 76 | * Client MAY calculate SHA-256 checksum of the segment it's about to send and send it as part of Topic Name
 77 | * Client MAY calculate SHA-256 checksum of the file it's about to send and include it in the `init` message payload or send is as part of the `fin` message
 78 | * If Client chooses to provide checksum for file segments, whole file, or both, it MUST use [SHA-256](https://www.rfc-editor.org/rfc/rfc6234)
 79 | * If checksum is included in the `init` message payload, the Broker MUST use it to verify integrity of the file after receiving the `fin` message for the corresponding file transfer
 80 | * If checksum is included in topic name, Broker MUST use it to verify integrity of corresponding data:
 81 |   * segment, if it's a segment transfer message
 82 |   * whole file, if it's a `fin` message
 83 | * If checksum verification fails, Broker MUST reject the corresponding data
 84 | * Client MUST use Topic starting with `$file/`(`$file-async/`) to transfer files.
 85 | * Broker MUST NOT let clients subscribe to Topics starting with `$file/`(`$file-async/`) topics
 86 | * Segment length can be calculated on the server side by subtracting the length of the [Variable Header](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901025) from the [Remaining Length](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901105) field that is in the [Fixed Header](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901021)
 87 | 
 88 | ### Protocol flow
 89 | 
 90 | Data is transferred in PUBLISH packets in the following order:
 91 | 
 92 | 1. `$file[-async]/{fileId}/init`
 93 | 
 94 |     ```
 95 |     {
 96 |       "name": "ml-logs-data.log",
 97 |       "size": 12345,
 98 |       "checksum": "1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef",
 99 |       "expire_at": 1696659943,
100 |       "segments_ttl": 600
101 |     }
102 |     ```
103 | 
104 | 2. `$file[-async]/{fileId}/{offset}[/{checksum}]`
105 | 
106 |     ```
107 |     <file segment data>
108 |     ```
109 | 
110 | 3. `$file[-async]/{fileId}/{offset}[/{checksum}]`
111 | 
112 |     ```
113 |     <file segment data>
114 |     ```
115 | 
116 | 4. ...
117 | 
118 | 5. `$file[-async]/{fileId}/[fin/{fileSize}[/{checksum}] | abort]`
119 | 
120 |     No payload
121 | 
122 | 
123 | ### Sync mode vs Async mode
124 | 
125 | File transfer individual commands may be handled in two modes: synchronous and asynchronous. The mode is chosen by the client, by sending commands either to `$file/...` (sync mode) or `$file-async/...` (async mode) topics.
126 | 
127 | Client is free to mix arbitrarily sync and async commands in the same file transfer session.
128 | 
129 | #### Synchronous mode
130 | 
131 | In synchronous mode, the successful/unsuccessful status of individual operations is communicated to the client via Reason Code field of MQTT `PUBACK` messages. See [Reason codes](#reason-codes) for details.
132 | 
133 | ![](./0021-assets/flow-sync-1.png)
134 | 
135 | Caveats:
136 | 
137 | * Some operations (`fin` command) may take considerable time to complete. So if a client wants to utilize the session while waiting for the result, it should either use async mode or implement some kind of asynchronous logic itself, that is to deal with several unacked PUBLISH messages in parallel, like this:
138 | 
139 | ![](./0021-assets/flow-sync-2.png)
140 | * MQTTv3 clients do not support reason codes, so async mode is the preferred option for them.
141 | 
142 | #### Asynchronous mode
143 | 
144 | In asynchronous mode, the logic of command handling is the following:
145 | * Client sends a command to `$file-async/...` topic.
146 | * Broker responds with `PUBACK`. Nonzero Reason Code indicates immediate failure.
147 | * Zero Reason Code indicates that the command has been accepted for processing. The client is expected to wait for the actual result of the command via `$file-response/{clientId}` topic.
148 | 
149 | ![](./0021-assets/flow-async-1.png)
150 | 
151 | The result of the command is a JSON document with the following fields:
152 | ```json
153 | {
154 |   "vsn": "0.1",
155 |   "topic": "$file-async/[COMMAND]",
156 |   "packet_id": 1,
157 |   "reason_code": 0,
158 |   "reason_description": "success"
159 | }
160 | ```
161 | 
162 | | Field | Description |
163 | |-------|-------------|
164 | | `vsn`  | response document format version |
165 | | `topic`  | the topic of the command that the response is for, e.g. `$file-async/somefileid/init` |
166 | | `packet_id`  | the MQTT packet id of the command packet (`PUBLISH`) that the response is for |
167 | | `reason_code`  | the result code of the command execution. See [Reason Codes](#reason-codes) for details |
168 | | `reason_description`  | the human-readable description of the Reason Code |
169 | 
170 | JSON Schema is available [here](https://github.com/emqx/mqtt-file-transfer-schema/blob/v1.1.0/response.json).
171 | 
172 | Notes:
173 | * The operation result is always sent to the response topic (both in case of immediate failure and in case of actual processing). So the response topic may be used as the only source of information about the operation result. Also, this is the only variant available for MQTTv3 clients.
174 | * The response topic is client-specific. The client should subscribe to it before sending the command.
175 | * The client may override the response topic using `REQUEST/RESPONSE` ![pattern](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901253).
176 | 
177 | #### Reason Codes
178 | 
179 | In the subsequent sections, the _reason code_ of a file transfer operation means the _final_ reason code of the command execution, that is, the Reason Code of `PUBACK` packet in the sync mode or the `reason_code` response document field value in the async mode.
180 | 
181 | ### Protocol messages
182 | 
183 | #### `$file[-async]/{fileId}/init` message
184 | 
185 | Initialize the file transfer. Server is expected to store metadata from the payload in the session along with `{fileId}` as a reference for the rest of file transfer.
186 | 
187 |   * Qos=1
188 |   * Payload Format Indicator=0x01
189 |   * `{fileId}` is corresponding file UUID
190 |   * Payload is a JSON document
191 | 
192 | Getting a successful reason code from the broker means that the file transfer has been initialized successfully, and the metadata has been persisted in the storage.
193 | 
194 | Broker MAY refuse to accept the file transfer in case of the metadata conflict, e.g. if the transfer with the same `{fileId}` from the same Client has different `name` or `checksum` value. Client is expected to start the transfer with a different `{fileId}`.
195 | 
196 | Broker MAY abort incomplete file transfers after their respective sessions have been discarded, and clean up any resources associated with them.
197 | 
198 | Broker MAY refuse the file transfer if the `fileId` is too long, but generally `fileId`s of up to 255 bytes (in UTF-8 encoding) should be safe to use.
199 | 
200 | ##### `init` payload JSON Schema
201 | 
202 | Available [here](https://github.com/emqx/mqtt-file-transfer-schema/blob/v1.1.0/init.json).
203 | 
204 | * Broker MAY use `name` value as a filename in a file system
205 | 
206 |     This generally means that it SHOULD NOT contain path separators, and SHOULD NOT contain characters or sequences
207 |     of characters that are not allowed in filenames in the file system where the file is going to be stored. Also,
208 |     the filename SHOULD be limited to 255 bytes (in UTF-8 encoding).
209 | 
210 | * Broker SHOULD consider `size` value as informational only, given it's not required to be provided by the client
211 | 
212 |     Mandatory file size should be specified in the `fin` message Topic anyway, and may be different from the value
213 |     provided in the `size` field. The `size` field may be used for example to calculate the progress of the transfer,
214 |     which thus may be inaccurate.
215 | 
216 | * Broker SHOULD have default setting for `segments_ttl`
217 | 
218 | * Broker MAY delete segments of unfinished file transfers when their TTL has expired
219 | 
220 | * Broker MAY NOT honor `segments_ttl` value that is either too large or too small
221 | 
222 |     What means _too large_ or _too small_ is up to the Broker implementation and/or configuration.
223 | 
224 | #### `$file[-async]/{fileId}/{offset}[/{checksum}]` message
225 | 
226 | One such message for each file segment.
227 | 
228 |   * Qos=1
229 |   * Payload Format Indicator=0x00
230 |   * Payload is file segment bytes
231 |   * `{offset}` is byte offset of the given segment
232 |   * optional `{checksum}` is SHA-256 checksum of the file segment
233 | 
234 | Getting a successful reason code from the Broker means that the file segment has been verified (if checksum was provided) and successfully persisted in the storage.
235 | 
236 | #### `$file/{fileId}/fin/{fileSize}[/{checksum}]` message
237 | 
238 | All file segments have been successfully transferred.
239 | 
240 |   * Qos=1
241 |   * no payload
242 |   * optional `{checksum}` is SHA-256 checksum of the file
243 | 
244 | Getting a successful reason code from the Broker means that the file being transferred is ready to be used. This implies a lot of things:
245 |   * Broker has verified that it has corresponding metadata for the file
246 |   * Broker has verified that it has all the segments of the file up to `{fileSize}` persisted in the storage
247 |   * Broker has verified the file integrity (if checksum was provided)
248 |   * Broker has published the file along with its metadata to the location where it can be accessed by other users
249 | 
250 | In cases when checksum was provided both in the `init` message and in the `fin` message, Broker MUST ignore the former and use the latter.
251 | 
252 | Clients MUST expect that handling of the `fin` message may take considerable time, depending on the file size and the
253 | Broker implementation or configuration.
254 | 
255 | #### `$file/{fileId}/abort` message
256 | 
257 | Client wants to abort the transfer.
258 | 
259 |   * Qos=1
260 |   * no payload
261 | 
262 | ### Durability
263 | 
264 | This specification does not define how reliably the file transfer data SHOULD be persisted. It is up to the Broker implementation what specific durability guarantees it provides (e.g. datasync or replication factor). However, Broker is expected to support transfers that are interrupted by a network failure, Broker restart, or Client reconnect.
265 | 
266 | ### Reason codes
267 | 
268 | | Reason code | MQTT Name                     | Meaning in file transfer context                    |
269 | |-------------|-------------------------------|-----------------------------------------------------|
270 | | OMIT        |                               | Same as 0x00                                        |
271 | | 0x00        | Success                       | File segment has been successfully persisted        |
272 | | 0x10        | No matching subscribers       | Server asks Client to retransmit all segments       |
273 | | 0x80        | Unspecified error             | For segment transmission, Server asks Client to retransmit the segment. For `fin`, Server asks Client to retransmit all segments |
274 | | 0x83        | Implementation specific error | Server asks Client to cancel the transfer           |
275 | | 0x97        | Quota exceeded                | Server asks Client to pause the transfer            |
276 | 
277 | #### 0x97, "Quota exceeded", "Pause Transfer"
278 | 
279 | Client is expected to wait before trying to retransmit file segment again.
280 | 
281 | ### PUBACK from MQTT servers < v5.0
282 | 
283 | `PUBACK` messages prior to MQTT v5.0 do not carry Reason Code, so the only way for the client to know the result of the operation is to use async mode and wait for the response on the response topic.
284 | 
285 | ### Happy path
286 | 
287 | ![](0021-assets/flow-happy-path.png)
288 | 
289 | ### Transfer abort initiated by client
290 | 
291 | ![](0021-assets/flow-abort.png)
292 | 
293 | ### Transfer restart initiated by server
294 | 
295 | ![](0021-assets/flow-restart.png)
296 | 
297 | ## Backwards Compatibility
298 | 
299 | Full backward compatibility with MQTT 5.x and MQTT 3.x.
300 | 
301 | ## Innovation Opportunities
302 | 
303 | * Integration with rule engine
304 | * Pluggable storage backends
305 | * Multiple backend writes to a single file
306 | * Do the entire MQTT message logging
307 | * QUIC pure binary stream support
308 | * ACL enables client-side control of file size
309 | * Bulk upload at EMQX
310 | * Multi-node local cache utilization similar to HDFS
311 | * Cheaper (in computational costs) checksum algorithm
312 | * Server-to-client file transfer protocol
313 | 
314 | ## Declined Alternatives
315 | 
316 | * Use of MQTT extension headers
317 |   * Poor compatibility and complex application layer implementation
318 | * Capability negotiation
319 |   * Poor client compatibility, complex application layer implementation
320 | * Client supplied file name as file identifier instead of UUID
321 |   * potential security issues
322 |   * potential name clashing issues
323 | * checksum value instead of UUID
324 |   * will not work if client does not have full file when initiating the transfer
325 | * checksum as part of the payload, not as part of topic name
326 |   * Requires specification for payload serialization format leading to more complicated client code
327 | * File segment offset as part of the payload
328 |   * Same as above, Requires specification for payload serialization format and more complicated code
329 | * Using constant segment size (per file) and sequential segment number instead of byte offset to reduce packet size
330 |   * Clients may want to change segment size dynamically to account for changes in network properties (e.g. moving from faster to slower and spotty network, or vice versa)
331 | 
332 | 


--------------------------------------------------------------------------------
/implemented/0022-forward-check-install-upgrade-script.md:
--------------------------------------------------------------------------------
  1 | # Forward compatibility check for install upgrade script (5.0)
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2022-12-07: @thalesmg Initial draft
  6 | * 2022-12-13: @thalesmg Added comment about downgrading
  7 | 
  8 | ## Abstract
  9 | 
 10 | This proposes a way to circumvent the limitation that the hot upgrade
 11 | installation scripts that are executed cannot be changed after a
 12 | package is released, making it impossible to fix bugs or change
 13 | upgrade logic.
 14 | 
 15 | ## Motivation
 16 | 
 17 | This section should clearly explain why the functionality proposed by this EIP
 18 | is necessary. EIP submissions without sufficient motivation may be rejected
 19 | outright.
 20 | 
 21 | Currently, when performing a hot upgrade, the scripts that are run are
 22 | those from the currently installed EMQX version.  It has a validation
 23 | that prevents upgrade between different minor versions.  If we want to
 24 | allow an upgrade from, say, 5.0.x to 5.1.y, then the scripts in
 25 | already released packages will deny such operation.  Also, if an
 26 | upgrade installation script contains a bug in the current, it will
 27 | never be able to execute properly without manual patching.
 28 | 
 29 | By attempting to execute the scripts from the _target version_, we may
 30 | add fixes and new validations to new EMQX versions and have them
 31 | executed by older versions.
 32 | 
 33 | ## Design
 34 | 
 35 | ### Current upgrade procedure
 36 | 
 37 | 1. A zip file with a conventional filename format
 38 |    (`<relname>-<version>.tar.gz`) is placed in the `releases`
 39 |    directory of the currently running EMQX installation.  Typically
 40 |    this means `/usr/lib/emqx/releases`.
 41 | 2. The user runs `emqx {install,upgrade,unpack} <relname>-<version>`.
 42 | 3. The `emqx` script then calls `install_upgrade.escript` **from the
 43 |    currently running version** with some info alongside the desired
 44 |    operation and new version.
 45 | 4. The script then processes the desired operation.
 46 | 
 47 | Currently, a versioned copy of `install_upgrade.escript` and `emqx`
 48 | are already installed with EMQX in the `bin` directory.  That means
 49 | `install_upgrade.escript-<version>` and `emqx-<version>`,
 50 | respectively.
 51 | 
 52 | ### Proposed new procedure
 53 | 
 54 | 1. First of all, we check if the currently installed and the target
 55 |    release profiles match.  For example, if the enterprise edition is
 56 |    currently installed and the user attempts to hot-upgrade using a
 57 |    community edition package (`emqx-<version>.tar.gz`) or vice-versa,
 58 |    the operation should abort.
 59 |    - This can be done by checking the `$IS_ENTERPRISE` variable that
 60 |      is set in `emqx_vars` and loaded by `bin/emqx` against the
 61 |      package filename: `emqx-<version>.tar.gz` for community edition,
 62 |      `emqx-enterprise-<version>.tar.gz` for enterprise edition.
 63 | 2. In the `emqx` script, we parse the desired operation and target
 64 |    version and check if scripts called
 65 |    `install_upgrade.escript-<target version>` and `emqx-<target
 66 |    version>` exist in the `bin` directory.
 67 |    - If it does, it means that the target version was already unpacked
 68 |      at some point, and we just execute `emqx-<target version>`
 69 |      passing it the necessary info.
 70 |    - We also need to check inside the `emqx` script if it is the
 71 |      `emqx-<target version>` script itself, to avoid an infinite loop.
 72 | 3. If any such file does not exist, then, without executing the
 73 |    currently installed `install_upgrade.escript` file:
 74 |    1. We check if the `<relname>-<target version>.tar.gz` file is at
 75 |       the expected location (`releases`) and it's readable, bailing
 76 |       out otherwise.
 77 |    2. We extract _only_ the `install_upgrade.escript-<target version>`
 78 |       and `emqx-<target version>` files to the `bin` directory.
 79 |    3. Then we just call `emqx-<target version>` with the same
 80 |       arguments: `emqx-<target version> {install,unpack,upgrade}
 81 |       <target version>`.
 82 | 
 83 | For _downgrading_, since a newer script might have bug fixes and also
 84 | have laxer restrictions on the source and target versions (for
 85 | example, an older script might forbid 5.0 <-> 5.1, but the newer one
 86 | allows it), we have to use the _newer_ script as well.  Otherwise the
 87 | user might risk getting stuck at the newer version without being able
 88 | to downgrade.
 89 | 
 90 | ## Configuration Changes
 91 | 
 92 | No configuration changes needed.
 93 | 
 94 | ## Backwards Compatibility
 95 | 
 96 | Since EMQX 5.0, at the time of writing, has not been tracking appup
 97 | changes for hot upgrades, so this change shouldn't pose backwards
 98 | compatibility issues.
 99 | 
100 | ## Document Changes
101 | 
102 | The documentation for 5.0 currently doesn't even have a section for
103 | hot upgrades, so it'll need to be ported from 4.x.
104 | 
105 | ## Testing Suggestions
106 | 
107 | The final implementation must include unit test or common test code. If some
108 | more tests such as integration test or benchmarking test that need to be done
109 | manually, list them here.
110 | 
111 | ## Declined Alternatives
112 | 
113 | No prior alternatives discussed.
114 | 


--------------------------------------------------------------------------------
/implemented/0023-assets/trace-export-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/implemented/0023-assets/trace-export-example.png


--------------------------------------------------------------------------------
/implemented/0023-opentelemetry-trace-integration.md:
--------------------------------------------------------------------------------
  1 | # OpenTelemetry Traces Integration
  2 | 
  3 | ## Changelog
  4 | - 2023-09-28: Initial draft
  5 | - 2023-10-04:
  6 |   - Apply review remarks (define trace spans for the first iterations)
  7 |   - Add description of `emx_external_trace` behaviour
  8 | - 2023-12-26:
  9 |   - Update the document to match the actual implementation and move to implemented
 10 | 
 11 | ## Abstract
 12 | 
 13 | This document describes EMQX OpenTelemetry integration design proposal.
 14 | 
 15 | More details about related components, concepts and conventions can be found in the following resources:
 16 | 
 17 | - [OpenTelemetry Erlang lib documentation](https://opentelemetry.io/docs/instrumentation/erlang/)
 18 | - [OpenTelemetry main trace components overview](https://opentelemetry.io/docs/concepts/signals/traces)
 19 | - [MQTT trace context specification](https://w3c.github.io/trace-context-mqtt/)
 20 | - [Trace context header specification](https://www.w3.org/TR/trace-context-1/#tracestate-header)
 21 | - [OpenTelemetry messaging system semantic conventions](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md)
 22 | - [Draft PR](https://github.com/emqx/emqx/pull/11696)
 23 | 
 24 | ## Motivation
 25 | 
 26 | OpenTelemetry distributed trace integration is the part of EMQX Product Road-map.
 27 | 
 28 | ## Design
 29 | 
 30 | ### Core concepts and tracing scope
 31 | 
 32 | The core traceable entity for EMQX is a message. It means that one trace should be associated with one message.
 33 | 
 34 | For example, a single HTTP request is a common traceable entity for a HTTP server. HTTP server instrumented with OpenTelemetry receives a HTTP request, extracts trace context from headers (e.g., `traceparent`, `tracestate` headers) and traces any processing steps (spans) up to sending a HTTP response back to the client, associating all the spans with the same trace ID (1 request - 1 trace ID).
 35 | HTTP client after receiving the response may proceed executing some subsequent operations, tracing and linking them to the same trace ID.
 36 | 
 37 | Somewhat analogously, EMQX instrumented with OpenTelemetry, is expected to receive a published message, extract trace context (e.g., `traceparent`, `tracestate` User-Properties), and trace some/all processing steps under the same trace ID.
 38 | Producer/consumer of the message may proceed tracing any subsequent operations relating them to the same trace ID.
 39 | 
 40 | These traced steps (or spans) should include the following (in the first iteration):
 41 | 
 42 | - Process a published message (traced by a node that received a published message).
 43 |   This span starts when PUBLISH packet is received and parsed by a connection process and ends when the message is dispatched to local subscribers and/or forwarded to other nodes (forwarding is async by default).
 44 | - Send a published message to a subscriber (traced by all nodes that have matched subscribers).
 45 |   This span is traced by each connection process (so there will be one span per each subscriber). It will be started when 'deliver` message is received by a connection controlling process and ended when outgoing packet is serialized and sent to the socket port.
 46 | 
 47 | NOTE: the above list may be extended/changed in the next iterations.
 48 | 
 49 | ![An actual EMQX trace example as exported by POC implementation](0023-assets/trace-export-example.png)
 50 | 
 51 | Any other processing steps/events like client connection, authentication, subscription are currently not considered for OpenTelemetry tracing due to the following reasons:
 52 | 
 53 | - these actions are not directly associated with the main traceable entity (message) defined above
 54 | - these actions seem not absolutely suitable for distributed tracing, they can be probably traced only as internal EMQX events
 55 | 
 56 | ### Implementation details
 57 | 
 58 | Erlang OpenTelemetry lib heavily relies on propagating trace context by means of process dictionary.
 59 | Obviously, this works fine when function calls are being traced within the context of the same process and needs little efforts when the context is to be propagated to a new spawned process.
 60 | 
 61 | However, this approach is not absolutely suitable for EMQX distributed architecture:
 62 | 
 63 | - correlated spans can be executed on different nodes and/or by different processes
 64 | - a batch of items relating to different traces can be processed together as a single unit of work, e.g., `emqx_connection`, `emqx_channel` modules process deliver messages in batches, where each message would have a unique trace ID if tracing is enabled.
 65 | 
 66 | That’s why the proposed implementation mostly relies on propagating the tracing context as a part of the message itself, which has the following advantages:
 67 | 
 68 | - inter-cluster communication doesn’t require any changes to support trace context propagation and is backward compatible (trace context is added to a reserved `#emqx_message.extra` field)
 69 | - tracing individual messages processed in batches is possible and doesn’t require any significant changes in the current implementation.
 70 | 
 71 | API and context propagation examples (see: [full implementation](https://github.com/emqx/emqx/pull/11984/files#diff-73384b930f330bcf64fb285a0bbcdce0edd015f6c7598e847da49d46b878ebe4):
 72 | 
 73 | ```erlang
 74 | 
 75 | put_ctx_to_msg(OtelCtx, Msg = #message{extra = Extra}) when is_map(Extra) ->
 76 |     Msg#message{extra = Extra#{?EMQX_OTEL_CTX => OtelCtx}};
 77 | %% extra field has not being used previously and defaulted to an empty list, it's safe to overwrite it
 78 | put_ctx_to_msg(OtelCtx, Msg) when is_record(Msg, message) ->
 79 |     Msg#message{extra = #{?EMQX_OTEL_CTX => OtelCtx}}.
 80 | 
 81 | get_ctx_from_msg(#message{extra = Extra}) ->
 82 |     from_extra(Extra).
 83 | 
 84 | get_ctx_from_packet(#mqtt_packet{variable = #mqtt_packet_publish{properties = #{internal_extra := Extra}}}) ->
 85 |     from_extra(Extra);
 86 | get_ctx_from_packet(_) ->
 87 |     undefined.
 88 | 
 89 | from_extra(#{?EMQX_OTEL_CTX := OtelCtx}) ->
 90 |     OtelCtx;
 91 | from_extra(_) ->
 92 |     undefined.
 93 | ```
 94 | 
 95 | Some drawbacks of the proposed implementation should also be mentioned:
 96 | 
 97 | - internal tracing API (as of now, implemented in `emqx_otel_trace` module) is not decoupled from the rest of the code base: each traceable action (span) is traced by a specific function and all these functions are quite specific. For example, they may extract/propagate the context differently and/or rely on the previous (parent) span. For now, it doesn’t seem feasible to create a generic trace wrapper that can trace an arbitrary function.
 98 | 
 99 | #### emqx_external_trace behaviour
100 | 
101 | Most (currently all) trace spans are expected to be added to the core `emqx` OTP application. However, `emqx` application mustn't depend on `opentelemetry` libs/apps.
102 | Moreover, we already have `emqx_opentelemetry` OTP application that implements OpenTelementry metrics, schema, configuration, etc.
103 | In order to keep `emqx` application decoupled from `opentelemetry` specific code, it's proposed to introduce `emqx_external_trace` module in `emqx` application.
104 | The module will include necessary callbacks that an actual trace backend must implement. It will also implement `register_provider/1`, `unregister_provider/1` functions, so that `opentelemetry` backend trace module can register itself as a trace provider.
105 | 
106 | `apps/emqx/src/emqx_external_trace.erl`:
107 | ```erlang
108 | -module(emqx_external_trace).
109 | 
110 | -callback trace_process_publish(Packet, Channel, fun((Packet, Channel) -> Res)) -> Res when
111 |     Packet :: emqx_types:packet(),
112 |     Channel :: emqx_channel:channel(),
113 |     Res :: term().
114 | 
115 | ...
116 | 
117 | -define(PROVIDER, {?MODULE, trace_provider}).
118 | 
119 | -define(with_provider(IfRegisitered, IfNotRegisired),
120 |     case persistent_term:get(?PROVIDER, undefined) of
121 |         undefined ->
122 |             IfNotRegisired;
123 |         Provider ->
124 |             Provider:IfRegisitered
125 |     end
126 | ).
127 | 
128 | %%--------------------------------------------------------------------
129 | %% provider API
130 | %%--------------------------------------------------------------------
131 | 
132 | -spec register_provider(module()) -> ok | {error, term()}.
133 | register_provider(Module) when is_atom(Module) ->
134 |     case is_valid_provider(Module) of
135 |         true ->
136 |             persistent_term:put(?PROVIDER, Module);
137 |         false ->
138 |             {error, invalid_provider}
139 |     end.
140 | 
141 | -spec unregister_provider(module()) -> ok | {error, term()}.
142 | unregister_provider(Module) ->
143 |     case persistent_term:get(?PROVIDER, undefined) of
144 |         Module ->
145 |             persistent_term:erase(?PROVIDER),
146 |             ok;
147 |         _ ->
148 |             {error, not_registered}
149 |     end.
150 | 
151 | %%--------------------------------------------------------------------
152 | %% trace API
153 | %%--------------------------------------------------------------------
154 | 
155 | -spec trace_process_publish(Packet, Channel, fun((Packet, Channel) -> Res)) -> Res when
156 |     Packet :: emqx_types:packet(),
157 |     Channel :: emqx_channel:channel(),
158 |     Res :: term().
159 | trace_process_publish(Packet, Channel, ProcessFun) ->
160 |     ?with_provider(?FUNCTION_NAME(Packet, Channel, ProcessFun), ProcessFun(Packet, Channel)).
161 | 
162 | ```
163 | 
164 | ### External trace context propagation
165 | 
166 | If EMQX receives trace context in a published message, e.g., `traceparent`/`tracestate` User-property for MQTT v5.0, it must be sent unaltered when forwarding the Application Message to a Client to conform with [MQTT specification 3.3.2.3.7](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901116).
167 | 
168 | This also perfectly follows [OpenTelemetry semantics for messaging systems](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#context-propagation):
169 | 
170 | > Messaging systems themselves may trace messages as the messages travels from producers to consumers. Such tracing would cover the transport layer but would not help in correlating producers with consumers. To be able to directly correlate producers with consumers, another context that is propagated with the message is required.
171 | >
172 | > A message creation context allows correlating producers with consumers of a message and model the dependencies between them, regardless of the underlying messaging transport mechanism and its instrumentation.
173 | >
174 | > The message creation context is created by the producer and should be propagated to the consumer(s).
175 | >
176 | > A producer SHOULD attach a message creation context to each message. If possible, the message creation context SHOULD be attached in such a way that it cannot be changed by intermediaries.
177 | 
178 | In fact, EMQX is capable of participating in distributed trace out of the box (without OpenTelemetry instrumentation), simply because it implements the above MQTT specification requirement and propagates User-Properties from a publisher to a subscriber.
179 | 
180 | However, if no trace context received but a message still should be traced, one of the following options should be chosen:
181 | 
182 | - create internal trace context and trace only internal EMQX events and do not propagate the context to receivers and/or external data systems (if bridges are set up)
183 | - create internal trace context and propagate it to receivers and/or external data systems.
184 | 
185 | The option shall be configurable and default to not propagating internally created trace context (controlled via `opentelementry.traces.filter.trace_all` configuration parameter).
186 | 
187 | ### Attributes
188 | 
189 | OpenTelemetry defines [some conventions](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#messaging-attributes) (status: Experimental at the time of writing this document).
190 | 
191 | The attributes are grouped under several name-spaces:
192 | 
193 | - `messaging.*`
194 | - `network.*`
195 | - `server.*`
196 | 
197 | The implementation shall follow these conventions, but as of now, only a small subset of attributes are added.
198 | The attributes can be extended in future upon request.
199 | 
200 | ### Sampling
201 | 
202 | OpenTelemetry sampling is described in great depth in the [official documentation](https://opentelemetry.io/docs/concepts/sampling).
203 | 
204 | Erlang opentelemetry lib implements only head sampling. Head sampling implies that a sampling decision is made as early as possible, e.g., by following a configured percentage of traces to sample (100% by default). A decision to sample or drop a span or trace is not made by inspecting the trace as a whole.
205 | 
206 | Sampling rate option should be added to EMQX configuration.
207 | 
208 | Tail sampling that makes a sampling decision after all the spans are done would need to be implemented by extending opentelemetry lib.
209 | 
210 | Examples of tail sampling capabilities:
211 | 
212 | - sample traces based on their latency (e.g. sample only traces that take more than 5ms)
213 | - sample traces only if they contain an error, a specific event or attribute value
214 | 
215 | The first iteration of EMQX OpenTelemetry integration doesn't implement any sampling. This feature can be considered for development in the next EMQX releases.
216 | 
217 | ### Filtering
218 | 
219 | The goal of filtering is similar to one of sampling: to narrow down the amount of traces.
220 | However, filtering is considered as a EMQX/MQTT specific extension that doesn’t necessary follow OpenTelemetry sampling concepts.
221 | 
222 | NOTE: filters can be implemented using `otel_sampler` behavior, but it doesn’t seem to have any advantages.
223 | It is suggested to implement a configurable filtering rules, so that a user can control which messages should be traced. It must be possible to leave filtering rules blank, so that all the incoming messages are traced (if tracing itself is enabled).
224 | 
225 | The filters may be similar to ones used in EMQX tracing:
226 | 
227 | - Client ID
228 | - Topic
229 | - IP address
230 | 
231 | The filtering rule should not probably be too complex to minimize performance impact.
232 | 
233 | The first iteration of EMQX OpenTelemetry integration defines only one boolean filter: `trace_all`. If it is enabled, all published messages are traced, and a new trace ID is generated if it can't be extracted from the message.
234 | Otherwise, only messages published with trace context are traced.
235 | 
236 | ## Configuration Changes
237 | 
238 | The existing EMQX OpenTelemetry schema (defined in emqx_otel_schema module) must be extended to include trace specific configuration.
239 | 
240 | Current HOCON config example:
241 | ```
242 | opentelemetry {
243 |   enable = true
244 |   exporter {endpoint = "http://172.18.0.2:4317", interval = 10s}
245 | }
246 | ```
247 | 
248 | Suggested HOCON config example:
249 | ```
250 | opentelemetry {
251 |   metrics {enable = true} # must be backward-compatible with opentelemetry.enable
252 |   exporter {endpoint = "http://172.18.0.2:4317"}
253 |   trace {
254 |     enable = true
255 |     filter {}
256 |     ...
257 |     }
258 | }
259 | ```
260 | 
261 | ## Backward Compatibility
262 | 
263 | All changes are backward compatible.
264 | 
265 | ## Testing
266 | 
267 | Besides integration/unit tests, it is necessary to make performance tests/profiling to measure the impact of tracing on EMQX performance.
268 | 


--------------------------------------------------------------------------------
/implemented/0026-schema-validation.md:
--------------------------------------------------------------------------------
  1 | # Built-in Schema Validation
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2024-05-22: @thalesmg Rename feature to "schema validation".
  6 | * 2023-10-31: @zmstone Initial draft message validation, filter and transformation.
  7 | * 2022-12-19: @zmstone Limit the scope to filter and validation.
  8 | 
  9 | ## Abstract
 10 | 
 11 | A new feature for EMQX v5 to validate messages.
 12 | 
 13 | ## Motivation
 14 | 
 15 | ### Schema Validation
 16 | 
 17 | Technically, by using EMQX rules engine, it is possible to validate the incoming
 18 | MQTT messages before sending the message off to data bridges.
 19 | 
 20 | For instance, one can write a rules engine SQL like below to
 21 | drop messages which are not avro encoded and also when the 'val' field is not greater than 10.
 22 | 
 23 | ```
 24 | SELECT schema_decode(my_avro_schema, payload) as decoded from t/# where decoded.val > 10
 25 | ```
 26 | 
 27 | However, currently rules are only used in data bridges, but they do not interfere MQTT message pub/sub.
 28 | For example, if there is client subscribed to `t/#` it would still receive the avro encoded message.
 29 | 
 30 | If we also want to stop the message being published, then there would be a need for a message republish action,
 31 | and the publisher and subscriber will likely be forced to use different topics.
 32 | e.g. publisher publishes to `t1/...`, republish action send it to `t2/...` for subscribers to receive.
 33 | 
 34 | This can work, but not easy to use.
 35 | 
 36 | ### Message filter
 37 | 
 38 | Message filtering can be achieved by configuring schema validation rule to drop invalid messages.
 39 | 
 40 | ## Design
 41 | 
 42 | This data validation step should happen after authorization and before rules engine.
 43 | Probably can make use of the 'message.publish' with a hook priority which
 44 | positions the callback after authorization, but before data bridges.
 45 | 
 46 | ### High level design
 47 | 
 48 | * Make use of the rules engine SQL, but without the `FROM` clause as it's always the MQTT message being th input.
 49 | 
 50 | Each data validation rule should have a unique ID. And each rule should have a input topic-filter name which
 51 | can be used to build a topic index for quick look-up.
 52 | 
 53 | The common parts can be described as hocon config below:
 54 | 
 55 | ### Schema Validation
 56 | 
 57 | ```
 58 | validations = [
 59 |   {
 60 |     name = validation1
 61 |     tags = []
 62 |     description = ""
 63 |     enable = true # or 'false'
 64 |     type = validation
 65 |     description = "drop message if it is not compabitble with my avro schema or if payload.value is less than 0"
 66 |     topics = "t/#" # or topics = ["t/1", "t/2"]
 67 |     strategy = any_pass # or all_pass
 68 |     failure_action = disconnect # (disconnect also implies 'drop') or 'drop' to only drop the message
 69 |     log_failure_at = none # debug, notice, info, warning, error
 70 |     checks = [
 71 |         {
 72 |             # message payload payload to be validated against JSON schema
 73 |             type = json # can also be 'avro' or 'protobuf'
 74 |             schema = "my-json-schema-registration-id"
 75 |         }
 76 |         {
 77 |             # SQL evaluates to empty result indicates validation falure
 78 |             # in this example, if payload.value is less than 0, the message is considiered invalid
 79 |             type = sql
 80 |             sql = "SELECT * WHERE payload.value < 0"
 81 |         }
 82 |     ]
 83 |   }
 84 | ]
 85 | ```
 86 | 
 87 | If there are more than one validation matched for one message, all validations should be executed
 88 | in the configured order.
 89 | For example, if one is configured with `topics = "t/#"` and another with `topics = "t/1"`,
 90 | when a message is published to `t/1`, the both validations should be triggered.
 91 | 
 92 | ## Configuration Changes
 93 | 
 94 | A new config root named `message_validation` is to be added.
 95 | 
 96 | ```
 97 | message_validation {
 98 |     validations = [
 99 |         ...
100 |     ]
101 | }
102 | ```
103 | 
104 | ## APIs
105 | 
106 | - GET /schema_validations
107 |   To list all the schema validations
108 | 
109 | - GET /schema_validations?topic=t/#&schema_name=jsonsch1&schema_type=json
110 |   Fetch validations based on filter
111 | 
112 | - PUT /schema_validations
113 |   To update a validation
114 | 
115 | - POST /schema_validations
116 |   To create a new validation
117 | 
118 | ## Observerbility
119 | 
120 | - There should be metrics created for each processor name.
121 | - Opentelemetry tracing context in message properties should be preserved.
122 | - Client disconnect and message drop events should be traceable in the EMQX builtin tracing.
123 | - Emit a new event e.g. `schema.validation_failure`.
124 |   This allows users to handle the validation failures in Rule-Engine.
125 |   For instance, publish the message to a different topic.
126 | 
127 | ## Backwards Compatibility
128 | 
129 | New feature.
130 | 
131 | ## Document Changes
132 | 
133 | - Docs should be created to guide users to use the new feature.
134 | - New APIs in swagger spec
135 | - New config schema for configuration manual doc
136 | 
137 | ## Testing Suggestions
138 | 
139 | The significant part of this feature is pure functional, this makes the testing more suitable as unit tests.
140 | 
141 | Integration tests should be more focused on the management APIs.
142 | 
143 | ## Declined Alternatives
144 | 
145 | - Extend authz to support SQL.
146 |   This was declined because authz's principal is MQTT topcis, we do not want to introduce data to it.
147 | - Add new actions to rules for 'deny' and 'disconnect' actions.
148 |   This was declined because
149 |   - Rules lack of ordering while validation and transform are to be chained.
150 |   - We might implement as a part of the rule engine internally, but at least from user's perspective, it should be an independent feature (not like an extension of rules)
151 | 


--------------------------------------------------------------------------------
/implemented/0027-message-transform.md:
--------------------------------------------------------------------------------
  1 | # Buit-in Message Transformation
  2 | 
  3 | ## Changelog
  4 | 
  5 | * 2024-08-13: @zmstone Update to reflect the actual implementation.
  6 | * 2023-10-31: @zmstone Initial draft for message validation, filter and transformation (as a part of EIP-0024).
  7 | * 2022-12-19: @zmstone Limit the scope message filter and tranformation.
  8 | 
  9 | ## Abstract
 10 | 
 11 | An extension of EMQX Enterprise to support action-less message filter and transformation.
 12 | 
 13 | ## Motivation
 14 | 
 15 | 
 16 | Currently, if one wants to tranform a message before the message is published, the only option is to
 17 | make use of the Rule Engine's republish action.
 18 | 
 19 | This requires the clients to publish to one topic, and subscribe to another.
 20 | 
 21 | ## Design
 22 | 
 23 | Each transfromation consists of
 24 | 
 25 | - A set of topic match patterns
 26 | - A list of operations for data mutation
 27 | - Payload decode/encode rules
 28 | - Error handling strategy
 29 | 
 30 | The transformation is done in below steps for each message.
 31 | 
 32 | - Match topic (filter). Only those messages of topic matching the configured topic (filter) are processed.
 33 | - Payload decode. When the transformation is about message payload, the payload has to be decoded first. e.g. from JSON, or avro.
 34 | - Mutation. Evaluate the expression to mutate the input data.
 35 | - Payload encode. When the transformation is about message payload, the payload has to be encoded back to the desired format of the subscribers.
 36 | 
 37 | We pre-define a set of keys (message attributes) which can be the subjects of data mutation.
 38 | 
 39 | - Topic
 40 | - Payload
 41 | - QoS
 42 | - Retain flag
 43 | - User properties
 44 | 
 45 | Each tranformation is a [varifom expression](https://docs.emqx.com/en/emqx/latest/configuration/configuration.html#variform-expressions)
 46 | which can be used to perform string operations.
 47 | 
 48 | ## Configuration Changes
 49 | 
 50 | Add a `message_tranformation` config root.
 51 | 
 52 | ```
 53 | message_transformation {
 54 |   transformations = [
 55 |     {
 56 |       name = trans1
 57 |       description = ""
 58 |       failure_action = ignore
 59 |       log_failure {level = warning}
 60 |       operations = [
 61 |         {
 62 |           key = topic
 63 |           ## prepend client ID to the publishing topic
 64 |           value = "concat([clientid,'/',topic])"
 65 |         }
 66 |       ]
 67 |       payload_decoder {type = none}
 68 |       payload_encoder {type = none}
 69 |       topics = [
 70 |         "#"
 71 |       ]
 72 |     }
 73 |   ]
 74 | }
 75 | ```
 76 | 
 77 | ### Expression Examples
 78 | 
 79 | - Add client ID as the prefix of publishing topics.
 80 | 
 81 | ```
 82 | {
 83 |     key = topic
 84 |     value = "concat([clientid,'/',topic])"
 85 | }
 86 | ```
 87 | 
 88 | - Add client TLS certificate's OU as a MQTT message user property
 89 | 
 90 | ```
 91 | {
 92 |     key = "user_property.ou"
 93 |     # if client_attrs.cert_dn is initialized, extract the OU
 94 |     # otherwise user_property.ou is set as 'none'
 95 |     value = "coalesce(regex_extract(client_attrs.cert_dn,'OU=([a-zA-Z0-9_-]+)'), 'none')"
 96 | }
 97 | ```
 98 | 
 99 | ### Declined Alternatives
100 | 
101 | - Employee rule SQL expressions
102 | - Message multiplication (transform one message to many)
103 | 


--------------------------------------------------------------------------------
/implemented/0030-clientinfo-authentication-rules.md:
--------------------------------------------------------------------------------
 1 | # Authenticate MQTT clients with composible rules against client info
 2 | 
 3 | ## Changelog
 4 | 
 5 | * 2024-06-27: @zmstone Initial draft
 6 | * 2024-09-20: @zmstone Update to reflect the latest design
 7 | 
 8 | ## Abstract
 9 | 
10 | Implement a new feature which can authenticate MQTT clients based on a set of composible rules against client properties and attributes.
11 | 
12 | ## Motivation
13 | 
14 | EMQX has a set comprehensive authentication chain, most of which requires out-of-band requests towards an external service such as HTTP server, or a database. However, there are some scenarios where the authentication decision can be made based on the client properties and attributes, such as client ID, username, and TLS certificate.
15 | 
16 | For certain use cases, it is more efficient to authenticate clients based on the client properties and attributes.
17 | Some examples:
18 | 
19 | - Quick deny clients which have no username.
20 | - Only allow clients with certain client ID prefix to connect.
21 | - Username prefix must match the OU (Organizational Unit) in the TLS certificate.
22 | - Password is a hash of the client ID and a secret key defined in a environment variable.
23 | 
24 | Such rules can be added to the authentication chain (often to the head of it), to effectively fence off clients that do not meet the criteria. Or used standalone to authenticate clients if the checks are sufficient.
25 | 
26 | ## Design
27 | 
28 | In addition to the current `Password Based`, `JWT` and `SCRAM`, we add a new authentication mechanism called `Client Info`.
29 | The `Client Info` mechanism has no external dependencies, but should have a set of configurable checks.
30 | 
31 | The checks can be composed similar to the `ACL` rules, formatted externally it's a HOCON array of objects, each object is a check.
32 | The checks are evaluated in order, and the first check that matches the client info will be used to authenticate the client.
33 | 
34 | Here is an example of the configuration:
35 | 
36 | ```
37 | checks = [
38 |     # Allow clients with username starts with 'super-'
39 |     {
40 |         is_match = "regex_match(username, '^super-.+$')"
41 |         result = allow
42 |     },
43 |     # deny clients with empty username and client ID starts with 'v1-'
44 |     {
45 |         # When is_match is an array, it yields 'true' if all individual checks yield 'true'
46 |         is_match = ["str_eq(username, '')", "str_eq(nth(1,tokens(clientid,'-')), 'v1')"]
47 |         result = deny
48 |     }
49 |     # If all checks are exhausted without an 'allow' or a 'deny' result
50 |     # this authenticator results in `ignore` so to continue to the next authenticator in the chain
51 | ]
52 | ```
53 | 
54 | Each check object has two fields:
55 | 
56 | - `is_match`: One or more boolean expressions that evaluates to `true` or any other string value as `false`.
57 | - `result`: either `allow`,`deny` or `ignore`.
58 | 
59 | ### Logical Operators
60 | 
61 | There is no explicit logical `AND` or `OR` operator support for checks and match conditions, but the following rules apply:
62 | 
63 | - Since each `check` can yield a `result`, one may consider the `checks` arrary are connected by a logical `||` (`OR`) operator.
64 | - When `is_match` is an array, it yields `true` if all individual checks yield `true`, one may consider the `is_match` array are connected by a logical `&&` (`AND`) operator.
65 | 
66 | ### Functions
67 | 
68 | The `is_match` expressions are Variform expressions used in other parts of EMQX, they are evaluated in the context of the client info.
69 | 
70 | Find more information about the Variform expressions in the [Variform documentation](https://docs.emqx.com/en/emqx/v5.8/configuration/configuration.html#variform-expressions)
71 | 
72 | ### Predefined Variables
73 | 
74 | - `username`: the username of the client.
75 | - `clientid`: the client ID of the client.
76 | - `peerhost`: the IP address of the client.
77 | - `cert_subject`: the subject of the TLS certificate.
78 | - `cert_common_name`: the issuer of the TLS certificate.
79 | - `client_attrs.*`: the client attributes of the client. See more in the [Client Attributes documentation](https://docs.emqx.com/en/emqx/v5.8/client-attributes/client-attributes.html#mqtt-client-attributes)
80 | 
81 | ## Configuration Changes
82 | 
83 | This section should list all the changes to the configuration files (if any).
84 | 
85 | ## Backwards Compatibility
86 | 
87 | This is a new fature and should not affect any existing configurations.
88 | 
89 | ## Document Changes
90 | 
91 | A new section should be added to the authentication documentation to describe the new `Client Info` authentication mechanism.
92 | 
93 | ## Testing Suggestions
94 | 
95 | Test coverage should be added to cover the new `Client Info` authentication mechanism.
96 | 
97 | ## Declined Alternatives
98 | 


--------------------------------------------------------------------------------
/implemented/0031-immutable-config-base-for-cluster.hocon:
--------------------------------------------------------------------------------
 1 | # Immutable config base for cluster.hocon
 2 | 
 3 | ## Changelog
 4 | 
 5 | * 2024-11-13: @zmstone Initial draft
 6 | 
 7 | ## Abstract
 8 | 
 9 | Make EMQX configuration files great again.
10 | 
11 | ## Motivation
12 | 
13 | The config overriding rule since EMQX 5.1 is not easy to understand and manage.
14 | The current (as of 5.8) config layers are as following (in the order of overrideing precidence from high to low):
15 | 
16 | - Environment variables (highest precidence)
17 |   Environment variables which start with `EMQX_` are translated to config keys and sit on top of the override chain.
18 |   Since usually the environment variables are set by the system administrator, it is considered the highest precidence.
19 | - File `etc/emqx.conf` (medium precidence)
20 |   This file holds all the manually crafted configurations.
21 |   The contnet of this file is not mutable to EMQX software (in fact it can be a read-only file).
22 |   Since this is manually crafted, it is considered a higher precidence than `cluster.hocon`.
23 | - File `data/configs/cluster.hocon` (low precidence)
24 |   This config file is mostly hidden from the users. It holds the config changes made from the UI, API or CLI.
25 | 
26 | One may argue that this is not the perfect order of overriding, but that's the fixed order in EMQX since 5.1 and we cannot change it at will. i.e. This proposal is not about changing the overriding order.
27 | 
28 | The ordering itself is not the problem, but the problem is, there is lack of support for maually crafted config files under `cluster.hocon`.
29 | Since emqx.conf is the only option, people started putting their custom configurations in `etc/emqx.conf`, but also want to use the UI/API/CLI to change some of the configurations.
30 | As a result, changes made from the UI/API/CLI will override the existing merged config during the runtime, but gets overriden by the `etc/emqx.conf` when the node restarts.
31 | 
32 | This is in particular a problem for emqx kubernetes operator, because it encourages the users to put all configs in one yaml block which gets mapped to `etc/emqx.conf` when bootstraping the deployment.
33 | And changes made to the resource will be applied by calling the API which only temporarily changes the configurations until the pod restarts.
34 | 
35 | ## Design
36 | 
37 | Add a conventional layer named (`base.hocon`) to the config overriding chain, which sits under `cluster.hocon`.
38 | So the new overriding chain will be:
39 | 
40 | - Environment variables (highest precidence)
41 | - File `etc/emqx.conf` (medium precidence)
42 | - File `data/configs/cluster.hocon` (low precidence)
43 | - File `etc/base.hocon` (lowest precidence)
44 | 
45 | ## Configuration Changes
46 | 
47 | Add a new line to `cluster.hocon` at the top of it:
48 | 
49 | ```
50 | include "etc/base.hocon"
51 | ```
52 | 
53 | The path to base.hocon depends on the packaging flavor of EMQX.
54 | 
55 | -  docker: `/opt/emqx/etc/base.hocon`
56 | -  RPM/DEB: `/etc/emqx/base.hocon`
57 | 
58 | ## Backwards Compatibility
59 | 
60 | Since hocon's include directive is used, the existing configurations will not be affected.
61 | Also, include is silently ignored if the file does not exist, so the new file can be added without breaking the existing installations.
62 | 
63 | ## Document Changes
64 | 
65 | Configuration documentation should be updated to reflect the new file.
66 | 
67 | ## Testing Suggestions
68 | 
69 | Since we are risking at duplicating the configurations in base.hocon and cluster.hocon, we should cover the test scenarios where the content of base.hocon and cluster.hocon are very large and have many overlapping configurations.
70 | 
71 | - Functionality wise, the configuration should be verified to be merged correctly.
72 | - Performance wise, large amount of configs should not affect the startup time of the node.
73 | 
74 | ## Declined Alternatives
75 | 


--------------------------------------------------------------------------------
/rejected/0005-stateless-brokers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/emqx/eip/2202950d8518128ee29db82e2adb549f478093d6/rejected/0005-stateless-brokers.png


--------------------------------------------------------------------------------
/rejected/0006-plugins-on-dashboard.md:
--------------------------------------------------------------------------------
 1 | # Stateless Brokers in EMQ X v5.0
 2 | 
 3 | ```
 4 | Author: Shawn <liuxy@emqx.io>
 5 | Status: Draft
 6 | Type: Design
 7 | Created: 2020-10-27
 8 | EMQ X Version: 5.0
 9 | Post-History:
10 | ```
11 | 
12 | ## Abstract
13 | 
14 | ## Motivation
15 | 
16 | ## Rationale
17 | 
18 | ## Architecture
19 | 
20 | Please attach the architecture diagrams.
21 | 
22 | ## Design
23 | 
24 | ## Discussion
25 | 
26 | ## References
27 | 
28 | 


--------------------------------------------------------------------------------
/rejected/0008-community-plugins.md:
--------------------------------------------------------------------------------
 1 | ## Community Plugins
 2 | 
 3 | ```
 4 | Author: Yudai Kiyofuji <yudai.kiyofuji@emqx.io>
 5 | Status: Draft
 6 | First Created: 2021-02-21
 7 | EMQ X Version: 4.3
 8 | ```
 9 | 
10 | ## Change log
11 | 
12 | * 2021-02-23: @z8674558 Initial draft
13 | * 2021-02-25: @z8674558 Add proposal on elixir plugins
14 | 
15 | ## Abstract
16 | 
17 | This proposal suggests ways to encourage plugin development by community. 
18 | 
19 | ## Motivation
20 | 
21 | Allowing people to develop their own plugins is a good way for EMQX to gain popularity.
22 | To achieve this, it is nice to `approve` some community plugins and let people use them.
23 | 
24 | ## Design
25 | 
26 | At the root of emqx.git, we add the file `community-plugins`, 
27 | where we list approved community plugins.
28 | (the advantage of having it in a separate file is to keep minimum lines of such in `rebar.config.erl`
29 | which may otherwise cause more lines of conflicts when porting changes to enterprise.)
30 | 
31 | ```erlang
32 | {erlang_plugins, [{foo_plugin, {git, "https://github.com", {tag, "1.0.0"}}}}]}.
33 | ```
34 | 
35 | And when a user would like to use one of them, 
36 | he/she can do so by setting env variable `EMQX_COMMUNITY_PLUGINS=foo_plugin`.
37 | Then `rebar.config.erl` read the file and the environment variables, to include specified ones.
38 | 
39 | ### Elixir Plugins
40 | 
41 | Considering the recent popularity of Elixir, we have decided to continue supporting Elixir plugins in v4.3.
42 | At the end of `community-plugins` file, there should be
43 | 
44 | ```erlang
45 | {elixir_plugins, [{bar_plugin, {git, "https://github.com", {tag, "1.0.0"}}}}]}.
46 | ```
47 | 
48 |  ## Configuration Changes
49 | 
50 | 
51 | 
52 |  ## Backwards Compatibility
53 | 
54 | 
55 |  ## Document Changes
56 | 
57 | In `emqx-doc`, there should be detailed information
58 | on how to use third-party plugins.
59 | 
60 | Add detailed information on how one can develop their own plugins
61 | in `emqx-plugin-template` and `emqx-elixir-plugin`.
62 | 
63 |  ## Testing Suggestions
64 | 
65 | Suppose we have approved a third-party plugin `emqx-some-plugin`.
66 | Since we have an umbrella project in v4.3, 
67 | The developers of `emqx-some-plugin` is going to run the test
68 | by placing it to emqx.git, for example in `_checkouts` dir.
69 | 
70 | In `emqx-some-plugin`'s CI, they have to fetch emqx.git then run the test.
71 | 


--------------------------------------------------------------------------------
/rejected/0009-consul-cluster-discovery.md:
--------------------------------------------------------------------------------
 1 | # Support Consul for cluster discovery
 2 | 
 3 | ```
 4 | Author: <John Roesler> <johnrroesler@gmail.com>
 5 | Status: Draft
 6 | Created: 2021-03-05
 7 | ```
 8 | 
 9 | ## Abstract
10 | 
11 | Cluster discovery should support Consul as an option to store and lookup node details.
12 | 
13 | ## Motivation
14 | 
15 | Consul is another popular key value storage system (similar to etcd). This will add another valuable clustering option to emqx.
16 | 
17 | ## Design
18 | 
19 | Each emqx node would write a key on a give path that would be provided in the configuration and make an API call
20 | to Consul, e.g. `PUT /kv/emqx/<node-name>` - [Consul key create API](https://www.consul.io/api-docs/kv#create-update-key).
21 | 
22 | Each emqx node would then be able to query Consul for other keys on that path, e.g. `GET /kv/emqx/?keys` - 
23 | [Consul get multiple keys docs](https://www.consul.io/api-docs/kv#keys-response)
24 | 
25 | With the node information - clustering would proceed in the standard emqx fashion. 
26 | 
27 | ## Configuration Changes
28 | 
29 | `consul` would be added to the list of clustering options and config values would look like:
30 | 
31 | ```
32 | cluster.discovery = consul
33 | cluster.consul.server = http://127.0.0.1:8500
34 | cluster.consul.prefix = emqcl
35 | ```
36 | 
37 | may want to look at adding support for consul authentication as well - 
38 | [auth docs](https://www.consul.io/api-docs#authentication)
39 | 
40 | ```
41 | cluster.consul.token = <token value>
42 | ```
43 | 
44 | ## Backwards Compatibility
45 | 
46 | There should be no issue with backwards compatibility as this is a new feature that would
47 | not impact the performance of any previous features.
48 | 
49 | ## Document Changes
50 | 
51 | Documentation will need to be updated to include the new clustering option.
52 | 
53 | ## Testing Suggestions
54 | 
55 | Could be tested the same as etcd.
56 | 
57 | ## Declined Alternatives
58 | 
59 | None
60 | 


--------------------------------------------------------------------------------