├── LICENSE ├── README.md ├── TODO ├── authorization_and_authentication.md ├── backing_queue.md ├── basic_publish.md ├── channels.md ├── credit_flow.md ├── deliver_to_queues.md ├── exchange_decorators.md ├── interceptors.md ├── internal_events.md ├── mandatory_message_handling.md ├── metrics_and_management_plugin.md ├── mirroring.md ├── networking_and_connections.md ├── publisher_confirms.md ├── queue_decorators.md ├── queues_and_message_store.md ├── rabbit_boot_process.md ├── transactions_in_exchange_modules.md ├── uninterupted_cluster_upgrade.md └── variable_queue.md /LICENSE: -------------------------------------------------------------------------------- 1 | THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS 2 | CREATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS 3 | PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE 4 | WORK OTHER THAN AS AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS 5 | PROHIBITED. 6 | 7 | BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND 8 | AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS 9 | LICENSE MAY BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU 10 | THE RIGHTS CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH 11 | TERMS AND CONDITIONS. 12 | 13 | 1. Definitions 14 | 15 | "Adaptation" means a work based upon the Work, or upon the Work and 16 | other pre-existing works, such as a translation, adaptation, 17 | derivative work, arrangement of music or other alterations of a 18 | literary or artistic work, or phonogram or performance and includes 19 | cinematographic adaptations or any other form in which the Work may be 20 | recast, transformed, or adapted including in any form recognizably 21 | derived from the original, except that a work that constitutes a 22 | Collection will not be considered an Adaptation for the purpose of 23 | this License. For the avoidance of doubt, where the Work is a musical 24 | work, performance or phonogram, the synchronization of the Work in 25 | timed-relation with a moving image ("synching") will be considered an 26 | Adaptation for the purpose of this License. "Collection" means a 27 | collection of literary or artistic works, such as encyclopedias and 28 | anthologies, or performances, phonograms or broadcasts, or other works 29 | or subject matter other than works listed in Section 1(f) below, 30 | which, by reason of the selection and arrangement of their contents, 31 | constitute intellectual creations, in which the Work is included in 32 | its entirety in unmodified form along with one or more other 33 | contributions, each constituting separate and independent works in 34 | themselves, which together are assembled into a collective whole. A 35 | work that constitutes a Collection will not be considered an 36 | Adaptation (as defined below) for the purposes of this License. 37 | "Creative Commons Compatible License" means a license that is listed 38 | at https://creativecommons.org/compatiblelicenses that has been 39 | approved by Creative Commons as being essentially equivalent to this 40 | License, including, at a minimum, because that license: (i) contains 41 | terms that have the same purpose, meaning and effect as the License 42 | Elements of this License; and, (ii) explicitly permits the relicensing 43 | of adaptations of works made available under that license under this 44 | License or a Creative Commons jurisdiction license with the same 45 | License Elements as this License. "Distribute" means to make 46 | available to the public the original and copies of the Work or 47 | Adaptation, as appropriate, through sale or other transfer of 48 | ownership. "License Elements" means the following high-level license 49 | attributes as selected by Licensor and indicated in the title of this 50 | License: Attribution, ShareAlike. "Licensor" means the individual, 51 | individuals, entity or entities that offer(s) the Work under the terms 52 | of this License. "Original Author" means, in the case of a literary 53 | or artistic work, the individual, individuals, entity or entities who 54 | created the Work or if no individual or entity can be identified, the 55 | publisher; and in addition (i) in the case of a performance the 56 | actors, singers, musicians, dancers, and other persons who act, sing, 57 | deliver, declaim, play in, interpret or otherwise perform literary or 58 | artistic works or expressions of folklore; (ii) in the case of a 59 | phonogram the producer being the person or legal entity who first 60 | fixes the sounds of a performance or other sounds; and, (iii) in the 61 | case of broadcasts, the organization that transmits the broadcast. 62 | "Work" means the literary and/or artistic work offered under the terms 63 | of this License including without limitation any production in the 64 | literary, scientific and artistic domain, whatever may be the mode or 65 | form of its expression including digital form, such as a book, 66 | pamphlet and other writing; a lecture, address, sermon or other work 67 | of the same nature; a dramatic or dramatico-musical work; a 68 | choreographic work or entertainment in dumb show; a musical 69 | composition with or without words; a cinematographic work to which are 70 | assimilated works expressed by a process analogous to cinematography; 71 | a work of drawing, painting, architecture, sculpture, engraving or 72 | lithography; a photographic work to which are assimilated works 73 | expressed by a process analogous to photography; a work of applied 74 | art; an illustration, map, plan, sketch or three-dimensional work 75 | relative to geography, topography, architecture or science; a 76 | performance; a broadcast; a phonogram; a compilation of data to the 77 | extent it is protected as a copyrightable work; or a work performed by 78 | a variety or circus performer to the extent it is not otherwise 79 | considered a literary or artistic work. "You" means an individual or 80 | entity exercising rights under this License who has not previously 81 | violated the terms of this License with respect to the Work, or who 82 | has received express permission from the Licensor to exercise rights 83 | under this License despite a previous violation. "Publicly Perform" 84 | means to perform public recitations of the Work and to communicate to 85 | the public those public recitations, by any means or process, 86 | including by wire or wireless means or public digital performances; to 87 | make available to the public Works in such a way that members of the 88 | public may access these Works from a place and at a place individually 89 | chosen by them; to perform the Work to the public by any means or 90 | process and the communication to the public of the performances of the 91 | Work, including by public digital performance; to broadcast and 92 | rebroadcast the Work by any means including signs, sounds or images. 93 | "Reproduce" means to make copies of the Work by any means including 94 | without limitation by sound or visual recordings and the right of 95 | fixation and reproducing fixations of the Work, including storage of a 96 | protected performance or phonogram in digital form or other electronic 97 | medium. 2. Fair Dealing Rights. Nothing in this License is intended 98 | to reduce, limit, or restrict any uses free from copyright or rights 99 | arising from limitations or exceptions that are provided for in 100 | connection with the copyright protection under copyright law or other 101 | applicable laws. 102 | 103 | 3. License Grant. Subject to the terms and conditions of this License, 104 | Licensor hereby grants You a worldwide, royalty-free, non-exclusive, 105 | perpetual (for the duration of the applicable copyright) license to 106 | exercise the rights in the Work as stated below: 107 | 108 | to Reproduce the Work, to incorporate the Work into one or more 109 | Collections, and to Reproduce the Work as incorporated in the 110 | Collections; to create and Reproduce Adaptations provided that any 111 | such Adaptation, including any translation in any medium, takes 112 | reasonable steps to clearly label, demarcate or otherwise identify 113 | that changes were made to the original Work. For example, a 114 | translation could be marked "The original work was translated from 115 | English to Spanish," or a modification could indicate "The original 116 | work has been modified."; to Distribute and Publicly Perform the Work 117 | including as incorporated in Collections; and, to Distribute and 118 | Publicly Perform Adaptations. For the avoidance of doubt: 119 | 120 | Non-waivable Compulsory License Schemes. In those jurisdictions in 121 | which the right to collect royalties through any statutory or 122 | compulsory licensing scheme cannot be waived, the Licensor reserves 123 | the exclusive right to collect such royalties for any exercise by You 124 | of the rights granted under this License; Waivable Compulsory License 125 | Schemes. In those jurisdictions in which the right to collect 126 | royalties through any statutory or compulsory licensing scheme can be 127 | waived, the Licensor waives the exclusive right to collect such 128 | royalties for any exercise by You of the rights granted under this 129 | License; and, Voluntary License Schemes. The Licensor waives the right 130 | to collect royalties, whether individually or, in the event that the 131 | Licensor is a member of a collecting society that administers 132 | voluntary licensing schemes, via that society, from any exercise by 133 | You of the rights granted under this License. The above rights may be 134 | exercised in all media and formats whether now known or hereafter 135 | devised. The above rights include the right to make such modifications 136 | as are technically necessary to exercise the rights in other media and 137 | formats. Subject to Section 8(f), all rights not expressly granted by 138 | Licensor are hereby reserved. 139 | 140 | 4. Restrictions. The license granted in Section 3 above is expressly made subject to and limited by the following restrictions: 141 | 142 | You may Distribute or Publicly Perform the Work only under the terms 143 | of this License. You must include a copy of, or the Uniform Resource 144 | Identifier (URI) for, this License with every copy of the Work You 145 | Distribute or Publicly Perform. You may not offer or impose any terms 146 | on the Work that restrict the terms of this License or the ability of 147 | the recipient of the Work to exercise the rights granted to that 148 | recipient under the terms of the License. You may not sublicense the 149 | Work. You must keep intact all notices that refer to this License and 150 | to the disclaimer of warranties with every copy of the Work You 151 | Distribute or Publicly Perform. When You Distribute or Publicly 152 | Perform the Work, You may not impose any effective technological 153 | measures on the Work that restrict the ability of a recipient of the 154 | Work from You to exercise the rights granted to that recipient under 155 | the terms of the License. This Section 4(a) applies to the Work as 156 | incorporated in a Collection, but this does not require the Collection 157 | apart from the Work itself to be made subject to the terms of this 158 | License. If You create a Collection, upon notice from any Licensor You 159 | must, to the extent practicable, remove from the Collection any credit 160 | as required by Section 4(c), as requested. If You create an 161 | Adaptation, upon notice from any Licensor You must, to the extent 162 | practicable, remove from the Adaptation any credit as required by 163 | Section 4(c), as requested. You may Distribute or Publicly Perform an 164 | Adaptation only under the terms of: (i) this License; (ii) a later 165 | version of this License with the same License Elements as this 166 | License; (iii) a Creative Commons jurisdiction license (either this or 167 | a later license version) that contains the same License Elements as 168 | this License (e.g., Attribution-ShareAlike 3.0 US)); (iv) a Creative 169 | Commons Compatible License. If you license the Adaptation under one of 170 | the licenses mentioned in (iv), you must comply with the terms of that 171 | license. If you license the Adaptation under the terms of any of the 172 | licenses mentioned in (i), (ii) or (iii) (the "Applicable License"), 173 | you must comply with the terms of the Applicable License generally and 174 | the following provisions: (I) You must include a copy of, or the URI 175 | for, the Applicable License with every copy of each Adaptation You 176 | Distribute or Publicly Perform; (II) You may not offer or impose any 177 | terms on the Adaptation that restrict the terms of the Applicable 178 | License or the ability of the recipient of the Adaptation to exercise 179 | the rights granted to that recipient under the terms of the Applicable 180 | License; (III) You must keep intact all notices that refer to the 181 | Applicable License and to the disclaimer of warranties with every copy 182 | of the Work as included in the Adaptation You Distribute or Publicly 183 | Perform; (IV) when You Distribute or Publicly Perform the Adaptation, 184 | You may not impose any effective technological measures on the 185 | Adaptation that restrict the ability of a recipient of the Adaptation 186 | from You to exercise the rights granted to that recipient under the 187 | terms of the Applicable License. This Section 4(b) applies to the 188 | Adaptation as incorporated in a Collection, but this does not require 189 | the Collection apart from the Adaptation itself to be made subject to 190 | the terms of the Applicable License. If You Distribute, or Publicly 191 | Perform the Work or any Adaptations or Collections, You must, unless a 192 | request has been made pursuant to Section 4(a), keep intact all 193 | copyright notices for the Work and provide, reasonable to the medium 194 | or means You are utilizing: (i) the name of the Original Author (or 195 | pseudonym, if applicable) if supplied, and/or if the Original Author 196 | and/or Licensor designate another party or parties (e.g., a sponsor 197 | institute, publishing entity, journal) for attribution ("Attribution 198 | Parties") in Licensor's copyright notice, terms of service or by other 199 | reasonable means, the name of such party or parties; (ii) the title of 200 | the Work if supplied; (iii) to the extent reasonably practicable, the 201 | URI, if any, that Licensor specifies to be associated with the Work, 202 | unless such URI does not refer to the copyright notice or licensing 203 | information for the Work; and (iv) , consistent with Ssection 3(b), in 204 | the case of an Adaptation, a credit identifying the use of the Work in 205 | the Adaptation (e.g., "French translation of the Work by Original 206 | Author," or "Screenplay based on original Work by Original 207 | Author"). The credit required by this Section 4(c) may be implemented 208 | in any reasonable manner; provided, however, that in the case of a 209 | Adaptation or Collection, at a minimum such credit will appear, if a 210 | credit for all contributing authors of the Adaptation or Collection 211 | appears, then as part of these credits and in a manner at least as 212 | prominent as the credits for the other contributing authors. For the 213 | avoidance of doubt, You may only use the credit required by this 214 | Section for the purpose of attribution in the manner set out above 215 | and, by exercising Your rights under this License, You may not 216 | implicitly or explicitly assert or imply any connection with, 217 | sponsorship or endorsement by the Original Author, Licensor and/or 218 | Attribution Parties, as appropriate, of You or Your use of the Work, 219 | without the separate, express prior written permission of the Original 220 | Author, Licensor and/or Attribution Parties. Except as otherwise 221 | agreed in writing by the Licensor or as may be otherwise permitted by 222 | applicable law, if You Reproduce, Distribute or Publicly Perform the 223 | Work either by itself or as part of any Adaptations or Collections, 224 | You must not distort, mutilate, modify or take other derogatory action 225 | in relation to the Work which would be prejudicial to the Original 226 | Author's honor or reputation. Licensor agrees that in those 227 | jurisdictions (e.g. Japan), in which any exercise of the right granted 228 | in Section 3(b) of this License (the right to make Adaptations) would 229 | be deemed to be a distortion, mutilation, modification or other 230 | derogatory action prejudicial to the Original Author's honor and 231 | reputation, the Licensor will waive or not assert, as appropriate, 232 | this Section, to the fullest extent permitted by the applicable 233 | national law, to enable You to reasonably exercise Your right under 234 | Section 3(b) of this License (right to make Adaptations) but not 235 | otherwise. 236 | 237 | 5. Representations, Warranties and Disclaimer 238 | 239 | UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, 240 | LICENSOR OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR 241 | WARRANTIES OF ANY KIND CONCERNING THE WORK, EXPRESS, IMPLIED, 242 | STATUTORY OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF 243 | TITLE, MERCHANTIBILITY, FITNESS FOR A PARTICULAR PURPOSE, 244 | NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, 245 | OR THE PRESENCE OF ABSENCE OF ERRORS, WHETHER OR NOT 246 | DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED 247 | WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU. 248 | 249 | 6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY 250 | APPLICABLE LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY 251 | LEGAL THEORY FOR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR 252 | EXEMPLARY DAMAGES ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, 253 | EVEN IF LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 254 | 255 | 7. Termination 256 | 257 | This License and the rights granted hereunder will terminate 258 | automatically upon any breach by You of the terms of this 259 | License. Individuals or entities who have received Adaptations or 260 | Collections from You under this License, however, will not have their 261 | licenses terminated provided such individuals or entities remain in 262 | full compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8 263 | will survive any termination of this License. Subject to the above 264 | terms and conditions, the license granted here is perpetual (for the 265 | duration of the applicable copyright in the Work). Notwithstanding the 266 | above, Licensor reserves the right to release the Work under different 267 | license terms or to stop distributing the Work at any time; provided, 268 | however that any such election will not serve to withdraw this License 269 | (or any other license that has been, or is required to be, granted 270 | under the terms of this License), and this License will continue in 271 | full force and effect unless terminated as stated above. 272 | 273 | 8. Miscellaneous 274 | 275 | Each time You Distribute or Publicly Perform the Work or a Collection, 276 | the Licensor offers to the recipient a license to the Work on the same 277 | terms and conditions as the license granted to You under this License. 278 | Each time You Distribute or Publicly Perform an Adaptation, Licensor 279 | offers to the recipient a license to the original Work on the same 280 | terms and conditions as the license granted to You under this License. 281 | If any provision of this License is invalid or unenforceable under 282 | applicable law, it shall not affect the validity or enforceability of 283 | the remainder of the terms of this License, and without further action 284 | by the parties to this agreement, such provision shall be reformed to 285 | the minimum extent necessary to make such provision valid and 286 | enforceable. No term or provision of this License shall be deemed 287 | waived and no breach consented to unless such waiver or consent shall 288 | be in writing and signed by the party to be charged with such waiver 289 | or consent. This License constitutes the entire agreement between the 290 | parties with respect to the Work licensed here. There are no 291 | understandings, agreements or representations with respect to the Work 292 | not specified here. Licensor shall not be bound by any additional 293 | provisions that may appear in any communication from You. This License 294 | may not be modified without the mutual written agreement of the 295 | Licensor and You. The rights granted under, and the subject matter 296 | referenced, in this License were drafted utilizing the terminology of 297 | the Berne Convention for the Protection of Literary and Artistic Works 298 | (as amended on September 28, 1979), the Rome Convention of 1961, the 299 | WIPO Copyright Treaty of 1996, the WIPO Performances and Phonograms 300 | Treaty of 1996 and the Universal Copyright Convention (as revised on 301 | July 24, 1971). These rights and subject matter take effect in the 302 | relevant jurisdiction in which the License terms are sought to be 303 | enforced according to the corresponding provisions of the 304 | implementation of those treaty provisions in the applicable national 305 | law. If the standard suite of rights granted under applicable 306 | copyright law includes additional rights not granted under this 307 | License, such additional rights are deemed to be included in the 308 | License; this License is not intended to restrict the license of any 309 | rights under applicable law. 310 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RabbitMQ Internals # 2 | 3 | This project aims to explain how RabbitMQ works internally. The goal 4 | is to make it easier to contribute for newcomers to the project, and 5 | at the same time have a common repository of knowledge to be shared 6 | across the project contributors. 7 | 8 | ## Purpose ## 9 | 10 | Most interesting modules in RabbitMQ projects have documentation 11 | essays, sometimes quite extensive, at the top. The aim here is not to 12 | duplicate what's there, but to provide the highest-level overview as 13 | to the overall architecture. 14 | 15 | ## Guides ## 16 | 17 | In order to understand how RabbitMQ's internals work, it's better to 18 | try to follow the logic of how a message progresses through 19 | RabbitMQ as it is handled by the broker, otherwise, you would end up 20 | navigating through many guides without a clear context of what's going 21 | on, or without knowing what to read next. Therefore we have prepared 22 | the following guides to help you understand how RabbitMQ works: 23 | 24 | ### Basic Publish Guide ### 25 | 26 | Here we follow the life of a message since it's received from the 27 | network until it has been routed by the exchanges. We take a look at 28 | the various processing steps that happen to a message right until it 29 | is delivered to one or perhaps many queues. 30 | 31 | [Basic Publish](./basic_publish.md) 32 | 33 | ### Deliver To Queues Guide ### 34 | 35 | After the message has been routed, the broker needs to deliver that 36 | message to the respective queues. Here not only the message has to be 37 | sent to queues, but also mandatory messages and publisher confirms 38 | need to be taken into account. Also, the queue needs to try to deliver 39 | the message to prospective consumer, otherwise the message ends up 40 | queued. 41 | 42 | [Deliver To Queues](./deliver_to_queues.md) 43 | 44 | ### Queues and Message Store 45 | 46 | Provides an overview of the Erlang processes that back queues 47 | and how they interact with the message store, message index and so on. 48 | 49 | [Queues and Message Store](./queues_and_message_store.md) 50 | 51 | ### Variable Queue Guide ### 52 | 53 | Ultimately, messages end up queued at the 54 | [backing queue](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_backing_queue.erl). From 55 | here they can be retrieved, acked, purged, and so on. The most common 56 | implementation of the backing queue behaviour is the 57 | `rabbit_variable_queue` 58 | [module](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_variable_queue.erl), 59 | explained in the following guide: 60 | 61 | [Variable Queue](./variable_queue.md) 62 | 63 | ### Mandatory Messages and Publisher Confirm Guides ### 64 | 65 | As explained on the [Deliver To Queues](./deliver_to_queues.md) guide, 66 | a channel has to handle messages published as mandatory and also take 67 | care of publisher confirms. These processes are explained in the 68 | following guides: 69 | 70 | - [Mandatory Message Handling](./mandatory_message_handling.md) 71 | - [Publisher Confirms](./publisher_confirms.md) 72 | 73 | ### Authentication and Authorization ### 74 | 75 | As explained in the [Basic Publish](./basic_publish.md), there are 76 | some rules to see if a message can be accepted by the broker from a 77 | certain publisher. This is explained in the following guide: 78 | 79 | [Authorization and Authentication Backends](./authorization_and_authentication.md) 80 | 81 | ### Internal Event Subsystem 82 | 83 | In some cases components in a running node communicate via events. 84 | Some events are consumed by other nodes. 85 | 86 | [Internal Events](./internal_events.md) 87 | 88 | ### Management Plugin ### 89 | 90 | An architectural overview of the v3.6.7+ version of the management plugin. 91 | 92 | [Metrics and Management Plugin](./metrics_and_management_plugin.md) 93 | 94 | ## Maturity and Completeness 95 | 96 | These guides are not complete, haven't been edited, and are work in 97 | progress in general. 98 | 99 | So if you find yourself wanting more detail, check the code first! 100 | 101 | ## License 102 | 103 | (c) Pivotal Software Inc, 2015-2016 104 | 105 | Released under the 106 | [Creative Commons Attribution-ShareAlike 3.0 Unported](https://creativecommons.org/licenses/by-sa/3.0/) 107 | license. 108 | -------------------------------------------------------------------------------- /TODO: -------------------------------------------------------------------------------- 1 | Overall architecture 2 | - which things are processes, how do messages flow, how do plugins work, 3 | what are the major subsystems? 4 | 5 | Profiling / debugging 6 | - fprof and/or dbg 7 | 8 | Shovel / federation 9 | - History / rationales, how they work, Erlang client, future directions 10 | -------------------------------------------------------------------------------- /authorization_and_authentication.md: -------------------------------------------------------------------------------- 1 | # Authorization and Authentication Backends 2 | 3 | This document describes authentication and authorization machinery that 4 | implements [access control](https://www.rabbitmq.com/access-control.html). 5 | 6 | Authentication backends should not be confused with authentication mechanisms, 7 | which are defined in some protocols supported by RabbitMQ. 8 | For AMQP 0-9-1 authentication mechanisms, see [documentation](https://www.rabbitmq.com/authentication.html). 9 | 10 | ## Definitions 11 | 12 | Authentication and authorization are often confused or used interchangeably. That's 13 | wrong and RabbitMQ separates the two. For the sake of simplicity, we'll define 14 | authentication as "identifying who the user is" and authorization as 15 | "determining what the user is and isn't allowed to do." 16 | 17 | 18 | ## Authentication Mechanisms 19 | 20 | AMQP 0-9-1 supports multiple authentication **mechanisms**. The mechanisms decide how a 21 | client connection authenticates, for example, what should be considered a 22 | set of credentials. 23 | 24 | In practice in 99% of cases only two mechanisms are used: 25 | 26 | * `PLAIN` (a set of credentials such as username and password) 27 | * `EXTERNAL`, which assumes authentication happens out of band (not performed 28 | by RabbitMQ authN backends), usually [using x509 (TLS) certificates](https://github.com/rabbitmq/rabbitmq-server/tree/master/deps/rabbitmq_auth_mechanism_ssl). 29 | This mechanism ignores client-provided credentials and relies on TLS [peer certificate chain 30 | verification](https://tools.ietf.org/html/rfc6818). 31 | 32 | When a client connection reaches [authentication stage](https://github.com/rabbitmq/rabbitmq-server/blob/v3.7.2/src/rabbit_reader.erl#L1304), a mechanism requested by the client 33 | and supported by the server is selected. The mechanism module then checks whether it can 34 | be applied to a connection (e.g. the TLS-based mechanism will reject non-TLS connections). 35 | 36 | An authentication mechanism is a module that implements the [rabbit_auth_mechanism](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/rabbit_auth_mechanism.erl) behaviour, which includes 37 | 3 functions: 38 | 39 | * `init/1`: self-explanatory 40 | * `should_offer/1`: if this mechanism is enabled, should it be offered for a given socket? 41 | * `handle_response/2`: the core authentication logic of the mechanism 42 | 43 | The `PLAIN` mechanism extracts client credentials and passes them to 44 | a chain of authentication and authorization backends. 45 | 46 | 47 | ## Authentication and Authorization Backends 48 | 49 | Authentication (authN) and authorization (authZ) backend(s) use 50 | client-provided credentials to decide whether the client passes 51 | authentication and should be granted access to the target virtual 52 | host. 53 | 54 | The above sentence implies that the `PLAIN` (or similar) 55 | authentication mechanism is used and already validated the presence of 56 | client credentials. 57 | 58 | Authentication and authorization backends form a chain of 59 | responsibility: a set of backends is applied to the same set of client 60 | credentials and as soon as one of them reports success, the entire 61 | operation is considered to be successful. 62 | 63 | Authentication and authorization backends can be provided by 64 | plugins. They are modules that must implement the following 65 | behaviours: 66 | 67 | * `rabbit_authn_backend` for authentication ("authn") backends 68 | * `rabbit_authz_backend` for authorization ("authz") backends 69 | 70 | It is possible to implement both in a single module. 71 | For example `internal`, `ldap` and `http` backends do so. 72 | 73 | It is possible to use multiple backends for authn or authz. Then the first 74 | positive result returned by a backend in the chain is considered to be final. 75 | 76 | ### AuthN Backend 77 | 78 | The `rabbit_authn_backend` behaviour defines authentication process with single function: 79 | 80 | ``` erlang 81 | user_login_authentication(UserName, AuthProps) -> {ok, #auth_user{}} | {refused, Format, Args} | {error, Reason} 82 | ``` 83 | Where `UserName` is the name of the user which is trying to authorize, 84 | `AuthProps` is an authorization context (proplist) (e.g. it can be `[]` for x509 certificate-based 85 | authentication or `[{password, Password}]` for password-based one). 86 | 87 | This function returns 88 | 89 | * `{ok, #auth_user{}}` in case of successfull authentication. The `#auth_user{}` record is then passed 90 | on to other modules, associated with the connection, etc. 91 | * `{refused, Format, Args}` when user authentication fails. `Format` and `Args` are meant to be used 92 | with `io:format/2` and similar functions. 93 | * `{error, Reason}` when an unexpected error occurs. 94 | 95 | ### AuthZ Backend 96 | 97 | The `rabbit_authz_backend` behaviour defines functions that authorize access 98 | to RabbitMQ resources, such as `vhost`, `exchange`, `queue` or `topic`. 99 | 100 | It contains following functions: 101 | 102 | ``` erlang 103 | % if user is allowed to access broker. 104 | user_login_authorization(UserName) -> {ok, Impl} | {ok, Impl, Tags} | {refused, Format, Args} | {error, Reason}. 105 | % if user has access to specific vhost. 106 | check_vhost_access(#auth_user{}, Vhost, Context) -> boolean() | {error, Reason}. 107 | % if user has access to specific resource 108 | check_resource_access(#auth_user{}, #resource{}, Permission), Context -> boolean() | {error, Reason}. 109 | % if user has access to specific topic 110 | check_topic_access(#auth_user{}, #resource{}, Permission, Context) -> boolean() | {error, Reason}. 111 | % if the backend supports state or credential expiration 112 | state_can_expire() -> boolean() 113 | % optional: update backend state (eg. new JWT token) 114 | update_state(#auth_user{}, NewState) -> {ok, #auth_user{}} | {refused, Fmt, Args} | {error, Reason}. 115 | ``` 116 | 117 | Where 118 | 119 | * `UserName`, `Format`, `Args`: see above. 120 | * `Impl` is internal state of authorization backend. It will vary between backends and can be thought of 121 | as backend's `State`. 122 | * `Tags` is user tags. Those are used by features such as policies, plugins such as management, and so on. Tags can be an empty list. 123 | * `Vhost` is self-explanatory 124 | * `Permission` currently one of `configure`, `read`, or `write` 125 | * `Context` is a map with additional information (like peer address or routing key, or protocol-specific information like MQTT client ID). It is slightly different for each of the three check functions. 126 | 127 | The `#auth_user{}` record represents a user whenever we need to 128 | check access to vhosts and resources. 129 | 130 | This record has the following structure: 131 | `#auth_user{ username :: binary(), impl :: any(), tags :: [any()] }`, 132 | 133 | where `impl` is internal backend state, `tags` is user tags (see above). 134 | 135 | `impl` can be used to check resource access by querying an external data source or performing 136 | a check solely on the provided state (local data). 137 | 138 | `#resource{ virtual_host :: binary(), kind :: query|exchange|topic, name :: binary() }` 139 | represents a resource (a queue, exchange, or topic) access to which is restricted. 140 | 141 | ### Configuring Backends 142 | 143 | Backends are configured the usual way and can have multiple "syntaxes" 144 | (recognised forms): 145 | 146 | ``` erlang 147 | % To enable single backend: 148 | {rabbit, [{auth_backends, [my_auth_backend]}]}. 149 | % To check several backends. If one is refused - check next. 150 | {rabbit, [{auth_backends, [my_auth_backend, my_other_auth_backend]}]}. 151 | % To use different modules as AuthN and AuthZ backends 152 | {rabbit, [{auth_backends, [{my_authn_backend, my_authz_backend}]}]}. 153 | % You can still fallback if using different modules 154 | {rabbit, [{auth_backends, [{my_authn_backend, my_authz_backend}, my_other_auth_backend]}]}. 155 | ``` 156 | 157 | If backend is defined by a tuple, 158 | the first element will be used as an `AuthN` module and the second as the `AuthZ` one. 159 | If it is defined by an atom, it will be used for both `AuthN` and `AuthZ`. 160 | 161 | When a backend is defined by a list, the server will use modules in the chain in order 162 | until one of them returns a positive result or the list is exhausted (the Chain of Responsibility 163 | pattern in object-oriented parlance). 164 | 165 | If authentication is successfull then the `AuthZ` backend from the same tuple ("chain element") 166 | will be used for authorization checks later. 167 | 168 | ### Example Backends: 169 | 170 | * `rabbit_auth_backend_dummy`: a dummy no-op backend, only used as the most trivial example 171 | * `rabbit_auth_backend_internal`: internal data store backend. See https://www.rabbitmq.com/access-control.html for more info 172 | * `rabbit_auth_backend_ldap`: provides `AuthN` and `AuthZ` backends in a single module, backed by LDAP 173 | * `rabbit_auth_backend_http`: provides `AuthN` and `AuthZ` backends in a single module, backed by an HTTP service 174 | * `rabbit_auth_backend_amqp`: provides `AuthN` and `AuthZ` backends in a single module, backed by an AMQP 0-9-1 service that uses request/response ("RPC") 175 | * `rabbit_auth_backend_oauth2`: provides `AuthN` and `AuthZ` backends in a single module, that uses OAuth 2.0 (JWT) tokens. It is not specific to but developed against [Cloud Foundry UAA](https://github.com/cloudfoundry/uaa). 176 | 177 | ### Permission caching 178 | 179 | Permissions are cached for latest operations by each channel. There is 180 | a cache for resource and another for topic permissions. When a 181 | `Resource/Topic + Context + Permission` triplet is successfully 182 | authorized, it is added to the corresponding cache. When the cache is 183 | full (currently max 12 entries in each), oldest entry is rotated 184 | out. The caches are cleared when the channel process is 185 | hibernated. They are also cleared every `channel_tick_interval` if any 186 | enabled backend supports state expiration. 187 | 188 | To give an extreme and unrealistic example: when only the internal 189 | backend is enabled a client only doing publishes to the same resource 190 | with high frequency can continue to do so indefinitely even after its 191 | permission is revoked for that resource (because the channel cache is 192 | never cleared and never rotates). 193 | 194 | Separately there is also a [caching authN/authZ 195 | backend](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbitmq_auth_backend_cache/README.md) 196 | that provides a TTL based caching layer for another backend. It is 197 | useful to reduce network traffic and latency for backends that use 198 | external services (like LDAP or HTTP). Its time based expiration 199 | allows better control over when entries are evicted from the cache. 200 | -------------------------------------------------------------------------------- /backing_queue.md: -------------------------------------------------------------------------------- 1 | # Backing Queue # 2 | 3 | RabbitMQ supports plugable backing queues by modules implementing the 4 | `rabbit_backing_queue` behaviour. 5 | 6 | The backing queue `init/3` callback expects an `async_callback()` 7 | parameter which is a `fun` callback which takes the backing queue 8 | state, and returns a new state. Keep reading to understand what all 9 | this callback mumbo-jumbo means. 10 | 11 | **TL;DR:** due to the two problems explained below, this callback 12 | takes care of executing certain functions in the context of a 13 | particular Erlang process. 14 | 15 | ## Process Dictionary Problem ## 16 | 17 | Understanding how this callback works is vital since the persistence 18 | layer of the backing queue does heavy use of the _process dictionary_ 19 | and the use of `self()` to track who opened which file handle. What 20 | this means is that even tho the backing queue behaviour callbacks 21 | seem to have the referential transparent property, they do not. Behind 22 | the scenes, some of the backing queue behaviour callbacks will `put/get` 23 | values to/from the process dictionary, but if one of said callbacks is 24 | executed in a different process context, then those values won't be 25 | found on the process dictionary, and everything else breaks havoc. 26 | 27 | ## File Handle Cache Problem ## 28 | 29 | The same applies for the `file_handle_cache` tracking who owns which 30 | file handle by calling `self()` inside its functions implementations 31 | instead of expecting a `Pid` as parameter for example. The call of 32 | `self()` again violates referential transparency. The function 33 | behaviour now depends on the process context on which it's 34 | called. This means that closing file handles must be done from the 35 | same caller that issued the file open. 36 | 37 | ## How Things Work ## 38 | 39 | The function `rabbit_amqqueue_process:bq_init/3` takes care of 40 | initializing the backing queue implementation, whether it is the 41 | `rabbit_variable_queue`, the `rabbit_mirror_queue_master`, the 42 | `rabbit_priority_queue` or your own backing queue behaviour 43 | implementation. 44 | 45 | The async callback passed into `BQ:init` is defined as: 46 | 47 | ```erlang 48 | fun (Mod, Fun) -> 49 | rabbit_amqqueue:run_backing_queue(Self, Mod, Fun) 50 | end 51 | ``` 52 | 53 | This `fun` will take a module argument, which is usually an atom 54 | referring to the backing queue module being used, for example 55 | `rabbit_variable_queue`, or `rabbit_mirror_queue_master`. The second 56 | argument expected by this callback is a `fun` that will be passed 57 | along to `rabbit_amqqueue:run_backing_queue/3`. Now lets see what 58 | `rabbit_amqqueue:run_backing_queue` does. 59 | 60 | ### rabbit_amqqueue:run_backing_queue ### 61 | 62 | The function body is like this: 63 | 64 | ```erlang 65 | run_backing_queue(QPid, Mod, Fun) -> 66 | gen_server2:cast(QPid, {run_backing_queue, Mod, Fun}). 67 | ``` 68 | 69 | It sends a `{run_backing_queue, Mod, Fun}` message to whatever process 70 | was provided as `QPid`. **This is important**, since that process' 71 | context is the one which will get its process dictionary modified 72 | indirectly, and at the same time will own file handles when they are 73 | opened by the _msg\_store_ for example. 74 | 75 | Back to `rabbit_amqqueue_process` we will see that this module has a 76 | callback for the message mentioned above: 77 | 78 | ```erlang 79 | handle_cast({run_backing_queue, Mod, Fun}, 80 | State = #q{backing_queue = BQ, backing_queue_state = BQS}) -> 81 | noreply(State#q{backing_queue_state = BQ:invoke(Mod, Fun, BQS)}); 82 | ``` 83 | 84 | This function takes care of extracting the current `backing_queue` 85 | module and `backing_queue_state` from its own process state, and then 86 | calling `BQ:invoke(Mod, Fun, BQS)`. 87 | 88 | This is what `BQ:invoke/3` does: 89 | 90 | 91 | ```erlang 92 | invoke(?MODULE, Fun, State) -> Fun(?MODULE, State); 93 | invoke( _, _, State) -> State. 94 | ``` 95 | 96 | Invoke's implementation is pretty simple, if the `Mod` argument 97 | provided to it matches the current module, in this example 98 | `rabbit_variable_queue`, then the `Fun` will be executed with 99 | `rabbit_variable_queue` as first parameter and the backing queue 100 | `State` as the second argument. To reiterate, what's important to 101 | understand is that `Fun` will be executed in the context of whatever 102 | `QPid` was referring to above. In the case we are analyzing so far, 103 | this is the `rabbit_amqqueue_process` pid. 104 | 105 | ## What Fun ## 106 | 107 | Now let's try to find out what `Fun` actually is. To get to this we 108 | need to see how `rabbit_variable_queue` is initialized`. 109 | 110 | `rabbit_variable_queue:init/6` will call into 111 | `msg_store_client_init/3` passing our initial callback as the third 112 | parameter (`msg_store_client_init/3` then expands into 113 | `msg_store_client_init/4`). Let's refresh what that callback was: 114 | 115 | ```erlang 116 | fun (Mod, Fun) -> 117 | rabbit_amqqueue:run_backing_queue(Self, Mod, Fun) 118 | end 119 | ``` 120 | 121 | That callback will be now wrapped into yet another `fun` like this: 122 | 123 | ```erlang 124 | fun () -> Callback(?MODULE, CloseFDsFun) end 125 | ``` 126 | 127 | To see that in context: 128 | 129 | ```erlang 130 | msg_store_client_init(MsgStore, Ref, MsgOnDiskFun, Callback) -> 131 | CloseFDsFun = msg_store_close_fds_fun(MsgStore =:= ?PERSISTENT_MSG_STORE), 132 | rabbit_msg_store:client_init(MsgStore, Ref, MsgOnDiskFun, 133 | fun () -> Callback(?MODULE, CloseFDsFun) end). 134 | ``` 135 | 136 | So now we have a clue of what the `Fun` passed into our callback might 137 | be. It is whatever `msg_store_close_fds_fun` returned as 138 | `CloseFDsFun`. Let's check: 139 | 140 | ```erlang 141 | msg_store_close_fds_fun(IsPersistent) -> 142 | fun (?MODULE, State = #vqstate { msg_store_clients = MSCState }) -> 143 | {ok, MSCState1} = msg_store_close_fds(MSCState, IsPersistent), 144 | State #vqstate { msg_store_clients = MSCState1 } 145 | end. 146 | ``` 147 | 148 | We get a `fun` that will only be executed if the `Mod` argument 149 | matches, in this case `rabbit_variable_queue`. That `fun` takes as 150 | second argument our `rabbit_variable_queue` state. 151 | 152 | On `msg_store_client_init/4` above we said that our initial callback 153 | gets wrapped like this: 154 | 155 | ```erlang 156 | fun () -> Callback(?MODULE, CloseFDsFun) end 157 | ``` 158 | 159 | This means inside the msg_store, at various places, that `fun` closure 160 | gets called without arguments which in turn calls our callback with 161 | the `CloseFDsFun`. We end up with something like what's below after 162 | some expansions: 163 | 164 | ```erlang 165 | fun (rabbit_variable_queue, Fun) -> 166 | rabbit_amqqueue:run_backing_queue(QPid, rabbit_variable_queue, 167 | fun (?MODULE, State = #vqstate { msg_store_clients = MSCState }) -> 168 | {ok, MSCState1} = msg_store_close_fds(MSCState, IsPersistent), 169 | State #vqstate { msg_store_clients = MSCState1 } 170 | end) 171 | end 172 | ``` 173 | 174 | So our `rabbit_amqqueue_process` will ask the backing queue module 175 | to invoke that expanded fun in the context of the 176 | `rabbit_amqqueue_process` Pid: 177 | 178 | ```erlang 179 | handle_cast({run_backing_queue, Mod, Fun}, 180 | State = #q{backing_queue = BQ, backing_queue_state = BQS}) -> 181 | noreply(State#q{backing_queue_state = BQ:invoke(Mod, Fun, BQS)}); 182 | ``` 183 | 184 | This very same technique is used on `rabbit_variable_queue:init/3` to 185 | setup the functions that will write messages to disk (see 186 | `rabbit_variable_queue:msgs_written_to_disk/3`) and the ones that will 187 | write the message indexes to disk (see 188 | `rabbit_variable_queue:msg_indices_written_to_disk/2`). 189 | 190 | ## It's About Context ## 191 | 192 | From all these layers of indirection, what's important to understand 193 | is that the `Pid` passed into `rabbit_amqqueue:run_backing_queue/3` 194 | determines the context on which all the functions implementing message 195 | persistence will be run. Unless your `rabbit_backing_queue` behaviour 196 | implementation is just a proxy like that of `rabbit_priority_queue`, 197 | you must take that `Pid` context into account, since it will hold file 198 | handles references and its process dictionary will be the one where 199 | the `file_handle_cache` will store its information. 200 | 201 | If you want a second example of what we outlined above, take a look at 202 | `rabbit_mirror_queue_slave:bq_init/3` where the Pid provided to 203 | `run_backing_queue/3` in this case is the _slave_ Pid. The slave 204 | process implements it's own `handle_cast({run_backing_queue, Mod, 205 | Fun}, State)` function clause, on which funs from 206 | `rabbit_variable_queue` like `msg_store_close_fds_fun`, 207 | `msgs_written_to_disk`, `msg_indices_written_to_disk` and 208 | `msgs_and_indices_written_to_disk` will be run. 209 | -------------------------------------------------------------------------------- /basic_publish.md: -------------------------------------------------------------------------------- 1 | # Publishing Messages into RabbitMQ # 2 | 3 | One of the best ways to cover the various parts of RabbitMQ's 4 | architecture is to see what happens when a message gets published into 5 | the broker. In this document we are going to visit the different 6 | subsystems a message crosses inside the broker. Let's start by 7 | `rabbit_reader`. 8 | 9 | The `rabbit_reader` process is the one that takes care of reading data 10 | from the network and forwarding it to the respective channel 11 | process. Messages get into the channel when the reader calls the 12 | function `rabbit_channel:do_flow/3`, this function will call the 13 | `credit_flow` module to track that a message was received from the 14 | reader, so it could eventually throttle down the reader in case the 15 | message publisher is sending more messages in than the amount the 16 | broker can handle at a particular time. Read more about Credit Flow 17 | [here](./credit_flow.md). More information about the Reader process 18 | can be found in the 19 | [Networking and Connections guide](./networking_and_connections.md#rabbit_reader). 20 | 21 | ## Arriving into the Channel Process ## 22 | 23 | Once Credit Flow is accounted for, then the `do_flow/3` function will 24 | issue an asynchronous `gen_server:cast/2` into the channel process 25 | passing in this Erlang message: `{method, Method, Content, 26 | flow}`. There we have the AMQP `Method`, then method `Content`, and 27 | the atom `flow` indicating the channel that credit flow is in use. 28 | 29 | When the cast reaches the `handle_cast/2` function inside the channel 30 | module, we are finally inside the channel process memory and execution 31 | path. If `flow` was in use, as is the case here, then the channel will 32 | issue a `credit_flow:ack/1` to the reader process. Then the AMQP 33 | method that's being processed will be passed to the 34 | [Interceptor](./interceptors.md) defined for the channel, in case 35 | there are any. After the Interceptors are done processing the AMQP 36 | method, then the channel process will continue processing the method, 37 | in our case the function `handle_method/3` will be called, with a 38 | `basic.publish` record. 39 | 40 | ## Inside basic.publish ## 41 | 42 | basic.publish works by receiving an AMQP message, an Exchange and a 43 | Routing Key, and it will use the exchange to route the message to one 44 | or various queues, based on the routing key. Let's see how's that 45 | accomplished. 46 | 47 | The first thing the function does is to check the size of the message 48 | since RabbitMQ has an upper limit of 2GB for messages. 49 | 50 | Then the function needs to build the resource record for the 51 | Exchange. Exchanges and Queues are represented internally with a 52 | resource record that keeps track of the name, and the vhost where the 53 | exchange or queue was declared. The type declaration record looks like 54 | this: 55 | 56 | ```erlang 57 | #resource{virtual_host :: VirtualHost, 58 | kind :: Kind, 59 | name :: Name} 60 | ``` 61 | 62 | So if a message was published to the default vhost to an exchange 63 | called `"my_exchange"`, we will end up with the following record: 64 | 65 | ```erlang 66 | #resource{virtual_host = <<"/">> 67 | kind = exchange, 68 | name = <<"my_exchange">>} 69 | ``` 70 | 71 | Resources like that one are used everywhere in RabbitMQ, so it's a 72 | good idea to study their parts in the 73 | [rabbit_types](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/rabbit_types.erl) 74 | module where this declarations are defined. 75 | 76 | Once we have the exchange record, `basic.publish` will use it to see 77 | if the user publishing the message has write permissions to this 78 | particular exchange by calling the function 79 | `check_write_permitted/2`. Read more about the different kind of 80 | permissions here: 81 | [access-control](https://www.rabbitmq.com/access-control.html) 82 | 83 | If the user does have permission to publish messages to this exchange, 84 | then the channel will query the Mnesia database trying to find out if 85 | the exchange actually exists, so the function 86 | `rabbit_exchange:lookup_or_die/1` is called in order to retrieve the 87 | actual exchange record from the database, if the exchange is not found, 88 | then a channel error is raised by `lookup_or_die/1`. Keep in mind that 89 | one thing is the exchange resource we mentioned above, and another 90 | much different is the exchange record stored in mnesia. The latter 91 | holds up much more information about the actual exchange, like it's 92 | type for example (direct, fanout, topic, etc). Here's the exchange 93 | record definition from `rabbit.hrl`: 94 | 95 | ```erlang 96 | %% fields described as 'transient' here are cleared when writing to 97 | %% rabbit_durable_ 98 | -record(exchange, { 99 | name, type, durable, auto_delete, internal, arguments, %% immutable 100 | scratches, %% durable, explicitly updated via update_scratch/3 101 | policy, %% durable, implicitly updated when policy changes 102 | decorators}). %% transient, recalculated in store/1 (i.e. recovery) 103 | ``` 104 | 105 | Then we need to check that the record returned by Mnesia is not an 106 | internal exchange, otherwise an error will be raised and the publish 107 | will fail. 108 | 109 | The next thing to do is to validate the user id provided with the 110 | basic publish, if any. If provided, this user id has to be validated 111 | against the user that created the channel where the message is being 112 | published. More details 113 | [here](https://www.rabbitmq.com/validated-user-id.html) 114 | 115 | Then we need to validate if the message expiration header that the 116 | user provided is correct. More info about the Per-Message-TTL 117 | [here](https://www.rabbitmq.com/ttl.html#per-message-ttl) 118 | 119 | Then it's time to check if the message was published as _Mandatory_ or 120 | if the channel is in _Transaction_ or _Confirm Mode_. If this is the 121 | case, then the `publish_seqno` field on the channel state will be 122 | incremented to account for the new publish that's being handled. This 123 | Message Sequence Number will be later used to reply back to the 124 | publisher in case the message was Mandatory and/or the channel was in 125 | [Confirm Mode](https://www.rabbitmq.com/confirms.html). See also the 126 | document [Delivering Messages to Queues](./deliver_to_queues.md). 127 | 128 | After all these steps have been completed, it's time to route the AMQP 129 | message, but in order to do that we need to wrap the message first 130 | into a `#basic_message` record, and then pass it to the exchange and 131 | queues as a `#delivery{}` record: 132 | 133 | ```erlang 134 | -record(basic_message, 135 | {exchange_name, %% The exchange where the message was received 136 | routing_keys = [], %% Routing keys used during publish 137 | content, %% The message content 138 | id, %% A `rabbit_guid:gen()` generated id 139 | is_persistent}). %% Whether the message was published as persistent 140 | 141 | -record(delivery, 142 | {mandatory, %% Whether the message was published as mandatory 143 | confirm, %% Whether the message needs confirming 144 | sender, %% The pid of the process that created the delivery 145 | message, %% The #basic_message record 146 | msg_seq_no, %% Msg Sequence Number from the channel publish_seqno field 147 | flow}). %% Should flow control be used for this delivery 148 | ``` 149 | 150 | ## Message Routing ## 151 | 152 | The `#delivery` we just created on the previous step is now passed to 153 | the exchange via the function `rabbit_exchange:route/2`. If the 154 | exchange name used during `basic.publish` is the empty string 155 | `<<"">>`, then the `default` exchange is assumed, and the `route/2` 156 | will just return the queue name associated with the routing key, per 157 | AMQP spec. If that's not the case, then the delivery will be processed 158 | first by the [exchange decorators](./exchange_decorators.md) that are 159 | configured to the exchange that's handling the routing. The decorators 160 | will send back a list of _destinations_. At this point, delivery will 161 | finally reach the exchange, where the routing algorithm implemented by 162 | the exchange will take place. This process will return a new list of 163 | _destinations_ which will be merged and deduplicated with the list 164 | returned before by the decorators. At this point, all the destinations 165 | proposed by the 166 | [Exchange To Exchange](https://www.rabbitmq.com/e2e.html) bindings are 167 | also included in the list of destinations that will be returned to the 168 | channel. 169 | 170 | ## Processing Routing Results ## 171 | 172 | Now the channel has a list of queues to which it should deliver the 173 | messages. Before doing that, we need to see if the channel is in 174 | transaction mode, if that's the case, then the `#delivery` and the 175 | list of queues are enqueued for later until the transaction is 176 | committed. Keep in mind that transaction support in RabbitMQ are a 177 | very simple form of 178 | [message batching](https://www.rabbitmq.com/semantics.html). If the 179 | channel is not in transaction mode, then the message will be delivered 180 | to the queues returned by the routing function. 181 | 182 | ## Summary ## 183 | 184 | We saw in this guide that messages arrive via the network into the 185 | `rabbit_reader` process. This process forwards commands to 186 | `rabbit_channel` processes who take care of processing the various 187 | AMQP methods. In this case, we are seeing what happens when a message 188 | is published into RabbitMQ. Once credit flow has been acked back to 189 | the reader process, then it's time to take care of handling the 190 | message. First it will go to the interceptors, who might modify or 191 | augment the AMQP method received from the _reader_. Then the channel 192 | must make sure the message complies to the size limits set at the 193 | broker side. Once that's done, we need to see if the user has 194 | permission to publish message to the selected exchange. If that's fine 195 | and the `user_id` and `expiration` headers of the message are 196 | validated, then it's time to route the message. The exchange who 197 | handles the message will return back a list of queues to which the 198 | message must be delivered to. At this point we are done with the 199 | message and the channel is ready to keep processing commands. 200 | 201 | Now we can continue with the next guide and see what happens when 202 | messages are delivered to queues: 203 | [Delivering Messages to Queues](./deliver_to_queues.md) 204 | -------------------------------------------------------------------------------- /channels.md: -------------------------------------------------------------------------------- 1 | # Channels 2 | 3 | This guide provides an overview of AMQP 0-9-1 channel implementation. 4 | Before you start, please take a look at the [Networking and Connections](./networking_and_connections.md) one. 5 | -------------------------------------------------------------------------------- /credit_flow.md: -------------------------------------------------------------------------------- 1 | # Credit Flow # 2 | 3 | In order to prevent fast publishers from overflowing the broker with 4 | more messages than it can handle at any particular moment, RabbitMQ 5 | implements an internal mechanism called credit flow that will be used 6 | by the various systems inside RabbitMQ to throttle down publishers, 7 | while allowing the message consumers to catch up. In this blog post we 8 | are going to see how credit flow works, and what we can do to tune its 9 | configuration for an optimal behaviour. 10 | 11 | Since version 3.5.5, RabbitMQ includes a couple of new configuration 12 | values that let users fiddle with the internal credit flow 13 | settings. Understanding how these work according to your particular 14 | workload can help you get the most out of RabbitMQ in terms of 15 | performance, but beware, increasing these values just to see what 16 | happens can have adverse effects on how RabbitMQ is able to respond to 17 | message bursts, affecting the internal strategies that RabbitMQ has in 18 | order to deal with memory pressure. Handle with care. 19 | 20 | To understand the new credit flow settings first we need to understand 21 | how the internals of RabbitMQ work with regards to message publishing 22 | and paging messages to disk. Let’s see first how message publishing 23 | works in RabbitMQ. 24 | 25 | ## Message Publishing ## 26 | 27 | To see how credit_flow and its settings affect publishing, let’s see 28 | how internal messages flow in RabbitMQ. Keep in mind that RabbitMQ is 29 | implemented in Erlang, where processes communicate by sending messages 30 | to each other. 31 | 32 | Whenever a RabbitMQ instance is running, there are probably hundreds 33 | of Erlang processes exchanging messages to communicate with each 34 | other. We have for example a reader process that reads AMQP frames 35 | from the network. Those frames are transformed into AMQP commands that 36 | are forwarded to the AMQP channel process. If this channel is handling 37 | a publish, it needs to ask a particular exchange for the list of 38 | queues where this message should end up going, which means the channel 39 | will deliver the message to each of those queues. Finally if the AMQP 40 | message needs to be persisted, the msg_store process will receive it 41 | and write it to disk. So whenever we publish an AMQP message to 42 | RabbitMQ we have the following Erlang message flow[1]: 43 | 44 | ``` 45 | reader -> channel -> queue process -> message store. 46 | ``` 47 | 48 | In order to prevent any of those processes from overflowing the next 49 | one down the chain, we have a credit flow mechanism in place. Each 50 | process initially grants certain amount of credits to the process that 51 | is sending them messages. Once a process is able to handle N of 52 | those messages, it will grant more credit to the process that sent 53 | them. Under default credit flow settings (`credit_flow_default_credit` 54 | under `rabbitmq.config`) these values are 200 messages of initial 55 | credit, and after 50 messages processed by the receiving process, the 56 | process that sent the messages will be granted 50 more credits. 57 | 58 | Say we are publishing messages to RabbitMQ, this means the reader will 59 | be sending one erlang message to the channel process per AMQP 60 | basic.publish received. Each of those messages will consume one of 61 | these credits from the channel. Once the channel is able to process 50 62 | of those messages, it will grant more credit to the reader. So far so 63 | good. 64 | 65 | In turn the channel will send the message to the queue process that 66 | matched the message routing rules. This will consume one credit from 67 | the credit granted by the queue process to the channel. After the 68 | queue process manages to handle 50 deliveries, it will grant 50 more 69 | credits to the channel. 70 | 71 | Finally if a message is deemed to be persistent (it’s persistent and 72 | published to a durable queue), it will be sent to the message store, 73 | which in this case will also consume credits from the ones granted by 74 | the message store to the queue process. In this case the initial 75 | values are different and handled by the `msg_store_credit_disc_bound` 76 | setting: 2000 messages of initial credit and 500 more credits after 77 | 500 messages are processed by the message store. 78 | 79 | So we know how internal messages flow inside RabbitMQ and when credit 80 | is granted to a process that’s above in the msg stream. The tricky 81 | part comes when credit is granted between processes. Under normal 82 | conditions a channel will process 50 messages from the reader, and 83 | then grant the reader 50 more credits, but keep in mind that a channel 84 | is not just handling publishes, it’s also sending messages to 85 | consumers, routing messages to queues and so on. 86 | 87 | What happens if the reader is sending messages to the channel at a 88 | higher speed of what the channel is able to process? If we reach this 89 | situation, then the channel will block the reader process, which will 90 | result in producers being throttled down by RabbitMQ. Under default 91 | settings, the reader will be blocked once it sends 200 messages to the 92 | channel, but the channel is not able to process at least 50 of them, 93 | in order to grant credit back to the reader. 94 | 95 | Again, under normal conditions, once the channel manages to go through 96 | the message backlog, it will grant more credit to the reader, but 97 | there’s a catch. What if the channel process is being blocked by the 98 | queue process, due to similar reasons? Then the new credit that was 99 | supposed to go to the reader process will be deferred. The reader 100 | process will remain blocked. 101 | 102 | Once the queue process manages to go through the deliveries backlog 103 | from the channel, it will grant more credit to the channel, unblocking 104 | it, which will result in the channel granting more credit to the 105 | reader, unblocking it. Once again, that’s under normal conditions, 106 | but, you guessed it, what if the message store is blocking the queue 107 | process? Then credit to the channel will be deferred, which will 108 | remain blocked, deferring credit to the reader, leaving the reader 109 | blocked. At some point, the message store will grant messages to the 110 | queue process, which will grant messages back to the channel, and then 111 | the channel will finally grant messages to the reader and unblock the 112 | reader: 113 | 114 | ``` 115 | reader <--[grant]-- channel <--[grant]-- queue process <--[grant]--message store 116 | ``` 117 | 118 | Having one channel and one queue process makes things easier to 119 | understand but it might not reflect reality. It’s common for RabbitMQ 120 | users to have more than one channel publishing messages on the same 121 | connection. Even more common is to have one message being routed to 122 | more than one queue. What happens with the credit flow scheme we’ve 123 | just explained is that if one of those queues blocks the channel, then 124 | the reader will be blocked as well. 125 | 126 | The problem is that from a reader standpoint, when we read a frame 127 | from the network, we don’t even know to which channel it belongs 128 | to. Keep in mind that channels are a logical concept on top of AMQP 129 | connections. So even if a new AMQP command will end up in a channel 130 | that is not blocking the reader, the reader has no way of knowing 131 | it. Note that we only block publishing connections, consumers 132 | connections are unaffected since we want consumers to drain messages 133 | from queues. This is a good reason why it might be better to have 134 | connections dedicated to publishing messages, and connections 135 | dedicated for consumers only. 136 | 137 | On a similar fashion, whenever a channel is processing message 138 | publishes, it doesn’t know where messages will end up going, until it 139 | performs routing. So a channel might be receiving a message that 140 | should end up in a queue that is not blocking the channel. Since at 141 | ingress time we don’t know any of this, then the credit flow strategy 142 | in place is to block the reader until processes down the chain are 143 | able to handle new messages. 144 | 145 | One of the new settings introduced in RabbitMQ 3.5.5 is the ability to 146 | modify the values for `credit_flow_default_credit`. This setting takes 147 | a tuple of the form `{InitialCredit, 148 | MoreCreditAfter}`. `InitialCredit` is set to 200 by default, and 149 | `MoreCreditAfter` is set to 50. Depending on your particular workflow, 150 | you need to decide if it’s worth bumping those values. Let’s see the 151 | message flow scheme again: 152 | 153 | ``` 154 | reader -> channel -> queue process -> message store. 155 | ``` 156 | 157 | Bumping the values for `{InitialCredit, MoreCreditAfter}` will mean 158 | that at any point in that chain we could end up with more messages 159 | than those that can be handled by the broker at that particular point 160 | in time. More messages means more RAM usage. The same can be said 161 | about `msg_store_credit_disc_bound`, but keep in mind that there’s 162 | only one message store[2] per RabbitMQ instance, and there can be many 163 | channels sending messages to the same queue process. So while a queue 164 | process has a value of 2000 as InitialCredit from the message store, 165 | that queue can be ingesting many times that value from different 166 | channel/connection sources. So 200 credits as initial 167 | `credit_flow_default_credit` value could be seen as too conservative, 168 | but you need to understand if according to your workflow that’s still 169 | good enough or not. 170 | 171 | ## Message Paging ## 172 | 173 | Let’s take a look at how RabbitMQ queues store messages. When a 174 | message enters the queue, the queue needs to determine if the message 175 | should be persisted or not. If the message has to be persisted, then 176 | RabbitMQ will do so right away[3]. Now even if a message was persisted 177 | to disk, this doesn’t mean the message got removed from RAM, since 178 | RabbitMQ keeps a cache of messages in RAM for fast access when 179 | delivering messages to consumers. Whenever we are talking about paging 180 | messages out to disk, we are talking about what RabbitMQ does when it 181 | has to send messages from this cache to the file system. 182 | 183 | When RabbitMQ decides it needs to page messages to disk it will call 184 | the function `reduce_memory_use` on the internal queue implementation in 185 | order to send messages to the file system. Messages are going to be 186 | paged out in batches; how big are those batches depends on the current 187 | memory pressure status. It basically works like this: 188 | 189 | The function `reduce_memory_use` will receive a number called target 190 | ram count which tells RabbitMQ that it should try to page out messages 191 | until only that many remain in RAM. Keep in mind that whether messages 192 | are persistent or not, they are still kept in RAM for fast delivery to 193 | consumers. Only when memory pressure kicks in, is when messages in 194 | memory are paged out to disk. Quoting from our code comments: “The 195 | question of whether a message is in RAM and whether it is persistent 196 | are orthogonal”. 197 | 198 | The number of messages that are accounted for during this chunk 199 | calculation are those messages that are in RAM (in the aforementioned 200 | cache), plus the number of pending acks that are kept in RAM (i.e.: 201 | messages that were delivered to consumers and are pending 202 | acknowledgment). If we have 20000 messages in RAM (cache + pending 203 | acks) and then target ram count is set to 8000, we will have to page 204 | out 12000 messages. This means paging will receive a quota of 12000 205 | messages. Each message paged out to disk will consume one unit from 206 | that quota, whether it’s a pending ack, or an actual message from the 207 | cache. 208 | 209 | Once we know how many messages need to be paged out, we need to decide 210 | from where we should page them first: pending acks, or the message 211 | cache. If pending acks is growing faster than messages the cache, ie: 212 | more messages are being delivered to consumers than those being 213 | ingested, this means the algorithm will try to page out pending acks 214 | first, and then try to push messages from the cache to the file 215 | system. If the cache is growing faster than pending acks, then 216 | messages from the cache will be pushed out first. 217 | 218 | The catch here is that paging messages from pending acks (or the cache 219 | if that comes first) might result in the first part of the process 220 | consuming all the quota of messages that need to be pushed to disk. So 221 | if pending acks pushes 12000 acks to disk as in our example, this 222 | means we won’t page out messages from the cache, and vice versa. 223 | 224 | This first part of the paging process sent to disk certain amount of 225 | messages (between acks + messages paged from the cache). The messages 226 | that were paged out just had their contents paged out, but their 227 | position in the queue is still in RAM. Now the queue needs to decide 228 | if this extra information that’s kept in RAM needs to be paged out as 229 | well, to further reduce memory usage. Here is where 230 | `msg_store_io_batch_size` finally enters into play (coupled with 231 | `msg_store_credit_disc_bound` as well). Let’s try to understand how 232 | they work. 233 | 234 | The settings for `msg_store_credit_disc_bound` affect how internal 235 | credit flow is handled when sending message to disk. The 236 | `rabbitmq_msg_store` module implements a database that takes care of 237 | persisting messages to disk. Some details about the why’s of this 238 | implementation can be found here: RabbitMQ, backing stores, databases 239 | and disks. 240 | 241 | The message store has a credit system for each of the clients that 242 | send writes to it. Every RabbitMQ queue would be a read/write client 243 | for this store. The message store has a credits mechanism to prevent a 244 | particular writer to overflow its inbox with messages. Assuming 245 | current default values, when a writer starts talking to the message 246 | store, it receives an initial credit of 2000 messages, and it will 247 | receive more credit once 500 messages are processed. When is this 248 | credit consumed then? Credit is consumed whenever we write to the 249 | message store, but that doesn’t happen for every message. The plot 250 | thickens. 251 | 252 | Since version 3.5.0 it’s possible to embed small messages into the 253 | queue index, instead of having to reach the message store for 254 | that. Messages that are smaller than a configurable setting (currently 255 | 4096 bytes) will go to the queue index when persisted, so those 256 | messages won’t consume this credit. Now, let’s see what happens with 257 | messages that do need to go to the message store. 258 | 259 | Whenever we publish a message that’s determined to be persistent 260 | (persistent messages published to a durable queue), then that message 261 | will consume one of these credits. If a message has to paged out to 262 | disk from the cache mentioned above, it will also consume one 263 | credit. So if during message paging we consume more credits than those 264 | currently available for our queue, the first half of the paging 265 | process might stop, since there’s no point in sending writes to the 266 | message store when it won’t accept them. This means that from the 267 | initial quota of 12000 that we would have had to page out, we only 268 | managed to process 2000 of them (assuming all of them need to go to 269 | the message store). 270 | 271 | So we managed to page out 2000 messages, but we still keep their 272 | position in the queue in RAM. Now the paging process will determine if 273 | it needs to also page out any of these messages positions to disk as 274 | well. RabbitMQ will calculate how many of them can stay in RAM, and 275 | then it will try to page out the remaining of them to disk. For this 276 | second paging to happen, the amount of messages that has to be paged 277 | to disk must be greater than `msg_store_io_batch_size`. The bigger 278 | this number is, the more message positions RabbitMQ will keep in RAM, 279 | so again, depending on your particular workload, you need to tune this 280 | parameter as well. 281 | 282 | Another thing we improved significantly in 3.5.5 is the performance of 283 | paging queue index contents to disk. If your messages are generally 284 | smaller than `queue_index_embed_msgs_below`, then you’ll see the 285 | benefit of these changes. These changes also affect how message 286 | positions are paged out to disk, so you should see improvements in 287 | this area as well. So while having a low `msg_store_io_batch_size` 288 | might mean the queue index will have more work paging to disk, keep in 289 | mind this process has been optimized. 290 | 291 | ## Queue Mirroring ## 292 | 293 | To keep the descriptions above a bit simpler, we avoided bringing 294 | queue mirroring into the picture. Credit flows also affects mirroring 295 | from a channel point of view. When a channel delivers AMQP messages to 296 | queues, it sends the message to each mirror, consuming one credit from 297 | each mirror process. If any of the mirrors is slow processing the 298 | message then that particular mirror might be responsible for the 299 | channel being blocked. If the channel is being blocked by a mirror, 300 | and that queue mirror gets partitioned from the network, then the 301 | channel will be unblocked only after RabbitMQ detects the mirror 302 | death. 303 | 304 | Credit flow also takes part when synchronising mirrored queues, but 305 | this is something you shouldn’t care too much about, mostly because 306 | there’s nothing you could do about it, since mirror synchronisation is 307 | handled entirely by RabbitMQ. 308 | 309 | ## Footnotes ## 310 | 311 | 1. A message can be delivered to more than one queue process. 312 | 2. There are two message stores, one for transient messages and one for persistent messages. 313 | 3. RabbitMQ will call fsync every 200 ms. 314 | -------------------------------------------------------------------------------- /deliver_to_queues.md: -------------------------------------------------------------------------------- 1 | # Deliver To Queues # 2 | 3 | In this document we will be going over quite a few RabbitMQ modules, 4 | since a message crosses all of these once it enters the broker: 5 | 6 | ``` 7 | rabbit_reader -> rabbit_channel -> rabbit_amqqueue -> delegate -> rabbit_amqqueue_process -> rabbit_backing_queue 8 | ``` 9 | 10 | Let's see this process in more detail. 11 | 12 | The process of delivering messages to queues start during 13 | `basic.publish`, right after the channel receives the result from 14 | calling `rabbit_exchange:route/2`. 15 | 16 | First we need to lookup the list of `#amqqueue` records based on the 17 | destinations obtained from `route/2`. These records will be passed to 18 | the function `rabbit_amqqueue:deliver/2` where they will be used to 19 | obtain the _pids_ of the queue process where the message is going to 20 | be delivered. Once the master and slave pids have been obtained, then 21 | the message can start its way to be delivered to a queue process, 22 | which consists of two parts: accounting for credit flow, and casting 23 | the message into the queue process. 24 | 25 | If the message delivery arrived with `flow = true`, then `credit_flow` 26 | must be accounted for this message. One credit for each master Pid 27 | where the message should arrive, plus one credit for each slave pid 28 | that receives the message. 29 | 30 | Then the message delivery will be sent to master pids and slave pids, 31 | via the `delegate` framework. The Erlang message will have this shape: 32 | 33 | ```erlang 34 | {deliver, %% message tag 35 | Delivery, %% The Delivery record 36 | SlaveWhenPublished} %% The Pid that received the message, was it a 37 | %% slave when the deliver was published? This is 38 | %% used in case of slave promotion 39 | ``` 40 | 41 | You can learn more about the delegate framework 42 | [here](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/delegate.erl#L10). 43 | 44 | ## AMQQueue Process Message Handling ## 45 | 46 | At this point the message delivery will finally arrive at the queue 47 | process, implemented as a `gen_server2` callback inside the 48 | `rabbit_amqqueue_process` module. The message from the delegate 49 | framework will be received by the `handle_cast/2` callback. This 50 | callback will ack the `credit_flow` issued in above, and it will 51 | monitor the message sender. The message sender is usually the 52 | `rabbit_channel` that received the process. This pid is tracked using 53 | the 54 | [pmon module](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/pmon.erl). The 55 | state is kept as part of the `senders` field in the gen_server state 56 | record. Once the message sender is accounted for the delivery is 57 | passed to the function `deliver_or_enqueue/3`. There is where the 58 | message will either be sent to a consumer or enqueued into the backing 59 | queue. 60 | 61 | ### Mandatory Message Handling ### 62 | 63 | The first thing `deliver_or_enqueue/3` does is to account for the 64 | mandatory flag of the delivery. If the message was published as 65 | mandatory, then at this point the queue process will consider the 66 | message as routed to the queue. To that effect, the queue process will 67 | cast the message `{mandatory_received, MsgSeqNo}` to the channel pid 68 | that received the delivery. The channel process will proceed to 69 | forget the message, since from the point of view mandatory message 70 | handling, there isn't anything left to do for that particular 71 | delivery. 72 | 73 | Take a look at the 74 | [mandatory message handling guide](./mandatory_message_handling.md) for 75 | more info. 76 | 77 | ### Message Confirm Handling ### 78 | 79 | When handling confirms we need to take into account two things: is the 80 | queue durable, and was the message published as persistent. If that's 81 | the case, then the queue process will keep track of the `MsgId` in 82 | order to confirm the message back later to the channel that received 83 | it from a producer. To achieve that, the queue process keeps track of 84 | a dictionary in the process state, using `msg_id_to_channel` record 85 | field to hold it. As the name of the field implies, this dictionary 86 | maps _msg ids_ to _channels_. When a message is finally persisted to 87 | disk by the backing queue, then the BQ will notify the queue process, 88 | which will send the confirm back to the channel using the 89 | `msg_id_to_channel` dictionary just mentioned. 90 | 91 | If the queue was non durable, or the message was published as 92 | transient, then the queue process will proceed to issue a confirm back 93 | to the channel that sent the message in. 94 | 95 | The function `rabbit_misc:confirm_to_sender/2` is the one taking care 96 | of sending confirms back to channels. 97 | 98 | Take a look at the 99 | [publisher confirm handling guide](./publisher_confirms.md) for more info. 100 | 101 | ### Check for Message Duplicates ### 102 | 103 | The next step is to check if the message has been seen by the queue 104 | before. If the backing queue responds that the message is a duplicate, 105 | then processing stops right here, since there's anything left to do 106 | for this delivery, so `deliver_or_enqueue/3` simply returns. 107 | 108 | ### Attempt to Deliver the Message to a Consumer ### 109 | 110 | To try to send the message delivery to a consumer, the function 111 | `attempt_delivery/4` is called. This function will in turn call 112 | `rabbit_queue_consumers:delivery/3` which takes a `FetchFun`, the 113 | `QueueName`, and the `Consumers State` for this particular queue. The 114 | Fetch Fun will return the message that will be delivered to the 115 | consumer (if a consumer is available). This function deals with 116 | message acknowledgment from the point of view of the queue. If the 117 | consumer is in `ackmode = true`, then the message will be 118 | `publish_delivered` into the backing queue, otherwise the message will 119 | be discarded. 120 | 121 | Discarding a message involves confirming the message, in case that's 122 | required for this particular delivery, and telling the backing queue 123 | to discard it as well. 124 | 125 | Once the queue attempted to deliver the message straight to a 126 | consumer, it will call the function `maybe_notify_decorators/2` which 127 | takes care of telling the queue decorators that the consumer state 128 | might have changed. See the [queue decorators](./queue_decorators.md) 129 | guide for more information on how decorators work. 130 | 131 | The `attempt_delivery/4` will return back to the 132 | `deliver_or_enqueue/3` function telling it if the message was 133 | `delivered` or if it is still `undelivered`. If the message was 134 | delivered to a consumer, then there's nothing else to do, and 135 | `deliver_or_enqueue/3` will simply return. Otherwise there's still 136 | more to do. 137 | 138 | ### Handling Undelivered Messages ### 139 | 140 | When handling undelivering messages, there's a special case that can 141 | be considered an optimization. If the queue has a 142 | [TTL](https://www.rabbitmq.com/ttl.html) of 0, and no 143 | [DLX](https://www.rabbitmq.com/dlx.html) has been set up, then there 144 | is no point in queueing this message, so it can be discarded in the 145 | same way as explained above. 146 | 147 | If a message cannot be discarded, then it has to be enqueued, so the 148 | queue process will `publish` the message into the backing queue. After 149 | the message has been published, we need to enforce the various 150 | policies that might apply to this queue, like `max-length` for 151 | example. This means we need to see if the queue head has to be 152 | dropped. Once that's enforced, then we also have to check if we need 153 | to drop expired messages. Both these functions work in conjunction 154 | with the DLX feature mentioned above. At this point 155 | `deliver_or_enqueue/3` returns. 156 | 157 | ## Bookkeeping ## 158 | 159 | Even if we are done with the delivery after this was handled by the 160 | respective queue processes where it was sent, we still need to perform 161 | some bookkeeping on the channel side. The `rabbit_amqqueue:deliver/2` 162 | function will return a list of `QPids` that received the 163 | messages. This list of pids will be used now for bookkeeping. 164 | 165 | ### Queue Monitoring ### 166 | 167 | The first thing to do is to monitor the queue pids to which the 168 | message was delivered. This is done among other things, to account for 169 | credit flow in case the queue goes down. We don't want to block the 170 | channel forever if a queue that's blocking it is actually down. 171 | 172 | Take a look at the `handle_info` channel callback for the case when a 173 | `DOWN` message is 174 | [received](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_channel.erl#L818). 175 | 176 | ### Process Mandatory Messages ### 177 | 178 | Here if the message wasn't delivered to any queue, then it's time to 179 | issue `basic.return`s back to the publisher that sent them. If the 180 | message was delivered to queues, then those `QPids` will be kept into 181 | a dictionary for later processing. 182 | 183 | As explained above, once the queue process receives the message 184 | delivery, then it will take care of updating the `mandatory` 185 | dictionary on the channel's state. 186 | 187 | ### Process Confirms ### 188 | 189 | Similar as with mandatory messages, if the message wasn't routed to 190 | any queue, then it's time to record the message as confirmed. If the 191 | message was delivered to some queues, then it will be tracked as 192 | unconfirmed until the queue updates the message status. 193 | 194 | ### Stats Update ### 195 | 196 | The final step for the channel is to account for stats, so it will 197 | update the exchange stats, indicating that a message has been routed, 198 | and then it will also update the queue stats, to indicate that a 199 | message was delivered to this or that queue. 200 | 201 | ## Summary ## 202 | 203 | Delivering a message to a RabbitMQ queue is quite an involved process, 204 | and we didn't even touch on queue mirroring! The main things to 205 | account for when handling a delivery are mandatory messages and 206 | message confirms. Both have to be handled accordingly, and the whole 207 | process is coordinated between the channel process and the queue 208 | process that receives the message. Other than that, the queue needs to 209 | see if the message can be delivered to a consumer or if it has to be 210 | enqueued for later. Once this is handled, the queue needs to enforce 211 | the various policies that can be applied to it, like TTLs, or 212 | max-lengths. 213 | 214 | To understand what happens once a message arrives to a queue, take 215 | look at the [variable queue](./variable_queue.md) guide. 216 | -------------------------------------------------------------------------------- /exchange_decorators.md: -------------------------------------------------------------------------------- 1 | # Exchange Decorators # 2 | 3 | Exchange decorators are modules implemented as behaviours that can let 4 | you extend existing exchanges. For example, you might want to perform 5 | some actions only when the exchange is created, or deleted, but leave 6 | alone the whole routing logic to the underlying exchange. 7 | 8 | Decorators are usually associated with exchanges via policies. 9 | 10 | See the `active_for/1` 11 | [callback](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_exchange_decorator.erl#L70) 12 | to understand which functions on the exchange would be decorated. 13 | 14 | Take a look at the 15 | [Sharding Plugin](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbitmq_sharding/src/rabbit_sharding_exchange_decorator.erl) 16 | and the 17 | [Federation Plugin](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbitmq_federation/src/rabbit_federation_exchange.erl) 18 | to see how exchange decorators are implemented. 19 | -------------------------------------------------------------------------------- /interceptors.md: -------------------------------------------------------------------------------- 1 | # Interceptors # 2 | 3 | Interceptors are modules implemented as behaviours that allow plugin 4 | authors to intercept and modify AMQP methods before they are handled 5 | by the channel process. They were originally created for the 6 | development of the 7 | [Sharding Plugin](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbitmq_sharding/README.extra.md#intercepted-channel-behaviour) 8 | to facilitate mapping queue names as specified by users vs. the actual 9 | names used by sharded queues. Another plugin using interceptors is the 10 | [Message Timestamp Plugin](https://github.com/rabbitmq/rabbitmq-message-timestamp) 11 | which injects timestamps into message properties during 12 | `basic.publish`. 13 | 14 | An interceptor must implement the `rabbit_channel_interceptor` 15 | behaviour. The most important callback is `intercept/3` where an 16 | interceptor will be provided with the original AMQP method record that 17 | the channel should process, the AMQP method content, if any, and the 18 | interceptor state (see 19 | [init/1](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_channel_interceptor.erl#L36)). This 20 | callback should take the AMQP method that was passed to it, and the 21 | content, and modify it accordingly. For example, the Sharding Plugin 22 | will receive a `basic.consume` method, with a sharded queue called 23 | `my_queue` and it will map that name, to the appropriate shard for the 24 | client that issued the `basic.consume` command, for example: 25 | `sharding: my_queue - 1`. This means that if the channel received the 26 | following record: 27 | 28 | ```erlang 29 | #'basic.consume'{queue = <<"my_queue">>} 30 | ``` 31 | 32 | Then the interceptor will pass back the following transformed method: 33 | 34 | ```erlang 35 | #'basic.consume'{queue = <<"sharding: my_queue - 1">>} 36 | ``` 37 | 38 | This process is transparent to the user and to the channel 39 | code. There's no need by RabbitMQ core developers to add extra 40 | functionality to the `rabbit_channel` in order to support sharding, 41 | since the interceptors take care of that. For example, if we need to 42 | inject a timestamp into each message that crossed the broker, instead 43 | of modifying the `rabbit_channel` code to do that, we can just create 44 | a new interceptor for the `basic.publish` method, and there inject the 45 | desired timestamp. 46 | 47 | Interceptors can do more than just modifying AMQP methods, they can 48 | also forbid its access. A good example is again the Sharding 49 | Plugin. If we have a sharded queue called `my_queue` then it won't 50 | make much sense to allow users to declare queues with that name, so 51 | the sharding interceptor also intercepts the `queue.declare` method, 52 | but in this case if the queue name provided matches that of a sharded 53 | queue, then a channel error is produced. 54 | 55 | Keep in mind that while we can enable several interceptors, only one 56 | interceptor can intercept a particular AMQP method, otherwise we would 57 | need to define interceptors priorities, plus a way to merge the 58 | results of their invocations. 59 | 60 | ## Enabling Interceptors ## 61 | 62 | To enable interceptors, they have to be registered into the 63 | [rabbit_registry](./rabbit_registry.md), via a 64 | [boot step](./boot_steps.md): 65 | 66 | ```erlang 67 | -rabbit_boot_step({?MODULE, 68 | [{description, "sharding interceptor"}, 69 | {mfa, {rabbit_registry, register, 70 | [channel_interceptor, 71 | <<"sharding interceptor">>, ?MODULE]}}, 72 | {cleanup, {rabbit_registry, unregister, 73 | [channel_interceptor, 74 | <<"sharding interceptor">>]}}, 75 | {requires, rabbit_registry}, 76 | {enables, recovery}]}). 77 | ``` 78 | 79 | Once the interceptor is registered, only new channels will use it to 80 | intercept AMQP methods. Channels that were already running won't load 81 | the interceptor. In a similar fashion, if a plugin that provides 82 | interceptors is disabled, then only new channels will stop using these 83 | interceptors. 84 | -------------------------------------------------------------------------------- /internal_events.md: -------------------------------------------------------------------------------- 1 | # Internal Events 2 | 3 | This document describes a mechanism RabbitMQ components use to notify 4 | each other of events. Note that this mechanism is not used to transfer 5 | messages between components or nodes. It is also entirely transient. 6 | 7 | ## Overview 8 | 9 | Client connection, channels, queues, consumers, and other parts of the system 10 | naturally generate events. Other parts of the system can be interested 11 | in observing those events. RabbitMQ has a very minimalistic mechanism that 12 | is used for internal event notifications both within a single node and 13 | across nodes in a cluster. 14 | 15 | For example, when a policy is modified, RabbitMQ needs to apply it 16 | to matching queues and notify the queues that no longer match. 17 | These events are irrelevant to clients and have no relation to messaging 18 | protocols. 19 | 20 | ## Events, Metrics, Stats 21 | 22 | Perhaps the heaviest user of this notification subsystem, known 23 | as `rabbit_event`, is the [management plugin](./metrics_and_management_plugin.md). 24 | Management plugin's metrics are collected in a variety of ways but often 25 | transferred over the internal event subsystem. 26 | 27 | For example, when a connection is accepted, authenticated and access 28 | to the target virtual host is authorised, it will emit an event of type 29 | `connection_created`. When a connection is closed or fails for any reason, 30 | a `connection_closed` event is emitted. Events from connections, channels, queues, consumers, 31 | and so on are processed and stored as metrics 32 | to be later served over HTTP to the management plugin UI application. 33 | 34 | 35 | ## rabbit_event 36 | 37 | Both internal event publishers and consumers interact with the notification 38 | subsystem using a single module, `rabbit_event`. Publishers typically 39 | use the `rabbit_event:notify/2` function, consumers register 40 | [gen_event](https://learnyousomeerlang.com/event-handlers) event handlers. 41 | 42 | Every event is an instance of the `#event` record. 43 | An event has a name (e.g. `connection_created` or `queue_deleted`), a timestamp and an 44 | dictionary-like data structure (a proplist) for payload. 45 | 46 | The mechanism is very minimalistic: every event handler receives a copy 47 | of every event and ignores those that are irrelevant to it. 48 | 49 | 50 | ## Acting User Details 51 | 52 | Starting with RabbitMQ 3.7.0, internal components try to associate an 53 | acting user with each emitted event, where possible. For example, 54 | if a channel is opened on a connection, the acting user is the user 55 | of that connection. 56 | 57 | In some cases there is no acting user or it cannot be known, for example, 58 | when a connection is forcefully closed via CLI tools. In such cases 59 | dummy usernames are used, e.g. `rmq-internal` or `rmq-cli`. 60 | 61 | 62 | ## rabbitmq-event-exchange Plugin 63 | 64 | [rabbitmq-event-exchange](https://github.com/rabbitmq/rabbitmq-server/tree/master/deps/rabbitmq_event_exchange) is a plugin that consumes internal events 65 | and re-publishes them to a topic exchange, thus exposing the events 66 | to clients (applications). 67 | 68 | This can be used by monitoring an audit systems. 69 | -------------------------------------------------------------------------------- /mandatory_message_handling.md: -------------------------------------------------------------------------------- 1 | # Mandatory Message Handling # 2 | 3 | When we publish a message with the mandatory flag on, this means that 4 | the broker must notify the publisher if the message is not routed to 5 | any queue via the `basic.return` AMQP command. In this guide we will 6 | see how the channel handles mandatory messages. 7 | 8 | ## Tracking Mandatory Messages ## 9 | 10 | Note: Implementation was changed in 11 | [3.8.0](https://github.com/rabbitmq/rabbitmq-server/pull/1831), 12 | removing the dtree strucutre described below. 13 | 14 | Mandatory messages are tracked in the `mandatory` field of the 15 | channel's state record. Messages are tracked using our own 16 | [dtree](https://github.com/rabbitmq/rabbitmq-server/blob/v3.7.28/src/dtree.erl) 17 | data structure. As explained in that module documentation, entries on 18 | the _dual-index tree_ are stored using a primary key, a set of 19 | secondary keys, and a value. In the case of tracking mandatory 20 | messages we have: 21 | 22 | - primary key: the `MsgSeqNo` assigned to the message by the channel 23 | - secondary keys: the list of queue pids where the message was routed 24 | - value: the actual message 25 | 26 | Keep in mind that delivering the message to queues is an asynchronous 27 | operation, this means that in the `deliver_to_queues/2` function 28 | inside the channel, we just know to which queues the message was 29 | routed to, but this doesn't mean it had arrived there. Therefore only 30 | when a queue has accepted the message, we can forget about it and 31 | consider it as routed. 32 | 33 | ## Forgetting About Mandatory Messages ## 34 | 35 | Once a queue has received a message, in other words, the message was 36 | successfully routed to the queue, the queue process will cast the 37 | following message back to the channel pid: 38 | 39 | ``` 40 | {mandatory_received, MsgSeqNo} 41 | ``` 42 | 43 | The channel will then proceed to use the `MsgSeqNo` to forget about 44 | the mandatory message it was tracking, by deleting it from the 45 | `mandatory` dtree from the channel state. 46 | 47 | Keep in mind that mandatory messages only require that a return is 48 | sent in case the message is unroutable, so it's safe to forget about 49 | it once the message has been routed to a queue. 50 | 51 | ## Sending Returns ## 52 | 53 | If a mandatory message cannot be routed to any queue then we need to 54 | send `basic.return`s back to the publisher. This is done in two 55 | different places, responding to different situations. 56 | 57 | The first one is the obvious one, if the message is not routed to any 58 | queue, then we can safely send a `basic.return` back. Take a look at 59 | the function `rabbit_channel:process_routing_mandatory/5` for more details. 60 | 61 | The other situation arises when a queue where we had just publishes 62 | messages crashes before it is able to receive the message. As 63 | explained in the [deliver to queues](./deliver_to_queues.md) guide, we 64 | monitor the QPids where the message was delivered. If the monitor 65 | reports that the queue has crashed, then we will send `basic.return` 66 | for all the messages that were delivered to the queues that 67 | crashed. Take a look at the function 68 | `rabbit_channel:handle_publishing_queue_down/3` for more information. 69 | 70 | ## Related guides ## 71 | 72 | - [basic publish](./basic_publish.md) 73 | - [deliver to queues](./deliver_to_queues.md) 74 | -------------------------------------------------------------------------------- /metrics_and_management_plugin.md: -------------------------------------------------------------------------------- 1 | # Metrics and Management Plugin Architecture (3.6.7+) 2 | 3 | This document describes key implementation aspects of [RabbitMQ management plugin](https://www.rabbitmq.com/management.html) 4 | starting with version 3.6.7. Earlier versions of the plugin had a substantially different 5 | architecture. 6 | 7 | ## Overview 8 | 9 | Since 3.6.7 the management plugin has been re-designed to spread the memory 10 | used for statistics across the entire rabbit cluster instead of aggregating 11 | it all in a single node. Doing this isn't free. There is a trade-off in 12 | metric latency and processing for memory stability. 13 | 14 | 15 | ## Components 16 | 17 | There are four main components: 18 | 19 | * [Internal event notifications](./internal_events.md) 20 | * Core metrics 21 | * rabbitmq-management-agent 22 | * rabbitmq-management 23 | 24 | 25 | 26 | ## Core metrics 27 | 28 | Core metrics are implemented in the rabbitmq server itself consisting of 29 | a set of of ETS tables storing either counters or proplists containing details 30 | or metrics of various entities. The schema of each table is documented in 31 | [rabbit_core_metrics.hrl](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/include/rabbit_core_metrics.hrl) 32 | in `rabbitmq-common`. 33 | 34 | Mostly counters that are incremented in real-time as message interactions occur 35 | in queues, channels, exchanges etc. 36 | 37 | This replaces the previous approach of emitting events containing metrics 38 | at regular intervals. `created` and `deleted` events are still emitted, 39 | however `stats` events have been removed. 40 | 41 | Because no unbounded queues are involved this approach should have fixed 42 | memory overhead in relation to the number of active entities in the system. 43 | 44 | 45 | 46 | ## Management Agent 47 | 48 | [rabbitmq-managment-agent](https://github.com/rabbitmq/rabbitmq-server/tree/master/deps/rabbitmq_management_agent) is responsible for turning core metrics into 49 | data structures suitable for `rabbitmq-management` consumption. This is 50 | done on a per node basis. There are no inter-node communications involved. 51 | 52 | The management agent runs a set of metrics collector processes. There is one 53 | process per core metrics table. Each collector periodically read its associated 54 | core metrics table and performs some table-specific processing which produces 55 | new data points to be inserted into the management metrics tables (defined in 56 | [rabbitmq_mgmt_metrics.hrl](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbitmq_management_agent/include/rabbit_mgmt_metrics.hrl)). 57 | The collection interval is determined by the smallest configured retention intervals. 58 | 59 | In addition to the collector processes there is a garbage collection event 60 | handler that handles the `delete` events emitted by the various processes to ensure 61 | stats are completely cleared up. To make this efficient there is also a set of 62 | index tables (see `rabbitmq_mgmt_metrics.hrl`) that allow the gc process to 63 | remove all stats for a particular entity. 64 | 65 | The management agent plugin also hosts the `rabbitmq_mgmt_external_stats` process 66 | that periodically updates the core metrics tables with node specific stats 67 | (such as free disk space or file descriptors available, data and log directory locations, et cetera). 68 | Arguably this should be moved to the core at some point. 69 | 70 | It is worth noting that the latency of metric processing is now related to the retention 71 | interval and is typically higher than the previous version. To put it differently, it can 72 | take longer for the stats DB to have up-to-date information after a particular event occurs. 73 | This has no effect on the user but test suites that use the HTTP API would often 74 | [need adapting](https://github.com/michaelklishin/rabbit-hole/blob/master/bin/ci/before_build.sh#L11). 75 | 76 | 77 | ### exometer_slide 78 | 79 | The [exometer_slide](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbitmq_management_agent/src/exometer_slide.erl) 80 | module is a key part of the management stats processing. 81 | It allows us to reasonably efficiently store a sliding window of incoming metrics 82 | and also perform various processing on this window. It was extracted from the 83 | [exometer_core](https://github.com/Feuerlabs/exometer_core) project but has 84 | since been heavily modified to fit our specific needs. 85 | 86 | One notable addition is the "incremental" slide type that is used to aggregate 87 | data from multiple sources. A typical example would be vhost message rates. 88 | 89 | 90 | ## HTTP API 91 | 92 | The `rabbitmq-management` plugin is now mostly a fairly thin HTTP API layer. 93 | 94 | It also handles the distributed querying and stats merging logic. When a stats 95 | request comes in the plugin contacts each node in parallel for a set of "raw" 96 | stats (typically `exometer_slide` instances). It uses the [delegate](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/delegate.erl) 97 | module for this and has it's own `delegate` supervision tree to avoid affecting 98 | the one used for core rabbit delegations. Once stats for 99 | each node has been collected it merges the data then proceeds with processing 100 | this (for example turn sliding window data points into rates) for API 101 | consumption. Most of this logic is implemented in the `rabbit_mgmt_db` module. 102 | 103 | This distributed querying/merging is arguably the most complex part of the stats 104 | system. 105 | 106 | ### Distributed Querying Aggregation of Complex Stats 107 | 108 | Because the information returned by the HTTP API is fairly heavily augmented (e.g. 109 | a request for a queue would also contain channel details) we often have to 110 | perform multiple distributed queries in response to a stats request. 111 | For example, to get the channel details for a queue we first have to fetch the 112 | queue stats, inspect the consumers attached to that queue then query for the 113 | channel details based on the consumer channel). 114 | 115 | There are also inefficiencies when listing entities whose number could 116 | be unbounded (queues, channels, exchanges and connections). 117 | As management allows for sorting on almost any stats including rates we always 118 | need to fetch _all_ entity stats from each node, merge, sort then typically 119 | return a smaller page of items to the API. For systems with lots of such 120 | entities this can become very inefficient as potentially large amounts of data 121 | need to travel between nodes for each request. Therefore all requests that can 122 | return large numbers of entities go through an adaptive cache processes that adjusts 123 | its cache expiry time based on how long it took to fetch all that data. This 124 | should provide some degree of protection against excessive entity listings. It 125 | would be prudent to reduce the frequency of these queries if at all possible 126 | in heavily loaded systems. 127 | -------------------------------------------------------------------------------- /mirroring.md: -------------------------------------------------------------------------------- 1 | The essay at the top of rabbit_mirror_queue_coordinator has quite a 2 | decent overview of how mirroring works. In order to avoid repetition, 3 | this will aim to be an even higher-level summary. 4 | 5 | How mirroring works 6 | ------------------- 7 | 8 | In very quick terms: the master is a rabbit_backing_queue (BQ) 9 | implementation "between" the rabbit_amqqueue_process and 10 | rabbit_variable_queue (VQ), while the slave is a full process 11 | implementing most of a queue (again in terms of VQ/BQ). They 12 | communicate via GM. Since the master can't receive messages in its own 13 | right there is also an associated coordinator process. 14 | 15 | See rabbit_mirror_queue_coordinator for much more. 16 | 17 | How mirroring is controlled 18 | --------------------------- 19 | 20 | Policies call into the queue to tell them their policy has changed, 21 | which calls into rabbit_mirror_queue_misc to update mirrors. Each 22 | mirroring mode is an implementation of the behaviour 23 | rabbit_mirror_queue_mode - rmq_misc selects the appropriate rmq_mode, 24 | asks it which nodes should have slaves, and starts and stops slaves as 25 | appropriate. 26 | 27 | Eager synchronisation 28 | --------------------- 29 | 30 | The master and all slaves need to come to a halt while synchronising: 31 | we assume that handling publishes, deliveries or acks while 32 | synchronisation is ongoing is too hard. Therefore although the master 33 | and slaves are gen_servers, they essentially go into 34 | manually-implemented selective "receive" loops while syncing, only 35 | responding to a small set of messages and letting others back up - 36 | flow control will typically stop publishers from publishing too 37 | much. While syncing, the processes do respond to info requests and 38 | emit info messages periodically, so that rabbitmqctl and management do 39 | not become unresponsive and outdated respectively, but otherwise they 40 | are dead to the world. 41 | 42 | Because of this, we need to take care not to interfere with the state 43 | of the master too much - leaving it with a different flow control 44 | state or set of monitors than it entered the sync process with would 45 | lead to subtle bugs. The master therefore spawns a local "syncer" 46 | process which handles communication with the slaves to sync them. 47 | 48 | See rabbit_mirror_queue_sync for more details on how exactly the sync 49 | protocol works. 50 | 51 | GM 52 | -- 53 | 54 | The gm.erl module contains an essay on how GM works. Unfortunately 55 | it's not easy to understand; a property which it shares with GM in 56 | general. 57 | 58 | The overall principle is fairly clear: GM processes form a ring, 59 | around which messages are broadcast. Each message goes round twice, 60 | once to publish and once to acknowledge. New members enter the ring at 61 | any point; if a member dies the ring needs to heal and ensure that 62 | messages which might have been lost are sent again. 63 | 64 | The last part is tricky. Much of the complexity in GM is around 65 | knowing which members have what knowledge, and how to bring members up 66 | to speed if one fails. 67 | 68 | Additionally, the fact that ring information travels through two 69 | routes (around the ring itself, and through Mnesia) makes it even 70 | harder to reason about. 71 | 72 | Finally, GM is not at all designed to cope with partial network 73 | partitions: if A is partitioned from B, then B can remove it from 74 | Mnesia, and that information can leak back to A via C. We currently 75 | don't handle this situation well; this is the biggest unsolved 76 | problem in RabbitMQ. 77 | 78 | It might be worth replacing GM altogether with something new. 79 | 80 | If modifying it, be very conservative about even small changes. 81 | -------------------------------------------------------------------------------- /networking_and_connections.md: -------------------------------------------------------------------------------- 1 | # Networking and Connections 2 | 3 | This guide provides an overview of how networking (TCP socket acceptors, listeners) and connections 4 | (for multiple protocols) are implemented in RabbitMQ. 5 | 6 | ## Boot Step(s) 7 | 8 | When RabbitMQ starts, it executes a directed graph of boot steps, which depend on each other. 9 | One of the steps is responsible for starting the networking-related machinery: 10 | 11 | ``` erlang 12 | -rabbit_boot_step({networking, 13 | [{mfa, {rabbit_networking, boot, []}}, 14 | {requires, log_relay}]}). 15 | ``` 16 | 17 | As you can see above, the function that kicks things off is `rabbit_networking:boot/0`. 18 | 19 | ### Listener Tracking 20 | 21 | Before we dive into `rabbit_networking:boot/0`, it should be explained 22 | that listeners are tracked using a Mnesia table, `rabbit_listener`. 23 | 24 | The purpose of tracking listeners is two-fold: 25 | 26 | * to make it possible to stop active listeners during shutdown 27 | * to make it possible to list them e.g. in the management UI 28 | 29 | The table is updated by `rabbit_networking:tcp_listener_started/3` and 30 | `rabbit_networking:tcp_listener_stopped/3`. 31 | 32 | ### Distribution Listener 33 | 34 | Erlang distribution TCP listener is also tracked: `rabbit_networking:boot/0` 35 | uses node name and `erl_epmd:port_please/2` to determine distribution port. 36 | 37 | ### Messaging Protocol Listeners 38 | 39 | Every protocol supported usually has 1 or 2 listeners: 40 | 41 | * Plain ("X over TCP") 42 | * TLS ("X over TLS") 43 | 44 | Listeners are collected from config file sections of RabbitMQ core 45 | and plugins that provide protocol support (e.g. STOMP). 46 | 47 | `rabbit_networking:boot_tcp/0` and `rabbit_networking:boot_ssl/0` start plain TCP and 48 | TLS listeners, respectively. 49 | 50 | 51 | ## Listener Process Tree 52 | 53 | RabbitMQ as of 3.6.0 uses [Ranch](https://github.com/ninenines/ranch) in [embedded mode](https://github.com/ninenines/ranch/blob/master/doc/src/guide/embedded.asciidoc) 54 | to accept TCP connections. 55 | 56 | A listener is represented by two processes, which are 57 | started under the `tcp_listener_sup` supervisor: 58 | 59 | * `tcp_listener` 60 | * `ranch_listener_sup` 61 | 62 | The former handles listener tracking (see above), the latter is 63 | a [Ranch listener](https://github.com/ninenines/ranch/blob/master/doc/src/guide/listeners.asciidoc) process. 64 | 65 | `tcp_listener_sup` itself is a child of `rabbit_sup`, the top-level 66 | RabbitMQ supervisor. 67 | 68 | Every listener has one or more acceptors (under `ranch_acceptors_sup`) 69 | and a supervisor for accepted client connections (under `ranch_conns_sup`). 70 | 71 | `ranch_conns_sup` supervises client connections, which will be covered in more 72 | details in the following section. 73 | 74 | 75 | ## AMQP 0-9-1 Connection Process Tree 76 | 77 | Every AMQP 0-9-1 connection has a supervisor, `rabbit_connection_sup`, which is placed under 78 | `ranch_conns_sup` in the process tree. It supervises two processes: 79 | 80 | * `rabbit_reader`: an important module in the protocol, see below 81 | * `rabbit_connection_helper_sup`: supervises helper processes 82 | 83 | So the hierarchy of processes looks like this: 84 | 85 | ``` org-mode 86 | * rabbit_connection_sup 87 | ** rabbit_reader 88 | ** rabbit_connection_helper_sup 89 | *** rabbit_channel_sup_sup 90 | *** heartbeat_receiver 91 | *** heartbeat_sender 92 | *** rabbit_queue_collector 93 | ``` 94 | 95 | ### rabbit_reader 96 | 97 | `rabbit_reader` is one of the key modules in the AMQP 0-9-1 implementation. 98 | 99 | This is a `gen_server`-like process that handles binary data parsing, 100 | authentication, connection negotiation state machine, and keeps track 101 | of channel processes. 102 | 103 | This module also handles protocol "hand-offs" such as that to AMQP 1.0 reader 104 | when an AMQP 1.0 client connects (despite being a completely different protocol, 105 | it uses the same port as AMQP 0-9-1). 106 | 107 | ### Auxiliary processes 108 | 109 | Every connection has several auxiliary processes supervised by 110 | `rabbit_connection_helper_sup`: 111 | 112 | * `rabbit_channel_sup_sup` 113 | * `heartbeat_receiver` (module: `rabbit_heartbeat`) 114 | * `heartbeat_sender` (module: `rabbit_heartbeat`) 115 | * `rabbit_queue_collector` 116 | 117 | #### rabbit_channel_sup_sup 118 | 119 | Top-level supervisor for channels. Every channel is represented by 120 | a group of processes under a supervisor (`rabbit_channel_sup`). 121 | 122 | #### Heartbeat Processes 123 | 124 | Heartbeat implementation uses two processes, one for sending heartbeats 125 | and another handling client heartbeats. 126 | 127 | See `rabbit_heartbeat:start_heartbeat_sender/4` and `rabbit_heartbeat:stop_heartbeat_sender/4`. 128 | 129 | #### Queue Collector 130 | 131 | Queue collector keeps track of exclusive queues, whose lifecycle 132 | is tied to that of their declaring connection. Whenever a connection 133 | is closed or lost, all exclusive queues that belonged to it must be 134 | cleaned up. 135 | 136 | `rabbit_queue_collector` is a `gen_server` which handles the above. 137 | 138 | 139 | ## STOMP Connection Process Tree 140 | 141 | For STOMP, TCP listener and Ranch supervision tree is similar to 142 | that of AMQP 0-9-1 (see above) except that the `tcp_listener_sup` supervisor 143 | is under `rabbit_stomp_sup`. 144 | 145 | Every STOMP client connection has a supervisor, `rabbit_stomp_client_sup`, which 146 | supervises 2 processes: 147 | 148 | * `rabbit_stomp_reader` 149 | * `rabbit_stomp_heartbeat_sup` 150 | 151 | `rabbit_stomp_reader` is similar to `rabbit_reader` but also 152 | handles parsed protocol commands (this structure is new in 3.6.0 and 153 | matches the one used by MQTT plugin). 154 | 155 | Finally, `rabbit_stomp_heartbeat_sup` supervises heartbeat delivery, 156 | reusing `rabbit_heartbeat`. 157 | 158 | 159 | ## MQTT Connection Process Tree 160 | 161 | MQTT TCP listener and Ranch supervision tree is effectively identical to 162 | that of STOMP (see above). 163 | 164 | Every MQTT client connection has a supervisor, `rabbit_mqtt_connection_sup`, 165 | which supervises two processes: 166 | 167 | * `rabbit_mqtt_reader` combines protocol parsing, state machine, and command handling 168 | * `rabbit_mqtt_keepalive_sup` that handles MQTT keep-alives (heartbeats), 169 | reusing `rabbit_heartbeat` 170 | -------------------------------------------------------------------------------- /publisher_confirms.md: -------------------------------------------------------------------------------- 1 | # Publisher Confirms # 2 | 3 | Publisher Confirms are a way to tell publishers that messages have 4 | been accepted by the broker and that the broker now takes full 5 | responsibility for the message (ie: it was written to disk if it was 6 | persistent, replicated if mirroring was enabled, and so on). Take a 7 | look at the 8 | [publisher confirms](https://www.rabbitmq.com/confirms.html) 9 | documentation in order to understand the feature. 10 | 11 | ## Tracking Confirms ## 12 | 13 | Note: Implementation was slightly changed in 14 | [3.8.0](https://github.com/rabbitmq/rabbitmq-server/pull/1893), 15 | replacing the dtree strucutre described below. 16 | 17 | Confirms work a bit differently than mandatory messages and 18 | `basic.return`. In the case of mandatory messages we only need to send 19 | a `basic.return` if the message can't be routed, but for publisher 20 | confirms we need to send an `ack` if the message was accepted by the 21 | broker and a `nack` if that's not the case. This means we need two 22 | fields in the channel's state record in order to track this 23 | information. 24 | 25 | The first one is a 26 | [dtree](https://github.com/rabbitmq/rabbitmq-server/blob/v3.7.28/src/dtree.erl) 27 | stored in the field `unconfirmed`, which keeps track of the `MsgSeqNo` 28 | associated with the QPids to which the message was delivered and 29 | the Exchange Name used to publish the message. As explained in the 30 | _dtree_ documentation, entries on the _dual-index tree_ are stored 31 | using a primary key, a set of secondary keys, and a value. In the case 32 | of tracking unconfirmed messages we have: 33 | 34 | - primary key: the `MsgSeqNo` assigned to the message by the channel 35 | - secondary keys: the list of queue pids where the message was routed 36 | - value: the exchange where the message was published 37 | 38 | The second field used when dealing with confirms is called `confirmed` 39 | which keeps a list of the messages that were delivered to queues and 40 | for which the queues have taken responsibility. The only thing left to 41 | do with these messages is for the channel to send the `acks` back to 42 | the publisher. This list tracks pairs of `{MsgSeqNo, XName}`. 43 | 44 | ## Marking Messages as Confirmed ## 45 | 46 | Once a queue has dealt with a message (for example persisted it to 47 | disk, in the case of persistent messages), then it will send confirms 48 | back to the channel. This done by the QPid by casting the following 49 | message back to the channel: 50 | 51 | ```erlang 52 | {confirm, MsgSeqNos, QPid} 53 | ``` 54 | 55 | The channel will deal with this message in the proper `handle_cast/2` 56 | callback where it will remove the `MsgSeqNos` from the `unconfirmed` 57 | dtree and then call `record_confirms/2`, with the passing in those 58 | `MsgSeqNos` and related exchange names that were obtained form the 59 | dtree. Keep in mind that at this point the function `record_confirms/2` 60 | is only adding the messages to the `confirmed` list. They channel 61 | still needs to send the `acks` back to the publisher. 62 | 63 | Another place where confirms are recorded is in the function 64 | `process_routing_confirm/5`. If the message wasn't routed to any 65 | queue, then it will be immediately marked as confirmed by being added 66 | to the `confirmed` list, otherwise it will track them in the 67 | `unconfirmed` dtree until a queue confirms the message. 68 | 69 | Finally as explained in the 70 | [deliver to queues](./deliver_to_queues.md) guide, we monitor the 71 | QPids where the message was delivered. If the monitor reports that the 72 | queue has crashed the function `handle_publishing_queue_down/3` will 73 | be called. In the case of confirms there are two cases: if the queue 74 | had an abnormal exit, then `nacks` will be sent to messages that were 75 | routed to that particular queue that just went down. If the queue had 76 | a normal exit, then messages that went to that queue will be marked as 77 | confirmed. 78 | 79 | ## Sending Out Confirms ## 80 | 81 | Confirms or `acks` are sent back to the publisher whenever the channel 82 | processes a request. For example, once it dealt with an AMQP method 83 | like `basic.publish`, the channel will then send confirms out, if any. 84 | 85 | For mode details, take a look at the functions `reply/2`, `noreply/3` 86 | and `next_state/1` inside the `rabbit_channel` module. 87 | 88 | ## Sending Out Nacks ## 89 | 90 | Nacks are sent to publishers when a queue that should have handled the 91 | message has exited with an abnormal reason. Check the function 92 | `handle_publishing_queue_down/3` for more information. 93 | 94 | ## Related guides and documentation ## 95 | 96 | - [publisher confirms](https://www.rabbitmq.com/confirms.html) 97 | - [basic publish](./basic_publish.md) 98 | - [deliver to queues](./deliver_to_queues.md) 99 | -------------------------------------------------------------------------------- /queue_decorators.md: -------------------------------------------------------------------------------- 1 | # Queue Decorators # 2 | 3 | Queue decorators are modules implemented as behaviours that can let 4 | you extend queues. A decorator will have a set of callbacks that will 5 | be called from the queue process in response to some events that might 6 | happen at the queue, like when messages are delivered to consumers, 7 | which might cause the list of active consumers to be updated. 8 | 9 | They were added to the broker as a way to handle 10 | [consumer priorities](https://www.rabbitmq.com/consumer-priority.html) 11 | or by the federation plugin, to know when to move messages across 12 | [federated queues](https://www.rabbitmq.com/federated-queues.html). 13 | 14 | Decorators need to implement the `rabbit_queue_decorator` 15 | [behaviour](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_queue_decorator.erl) 16 | and are usually associated with queues via policies. 17 | 18 | A Queue decorator can receive notifications of the following events: 19 | 20 | - Queue Startup 21 | - Queue Shutdown 22 | - Consumer State Changed (active consumers) 23 | - Queue Policy Changed 24 | -------------------------------------------------------------------------------- /queues_and_message_store.md: -------------------------------------------------------------------------------- 1 | This file attempts to document the overall structure of a queue, and 2 | how persistence works. 3 | 4 | Each queue is a [gen_server2 Erlang process](https://learnyousomeerlang.com/clients-and-servers). The usual pattern of the API and 5 | implementation being in one file is not applied; `rabbit_amqqueue` is 6 | the API (a module) and `rabbit_amqqueue_process` is the implementation (a `gen_server2`). 7 | 8 | Startup 9 | ------- 10 | 11 | The queue's supervisor initially starts the process as 12 | rabbit_prequeue. This is a gen_server which determines whether the 13 | process is an HA slave or a regular queue or master (see HA 14 | documentation), and if so whether it is starting afresh or needs to 15 | recover. This then uses the gen_server2 "become" mechanism to become 16 | the correct sort of process - for this document we'll deal with 17 | rabbit_amqqueue_process for regular queues. 18 | 19 | The queue process decides for itself what it should be (rather than 20 | having some library function that starts different types of processes) 21 | so that it can do the right thing if it crashes and is restarted by 22 | the supervisor - it might have been started as a master but need to 23 | restart as a slave after crashing, for example. Or vice-versa. 24 | 25 | Sub-modules 26 | ----------- 27 | 28 | The queue process probably has the most code running in it of any 29 | process; the rabbit_amqqueue_process has had various subsystems broken 30 | out of it into separate modules over the years. The most major such 31 | break-out is the queue implementation API, described by the 32 | rabbit_backing_queue behaviour. 33 | 34 | The aim of the code within rabbit_amqqueue_process is therefore mainly 35 | to take the abstract queue implementation and make it support AMQPish 36 | features, by handling consumers, implementing features like TTL and max 37 | length in terms of lower level APIs, and coordinating everything. 38 | 39 | Recently all the consumer-handling code was moved into 40 | rabbit_queue_consumers. 41 | 42 | rabbit_backing_queue 43 | -------------------- 44 | 45 | The behaviour rabbit_backing_queue (BQ) implements a Rabbit-ish queue 46 | with persistence and so on. The module rabbit_variable_queue (VQ) is 47 | the major implementation of this behaviour. 48 | 49 | This split was introduced with the "new" persister in 2.0.0. At the 50 | time this was done so the old persister could be offered as a backup 51 | (rabbit_invariable_queue) if serious bugs were found in the new 52 | implementation. rabbit_invariable_queue is long gone but the mechanism 53 | to configure an alternate module is still there. At various times 54 | there have been proposals to provide alternate queue implementations 55 | (using Mnesia, SQL etc) but this never came to anything. (One 56 | rationale for optional e.g. SQL-based queues is that they would make 57 | queue-browsing, atomic transactions and so on trivial, at the cost of 58 | performance.) 59 | 60 | The BQ behaviour had a secondary use that has turned out to be 61 | important - it provides an API where we can insert a proxy to modify 62 | how the queue behaves by intercepting calls and deferring to 63 | VQ. Currently there are two such proxies: rabbit_mirror_queue_master 64 | (see HA documentation) and rabbit_priority_queue (which implements 65 | priority queues by providing one BQ implemented in terms of several 66 | BQs. 67 | 68 | rabbit_variable_queue 69 | --------------------- 70 | 71 | So this is the meat of the queue implementation. This implements a 72 | queue in terms of various sub-queues, with various degrees of 73 | paged-out-ness. 74 | 75 | publish -> [q1 -> q2 -> delta -> q3 -> q4] -> consumer 76 | 77 | q1 and q4 contain "alpha" messages, meaning messages are entirely 78 | within RAM. q2 and q3 contain "beta" and "gamma" messages, meaning 79 | they have metadata in RAM (message ID, position etc) and contents on 80 | disk. Finally, delta messages are on disk only. Many of the subqueues 81 | can be empty so that messages do not need to pass through all states 82 | if the queue is short. 83 | 84 | The essay at the top of rabbit_variable_queue goes into a great deal 85 | more detail on this. 86 | 87 | Most of the complexity of VQ deals with moving messages between the 88 | various queues in an optimised way. The actual persistence is handled 89 | by rabbit_queue_index (QI) and rabbit_msg_store. 90 | 91 | rabbit_queue_index 92 | ------------------ 93 | 94 | QI contains metadata that needs to be held per queue even if one 95 | message is published to multiple queues - publication records with a 96 | small amount of metadata, and delivery / acknowledgement record. In 97 | 3.5.0 the QI was extended to directly handle persistence of tiny 98 | messages to improve performance by reducing the number of I/O ops we 99 | do. The QI exists as "segment" files containing a log of the actions 100 | which have taken place for an ordered segment (i.e. part) of the 101 | queue, and an out of order journal which we write to any time anything 102 | happens. Again, see the module for much more detail. 103 | 104 | Note that everything as far as this part is within the main queue 105 | process. 106 | 107 | rabbit_msg_store 108 | ---------------- 109 | 110 | #### The following note applies to versions prior to 3.7 111 | 112 | ----------------- 113 | 114 | There are also two msg_store processes per broker - one for transient 115 | messages and one for persistent ones (the transient one can be deleted at startup). 116 | 117 | ----------------- 118 | 119 | Since version 3.7 message stores are organised according to 120 | [per-vhost message store](#per-vhost-message-store) 121 | 122 | ----------------- 123 | 124 | The msg_store is a disk-based reference-counting key-value store, 125 | storing messages in log-structured files. Again, see its module for 126 | more details. 127 | 128 | If one message is published to multiple queues, they will all submit 129 | it to the message store, and the store will detect the non-first 130 | requests to record the message and just increment the reference count. 131 | 132 | The message store is designed to allow clients (i.e. queues) to read 133 | from store files directly without calling into the message store 134 | process. Only writes go via the process. There are a number of shared 135 | ETS tables to coordinate what's going on. 136 | 137 | We go to some effort to avoid unnecessary work. For example, the 138 | message store maintains a table of "flying writes" - writes which have 139 | been submitted by queues but not yet actioned. If a request to delete 140 | a message is enqueued before the message is actually written, the 141 | write is cancelled. 142 | 143 | The message store needs an index, from message-id to {file, offset, 144 | etc}. This is also pluggable. The default index is implemented in ETS 145 | and so each message has an in-memory cost. 146 | 147 | Message store index also contains reference-counters for messages 148 | and serves as a synchronization point between queues, message store process 149 | and GC process. Message store inserts new entries to the index and updates 150 | reference-counters, GC prcess updates file locations and removes entries 151 | using `delete_object`, queue processes only read entries. 152 | 153 | Reference-counter updates, file location updates and deletes from the index 154 | should be atomic. 155 | 156 | Message store logic assumes that lookup operations for non-existent message 157 | locations (if message is not yet written to file) are cheap. 158 | 159 | See the [message store index behaviour module](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/rabbit_msg_store_index.erl) for more details. 160 | 161 | The message store also needs to be garbage collected. There's an extra 162 | process for GC (so that GC can lock some files and the message store 163 | can concurrently serve from the rest). Within the message store, "GC" 164 | boils down to combining together two files, both of which are known to 165 | have over 50% messages where the ref count has gone to 0. See the 166 | `rabbit_msg_store_gc` module for more details on how that works. 167 | 168 | 169 | Per-vhost message store 170 | ------------------------ 171 | 172 | *Per-vhost message store was introduced in version 3.7* 173 | 174 | ### Process structure 175 | 176 | Since version 3.7 queues and message stores processes are grouped in 177 | supervision trees per-vhost. 178 | 179 | The goal here is to isolate processes managing data (like queues and message stores) 180 | on different vhosts from each other. 181 | So when there is an issue in one vhost, others can function without interruptions. 182 | Vhosts that experienced errors can restart and recover their data or stay "down" 183 | for some time until an operator intervene and fix the error. 184 | 185 | The data directories are also isolated per-vhost. Each vhost has its own data 186 | directory with all the queues and message stores in it. 187 | 188 | The supervision tree for two vhosts and two queues per vhost would look like: 189 | 190 | ``` 191 | 192 | rabbit_sup 193 | | 194 | | 195 | --- ... 196 | | 197 | | 198 | --- rabbit_vhost_sup_sup 199 | | 200 | | 201 | --- - supervision tree for vhost_1 202 | | | 203 | | | 204 | | --- 205 | | | 206 | | | 207 | | --- 208 | | | 209 | | | 210 | | --- 211 | | | 212 | | | 213 | | --- - persistent message store for vhost_1 214 | | | 215 | | | 216 | | --- - transient message store for vhost_1 217 | | | 218 | | | 219 | | --- - supervisor to contain queues for vhost_1 220 | | | 221 | | | 222 | | --- - vhost_1/queue_1 supervisor 223 | | | | 224 | | | | 225 | | | - vhost_1/queue_1 process 226 | | | 227 | | | 228 | | --- - vhost_1/queue_2 supervisor 229 | | | 230 | | | 231 | | - vhost_1/queue_2 process 232 | | 233 | | 234 | --- - supervision tree for vhost_2 235 | | 236 | | 237 | --- 238 | | 239 | | 240 | --- 241 | | 242 | | 243 | --- 244 | | 245 | | 246 | --- - persistent message store for vhost_2 247 | | 248 | | 249 | --- - transient message store for vhost_2 250 | | 251 | | 252 | --- - supervisor to contain queues for vhost_2 253 | | 254 | | 255 | --- - vhost_1/queue_1 supervisor 256 | | | 257 | | | 258 | | - vhost_1/queue_1 process 259 | | 260 | | 261 | --- - vhost_1/queue_2 supervisor 262 | | 263 | | 264 | - vhost_1/queue_2 process 265 | 266 | ``` 267 | Processes given in `` are not registered. Names represent controlling modules. 268 | 269 | As you can see, each vhost has it's own pair of message stores and all the vhost 270 | queue processes are grouped in the vhost queues supervisor (`rabbit_amqqueue_sup_sup`). 271 | 272 | #### Recovery 273 | 274 | If a queue process fails, it can be restored without impacting other queues. 275 | 276 | If a message store fails, the entire vhost message store will be restarted, 277 | including both message stores and all the vhost queues. 278 | This is because of callback based publish acknowledgements, if a message store 279 | restarts and queue processes keep going, some messages can never 280 | be acknowledged. 281 | 282 | Vhost restart process follows same recovery steps as when a node starts. 283 | 284 | #### More about vhost processes and modules 285 | 286 | ##### rabbit_vhost_sup_sup 287 | -------------------------- 288 | 289 | A `simple_one_for_one` supervisor. Serves as a container for vhosts. 290 | Has an API for starting and stopping vhost supervisors, retrieving a vhost supervisor 291 | by name, and checking if a vhost is alive. 292 | 293 | Also manages an ETS table, containing an index of vhost processes. 294 | 295 | The module is aware of the `vhost_restart_strategy` setting, which controls if a single 296 | vhost failure and inability to restart should take down the entire node. 297 | 298 | If the `rabbit_vhost_sup_sup` supervisor crashes - the node will be shut down. 299 | 300 | 301 | ##### rabbit_vhost_sup_wrapper 302 | ------------------------------ 303 | 304 | An intermediate supervisor to control vhost restarts. 305 | It allows several restarts (3 in 5 minutes). 306 | 3 restarts - to handle failures in both message stores, 307 | 5 minutes - so if there is a data corruption error, there is enough time to get 308 | the error during recover, so the supervisor will not retry recoveries forever. 309 | 310 | After max restarts it gives up with `shutdown` message, which can be interpreted 311 | by the `rabbit_vhost_sup_sup` supervisor according to configured `vhost_restart_strategy`. 312 | 313 | The wrapper makes sure that `rabbit_vhost_sup` is started before recovery process 314 | and is empty, because recovery process will dynamically add children to `rabbit_vhost_sup`. 315 | 316 | Should this process fail, the vhost will not be restarted. If an exit signal is 317 | not `normal` or `shutdown`, the `rabbit_vhost_sup_sup` process will crash 318 | which will take down the node. 319 | 320 | 321 | ##### rabbit_vhost_process 322 | -------------------------- 323 | 324 | An entity process for a vhost. It manages the vhost recovery process on start and 325 | notifies that vhost is down on terminate. 326 | 327 | The aliveness status of this process is used to check that the vhost is "alive". 328 | 329 | This process will also terminate the vhost supervision tree if the vhost is deleted 330 | from the database. 331 | 332 | 333 | ##### rabbit_vhost_sup 334 | ---------------------- 335 | 336 | A container supervisor for a vhost data store processes, such as message stores, 337 | queues and recovery terms. 338 | 339 | The restart strategy is `one_for_all`, which will restart the vhost should any 340 | message store process fail. This will restart all the vhost queues. 341 | 342 | Should this process crash, the vhost will be restarted (up to 3 times in 5 minutes) 343 | using recovery process. 344 | 345 | ### Data storage 346 | 347 | Each vhost data is stored in a separate directory. 348 | The directory name for a vhost is `/msg_stores/vhosts/`, 349 | where `` is a configured RabbitMQ data directory (`RABBITMQ_MNESIA_DIR` variable) 350 | and `` is a hash of the vhost name. The hash is used to comply with 351 | file name restrictions. 352 | 353 | A vhost name hash can be generated using the `rabbit_vhost:dir/1` function. 354 | 355 | A vhost directory path can be generated using the `rabbit_vhost:msg_store_dir_path/1` function. 356 | 357 | Each vhost directory contains all its message stores and queues directories. 358 | 359 | Example directory structure of a message store (with one vhost for simplicity): 360 | 361 | ``` 362 | mnesia_dir 363 | | 364 | | 365 | --- ... 366 | | 367 | | 368 | --- msg_stores 369 | | 370 | | 371 | --- vhosts 372 | | 373 | | 374 | --- 375 | | 376 | | 377 | --- .vhost - a file, containing the vhost name 378 | | 379 | | 380 | --- recovery.dets 381 | | 382 | | 383 | --- msg_store_persistent - persistent message store 384 | | | 385 | | | 386 | | --- ... - the message store data files 387 | | 388 | | 389 | --- msg_store_transient - transient message store 390 | | | 391 | | | 392 | | --- ... 393 | | 394 | | 395 | --- queues 396 | | 397 | | 398 | --- 399 | | | 400 | | | 401 | | --- .queue_name - a file, containing the vhost and the queue name 402 | | | 403 | | | 404 | | --- ... - the queue data files 405 | | 406 | | 407 | --- 408 | | 409 | | 410 | --- .queue_name 411 | | 412 | | 413 | --- ... 414 | ``` 415 | 416 | Each vhost directory contains `.vhost` file, with a name of the vhost. The file 417 | can be used for troubleshooting, when the RabbitMQ node cannot be used to 418 | generate the vhost directory name. 419 | 420 | Each vhost has it's own recovery DETS table. 421 | 422 | Queue directory names are also generated using a hash function. 423 | 424 | Each queue directory contains a `.queue_name` file with the queue and the vhost names. 425 | -------------------------------------------------------------------------------- /rabbit_boot_process.md: -------------------------------------------------------------------------------- 1 | Original: https://github.com/videlalvaro/rabbit-internals/blob/master/rabbit_boot_process.md 2 | 3 | ## RabbitMQ Boot Process ## 4 | 5 | RabbitMQ is designed as an Erlang/OTP application which means that during start up it will be initialized as such. The function `rabbit:start/2` will be called which lives in the file `rabbit.erl` where the [application behaviour](http://erlang.org/doc/apps/kernel/application.html#Module:start-2) is implemented. 6 | 7 | When RabbitMQ starts running it goes through a series of what are called __boot steps__ that take care of initializing all the core components of the broker in a specific order. The whole boot step concept is –as far as I can tell– something unique to RabbitMQ. The idea behind it is that each subsystem that forms part of RabbitMQ as a whole will declare on which other systems it depends on and if it's successfully started, which other systems it will enable. For example, there's no point in accepting client connections if the layer that routes messages to queues is not enabled. 8 | 9 | The implementation is very elegant, it relies on adding custom attributes to erlang modules that declare how to start a boot step, in which boot steps it depends on and which boot steps it will enable, here's an example: 10 | 11 | -rabbit_boot_step({recovery, 12 | [{description, "exchange, queue and binding recovery"}, 13 | {mfa, {rabbit, recover, []}}, 14 | {requires, empty_db_check}, 15 | {enables, routing_ready}]}). 16 | 17 | Here the step name is `recovery`, which as the description says it manages _"exchange, queue and binding recovery"_. It requires the `empty_db_check` boot step and enables the `routing_ready` boot step. As you can see, there's a `mfa` argument which specifies the `Module` `Function` and `Arguments` to call in order to start this boot step. 18 | 19 | So far this seems very simple and even can make us doubt of the usefulness of such approach: why is there a need for boot steps at all? Why there isn't just a call to functions one after the other and that's it? Well, it is not that simple. 20 | 21 | Boot steps can be separated into groups. A group of boot steps will enabled certain other group. For example `routing_ready` is actually enabled by many others boot steps, not just `recovery`. One of such steps is the `empty_db_check` that ensures that the Mnesia, Erlang's built-in distributed database, has the default data, like the default `guest` user for example. Also we can see that the `recovery` boot step also depends on `empty_db_check` so this logic takes care of running them in the right order that will satisfy the interdependencies they have. 22 | 23 | There are boot steps that don't enable nor require others to be run. They are used to signal that a group of boot steps have happened as a whole, so the next group can start running. For example we have the external infrastructure step: 24 | 25 | {external_infrastructure, 26 | [{description,"external infrastructure ready"}]} 27 | 28 | As you see it lacks the `requires` and the `enables` properties. But since many steps declare that their enable it, then `external_infrastructure` won't be run until those steps are run. Also many steps that come after in the chain require `external_infrastructure` to have run before, so they won't be started either until it had been processed. 29 | 30 | But the story doesn't ends here. RabbitMQ can be extended with plugins that add new exchanges or authentication methods to the broker. Taking the exchanges as an example, each exchange type is registered into the broker via the `rabbit_registry` module, that means the `rabbit_registry` has to be started __before__ we can register a plugin. If we want to add new exchanges we don't have to worry about when they will be started by the broker, neither we have to care of managing the functional dependencies of our exchange. We just add a `-rabbit_boot_step` declaration to our exchange module where we say that our custom exchange depends on `rabbit_registry` et voilà, the exchange will be ready to use. 31 | 32 | There's more to it too. In the same way your custom exchange can add their own boot steps to hook up into the server boot process, you can add extra boot steps that perform some stuff in between of RabbitMQ's predefined boot steps. Keep in mind that you have to know what you are doing if you are going to plug into RabbitMQ boot process. 33 | 34 | Now, if you have been doing some Erlang programming you may be wondering at this point how does this even work at all. Erlang modules can have attributes, like the list of exported functions, or the declaration of which behaviour is implemented by the module, but there's no where a mention in the Erlang documentation about `boot_steps` and of course there's nothing about `-rabbit_boot_steps`. How do they work then? 35 | 36 | When the broker is starting it builds a list of all the modules defined in the loaded applications. Once the list of modules is ready it's scanned for attributes called `rabbit_boot_steps`. If there are any, they are added to a new list. This list is further processed and converted into an [directed acyclic graph](http://en.wikipedia.org/wiki/Directed_acyclic_graph) which is used to maintain an order between the boot steps, that is the boot steps are ordered according to their dependencies. Here is where I think relies the elegance of this solution: add declarations to modules in the form of custom module attributes, scan for them and do something smart with the information. This speaks about the flexibility of Erlang as a language. 37 | 38 | ## Individual boot steps in detail ## 39 | 40 | Here's a graphic that shows the boot steps and their interconnections. An arrow from boot step __A__ to boot step __B__ means that __A__ enables __B__. A line with no arrows on both ends from __A__ to __B__ means that __A__ is required by __B__. You can open the image file in a separate window to see it [full size](http://github.com/videlalvaro/rabbit-internals/raw/master/images/boot_steps.png). 41 | 42 | ![demo](http://github.com/videlalvaro/rabbit-internals/raw/master/images/boot_steps.png "Rabbit Boot Steps") 43 | 44 | As we can see there the boot steps are somehow grouped. All starts at the `pre_boot` step continues at the `external_infrastructure` step and so on. Between `pre_boot` and `external_infrastructure` other steps occur that contribute to enable `external_infrastructure`. Now let's give a brief description of what happens on each of them. 45 | 46 | ### pre_boot ### 47 | 48 | The `pre_boot` signals the start of the boot process. After it happens RabbitMQ will start processing the other boot steps like `file_handle_cache`. 49 | 50 | ### external_infrastructure ### 51 | 52 | The `file_handle_cache` is used to manage file handles to synchronize reads and writes to them. See `file_handle_cache.erl` for an in depth explanation of its purpose. 53 | 54 | The next step that starts is the `worker_pool`. The worker pool process manages a pool of up to `N` number of workers where `N` is the return of `erlang:system_info(schedulers)`. It's used to parallelize function calls across the pool. 55 | 56 | Then the turn goes to the `database` step. This one is used to prepare the [Mnesia](http://www.erlang.org/doc/man/mnesia.html) database which is used by RabbitMQ to track exchanges meta information, users, vhosts, bindings, etc. 57 | 58 | The `codec_correctness_check` is used to ensure that the AMQP binary generator is working properly, that is, that it will generate the right protocol frames. 59 | 60 | Once all the previous steps have run then the `external_infrastructure` step will be processed signaling the boot process that it can continue with the following steps. 61 | 62 | ### kernel_ready ### 63 | 64 | Once the external infrastructure is ready RabbitMQ will proceed with booting its own kernel. The first step will be the `rabbit_registry` which keeps a registry of plugins and their modules. For example it maps authentication mechanisms to modules with the actual implementation. The same thing is done from _exchange type_ to _exchange type implementation_. This means that if a message is published to an exchange of type _direct_ the registry will be responsible of telling the broker where the routing logic for the direct exchange resides, returning the module name. 65 | 66 | After the `rabbit_registry` is ready, it's time to start the authentication modules. RabbitMQ will go through each of them, starting them and making them available. Some steps here are `rabbit_auth_mechanism_amqplain`, `rabbit_auth_mechanism_plain` and so on. If there's a plugin implementing an authentication mechanism, then it will be started at this point. 67 | 68 | The next step is the `rabbit_event` which handles event notification for statistics collection. For example when a new channel is created, then a notification like `rabbit_event:notify(channel_created, infos(?CREATION_EVENT_KEYS, State))` is fired. 69 | 70 | Then is time for the `rabbit_log` to start which manages the logging inside RabbitMQ. This process will delegate logging calls to the native error_logger module. 71 | 72 | The same procedure used to enable the authentication mechanism is now repeated for the exchanges. Steps like `rabbit_exchange_type_direct` or `rabbit_exchange_type_fanout` are executed here. If you installed plugins with custom exchange types, they will be registered at this point. 73 | 74 | Now is time to run the `kernel_ready` step in order to continue initializing the core of RabbitMQ. 75 | 76 | ### core_initialized ### 77 | 78 | The first step of this group is the `rabbit_alarm` which starts the memory alarm handler. It will perform alarm management for different events that may happen during the broker life. For example if the memory is about to surpass the `memory_high_watermar` setting, then this module will fire an event. 79 | 80 | Next is the `rabbit_node_monitor` which notifies other nodes in the cluster about its own node presence. It also takes cares of dealing with the situation of other node dying. 81 | 82 | Then is the turn of the `delegate_sup` step. This supervisor will start a pool of children that will be used to parallelize calls to processes. For example when routing messages, the delegates take care of sending the messages to each of the queues that ought to receive the message. 83 | 84 | The next step to be started is the `guid_generator` which as its name implies is used as a _Globally Unique Identifier Server_. This process is called for example when the server needs to generate random queue names, or consumer tags, etc. 85 | 86 | Next on the list is the `rabbit_memory_monitor` which monitors queues memory usage. It will take care of flushing messages to disk when a queue reaches certain level of memory. 87 | 88 | Finally the `core_initialized` step will be run and the boot step process will continue with the routing infrastructure. 89 | 90 | ### routing_ready ### 91 | 92 | At this stage RabbitMQ will start to fill up the Mnesia tables with information regarding the exchanges, routing and bindings. In order to do so first the step `empty_db_check` is run. This step will check that the database has the required information inside else it will insert it. At this point the default `guest` user will be created. 93 | 94 | Once the database is properly setup the `recovery` step is run. This step will restart recover the bindings between queues and exchanges. At this point is where the actual queue processes are started. 95 | 96 | After the queues are running the new boot steps that involve the mirrored queues will be called. Once the mirrored queues are ready the `routing_ready` step will take part and the boot step procedure will continue. 97 | 98 | ### log_relay ### 99 | 100 | Before RabbitMQ is ready to start accepting clients is time to start the `rabbit_error_logger` which is done during the `log_relay` boot step and from here the `networking` will be ready to run. 101 | 102 | ### networking ### 103 | 104 | The `networking` will start all the supervisors that are concerned with the different listeners specified in the application configuration. A `tcp_listener_sup` will be started for each interface/port combination in which RabbitMQ is listening to. The SSL listeners will be started and the tcp client will be ready to accept connections. 105 | 106 | ### direct_client ### 107 | 108 | RabbitMQ is nearly done with the boot process. The `direct_client` step is used to start the supervisor tree that takes cares of accepting _direct client connections_. The direct client is used for AMQP connections that use the Erlang distribution protocol. Once this is finished is time to proceed to the final step. 109 | 110 | ### notify_cluster ### 111 | 112 | At this point RabbitMQ is ready to start munching messages. The only thing that remains to do is to notify other nodes in the cluster of it's own presence. That is accomplished via the `notify_cluster` step. 113 | 114 | ## Summary ## 115 | 116 | If you read this far you can see that starting an application like RabbitMQ is not an easy task. Thanks to the __boot steps__ technique the process can be managed in such a way that the interdependencies between processes can be satisfied without sacrificing sanity. What's even more impressive is that this technique can be used to extend the broker in a way that goes beyond what the original developers planed for the server. 117 | -------------------------------------------------------------------------------- /transactions_in_exchange_modules.md: -------------------------------------------------------------------------------- 1 | # What are those transactions inside the exchange callback modules? # 2 | 3 | Many callbacks inside the `rabbit_exchange_type` behaviour expect a 4 | `tx()` parameter which is defined as follows: 5 | 6 | ```erlang 7 | -type(tx() :: 'transaction' | 'none'). 8 | ``` 9 | 10 | Then for example create is defined like: 11 | 12 | ```erlang 13 | %% called after declaration and recovery 14 | -callback create(tx(), rabbit_types:exchange()) -> 'ok'. 15 | ``` 16 | 17 | The question is, what's the purpose of that transaction parameter? 18 | 19 | This is related to how RabbitMQ runs Mnesia transactions for its 20 | internal bookkeeping: 21 | 22 | [rabbit_misc:execute_mnesia_transaction/2](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/rabbit_misc.erl#L586) 23 | 24 | As you can see in that code there's this PrePostCommitFun which is 25 | called in Mnesia transaction context, and after the transaction has 26 | run. 27 | 28 | So here for example: in 29 | [rabbit_exchange:declare/7](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_exchange.erl#L143) 30 | the create callback from the exchange is called inside a Mnesia 31 | transaction, and outside of afterwards. 32 | 33 | You can see this in action/understand the usefulness of it when 34 | considering an exchange like the topic exchange which keeps track of 35 | its own data structures: 36 | 37 | [rabbit_exchange_type_topic:delete/3](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_exchange_type_topic.erl#L49) 38 | [rabbit_exchange_type_topic:add_binding/3](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_exchange_type_topic.erl#L59) 39 | [rabbit_exchange_type_topic:remove_bindings/3](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_exchange_type_topic.erl#L64) 40 | 41 | Deleting the exchange, adding or removing bindings, are all done 42 | inside a Mnesia transaction for consistency reasons. 43 | -------------------------------------------------------------------------------- /uninterupted_cluster_upgrade.md: -------------------------------------------------------------------------------- 1 | # Uninterrupted cluster upgrade between minor releases 2 | 3 | ## Status as of November 2016 4 | 5 | Currently, when you want to upgrade RabbitMQ from a minor version to 6 | the next minor version (eg. 3.5.x to 3.6.x), you need to shutdown the 7 | entire cluster. A RabbitMQ cluster with a mix of multiple minor versions 8 | is unsupported: that's when we introduce incompatible changes, in 9 | particular to the Mnesia schema. 10 | 11 | The same rules apply for upgrades between major versions. 12 | 13 | On rare occasions, like between 3.6.5 and 3.6.6, we need to import a 14 | breaking change and again, you must shutdown the entire cluster to 15 | upgrade. 16 | 17 | ## Plan to fix the situation 18 | 19 | ### Scope 20 | 21 | This project targets the following goals: 22 | 23 | 1. Being able to run any minor versions from the same major branch mixed 24 | in the same cluster. 25 | 2. Being able to gradually upgrade all nodes of a cluster to a later 26 | minor version and still benefit from the new features and bugfixes at 27 | the end of the process. 28 | 29 | The first item is a prerequisite to the second item. 30 | 31 | This project does not try to make upgrades between major versions 32 | possible without terminating a cluster: thus, breaking changes requiring 33 | a cluster shutdown may happen even once this project is complete. 34 | 35 | ### Compatibility between brokers 36 | 37 | #### Elements to consider 38 | 39 | To be able to run different versions of RabbitMQ inside a single 40 | cluster, all nodes must understand and emit data in a common format. 41 | 42 | Here are the elements shared or exchanged by nodes: 43 | 44 | * record definitions; They are the building blocks of inter-process 45 | communication and the Mnesia schema below; 46 | * the Mnesia schema; 47 | * messages exchanged between nodes; 48 | * plugins ABI. 49 | 50 | Today, we can't achieve compatibility because when we need to eg. 51 | expand a record, we add a new field, which makes it incompatible with 52 | a previous version of that record. A plugin must even be recompiled 53 | against the new record definition to work again. 54 | 55 | Therefore, records and general messages must be redesigned to be 56 | extensible without breaking compatibility. 57 | 58 | #### Extensible and backward-compatible records 59 | 60 | We need to rethink how our records are designed to: 61 | 62 | * allow modifications of a record without breaking a table schema 63 | (if there is one associated); 64 | * permit the use of old and new records inside old and new code. 65 | 66 | We can use an Erlang map inside a record to allow extensibility, 67 | yet retaining backward compatibility: new keys would hold new 68 | informations, while old key could still exist. And using maps 69 | still allows pattern matching. 70 | 71 | Here is an example with the `#amqqueue` record from the 3.6.x branch, 72 | focused on the lists of queue slaves: 73 | 74 | ```erlang 75 | #amqqueue{ 76 | name, 77 | slaves = [], 78 | sync_slaves = [] 79 | }. 80 | ``` 81 | 82 | In 3.7.x, we add another list related to queue slaves: 83 | 84 | ```erlang 85 | #amqqueue{ 86 | name, 87 | slaves = [], 88 | sync_slaves = [], 89 | slaves_pending_shutdown = [] 90 | }. 91 | ``` 92 | 93 | And we already know we may need yet another list to track slaves pending 94 | startup. 95 | 96 | Instead, we could have used an extensible record in the 3.6.x branch of the form: 97 | 98 | ```erlang 99 | #amqqueue{ 100 | name, 101 | features = #{ 102 | slaves_list => #{ % The value could be anything: a record, a map, ... 103 | slaves => [], 104 | sync_slaves => [] 105 | } 106 | } 107 | }. 108 | ``` 109 | 110 | And in 3.7.x, it could have been extended like this: 111 | 112 | ```erlang 113 | #amqqueue{ 114 | name, 115 | features = #{ 116 | slaves_list => #{ 117 | slaves => [], 118 | sync_slaves => [], 119 | slaves_pending_shutdown = [] 120 | } 121 | % 'slaves_list' could exist, if the record was converted for 122 | % instance, but would have a lower precedence. 123 | } 124 | }. 125 | ``` 126 | 127 | This particular change fits the existing map so we don't need to 128 | introduce a new map. However the new code must support the absence of 129 | the `slaves_pending_shutdown` field and the old code must not trip up on 130 | this unknown field. 131 | 132 | If in 3.7.x, the representation of slaves would have required a complete 133 | revamp, a new map could be introduced: 134 | 135 | ```erlang 136 | #amqqueue{ 137 | name, 138 | features = #{ 139 | slaves_list_v2 => #{ 140 | rabbit@host1 => #{ 141 | pid = Pid, 142 | synced = true, 143 | state = ready 144 | }, 145 | rabbit@host2 => #{ 146 | pid = Pid, 147 | synced = false, 148 | state = pending_startup 149 | } 150 | } 151 | % 'slaves_list' could exist, if the record was converted for 152 | % instance, but would have a lower precedence. 153 | } 154 | }. 155 | ``` 156 | 157 | In 3.6.x, the code would look for `slaves_list` in the `features` map. 158 | In 3.7.x, the code would look for `slaves_list_v2` and fallback on 159 | `slaves_list` if the former key is missing (meaning it's an older 160 | record): 161 | 162 | ```erlang 163 | do_things(#amqqueue{features = #{slaves_list_v2 := Slaves}}) -> 164 | % Do things with the new format of slaves list. 165 | really_do_things(Slaves); 166 | do_things(#amqqueue{features = #{slaves_list := Slaves}}) -> 167 | % Convert the old format of slaves list and do things with it. 168 | Slaves1 = % ... 169 | really_do_things(Slaves). 170 | ``` 171 | 172 | In the end, not matter the complexity of the change in this case, the 173 | record is unchanged and thus, the Mnesia table schema remains the same. 174 | 175 | #### Feature flags 176 | 177 | Because we want to have different versions of RabbitMQ in the same 178 | cluster, new nodes must not produce new records while older nodes are 179 | still around. 180 | 181 | Just looking at the version of running nodes is not enough either 182 | because there could be stopped nodes. Furthermore, if new code is 183 | backported to an older release for whatever reason, a new record could 184 | be supported by a non-contiguous set of versions. 185 | 186 | Other projects such as ZFS resolve that by using *feature flags*. A 187 | given version of ZFS code has support for a certain list of features and 188 | a filesystem has a list of features enabled. A ZFS implementation can 189 | look at the features enabled on a particular filesystem: 190 | 191 | * If a feature supported by the implementation is disabled, it continues 192 | to use the old format when writing data. 193 | * If a feature enabled on the filesystem is not supported by the 194 | implementation, it refuses to mount the filesystem. 195 | 196 | The user is responsible for enabling features when he is sure he won't 197 | have to mount a filesystem with an older implementation. 198 | 199 | We can use the same principle with RabbitMQ. Each version comes with a 200 | list of supported "features" and the list of enabled features is stored 201 | in Mnesia. 202 | 203 | When a new node starts, it looks at the enabled features. If it doesn't 204 | support one of them, it refuses to boot. If it supports more features 205 | than enabled, it makes sure to never produce data which would rely on 206 | disabled features because other nodes might not support that or the user 207 | may want to rollback to an older version. 208 | 209 | The user can enable new features when the cluster is ready. All nodes 210 | must support the new feature to allow it to be enabled. If that is not 211 | the case, the feature is not enabled. 212 | 213 | Once new features are enabled, nodes can produce newer data. At this 214 | point, it means old and new data are in flight (eg. in queues in 215 | memory). That's why the code must still support both formats/messages. 216 | 217 | If we take the same `#amqqueue` record example: 218 | 219 | * RabbitMQ 3.6.x would have the following feature flags: 220 | 221 | ```erlang 222 | SuportedFeatures = [ 223 | amqqueue_slave_list 224 | ]. 225 | ``` 226 | 227 | * RabbitMQ 3.7.x would have the following feature flags: 228 | 229 | ```erlang 230 | SuppoertedFeatures = [ 231 | amqqueue_slave_list, 232 | amqqueue_slave_list_v2 233 | ]. 234 | ``` 235 | 236 | * Initially, only the `amqqueue_slave_list` feature would be enabled, 237 | while the cluster is running RabbitMQ 3.6.x. 238 | 239 | After RabbitMQ is upgraded from 3.6.x to 3.7.0, it would not produce 240 | `#amqqueue` records with the `slave_list_v2` map entry yet. It would 241 | continue to use the old `slave_list` map entry. However, once the 242 | feature is enabled, it would produce the new entry, while still keeping 243 | support for the old entry which might still be in flight. 244 | 245 | #### When to get rid of old data support 246 | 247 | The RabbitMQ code must keep old code around for the entire life of the 248 | major branch. 249 | 250 | In the next major branch, we may remove old code because mixing major 251 | versions remains unsupported. 252 | 253 | However, all features must remain in the list of supported features, 254 | even if the code to handle them disappeared. This is because enabled 255 | features in an existing cluster must be present in the list of supported 256 | features. 257 | 258 | #### Note about the performance 259 | 260 | Checking feature flags in Mnesia for every operations would be 261 | expensive. A possible solution is to cache the information in the 262 | process, either in the state or the dictionary. Then once an operator 263 | decides to enable features, processes could be notified so they refresh 264 | their cached list of enabled features. 265 | 266 | ### Upgrading a cluster 267 | 268 | With backward-compatible code, extensible records and feature flags in 269 | place, upgrading a cluster would consist of: 270 | 271 | 1. Installing the new version of RabbitMQ on all nodes. This means: 272 | 273 | * stop the broker 274 | * install the new version 275 | * restart the broker 276 | 277 | At this point, no new feature flag is enabled. The user benefits from 278 | the changes which do not depend on new features only. He can decide to 279 | rollback to a previous version of RabbitMQ. 280 | 281 | 2. Once all nodes are running the latest code, the user can enable new 282 | features. 283 | 284 | He can decide to enable all of them at once or just one. This is useful 285 | if he hits a problem fixed by a particular feature and wants to verify 286 | the issue is actually solved. This can be useful to us too during 287 | development. 288 | 289 | Brokers produce and exchange new records, while still handling old 290 | ones. Rollback is not possible anymore. 291 | 292 | ### Working with breaking changes 293 | 294 | #### From a developer point of view 295 | 296 | When working on the core of RabbitMQ or a plugin, wether it is a tier-1 297 | plugin or not, a developer must be rigourous about changing shared and 298 | exchanged data format. When he wants to introduce a breaking change, he 299 | will have to: 300 | 301 | * add and document a new feature flag; 302 | * update the code so old and new formats are both handled. 303 | 304 | The code snippets above give an overview of that. 305 | 306 | Specifically for plugins, they may come with a list of feature flags 307 | they require to run. This could be in addition to or instead of the 308 | RabbitMQ version check. 309 | 310 | > **TODO:** The implementation details remain to be designed. 311 | 312 | #### From an operator point of view 313 | 314 | An operator will need commands to manage features flags during a cluster 315 | upgrade: 316 | 317 | * list feature flags; 318 | * get information about a feature flag; 319 | * enable one, many or all feature flags. 320 | 321 | When listing feature flags or querying informations about them, the 322 | following elements will be of interest: 323 | * the name of the feature; 324 | * a description of the change; 325 | * is the feature is enabled; 326 | * (optional) a list of RabbitMQ versions where the feature flag was 327 | introduced; 328 | 329 | > **TODO:** The implementation details remain to be designed. 330 | 331 | ## Future out-of-scope ideas 332 | 333 | ### Downgrading a node 334 | 335 | Downgrading a node means that features not supported by the targetted old 336 | version must be disabled. Disabling a feature means that: 337 | 338 | * Nodes needs to produce old record again. 339 | * New records still in flight need to be converted back to their old 340 | version. 341 | 342 | The latter point would need code to find in-flight data and convert it. 343 | Once this is done, the old version of RabbitMQ can be deployed exactly 344 | like the new one was installed. Thus the order of operations is simply 345 | reversed. 346 | 347 | Obivously the difficulty is in the "find and convert data" part. This 348 | would be impossible for certain features. 349 | -------------------------------------------------------------------------------- /variable_queue.md: -------------------------------------------------------------------------------- 1 | # Variable Queue # 2 | 3 | ## Publishing messages ## 4 | 5 | When a message is published to the queue, the first thing we have to 6 | do is to determine if the message is persistent. We track this 7 | information in the `msg_status` record: 8 | 9 | ```erlang 10 | -record(msg_status, 11 | { seq_id, 12 | msg_id, 13 | msg, 14 | is_persistent, 15 | is_delivered, 16 | msg_in_store, 17 | index_on_disk, 18 | persist_to, 19 | msg_props 20 | }). 21 | ``` 22 | 23 | Message statuses are kept in a record where the `is_persistent` field 24 | is set to `true` if the queue is durable and the message was published 25 | as persistent: 26 | 27 | ```erlang 28 | is_persistent = IsDurable andalso IsPersistent 29 | ``` 30 | 31 | If it was determined that the message needs persistence, then it will 32 | be immediately written to disk, either to the message store or the 33 | queue index, depending on the message size (see 34 | `queue_index_embed_msgs_below`). 35 | 36 | Internally the `variable_queue` keeps messages on four `queue` data 37 | structures. They are a variation of erlang's _queue_ module, but which 38 | some extensions that allow getting the queue length in constant 39 | time. These four queues are identified on the variable queue state as 40 | `q1`, `q2`, `q3` and `q4`. The need for these four queues becomes 41 | apparent once disk paging is taken into account. 42 | 43 | `q4` keeps track of the oldest messages, that is, those at the front 44 | of the queue, those that will be delivered earlier to consumers. 45 | 46 | `q3` only has messages when there has been some disk paging due to 47 | memory pressure, or if we have a queue that has recovered contents 48 | from disk, due to a broker restart for instance. This means that 49 | messages that once were in `q4` only, now have had their content 50 | pushed to disk, and their references are now kept in `q3`. So when a 51 | message arrives into the variable queue, we need to determine if the 52 | message needs to be inserted at the back of `q4`, or somewhere else. 53 | 54 | If `q3` is empty, this means we haven't paged queue contents to disk, 55 | so the messages at the front of the queue are still in `q4`, and the 56 | last message arriving to the queue is still in `q4` as well. So new 57 | messages can be inserted at the back of `q4`. Now, if `q3` has 58 | messages in it, this means at some point we have paged to disk, so 59 | some messages that were at the rear of `q4` are in `q3` now. This 60 | means a new message _can't_ be inserted into `q4`, otherwise we will 61 | lose message ordering; therefore, if `q3` has messages, new messages 62 | go into `q1`. 63 | 64 | ```erlang 65 | case ?QUEUE:is_empty(Q3) of 66 | false -> State1 #vqstate { q1 = ?QUEUE:in(m(MsgStatus1), Q1) }; 67 | true -> State1 #vqstate { q4 = ?QUEUE:in(m(MsgStatus1), Q4) } 68 | end, 69 | ``` 70 | 71 | ## Fetching Messages ## 72 | 73 | Messages are fetched by calling `queue_out/1`, which retrieves 74 | messages from `q4` or, if `q4` is empty then they are retrieved from 75 | `q3`. 76 | 77 | For `q3` to have messages, it means that at some point messages were 78 | paged to disk due to memory pressure which led to 79 | `push_alphas_to_betas/2` being called. Another way for `q3` to get 80 | messages is when we load messages from disk into memory, for example 81 | when `maybe_deltas_to_betas/1` is called; this can happen when we are 82 | recovering queue contents during queue initialization, or also when we 83 | try to load more messages from disk so they can be delivered to 84 | clients. 85 | 86 | When there are no more messages in `q4`, `fetch_from_q3/1` is called 87 | trying to obtain messages from `q3`. If `q3` is empty, then the queue 88 | must be empty. Remember that if `q3` wasn't empty, then new messages 89 | arriving into the queue were put into `q1`. 90 | 91 | If `q3` wasn't empty, but the message fetched was the last one there, 92 | we must see if we need to load more messages from disk, or if we need 93 | to migrate `q1` messages into `q4`. 94 | 95 | So let's say we fetched the last message from `q3` and we know there 96 | are no more messages on disk (delta count = 0). If there were new 97 | publishes while `q3` had messages, those messages are in `q4`, so we 98 | need to move messages from `q1` into `q4`. Why? Remember that during 99 | publishing messages are queued into `q4` when `q3` is empty, otherwise 100 | they go into `q1`. Imagine that we were publishing messages into `q4`, 101 | then at some point we had to `push_alphas_to_betas`, which means some 102 | `q4` messages were moved into `q3`. Now that `q3` has some messages, 103 | _new_ messages are put into `q1`, but from the point of view for an 104 | external user of the backing queue, `q3` messages come first than 105 | those in `q1`, i.e.: they are at the front of the queue. So when there 106 | are no more messages in `q3`, we can start consuming those in `q1`, 107 | but since `queue_out/1` only fetches messages from `q4`, we move `q1` 108 | contents there. 109 | 110 | Now let's say there were remaining messages on disk, instead of moving 111 | messages from `q1` into `q4`, we have to load messages from disk into 112 | `q3`. This is accomplished by calling `maybe_deltas_to_betas/1`. 113 | 114 | `maybe_deltas_to_betas/1` reads the index and then depending on what 115 | it finds there, it loads messages at the rear of `q3`. If there are no 116 | more messages on disk, then those messages that are in `q2` are moved 117 | to the rear of `q3`. When we look into message paging we will see why 118 | only when there are no more paged messages, we move `q2` into `q3`, 119 | and why `q2` messages go to the back of `q3`. For now keep in mind 120 | that all this message shuffling is done to ensure message ordering 121 | from an external observer's point of view. `q2` only has messages if 122 | `q1` had messages before, and `q1` pushes messages to `q2` only when 123 | some messages have been paged before, so whatever is on disk, comes 124 | before than whatever is on `q2`. 125 | 126 | Remember, messages in `q1` are recently published messages that went 127 | there because `q3` had messages, so those are the last ones we should 128 | deliver. 129 | 130 | ## Paging messages to disk ## 131 | 132 | Disk paging starts when the function `reduce_memory_use/1` is 133 | called. This function calls `push_alphas_to_betas/2` to start sending 134 | messages to disk. _Alpha_ messages, are those messages where the 135 | contents and the queue position are held on RAM, and we are trying to 136 | convert them to _betas_, i.e.: messages where we still keep the 137 | position of the message in RAM, but we send the contents to disk. 138 | 139 | When paging to disk we try to first page those messages that are going 140 | to be delivered later, those that from a client point of view are at 141 | the rear of the queue, so we start with `q1`'s contents. If there are 142 | messages on disk (because we have paged out queue contents already, or 143 | because we didn't load all the queue contents in memory), then 144 | messages are moved from `q1` into `q2`, otherwise they go into 145 | `q3`. Then we move messages from `q4` into `q3`. 146 | 147 | Keep in mind that we move messages based on a _quota_ that's 148 | calculated taking into account how messages are in ram vs how many 149 | messages `set_ram_duration_target/2` decided that need to be paged out 150 | to disk. If after moving messages from _alpha_ into _beta_ this quota 151 | hasn't been consumed entirely, we have to then push messages from 152 | _beta_ to _delta_. _Delta_ messages are those messages where the 153 | message content and their position are only held on disk. 154 | 155 | To page to disk those messages whose position in the queue was still 156 | on RAM we call `push_betas_to_deltas/2`. We first page messages from 157 | `q3` into disk, and then we page messages from `q2`, but there's a 158 | catch. Keep in mind that we might not want to page every single 159 | message out to disk. `q3` holds messages that are in front of the 160 | queue compared to those in `q2`, so `q3` messages are paged in reverse 161 | order, that is, those messages at the rear of `q3` are sent to disk 162 | first, with the idea that if the quota of messages that need paging is 163 | reached, then we will keep in RAM messages that will be sent soon to 164 | clients. 165 | 166 | ## Example ## 167 | 168 | **publish msgs `[1, 2, 3, 4, 5, 6, 7, 8, 9]`**: 169 | 170 | ``` 171 | Q4: [1, 2, 3, 4, 5, 6, 7, 8, 9] 172 | 173 | Q3: [] 174 | 175 | Q2: [] 176 | 177 | Q1: [] 178 | 179 | Delta: [] 180 | ``` 181 | 182 | **publish msgs `[10]`**: 183 | 184 | ``` 185 | q4: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 186 | 187 | Q3: [] 188 | 189 | Q2: [] 190 | 191 | Q1: [] 192 | 193 | Delta: [] 194 | ``` 195 | 196 | **push_alphas_to_betas**: 197 | 198 | ``` 199 | Q4: [1, 2, 3, 4, 5, 6, 7] 200 | 201 | Q3: [8, 9, 10] 202 | 203 | Q2: [] 204 | 205 | Q1: [] 206 | 207 | Delta: [] 208 | ``` 209 | 210 | **publish msgs `[11, 12, 13, 14, 15]`**: 211 | 212 | ``` 213 | Q4: [1, 2, 3, 4, 5, 6, 7] 214 | 215 | Q3: [8, 9, 10] 216 | 217 | Q2: [] 218 | 219 | Q1: [11, 12, 13, 14, 15] 220 | 221 | Delta: [] 222 | ``` 223 | 224 | **push_alphas_to_betas**: 225 | 226 | ``` 227 | Q4: [1, 2, 3, 4] 228 | 229 | Q3: [5, 6, 7, 8, 9, 10, 11, 12, 13] 230 | 231 | Q2: [] 232 | 233 | Q1: [14, 15] 234 | 235 | Delta: [] 236 | ``` 237 | 238 | **push_betas_to_deltas**: 239 | 240 | ``` 241 | Q4: [1, 2, 3, 4] 242 | 243 | Q3: [5, 6, 7, 8, 9] 244 | 245 | Q2: [] 246 | 247 | Q1: [14, 15] 248 | 249 | Delta: [10, 11, 12, 13] 250 | ``` 251 | 252 | **publish msgs `[16, 17, 18, 19, 20]`**: 253 | 254 | ``` 255 | Q4: [1, 2, 3, 4] 256 | 257 | Q3: [5, 6, 7, 8, 9] 258 | 259 | Q2: [] 260 | 261 | Q1: [14, 15, 16, 17, 18, 19, 20] 262 | 263 | Delta: [10, 11, 12, 13] 264 | ``` 265 | 266 | **push_alphas_to_betas**: 267 | 268 | ``` 269 | Q4: [1] 270 | 271 | Q3: [2, 3, 4, 5, 6, 7, 8, 9] 272 | 273 | Q2: [14, 15, 16, 17] 274 | 275 | Q1: [18, 19, 20] 276 | 277 | Delta: [10, 11, 12, 13] 278 | ``` 279 | 280 | **fetch 3 messages**: 281 | 282 | ``` 283 | Q4: [] 284 | 285 | Q3: [4, 5, 6, 7, 8, 9] 286 | 287 | Q2: [14, 15, 16, 17] 288 | 289 | Q1: [18, 19, 20] 290 | 291 | Delta: [10, 11, 12, 13] 292 | ``` 293 | 294 | **fetch 5 messages**: 295 | 296 | ``` 297 | Q4: [] 298 | 299 | Q3: [9] 300 | 301 | Q2: [14, 15, 16, 17] 302 | 303 | Q1: [18, 19, 20] 304 | 305 | Delta: [10, 11, 12, 13] 306 | ``` 307 | 308 | **fetch 1 message**: 309 | 310 | ``` 311 | Q4: [] 312 | 313 | Q3: [] 314 | 315 | Q2: [14, 15, 16, 17] 316 | 317 | Q1: [18, 19, 20] 318 | 319 | Delta: [10, 11, 12, 13] 320 | ``` 321 | 322 | **Q3 became empty, but we have msgs on disk/delta, so we call 323 | maybe_deltas_to_betas to load messages from delta into Q3**: 324 | 325 | ``` 326 | Q4: [] 327 | 328 | Q3: [10, 11, 12, 13] 329 | 330 | Q2: [14, 15, 16, 17] 331 | 332 | Q1: [18, 19, 20] 333 | 334 | Delta: [] 335 | ``` 336 | 337 | **maybe_deltas_to_betas saw that now there are no more messages on 338 | disk, so it joins Q3 with Q2, Q2 messages going to the rear of Q3**: 339 | 340 | ``` 341 | Q4: [] 342 | 343 | Q3: [10, 11, 12, 13, 14, 15, 16, 17] 344 | 345 | Q2: [] 346 | 347 | Q1: [18, 19, 20] 348 | 349 | Delta: [] 350 | ``` 351 | 352 | **publish msgs `[21, 22, 23, 24, 25]`**: 353 | 354 | ``` 355 | Q4: [] 356 | 357 | Q3: [10, 11, 12, 13, 14, 15, 16, 17] 358 | 359 | Q2: [] 360 | 361 | Q1: [18, 19, 20, 21, 22, 23, 24, 25] 362 | 363 | Delta: [] 364 | ``` 365 | 366 | **fetch 8 messages**: 367 | 368 | ``` 369 | Q4: [] 370 | 371 | Q3: [] 372 | 373 | Q2: [] 374 | 375 | Q1: [18, 19, 20, 21, 22, 23, 24, 25] 376 | 377 | Delta: [] 378 | ``` 379 | 380 | **Q3 is empty and Delta is empty as well, time to move Q1 messages into 381 | Q4**: 382 | 383 | ``` 384 | Q4: [18, 19, 20, 21, 22, 23, 24, 25] 385 | 386 | Q3: [] 387 | 388 | Q2: [] 389 | 390 | Q1: [] 391 | 392 | Delta: [] 393 | ``` 394 | 395 | **fetch 1 message**: 396 | 397 | ``` 398 | Q4: [19, 20, 21, 22, 23, 24, 25] 399 | 400 | Q3: [] 401 | 402 | Q2: [] 403 | 404 | Q1: [] 405 | 406 | Delta: [] 407 | ``` 408 | 409 | and so on. 410 | --------------------------------------------------------------------------------