├── 2021-10-07-TWG-meeting.pdf ├── Draft.md ├── context-split-4.1-alt.pdf ├── context-split-4.1.pdf ├── context-split-errata.pdf ├── hw-proposal_V2.pdf └── issue-154.pdf /2021-10-07-TWG-meeting.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/2021-10-07-TWG-meeting.pdf -------------------------------------------------------------------------------- /Draft.md: -------------------------------------------------------------------------------- 1 | # Topologies (Draft) 2 | -------------------- 3 | 4 | ## Generic considerations 5 | 6 | In the current version on the standard, (virtual topologies) are not directly 7 | usable. 8 | Topology *information* is attached to a communicator and is accessible 9 | through a set of procedures. 10 | Topologies officially come in three types: 11 | 1. cartesian topologies (`MPI_CART`) 12 | 2. general graph topologies (`MPI_GRAPH`) 13 | 3. distributed graphs topologies (`MPI_DIST_GRAPH`). Distributed graph topologies 14 | were introduced to alleviate the scalability issues that exist in the 15 | (earlier) non distributed version. 16 | 17 | The topology type can be queried with a call to `MPI_TOPO_TEST(comm,status)` procedure 18 | (status can also be `MPI_UNDEFINED`) 19 | 20 | Two other types of virtual topologies are also more or less officially supported: 21 | 22 | 4. bipartite graphs through the properties of intercommunicators 23 | 5. neighborhood graphs: actually subset of general graphs (because of the 24 | symmetric adjacency matrix requirement) 25 | 26 | 27 | - **Question** Should these last two types be added to the previous list explicitly? 28 | - **Question** Remove `MPI_CART`? (covered by `MPI_DIST_GRAPH`) 29 | - Generalize Cartesian to euclidian space (support of diagonals) 30 | - Elliptic topologies? 31 | 32 | A Topology is rather a *topological* space. That is especially true for MPI_DIST_GRAPH 33 | and more precisely MPI_CART is a *metric* space. 34 | 35 | If redesign of the interface: should the new objects align with their mathematical 36 | counterparts? (b/c MPI is aimed at scientists). 37 | 38 | This redesign is also the opportunity to switch from a flat programming model 39 | to a more structured one, featuring neigborhoods and/or hierarchies for instance. 40 | 41 | ## Proposal 42 | 43 | Promote topologies from implicit information attached to communicators to 44 | full-fledged objects in the standard: `MPI_Topology` 45 | 46 | ### Properties/characteristics of the object 47 | 48 | Instead of being *advisory* such topology object should be: 49 | - **mandatory** (for objects such as communicators, windows, files and even groups 50 | or process sets? see below). 51 | Consequences: 52 | - MPI_UNDEFINED is no longer a valid status value after a call to `MPI_TOPO_TEST`. 53 | - A **default topology** is needed (in the World Process Model) that should be of type 54 | general graph or a distributed graph with full connectivity (to account for MPI flat 55 | programming model) , in case no topology is specified explicitely. 56 | - An interface has to be introduced, e.g. `MPI_Topo_attach(char *pset_name)` (see below) 57 | 58 | - **restrictive**: a topology describes a structure and MPI communication operations 59 | must comply to it. This would, for instance, eliminate the need for neighborhood 60 | collective operations (vs. "regular" collective operations, cf. Traff and al paper 61 | about collective communication and orthogonality) 62 | 63 | - **explicit**: a set of procedures must be added in order to access/query/manipulate 64 | topologies. This would go beyond the set of current operations that mainly 65 | work on the naming scheme/coordinates of the MPI processes in the topology. 66 | **Question**: what is the set of operations we want to support? 67 | 68 | - **nature**: Hardware vs. software (virtual) 69 | 70 | ### Relationship between topologies and other MPI objects 71 | 72 | A possible way would be to adopt the same scheme as in Sessions : 73 | Process set -> group -> communicator 74 | 75 | Here, we start with a process set, an unstructured "object", with almost 76 | no properties (a pset has a name and that's it) and turn it into something more structured. 77 | For instance : 78 | 79 | ```mermaid 80 | graph TD 81 | [MPI process set (unstructured)] + [MPI topology (a structure)]-->[an MPI Group]; 82 | 83 | ``` 84 | We can then apply the same logic to create other MPI objects/concepts: 85 | - MPI Group + context ID = an MPI Communicator (with a Topology since the group had one) 86 | And then: 87 | - MPI Comm + memory region = an MPI Window (now featuring a topology structure to boot) 88 | - MPI Comm + File handle = MPI file (now featuring a topology structure to boot) 89 | 90 | 91 | ### Naming schemes/coordinates 92 | 93 | - Topologies could also possess a **naming scheme** so that any MPI process can know 94 | its location in the structure. Default topology's naming scheme : linear, 1-dimension one. 95 | 96 | - **Question** What about distance functions for topologies? What is the meaning of 97 | hops/neighbours/weights (ok for cartesian ones, but for general one ...) 98 | 99 | - Naming schemes should also be promoted so that the user can efficiently 100 | leverage them. Currently, it's possible to handle them only through: 101 | - specific procedure parameters (e.g. the `reorder` parameter in `MPI_Comm_split`, 102 | or `MPI_Cart_create`, etc. ) 103 | - specific procedures (e.g. `MPI_Cart_shift`, `MPI_cart_rank`, etc. ) 104 | 105 | - Naming schemes could be functions that replace the *reorder* parameter throughout 106 | the standard. 107 | * pros: 108 | - user can provide their own naming schemes (right now, reordering 109 | works as a black-box without any user control over it). 110 | - possibility to inject hardware information into the naming scheme by 111 | the MPI implementers but the application developers too. 112 | * cons: 113 | - breaks in the current standard. 114 | 115 | - Translation from one scheme to another: 116 | Actually, reordering ranks is switching from a naming scheme to another. 117 | Instead of a simple function parameter, can we have: 118 | `MPI_Reorder(IN int *tuple, OUT int* tuple);` 119 | **Note** This is more a mapping function actually and Transformation != map 120 | Functionality: 121 | 122 | 1- Transform a topology into another one or create a new topology from an existing one 123 | (or even a set of input topologies into a new one) 124 | 2- Map: determine the coordinates of a Process in the new topology given the 125 | coordinates in the old topology 126 | 127 | **Question** Does reordering implies or necessitates isomorphism? 128 | 129 | 130 | How is this different than calling N times `MPI_Graph_map`? 131 | 132 | **Note** 133 | - mapping a virtual topology onto another virtual topology => reordering of ranks 134 | - mapping a virtual topology onto a hardware one => assignment of resources 135 | ! result might be different in both cases ! 136 | 137 | - But then we need another function to enforce this new binding (e.g. `MPI_Bind`)? 138 | 139 | 140 | ### Hardware topologies support 141 | 142 | - Considered as explicit, 143 | e.g. another type of `MPI_Topology` objects? Therefore: 144 | MPI_Topology object: 145 | - Nature : virtual vs hardware 146 | | | 147 | v v 148 | - cartesian - random 149 | - dist graph - fat-tree 150 | - bipartite - dragonfly 151 | - else? - ? 152 | 153 | Do we impose not one but two topologies for each pset/group/comm 154 | (a virtual + a physical one)? 155 | 156 | - OTH, they could remain implicit, e.g. used as input for a translation/mapping 157 | function between naming schemes. 158 | 159 | ### Process sets vs. Resource sets 160 | - **Process sets** contain *processes* with virtual topology 161 | A process set cannot be larger that the set named mpi://WORLD 162 | 163 | - **Resource sets** contain *processors* with physical topology 164 | A resource set can expand in the future of the application, especially 165 | if/when MPI processes can migrate/change their resource binding 166 | 167 | 168 | * map one topo to the other -> graph embedding according to some metric (bissection BW) 169 | Map one of many virtual topologies to the hw one 170 | 171 | Map -> did it do something? 172 | -> if yes, how well did it perform? 173 | 174 | 175 | MPI_Topology 176 | 177 | MPI_Topo_embedd(topo1, topo2, int *mapping_res, flag, quality) 178 | 179 | Test of virtual topo types against the hw one and get a the quality result 180 | to make the decision on which one to use. 181 | 182 | How to pass user information (pattern, frequency, etc)? 183 | 184 | ## Possible new design 185 | 186 | ### Principle 187 | The current way to use communicators with or without a topology attached to it is to: 188 | 189 | 1- create the communicator 190 | 191 | 2- create requests for persistent communications or call communication procedures 192 | that use this communicator 193 | When the communicator is created, it can't be optimized for the communication pattern 194 | that is goind to be applied. 195 | 196 | The proposed scheme reverses this state of things: 197 | 198 | 1- Create a request to construct a communicator (with or without topology information attached) 199 | (cf `MPI_Comm_idup`). This should be a collective call that does not necessarily 200 | synchronize (probably not actually). 201 | 202 | 2- Init all communication operations 203 | 204 | 3- Wait for effective communication creation. Since the communication pattern is known 205 | beforehand, optimizations can be applied (e.g ranks reordering). This can be achieved by 206 | `MPI_Wait` or `MPI_Comm_commit`. The latter has to be collective and has to synchronize 207 | (probalby). It does not just wait for the communicator creation but also 208 | for the persistent operations to be set up and usable (as in prepared) as well. 209 | 210 | 211 | 212 | ### Issues/Open questions 213 | ~~1- Side-effect: lift some restriction on MPI_Comm_idup 214 | (i.e. "It is erroneous to use the communicator newcomm as an input argument to other MPI 215 | functions before the `MPI_COMM_IDUP` operation completes.")~~ 216 | => Easy fix: specifiy in specs 217 | 218 | ~~2- Pre-exisiting communications (`MPI_COMM_SELF`, `MPI_COMM_WORLD`) should be considered as 219 | already committed and should not be committed again (no-op or erroneous if this 220 | kind of handle is used in the commit procedure)~~ 221 | => Easy fix: specifiy in specs 222 | 223 | 3- Question: are the only operations authorized on the communicator the ones 224 | registered before the call to wait/commit ? 225 | => TBD 226 | 227 | ~~4- Question: how to optimize with coalesced collective operations?~~ 228 | => Easy fix: Not part of the proposal 229 | 230 | 5- Process identification issue: when/if creating a new communication (request) with 231 | a call to a topology creation function, the reorder parameter can be set to 1. 232 | Then the MPI process ranks will probably be different from the ones in the initial communicator. 233 | The new ranks are supposed to be usable only once the new communicator is indeed available 234 | (after the commit/wait operation). How can they be used in the calls to the procedures that initialize 235 | the various persistent communication operations? 236 | => that's a tricky one, cf example V2 237 | 238 | 239 | ### Discussion 240 | Based on the second example (`example_v2.c`): 241 | 242 | 1- Naming scheme creation functions are synchronizing because if/when reordering is 243 | enabled, an MPI process might need the informations/parameters passed to the function 244 | by other parameters. This is the case in particular for distributed graphs naming schemes 245 | (a.k.a topology constructors) which might be synchronizing then (maybe not, depends 246 | on how the mapping should be done). 247 | This might not be necessary in the Cartesian case, as parameters 248 | are (should be) all the same on every involved MPI process. 249 | There could be exchange of information (e.g. to gather hardware information) but 250 | not necessarily with another MPI process but rather with a SW agent (that can even 251 | be supported by a progress thread). 252 | 253 | 2- In the case of Cartesian topologies, since the structure is known, it is easy for 254 | a process to identify its place in the topology. Its name a tuple (size = # of dimensions) 255 | and a process can have a "good vision" in the topology of another process' place. 256 | Also neighbors transitivity is possible because of the structure. 257 | In the case of graphs and distributed graphs, little can be said about a process' "location" 258 | in the topology once the naming scheme has been called/resolved. A process knows its 259 | neighbors and the only thing usable is the reflexivity of the neighborhood relationship. 260 | 261 | 3- Notion of local addressing/identifiers, without a meaning outside each process that 262 | determines the ordering of its neighbors. Reorder meaning in this case?? Myabe irrelevant. 263 | So, it the naming is local, the procedure can be local. But if more information is needed 264 | (e.g. values of parameters passed to the procedure by other processes), then synchronizing, 265 | therefore non local. 266 | 267 | 4- Data partition should not be happening *before* the `MPI_Wait` operation that creates/commits 268 | the new communicator. Issue: using the address of the data buffer before the processes know their 269 | respective roles is not realistic. Maybe OK for collectives but not for pt2pt and RMA. 270 | MPI_Send_init should be split into two parts. 271 | 272 | How about an `MPI_Register_address` function ? 273 | in the pt2pt op, call `MPI_BUFFER_DEFERRED` (a la `MPI_IN_PLACE`) instead of the buffer address. 274 | 275 | Then after the `MPI_Wait` (trigerring the comm creation) call ` MPI_Register_buffer(void *addr,MPI_Request req);` 276 | where req is the same handle that is ouptut from the init call (e.g. `MPI_Send_init`) 277 | See [example_pt2pt.c](https://github.com/mpiwg-hw-topology/code-examples/blob/main/example_pt2pt.c). 278 | 279 | Remarks: 280 | * Orthogonal, self-contained feature 281 | * ` MPI_Register_buffer(MPI_Request req, void *addr);` instead of ` MPI_Register_buffer(void *addr,MPI_Request req);` 282 | * Provide a list of buffer addresses : ` MPI_Register_buffer(MPI_Request req, void *addr_array);` the argument is still 283 | of type `void *` but is in reality of type `void **` 284 | * Do we need also to modify the `count` field of `MPI_XXX_init`? 285 | => potential issue with what we want to do w.r.t topologies since modifying the number of 286 | elements might substantially change the communication pattern. 287 | * What kind of use case for this feature (outside the topology case)? 288 | * Need to check if it solves the issues with the `MPI_Bcast` example 289 | 290 | 291 | OR: 292 | ~~an `MPI_Schedule_op` function which indicated which kind of operation is to be 293 | performed on the communicator. Only the info pertaining to the communication pattern 294 | is useful or is it not?~~ 295 | issue: need to provide an extensive list of supported operations, almost the need 296 | for a metalanguage. 297 | 298 | 299 | 5- if topology is made restrictive, then the Bcast is similar in behaviour with pt2pt 300 | operations w.r.t buffer addresses, plus some processes are not involved and call 301 | `MPI_Bcast` anyway => unused address buffer. For such process, what is the meaning 302 | of `MPI_Bcast_init`? 303 | 304 | 305 | 306 | 307 | 308 | -------------------------------------------------------------------------------- /context-split-4.1-alt.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/context-split-4.1-alt.pdf -------------------------------------------------------------------------------- /context-split-4.1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/context-split-4.1.pdf -------------------------------------------------------------------------------- /context-split-errata.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/context-split-errata.pdf -------------------------------------------------------------------------------- /hw-proposal_V2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/hw-proposal_V2.pdf -------------------------------------------------------------------------------- /issue-154.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/issue-154.pdf --------------------------------------------------------------------------------