├── 2021-10-07-TWG-meeting.pdf
├── Draft.md
├── context-split-4.1-alt.pdf
├── context-split-4.1.pdf
├── context-split-errata.pdf
├── hw-proposal_V2.pdf
└── issue-154.pdf


/2021-10-07-TWG-meeting.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/2021-10-07-TWG-meeting.pdf


--------------------------------------------------------------------------------
/Draft.md:
--------------------------------------------------------------------------------
  1 | # Topologies (Draft)
  2 | --------------------
  3 | 
  4 | ## Generic considerations
  5 | 
  6 | In the current version on the standard, (virtual topologies) are not directly
  7 | usable.
  8 | Topology *information* is attached to a communicator and is accessible
  9 | through a set of procedures.
 10 | Topologies officially come in three types:
 11 | 1. cartesian topologies (`MPI_CART`)
 12 | 2. general graph topologies (`MPI_GRAPH`)
 13 | 3. distributed graphs topologies (`MPI_DIST_GRAPH`). Distributed graph topologies
 14 | were introduced to alleviate the scalability issues that exist in the
 15 | (earlier) non distributed version.
 16 | 
 17 | The topology type can be queried with a call to `MPI_TOPO_TEST(comm,status)` procedure
 18 | (status can also be `MPI_UNDEFINED`)
 19 | 
 20 | Two other types of virtual topologies are also more or less officially supported:
 21 | 
 22 | 4. bipartite graphs through the properties of intercommunicators
 23 | 5. neighborhood graphs: actually subset of general graphs (because of the
 24 | symmetric adjacency matrix requirement)
 25 | 
 26 | 
 27 | - **Question** Should these last two types be added to the previous list explicitly?
 28 | - **Question** Remove `MPI_CART`? (covered by `MPI_DIST_GRAPH`)
 29 | - Generalize Cartesian to euclidian space (support of diagonals)
 30 | - Elliptic topologies?
 31 | 
 32 | A Topology is rather a *topological* space. That is especially true for MPI_DIST_GRAPH
 33 | and more precisely MPI_CART is a *metric* space. 
 34 | 
 35 | If redesign of the interface: should the new objects align with their mathematical
 36 | counterparts? (b/c MPI is aimed at scientists).
 37 | 
 38 | This redesign is also the opportunity to switch from a flat programming model
 39 | to a more structured one, featuring neigborhoods and/or hierarchies for instance.
 40 | 
 41 | ## Proposal
 42 | 
 43 | Promote topologies from implicit information attached to communicators to
 44 | full-fledged objects in the standard: `MPI_Topology`
 45 | 
 46 | ### Properties/characteristics of the object
 47 | 
 48 | Instead of being *advisory* such topology object should be:
 49 | - **mandatory** (for objects such as communicators, windows, files and even groups
 50 | or process sets? see below).
 51 | Consequences:
 52 | 	- MPI_UNDEFINED is no longer a valid status value after a call to `MPI_TOPO_TEST`.
 53 | 	- A **default topology** is needed (in the World Process Model) that should be of type
 54 | 	general graph or a distributed graph with full connectivity (to account for MPI flat
 55 | 	programming model) , in case no topology is specified explicitely.
 56 | 	- An interface has to be introduced, e.g. `MPI_Topo_attach(char *pset_name)` (see below) 
 57 | 
 58 | - **restrictive**: a topology describes a structure and MPI communication operations
 59 | must comply to it. This would, for instance, eliminate the need for neighborhood
 60 | collective operations (vs. "regular" collective operations, cf. Traff and al paper
 61 | about collective communication and orthogonality)
 62 | 
 63 | - **explicit**: a set of procedures must be added in order to access/query/manipulate
 64 | topologies. This would go beyond the set of current operations that mainly
 65 | work on the naming scheme/coordinates of the MPI processes in the topology.
 66 | **Question**: what is the set of operations we want to support?     	
 67 | 
 68 | - **nature**: Hardware vs. software (virtual)
 69 | 
 70 | ### Relationship between topologies and other MPI objects
 71 | 
 72 | A possible way would be to adopt the same  scheme as in Sessions :
 73 | Process set -> group -> communicator
 74 | 
 75 | Here,  we start with a process set, an unstructured "object", with almost
 76 | no properties (a pset has a name and that's it) and turn it into something more structured.
 77 | For instance :
 78 | 
 79 | ```mermaid
 80 | graph TD
 81 |       [MPI process set (unstructured)] + [MPI topology (a structure)]-->[an MPI Group];
 82 |       
 83 | ```
 84 | We can then apply the same logic to create other MPI objects/concepts:
 85 | - MPI Group + context ID = an MPI Communicator (with a Topology since the group had one)
 86 | And then:
 87 | - MPI Comm + memory region = an MPI Window (now featuring a topology structure to boot)
 88 | - MPI Comm + File handle = MPI file (now featuring a topology structure to boot)
 89 | 
 90 | 
 91 | ### Naming schemes/coordinates
 92 | 
 93 | - Topologies could also possess a **naming scheme** so that any MPI process can know
 94 | its location in the structure. Default topology's naming scheme : linear, 1-dimension one.
 95 | 
 96 | - **Question** What about distance functions for topologies? What is the meaning of
 97 | hops/neighbours/weights (ok for cartesian ones, but for general one ...)
 98 | 
 99 | - Naming schemes should also be promoted so that the user can efficiently
100 | leverage them. Currently, it's possible to handle them only through:
101 | 	 - specific procedure parameters (e.g. the `reorder` parameter in `MPI_Comm_split`,
102 | 	 or `MPI_Cart_create`, etc. )
103 | 	 - specific procedures (e.g. `MPI_Cart_shift`, `MPI_cart_rank`, etc. )
104 | 
105 | - Naming schemes could be functions that replace the *reorder* parameter throughout
106 | the standard.
107 |     * pros:
108 | 	- user can provide their own naming schemes (right now, reordering
109 |   works as a black-box without any user control over it).
110 |   	- possibility to inject hardware information into the naming scheme by
111 |   the MPI implementers but the application developers too.
112 |   * cons:
113 |     - breaks in the current standard. 
114 | 
115 | - Translation from one scheme to another:
116 | Actually, reordering ranks is switching from a naming scheme to another.
117 | Instead of a simple function parameter, can we have:
118 |    `MPI_Reorder(IN int *tuple, OUT int* tuple);`
119 | **Note**  This is more a mapping function actually  and Transformation != map
120 | Functionality:
121 | 
122 | 1- Transform a topology into another one or create a new topology from an existing one
123 | (or even a set of input topologies into a new one)
124 | 2- Map: determine the coordinates of a Process in the new topology given the
125 | coordinates in the old topology
126 | 
127 | **Question** Does reordering implies or necessitates isomorphism?
128 | 
129 | 
130 | How is this different than calling N times `MPI_Graph_map`?
131 | 
132 | **Note**
133 | - mapping a virtual topology onto another virtual topology => reordering of ranks
134 | - mapping a virtual topology onto a hardware one => assignment of resources
135 | ! result might be different in both cases !
136 | 
137 | - But then we need another function to enforce this new binding (e.g. `MPI_Bind`)?
138 | 
139 | 
140 | ### Hardware topologies support
141 | 
142 | - Considered as explicit,
143 | e.g. another type of `MPI_Topology` objects? Therefore:
144 | MPI_Topology object:
145 |    - Nature : virtual vs hardware
146 |                  |           |
147 |                  v           v
148 |               - cartesian   - random
149 |               - dist graph  - fat-tree
150 |               - bipartite   - dragonfly
151 |               - else?       - ?
152 | 
153 | Do we impose not one but two topologies for each pset/group/comm
154 | (a virtual + a physical one)?
155 | 
156 | - OTH, they could remain implicit, e.g. used as input for a translation/mapping
157 | function between naming schemes.
158 | 
159 | ### Process sets vs. Resource sets
160 | - **Process sets** contain *processes* with virtual topology
161 | A process set cannot be larger that the set named mpi://WORLD
162 | 
163 | - **Resource sets** contain *processors* with physical topology
164 | A resource set can expand in the future of the application, especially
165 | if/when MPI processes can migrate/change their resource binding
166 | 
167 | 
168 | * map one topo to the other -> graph embedding according to some metric (bissection BW)
169 | Map one of many virtual topologies to the hw one
170 | 
171 | Map -> did it do something?
172 |     -> if yes, how well did it perform?
173 | 
174 | 
175 | MPI_Topology
176 | 
177 | MPI_Topo_embedd(topo1, topo2, int *mapping_res, flag, quality)
178 | 
179 | Test of virtual topo types against the hw one and get a the quality result
180 | to make the decision on which one to use.
181 | 
182 | How to pass user information (pattern, frequency, etc)?
183 | 
184 | ## Possible new design
185 | 
186 | ### Principle
187 | The current way to use communicators with or without a topology attached to it is to:
188 | 
189 | 1- create the communicator
190 | 
191 | 2- create requests for persistent communications or call communication procedures
192 | that use this communicator
193 | When the communicator is created, it can't be optimized for the communication pattern
194 | that is goind to be applied.
195 | 
196 | The proposed scheme reverses this state of things:
197 | 
198 | 1- Create a request to construct a communicator (with or without topology information attached)
199 | (cf `MPI_Comm_idup`). This should be a collective call that does not necessarily
200 | synchronize (probably not actually).
201 | 
202 | 2- Init all communication operations
203 | 
204 | 3- Wait for effective communication creation. Since the communication pattern is known
205 | beforehand, optimizations can be applied (e.g ranks reordering). This can be achieved by
206 | `MPI_Wait` or `MPI_Comm_commit`. The latter has to be collective and has to synchronize
207 | (probalby). It does not just wait for the communicator creation but also
208 | for the persistent operations to be set up and usable (as in prepared) as well.
209 | 
210 | 
211 | 
212 | ### Issues/Open questions
213 | ~~1- Side-effect: lift some restriction on MPI_Comm_idup
214 | (i.e. "It is erroneous to use the communicator newcomm as an input argument to other MPI
215 | functions before the `MPI_COMM_IDUP` operation completes.")~~
216 | => Easy fix: specifiy in specs 
217 | 
218 | ~~2- Pre-exisiting communications (`MPI_COMM_SELF`, `MPI_COMM_WORLD`) should be considered as
219 | already committed and should not be committed again (no-op or erroneous if this
220 | kind of handle is used in the commit procedure)~~
221 | => Easy fix: specifiy in specs  
222 | 
223 | 3- Question: are the only operations authorized on the communicator the ones
224 | registered before the call to wait/commit ?
225 | => TBD
226 | 
227 | ~~4- Question: how to optimize with coalesced collective operations?~~
228 | => Easy fix: Not part of the proposal
229 | 
230 | 5- Process identification issue: when/if creating a new communication (request) with
231 | a call to a topology creation function, the reorder parameter can be set to 1.
232 | Then the MPI process ranks will probably be different from the ones in the initial communicator.
233 | The new ranks are supposed to be usable only once the new communicator is indeed available
234 | (after the commit/wait operation). How can they be used in the calls to the procedures that initialize
235 | the various persistent communication operations?
236 | => that's a tricky one, cf example V2
237 | 
238 | 
239 | ### Discussion
240 | Based on the second example (`example_v2.c`):
241 | 
242 | 1- Naming scheme creation functions are synchronizing because if/when reordering is
243 | enabled, an MPI process might need the informations/parameters passed to the function
244 | by other parameters. This is the case in particular for distributed graphs naming schemes
245 | (a.k.a topology constructors) which might be synchronizing then (maybe not, depends
246 | on how the mapping should be done). 
247 | This might not be necessary in the Cartesian case, as parameters
248 | are (should be) all the same on every involved MPI process.
249 | There could be exchange of information (e.g. to gather hardware information) but
250 | not necessarily with another MPI process but rather with a SW agent (that can even
251 | be supported by a progress thread).
252 | 
253 | 2- In the case of Cartesian topologies, since the structure is known, it is easy for
254 | a process to identify its place in the topology. Its name a tuple (size = # of dimensions)
255 | and a process can have a "good vision" in the topology of another process' place.
256 | Also neighbors transitivity is possible because of the structure.
257 | In the case of graphs and distributed graphs, little can be said about a process' "location"
258 | in the topology once the naming scheme has been called/resolved. A process knows its
259 | neighbors and the only thing usable is the reflexivity of the neighborhood relationship.
260 | 
261 | 3- Notion of local addressing/identifiers, without a meaning outside each process that
262 | determines the ordering of its neighbors. Reorder meaning in this case?? Myabe irrelevant.
263 | So, it the naming is local, the procedure can be local. But if more information is needed
264 | (e.g. values of parameters passed to the procedure by other processes), then synchronizing,
265 | therefore non local.
266 | 
267 | 4- Data partition should not be happening *before* the `MPI_Wait` operation that creates/commits
268 | the new communicator. Issue: using the address of the data buffer before the processes know their
269 | respective roles is not realistic. Maybe OK for collectives but not for pt2pt and RMA.
270 | MPI_Send_init should be split into two parts.
271 | 
272 | How about an `MPI_Register_address` function ?
273 | in the pt2pt op, call `MPI_BUFFER_DEFERRED` (a la `MPI_IN_PLACE`) instead of the buffer address.
274 | 
275 | Then after the `MPI_Wait` (trigerring the comm creation) call ` MPI_Register_buffer(void *addr,MPI_Request req);`
276 | where req is the same handle that is ouptut from the init call (e.g. `MPI_Send_init`)
277 | See [example_pt2pt.c](https://github.com/mpiwg-hw-topology/code-examples/blob/main/example_pt2pt.c).
278 | 
279 | Remarks:
280 | * Orthogonal, self-contained feature 
281 | * ` MPI_Register_buffer(MPI_Request req, void *addr);` instead of  ` MPI_Register_buffer(void *addr,MPI_Request req);`
282 | * Provide a list of buffer addresses : ` MPI_Register_buffer(MPI_Request req, void *addr_array);` the argument is still
283 | of type `void *` but is in reality of type `void **`
284 | * Do we need also to modify the `count` field of `MPI_XXX_init`?
285 | => potential issue with what we want to do w.r.t topologies since modifying the number of
286 | elements might substantially change the communication pattern.
287 | * What kind of use case for this feature (outside the topology case)?
288 | * Need to check if it solves the issues with the `MPI_Bcast` example
289 | 
290 | 
291 | OR:
292 | ~~an `MPI_Schedule_op` function which indicated which kind of operation is to be
293 | performed on the communicator. Only the info pertaining to the communication pattern
294 | is useful or is it not?~~
295 | issue: need to provide an extensive list of supported operations, almost the need
296 | for a metalanguage.
297 | 
298 | 
299 | 5- if topology is made restrictive, then the Bcast is similar in behaviour with pt2pt
300 | operations w.r.t buffer addresses, plus some processes are not involved and call
301 | `MPI_Bcast` anyway => unused address buffer. For such process, what is the meaning
302 | of `MPI_Bcast_init`?
303 | 
304 | 
305 | 
306 | 
307 | 
308 | 


--------------------------------------------------------------------------------
/context-split-4.1-alt.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/context-split-4.1-alt.pdf


--------------------------------------------------------------------------------
/context-split-4.1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/context-split-4.1.pdf


--------------------------------------------------------------------------------
/context-split-errata.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/context-split-errata.pdf


--------------------------------------------------------------------------------
/hw-proposal_V2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/hw-proposal_V2.pdf


--------------------------------------------------------------------------------
/issue-154.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mpiwg-hw-topology/Discussions-and-proposal/cf2c443e5587fbab326c7a58242559c83d14496d/issue-154.pdf


--------------------------------------------------------------------------------