├── CONTRIBUTING.md
├── LICENSE.txt
├── README.md
├── algo.cpp
├── inputs
    ├── bert32a100.json
    ├── gnmt.json
    └── resnet.json
└── nlohmann
    └── json.hpp


/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing
 2 | 
 3 | This project welcomes contributions and suggestions. Most contributions require you to
 4 | agree to a Contributor License Agreement (CLA) declaring that you have the right to,
 5 | and actually do, grant us the rights to use your contribution. For details, visit
 6 | https://cla.microsoft.com.
 7 | 
 8 | When you submit a pull request, a CLA-bot will automatically determine whether you need
 9 | to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the
10 | instructions provided by the bot. You will only need to do this once across all repositories using our CLA.
11 | 
12 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
13 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
14 | or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | Code for NeurIPS 2021 paper "Piper: Multidimensional Planner for DNN Parallelization"
 2 | 
 3 | MIT License
 4 | 
 5 | Copyright (c) 2021 - present Microsoft Corporation
 6 | 
 7 | All rights reserved.
 8 | 
 9 | Permission is hereby granted, free of charge, to any person obtaining a copy
10 | of this software and associated documentation files (the "Software"), to deal
11 | in the Software without restriction, including without limitation the rights
12 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
13 | copies of the Software, and to permit persons to whom the Software is
14 | furnished to do so, subject to the following conditions:
15 | 
16 | The above copyright notice and this permission notice shall be included in all
17 | copies or substantial portions of the Software.
18 | 
19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
20 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
21 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
22 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
23 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
24 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
25 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Piper: Multidimensional Planner for DNN Parallelization - code
 2 | 
 3 | This code package contains algorithms (proof-of-concept implementation) and input files (profiled DNN models / workloads) from the paper "Piper: Multidimensional Planner for DNN Parallelization" published at NeurIPS 2021.
 4 | It allows one to reproduce the results in the paper, as well as run the partitioning algorithms on other workloads.
 5 | 
 6 | ## Input format
 7 | 
 8 | All our algorithms take as input a JSON file with the following format (all fields are mandatory unless indicated otherwise). This format closely follows our model (see Section 3 "Problem Setup" in the paper):
 9 | * `maxMemoryPerDevice` (floating-point): a memory size limit of a single accelerator, in bytes,
10 | * `maxDevices` (integer): number of accelerators (`k` from the paper),
11 | * `maxBatchSize` (integer): maximum number of microbatches in a batch (`N` from the paper),
12 | * `bandwidth` (floating-point): bandwidth (from each device to the outside),
13 | * `nodes` (array): for each node (layer):
14 |     * `id` (integer): unique ID of node,
15 |     * `TMPCs` (dictionary): mapping from tensor-parallelism degree (`t`) to an array of TMPCs, each having:
16 |         * `id` (string): name,
17 |         * `timePerSample` (floating-point): compute latency (backward+forward, quantity `p` from the paper),
18 |         * `parameterSize` (floating-point): size of weights (to be used in computing data-parallel resync costs, quantity `w` from the paper),
19 |         * `memoryUsageA`, `memoryUsageB` (floating-point): memory usage coefficients `a` and `b` (see paper),
20 |         * `syncTimeFw` (dictionary): mapping from heads of outgoing edges to their parameters `c^fw` (see paper),
21 |         * `syncTimeBw` (dictionary): mapping from tails of incoming edges to their parameters `c^bw` (see paper),
22 | * `edges` (array): for each edge:
23 |     * `sourceId` (integer): the ID of the tail of the edge (edge from `sourceId` to `destId`),
24 |     * `destId` (integer): the ID of the head of the edge,
25 |     * `communicationCost` (floating-point): cost of transfer over this edge (in bytes).
26 | 
27 | Other debug information may be present in the input files, such as `name`s on nodes.
28 | 
29 | ## Piper algorithm
30 | 
31 | The solution is implemented in `algo.cpp`. It is a single C++ file (using one header-only library for JSON parsing) and can be compiled with a recent version of `gcc` by running e.g. `g++ -O3 algo.cpp -o algo.exe`.
32 | 
33 | The compiled program runs experiments from the paper - see `main()` at the end of `algo.cpp`.
34 | It is possible to run only a subset of the evaluations by simply commenting out some lines in `main()`.
35 | The simplest mode of usage is shown in `single()`.
36 | The main example input file is `inputs/bert32a100.json`.
37 | 
38 | ## Legal notices
39 | 
40 | **Trademarks**
41 | This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
42 | 
43 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
44 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
45 | or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
46 | 
47 | We use the [JSON for Modern C++](https://github.com/nlohmann/json) library, copyright (c) 2013-2020 Niels Lohmann, licensed under the MIT license.
48 | 


--------------------------------------------------------------------------------
/algo.cpp:
--------------------------------------------------------------------------------
   1 | #include <iostream>
   2 | #include <fstream>
   3 | #include <set>
   4 | #include <unordered_set>
   5 | #include <limits>
   6 | #include <sstream>
   7 | #include <iomanip>
   8 | #include <set>
   9 | 
  10 | #include "nlohmann/json.hpp"
  11 | 
  12 | using namespace std;
  13 | using json = nlohmann::json;
  14 | 
  15 | // begin trivial helper stuff
  16 | ostream& dbg = cerr;
  17 | 
  18 | void fail (const string &s) {
  19 |     cout << "FAIL: " << s << endl;
  20 |     dbg << "FAIL: " << s << endl;
  21 |     exit(1);
  22 | }
  23 | 
  24 | void warn (const string &s) {
  25 |     dbg << "WARNING: " << s << endl;
  26 | }
  27 | 
  28 | #define DBG(vari) cerr<<"["<<__LINE__<<"] "<<#vari<<" = "<<(vari)<<endl;
  29 | 
  30 | template <typename T>
  31 | ostream& operator << (ostream &s, const vector<T> &v) {
  32 |     for (const T &x : v) {
  33 |         s << x << " ";
  34 |     }
  35 |     return s;
  36 | }
  37 | 
  38 | template <typename T>
  39 | string to_string (const vector<T> &v) {
  40 |     stringstream ss;
  41 |     ss << v;
  42 |     return ss.str();
  43 | }
  44 | 
  45 | template <typename T>
  46 | void append (vector<T> &v, const vector<T> &w) {
  47 |     v.insert(v.end(), w.begin(), w.end());
  48 | }
  49 | 
  50 | template <typename T>
  51 | inline void minify (T &x, const T &y) {
  52 |     x = min(x,y);
  53 | }
  54 | 
  55 | int ceildiv (int x, int y) {
  56 |     assert(y > 0);
  57 |     return (x + y - 1) / y;
  58 | }
  59 | 
  60 | constexpr double INFTY = 1e30;
  61 | 
  62 | vector<int> vectorOfSetBits (const vector<bool> &v) {
  63 |     vector<int> res;
  64 |     for (int i = 0; i < v.size(); ++i) {
  65 |         if (v[i]) {
  66 |             res.push_back(i);
  67 |         }
  68 |     }
  69 |     return res;
  70 | }
  71 | 
  72 | string formatMemoryUsage (const double M) {
  73 |     stringstream ss;
  74 |     ss << setprecision(1) << fixed << (M / (1 << 30)) << " GB";
  75 |     return ss.str();
  76 | }
  77 | 
  78 | string formatTPS (double our) {
  79 |     stringstream ss;
  80 |     int prec = 1;
  81 |     // make it so that there are 3 nonzero digits displayed
  82 |     double powerOur = our * 10;
  83 |     while (powerOur < 1 && prec < 3) {
  84 |         prec++;
  85 |         powerOur *= 10;
  86 |     }
  87 |     ss << setprecision(prec+2) << fixed << our << " s";
  88 |     return ss.str();
  89 | }
  90 | 
  91 | string formatTPSIncrease (double our, double worse) {
  92 |     if (worse > INFTY/2) {
  93 |         return "OOM";
  94 |     }
  95 |     stringstream ss;
  96 |     double percent = (worse / our - 1) * 100;
  97 |     if (percent < 100) {
  98 |         ss << setprecision(2) << fixed << percent << "\\%";
  99 |     } else {
 100 |         double times = worse / our;
 101 |         ss << setprecision(2) << fixed << times << "x";
 102 |     }
 103 |     return ss.str();
 104 | }
 105 | 
 106 | string formatRuntime (double our) {
 107 |     stringstream ss;
 108 |     ss << setprecision(1) << fixed << our << " s";
 109 |     return ss.str();
 110 | }
 111 | 
 112 | double average(const vector<double> &numbers) {
 113 |    if (numbers.empty()) {
 114 |       fail("trying to take average of an empty vector");
 115 |    }
 116 |    return accumulate(numbers.begin(), numbers.end(), 0.0) / static_cast<double>(numbers.size());
 117 | }
 118 | 
 119 | double sampleStddev(const vector<double> &numbers) {
 120 |    if (numbers.empty()) {
 121 |       fail("trying to take stddev of an empty vector");
 122 |    }
 123 |    if (numbers.size() == 1) {
 124 |       fail("trying to take sample stddev of a single number");
 125 |    }
 126 |    const double avg = average(numbers);
 127 |    double sum_of_squares = 0.0;
 128 |    for (double number : numbers) {
 129 |       sum_of_squares += pow(number - avg, 2);
 130 |    }
 131 |    return sqrt(sum_of_squares / (static_cast<double>(numbers.size()) - 1));
 132 | }
 133 | // end trivial helper stuff
 134 | 
 135 | 
 136 | constexpr int IDEALS_LIMIT = 40'000;
 137 | constexpr int IDEALS_EXPLORATION_LIMIT = 200'000;
 138 | constexpr int DEVICES_LIMIT = 10'000; // some loose upper bound on number of devices there can be in any reasonable input
 139 | bool DATA_PARALLELISM_ALLOWED = true;
 140 | bool TENSOR_PARALLELISM_ALLOWED = true;
 141 | bool ACTIVATION_RECOMPUTATION_ALLOWED = true;
 142 | bool ACTIVATION_RECOMPUTATION_FORCED = false;
 143 | bool ACTIVATION_RECOMPUTATION_ALL_LAYERS_OR_NONE = false;
 144 | constexpr bool KNAPSACK_FAST_HEURISTIC = true;
 145 | bool OUTPUT_KNAPSACK_INSTANCES_FOR_INSPECTION = false;
 146 | bool DEBUG_DATA_PARALLEL_COSTS = false;
 147 | constexpr bool FASTER_DP_IMPLEMENTATION = true;
 148 | 
 149 | 
 150 | struct TMPC {
 151 |     // known from the context:
 152 |     // - node id (v)
 153 |     // - number of devices (k)
 154 |     double timePerSample; // p
 155 |     unordered_map<int,double> syncTimeFw; // syncTimeFw[u] = sfw(u, this node)
 156 |     unordered_map<int,double> syncTimeBw; // syncTimeBw[w] = sbw(this node, w)
 157 |     double parameterSize; // w (only used to compute data-parallel resync costs)
 158 |     double memoryUsageA, memoryUsageB; // usage is A*y + B, y being ceil(suffix-sum-of-dp-degrees / d)
 159 |     string id;
 160 | };
 161 | 
 162 | 
 163 | void from_json (const json &j, TMPC &t) {
 164 |     j.at("timePerSample").get_to(t.timePerSample);
 165 |     map<string,double> syncTimeFw, syncTimeBw;
 166 |     j.at("syncTimeFw").get_to(syncTimeFw);
 167 |     for (const pair<string,double> &p : syncTimeFw) {
 168 |         // TODO verify that p.first is a number
 169 |         t.syncTimeFw[stoi(p.first)] = p.second;
 170 |     }
 171 |     j.at("syncTimeBw").get_to(syncTimeBw);
 172 |     for (const pair<string,double> &p : syncTimeBw) {
 173 |         // TODO verify that p.first is a number
 174 |         t.syncTimeBw[stoi(p.first)] = p.second;
 175 |     }
 176 |     j.at("parameterSize").get_to(t.parameterSize);
 177 |     j.at("memoryUsageA").get_to(t.memoryUsageA);
 178 |     j.at("memoryUsageB").get_to(t.memoryUsageB);
 179 |     j.at("id").get_to(t.id);
 180 | }
 181 | 
 182 | 
 183 | struct Node {
 184 |     int id; // v
 185 |     unordered_map<int,vector<TMPC>> TMPCs; // TMPCs[k] = vector of TMPCs for number k of devices
 186 |     string name; // just for debugging etc.
 187 | };
 188 | 
 189 | 
 190 | void from_json (const json &j, Node &n) {
 191 |     j.at("id").get_to(n.id);
 192 |     map<string,vector<TMPC>> TMPCs;
 193 |     j.at("TMPCs").get_to(TMPCs);
 194 |     for (const pair<string,vector<TMPC>> &p : TMPCs) {
 195 |         // TODO verify that p.first is a number
 196 |         n.TMPCs[stoi(p.first)] = p.second;
 197 |     }
 198 |     if (j.count("name")) {
 199 |         j.at("name").get_to(n.name);
 200 |     }
 201 | }
 202 | 
 203 | 
 204 | struct Edge {
 205 |     int sourceId; // u
 206 |     int destId;  // v
 207 |     double communicationCost; // c(u,v), in bytes
 208 | };
 209 | 
 210 | 
 211 | void from_json (const json &j, Edge &e) {
 212 |     j.at("sourceId").get_to(e.sourceId);
 213 |     j.at("destId").get_to(e.destId);
 214 |     j.at("communicationCost").get_to(e.communicationCost);
 215 | }
 216 | 
 217 | 
 218 | struct Instance {
 219 |     double maxMemoryPerDevice;
 220 |     int maxDevices;
 221 |     double bandwidth; // in bytes per second
 222 |     int mbsInBatch;
 223 |     vector<Node> nodes;
 224 |     vector<Edge> edges;
 225 | 
 226 |     // filled with renumber()
 227 |     unordered_map<int,int> newNumber;
 228 |     vector<int> oldNumber;
 229 | 
 230 |     void linearize();
 231 |     void checkInputCorrectness() const;
 232 |     bool isDAG() const;
 233 |     void renumber();
 234 | 
 235 |     // for our experiments
 236 |     void insertTransformerLayer();
 237 |     vector<int> getTransformerIds() const;
 238 |     vector<int> getNodeIdsToposorted() const;
 239 |     int getMaxTensorParallelDegree() const;
 240 | };
 241 | 
 242 | 
 243 | void from_json (const json &j, Instance &ii) {
 244 |     j.at("maxMemoryPerDevice").get_to(ii.maxMemoryPerDevice);
 245 |     j.at("maxDevices").get_to(ii.maxDevices);
 246 |     j.at("bandwidth").get_to(ii.bandwidth);
 247 |     j.at("mbsInBatch").get_to(ii.mbsInBatch);
 248 |     j.at("nodes").get_to(ii.nodes);
 249 |     j.at("edges").get_to(ii.edges);
 250 | 
 251 |     ii.checkInputCorrectness();
 252 |     ii.renumber();
 253 |     ii.checkInputCorrectness();
 254 | }
 255 | 
 256 | 
 257 | void Instance::linearize () {
 258 |     fail("not needed for this work, to be implemented when needed");
 259 |     // remember, there should be no parallel edges even after adding the path
 260 | }
 261 | 
 262 | 
 263 | void Instance::checkInputCorrectness() const {
 264 |     if (maxDevices < 1 || maxDevices > DEVICES_LIMIT) {
 265 |         fail("wrong number of devices");
 266 |     }
 267 |     if (bandwidth < 1e-9) {
 268 |         fail("wrong bandwidth");
 269 |     }
 270 |     if (maxMemoryPerDevice < 1e-9) {
 271 |         fail("wrong maxMemoryPerDevice");
 272 |     }
 273 |     if (mbsInBatch < 1) {
 274 |         fail("wrong mbsInBatch");
 275 |     }
 276 |     if (nodes.empty()) {
 277 |         fail("no nodes in input");
 278 |     }
 279 |     set<pair<int,int>> edgesAsPairs;
 280 |     unordered_map<int,unordered_set<int>> incomingEdges, outgoingEdges;
 281 |     for (const Node &n : nodes) {
 282 |         incomingEdges[n.id] = {};
 283 |         outgoingEdges[n.id] = {};
 284 |     }
 285 |     if (edges.empty()) {
 286 |         fail("no edges in input");
 287 |     }
 288 |     for (const Edge &e : edges) {
 289 |         if (e.communicationCost < 0) {
 290 |             fail("communication cost of edge is negative"); // zero is fine
 291 |         }
 292 |         if (edgesAsPairs.insert(make_pair(e.sourceId, e.destId)).second == false) {
 293 |             fail("duplicated edge");
 294 |         }
 295 |         if (!incomingEdges.count(e.destId) || !outgoingEdges.count(e.sourceId)) {
 296 |             fail("edge endpoint is not in nodes");
 297 |         }
 298 |         incomingEdges[e.destId].insert(e.sourceId);
 299 |         outgoingEdges[e.sourceId].insert(e.destId);
 300 |     }
 301 | 
 302 |     if (!isDAG()) {
 303 |         fail("instance is not acyclic");
 304 |     }
 305 | 
 306 |     set<int> nodeIds;
 307 |     for (const Node &n : nodes) {
 308 |         if (nodeIds.insert(n.id).second == false) {
 309 |             fail("duplicate node ID");
 310 |         }
 311 |         // check TMPCs
 312 |         for (const auto &it : n.TMPCs) {
 313 |             if (it.first < 1 || it.first > DEVICES_LIMIT) {
 314 |                 fail("key " + to_string(it.first) + " doesn't look like a number of devices");
 315 |             }
 316 |             set<string> TMPCIds;
 317 |             for (const TMPC &tmpc : it.second) {
 318 |                 if (tmpc.id.empty()) {
 319 |                     fail("TMPC ID empty");
 320 |                 }
 321 |                 if (TMPCIds.insert(tmpc.id).second == false) {
 322 |                     fail("TMPC IDs are not distinct within one node and number of devices");
 323 |                 }
 324 |                 // check each TMPC
 325 |                 if (tmpc.timePerSample < 0) {
 326 |                     fail("wrong timePerSample");
 327 |                 }
 328 |                 if (tmpc.parameterSize < 0) {
 329 |                     fail("wrong parameterSize");
 330 |                 }
 331 |                 if (tmpc.memoryUsageA < 0 || tmpc.memoryUsageB < 0) {
 332 |                     fail("wrong memoryUsage");
 333 |                 }
 334 |                 // syncTimes should contain only real edges
 335 |                 for (const pair<int,double> &p : tmpc.syncTimeFw) {
 336 |                     if (p.second < 0) {
 337 |                         fail("wrong syncTime");
 338 |                     }
 339 |                     if (!incomingEdges[n.id].count(p.first)) {
 340 |                         dbg << n.id << "<-" << p.first << endl;
 341 |                         fail("edge present in syncTimes but not in edges");
 342 |                     }
 343 |                 }
 344 |                 for (const pair<int,double> &p : tmpc.syncTimeBw) {
 345 |                     if (p.second < 0) {
 346 |                         fail("wrong syncTime");
 347 |                     }
 348 |                     if (!outgoingEdges[n.id].count(p.first)) {
 349 |                         dbg << n.id << "->" << p.first << endl;
 350 |                         fail("edge present in syncTimes but not in edges");
 351 |                     }
 352 |                 }
 353 |                 // and they should contain all edges
 354 |                 for (int u : incomingEdges[n.id]) {
 355 |                     if (!tmpc.syncTimeFw.count(u)) {
 356 |                         dbg << "incoming " << n.id << " <- " << u << endl;
 357 |                         fail("edge missing from syncTimes");
 358 |                     }
 359 |                 }
 360 |                 for (int w : outgoingEdges[n.id]) {
 361 |                     if (!tmpc.syncTimeBw.count(w)) {
 362 |                         dbg << "outgoing " << n.id << " -> " << w << endl;
 363 |                         fail("edge missing from syncTimes");
 364 |                     }
 365 |                 }
 366 |             }
 367 |             if (KNAPSACK_FAST_HEURISTIC) {
 368 |                 if (it.second.size() > 2) {
 369 |                     fail("can't run the fast heuristic with > 2 TMPCs");
 370 |                 }
 371 |             }
 372 |         }
 373 |         if (!n.TMPCs.count(1) || n.TMPCs.at(1).empty()) {
 374 |             warn("no TMPC given for executing node " + to_string(n.id) + " w/o tensor parallelism?");
 375 |         }
 376 | 
 377 |     }
 378 | }
 379 | 
 380 | 
 381 | bool Instance::isDAG() const {
 382 |     unordered_map<int,int> indegree;
 383 |     unordered_map<int,vector<int>> outgoingEdges;
 384 |     for (const Edge &e : edges) {
 385 |         ++indegree[e.destId];
 386 |         outgoingEdges[e.sourceId].push_back(e.destId);
 387 |     }
 388 |     vector<int> deg0vertices;
 389 |     for (const Node &n : nodes) {
 390 |         if (indegree[n.id] == 0) {
 391 |             deg0vertices.push_back(n.id);
 392 |         }
 393 |     }
 394 |     int processed_vertices = 0;
 395 |     while (!deg0vertices.empty()) {
 396 |         int v = deg0vertices.back();
 397 |         deg0vertices.pop_back();
 398 |         ++processed_vertices;
 399 |         for (int w : outgoingEdges[v]) {
 400 |             --indegree[w];
 401 |             if (indegree[w] == 0) {
 402 |                 deg0vertices.push_back(w);
 403 |             }
 404 |         }
 405 |     }
 406 |     return processed_vertices == nodes.size();
 407 | }
 408 | 
 409 | 
 410 | void Instance::renumber () {
 411 |     // renumber nodes as 0,1,2,...
 412 |     assert(oldNumber.empty()); // oldNumber.size() used as next id
 413 |     // build oldNumber and newNumber
 414 |     for (Node &n : nodes) {
 415 |         newNumber[n.id] = oldNumber.size();
 416 |         oldNumber.push_back(n.id);
 417 |     }
 418 |     // now replace old ids with new ids everywhere
 419 |     for (Node &n : nodes) {
 420 |         n.id = newNumber[n.id];
 421 |         for (auto &it : n.TMPCs) {
 422 |             for (auto &tmpc : it.second) {
 423 |                 unordered_map<int,double> newSyncTimeFw, newSyncTimeBw;
 424 |                 for (const auto &p : tmpc.syncTimeFw) {
 425 |                     newSyncTimeFw[newNumber[p.first]] = p.second;
 426 |                 }
 427 |                 for (const auto &p : tmpc.syncTimeBw) {
 428 |                     newSyncTimeBw[newNumber[p.first]] = p.second;
 429 |                 }
 430 |                 tmpc.syncTimeFw = move(newSyncTimeFw);
 431 |                 tmpc.syncTimeBw = move(newSyncTimeBw);
 432 |             }
 433 |         }
 434 |     }
 435 |     for (Edge &e : edges) {
 436 |         e.sourceId = newNumber[e.sourceId];
 437 |         e.destId = newNumber[e.destId];
 438 |     }
 439 | }
 440 | 
 441 | 
 442 | void Instance::insertTransformerLayer () {
 443 |     // make a larger DNN for our experiments
 444 | 
 445 |     // find a ParallelTransformerLayer
 446 |     unordered_set<int> idsOfTransformerLayers;
 447 |     for (const Node &n : nodes) {
 448 |         if (n.name == "ParallelTransformerLayer") {
 449 |             idsOfTransformerLayers.insert(n.id);
 450 |         }
 451 |     }
 452 |     unordered_map<int,int> prevTransformer, nextTransformer;
 453 |     for (const Edge &e : edges) {
 454 |         if (idsOfTransformerLayers.count(e.sourceId) && idsOfTransformerLayers.count(e.destId)) {
 455 |             prevTransformer[e.destId] = e.sourceId;
 456 |             nextTransformer[e.sourceId] = e.destId;
 457 |         }
 458 |     }
 459 |     int v = -1;
 460 |     for (int w : idsOfTransformerLayers) {
 461 |         if (prevTransformer.count(w) && nextTransformer.count(w)) {
 462 |             // found
 463 |             v = w;
 464 |             break;
 465 |         }
 466 |     }
 467 |     if (v == -1) {
 468 |         fail("no transformer with transformer precedessor and successor found");
 469 |     }
 470 | 
 471 |     // replicate v
 472 |     static int freshNumber = 123456789;
 473 |     ++freshNumber;
 474 |     int nu = oldNumber.size();
 475 |     oldNumber.push_back(freshNumber);
 476 |     newNumber[freshNumber] = nu;
 477 | 
 478 |     const int s = nextTransformer[v];
 479 |     assert(nodes[v].id == v);
 480 |     assert(nodes[s].id == s);
 481 |     nodes.emplace_back();
 482 |     nodes.back().id = nu;
 483 |     nodes.back().name = "ParallelTransformerLayer";
 484 | 
 485 |     // redirect edge v->s to be v->nu
 486 |     // and add edge nu->s
 487 |     double communicationCost = 1e30;
 488 |     for (Edge &e : edges) {
 489 |         if (e.sourceId == v && e.destId == s) {
 490 |             e.destId = nu;
 491 |             communicationCost = e.communicationCost;
 492 |         }
 493 |     }
 494 |     assert(communicationCost < 1e29);
 495 |     edges.emplace_back();
 496 |     edges.back().sourceId = nu;
 497 |     edges.back().destId = s;
 498 |     edges.back().communicationCost = communicationCost;
 499 | 
 500 |     for (auto &it : nodes[v].TMPCs) {
 501 |         for (auto &tmpc : it.second) {
 502 |             const double sbw = tmpc.syncTimeBw.at(s);
 503 |             tmpc.syncTimeBw.erase(s);
 504 |             tmpc.syncTimeBw.emplace(nu, sbw);
 505 |         }
 506 |     }
 507 |     for (auto &it : nodes[s].TMPCs) {
 508 |         for (auto &tmpc : it.second) {
 509 |             const double sfw = tmpc.syncTimeFw.at(v);
 510 |             tmpc.syncTimeFw.erase(v);
 511 |             tmpc.syncTimeFw.emplace(nu, sfw);
 512 |         }
 513 |     }
 514 |     // create TMPCs for nu
 515 |     nodes.back().TMPCs = nodes[v].TMPCs;
 516 |     for (auto &it : nodes.back().TMPCs) {
 517 |         for (auto &tmpc : it.second) {
 518 |             // leave all stuff as it was, just rewrite syncTimeFw and syncTimeBw
 519 |             const double sfw = tmpc.syncTimeFw.at(prevTransformer[v]);
 520 |             const double sbw = tmpc.syncTimeBw.at(nu);
 521 |             tmpc.syncTimeFw = {{v, sfw}};
 522 |             tmpc.syncTimeBw = {{s, sbw}};
 523 |         }
 524 |     }
 525 | 
 526 |     // TODO: ideally, here we should also:
 527 |     // for each other edge a->v, add a->nu
 528 |     // for each other edge v->b, add nu->b
 529 |     // (but, in the BERT model input, the former are irrelevant and latter are absent)
 530 | 
 531 |     checkInputCorrectness();
 532 | }
 533 | 
 534 | 
 535 | vector<int> Instance::getTransformerIds() const {
 536 |     unordered_set<int> tranformerIdsUnordered;
 537 |     for (const Node &n : nodes) {
 538 |         if (n.name == "ParallelTransformerLayer") {
 539 |             tranformerIdsUnordered.insert(n.id);
 540 |         }
 541 |     }
 542 | 
 543 |     // return these in topological order
 544 |     unordered_map<int,int> indegree;
 545 |     unordered_map<int,vector<int>> outgoingEdges;
 546 |     for (const Edge &e : edges) {
 547 |         ++indegree[e.destId];
 548 |         outgoingEdges[e.sourceId].push_back(e.destId);
 549 |     }
 550 |     vector<int> deg0vertices;
 551 |     for (const Node &n : nodes) {
 552 |         if (indegree[n.id] == 0) {
 553 |             deg0vertices.push_back(n.id);
 554 |         }
 555 |     }
 556 |     vector<int> transformerIdsOrdered;
 557 |     while (!deg0vertices.empty()) {
 558 |         int v = deg0vertices.back();
 559 |         deg0vertices.pop_back();
 560 |         if (tranformerIdsUnordered.count(v)) {
 561 |             transformerIdsOrdered.push_back(v);
 562 |         }
 563 |         for (int w : outgoingEdges[v]) {
 564 |             --indegree[w];
 565 |             if (indegree[w] == 0) {
 566 |                 deg0vertices.push_back(w);
 567 |             }
 568 |         }
 569 |     }
 570 |     return transformerIdsOrdered;
 571 | }
 572 | 
 573 | 
 574 | vector<int> Instance::getNodeIdsToposorted() const {
 575 |     unordered_map<int,int> indegree;
 576 |     unordered_map<int,vector<int>> outgoingEdges;
 577 |     for (const Edge &e : edges) {
 578 |         ++indegree[e.destId];
 579 |         outgoingEdges[e.sourceId].push_back(e.destId);
 580 |     }
 581 |     vector<int> deg0vertices;
 582 |     for (const Node &n : nodes) {
 583 |         if (indegree[n.id] == 0) {
 584 |             deg0vertices.push_back(n.id);
 585 |         }
 586 |     }
 587 |     vector<int> nodeIdsOrdered;
 588 |     while (!deg0vertices.empty()) {
 589 |         int v = deg0vertices.back();
 590 |         deg0vertices.pop_back();
 591 |         nodeIdsOrdered.push_back(v);
 592 |         for (int w : outgoingEdges[v]) {
 593 |             --indegree[w];
 594 |             if (indegree[w] == 0) {
 595 |                 deg0vertices.push_back(w);
 596 |             }
 597 |         }
 598 |     }
 599 |     return nodeIdsOrdered;
 600 | }
 601 | 
 602 | 
 603 | int Instance::getMaxTensorParallelDegree() const {
 604 |     // returns max t such that some node has some TMPC for tensor-parallel degree t
 605 |     int res = 1;
 606 |     for (const Node &n : nodes) {
 607 |         for (const auto &it : n.TMPCs) {
 608 |             res = max(res, it.first);
 609 |         }
 610 |     }
 611 |     return res;
 612 | }
 613 | 
 614 | 
 615 | struct ResultStage {
 616 |     vector<int> nodes; // node ids (this is actually superfluous because of TMPCids)
 617 |     int dataParallelDegree; // d_i
 618 |     int tensorParallelDegree; // t_i
 619 |     unordered_map<int,string> TMPCids; // TMPCids[v] = identifier of the TMPC used for node v
 620 | };
 621 | 
 622 | 
 623 | struct Result {
 624 |     vector<ResultStage> stages;
 625 |     string debugInfo;
 626 | };
 627 | 
 628 | 
 629 | void to_json (json &j, const ResultStage &rs) {
 630 |     j = json{{"nodes", rs.nodes},
 631 |              {"dataParallelDegree", rs.dataParallelDegree},
 632 |              {"tensorParallelDegree", rs.tensorParallelDegree},
 633 |              {"TMPCids", rs.TMPCids}
 634 |     };
 635 | }
 636 | 
 637 | 
 638 | void to_json (json &j, const Result &r) {
 639 |     j = json{{"stages", r.stages}};
 640 | }
 641 | 
 642 | 
 643 | struct Graph {
 644 |     const Instance &ins; // already renumbered (nodes 0,1,2,...)
 645 |     const int boundOnS; // set as min(mbsInBatch, maxDevices)
 646 | 
 647 |     vector<vector<pair<int,double>>> incomingEdges; // v -> vector of {u,c(u,v)}
 648 |     vector<vector<pair<int,double>>> outgoingEdges; // v -> vector of {w,c(v,w)}
 649 |     vector<const Node*> node; // node[v] = pointer to node with (new) id v
 650 | 
 651 |     // ideals, represented as indicator vectors
 652 |     unordered_map<vector<bool>,int> idealToId; // maps ideal to its ID
 653 |     vector<vector<bool>> ideals; // maps ID to ideal
 654 |     vector<int> idealsSortedBySize; // IDs of ideals, sorted by size
 655 | 
 656 |     // NOTE: throughout the code, we are actually dealing with DOWNSETS, not ideals. should Ctrl+Replace
 657 | 
 658 |     // pairs of ideals (that induce contiguous sets)
 659 |     vector<vector<int>> immediateSubIdeals;
 660 |     // immediateSubIdeals[id] = IDs of ideals that are immediate subsets of the ideal with ID id
 661 |     vector<vector<int>> subIdeals;
 662 |     // subideals[id] = IDs of ideals that are subsets of the ideal with ID id
 663 |     // (this takes O(numberOfIdealPairs) space; could be done on the fly in the DP,
 664 |     //  but if one can't afford this memory then probably one can't afford the DP alg timewise)
 665 |     long long numberOfIdealPairs;
 666 | 
 667 |     Graph (const Instance &_ins);
 668 |     void generateIdeals ();
 669 |     void growIdeal (const vector<bool> &ideal, int myId);
 670 |     void prepareSubIdeals ();
 671 | 
 672 |     vector<bool> getContiguousSet (int id, int subId) const;
 673 |     vector<vector<vector<double>>> getAllTPSsForIdealPair (int id, int subId);
 674 | 
 675 |     double getDataParallelResyncCost (double parameterSize, int d) const;
 676 | 
 677 |     // reconstruction stuff:
 678 |     ResultStage getTPSWitnessFor (int id, int subId, int t, int d, int s, double targetTPS);
 679 |     // some "global" variables used in the reconstruction phase
 680 |     bool reconstructionModeForGetAllTPSsForIdealPair;
 681 |     bool reconstructionModeForSolveKnapsack;
 682 |     int reconstructionT, reconstructionD, reconstructionS, reconstructionY;
 683 |     double reconstructionTPS;
 684 |     vector<int> reconstructionTMPCindices; // output of solveKnapsack in reconstruction phase
 685 |     ResultStage reconstructionRS; // output of getTPSWitnessFor in reconstruction phase
 686 | 
 687 |     vector<double> solveKnapsack (const vector<vector<double>>& TMPCTPS,
 688 |                                   const vector<vector<double>>& TMPCMemoryUsageA,
 689 |                                   const vector<vector<double>>& TMPCMemoryUsageB,
 690 |                                   int maxY);
 691 |     
 692 |     vector<vector<vector<double>>> dp; // dp table
 693 |     int idOfFullSet;
 694 |     Result runDP();
 695 | 
 696 |     void renumberResultBack (Result &r) const;
 697 | 
 698 |     double getTPSForResult (const Result &r) const;
 699 | };
 700 | 
 701 | 
 702 | Graph::Graph (const Instance &_ins) :
 703 |     ins(_ins),
 704 |     boundOnS(min(_ins.mbsInBatch,_ins.maxDevices)),
 705 |     numberOfIdealPairs(0),
 706 |     reconstructionModeForGetAllTPSsForIdealPair(false),
 707 |     reconstructionModeForSolveKnapsack(false) {
 708 | 
 709 |     // precompute incomingEdges and outgoingEdges
 710 |     incomingEdges.resize(ins.nodes.size());
 711 |     outgoingEdges.resize(ins.nodes.size());
 712 |     for (const Edge &e : ins.edges) {
 713 |         incomingEdges[e.destId].emplace_back(e.sourceId, e.communicationCost);
 714 |         outgoingEdges[e.sourceId].emplace_back(e.destId, e.communicationCost);
 715 |     }
 716 |     // precompute node
 717 |     node.resize(ins.nodes.size());
 718 |     for (const Node &n : ins.nodes) {
 719 |         node[n.id] = &n;
 720 |     }
 721 | 
 722 |     generateIdeals();
 723 |     // immediateSubIdeals is prepared. now prepare subIdeals
 724 |     prepareSubIdeals();
 725 | }
 726 | 
 727 | 
 728 | void Graph::generateIdeals () {
 729 |     if (!ideals.empty()) {
 730 |         fail("generating ideals twice?");
 731 |     }
 732 | 
 733 |     // start with empty set
 734 |     const vector<bool> emptySet(ins.nodes.size(), false);
 735 |     idealToId[emptySet] = 0;
 736 |     ideals.push_back(emptySet);
 737 |     immediateSubIdeals.emplace_back();
 738 |     growIdeal(emptySet, 0);
 739 | 
 740 |     dbg << ideals.size() << " ideals" << endl;
 741 |     if (ideals.size() > IDEALS_LIMIT) {
 742 |         fail("too many ideals (current limit set at " + to_string(IDEALS_LIMIT) + "); this isn't going to work...");
 743 |     }
 744 | 
 745 |     // prepare idealsSortedBySize
 746 |     vector<pair<int,int>> sorter; // {<size,ideal id>}
 747 |     for (int i = 0; i < ideals.size(); ++i) {
 748 |         sorter.emplace_back(count(ideals[i].begin(), ideals[i].end(), true), i);
 749 |     }
 750 |     sort(sorter.begin(), sorter.end());
 751 |     for (const auto &it : sorter) {
 752 |         idealsSortedBySize.push_back(it.second);
 753 |     }
 754 |     assert(idealsSortedBySize[0] == 0);
 755 | }
 756 | 
 757 | 
 758 | void Graph::growIdeal (const vector<bool> &ideal, int myId) {
 759 |     // myId == idealToId[ideal]
 760 |     // try to add every vertex
 761 |     for (int v = 0; v < ins.nodes.size(); ++v) {
 762 |         if (!ideal[v]) {
 763 |             // try ideal+v as a new ideal
 764 |             // check if valid: do all v's successors belong to ideal?
 765 |             bool valid = true;
 766 |             for (const pair<int,double>& p : outgoingEdges[v]) {
 767 |                 // v -> p.first
 768 |                 if (!ideal[p.first]) {
 769 |                     valid = false;
 770 |                     break;
 771 |                 }
 772 |             }
 773 |             if (valid) {
 774 |                 vector<bool> newIdeal = ideal;
 775 |                 newIdeal[v] = true;
 776 |                 // check if newIdeal had already been generated
 777 |                 if (!idealToId.count(newIdeal)) {
 778 |                     int newId = ideals.size();
 779 |                     idealToId[newIdeal] = newId;
 780 |                     ideals.push_back(newIdeal);
 781 |                     if (ideals.size() >= IDEALS_EXPLORATION_LIMIT) {
 782 |                         fail("already over " + to_string(IDEALS_EXPLORATION_LIMIT) + " ideals. this isn't going to work...");
 783 |                     }
 784 |                     immediateSubIdeals.emplace_back();
 785 |                     growIdeal(newIdeal, newId);
 786 |                 }
 787 |                 immediateSubIdeals[idealToId[newIdeal]].push_back(myId);
 788 |             }
 789 |         }
 790 |     }
 791 | }
 792 | 
 793 | 
 794 | void Graph::prepareSubIdeals () {
 795 |     // subideals = transitive closure of immediateSubIdeals
 796 | 
 797 |     numberOfIdealPairs = 0;
 798 |     subIdeals.resize(ideals.size());
 799 | 
 800 |     for (int id = 0; id < ideals.size(); ++id) {
 801 |         // we will generate subIdeals[id] using some BFS/DFS
 802 |         vector<int> queue = {id};
 803 |         unordered_set<int> enqueuedIdeals = {id};
 804 |         while (!queue.empty()) {
 805 |             int subId = queue.back();
 806 |             queue.pop_back();
 807 | 
 808 |             // now visiting subId
 809 |             if (subId != id) {
 810 |                 subIdeals[id].push_back(subId);
 811 |                 ++numberOfIdealPairs;
 812 |             }
 813 | 
 814 |             // expand further from subId
 815 |             for (int subSubId : immediateSubIdeals[subId]) {
 816 |                 if (enqueuedIdeals.insert(subSubId).second == true) {
 817 |                     // subSubId was not in enqueuedIdeals before
 818 |                     queue.push_back(subSubId);
 819 |                 }
 820 |             }
 821 |         }
 822 |     }
 823 | 
 824 |     dbg << numberOfIdealPairs << " ideal pairs" << endl;
 825 | }
 826 | 
 827 | 
 828 | // returns the difference ideals[id] \ ideals[subId] as vector<bool>
 829 | vector<bool> Graph::getContiguousSet (int id, int subId) const {
 830 |     vector<bool> ideal = ideals[id], subIdeal = ideals[subId];
 831 |     for (int v = 0; v < ins.nodes.size(); ++v) {
 832 |         if (subIdeal[v]) {
 833 |             ideal[v] = false;
 834 |         }
 835 |     }
 836 |     return ideal;
 837 | }
 838 | 
 839 | 
 840 | // returns the cost per sample (microbatch), in bytes
 841 | // (so this should be still divided by the bandwidth)
 842 | double Graph::getDataParallelResyncCost (double parameterSize, int d) const {
 843 |     return 4 * (d-1) * parameterSize / d / ins.mbsInBatch;
 844 |     // the factor 4 is following the implementation of the PipeDream planner
 845 |     // (modeling a distributed parameter server implementation of AllReduce)
 846 | }
 847 | 
 848 | 
 849 | Result Graph::runDP () {
 850 |     // initialize DP table: dp[ideal][k][s]
 851 |     // (partition ideal over <= k machines, sum-of-dp-degrees <= s)
 852 |     // note: both are AT MOST rather than EQUAL as in the paper
 853 |     dp.assign(ideals.size(), vector<vector<double>>(
 854 |               ins.maxDevices+1, vector<double>(
 855 |               boundOnS+1, INFTY)));
 856 | 
 857 |     // case of the empty set (ideal with ID 0)
 858 |     // we initialize all dp[0][*][*] = 0 so that it is monotone wrt k and s
 859 |     for (int k = 0; k <= ins.maxDevices; ++k) {
 860 |         for (int s = 0; s <= boundOnS; ++s) {
 861 |             dp[0][k][s] = 0;
 862 |         }
 863 |     }
 864 | 
 865 |     // dpDrops[id][k] will be the list of s such that dp[id][k][s] < dp[id][k][s-1]
 866 |     // (used for FASTER_DP_IMPLEMENTATION)
 867 |     vector<vector<vector<int>>> dpDrops(ideals.size(), vector<vector<int>>(
 868 |                                         ins.maxDevices+1, vector<int>(1, 0))); // dpDrops[id][k] = {0}
 869 | 
 870 |     // profiling stuff
 871 |     double timeSpentInGetAllTPSsForIdealPair = 0.0, timeSpentInDPLoop = 0.0;
 872 | 
 873 |     // here we go!
 874 |     dbg << "running DP..." << endl;
 875 |     for (int id : idealsSortedBySize) {
 876 |         if (id == 0) continue;
 877 |         // we want to fill dp[id][*][*] (already initialized to INFTY).
 878 |         // we will loop over every subideal subId (their list is already
 879 |         // precomputed in subIdeals[id] for convenience) and account for its
 880 |         // contributions to dp[id][*][*]
 881 |         for (int subId : subIdeals[id]) {
 882 | 
 883 |             clock_t startTime = clock();
 884 |             const vector<vector<vector<double>>> TPS = getAllTPSsForIdealPair(id, subId);
 885 |             timeSpentInGetAllTPSsForIdealPair += (clock() - startTime) * 1.0 / CLOCKS_PER_SEC;
 886 |             // TPS[t][d][s] = min TPS when t-tensor and d-data-parallel partitioning
 887 |             //                id\subId (across t*d devices), with sum-of-dp-degrees <= s
 888 | 
 889 |             startTime = clock();
 890 |             for (int t = 1; t <= ins.maxDevices; ++t) if (!TPS[t].empty()) {
 891 |                 for (int d = 1; d*t <= ins.maxDevices; ++d) if (ins.mbsInBatch % d == 0) {
 892 |                     for (int k = d*t; k <= ins.maxDevices; ++k) {
 893 |                         // id/subId on t*d devices, subId on k-t*d devices
 894 | 
 895 |                         // now we need to perform the updates of dp[id][k][s] for all s
 896 | 
 897 |                         if (!FASTER_DP_IMPLEMENTATION) {
 898 | 
 899 |                             // straightforward implementation
 900 | 
 901 |                             #pragma GCC unroll 16
 902 |                             for (int s = d; s <= boundOnS; ++s) {
 903 |                                 minify(dp[id][k][s], max(dp[subId][k-t*d][s-d], TPS[t][d][s]));
 904 |                             }
 905 | 
 906 |                         } else { // FASTER_DP_IMPLEMENTATION
 907 | 
 908 |                             // we do not update dp[id][k][s] for all s here, but only for
 909 |                             // a selected subset that is sufficient
 910 |                             // (namely, those s where dp[subId][k-t*d][s-d] has decreased (from [s-1-d]))
 911 |                             // then, these updates will be propagated right after the loop over subId
 912 |                             // in order to ensure monotonicity
 913 | 
 914 |                             // dp[subId][k-t*d][s-d] is non-increasing wrt s
 915 |                             // TPS[t][d][s] is non-decreasing wrt s
 916 |                             // 1. find (using binary search) max index u >= 0 such that:
 917 |                             //    - f = dpDrops[subId][k-t*d][u] exists
 918 |                             //    - f+d <= boundOnS
 919 |                             //    - dp[subId][k-t*d][f] > TPS[t][d][f+d]
 920 |                             // 2. do minify from 0 to u incl. (use just dp, not TPS)
 921 |                             // 3. also do minify for u+1 if it is valid (two first points above) (use both dp and TPS here)
 922 |                             int u = -1, uf = 0, ut = int(dpDrops[subId][k-t*d].size()) - 1;
 923 |                             while (uf <= ut) {
 924 |                                 int mid = (uf + ut) / 2;
 925 |                                 int f = dpDrops[subId][k-t*d][mid];
 926 |                                 if (f+d <= boundOnS && dp[subId][k-t*d][f] > TPS[t][d][f+d]) {
 927 |                                     u = mid;
 928 |                                     uf = mid+1;
 929 |                                 } else {
 930 |                                     ut = mid-1;
 931 |                                 }
 932 |                             }
 933 |                             #pragma GCC unroll 16
 934 |                             for (int v = 0; v <= u; ++v) {
 935 |                                 const int f = dpDrops[subId][k-t*d][v];
 936 |                                 minify(dp[id][k][f+d], dp[subId][k-t*d][f]);
 937 |                             }
 938 |                             if (u+1 < dpDrops[subId][k-t*d].size()) {
 939 |                                 int f = dpDrops[subId][k-t*d][u+1];
 940 |                                 if (f+d <= boundOnS) {
 941 |                                     assert(dp[subId][k-t*d][f] <= TPS[t][d][f+d]);
 942 |                                     minify(dp[id][k][f+d], TPS[t][d][f+d]);
 943 |                                 }
 944 |                             }
 945 | 
 946 |                         }
 947 |                     }
 948 |                 }
 949 |             }
 950 |             timeSpentInDPLoop += (clock() - startTime) * 1.0 / CLOCKS_PER_SEC;
 951 |         }
 952 | 
 953 |         // ensure monotonicity
 954 |         // (only really needed if FASTER_DP_IMPLEMENTATION)
 955 |         for (int k = 0; k <= ins.maxDevices; ++k) {
 956 |             //assert(dpDrops[id][k] == vector<int>(1,0));
 957 |             for (int s = 0; s <= boundOnS; ++s) {
 958 |                 if (k > 0) {
 959 |                     minify(dp[id][k][s], dp[id][k-1][s]);
 960 |                 }
 961 |                 if (s > 0) {
 962 |                     minify(dp[id][k][s], dp[id][k][s-1]);
 963 |                     if (dp[id][k][s] + 1e-9 < dp[id][k][s-1]) {
 964 |                         // significant drop
 965 |                         dpDrops[id][k].push_back(s);
 966 |                     }
 967 |                 }
 968 |             }
 969 |         }
 970 |     }
 971 | 
 972 |     DBG(timeSpentInGetAllTPSsForIdealPair)
 973 |     DBG(timeSpentInDPLoop)
 974 | 
 975 |     // the solution (final TPS) is now known: it's max_k max_s dp[all nodes][k][s]
 976 | 
 977 |     // ID of ideal that contains all nodes
 978 |     idOfFullSet = idealToId.at(vector<bool>(ins.nodes.size(), true));
 979 | 
 980 |     double finalTPS = INFTY;
 981 |     int devicesUsed = -1, sUsed = -1;
 982 |     for (int k = 0; k <= ins.maxDevices; ++k) {
 983 |         for (int s = 1; s <= boundOnS; ++s) {
 984 |             if (dp[idOfFullSet][k][s] + 1e-9 < finalTPS) {
 985 |                 finalTPS = dp[idOfFullSet][k][s];
 986 |                 devicesUsed = k;
 987 |                 sUsed = s;
 988 |             }
 989 |         }
 990 |     }
 991 |     if (finalTPS > INFTY/2) {
 992 |         // cannot partition the graph feasibly (in terms of memory usage)
 993 |         return Result(); // empty result
 994 |     }
 995 |     dbg << "max load = " << finalTPS << " using " << devicesUsed
 996 |         << " out of " << ins.maxDevices << " devices, and using sum-of-dp-degrees "
 997 |         << sUsed << " (batch size = " 
 998 |         << ins.mbsInBatch << ")" << endl;
 999 |     // note: the reported number of devices and batch size might possibly be overshot
1000 |     // (since we initialized all dp[0][*][*] = 0 at the beginning,
1001 |     //  and also since, if FASTER_DP_IMPLEMENTATION,
1002 |     //  we do minify(dp[id][k][s], dp[id][k-1][s]) and minify(dp[id][k][s], dp[id][k][s-1]))
1003 |     // however, the strict inequality in the comparison above should prevent this,
1004 |     // as this way we take the minimal (k,s) that attains this TPS
1005 | 
1006 |     // for debug/experiments only
1007 |     vector<int> transformerIds = ins.getTransformerIds();
1008 | 
1009 |     // now we reconstruct the solution
1010 |     Result result;
1011 |     int curId = idOfFullSet, curK = devicesUsed, curS = sUsed;
1012 |     while (curId != 0) { // curId is not empty set
1013 |         assert(curK > 0);
1014 |         assert(curS > 0);
1015 |         // how does dp[curId][curK][curS] arise?
1016 |         bool found = false;
1017 |         for (int subId : subIdeals[curId]) {
1018 |             const vector<vector<vector<double>>> TPS = getAllTPSsForIdealPair(curId, subId);
1019 |             // possible optimization: could only ask for s = curS
1020 | 
1021 |             for (int t = 1; t <= curK; ++t) if (!TPS[t].empty()) {
1022 |                 for (int d = 1; t*d <= curK && d <= curS; ++d) {
1023 |                     if (ins.mbsInBatch % d != 0) continue; // not really necessary
1024 |                     // curId\subId on t*d devices, subId on curK-t*d devices
1025 |                     if (1e-9 > abs(dp[curId][curK][curS] - max(dp[subId][curK-t*d][curS-d], TPS[t][d][curS]))) {
1026 |                         // found the next stage
1027 |                         found = true;
1028 | 
1029 |                         ResultStage rs = getTPSWitnessFor(curId, subId, t, d, curS, TPS[t][d][curS]);
1030 |                         result.stages.push_back(rs);
1031 | 
1032 |                         assert(rs.dataParallelDegree == d);
1033 |                         assert(rs.tensorParallelDegree == t);
1034 | 
1035 |                         dbg << "formed a stage with nodes [" << rs.nodes << "] using d=" 
1036 |                             << rs.dataParallelDegree << " and t=" << rs.tensorParallelDegree
1037 |                             << " yielding TPS = " << TPS[t][d][curS] << endl;
1038 | 
1039 |                         int countTransformers = 0, countTransformersWithAR = 0;
1040 |                         for (const pair<int,string> &p : rs.TMPCids) {
1041 |                             if (count(transformerIds.begin(), transformerIds.end(), p.first)) {
1042 |                                 ++countTransformers;
1043 |                                 if (p.second == "activation recomp") {
1044 |                                     ++countTransformersWithAR;
1045 |                                 }
1046 |                             }
1047 |                         }
1048 |                         dbg << "transformer layers (with act. recomp. / total) = "
1049 |                             << countTransformersWithAR << "/"
1050 |                             << countTransformers << endl << endl;
1051 | 
1052 |                         curS = curS - d;
1053 |                         curK = curK - t*d;
1054 |                         curId = subId;
1055 | 
1056 |                         break;
1057 |                     }
1058 |                 }
1059 |                 if (found) break;
1060 |             }
1061 |             if (found) break;
1062 |         }
1063 |         if (!found) {
1064 |             fail("didn't find any reconstruction step to make?");
1065 |         }
1066 |     }
1067 |     if (curK > 0 || curS > 0) {
1068 |         fail("k or s didn't fall to 0 by the end of reconstruction?");
1069 |     }
1070 | 
1071 |     const double verificationTPS = getTPSForResult(result);
1072 |     if (abs(finalTPS - verificationTPS) > 1e-5) {
1073 |         DBG(finalTPS)
1074 |         DBG(verificationTPS)
1075 |         fail("verification TPS is different");
1076 |     }
1077 | 
1078 |     // note: the result is in terms of the new numbers; if wanting to print it etc.,
1079 |     // then at the end, translate result from new numbers to old by calling:
1080 |     // renumberResultBack(result);
1081 | 
1082 |     return result;
1083 | }
1084 | 
1085 | 
1086 | // TPS[t][d][s] = min TPS when t-tensor and d-data-parallel partitioning
1087 | //                id\subId (across t*d devices), with sum-dp-degrees <= s
1088 | // TPS[t] can be empty for some t!
1089 | // (note: this should be deterministic as it will be rerun during reconstruction)
1090 | vector<vector<vector<double>>> Graph::getAllTPSsForIdealPair (int id, int subId) {
1091 |     const vector<bool> subgraphVB = getContiguousSet(id, subId);
1092 |     const vector<int> subgraph = vectorOfSetBits(subgraphVB);
1093 | 
1094 |     vector<vector<vector<double>>> result(ins.maxDevices+1);
1095 | 
1096 |     // t = degree of tensor parallelism
1097 |     for (int t = 1; t <= ins.maxDevices; ++t) {
1098 |         if (t > 1 && !TENSOR_PARALLELISM_ALLOWED) {
1099 |             break;
1100 |         }
1101 | 
1102 |         bool someNodeHasNoTMPCsForT = false;
1103 |         for (int v : subgraph) {
1104 |             if (!node[v]->TMPCs.count(t) || node[v]->TMPCs.at(t).empty()) {
1105 |                 someNodeHasNoTMPCsForT = true;
1106 |                 break;
1107 |             }
1108 |         }
1109 |         if (someNodeHasNoTMPCsForT) {
1110 |             // tensor parallelism of degree t is not supported
1111 |             // for some node of this subgraph
1112 |             continue;
1113 |         }
1114 | 
1115 |         // initialize the result vector
1116 |         result[t].assign(ins.maxDevices/t + 1, vector<double>(boundOnS+1, INFTY));
1117 | 
1118 |         vector<const vector<TMPC>*> TMPCs; // TMPCs[i] = TMPCs for node subgraph[i] on t devices
1119 |         vector<vector<double>> TMPCEdgeCommCosts;
1120 |         // TMPCEdgeCommCosts[i][l] = sum of all edge-associated communication costs
1121 |         //                           for the l-th TMPC of node subgraph[i] (on t devices)
1122 | 
1123 |         // prepare TMPCs and TMPCEdgeCommCosts
1124 |         for (int v : subgraph) {
1125 |             TMPCs.push_back(&(node[v]->TMPCs.at(t)));
1126 |             TMPCEdgeCommCosts.emplace_back();
1127 |             for (const TMPC &tmpc : *(TMPCs.back())) {
1128 |                 double tmpcEdgeCommCost = 0.0;
1129 |                 // need to take into account c(u,v) and sfw(u,v)
1130 |                 // (first is a property of the edge, second is a property of the TMPC)
1131 |                 // for all edges (u,v) incoming into v from outside S
1132 |                 for (const pair<int,double> &p : incomingEdges[v]) {
1133 |                     // edge p.first -> v, of cost p.second
1134 |                     if (!subgraphVB[p.first]) {
1135 |                         tmpcEdgeCommCost += 2 * (p.second + tmpc.syncTimeFw.at(p.first));
1136 |                     }
1137 |                 }
1138 | 
1139 |                 // instead of "tmpc.syncTimeFw.at(p.first)" we could do this:
1140 |                 // for (const pair<int,double> &p : tmpc.syncTimeFw) {
1141 |                 //     if (!subgraphVB[p.first]) {
1142 |                 //         tmpcEdgeCommCost += p.second;
1143 |                 //     }
1144 |                 // }
1145 |                 // and then we could even turn syncTimeFw/Bw into vectors
1146 | 
1147 |                 // and similarly for outgoing edges (c and sbw)
1148 |                 for (const pair<int,double> &p : outgoingEdges[v]) {
1149 |                     // edge v -> p.first, of cost p.second
1150 |                     if (!subgraphVB[p.first]) {
1151 |                         tmpcEdgeCommCost += 2 * (p.second + tmpc.syncTimeBw.at(p.first));
1152 |                     }
1153 |                 }
1154 |                 TMPCEdgeCommCosts.back().push_back(tmpcEdgeCommCost);
1155 |             }
1156 |         }
1157 |         // TMPCs and TMPCEdgeCommCosts prepared
1158 | 
1159 |         // d = degree of data parallelism
1160 |         for (int d = 1; d*t <= ins.maxDevices; ++d) if (ins.mbsInBatch % d == 0) {
1161 | 
1162 |             if (d > 1 && !DATA_PARALLELISM_ALLOWED) {
1163 |                 break;
1164 |             }
1165 | 
1166 |             // at this point, we have multiple TMPCs for each node
1167 |             // and we have to select one TMPC per node
1168 |             // (for each s; can be different TMPC-sets for different s;
1169 |             //  but only memory usage depends on s)
1170 | 
1171 |             vector<vector<double>> TMPCTotalTPSContribution, TMPCMemoryUsageA, TMPCMemoryUsageB;
1172 |             // TMPCTotalTPSContribution[i][l] = sum of all contributions to TPS (compute, comm)
1173 |             //    for the l-th TMPC of node subgraph[i] (on t-tensor and d-data parallelism)
1174 |             // similarly TMPCMemoryUsageA, B
1175 |             // prepare these:
1176 |             for (int i = 0; i < subgraph.size(); ++i) {
1177 |                 const int v = subgraph[i];
1178 |                 TMPCTotalTPSContribution.emplace_back();
1179 |                 TMPCMemoryUsageA.emplace_back();
1180 |                 TMPCMemoryUsageB.emplace_back();
1181 |                 for (int l = 0; l < TMPCs[i]->size(); ++l) {
1182 |                     const TMPC &tmpc = (*TMPCs[i])[l];
1183 |                     TMPCMemoryUsageA.back().push_back(tmpc.memoryUsageA);
1184 |                     TMPCMemoryUsageB.back().push_back(tmpc.memoryUsageB);
1185 |                     // add up all the contributions of node v to the TPS:
1186 |                     const double communicationInBytes = 
1187 |                     // 1. edge-related communication
1188 |                         TMPCEdgeCommCosts[i][l] / d
1189 |                         +
1190 |                     // 2. data parallelism resync communication
1191 |                         getDataParallelResyncCost(tmpc.parameterSize, d);
1192 |                     // 3. compute
1193 |                     const double compute = tmpc.timePerSample / d;
1194 |                     const double totalTPSContribution = communicationInBytes / ins.bandwidth + compute;
1195 |                     TMPCTotalTPSContribution.back().push_back(totalTPSContribution);
1196 | 
1197 |                     if ((!ACTIVATION_RECOMPUTATION_ALLOWED && tmpc.id == "activation recomp")
1198 |                        || (ACTIVATION_RECOMPUTATION_FORCED && tmpc.id == "vanilla")) {
1199 |                         // only used for our specific experiments
1200 |                         // make this TMPC super unattractive
1201 |                         TMPCMemoryUsageB.back().back() = 1e60;
1202 |                         TMPCTotalTPSContribution.back().back() = 1e28;
1203 |                     }
1204 |                 }
1205 |             }
1206 | 
1207 |             // we have now built a "knapsack" instance (well, one for each y = ceil(s/d))
1208 |             vector<double> knapsackResults = solveKnapsack(TMPCTotalTPSContribution,
1209 |                                                            TMPCMemoryUsageA,
1210 |                                                            TMPCMemoryUsageB,
1211 |                                                            boundOnS / d + 1);
1212 |             // knapsackResults[y] = best TPS (over all choices of TMPC-per-node)
1213 |             // when t-tensor- and d-data-parallel partitioning
1214 |             // id\subId (across t*d devices), with sum-of-dp-degrees <= d*y
1215 |             for (int s = 0; s <= boundOnS; ++s) {
1216 |                 result[t][d][s] = knapsackResults[ceildiv(s, d)];
1217 |             }
1218 | 
1219 |             // stuff for the reconstruction phase
1220 |             if (reconstructionModeForGetAllTPSsForIdealPair) {
1221 |                 reconstructionY = ceildiv(reconstructionS, reconstructionD);
1222 |                 if (t == reconstructionT && d == reconstructionD) {
1223 |                     // if reconstructionTPS == knapsackResults[reconstructionY]:
1224 |                     if (1e-9 > abs(reconstructionTPS - knapsackResults[reconstructionY])) {
1225 |                         // got it
1226 |                         // now we need to build our "output", which is reconstructionRS
1227 |                         // (make sure to clear or replace each field, since reconstructionRS can be dirty)
1228 | 
1229 |                         reconstructionRS.nodes = subgraph;
1230 |                         reconstructionRS.dataParallelDegree = d;
1231 |                         reconstructionRS.tensorParallelDegree = t;
1232 |                         
1233 |                         // rerun knapsackResults in reconstruction mode
1234 |                         reconstructionModeForSolveKnapsack = true;
1235 |                         // this will fill reconstructionTMPCindices:
1236 |                         solveKnapsack(TMPCTotalTPSContribution,
1237 |                                       TMPCMemoryUsageA,
1238 |                                       TMPCMemoryUsageB,
1239 |                                       boundOnS / d + 1);
1240 |                         // now we need to translate this to reconstructionRS.TMPCids
1241 |                         reconstructionRS.TMPCids.clear();
1242 |                         for (int i = 0; i < subgraph.size(); ++i) {
1243 |                             const int v = subgraph[i];
1244 |                             reconstructionRS.TMPCids[v] = (*TMPCs[i])[reconstructionTMPCindices[i]].id;
1245 |                         }
1246 | 
1247 |                         reconstructionModeForSolveKnapsack = false;
1248 |                         // might as well return now
1249 |                         reconstructionModeForGetAllTPSsForIdealPair = false;
1250 |                     }
1251 |                 }
1252 |             }
1253 |         }
1254 |     }
1255 | 
1256 |     return result;
1257 | }
1258 | 
1259 | 
1260 | // this function aims to solve the following knapsack problem variant for each y=1,...,maxY.
1261 | // we have n = TMPCTPS.size() nodes.
1262 | // for each node, we have some number of TMPCs; each has TMPCTPS, TMPCMemoryUsageA, TMPCMemoryUsageB.
1263 | // select one TMPC per node so that the sum of TMPCMemoryUsageA*y + TMPCMemoryUsageB <= ins.maxMemoryPerDevice
1264 | // and so that the sum of TMPCTPS is minimized.
1265 | // return that minimum.
1266 | // (note: this should be deterministic as it will be rerun during reconstruction)
1267 | vector<double> Graph::solveKnapsack (const vector<vector<double>>& TMPCTPS,
1268 |                                      const vector<vector<double>>& TMPCMemoryUsageA,
1269 |                                      const vector<vector<double>>& TMPCMemoryUsageB,
1270 |                                      int maxY) {
1271 |     const int n = TMPCTPS.size();
1272 |     vector<double> result(maxY+1, INFTY);
1273 | 
1274 |     // we will just do things independently for each y.
1275 |     // perhaps we can do something smarter in the future
1276 | 
1277 |     vector<int> fastestIndex(n, 0);
1278 |     for (int i = 0; i < n; ++i) {
1279 |         for (int l = 1; l < TMPCTPS[i].size(); ++l) {
1280 |             if (TMPCTPS[i][fastestIndex[i]] > TMPCTPS[i][l]) {
1281 |                 fastestIndex[i] = l;
1282 |             }
1283 |         }
1284 |     }
1285 | 
1286 |     vector<vector<double>> TMPCMemoryUsage = TMPCMemoryUsageB; // just needed to give it the right shape
1287 | 
1288 |     for (int y = 1; y <= maxY; ++y) {
1289 |         // update TMPCMemoryUsage
1290 |         for (int i = 0; i < n; ++i) {
1291 |             for (int l = 0; l < TMPCTPS[i].size(); ++l) {
1292 |                 TMPCMemoryUsage[i][l] = TMPCMemoryUsageA[i][l] * y + TMPCMemoryUsageB[i][l];
1293 |             }
1294 |         }
1295 | 
1296 |         // check whether choosing the fastest TMPC everywhere might give a memory-feasible result
1297 |         // (then we'd just return it)
1298 |         double memoryOfFastestSolution = 0.0;
1299 |         for (int i = 0; i < n; ++i) {
1300 |             memoryOfFastestSolution += TMPCMemoryUsage[i][fastestIndex[i]];
1301 |         }
1302 |         if (memoryOfFastestSolution <= ins.maxMemoryPerDevice) {
1303 |             // cool! just use that
1304 |             result[y] = 0.0;
1305 |             for (int i = 0; i < n; ++i) {
1306 |                 result[y] += TMPCTPS[i][fastestIndex[i]];
1307 |             }
1308 | 
1309 |             // reconstruction stuff (unfortunately the code repeats)
1310 |             if (reconstructionModeForSolveKnapsack) {
1311 |                 if (reconstructionY == y) {
1312 |                     // fill reconstructionTMPCindices
1313 |                     reconstructionTMPCindices = fastestIndex;
1314 |                     dbg << "we picked the all-fastest knapsack solution, mem usage = " 
1315 |                         << memoryOfFastestSolution << endl;
1316 |                 }
1317 |             }
1318 | 
1319 |             continue;
1320 |         }
1321 | 
1322 |         // or if choosing the lowest-memory TMPC everywhere is not even memory-feasible
1323 |         // (then there is no feasible solution)
1324 |         vector<int> lowestMemoryIndex(n, 0);
1325 |         for (int i = 0; i < n; ++i) {
1326 |             for (int l = 1; l < TMPCTPS[i].size(); ++l) {
1327 |                 if (TMPCMemoryUsage[i][lowestMemoryIndex[i]] > TMPCMemoryUsage[i][l]) {
1328 |                     lowestMemoryIndex[i] = l;
1329 |                 }
1330 |             }
1331 |         }
1332 |         double memoryOfLowestMemorySolution = 0.0;
1333 |         for (int i = 0; i < n; ++i) {
1334 |             memoryOfLowestMemorySolution += TMPCMemoryUsage[i][lowestMemoryIndex[i]];
1335 |         }
1336 |         if (memoryOfLowestMemorySolution > ins.maxMemoryPerDevice) {
1337 |             // no feasible solution. and there won't be one for larger y, either. so break
1338 |             break;
1339 |         }
1340 | 
1341 |         // the instance is non-trivial. now we run a heuristic
1342 |         vector<int> H = fastestIndex; // H[i] = index of the currently chosen TMPC for node i
1343 |         // we start from the all-fastest solution
1344 |         double memoryToShaveOff = memoryOfFastestSolution - ins.maxMemoryPerDevice;
1345 |         // we need to reduce the memory usage by this much
1346 | 
1347 |         if (ACTIVATION_RECOMPUTATION_ALL_LAYERS_OR_NONE) {
1348 |             // special thing for our experiments
1349 | 
1350 |             // at this point we know that the all-vanilla solution is not feasible
1351 |             // but the all-AC one is. so just return the latter
1352 |             H = lowestMemoryIndex;
1353 |             memoryToShaveOff = memoryOfLowestMemorySolution - ins.maxMemoryPerDevice;
1354 |         } else {
1355 |             // normal execution
1356 | 
1357 |             if (!KNAPSACK_FAST_HEURISTIC) {
1358 | 
1359 |                 while (memoryToShaveOff > 1e-9) { // # iterations <= total number of TMPCs
1360 |                     // we do so by repeatedly picking the best-bang-for-buck swap (change of some H[i])
1361 |                     double leastBuckPerBang = INFTY;
1362 |                     int bestI = -1, bestL = -1;
1363 |                     // check all possible swaps
1364 |                     for (int i = 0; i < n; ++i) {
1365 |                         for (int l = 0; l < TMPCTPS[i].size(); ++l) {
1366 |                             // considering the change "H[i] := l"
1367 |                             const double memorySavings = TMPCMemoryUsage[i][H[i]] - TMPCMemoryUsage[i][l];
1368 |                             if (memorySavings > 1e-9) {
1369 |                                 if (TMPCTPS[i][l] < TMPCTPS[i][H[i]]) {
1370 |                                     fail("we are gaining both on memory and TPS?!");
1371 |                                 }
1372 |                                 // important (somewhat): here we truncate large gains to what we actually need
1373 |                                 const double bang = min(memorySavings, memoryToShaveOff);
1374 |                                 const double buckPerBang = (TMPCTPS[i][l] - TMPCTPS[i][H[i]]) / bang;
1375 |                                 if (leastBuckPerBang > buckPerBang) {
1376 |                                     leastBuckPerBang = buckPerBang;
1377 |                                     bestI = i;
1378 |                                     bestL = l;
1379 |                                 }
1380 |                             }
1381 |                         }
1382 |                     }
1383 |                     if (bestI == -1) {
1384 |                         fail("no good change was found?!");
1385 |                     }
1386 |                     // now apply the best found change
1387 |                     const double memorySavings = TMPCMemoryUsage[bestI][H[bestI]] - TMPCMemoryUsage[bestI][bestL];
1388 |                     memoryToShaveOff -= memorySavings;
1389 |                     H[bestI] = bestL;
1390 |                 }
1391 | 
1392 |             } else { // KNAPSACK_FAST_HEURISTIC
1393 | 
1394 |                 // so far, for simplicity and speed, we have here
1395 |                 // a version that only works for <= 2 TMPCs per (v,t)
1396 |                 // (this is verified in checkInputCorrectness())
1397 | 
1398 |                 vector<tuple<double,int,int>> sorter;
1399 |                 for (int i = 0; i < n; ++i) {
1400 |                     for (int l = 0; l < TMPCTPS[i].size(); ++l) {
1401 |                         // considering the change "H[i] := l"
1402 |                         const double memorySavings = TMPCMemoryUsage[i][H[i]] - TMPCMemoryUsage[i][l];
1403 |                         if (memorySavings > 1e-9) {
1404 |                             if (TMPCTPS[i][l] < TMPCTPS[i][H[i]]) {
1405 |                                 fail("we are gaining both on memory and TPS?!");
1406 |                             }
1407 |                             // here we don't truncate and might overshoot
1408 |                             // (see the similar point in the slower heuristic)
1409 |                             const double buckPerBang = (TMPCTPS[i][l] - TMPCTPS[i][H[i]]) / memorySavings;
1410 |                             sorter.emplace_back(buckPerBang, i, l);
1411 |                         }
1412 |                     }
1413 |                 }
1414 |                 sort(sorter.begin(), sorter.end());
1415 |                 for (const auto &it : sorter) {
1416 |                     const int bestI = get<1>(it), bestL = get<2>(it);
1417 |                     // apply this change
1418 |                     const double memorySavings = TMPCMemoryUsage[bestI][H[bestI]] - TMPCMemoryUsage[bestI][bestL];
1419 |                     memoryToShaveOff -= memorySavings;
1420 |                     H[bestI] = bestL;
1421 |                     if (memoryToShaveOff < 1e-9) {
1422 |                         break; // done
1423 |                     }
1424 |                 }
1425 | 
1426 |             }
1427 |         }
1428 |         // done - we have a feasible solution H
1429 |         result[y] = 0.0;
1430 |         for (int i = 0; i < n; ++i) {
1431 |             result[y] += TMPCTPS[i][H[i]];
1432 |         }
1433 | 
1434 |         // reconstruction stuff
1435 |         if (reconstructionModeForSolveKnapsack) {
1436 |             if (reconstructionY == y) {
1437 |                 // fill reconstructionTMPCindices
1438 |                 reconstructionTMPCindices = H;
1439 |                 dbg << "nontrivial knapsack solution, mem usage = " << ins.maxMemoryPerDevice + memoryToShaveOff << endl;
1440 |             }
1441 |         }
1442 |     }
1443 | 
1444 |     // debug stuff for verification that our knapsack heuristic
1445 |     // almost always finds the optimal solution (see Appendix)
1446 |     if (OUTPUT_KNAPSACK_INSTANCES_FOR_INSPECTION) {
1447 |         static long long executionCount = 0;
1448 |         static ofstream knapsackFile("knapsacks.txt");
1449 |         ++executionCount;
1450 |         constexpr int y = 2;
1451 |         assert(y <= maxY);
1452 |         if (rand() % 3 == 0) {
1453 |             dbg << "executionCount = " << executionCount << endl;
1454 |             knapsackFile << setprecision(15) << fixed << ins.maxMemoryPerDevice << endl;
1455 |             knapsackFile << TMPCTPS.size() << endl;
1456 |             for (int i = 0; i < TMPCTPS.size(); ++i) {
1457 |                 knapsackFile << TMPCTPS[i].size() << endl;
1458 |                 for (int l = 0; l < TMPCTPS[i].size(); ++l) {
1459 |                     const double memUsageFor5 = y * TMPCMemoryUsageA[i][l] + TMPCMemoryUsageB[i][l];
1460 |                     knapsackFile << setprecision(15) << fixed << TMPCTPS[i][l] << " " << memUsageFor5 << endl;
1461 |                 }
1462 |             }
1463 |             knapsackFile << setprecision(15) << fixed << result[y] << endl << endl;
1464 |         }
1465 |     }
1466 | 
1467 |     return result;
1468 | }
1469 | 
1470 | 
1471 | ResultStage Graph::getTPSWitnessFor (int id, int subId, int t, int d, int s, double targetTPS) {
1472 |     reconstructionModeForGetAllTPSsForIdealPair = true;
1473 |     reconstructionT = t;
1474 |     reconstructionD = d;
1475 |     reconstructionS = s;
1476 |     reconstructionTPS = targetTPS;
1477 |     getAllTPSsForIdealPair(id, subId); // this will fill reconstructionRS
1478 |     reconstructionModeForGetAllTPSsForIdealPair = false;
1479 |     return reconstructionRS;
1480 | }
1481 | 
1482 | 
1483 | void Graph::renumberResultBack (Result &r) const {
1484 |     for (ResultStage &rs : r.stages) {
1485 |         for (int &nodeId : rs.nodes) {
1486 |             nodeId = ins.oldNumber[nodeId];
1487 |         }
1488 |         unordered_map<int,string> newTMPCids;
1489 |         for (const pair<int,string> &p : rs.TMPCids) {
1490 |             newTMPCids[ins.oldNumber[p.first]] = p.second;
1491 |         }
1492 |         rs.TMPCids = move(newTMPCids);
1493 |     }
1494 | }
1495 | 
1496 | 
1497 | double Graph::getTPSForResult (const Result &r) const {
1498 |     // for sanity checks of returned solutions,
1499 |     // and also to judge baselines
1500 | 
1501 |     if (r.stages.empty()) {
1502 |         // infeasible/OOM/empty result
1503 |         return INFTY;
1504 |     }
1505 | 
1506 |     // first step: check that the solution is contiguous
1507 |     // (and that there is some topological order in the contracted graph)
1508 |     // and that every node belongs to exactly one subgraph
1509 |     // and that we don't use too many devices
1510 |     vector<int> stageOfNode(ins.nodes.size(), -1);
1511 |     int devicesUsed = 0, sumOfDpDegrees = 0;
1512 |     for (int i = 0; i < r.stages.size(); ++i) {
1513 |         for (int v : r.stages[i].nodes) {
1514 |             if (stageOfNode[v] != -1) {
1515 |                 fail("duplicate node");
1516 |             }
1517 |             stageOfNode[v] = i;
1518 |         }
1519 |         if (r.stages[i].dataParallelDegree < 1 || r.stages[i].dataParallelDegree > ins.maxDevices) {
1520 |             fail("wrong data-parallel degree");
1521 |         }
1522 |         if (ins.mbsInBatch % r.stages[i].dataParallelDegree != 0) {
1523 |             fail("data-parallel degree must divide the number of microbatches in a batch");
1524 |         }
1525 |         if (r.stages[i].tensorParallelDegree < 1 || r.stages[i].tensorParallelDegree > ins.maxDevices) {
1526 |             fail("wrong tensor-parallel degree");
1527 |         }
1528 |         devicesUsed += r.stages[i].dataParallelDegree * r.stages[i].tensorParallelDegree;
1529 |         sumOfDpDegrees += r.stages[i].dataParallelDegree;
1530 |     }
1531 |     for (const Edge &e : ins.edges) {
1532 |         if (stageOfNode[e.sourceId] > stageOfNode[e.destId]) {
1533 |             fail("problem with contiguity (or stages given in wrong order)");
1534 |         }
1535 |     }
1536 |     for (int v = 0; v < ins.nodes.size(); ++v) {
1537 |         if (-1 == stageOfNode[v]) {
1538 |             fail("node does not appear in any subgraph");
1539 |         }
1540 |     }
1541 |     if (sumOfDpDegrees > ins.mbsInBatch) {
1542 |         fail("sum of data-parallel degrees too large");
1543 |     }
1544 |     if (devicesUsed > ins.maxDevices) { 
1545 |         fail("too many devices used");
1546 |     }
1547 | 
1548 |     for (int v = 0; v < ins.nodes.size(); ++v) {
1549 |         if (v != ins.nodes[v].id) {
1550 |             fail("some issue with numbering");
1551 |         }
1552 |     }
1553 | 
1554 |     // now we want to compute the TPS,
1555 |     // and also verify it's not OOM
1556 |     double finalTPS = 0.0;
1557 |     int suffixSumOfDataParallelDegrees = sumOfDpDegrees;
1558 |     for (int i = 0; i < r.stages.size(); ++i) {
1559 |         const ResultStage &rs = r.stages[i];
1560 | 
1561 |         // we want to compute TPS for this stage
1562 |         
1563 |         const int d = rs.dataParallelDegree, t = rs.tensorParallelDegree;
1564 | 
1565 |         const int y = ceildiv(suffixSumOfDataParallelDegrees, d);
1566 |         suffixSumOfDataParallelDegrees -= d; // update for next iteration
1567 | 
1568 |         // get TMPCs (also check TMPCids)
1569 |         unordered_map<int, const TMPC*> TMPCs;
1570 |         for (const pair<int,string> &p : rs.TMPCids) {
1571 |             const int v = p.first;
1572 |             if (!count(rs.nodes.begin(), rs.nodes.end(), v)) {
1573 |                 fail("node appears in TMPCids but not in nodes");
1574 |             }
1575 |             if (v < 0 || v >= ins.nodes.size()) {
1576 |                 fail("wrong node id in TMPCids");
1577 |             }
1578 |             if (TMPCs.count(v)) {
1579 |                 fail("duplicate node in TMPCids");
1580 |             }
1581 |             bool found = false;
1582 |             if (!ins.nodes[v].TMPCs.count(t)) {
1583 |                 fail("no TMPCs for that node and that t");
1584 |             }
1585 |             for (const TMPC &tmpc : ins.nodes[v].TMPCs.at(t)) {
1586 |                 if (tmpc.id == p.second) {
1587 |                     assert(!found);
1588 |                     found = true;
1589 |                     TMPCs[v] = &tmpc;
1590 |                 }
1591 |             }
1592 |             if (!found) {
1593 |                 fail("no TMPC with that ID was found");
1594 |             }
1595 |         }
1596 |         if (rs.nodes.size() > rs.TMPCids.size()) {
1597 |             fail("more nodes in nodes than in TMPCids");
1598 |         }
1599 |         // okay, TMPCs is populated. now compute TPS and memory usage
1600 | 
1601 |         double compute = 0.0, edgeCommunicationInBytes = 0.0, dataParallelCommunicationInBytes = 0.0, 
1602 |                memoryUsage = 0.0;
1603 | 
1604 |         for (int v : rs.nodes) {
1605 |             compute += TMPCs[v]->timePerSample;
1606 |             for (const pair<int,double> &p : incomingEdges[v]) {
1607 |                 if (!count(rs.nodes.begin(), rs.nodes.end(), p.first)) {
1608 |                     edgeCommunicationInBytes += 2 * (p.second + TMPCs[v]->syncTimeFw.at(p.first));
1609 |                 }
1610 |             }
1611 |             for (const pair<int,double> &p : outgoingEdges[v]) {
1612 |                 if (!count(rs.nodes.begin(), rs.nodes.end(), p.first)) {
1613 |                     edgeCommunicationInBytes += 2 * (p.second + TMPCs[v]->syncTimeBw.at(p.first));
1614 |                 }
1615 |             }
1616 |             dataParallelCommunicationInBytes += getDataParallelResyncCost(TMPCs[v]->parameterSize, d);
1617 |             memoryUsage += TMPCs[v]->memoryUsageA * y + TMPCs[v]->memoryUsageB;
1618 |         }
1619 | 
1620 |         if (memoryUsage > ins.maxMemoryPerDevice) {
1621 |             // given solution is OOM
1622 |             return INFTY;
1623 |         }
1624 | 
1625 |         const double communicationInBytes = edgeCommunicationInBytes / d + dataParallelCommunicationInBytes;
1626 | 
1627 |         if (DEBUG_DATA_PARALLEL_COSTS) {
1628 |             const double dataParallelCost = dataParallelCommunicationInBytes / ins.bandwidth;
1629 |             const double theRestxxxxxxxxx = (edgeCommunicationInBytes / ins.bandwidth + compute) / d;
1630 |             cerr << setprecision(10) << fixed;
1631 |             DBG(dataParallelCost);
1632 |             DBG(theRestxxxxxxxxx);
1633 |         }
1634 | 
1635 |         const double stageTPS = communicationInBytes / ins.bandwidth + compute / d;
1636 | 
1637 |         finalTPS = max(finalTPS, stageTPS);
1638 |     }
1639 |     return finalTPS;
1640 | }
1641 | 
1642 | 
1643 | Result runPipeDream2BWPlanner (const Instance &ins,
1644 |                                bool useTensorParallelism,
1645 |                                bool tryPuttingNonTransformerNodesSeparately) {
1646 |     vector<int> transformers = ins.getTransformerIds();
1647 |     vector<int> initialNodes, finalNodes; // before and after transformers, respectively
1648 | 
1649 |     while (transformers.size() + initialNodes.size() + finalNodes.size() < ins.nodes.size()) {
1650 |         for (const Edge &e : ins.edges) {
1651 |             const int u = e.sourceId, v = e.destId;
1652 |             const bool initialU = count(initialNodes.begin(), initialNodes.end(), u);
1653 |             const bool initialV = count(initialNodes.begin(), initialNodes.end(), v);
1654 |             const bool finalU = count(finalNodes.begin(), finalNodes.end(), u);
1655 |             const bool finalV = count(finalNodes.begin(), finalNodes.end(), v);
1656 |             const bool transformerU = count(transformers.begin(), transformers.end(), u);
1657 |             const bool transformerV = count(transformers.begin(), transformers.end(), v);
1658 | 
1659 |             if (!initialU && !transformerU && !finalU) {
1660 |                 if (initialV || transformerV) {
1661 |                     initialNodes.push_back(u);
1662 |                 }
1663 |             }
1664 |             if (!initialV && !transformerV && !finalV) {
1665 |                 if (transformerU || finalU) {
1666 |                     finalNodes.push_back(v);
1667 |                 }
1668 |             }
1669 |         }
1670 |     }
1671 | 
1672 |     Graph g(ins); // needed to run getTPSForResult()
1673 | 
1674 |     double bestTPS = INFTY;
1675 |     Result bestResult;
1676 | 
1677 |     for (int putNonTransformerNodesSeparately = 0;
1678 |          putNonTransformerNodesSeparately <= tryPuttingNonTransformerNodesSeparately;
1679 |          ++putNonTransformerNodesSeparately) {
1680 | 
1681 |         for (int stages = 1 + 2 * putNonTransformerNodesSeparately;
1682 |              stages <= ins.maxDevices && stages <= transformers.size() + 2 * putNonTransformerNodesSeparately;
1683 |              stages ++) {
1684 | 
1685 |             Result result;
1686 |             result.stages.resize(stages);
1687 | 
1688 |             const int transformerStages = stages - 2 * putNonTransformerNodesSeparately,
1689 |                       firstTransformerStage = putNonTransformerNodesSeparately,
1690 |                       lastTransformerStage = stages - 1 - putNonTransformerNodesSeparately;
1691 |             assert(lastTransformerStage - firstTransformerStage + 1 == transformerStages);
1692 | 
1693 |             // divide transformers equally-ish among stages
1694 |             const int transformersPerStage = transformers.size() / transformerStages,
1695 |                       largerStages = transformers.size() % transformerStages;
1696 |             int nextTransformerIndex = 0;
1697 |             for (int st = firstTransformerStage; st <= lastTransformerStage; ++st) {
1698 |                 // last `largerStages` stages have one additional transformer each
1699 |                 const bool thisIsALargerStage = (st >= lastTransformerStage + 1 - largerStages);
1700 |                 for (int i = 0; i < transformersPerStage + thisIsALargerStage; ++i) {
1701 |                     result.stages[st].nodes.push_back(transformers[nextTransformerIndex]);
1702 |                     nextTransformerIndex++;
1703 |                 }
1704 |             }
1705 |             assert(nextTransformerIndex == transformers.size());
1706 |             // the initial nodes go to first stage, the final nodes go to last stage
1707 |             append(result.stages[0].nodes, initialNodes);
1708 |             append(result.stages.back().nodes, finalNodes);
1709 |             result.debugInfo = (putNonTransformerNodesSeparately ? "1" : "0");
1710 | 
1711 |             for (int d = 1; stages*d <= min(ins.maxDevices, ins.mbsInBatch); d ++) {
1712 |                 if (ins.mbsInBatch % d == 0) {
1713 |                     for (int t = 1; stages*d*t <= ins.maxDevices; t ++) {
1714 |                         if (t > 1) {
1715 |                             if (!useTensorParallelism) {
1716 |                                 break;
1717 |                             }
1718 |                             // should check if all nodes support t-degree tensor parallelism
1719 |                             // we'll just check one node since in our inputs they all support some t or not
1720 |                             if (!ins.nodes[0].TMPCs.count(t)) {
1721 |                                 continue;
1722 |                             }
1723 |                         }
1724 | 
1725 |                         for (ResultStage &rs : result.stages) {
1726 |                             rs.dataParallelDegree = d;
1727 |                             rs.tensorParallelDegree = t;
1728 |                         }
1729 | 
1730 |                         // PipeDream-2BW uses activation recomputation everywhere or nowhere
1731 |                         for (bool activationRecomputationEverywhere : {false, true}) {
1732 |                             for (ResultStage &rs : result.stages) {
1733 |                                 for (int v : rs.nodes) {
1734 |                                     rs.TMPCids[v] = activationRecomputationEverywhere ? "activation recomp" : "vanilla";
1735 |                                 }
1736 |                             }
1737 | 
1738 |                             // result is built. try it
1739 |                             const double TPS = g.getTPSForResult(result);
1740 |                             if (TPS < bestTPS) {
1741 |                                 bestTPS = TPS;
1742 |                                 bestResult = result;
1743 |                             }
1744 |                         }
1745 |                     }
1746 |                 }
1747 |             }
1748 |         }
1749 |     }
1750 | 
1751 |     if (bestTPS > INFTY/2) {
1752 |         dbg << "2BW couldn't partition in any way (OOM)" << endl;
1753 |         return bestResult; // empty result
1754 |     }
1755 | 
1756 |     dbg << "best 2BW: TPS = " << bestTPS
1757 |         << ", stages = " << bestResult.stages.size()
1758 |         << ", d = " << bestResult.stages[0].dataParallelDegree
1759 |         << ", t = " << bestResult.stages[0].tensorParallelDegree
1760 |         << ", activation recomp = " << bestResult.stages[0].TMPCids.begin()->second
1761 |         << ", put non-transformers separately = " << bestResult.debugInfo
1762 |         << endl;
1763 | 
1764 |     //for (const ResultStage &rs : bestResult.stages) {
1765 |     //    dbg << rs.nodes << endl;
1766 |     //}
1767 | 
1768 |     return bestResult;
1769 | }
1770 | 
1771 | 
1772 | 
1773 | // run the PipeDream-2BW-like planner,
1774 | // but without treating transformer and non-transformer layers differently
1775 | Result runPipeDream2BWPlannerNonTransformer (const Instance &ins,
1776 |                                              bool useTensorParallelism) {
1777 |     Graph g(ins); // needed to run getTPSForResult()
1778 | 
1779 |     const vector<int> nodeIds = ins.getNodeIdsToposorted();
1780 | 
1781 |     double bestTPS = INFTY;
1782 |     Result bestResult;
1783 | 
1784 |     for (int stages = 1; stages <= ins.maxDevices; ++stages) {
1785 |         Result result;
1786 |         result.stages.resize(stages);
1787 | 
1788 |         // divide layers equally-ish among stages
1789 |         const int layersPerStage = nodeIds.size() / stages,
1790 |                   largerStages = nodeIds.size() % stages;
1791 |         int nextStageIndex = 0;
1792 |         for (int st = 0; st <= stages - 1; ++st) {
1793 |             // last `largerStages` have one additional layer each
1794 |             const bool thisIsALargerStage = (st >= stages - largerStages);
1795 |             for (int i = 0; i < layersPerStage + thisIsALargerStage; ++i) {
1796 |                 result.stages[st].nodes.push_back(nodeIds[nextStageIndex]);
1797 |                 nextStageIndex++;
1798 |             }
1799 |         }
1800 |         assert(nextStageIndex == nodeIds.size());
1801 | 
1802 |         for (int d = 1; stages*d <= min(ins.maxDevices, ins.mbsInBatch); d ++) {
1803 |             if (ins.mbsInBatch % d == 0) {
1804 |                 for (int t = 1; stages*d*t <= ins.maxDevices; t ++) {
1805 |                     if (t > 1) {
1806 |                         if (!useTensorParallelism) {
1807 |                             break;
1808 |                         }
1809 |                         // should check if all nodes support t-degree tensor parallelism
1810 |                         // we'll just check one node since in our inputs they all support some t or not
1811 |                         if (!ins.nodes[0].TMPCs.count(t)) {
1812 |                             continue;
1813 |                         }
1814 |                     }
1815 | 
1816 |                     for (ResultStage &rs : result.stages) {
1817 |                         rs.dataParallelDegree = d;
1818 |                         rs.tensorParallelDegree = t;
1819 |                     }
1820 | 
1821 |                     // PipeDream-2BW uses activation recomputation everywhere or nowhere
1822 |                     for (bool activationRecomputationEverywhere : {false, true}) {
1823 |                         for (ResultStage &rs : result.stages) {
1824 |                             for (int v : rs.nodes) {
1825 |                                 rs.TMPCids[v] = activationRecomputationEverywhere ? "activation recomp" : "vanilla";
1826 |                             }
1827 |                         }
1828 | 
1829 |                         // result is built. try it
1830 |                         const double TPS = g.getTPSForResult(result);
1831 |                         if (TPS < bestTPS) {
1832 |                             bestTPS = TPS;
1833 |                             bestResult = result;
1834 |                         }
1835 |                     }
1836 |                 }
1837 |             }
1838 |         }
1839 |     }
1840 | 
1841 | 
1842 |     if (bestTPS > INFTY/2) {
1843 |         dbg << "2BW couldn't partition in any way (OOM)" << endl;
1844 |         return bestResult; // empty result
1845 |     }
1846 | 
1847 |     dbg << "best 2BW: TPS = " << bestTPS
1848 |         << ", stages = " << bestResult.stages.size()
1849 |         << ", d = " << bestResult.stages[0].dataParallelDegree
1850 |         << ", t = " << bestResult.stages[0].tensorParallelDegree
1851 |         << ", activation recomp = " << bestResult.stages[0].TMPCids.begin()->second
1852 |         << endl;
1853 | 
1854 |     //for (const ResultStage &rs : bestResult.stages) {
1855 |     //    dbg << rs.nodes << endl;
1856 |     //}
1857 | 
1858 |     return bestResult;
1859 | }
1860 | 
1861 | 
1862 | 
1863 | // build a single configuration, one of those that the PipeDream-2BW-like planner would consider
1864 | // (here code unfortunately repeats a lot from runPipeDream2BWPlanner)
1865 | double buildEquiPartitionResult (Instance &ins,
1866 |                                  int d,
1867 |                                  int t,
1868 |                                  int stages,
1869 |                                  bool putNonTransformerNodesSeparately,
1870 |                                  bool activationRecomputationEverywhere) {
1871 |     vector<int> transformers = ins.getTransformerIds();
1872 |     vector<int> initialNodes, finalNodes; // before and after transformers, respectively
1873 | 
1874 |     while (transformers.size() + initialNodes.size() + finalNodes.size() < ins.nodes.size()) {
1875 |         for (const Edge &e : ins.edges) {
1876 |             const int u = e.sourceId, v = e.destId;
1877 |             const bool initialU = count(initialNodes.begin(), initialNodes.end(), u);
1878 |             const bool initialV = count(initialNodes.begin(), initialNodes.end(), v);
1879 |             const bool finalU = count(finalNodes.begin(), finalNodes.end(), u);
1880 |             const bool finalV = count(finalNodes.begin(), finalNodes.end(), v);
1881 |             const bool transformerU = count(transformers.begin(), transformers.end(), u);
1882 |             const bool transformerV = count(transformers.begin(), transformers.end(), v);
1883 | 
1884 |             if (!initialU && !transformerU && !finalU) {
1885 |                 if (initialV || transformerV) {
1886 |                     initialNodes.push_back(u);
1887 |                 }
1888 |             }
1889 |             if (!initialV && !transformerV && !finalV) {
1890 |                 if (transformerU || finalU) {
1891 |                     finalNodes.push_back(v);
1892 |                 }
1893 |             }
1894 |         }
1895 |     }
1896 | 
1897 |     Graph g(ins); // needed to run getTPSForResult()
1898 |     assert(stages >= 1 + 2 * putNonTransformerNodesSeparately);
1899 |     assert(stages <= ins.maxDevices && stages <= transformers.size() + 2 * putNonTransformerNodesSeparately);
1900 | 
1901 |     Result result;
1902 |     result.stages.resize(stages);
1903 | 
1904 |     const int transformerStages = stages - 2 * putNonTransformerNodesSeparately,
1905 |                 firstTransformerStage = putNonTransformerNodesSeparately,
1906 |                 lastTransformerStage = stages - 1 - putNonTransformerNodesSeparately;
1907 |     assert(lastTransformerStage - firstTransformerStage + 1 == transformerStages);
1908 | 
1909 |     // divide transformers equally-ish among stages
1910 |     const int transformersPerStage = transformers.size() / transformerStages,
1911 |                 largerStages = transformers.size() % transformerStages;
1912 |     int nextTransformerIndex = 0;
1913 |     for (int st = firstTransformerStage; st <= lastTransformerStage; ++st) {
1914 |         // last `largerStages` stages have one additional transformer each
1915 |         const bool thisIsALargerStage = (st >= lastTransformerStage + 1 - largerStages);
1916 |         for (int i = 0; i < transformersPerStage + thisIsALargerStage; ++i) {
1917 |             result.stages[st].nodes.push_back(transformers[nextTransformerIndex]);
1918 |             nextTransformerIndex++;
1919 |         }
1920 |     }
1921 |     assert(nextTransformerIndex == transformers.size());
1922 |     // the initial nodes go to first stage, the final nodes go to last stage
1923 |     append(result.stages[0].nodes, initialNodes);
1924 |     append(result.stages.back().nodes, finalNodes);
1925 |     result.debugInfo = (putNonTransformerNodesSeparately ? "1" : "0");
1926 | 
1927 |     assert(d >= 1);
1928 |     assert(stages*d <= min(ins.maxDevices, ins.mbsInBatch));
1929 |     assert(t >= 1);
1930 |     assert(stages*d*t <= ins.maxDevices);
1931 |     assert(ins.mbsInBatch % d == 0);
1932 | 
1933 |     // should check if all nodes support t-degree tensor parallelism
1934 |     // we'll just check one node since in our inputs they all support some t or not
1935 |     assert(ins.nodes[0].TMPCs.count(t));
1936 | 
1937 |     for (ResultStage &rs : result.stages) {
1938 |         rs.dataParallelDegree = d;
1939 |         rs.tensorParallelDegree = t;
1940 |     }
1941 | 
1942 |     for (ResultStage &rs : result.stages) {
1943 |         for (int v : rs.nodes) {
1944 |             rs.TMPCids[v] = activationRecomputationEverywhere ? "activation recomp" : "vanilla";
1945 |         }
1946 |     }
1947 | 
1948 |     // result is built. try it
1949 |     const double TPS = g.getTPSForResult(result);
1950 | 
1951 |     // dbg output
1952 |     dbg << "TPS = " << TPS << endl;
1953 | 
1954 |     return TPS;
1955 | }
1956 | 
1957 | 
1958 | 
1959 | void runOurAlgoOnInstances (const vector<Instance> &instances) {
1960 |     for (const Instance &ins : instances) {
1961 |         // our alg
1962 |         clock_t startTime = clock();
1963 |         Graph g(ins);
1964 |         Result our = g.runDP();
1965 |         double runtime = (clock() - startTime) * 1.0 / CLOCKS_PER_SEC;
1966 |         DBG(runtime)
1967 |         if (our.stages.empty()) {
1968 |             continue; // infeasible, not interesting
1969 |         }
1970 |         double ourTPS = g.getTPSForResult(our);
1971 |     }
1972 | }
1973 | 
1974 | 
1975 | // read the BERT32 instance, then possibly add more transformer layers
1976 | Instance readBERTA100 (int noTransformers) {
1977 |     assert(noTransformers >= 32);
1978 |     json j;
1979 |     ifstream jsonfile("inputs/bert32a100.json");
1980 |     jsonfile >> j;
1981 |     Instance ins = j.get<Instance>();
1982 |     for (int t = 33; t <= noTransformers; ++t) {
1983 |         ins.insertTransformerLayer();
1984 |     }
1985 |     vector<int> transformers = ins.getTransformerIds();
1986 |     dbg << "transformer (renumbered) ids = [" << transformers << "]" << endl;
1987 |     return ins;
1988 | }
1989 | 
1990 | 
1991 | Instance readGNMT () {
1992 |     json j;
1993 |     ifstream jsonfile("inputs/gnmt.json");
1994 |     jsonfile >> j;
1995 |     Instance ins = j.get<Instance>();
1996 |     return ins;
1997 | }
1998 | 
1999 | 
2000 | Instance readResnet () {
2001 |     json j;
2002 |     ifstream jsonfile("inputs/resnet.json");
2003 |     jsonfile >> j;
2004 |     Instance ins = j.get<Instance>();
2005 |     return ins;
2006 | }
2007 | 
2008 | 
2009 | // run just a single instance
2010 | void single () {
2011 |     Instance ins = readBERTA100(32);
2012 |     ins.maxMemoryPerDevice = 8.0 * (1 << 30);
2013 |     ins.maxDevices = 512;
2014 |     ins.mbsInBatch = 1920;
2015 |     ins.bandwidth = 25.0 * (1LL << 30);
2016 |     
2017 |     runOurAlgoOnInstances({ins});
2018 | }
2019 | 
2020 | 
2021 | void plots () {
2022 |     for (double memGB : {1,2,8,80}) {
2023 |         for (int bertSize : {32}) {
2024 |             for (int batchSize : {1920}) {
2025 |                 stringstream ss;
2026 |                 ss << "bert" << bertSize << "-" << memGB << "GB-bs" << batchSize << ".csv";
2027 |                 ofstream of(ss.str());
2028 |                 of << ",Number of devices,color,\\sf{type},\\sf{throughput}\n";
2029 |                 int cnt = 0;
2030 | 
2031 |                 dbg << "=================================================" << endl;
2032 |                 dbg << "bertSize = " << bertSize << ", batch size = " << batchSize << endl;
2033 |                 dbg << "=================================================" << endl;
2034 | 
2035 |                 for (int k : {8,32,64,128,512,1024,2048}) {
2036 | 
2037 |                     dbg << "=================================================" << endl;
2038 |                     dbg << "now running k = " << k << endl;
2039 |                     dbg << "=================================================" << endl;
2040 | 
2041 |                     Instance ins = readBERTA100(bertSize);
2042 |                     ins.maxMemoryPerDevice = (1 << 30) * 1.0 * memGB;
2043 |                     ins.maxDevices = k;
2044 |                     ins.bandwidth = 25.0 * (1 << 30);
2045 |                     ins.mbsInBatch = batchSize;
2046 | 
2047 |                     if (ins.getMaxTensorParallelDegree() * batchSize < k) {
2048 |                         dbg << "skipping device count " << k << " since they cannot all be used" << endl;
2049 |                         continue;
2050 |                     }
2051 | 
2052 |                     Graph g(ins);
2053 | 
2054 |                     const Result ourResult = g.runDP();
2055 |                     if (ourResult.stages.empty()) {
2056 |                         dbg << "skipping device count " << k << " since our solution is OOM" << endl;
2057 |                         continue;
2058 |                     }
2059 |                     const double ourTPS = g.getTPSForResult(ourResult);
2060 |                     assert(ourTPS < INFTY/2);
2061 |                     of << (cnt++) << ",$k=" << k << "$,";
2062 |                     of << "#2b7bba,Piper,1.000\n";
2063 | 
2064 |                     DATA_PARALLELISM_ALLOWED = false;
2065 |                     const double noDP = g.getTPSForResult(g.runDP());
2066 |                     DATA_PARALLELISM_ALLOWED = true;
2067 |                     of << (cnt++) << ",$k=" << k << "$,";
2068 |                     of << "#DB7093,no DP," << setprecision(3) << fixed << (ourTPS / noDP) << "\n";
2069 | 
2070 |                     TENSOR_PARALLELISM_ALLOWED = false;
2071 |                     const double noTP = g.getTPSForResult(g.runDP());
2072 |                     TENSOR_PARALLELISM_ALLOWED = true;
2073 |                     of << (cnt++) << ",$k=" << k << "$,";
2074 |                     of << "#FFD700,no TP," << setprecision(3) << fixed << (ourTPS / noTP) << "\n";
2075 | 
2076 |                     ACTIVATION_RECOMPUTATION_ALLOWED = false;
2077 |                     const double noAR = g.getTPSForResult(g.runDP());
2078 |                     ACTIVATION_RECOMPUTATION_ALLOWED = true;
2079 |                     of << (cnt++) << ",$k=" << k << "$,";
2080 |                     of << "#008000,no AR," << setprecision(3) << fixed << (ourTPS / noAR) << "\n";
2081 | 
2082 |                     //const double PD2BW = g.getTPSForResult(runPipeDream2BWPlanner(ins, false, false));
2083 |                     //of << (cnt++) << ",$k=" << k << "$,";
2084 |                     //of << "#191970,equi-no TP-no sep," << setprecision(3) << fixed << (ourTPS / PD2BW) << "\n";
2085 | 
2086 |                     //const double PD2BWHR = g.getTPSForResult(runPipeDream2BWPlanner(ins, true, false));
2087 |                     //of << (cnt++) << ",$k=" << k << "$,";
2088 |                     //of << "#FF4500,equi-no sep," << setprecision(3) << fixed << (ourTPS / PD2BWHR) << "\n";
2089 | 
2090 |                     const double PD2BWsep = g.getTPSForResult(runPipeDream2BWPlanner(ins, false, true));
2091 |                     of << (cnt++) << ",$k=" << k << "$,";
2092 |                     of << "#191970,equi-no TP," << setprecision(3) << fixed << (ourTPS / PD2BWsep) << "\n";
2093 | 
2094 |                     const double PD2BWsepHR = g.getTPSForResult(runPipeDream2BWPlanner(ins, true, true));
2095 |                     of << (cnt++) << ",$k=" << k << "$,";
2096 |                     of << "#FF4500,equi," << setprecision(3) << fixed << (ourTPS / PD2BWsepHR) << "\n";
2097 |                 }
2098 |             }
2099 |         }
2100 |     }
2101 | }
2102 | 
2103 | 
2104 | void correlationExperiment () {
2105 |     Instance ins = readBERTA100(32);
2106 |     ins.maxDevices = 64;
2107 |     ins.bandwidth = (300.0 + 25.0)/2 * (1 << 30);
2108 |     vector<double> results;
2109 |     ins.mbsInBatch = 128;
2110 |     results.push_back(buildEquiPartitionResult(ins, 32, 1, 2, false, true));
2111 |     results.push_back(buildEquiPartitionResult(ins, 16, 1, 4, false, true));
2112 |     results.push_back(buildEquiPartitionResult(ins, 8, 1, 8, false, true));
2113 |     results.push_back(buildEquiPartitionResult(ins, 4, 1, 16, false, true));
2114 |     results.push_back(buildEquiPartitionResult(ins, 2, 1, 32, false, true));
2115 |     ins.mbsInBatch = 512;
2116 |     results.push_back(buildEquiPartitionResult(ins, 32, 1, 2, false, true));
2117 |     results.push_back(buildEquiPartitionResult(ins, 16, 1, 4, false, true));
2118 |     results.push_back(buildEquiPartitionResult(ins, 8, 1, 8, false, true));
2119 |     results.push_back(buildEquiPartitionResult(ins, 4, 1, 16, false, true));
2120 |     results.push_back(buildEquiPartitionResult(ins, 2, 1, 32, false, true));
2121 |     results.push_back(-1e30);
2122 |     results.push_back(-1e30);
2123 |     results.push_back(-1e30);
2124 |     ins.mbsInBatch = 128;
2125 |     results.push_back(buildEquiPartitionResult(ins, 32, 2, 1, false, true));
2126 |     results.push_back(buildEquiPartitionResult(ins, 16, 4, 1, false, true));
2127 |     results.push_back(buildEquiPartitionResult(ins, 8, 8, 1, false, true));
2128 |     ins.mbsInBatch = 512;
2129 |     results.push_back(buildEquiPartitionResult(ins, 32, 2, 1, false, true));
2130 |     results.push_back(buildEquiPartitionResult(ins, 16, 4, 1, false, true));
2131 |     results.push_back(buildEquiPartitionResult(ins, 8, 8, 1, false, true));
2132 |     for (double res : results) {
2133 |         cout << res << endl;
2134 |     }
2135 | }
2136 | 
2137 | 
2138 | 
2139 | 
2140 | 
2141 | void scalability () {
2142 |     for (int bertSize : {32,48,64,96}) {
2143 |         for (int batchSize : {480,512,1920,2048}) {
2144 |             //if (bertSize == 96 && batchSize == 2048) continue;
2145 |             Instance ins = readBERTA100(bertSize);
2146 |             stringstream ss;
2147 |             ss << "bert" << bertSize << "bs" << batchSize << "times.txt";
2148 |             ofstream of(ss.str());
2149 |             of << "k time stddev\n";
2150 |             for (int k : {8,16,32,64,128,256,512,1024,1536,2048}) {
2151 |                 const int tries = (k > 600) ? 3 : 5;
2152 |                 vector<double> times;
2153 |                 for (int t = 0; t < tries; ++t) {
2154 |                     ins.maxDevices = k;
2155 |                     ins.mbsInBatch = batchSize;
2156 |                     clock_t startTime = clock();
2157 |                     Graph g(ins);
2158 |                     g.runDP();
2159 |                     const double runtime = (clock() - startTime) * 1.0 / CLOCKS_PER_SEC;
2160 |                     times.push_back(runtime);
2161 |                 }
2162 |                 of << k << " " << setprecision(3) << fixed << average(times) << " " << sampleStddev(times) << endl;
2163 |             }
2164 |         }
2165 |     }
2166 | }
2167 | 
2168 | 
2169 | 
2170 | void runAndWrite (Graph &g, const string &filenameCore) {
2171 |     g.runDP();
2172 | 
2173 |     ofstream ofDevices(filenameCore + "-devices.txt");
2174 |     ofDevices << "k TPS\n";
2175 |     for (int k = 8; k <= g.ins.maxDevices; ++k) {
2176 |         const double TPS_k = g.dp[g.idOfFullSet][k][g.boundOnS];
2177 |         if (TPS_k > INFTY/2) {
2178 |             continue;
2179 |         }
2180 |         ofDevices << k << " " << setprecision(7) << fixed << TPS_k << "\n";
2181 |     }
2182 | 
2183 |     ofstream ofSBound(filenameCore + "-sbound.txt");
2184 |     ofSBound << "s TPS\n";
2185 |     for (int s = 8; s <= g.boundOnS; ++s) {
2186 |         const double TPS_s = g.dp[g.idOfFullSet][g.ins.maxDevices][s];
2187 |         if (TPS_s > INFTY/2) {
2188 |             continue;
2189 |         }
2190 |         ofSBound << s << " " << setprecision(7) << fixed << TPS_s << "\n";
2191 |     }
2192 | }
2193 | 
2194 | 
2195 | // run the PipeDream-2BW-like planner
2196 | void runPD2BWAndWrite (Instance &ins, const string &filenameCore, bool useTensorParallelism, bool tryPuttingNonTransformerNodesSeparately) {
2197 | 
2198 |     const int backupMaxDevices = ins.maxDevices;
2199 | 
2200 |     ofstream ofDevices(filenameCore + "-devices.txt");
2201 |     ofDevices << "k TPS\n";
2202 |     for (int k : {8,16,32,64,128,256,512,768,1024,1024+256,1024+512,1024+512+256,2048}) {
2203 |         ins.maxDevices = k;
2204 |         Graph g(ins);
2205 |         ofDevices << k << " " << setprecision(7) << fixed << g.getTPSForResult(runPipeDream2BWPlanner(ins, useTensorParallelism, tryPuttingNonTransformerNodesSeparately)) << endl;
2206 |     }
2207 | 
2208 |     ins.maxDevices = backupMaxDevices;
2209 | 
2210 | 
2211 |     ofstream ofBatch(filenameCore + "-batch.txt");
2212 |     ofBatch << "bs TPS\n";
2213 |     for (int bs : {8,16,32,64,128,256,256+128,512,512+128,512+256,512+256+128,1024}) {
2214 |         ins.mbsInBatch = bs;
2215 |         Graph g(ins);
2216 |         ofBatch << bs << " " << setprecision(7) << fixed << g.getTPSForResult(runPipeDream2BWPlanner(ins, useTensorParallelism, tryPuttingNonTransformerNodesSeparately)) << endl;
2217 |     }
2218 | }
2219 | 
2220 | 
2221 | // measuring runtime
2222 | void parallelizabilityExperiment (int bertSize, int memSize, int batchSize) {
2223 |     Instance ins = readBERTA100(bertSize);
2224 |     ins.maxDevices = 2048;
2225 |     ins.maxMemoryPerDevice = memSize * 1.0 * (1 << 30);
2226 |     ins.mbsInBatch = batchSize;
2227 |     Graph g(ins);
2228 | 
2229 |     // Piper
2230 |     runAndWrite(g, "parallel-piper-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize));
2231 | 
2232 |     DATA_PARALLELISM_ALLOWED = false;
2233 |     runAndWrite(g, "parallel-nodp-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize));
2234 |     DATA_PARALLELISM_ALLOWED = true;
2235 | 
2236 |     TENSOR_PARALLELISM_ALLOWED = false;
2237 |     runAndWrite(g, "parallel-notp-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize));
2238 |     TENSOR_PARALLELISM_ALLOWED = true;
2239 | 
2240 |     ACTIVATION_RECOMPUTATION_ALLOWED = false;
2241 |     runAndWrite(g, "parallel-noar-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize));
2242 |     ACTIVATION_RECOMPUTATION_ALLOWED = true;
2243 | 
2244 |     //runPD2BWAndWrite(ins, "parallel-equi-no sep-no-TP-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize), false, false);
2245 |     //runPD2BWAndWrite(ins, "parallel-equi-no sep-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize), true, false);
2246 |     runPD2BWAndWrite(ins, "parallel-equi-no-TP-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize), false, true);
2247 |     runPD2BWAndWrite(ins, "parallel-equi-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize), true, true);
2248 | }
2249 | 
2250 | 
2251 | void runResnet () {
2252 |     vector<pair<int,double>> pairs = {
2253 |         {8, 4.0}, {8, 8.0},
2254 |         {16, 1.5}, {16, 2.0}, {16, 4.0}, {16, 8.0},
2255 |         {32, 1.5}, {32, 2.0}, {32, 4.0}, {32, 8.0},
2256 |         {64, 1.0}, {64, 1.5}, {64, 2.0}, {64, 4.0},
2257 |         {128, 1.0}, {128, 1.5}, {128, 2.0}, {128, 4.0},
2258 |         {8, 16.0}, {16, 16.0}, {32, 16.0}, {64, 8.0}, {64, 16.0},
2259 |         {128, 8.0}, {128, 16.0}
2260 |     };
2261 |     sort(pairs.begin(), pairs.end());
2262 |     for (pair<int,double> mm : pairs) {
2263 |         cerr << endl << endl << endl << endl;
2264 |         DBG(mm.first);
2265 |         DBG(mm.second);
2266 |         Instance ins = readResnet();
2267 |         ins.maxMemoryPerDevice = mm.second * (1 << 30);
2268 |         ins.maxDevices = mm.first;
2269 |         //ins.mbsInBatch = 1920*4;
2270 |         ins.bandwidth = 25.0 * (1LL << 30);
2271 |         
2272 |         // our alg
2273 |         clock_t startTime = clock();
2274 |         Graph g(ins);
2275 |         Result our = g.runDP();
2276 |         double runtime = (clock() - startTime) * 1.0 / CLOCKS_PER_SEC;
2277 |         DBG(runtime)
2278 |         if (our.stages.empty()) {
2279 |             continue; // infeasible, not interesting
2280 |         }
2281 |         double ourTPS = g.getTPSForResult(our);
2282 | 
2283 |         Result equi = runPipeDream2BWPlannerNonTransformer(ins, false);
2284 |         double equiTPS = equi.stages.empty() ? 1e30 : g.getTPSForResult(equi);
2285 |         double ratio = equi.stages.empty() ? 0 : ourTPS / equiTPS;
2286 | 
2287 |         cout << mm.first << " & "
2288 |              << mm.second << " & "
2289 |              << setprecision(3) << fixed << ourTPS << " & ";
2290 |         if (equi.stages.empty()) cout << "OOM"; else cout << setprecision(3) << fixed << equiTPS;
2291 |         cout << " & "
2292 |              << setprecision(3) << fixed << ratio << "$\\times$ & "
2293 |              << setprecision(1) << fixed << runtime << "s \\\\\n";
2294 |     }
2295 | }
2296 | 
2297 | 
2298 | void runGNMT () {
2299 |     vector<pair<int,double>> pairs = {
2300 |         {2, 2.5}, {2, 3.5},
2301 |         {4, 1.2}, {4, 2.5},
2302 |         {8, 0.6}, {8, 1.2}, {8, 2.4},
2303 |         {16, 0.3}, {16, 0.6}, {16, 1.2},
2304 |         {32, 0.3},
2305 |         {64, 0.3},
2306 |         {32, 0.8}
2307 |     };
2308 |     sort(pairs.begin(), pairs.end());
2309 |     for (pair<int,double> mm : pairs) {
2310 |         cerr << endl << endl << endl << endl;
2311 |         DBG(mm.first);
2312 |         DBG(mm.second);
2313 |         Instance ins = readGNMT();
2314 |         ins.maxMemoryPerDevice = mm.second * (1 << 30);
2315 |         ins.maxDevices = mm.first;
2316 |         //ins.mbsInBatch = 256;
2317 |         ins.bandwidth = 25.0 * (1LL << 30);
2318 |         
2319 |         // our alg
2320 |         clock_t startTime = clock();
2321 |         Graph g(ins);
2322 |         Result our = g.runDP();
2323 |         double runtime = (clock() - startTime) * 1.0 / CLOCKS_PER_SEC;
2324 |         DBG(runtime)
2325 |         if (our.stages.empty()) {
2326 |             continue; // infeasible, not interesting
2327 |         }
2328 |         double ourTPS = g.getTPSForResult(our);
2329 | 
2330 |         Result equi = runPipeDream2BWPlannerNonTransformer(ins, false);
2331 |         double equiTPS = equi.stages.empty() ? 1e30 : g.getTPSForResult(equi);
2332 |         double ratio = equi.stages.empty() ? 0 : ourTPS / equiTPS;
2333 | 
2334 |         cout << mm.first << " & "
2335 |              << mm.second << " & "
2336 |              << setprecision(3) << fixed << ourTPS << " & ";
2337 |         if (equi.stages.empty()) cout << "OOM"; else cout << setprecision(3) << fixed << equiTPS;
2338 |         cout << " & "
2339 |              << setprecision(3) << fixed << ratio << "$\\times$ & "
2340 |              << setprecision(1) << fixed << runtime << "s \\\\\n";
2341 |     }
2342 | }
2343 | 
2344 | 
2345 | int main (int argc, char **argv) {
2346 |     plots(); // generates data for the bar plots (Fig. 1)
2347 | 
2348 |     parallelizabilityExperiment(32, 8, 960); // Fig. 2a
2349 |     parallelizabilityExperiment(32, 8, 512); // Fig. 2b
2350 | 
2351 |     scalability(); // Fig. 3
2352 | 
2353 |     correlationExperiment(); // Fig. 4 (y-axis values)
2354 | 
2355 |     // experiment for Appendix C
2356 |     OUTPUT_KNAPSACK_INSTANCES_FOR_INSPECTION = true;
2357 |     single();
2358 |     OUTPUT_KNAPSACK_INSTANCES_FOR_INSPECTION = false;
2359 | 
2360 |     // additional experiments on Resnet50 and GNMT
2361 |     runGNMT();
2362 |     cout << endl;
2363 |     runResnet();
2364 | }
2365 | 


--------------------------------------------------------------------------------
/inputs/gnmt.json:
--------------------------------------------------------------------------------
   1 | {
   2 |   "maxDevices": 10,
   3 |   "maxMemoryPerDevice": 80000000000.0,
   4 |   "bandwidth": 25000000000.0,
   5 |   "mbsInBatch": 256,
   6 |   "nodes": [
   7 |     {
   8 |       "id": 0,
   9 |       "name": "node4",
  10 |       "TMPCs": {
  11 |         "1": [
  12 |           {
  13 |             "id": "vanilla",
  14 |             "timePerSample": 25.678,
  15 |             "parameterSize": 132382720.0,
  16 |             "memoryUsageA": 13107200.0,
  17 |             "memoryUsageB": 264765440.0,
  18 |             "syncTimeFw": {},
  19 |             "syncTimeBw": {
  20 |               "1": 0.0
  21 |             }
  22 |           },
  23 |           {
  24 |             "id": "activation recomp",
  25 |             "timePerSample": 25.818,
  26 |             "parameterSize": 132382720.0,
  27 |             "memoryUsageA": 1179648.0,
  28 |             "memoryUsageB": 276692992.0,
  29 |             "syncTimeFw": {},
  30 |             "syncTimeBw": {
  31 |               "1": 0.0
  32 |             }
  33 |           }
  34 |         ]
  35 |       }
  36 |     },
  37 |     {
  38 |       "id": 1,
  39 |       "name": "node5",
  40 |       "TMPCs": {
  41 |         "1": [
  42 |           {
  43 |             "id": "vanilla",
  44 |             "timePerSample": 14.145,
  45 |             "parameterSize": 67174400.0,
  46 |             "memoryUsageA": 26214400.0,
  47 |             "memoryUsageB": 134348800.0,
  48 |             "syncTimeFw": {
  49 |               "0": 0.0,
  50 |               "2": 0.0
  51 |             },
  52 |             "syncTimeBw": {
  53 |               "3": 0.0
  54 |             }
  55 |           },
  56 |           {
  57 |             "id": "activation recomp",
  58 |             "timePerSample": 28.270999999999997,
  59 |             "parameterSize": 67174400.0,
  60 |             "memoryUsageA": 2359296.0,
  61 |             "memoryUsageB": 158203904.0,
  62 |             "syncTimeFw": {
  63 |               "0": 0.0,
  64 |               "2": 0.0
  65 |             },
  66 |             "syncTimeBw": {
  67 |               "3": 0.0
  68 |             }
  69 |           }
  70 |         ]
  71 |       }
  72 |     },
  73 |     {
  74 |       "id": 2,
  75 |       "name": "node2",
  76 |       "TMPCs": {
  77 |         "1": [
  78 |           {
  79 |             "id": "vanilla",
  80 |             "timePerSample": 0.0,
  81 |             "parameterSize": 0.0,
  82 |             "memoryUsageA": 0.0,
  83 |             "memoryUsageB": 0.0,
  84 |             "syncTimeFw": {},
  85 |             "syncTimeBw": {
  86 |               "1": 0.0,
  87 |               "33": 0.0
  88 |             }
  89 |           },
  90 |           {
  91 |             "id": "activation recomp",
  92 |             "timePerSample": 0.0,
  93 |             "parameterSize": 0.0,
  94 |             "memoryUsageA": 0.0,
  95 |             "memoryUsageB": 0.0,
  96 |             "syncTimeFw": {},
  97 |             "syncTimeBw": {
  98 |               "1": 0.0,
  99 |               "33": 0.0
 100 |             }
 101 |           }
 102 |         ]
 103 |       }
 104 |     },
 105 |     {
 106 |       "id": 3,
 107 |       "name": "node6",
 108 |       "TMPCs": {
 109 |         "1": [
 110 |           {
 111 |             "id": "vanilla",
 112 |             "timePerSample": 0.471,
 113 |             "parameterSize": 0.0,
 114 |             "memoryUsageA": 26214400.0,
 115 |             "memoryUsageB": 0.0,
 116 |             "syncTimeFw": {
 117 |               "1": 0.0
 118 |             },
 119 |             "syncTimeBw": {
 120 |               "4": 0.0
 121 |             }
 122 |           },
 123 |           {
 124 |             "id": "activation recomp",
 125 |             "timePerSample": 0.657,
 126 |             "parameterSize": 0.0,
 127 |             "memoryUsageA": 2359296.0,
 128 |             "memoryUsageB": 23855104.0,
 129 |             "syncTimeFw": {
 130 |               "1": 0.0
 131 |             },
 132 |             "syncTimeBw": {
 133 |               "4": 0.0
 134 |             }
 135 |           }
 136 |         ]
 137 |       }
 138 |     },
 139 |     {
 140 |       "id": 4,
 141 |       "name": "node7",
 142 |       "TMPCs": {
 143 |         "1": [
 144 |           {
 145 |             "id": "vanilla",
 146 |             "timePerSample": 29.075000000000003,
 147 |             "parameterSize": 50364416.0,
 148 |             "memoryUsageA": 14155776.0,
 149 |             "memoryUsageB": 100728832.0,
 150 |             "syncTimeFw": {
 151 |               "3": 0.0
 152 |             },
 153 |             "syncTimeBw": {
 154 |               "5": 0.0
 155 |             }
 156 |           },
 157 |           {
 158 |             "id": "activation recomp",
 159 |             "timePerSample": 39.373000000000005,
 160 |             "parameterSize": 50364416.0,
 161 |             "memoryUsageA": 1274019.8399999999,
 162 |             "memoryUsageB": 113610588.16,
 163 |             "syncTimeFw": {
 164 |               "3": 0.0
 165 |             },
 166 |             "syncTimeBw": {
 167 |               "5": 0.0
 168 |             }
 169 |           }
 170 |         ]
 171 |       }
 172 |     },
 173 |     {
 174 |       "id": 5,
 175 |       "name": "node8",
 176 |       "TMPCs": {
 177 |         "1": [
 178 |           {
 179 |             "id": "vanilla",
 180 |             "timePerSample": 0.0,
 181 |             "parameterSize": 0.0,
 182 |             "memoryUsageA": 13107200.0,
 183 |             "memoryUsageB": 0.0,
 184 |             "syncTimeFw": {
 185 |               "4": 0.0
 186 |             },
 187 |             "syncTimeBw": {
 188 |               "6": 0.0,
 189 |               "9": 0.0
 190 |             }
 191 |           },
 192 |           {
 193 |             "id": "activation recomp",
 194 |             "timePerSample": 0.0,
 195 |             "parameterSize": 0.0,
 196 |             "memoryUsageA": 1179648.0,
 197 |             "memoryUsageB": 11927552.0,
 198 |             "syncTimeFw": {
 199 |               "4": 0.0
 200 |             },
 201 |             "syncTimeBw": {
 202 |               "6": 0.0,
 203 |               "9": 0.0
 204 |             }
 205 |           }
 206 |         ]
 207 |       }
 208 |     },
 209 |     {
 210 |       "id": 6,
 211 |       "name": "node10",
 212 |       "TMPCs": {
 213 |         "1": [
 214 |           {
 215 |             "id": "vanilla",
 216 |             "timePerSample": 0.28300000000000003,
 217 |             "parameterSize": 0.0,
 218 |             "memoryUsageA": 13107200.0,
 219 |             "memoryUsageB": 0.0,
 220 |             "syncTimeFw": {
 221 |               "5": 0.0
 222 |             },
 223 |             "syncTimeBw": {
 224 |               "7": 0.0
 225 |             }
 226 |           },
 227 |           {
 228 |             "id": "activation recomp",
 229 |             "timePerSample": 0.40700000000000003,
 230 |             "parameterSize": 0.0,
 231 |             "memoryUsageA": 1179648.0,
 232 |             "memoryUsageB": 11927552.0,
 233 |             "syncTimeFw": {
 234 |               "5": 0.0
 235 |             },
 236 |             "syncTimeBw": {
 237 |               "7": 0.0
 238 |             }
 239 |           }
 240 |         ]
 241 |       }
 242 |     },
 243 |     {
 244 |       "id": 7,
 245 |       "name": "node11",
 246 |       "TMPCs": {
 247 |         "1": [
 248 |           {
 249 |             "id": "vanilla",
 250 |             "timePerSample": 20.436,
 251 |             "parameterSize": 33587200.0,
 252 |             "memoryUsageA": 14155776.0,
 253 |             "memoryUsageB": 67174400.0,
 254 |             "syncTimeFw": {
 255 |               "6": 0.0
 256 |             },
 257 |             "syncTimeBw": {
 258 |               "8": 0.0
 259 |             }
 260 |           },
 261 |           {
 262 |             "id": "activation recomp",
 263 |             "timePerSample": 27.323,
 264 |             "parameterSize": 33587200.0,
 265 |             "memoryUsageA": 1274019.8399999999,
 266 |             "memoryUsageB": 80056156.16,
 267 |             "syncTimeFw": {
 268 |               "6": 0.0
 269 |             },
 270 |             "syncTimeBw": {
 271 |               "8": 0.0
 272 |             }
 273 |           }
 274 |         ]
 275 |       }
 276 |     },
 277 |     {
 278 |       "id": 8,
 279 |       "name": "node12",
 280 |       "TMPCs": {
 281 |         "1": [
 282 |           {
 283 |             "id": "vanilla",
 284 |             "timePerSample": 0.0,
 285 |             "parameterSize": 0.0,
 286 |             "memoryUsageA": 13107200.0,
 287 |             "memoryUsageB": 0.0,
 288 |             "syncTimeFw": {
 289 |               "7": 0.0
 290 |             },
 291 |             "syncTimeBw": {
 292 |               "9": 0.0
 293 |             }
 294 |           },
 295 |           {
 296 |             "id": "activation recomp",
 297 |             "timePerSample": 0.0,
 298 |             "parameterSize": 0.0,
 299 |             "memoryUsageA": 1179648.0,
 300 |             "memoryUsageB": 11927552.0,
 301 |             "syncTimeFw": {
 302 |               "7": 0.0
 303 |             },
 304 |             "syncTimeBw": {
 305 |               "9": 0.0
 306 |             }
 307 |           }
 308 |         ]
 309 |       }
 310 |     },
 311 |     {
 312 |       "id": 9,
 313 |       "name": "node14",
 314 |       "TMPCs": {
 315 |         "1": [
 316 |           {
 317 |             "id": "vanilla",
 318 |             "timePerSample": 0.0,
 319 |             "parameterSize": 0.0,
 320 |             "memoryUsageA": 13107200.0,
 321 |             "memoryUsageB": 0.0,
 322 |             "syncTimeFw": {
 323 |               "8": 0.0,
 324 |               "5": 0.0
 325 |             },
 326 |             "syncTimeBw": {
 327 |               "10": 0.0,
 328 |               "13": 0.0
 329 |             }
 330 |           },
 331 |           {
 332 |             "id": "activation recomp",
 333 |             "timePerSample": 0.0,
 334 |             "parameterSize": 0.0,
 335 |             "memoryUsageA": 1179648.0,
 336 |             "memoryUsageB": 11927552.0,
 337 |             "syncTimeFw": {
 338 |               "8": 0.0,
 339 |               "5": 0.0
 340 |             },
 341 |             "syncTimeBw": {
 342 |               "10": 0.0,
 343 |               "13": 0.0
 344 |             }
 345 |           }
 346 |         ]
 347 |       }
 348 |     },
 349 |     {
 350 |       "id": 10,
 351 |       "name": "node15",
 352 |       "TMPCs": {
 353 |         "1": [
 354 |           {
 355 |             "id": "vanilla",
 356 |             "timePerSample": 0.29600000000000004,
 357 |             "parameterSize": 0.0,
 358 |             "memoryUsageA": 13107200.0,
 359 |             "memoryUsageB": 0.0,
 360 |             "syncTimeFw": {
 361 |               "9": 0.0
 362 |             },
 363 |             "syncTimeBw": {
 364 |               "11": 0.0
 365 |             }
 366 |           },
 367 |           {
 368 |             "id": "activation recomp",
 369 |             "timePerSample": 0.42500000000000004,
 370 |             "parameterSize": 0.0,
 371 |             "memoryUsageA": 1179648.0,
 372 |             "memoryUsageB": 11927552.0,
 373 |             "syncTimeFw": {
 374 |               "9": 0.0
 375 |             },
 376 |             "syncTimeBw": {
 377 |               "11": 0.0
 378 |             }
 379 |           }
 380 |         ]
 381 |       }
 382 |     },
 383 |     {
 384 |       "id": 11,
 385 |       "name": "node16",
 386 |       "TMPCs": {
 387 |         "1": [
 388 |           {
 389 |             "id": "vanilla",
 390 |             "timePerSample": 20.421,
 391 |             "parameterSize": 33587200.0,
 392 |             "memoryUsageA": 14155776.0,
 393 |             "memoryUsageB": 67174400.0,
 394 |             "syncTimeFw": {
 395 |               "10": 0.0
 396 |             },
 397 |             "syncTimeBw": {
 398 |               "12": 0.0
 399 |             }
 400 |           },
 401 |           {
 402 |             "id": "activation recomp",
 403 |             "timePerSample": 27.323999999999998,
 404 |             "parameterSize": 33587200.0,
 405 |             "memoryUsageA": 1274019.8399999999,
 406 |             "memoryUsageB": 80056156.16,
 407 |             "syncTimeFw": {
 408 |               "10": 0.0
 409 |             },
 410 |             "syncTimeBw": {
 411 |               "12": 0.0
 412 |             }
 413 |           }
 414 |         ]
 415 |       }
 416 |     },
 417 |     {
 418 |       "id": 12,
 419 |       "name": "node17",
 420 |       "TMPCs": {
 421 |         "1": [
 422 |           {
 423 |             "id": "vanilla",
 424 |             "timePerSample": 0.0,
 425 |             "parameterSize": 0.0,
 426 |             "memoryUsageA": 13107200.0,
 427 |             "memoryUsageB": 0.0,
 428 |             "syncTimeFw": {
 429 |               "11": 0.0
 430 |             },
 431 |             "syncTimeBw": {
 432 |               "13": 0.0
 433 |             }
 434 |           },
 435 |           {
 436 |             "id": "activation recomp",
 437 |             "timePerSample": 0.0,
 438 |             "parameterSize": 0.0,
 439 |             "memoryUsageA": 1179648.0,
 440 |             "memoryUsageB": 11927552.0,
 441 |             "syncTimeFw": {
 442 |               "11": 0.0
 443 |             },
 444 |             "syncTimeBw": {
 445 |               "13": 0.0
 446 |             }
 447 |           }
 448 |         ]
 449 |       }
 450 |     },
 451 |     {
 452 |       "id": 13,
 453 |       "name": "node19",
 454 |       "TMPCs": {
 455 |         "1": [
 456 |           {
 457 |             "id": "vanilla",
 458 |             "timePerSample": 0.0,
 459 |             "parameterSize": 0.0,
 460 |             "memoryUsageA": 13107200.0,
 461 |             "memoryUsageB": 0.0,
 462 |             "syncTimeFw": {
 463 |               "12": 0.0,
 464 |               "9": 0.0
 465 |             },
 466 |             "syncTimeBw": {
 467 |               "14": 0.0,
 468 |               "17": 0.0
 469 |             }
 470 |           },
 471 |           {
 472 |             "id": "activation recomp",
 473 |             "timePerSample": 0.0,
 474 |             "parameterSize": 0.0,
 475 |             "memoryUsageA": 1179648.0,
 476 |             "memoryUsageB": 11927552.0,
 477 |             "syncTimeFw": {
 478 |               "12": 0.0,
 479 |               "9": 0.0
 480 |             },
 481 |             "syncTimeBw": {
 482 |               "14": 0.0,
 483 |               "17": 0.0
 484 |             }
 485 |           }
 486 |         ]
 487 |       }
 488 |     },
 489 |     {
 490 |       "id": 14,
 491 |       "name": "node20",
 492 |       "TMPCs": {
 493 |         "1": [
 494 |           {
 495 |             "id": "vanilla",
 496 |             "timePerSample": 0.28200000000000003,
 497 |             "parameterSize": 0.0,
 498 |             "memoryUsageA": 13107200.0,
 499 |             "memoryUsageB": 0.0,
 500 |             "syncTimeFw": {
 501 |               "13": 0.0
 502 |             },
 503 |             "syncTimeBw": {
 504 |               "15": 0.0
 505 |             }
 506 |           },
 507 |           {
 508 |             "id": "activation recomp",
 509 |             "timePerSample": 0.404,
 510 |             "parameterSize": 0.0,
 511 |             "memoryUsageA": 1179648.0,
 512 |             "memoryUsageB": 11927552.0,
 513 |             "syncTimeFw": {
 514 |               "13": 0.0
 515 |             },
 516 |             "syncTimeBw": {
 517 |               "15": 0.0
 518 |             }
 519 |           }
 520 |         ]
 521 |       }
 522 |     },
 523 |     {
 524 |       "id": 15,
 525 |       "name": "node21",
 526 |       "TMPCs": {
 527 |         "1": [
 528 |           {
 529 |             "id": "vanilla",
 530 |             "timePerSample": 20.483,
 531 |             "parameterSize": 33587200.0,
 532 |             "memoryUsageA": 14155776.0,
 533 |             "memoryUsageB": 67174400.0,
 534 |             "syncTimeFw": {
 535 |               "14": 0.0
 536 |             },
 537 |             "syncTimeBw": {
 538 |               "16": 0.0
 539 |             }
 540 |           },
 541 |           {
 542 |             "id": "activation recomp",
 543 |             "timePerSample": 27.372,
 544 |             "parameterSize": 33587200.0,
 545 |             "memoryUsageA": 1274019.8399999999,
 546 |             "memoryUsageB": 80056156.16,
 547 |             "syncTimeFw": {
 548 |               "14": 0.0
 549 |             },
 550 |             "syncTimeBw": {
 551 |               "16": 0.0
 552 |             }
 553 |           }
 554 |         ]
 555 |       }
 556 |     },
 557 |     {
 558 |       "id": 16,
 559 |       "name": "node22",
 560 |       "TMPCs": {
 561 |         "1": [
 562 |           {
 563 |             "id": "vanilla",
 564 |             "timePerSample": 0.0,
 565 |             "parameterSize": 0.0,
 566 |             "memoryUsageA": 13107200.0,
 567 |             "memoryUsageB": 0.0,
 568 |             "syncTimeFw": {
 569 |               "15": 0.0
 570 |             },
 571 |             "syncTimeBw": {
 572 |               "17": 0.0
 573 |             }
 574 |           },
 575 |           {
 576 |             "id": "activation recomp",
 577 |             "timePerSample": 0.0,
 578 |             "parameterSize": 0.0,
 579 |             "memoryUsageA": 1179648.0,
 580 |             "memoryUsageB": 11927552.0,
 581 |             "syncTimeFw": {
 582 |               "15": 0.0
 583 |             },
 584 |             "syncTimeBw": {
 585 |               "17": 0.0
 586 |             }
 587 |           }
 588 |         ]
 589 |       }
 590 |     },
 591 |     {
 592 |       "id": 17,
 593 |       "name": "node24",
 594 |       "TMPCs": {
 595 |         "1": [
 596 |           {
 597 |             "id": "vanilla",
 598 |             "timePerSample": 0.0,
 599 |             "parameterSize": 0.0,
 600 |             "memoryUsageA": 13107200.0,
 601 |             "memoryUsageB": 0.0,
 602 |             "syncTimeFw": {
 603 |               "16": 0.0,
 604 |               "13": 0.0
 605 |             },
 606 |             "syncTimeBw": {
 607 |               "18": 0.0,
 608 |               "21": 0.0
 609 |             }
 610 |           },
 611 |           {
 612 |             "id": "activation recomp",
 613 |             "timePerSample": 0.0,
 614 |             "parameterSize": 0.0,
 615 |             "memoryUsageA": 1179648.0,
 616 |             "memoryUsageB": 11927552.0,
 617 |             "syncTimeFw": {
 618 |               "16": 0.0,
 619 |               "13": 0.0
 620 |             },
 621 |             "syncTimeBw": {
 622 |               "18": 0.0,
 623 |               "21": 0.0
 624 |             }
 625 |           }
 626 |         ]
 627 |       }
 628 |     },
 629 |     {
 630 |       "id": 18,
 631 |       "name": "node25",
 632 |       "TMPCs": {
 633 |         "1": [
 634 |           {
 635 |             "id": "vanilla",
 636 |             "timePerSample": 0.28,
 637 |             "parameterSize": 0.0,
 638 |             "memoryUsageA": 13107200.0,
 639 |             "memoryUsageB": 0.0,
 640 |             "syncTimeFw": {
 641 |               "17": 0.0
 642 |             },
 643 |             "syncTimeBw": {
 644 |               "19": 0.0
 645 |             }
 646 |           },
 647 |           {
 648 |             "id": "activation recomp",
 649 |             "timePerSample": 0.4,
 650 |             "parameterSize": 0.0,
 651 |             "memoryUsageA": 1179648.0,
 652 |             "memoryUsageB": 11927552.0,
 653 |             "syncTimeFw": {
 654 |               "17": 0.0
 655 |             },
 656 |             "syncTimeBw": {
 657 |               "19": 0.0
 658 |             }
 659 |           }
 660 |         ]
 661 |       }
 662 |     },
 663 |     {
 664 |       "id": 19,
 665 |       "name": "node26",
 666 |       "TMPCs": {
 667 |         "1": [
 668 |           {
 669 |             "id": "vanilla",
 670 |             "timePerSample": 20.647,
 671 |             "parameterSize": 33587200.0,
 672 |             "memoryUsageA": 14155776.0,
 673 |             "memoryUsageB": 67174400.0,
 674 |             "syncTimeFw": {
 675 |               "18": 0.0
 676 |             },
 677 |             "syncTimeBw": {
 678 |               "20": 0.0
 679 |             }
 680 |           },
 681 |           {
 682 |             "id": "activation recomp",
 683 |             "timePerSample": 27.666,
 684 |             "parameterSize": 33587200.0,
 685 |             "memoryUsageA": 1274019.8399999999,
 686 |             "memoryUsageB": 80056156.16,
 687 |             "syncTimeFw": {
 688 |               "18": 0.0
 689 |             },
 690 |             "syncTimeBw": {
 691 |               "20": 0.0
 692 |             }
 693 |           }
 694 |         ]
 695 |       }
 696 |     },
 697 |     {
 698 |       "id": 20,
 699 |       "name": "node27",
 700 |       "TMPCs": {
 701 |         "1": [
 702 |           {
 703 |             "id": "vanilla",
 704 |             "timePerSample": 0.0,
 705 |             "parameterSize": 0.0,
 706 |             "memoryUsageA": 13107200.0,
 707 |             "memoryUsageB": 0.0,
 708 |             "syncTimeFw": {
 709 |               "19": 0.0
 710 |             },
 711 |             "syncTimeBw": {
 712 |               "21": 0.0
 713 |             }
 714 |           },
 715 |           {
 716 |             "id": "activation recomp",
 717 |             "timePerSample": 0.0,
 718 |             "parameterSize": 0.0,
 719 |             "memoryUsageA": 1179648.0,
 720 |             "memoryUsageB": 11927552.0,
 721 |             "syncTimeFw": {
 722 |               "19": 0.0
 723 |             },
 724 |             "syncTimeBw": {
 725 |               "21": 0.0
 726 |             }
 727 |           }
 728 |         ]
 729 |       }
 730 |     },
 731 |     {
 732 |       "id": 21,
 733 |       "name": "node29",
 734 |       "TMPCs": {
 735 |         "1": [
 736 |           {
 737 |             "id": "vanilla",
 738 |             "timePerSample": 0.0,
 739 |             "parameterSize": 0.0,
 740 |             "memoryUsageA": 13107200.0,
 741 |             "memoryUsageB": 0.0,
 742 |             "syncTimeFw": {
 743 |               "20": 0.0,
 744 |               "17": 0.0
 745 |             },
 746 |             "syncTimeBw": {
 747 |               "22": 0.0,
 748 |               "25": 0.0
 749 |             }
 750 |           },
 751 |           {
 752 |             "id": "activation recomp",
 753 |             "timePerSample": 0.0,
 754 |             "parameterSize": 0.0,
 755 |             "memoryUsageA": 1179648.0,
 756 |             "memoryUsageB": 11927552.0,
 757 |             "syncTimeFw": {
 758 |               "20": 0.0,
 759 |               "17": 0.0
 760 |             },
 761 |             "syncTimeBw": {
 762 |               "22": 0.0,
 763 |               "25": 0.0
 764 |             }
 765 |           }
 766 |         ]
 767 |       }
 768 |     },
 769 |     {
 770 |       "id": 22,
 771 |       "name": "node30",
 772 |       "TMPCs": {
 773 |         "1": [
 774 |           {
 775 |             "id": "vanilla",
 776 |             "timePerSample": 0.28400000000000003,
 777 |             "parameterSize": 0.0,
 778 |             "memoryUsageA": 13107200.0,
 779 |             "memoryUsageB": 0.0,
 780 |             "syncTimeFw": {
 781 |               "21": 0.0
 782 |             },
 783 |             "syncTimeBw": {
 784 |               "23": 0.0
 785 |             }
 786 |           },
 787 |           {
 788 |             "id": "activation recomp",
 789 |             "timePerSample": 0.404,
 790 |             "parameterSize": 0.0,
 791 |             "memoryUsageA": 1179648.0,
 792 |             "memoryUsageB": 11927552.0,
 793 |             "syncTimeFw": {
 794 |               "21": 0.0
 795 |             },
 796 |             "syncTimeBw": {
 797 |               "23": 0.0
 798 |             }
 799 |           }
 800 |         ]
 801 |       }
 802 |     },
 803 |     {
 804 |       "id": 23,
 805 |       "name": "node31",
 806 |       "TMPCs": {
 807 |         "1": [
 808 |           {
 809 |             "id": "vanilla",
 810 |             "timePerSample": 20.605,
 811 |             "parameterSize": 33587200.0,
 812 |             "memoryUsageA": 14155776.0,
 813 |             "memoryUsageB": 67174400.0,
 814 |             "syncTimeFw": {
 815 |               "22": 0.0
 816 |             },
 817 |             "syncTimeBw": {
 818 |               "24": 0.0
 819 |             }
 820 |           },
 821 |           {
 822 |             "id": "activation recomp",
 823 |             "timePerSample": 27.596,
 824 |             "parameterSize": 33587200.0,
 825 |             "memoryUsageA": 1274019.8399999999,
 826 |             "memoryUsageB": 80056156.16,
 827 |             "syncTimeFw": {
 828 |               "22": 0.0
 829 |             },
 830 |             "syncTimeBw": {
 831 |               "24": 0.0
 832 |             }
 833 |           }
 834 |         ]
 835 |       }
 836 |     },
 837 |     {
 838 |       "id": 24,
 839 |       "name": "node32",
 840 |       "TMPCs": {
 841 |         "1": [
 842 |           {
 843 |             "id": "vanilla",
 844 |             "timePerSample": 0.0,
 845 |             "parameterSize": 0.0,
 846 |             "memoryUsageA": 13107200.0,
 847 |             "memoryUsageB": 0.0,
 848 |             "syncTimeFw": {
 849 |               "23": 0.0
 850 |             },
 851 |             "syncTimeBw": {
 852 |               "25": 0.0
 853 |             }
 854 |           },
 855 |           {
 856 |             "id": "activation recomp",
 857 |             "timePerSample": 0.0,
 858 |             "parameterSize": 0.0,
 859 |             "memoryUsageA": 1179648.0,
 860 |             "memoryUsageB": 11927552.0,
 861 |             "syncTimeFw": {
 862 |               "23": 0.0
 863 |             },
 864 |             "syncTimeBw": {
 865 |               "25": 0.0
 866 |             }
 867 |           }
 868 |         ]
 869 |       }
 870 |     },
 871 |     {
 872 |       "id": 25,
 873 |       "name": "node34",
 874 |       "TMPCs": {
 875 |         "1": [
 876 |           {
 877 |             "id": "vanilla",
 878 |             "timePerSample": 0.0,
 879 |             "parameterSize": 0.0,
 880 |             "memoryUsageA": 13107200.0,
 881 |             "memoryUsageB": 0.0,
 882 |             "syncTimeFw": {
 883 |               "24": 0.0,
 884 |               "21": 0.0
 885 |             },
 886 |             "syncTimeBw": {
 887 |               "26": 0.0,
 888 |               "29": 0.0
 889 |             }
 890 |           },
 891 |           {
 892 |             "id": "activation recomp",
 893 |             "timePerSample": 0.0,
 894 |             "parameterSize": 0.0,
 895 |             "memoryUsageA": 1179648.0,
 896 |             "memoryUsageB": 11927552.0,
 897 |             "syncTimeFw": {
 898 |               "24": 0.0,
 899 |               "21": 0.0
 900 |             },
 901 |             "syncTimeBw": {
 902 |               "26": 0.0,
 903 |               "29": 0.0
 904 |             }
 905 |           }
 906 |         ]
 907 |       }
 908 |     },
 909 |     {
 910 |       "id": 26,
 911 |       "name": "node35",
 912 |       "TMPCs": {
 913 |         "1": [
 914 |           {
 915 |             "id": "vanilla",
 916 |             "timePerSample": 0.28900000000000003,
 917 |             "parameterSize": 0.0,
 918 |             "memoryUsageA": 13107200.0,
 919 |             "memoryUsageB": 0.0,
 920 |             "syncTimeFw": {
 921 |               "25": 0.0
 922 |             },
 923 |             "syncTimeBw": {
 924 |               "27": 0.0
 925 |             }
 926 |           },
 927 |           {
 928 |             "id": "activation recomp",
 929 |             "timePerSample": 0.40900000000000003,
 930 |             "parameterSize": 0.0,
 931 |             "memoryUsageA": 1179648.0,
 932 |             "memoryUsageB": 11927552.0,
 933 |             "syncTimeFw": {
 934 |               "25": 0.0
 935 |             },
 936 |             "syncTimeBw": {
 937 |               "27": 0.0
 938 |             }
 939 |           }
 940 |         ]
 941 |       }
 942 |     },
 943 |     {
 944 |       "id": 27,
 945 |       "name": "node36",
 946 |       "TMPCs": {
 947 |         "1": [
 948 |           {
 949 |             "id": "vanilla",
 950 |             "timePerSample": 21.167,
 951 |             "parameterSize": 33587200.0,
 952 |             "memoryUsageA": 14155776.0,
 953 |             "memoryUsageB": 67174400.0,
 954 |             "syncTimeFw": {
 955 |               "26": 0.0
 956 |             },
 957 |             "syncTimeBw": {
 958 |               "28": 0.0
 959 |             }
 960 |           },
 961 |           {
 962 |             "id": "activation recomp",
 963 |             "timePerSample": 28.222,
 964 |             "parameterSize": 33587200.0,
 965 |             "memoryUsageA": 1274019.8399999999,
 966 |             "memoryUsageB": 80056156.16,
 967 |             "syncTimeFw": {
 968 |               "26": 0.0
 969 |             },
 970 |             "syncTimeBw": {
 971 |               "28": 0.0
 972 |             }
 973 |           }
 974 |         ]
 975 |       }
 976 |     },
 977 |     {
 978 |       "id": 28,
 979 |       "name": "node37",
 980 |       "TMPCs": {
 981 |         "1": [
 982 |           {
 983 |             "id": "vanilla",
 984 |             "timePerSample": 0.0,
 985 |             "parameterSize": 0.0,
 986 |             "memoryUsageA": 13107200.0,
 987 |             "memoryUsageB": 0.0,
 988 |             "syncTimeFw": {
 989 |               "27": 0.0
 990 |             },
 991 |             "syncTimeBw": {
 992 |               "29": 0.0
 993 |             }
 994 |           },
 995 |           {
 996 |             "id": "activation recomp",
 997 |             "timePerSample": 0.0,
 998 |             "parameterSize": 0.0,
 999 |             "memoryUsageA": 1179648.0,
1000 |             "memoryUsageB": 11927552.0,
1001 |             "syncTimeFw": {
1002 |               "27": 0.0
1003 |             },
1004 |             "syncTimeBw": {
1005 |               "29": 0.0
1006 |             }
1007 |           }
1008 |         ]
1009 |       }
1010 |     },
1011 |     {
1012 |       "id": 29,
1013 |       "name": "node39",
1014 |       "TMPCs": {
1015 |         "1": [
1016 |           {
1017 |             "id": "vanilla",
1018 |             "timePerSample": 0.0,
1019 |             "parameterSize": 0.0,
1020 |             "memoryUsageA": 13107200.0,
1021 |             "memoryUsageB": 0.0,
1022 |             "syncTimeFw": {
1023 |               "28": 0.0,
1024 |               "25": 0.0
1025 |             },
1026 |             "syncTimeBw": {
1027 |               "33": 0.0
1028 |             }
1029 |           },
1030 |           {
1031 |             "id": "activation recomp",
1032 |             "timePerSample": 0.0,
1033 |             "parameterSize": 0.0,
1034 |             "memoryUsageA": 1179648.0,
1035 |             "memoryUsageB": 11927552.0,
1036 |             "syncTimeFw": {
1037 |               "28": 0.0,
1038 |               "25": 0.0
1039 |             },
1040 |             "syncTimeBw": {
1041 |               "33": 0.0
1042 |             }
1043 |           }
1044 |         ]
1045 |       }
1046 |     },
1047 |     {
1048 |       "id": 30,
1049 |       "name": "node41",
1050 |       "TMPCs": {
1051 |         "1": [
1052 |           {
1053 |             "id": "vanilla",
1054 |             "timePerSample": 0.5880000000000001,
1055 |             "parameterSize": 132382720.0,
1056 |             "memoryUsageA": 13107200.0,
1057 |             "memoryUsageB": 264765440.0,
1058 |             "syncTimeFw": {},
1059 |             "syncTimeBw": {
1060 |               "33": 0.0
1061 |             }
1062 |           },
1063 |           {
1064 |             "id": "activation recomp",
1065 |             "timePerSample": 0.716,
1066 |             "parameterSize": 132382720.0,
1067 |             "memoryUsageA": 1179648.0,
1068 |             "memoryUsageB": 276692992.0,
1069 |             "syncTimeFw": {},
1070 |             "syncTimeBw": {
1071 |               "33": 0.0
1072 |             }
1073 |           }
1074 |         ]
1075 |       }
1076 |     },
1077 |     {
1078 |       "id": 31,
1079 |       "name": "node40",
1080 |       "TMPCs": {
1081 |         "1": [
1082 |           {
1083 |             "id": "vanilla",
1084 |             "timePerSample": 0.0,
1085 |             "parameterSize": 0.0,
1086 |             "memoryUsageA": 0.0,
1087 |             "memoryUsageB": 0.0,
1088 |             "syncTimeFw": {},
1089 |             "syncTimeBw": {
1090 |               "32": 0.0,
1091 |               "38": 0.0,
1092 |               "43": 0.0,
1093 |               "49": 0.0,
1094 |               "55": 0.0,
1095 |               "61": 0.0,
1096 |               "67": 0.0,
1097 |               "73": 0.0
1098 |             }
1099 |           },
1100 |           {
1101 |             "id": "activation recomp",
1102 |             "timePerSample": 0.0,
1103 |             "parameterSize": 0.0,
1104 |             "memoryUsageA": 0.0,
1105 |             "memoryUsageB": 0.0,
1106 |             "syncTimeFw": {},
1107 |             "syncTimeBw": {
1108 |               "32": 0.0,
1109 |               "38": 0.0,
1110 |               "43": 0.0,
1111 |               "49": 0.0,
1112 |               "55": 0.0,
1113 |               "61": 0.0,
1114 |               "67": 0.0,
1115 |               "73": 0.0
1116 |             }
1117 |           }
1118 |         ]
1119 |       }
1120 |     },
1121 |     {
1122 |       "id": 32,
1123 |       "name": "node42",
1124 |       "TMPCs": {
1125 |         "1": [
1126 |           {
1127 |             "id": "vanilla",
1128 |             "timePerSample": 0.0,
1129 |             "parameterSize": 0.0,
1130 |             "memoryUsageA": 0.0,
1131 |             "memoryUsageB": 0.0,
1132 |             "syncTimeFw": {
1133 |               "31": 0.0
1134 |             },
1135 |             "syncTimeBw": {
1136 |               "33": 0.0
1137 |             }
1138 |           },
1139 |           {
1140 |             "id": "activation recomp",
1141 |             "timePerSample": 0.0,
1142 |             "parameterSize": 0.0,
1143 |             "memoryUsageA": 0.0,
1144 |             "memoryUsageB": 0.0,
1145 |             "syncTimeFw": {
1146 |               "31": 0.0
1147 |             },
1148 |             "syncTimeBw": {
1149 |               "33": 0.0
1150 |             }
1151 |           }
1152 |         ]
1153 |       }
1154 |     },
1155 |     {
1156 |       "id": 33,
1157 |       "name": "node43",
1158 |       "TMPCs": {
1159 |         "1": [
1160 |           {
1161 |             "id": "vanilla",
1162 |             "timePerSample": 38.296,
1163 |             "parameterSize": 41979904.0,
1164 |             "memoryUsageA": 27582976.0,
1165 |             "memoryUsageB": 83959808.0,
1166 |             "syncTimeFw": {
1167 |               "30": 0.0,
1168 |               "32": 0.0,
1169 |               "29": 0.0,
1170 |               "2": 0.0
1171 |             },
1172 |             "syncTimeBw": {
1173 |               "34": 0.0,
1174 |               "35": 0.0
1175 |             }
1176 |           },
1177 |           {
1178 |             "id": "activation recomp",
1179 |             "timePerSample": 53.506,
1180 |             "parameterSize": 41979904.0,
1181 |             "memoryUsageA": 2482467.84,
1182 |             "memoryUsageB": 109060316.16,
1183 |             "syncTimeFw": {
1184 |               "30": 0.0,
1185 |               "32": 0.0,
1186 |               "29": 0.0,
1187 |               "2": 0.0
1188 |             },
1189 |             "syncTimeBw": {
1190 |               "34": 0.0,
1191 |               "35": 0.0
1192 |             }
1193 |           }
1194 |         ]
1195 |       }
1196 |     },
1197 |     {
1198 |       "id": 34,
1199 |       "name": "node44",
1200 |       "TMPCs": {
1201 |         "1": [
1202 |           {
1203 |             "id": "vanilla",
1204 |             "timePerSample": 0.0,
1205 |             "parameterSize": 0.0,
1206 |             "memoryUsageA": 13107200.0,
1207 |             "memoryUsageB": 0.0,
1208 |             "syncTimeFw": {
1209 |               "33": 0.0
1210 |             },
1211 |             "syncTimeBw": {
1212 |               "36": 0.0
1213 |             }
1214 |           },
1215 |           {
1216 |             "id": "activation recomp",
1217 |             "timePerSample": 0.0,
1218 |             "parameterSize": 0.0,
1219 |             "memoryUsageA": 1179648.0,
1220 |             "memoryUsageB": 11927552.0,
1221 |             "syncTimeFw": {
1222 |               "33": 0.0
1223 |             },
1224 |             "syncTimeBw": {
1225 |               "36": 0.0
1226 |             }
1227 |           }
1228 |         ]
1229 |       }
1230 |     },
1231 |     {
1232 |       "id": 35,
1233 |       "name": "node46",
1234 |       "TMPCs": {
1235 |         "1": [
1236 |           {
1237 |             "id": "vanilla",
1238 |             "timePerSample": 0.0,
1239 |             "parameterSize": 0.0,
1240 |             "memoryUsageA": 524288.0,
1241 |             "memoryUsageB": 0.0,
1242 |             "syncTimeFw": {
1243 |               "33": 0.0
1244 |             },
1245 |             "syncTimeBw": {
1246 |               "37": 0.0,
1247 |               "42": 0.0,
1248 |               "48": 0.0,
1249 |               "54": 0.0,
1250 |               "60": 0.0,
1251 |               "66": 0.0,
1252 |               "72": 0.0
1253 |             }
1254 |           },
1255 |           {
1256 |             "id": "activation recomp",
1257 |             "timePerSample": 0.0,
1258 |             "parameterSize": 0.0,
1259 |             "memoryUsageA": 47185.92,
1260 |             "memoryUsageB": 477102.08,
1261 |             "syncTimeFw": {
1262 |               "33": 0.0
1263 |             },
1264 |             "syncTimeBw": {
1265 |               "37": 0.0,
1266 |               "42": 0.0,
1267 |               "48": 0.0,
1268 |               "54": 0.0,
1269 |               "60": 0.0,
1270 |               "66": 0.0,
1271 |               "72": 0.0
1272 |             }
1273 |           }
1274 |         ]
1275 |       }
1276 |     },
1277 |     {
1278 |       "id": 36,
1279 |       "name": "node48",
1280 |       "TMPCs": {
1281 |         "1": [
1282 |           {
1283 |             "id": "vanilla",
1284 |             "timePerSample": 0.366,
1285 |             "parameterSize": 0.0,
1286 |             "memoryUsageA": 13107200.0,
1287 |             "memoryUsageB": 0.0,
1288 |             "syncTimeFw": {
1289 |               "34": 0.0
1290 |             },
1291 |             "syncTimeBw": {
1292 |               "37": 0.0
1293 |             }
1294 |           },
1295 |           {
1296 |             "id": "activation recomp",
1297 |             "timePerSample": 0.47,
1298 |             "parameterSize": 0.0,
1299 |             "memoryUsageA": 1179648.0,
1300 |             "memoryUsageB": 11927552.0,
1301 |             "syncTimeFw": {
1302 |               "34": 0.0
1303 |             },
1304 |             "syncTimeBw": {
1305 |               "37": 0.0
1306 |             }
1307 |           }
1308 |         ]
1309 |       }
1310 |     },
1311 |     {
1312 |       "id": 37,
1313 |       "name": "node49",
1314 |       "TMPCs": {
1315 |         "1": [
1316 |           {
1317 |             "id": "vanilla",
1318 |             "timePerSample": 0.0,
1319 |             "parameterSize": 0.0,
1320 |             "memoryUsageA": 13107200.0,
1321 |             "memoryUsageB": 0.0,
1322 |             "syncTimeFw": {
1323 |               "36": 0.0,
1324 |               "35": 0.0
1325 |             },
1326 |             "syncTimeBw": {
1327 |               "39": 0.0
1328 |             }
1329 |           },
1330 |           {
1331 |             "id": "activation recomp",
1332 |             "timePerSample": 0.0,
1333 |             "parameterSize": 0.0,
1334 |             "memoryUsageA": 1179648.0,
1335 |             "memoryUsageB": 11927552.0,
1336 |             "syncTimeFw": {
1337 |               "36": 0.0,
1338 |               "35": 0.0
1339 |             },
1340 |             "syncTimeBw": {
1341 |               "39": 0.0
1342 |             }
1343 |           }
1344 |         ]
1345 |       }
1346 |     },
1347 |     {
1348 |       "id": 38,
1349 |       "name": "node50",
1350 |       "TMPCs": {
1351 |         "1": [
1352 |           {
1353 |             "id": "vanilla",
1354 |             "timePerSample": 0.0,
1355 |             "parameterSize": 0.0,
1356 |             "memoryUsageA": 0.0,
1357 |             "memoryUsageB": 0.0,
1358 |             "syncTimeFw": {
1359 |               "31": 0.0
1360 |             },
1361 |             "syncTimeBw": {
1362 |               "39": 0.0
1363 |             }
1364 |           },
1365 |           {
1366 |             "id": "activation recomp",
1367 |             "timePerSample": 0.0,
1368 |             "parameterSize": 0.0,
1369 |             "memoryUsageA": 0.0,
1370 |             "memoryUsageB": 0.0,
1371 |             "syncTimeFw": {
1372 |               "31": 0.0
1373 |             },
1374 |             "syncTimeBw": {
1375 |               "39": 0.0
1376 |             }
1377 |           }
1378 |         ]
1379 |       }
1380 |     },
1381 |     {
1382 |       "id": 39,
1383 |       "name": "node51",
1384 |       "TMPCs": {
1385 |         "1": [
1386 |           {
1387 |             "id": "vanilla",
1388 |             "timePerSample": 29.261,
1389 |             "parameterSize": 50364416.0,
1390 |             "memoryUsageA": 14155776.0,
1391 |             "memoryUsageB": 100728832.0,
1392 |             "syncTimeFw": {
1393 |               "37": 0.0,
1394 |               "38": 0.0
1395 |             },
1396 |             "syncTimeBw": {
1397 |               "40": 0.0
1398 |             }
1399 |           },
1400 |           {
1401 |             "id": "activation recomp",
1402 |             "timePerSample": 39.778,
1403 |             "parameterSize": 50364416.0,
1404 |             "memoryUsageA": 1274019.8399999999,
1405 |             "memoryUsageB": 113610588.16,
1406 |             "syncTimeFw": {
1407 |               "37": 0.0,
1408 |               "38": 0.0
1409 |             },
1410 |             "syncTimeBw": {
1411 |               "40": 0.0
1412 |             }
1413 |           }
1414 |         ]
1415 |       }
1416 |     },
1417 |     {
1418 |       "id": 40,
1419 |       "name": "node52",
1420 |       "TMPCs": {
1421 |         "1": [
1422 |           {
1423 |             "id": "vanilla",
1424 |             "timePerSample": 0.0,
1425 |             "parameterSize": 0.0,
1426 |             "memoryUsageA": 13107200.0,
1427 |             "memoryUsageB": 0.0,
1428 |             "syncTimeFw": {
1429 |               "39": 0.0
1430 |             },
1431 |             "syncTimeBw": {
1432 |               "41": 0.0,
1433 |               "46": 0.0
1434 |             }
1435 |           },
1436 |           {
1437 |             "id": "activation recomp",
1438 |             "timePerSample": 0.0,
1439 |             "parameterSize": 0.0,
1440 |             "memoryUsageA": 1179648.0,
1441 |             "memoryUsageB": 11927552.0,
1442 |             "syncTimeFw": {
1443 |               "39": 0.0
1444 |             },
1445 |             "syncTimeBw": {
1446 |               "41": 0.0,
1447 |               "46": 0.0
1448 |             }
1449 |           }
1450 |         ]
1451 |       }
1452 |     },
1453 |     {
1454 |       "id": 41,
1455 |       "name": "node54",
1456 |       "TMPCs": {
1457 |         "1": [
1458 |           {
1459 |             "id": "vanilla",
1460 |             "timePerSample": 0.386,
1461 |             "parameterSize": 0.0,
1462 |             "memoryUsageA": 13107200.0,
1463 |             "memoryUsageB": 0.0,
1464 |             "syncTimeFw": {
1465 |               "40": 0.0
1466 |             },
1467 |             "syncTimeBw": {
1468 |               "42": 0.0
1469 |             }
1470 |           },
1471 |           {
1472 |             "id": "activation recomp",
1473 |             "timePerSample": 0.495,
1474 |             "parameterSize": 0.0,
1475 |             "memoryUsageA": 1179648.0,
1476 |             "memoryUsageB": 11927552.0,
1477 |             "syncTimeFw": {
1478 |               "40": 0.0
1479 |             },
1480 |             "syncTimeBw": {
1481 |               "42": 0.0
1482 |             }
1483 |           }
1484 |         ]
1485 |       }
1486 |     },
1487 |     {
1488 |       "id": 42,
1489 |       "name": "node55",
1490 |       "TMPCs": {
1491 |         "1": [
1492 |           {
1493 |             "id": "vanilla",
1494 |             "timePerSample": 0.0,
1495 |             "parameterSize": 0.0,
1496 |             "memoryUsageA": 13107200.0,
1497 |             "memoryUsageB": 0.0,
1498 |             "syncTimeFw": {
1499 |               "41": 0.0,
1500 |               "35": 0.0
1501 |             },
1502 |             "syncTimeBw": {
1503 |               "44": 0.0
1504 |             }
1505 |           },
1506 |           {
1507 |             "id": "activation recomp",
1508 |             "timePerSample": 0.0,
1509 |             "parameterSize": 0.0,
1510 |             "memoryUsageA": 1179648.0,
1511 |             "memoryUsageB": 11927552.0,
1512 |             "syncTimeFw": {
1513 |               "41": 0.0,
1514 |               "35": 0.0
1515 |             },
1516 |             "syncTimeBw": {
1517 |               "44": 0.0
1518 |             }
1519 |           }
1520 |         ]
1521 |       }
1522 |     },
1523 |     {
1524 |       "id": 43,
1525 |       "name": "node56",
1526 |       "TMPCs": {
1527 |         "1": [
1528 |           {
1529 |             "id": "vanilla",
1530 |             "timePerSample": 0.0,
1531 |             "parameterSize": 0.0,
1532 |             "memoryUsageA": 0.0,
1533 |             "memoryUsageB": 0.0,
1534 |             "syncTimeFw": {
1535 |               "31": 0.0
1536 |             },
1537 |             "syncTimeBw": {
1538 |               "44": 0.0
1539 |             }
1540 |           },
1541 |           {
1542 |             "id": "activation recomp",
1543 |             "timePerSample": 0.0,
1544 |             "parameterSize": 0.0,
1545 |             "memoryUsageA": 0.0,
1546 |             "memoryUsageB": 0.0,
1547 |             "syncTimeFw": {
1548 |               "31": 0.0
1549 |             },
1550 |             "syncTimeBw": {
1551 |               "44": 0.0
1552 |             }
1553 |           }
1554 |         ]
1555 |       }
1556 |     },
1557 |     {
1558 |       "id": 44,
1559 |       "name": "node57",
1560 |       "TMPCs": {
1561 |         "1": [
1562 |           {
1563 |             "id": "vanilla",
1564 |             "timePerSample": 29.324,
1565 |             "parameterSize": 50364416.0,
1566 |             "memoryUsageA": 14155776.0,
1567 |             "memoryUsageB": 100728832.0,
1568 |             "syncTimeFw": {
1569 |               "42": 0.0,
1570 |               "43": 0.0
1571 |             },
1572 |             "syncTimeBw": {
1573 |               "45": 0.0
1574 |             }
1575 |           },
1576 |           {
1577 |             "id": "activation recomp",
1578 |             "timePerSample": 39.877,
1579 |             "parameterSize": 50364416.0,
1580 |             "memoryUsageA": 1274019.8399999999,
1581 |             "memoryUsageB": 113610588.16,
1582 |             "syncTimeFw": {
1583 |               "42": 0.0,
1584 |               "43": 0.0
1585 |             },
1586 |             "syncTimeBw": {
1587 |               "45": 0.0
1588 |             }
1589 |           }
1590 |         ]
1591 |       }
1592 |     },
1593 |     {
1594 |       "id": 45,
1595 |       "name": "node58",
1596 |       "TMPCs": {
1597 |         "1": [
1598 |           {
1599 |             "id": "vanilla",
1600 |             "timePerSample": 0.0,
1601 |             "parameterSize": 0.0,
1602 |             "memoryUsageA": 13107200.0,
1603 |             "memoryUsageB": 0.0,
1604 |             "syncTimeFw": {
1605 |               "44": 0.0
1606 |             },
1607 |             "syncTimeBw": {
1608 |               "46": 0.0
1609 |             }
1610 |           },
1611 |           {
1612 |             "id": "activation recomp",
1613 |             "timePerSample": 0.0,
1614 |             "parameterSize": 0.0,
1615 |             "memoryUsageA": 1179648.0,
1616 |             "memoryUsageB": 11927552.0,
1617 |             "syncTimeFw": {
1618 |               "44": 0.0
1619 |             },
1620 |             "syncTimeBw": {
1621 |               "46": 0.0
1622 |             }
1623 |           }
1624 |         ]
1625 |       }
1626 |     },
1627 |     {
1628 |       "id": 46,
1629 |       "name": "node60",
1630 |       "TMPCs": {
1631 |         "1": [
1632 |           {
1633 |             "id": "vanilla",
1634 |             "timePerSample": 0.0,
1635 |             "parameterSize": 0.0,
1636 |             "memoryUsageA": 13107200.0,
1637 |             "memoryUsageB": 0.0,
1638 |             "syncTimeFw": {
1639 |               "45": 0.0,
1640 |               "40": 0.0
1641 |             },
1642 |             "syncTimeBw": {
1643 |               "47": 0.0,
1644 |               "52": 0.0
1645 |             }
1646 |           },
1647 |           {
1648 |             "id": "activation recomp",
1649 |             "timePerSample": 0.0,
1650 |             "parameterSize": 0.0,
1651 |             "memoryUsageA": 1179648.0,
1652 |             "memoryUsageB": 11927552.0,
1653 |             "syncTimeFw": {
1654 |               "45": 0.0,
1655 |               "40": 0.0
1656 |             },
1657 |             "syncTimeBw": {
1658 |               "47": 0.0,
1659 |               "52": 0.0
1660 |             }
1661 |           }
1662 |         ]
1663 |       }
1664 |     },
1665 |     {
1666 |       "id": 47,
1667 |       "name": "node61",
1668 |       "TMPCs": {
1669 |         "1": [
1670 |           {
1671 |             "id": "vanilla",
1672 |             "timePerSample": 0.376,
1673 |             "parameterSize": 0.0,
1674 |             "memoryUsageA": 13107200.0,
1675 |             "memoryUsageB": 0.0,
1676 |             "syncTimeFw": {
1677 |               "46": 0.0
1678 |             },
1679 |             "syncTimeBw": {
1680 |               "48": 0.0
1681 |             }
1682 |           },
1683 |           {
1684 |             "id": "activation recomp",
1685 |             "timePerSample": 0.496,
1686 |             "parameterSize": 0.0,
1687 |             "memoryUsageA": 1179648.0,
1688 |             "memoryUsageB": 11927552.0,
1689 |             "syncTimeFw": {
1690 |               "46": 0.0
1691 |             },
1692 |             "syncTimeBw": {
1693 |               "48": 0.0
1694 |             }
1695 |           }
1696 |         ]
1697 |       }
1698 |     },
1699 |     {
1700 |       "id": 48,
1701 |       "name": "node62",
1702 |       "TMPCs": {
1703 |         "1": [
1704 |           {
1705 |             "id": "vanilla",
1706 |             "timePerSample": 0.0,
1707 |             "parameterSize": 0.0,
1708 |             "memoryUsageA": 13107200.0,
1709 |             "memoryUsageB": 0.0,
1710 |             "syncTimeFw": {
1711 |               "47": 0.0,
1712 |               "35": 0.0
1713 |             },
1714 |             "syncTimeBw": {
1715 |               "50": 0.0
1716 |             }
1717 |           },
1718 |           {
1719 |             "id": "activation recomp",
1720 |             "timePerSample": 0.0,
1721 |             "parameterSize": 0.0,
1722 |             "memoryUsageA": 1179648.0,
1723 |             "memoryUsageB": 11927552.0,
1724 |             "syncTimeFw": {
1725 |               "47": 0.0,
1726 |               "35": 0.0
1727 |             },
1728 |             "syncTimeBw": {
1729 |               "50": 0.0
1730 |             }
1731 |           }
1732 |         ]
1733 |       }
1734 |     },
1735 |     {
1736 |       "id": 49,
1737 |       "name": "node63",
1738 |       "TMPCs": {
1739 |         "1": [
1740 |           {
1741 |             "id": "vanilla",
1742 |             "timePerSample": 0.0,
1743 |             "parameterSize": 0.0,
1744 |             "memoryUsageA": 0.0,
1745 |             "memoryUsageB": 0.0,
1746 |             "syncTimeFw": {
1747 |               "31": 0.0
1748 |             },
1749 |             "syncTimeBw": {
1750 |               "50": 0.0
1751 |             }
1752 |           },
1753 |           {
1754 |             "id": "activation recomp",
1755 |             "timePerSample": 0.0,
1756 |             "parameterSize": 0.0,
1757 |             "memoryUsageA": 0.0,
1758 |             "memoryUsageB": 0.0,
1759 |             "syncTimeFw": {
1760 |               "31": 0.0
1761 |             },
1762 |             "syncTimeBw": {
1763 |               "50": 0.0
1764 |             }
1765 |           }
1766 |         ]
1767 |       }
1768 |     },
1769 |     {
1770 |       "id": 50,
1771 |       "name": "node64",
1772 |       "TMPCs": {
1773 |         "1": [
1774 |           {
1775 |             "id": "vanilla",
1776 |             "timePerSample": 29.527,
1777 |             "parameterSize": 50364416.0,
1778 |             "memoryUsageA": 14155776.0,
1779 |             "memoryUsageB": 100728832.0,
1780 |             "syncTimeFw": {
1781 |               "48": 0.0,
1782 |               "49": 0.0
1783 |             },
1784 |             "syncTimeBw": {
1785 |               "51": 0.0
1786 |             }
1787 |           },
1788 |           {
1789 |             "id": "activation recomp",
1790 |             "timePerSample": 40.251999999999995,
1791 |             "parameterSize": 50364416.0,
1792 |             "memoryUsageA": 1274019.8399999999,
1793 |             "memoryUsageB": 113610588.16,
1794 |             "syncTimeFw": {
1795 |               "48": 0.0,
1796 |               "49": 0.0
1797 |             },
1798 |             "syncTimeBw": {
1799 |               "51": 0.0
1800 |             }
1801 |           }
1802 |         ]
1803 |       }
1804 |     },
1805 |     {
1806 |       "id": 51,
1807 |       "name": "node65",
1808 |       "TMPCs": {
1809 |         "1": [
1810 |           {
1811 |             "id": "vanilla",
1812 |             "timePerSample": 0.0,
1813 |             "parameterSize": 0.0,
1814 |             "memoryUsageA": 13107200.0,
1815 |             "memoryUsageB": 0.0,
1816 |             "syncTimeFw": {
1817 |               "50": 0.0
1818 |             },
1819 |             "syncTimeBw": {
1820 |               "52": 0.0
1821 |             }
1822 |           },
1823 |           {
1824 |             "id": "activation recomp",
1825 |             "timePerSample": 0.0,
1826 |             "parameterSize": 0.0,
1827 |             "memoryUsageA": 1179648.0,
1828 |             "memoryUsageB": 11927552.0,
1829 |             "syncTimeFw": {
1830 |               "50": 0.0
1831 |             },
1832 |             "syncTimeBw": {
1833 |               "52": 0.0
1834 |             }
1835 |           }
1836 |         ]
1837 |       }
1838 |     },
1839 |     {
1840 |       "id": 52,
1841 |       "name": "node67",
1842 |       "TMPCs": {
1843 |         "1": [
1844 |           {
1845 |             "id": "vanilla",
1846 |             "timePerSample": 0.0,
1847 |             "parameterSize": 0.0,
1848 |             "memoryUsageA": 13107200.0,
1849 |             "memoryUsageB": 0.0,
1850 |             "syncTimeFw": {
1851 |               "51": 0.0,
1852 |               "46": 0.0
1853 |             },
1854 |             "syncTimeBw": {
1855 |               "53": 0.0,
1856 |               "58": 0.0
1857 |             }
1858 |           },
1859 |           {
1860 |             "id": "activation recomp",
1861 |             "timePerSample": 0.0,
1862 |             "parameterSize": 0.0,
1863 |             "memoryUsageA": 1179648.0,
1864 |             "memoryUsageB": 11927552.0,
1865 |             "syncTimeFw": {
1866 |               "51": 0.0,
1867 |               "46": 0.0
1868 |             },
1869 |             "syncTimeBw": {
1870 |               "53": 0.0,
1871 |               "58": 0.0
1872 |             }
1873 |           }
1874 |         ]
1875 |       }
1876 |     },
1877 |     {
1878 |       "id": 53,
1879 |       "name": "node68",
1880 |       "TMPCs": {
1881 |         "1": [
1882 |           {
1883 |             "id": "vanilla",
1884 |             "timePerSample": 0.384,
1885 |             "parameterSize": 0.0,
1886 |             "memoryUsageA": 13107200.0,
1887 |             "memoryUsageB": 0.0,
1888 |             "syncTimeFw": {
1889 |               "52": 0.0
1890 |             },
1891 |             "syncTimeBw": {
1892 |               "54": 0.0
1893 |             }
1894 |           },
1895 |           {
1896 |             "id": "activation recomp",
1897 |             "timePerSample": 0.505,
1898 |             "parameterSize": 0.0,
1899 |             "memoryUsageA": 1179648.0,
1900 |             "memoryUsageB": 11927552.0,
1901 |             "syncTimeFw": {
1902 |               "52": 0.0
1903 |             },
1904 |             "syncTimeBw": {
1905 |               "54": 0.0
1906 |             }
1907 |           }
1908 |         ]
1909 |       }
1910 |     },
1911 |     {
1912 |       "id": 54,
1913 |       "name": "node69",
1914 |       "TMPCs": {
1915 |         "1": [
1916 |           {
1917 |             "id": "vanilla",
1918 |             "timePerSample": 0.0,
1919 |             "parameterSize": 0.0,
1920 |             "memoryUsageA": 13107200.0,
1921 |             "memoryUsageB": 0.0,
1922 |             "syncTimeFw": {
1923 |               "53": 0.0,
1924 |               "35": 0.0
1925 |             },
1926 |             "syncTimeBw": {
1927 |               "56": 0.0
1928 |             }
1929 |           },
1930 |           {
1931 |             "id": "activation recomp",
1932 |             "timePerSample": 0.0,
1933 |             "parameterSize": 0.0,
1934 |             "memoryUsageA": 1179648.0,
1935 |             "memoryUsageB": 11927552.0,
1936 |             "syncTimeFw": {
1937 |               "53": 0.0,
1938 |               "35": 0.0
1939 |             },
1940 |             "syncTimeBw": {
1941 |               "56": 0.0
1942 |             }
1943 |           }
1944 |         ]
1945 |       }
1946 |     },
1947 |     {
1948 |       "id": 55,
1949 |       "name": "node70",
1950 |       "TMPCs": {
1951 |         "1": [
1952 |           {
1953 |             "id": "vanilla",
1954 |             "timePerSample": 0.0,
1955 |             "parameterSize": 0.0,
1956 |             "memoryUsageA": 0.0,
1957 |             "memoryUsageB": 0.0,
1958 |             "syncTimeFw": {
1959 |               "31": 0.0
1960 |             },
1961 |             "syncTimeBw": {
1962 |               "56": 0.0
1963 |             }
1964 |           },
1965 |           {
1966 |             "id": "activation recomp",
1967 |             "timePerSample": 0.0,
1968 |             "parameterSize": 0.0,
1969 |             "memoryUsageA": 0.0,
1970 |             "memoryUsageB": 0.0,
1971 |             "syncTimeFw": {
1972 |               "31": 0.0
1973 |             },
1974 |             "syncTimeBw": {
1975 |               "56": 0.0
1976 |             }
1977 |           }
1978 |         ]
1979 |       }
1980 |     },
1981 |     {
1982 |       "id": 56,
1983 |       "name": "node71",
1984 |       "TMPCs": {
1985 |         "1": [
1986 |           {
1987 |             "id": "vanilla",
1988 |             "timePerSample": 29.506,
1989 |             "parameterSize": 50364416.0,
1990 |             "memoryUsageA": 14155776.0,
1991 |             "memoryUsageB": 100728832.0,
1992 |             "syncTimeFw": {
1993 |               "54": 0.0,
1994 |               "55": 0.0
1995 |             },
1996 |             "syncTimeBw": {
1997 |               "57": 0.0
1998 |             }
1999 |           },
2000 |           {
2001 |             "id": "activation recomp",
2002 |             "timePerSample": 40.149,
2003 |             "parameterSize": 50364416.0,
2004 |             "memoryUsageA": 1274019.8399999999,
2005 |             "memoryUsageB": 113610588.16,
2006 |             "syncTimeFw": {
2007 |               "54": 0.0,
2008 |               "55": 0.0
2009 |             },
2010 |             "syncTimeBw": {
2011 |               "57": 0.0
2012 |             }
2013 |           }
2014 |         ]
2015 |       }
2016 |     },
2017 |     {
2018 |       "id": 57,
2019 |       "name": "node72",
2020 |       "TMPCs": {
2021 |         "1": [
2022 |           {
2023 |             "id": "vanilla",
2024 |             "timePerSample": 0.0,
2025 |             "parameterSize": 0.0,
2026 |             "memoryUsageA": 13107200.0,
2027 |             "memoryUsageB": 0.0,
2028 |             "syncTimeFw": {
2029 |               "56": 0.0
2030 |             },
2031 |             "syncTimeBw": {
2032 |               "58": 0.0
2033 |             }
2034 |           },
2035 |           {
2036 |             "id": "activation recomp",
2037 |             "timePerSample": 0.0,
2038 |             "parameterSize": 0.0,
2039 |             "memoryUsageA": 1179648.0,
2040 |             "memoryUsageB": 11927552.0,
2041 |             "syncTimeFw": {
2042 |               "56": 0.0
2043 |             },
2044 |             "syncTimeBw": {
2045 |               "58": 0.0
2046 |             }
2047 |           }
2048 |         ]
2049 |       }
2050 |     },
2051 |     {
2052 |       "id": 58,
2053 |       "name": "node74",
2054 |       "TMPCs": {
2055 |         "1": [
2056 |           {
2057 |             "id": "vanilla",
2058 |             "timePerSample": 0.0,
2059 |             "parameterSize": 0.0,
2060 |             "memoryUsageA": 13107200.0,
2061 |             "memoryUsageB": 0.0,
2062 |             "syncTimeFw": {
2063 |               "57": 0.0,
2064 |               "52": 0.0
2065 |             },
2066 |             "syncTimeBw": {
2067 |               "59": 0.0,
2068 |               "64": 0.0
2069 |             }
2070 |           },
2071 |           {
2072 |             "id": "activation recomp",
2073 |             "timePerSample": 0.0,
2074 |             "parameterSize": 0.0,
2075 |             "memoryUsageA": 1179648.0,
2076 |             "memoryUsageB": 11927552.0,
2077 |             "syncTimeFw": {
2078 |               "57": 0.0,
2079 |               "52": 0.0
2080 |             },
2081 |             "syncTimeBw": {
2082 |               "59": 0.0,
2083 |               "64": 0.0
2084 |             }
2085 |           }
2086 |         ]
2087 |       }
2088 |     },
2089 |     {
2090 |       "id": 59,
2091 |       "name": "node75",
2092 |       "TMPCs": {
2093 |         "1": [
2094 |           {
2095 |             "id": "vanilla",
2096 |             "timePerSample": 0.394,
2097 |             "parameterSize": 0.0,
2098 |             "memoryUsageA": 13107200.0,
2099 |             "memoryUsageB": 0.0,
2100 |             "syncTimeFw": {
2101 |               "58": 0.0
2102 |             },
2103 |             "syncTimeBw": {
2104 |               "60": 0.0
2105 |             }
2106 |           },
2107 |           {
2108 |             "id": "activation recomp",
2109 |             "timePerSample": 0.525,
2110 |             "parameterSize": 0.0,
2111 |             "memoryUsageA": 1179648.0,
2112 |             "memoryUsageB": 11927552.0,
2113 |             "syncTimeFw": {
2114 |               "58": 0.0
2115 |             },
2116 |             "syncTimeBw": {
2117 |               "60": 0.0
2118 |             }
2119 |           }
2120 |         ]
2121 |       }
2122 |     },
2123 |     {
2124 |       "id": 60,
2125 |       "name": "node76",
2126 |       "TMPCs": {
2127 |         "1": [
2128 |           {
2129 |             "id": "vanilla",
2130 |             "timePerSample": 0.0,
2131 |             "parameterSize": 0.0,
2132 |             "memoryUsageA": 13107200.0,
2133 |             "memoryUsageB": 0.0,
2134 |             "syncTimeFw": {
2135 |               "59": 0.0,
2136 |               "35": 0.0
2137 |             },
2138 |             "syncTimeBw": {
2139 |               "62": 0.0
2140 |             }
2141 |           },
2142 |           {
2143 |             "id": "activation recomp",
2144 |             "timePerSample": 0.0,
2145 |             "parameterSize": 0.0,
2146 |             "memoryUsageA": 1179648.0,
2147 |             "memoryUsageB": 11927552.0,
2148 |             "syncTimeFw": {
2149 |               "59": 0.0,
2150 |               "35": 0.0
2151 |             },
2152 |             "syncTimeBw": {
2153 |               "62": 0.0
2154 |             }
2155 |           }
2156 |         ]
2157 |       }
2158 |     },
2159 |     {
2160 |       "id": 61,
2161 |       "name": "node77",
2162 |       "TMPCs": {
2163 |         "1": [
2164 |           {
2165 |             "id": "vanilla",
2166 |             "timePerSample": 0.0,
2167 |             "parameterSize": 0.0,
2168 |             "memoryUsageA": 0.0,
2169 |             "memoryUsageB": 0.0,
2170 |             "syncTimeFw": {
2171 |               "31": 0.0
2172 |             },
2173 |             "syncTimeBw": {
2174 |               "62": 0.0
2175 |             }
2176 |           },
2177 |           {
2178 |             "id": "activation recomp",
2179 |             "timePerSample": 0.0,
2180 |             "parameterSize": 0.0,
2181 |             "memoryUsageA": 0.0,
2182 |             "memoryUsageB": 0.0,
2183 |             "syncTimeFw": {
2184 |               "31": 0.0
2185 |             },
2186 |             "syncTimeBw": {
2187 |               "62": 0.0
2188 |             }
2189 |           }
2190 |         ]
2191 |       }
2192 |     },
2193 |     {
2194 |       "id": 62,
2195 |       "name": "node78",
2196 |       "TMPCs": {
2197 |         "1": [
2198 |           {
2199 |             "id": "vanilla",
2200 |             "timePerSample": 29.687,
2201 |             "parameterSize": 50364416.0,
2202 |             "memoryUsageA": 14155776.0,
2203 |             "memoryUsageB": 100728832.0,
2204 |             "syncTimeFw": {
2205 |               "60": 0.0,
2206 |               "61": 0.0
2207 |             },
2208 |             "syncTimeBw": {
2209 |               "63": 0.0
2210 |             }
2211 |           },
2212 |           {
2213 |             "id": "activation recomp",
2214 |             "timePerSample": 40.338,
2215 |             "parameterSize": 50364416.0,
2216 |             "memoryUsageA": 1274019.8399999999,
2217 |             "memoryUsageB": 113610588.16,
2218 |             "syncTimeFw": {
2219 |               "60": 0.0,
2220 |               "61": 0.0
2221 |             },
2222 |             "syncTimeBw": {
2223 |               "63": 0.0
2224 |             }
2225 |           }
2226 |         ]
2227 |       }
2228 |     },
2229 |     {
2230 |       "id": 63,
2231 |       "name": "node79",
2232 |       "TMPCs": {
2233 |         "1": [
2234 |           {
2235 |             "id": "vanilla",
2236 |             "timePerSample": 0.0,
2237 |             "parameterSize": 0.0,
2238 |             "memoryUsageA": 13107200.0,
2239 |             "memoryUsageB": 0.0,
2240 |             "syncTimeFw": {
2241 |               "62": 0.0
2242 |             },
2243 |             "syncTimeBw": {
2244 |               "64": 0.0
2245 |             }
2246 |           },
2247 |           {
2248 |             "id": "activation recomp",
2249 |             "timePerSample": 0.0,
2250 |             "parameterSize": 0.0,
2251 |             "memoryUsageA": 1179648.0,
2252 |             "memoryUsageB": 11927552.0,
2253 |             "syncTimeFw": {
2254 |               "62": 0.0
2255 |             },
2256 |             "syncTimeBw": {
2257 |               "64": 0.0
2258 |             }
2259 |           }
2260 |         ]
2261 |       }
2262 |     },
2263 |     {
2264 |       "id": 64,
2265 |       "name": "node81",
2266 |       "TMPCs": {
2267 |         "1": [
2268 |           {
2269 |             "id": "vanilla",
2270 |             "timePerSample": 0.0,
2271 |             "parameterSize": 0.0,
2272 |             "memoryUsageA": 13107200.0,
2273 |             "memoryUsageB": 0.0,
2274 |             "syncTimeFw": {
2275 |               "63": 0.0,
2276 |               "58": 0.0
2277 |             },
2278 |             "syncTimeBw": {
2279 |               "65": 0.0,
2280 |               "70": 0.0
2281 |             }
2282 |           },
2283 |           {
2284 |             "id": "activation recomp",
2285 |             "timePerSample": 0.0,
2286 |             "parameterSize": 0.0,
2287 |             "memoryUsageA": 1179648.0,
2288 |             "memoryUsageB": 11927552.0,
2289 |             "syncTimeFw": {
2290 |               "63": 0.0,
2291 |               "58": 0.0
2292 |             },
2293 |             "syncTimeBw": {
2294 |               "65": 0.0,
2295 |               "70": 0.0
2296 |             }
2297 |           }
2298 |         ]
2299 |       }
2300 |     },
2301 |     {
2302 |       "id": 65,
2303 |       "name": "node82",
2304 |       "TMPCs": {
2305 |         "1": [
2306 |           {
2307 |             "id": "vanilla",
2308 |             "timePerSample": 0.387,
2309 |             "parameterSize": 0.0,
2310 |             "memoryUsageA": 13107200.0,
2311 |             "memoryUsageB": 0.0,
2312 |             "syncTimeFw": {
2313 |               "64": 0.0
2314 |             },
2315 |             "syncTimeBw": {
2316 |               "66": 0.0
2317 |             }
2318 |           },
2319 |           {
2320 |             "id": "activation recomp",
2321 |             "timePerSample": 0.513,
2322 |             "parameterSize": 0.0,
2323 |             "memoryUsageA": 1179648.0,
2324 |             "memoryUsageB": 11927552.0,
2325 |             "syncTimeFw": {
2326 |               "64": 0.0
2327 |             },
2328 |             "syncTimeBw": {
2329 |               "66": 0.0
2330 |             }
2331 |           }
2332 |         ]
2333 |       }
2334 |     },
2335 |     {
2336 |       "id": 66,
2337 |       "name": "node83",
2338 |       "TMPCs": {
2339 |         "1": [
2340 |           {
2341 |             "id": "vanilla",
2342 |             "timePerSample": 0.0,
2343 |             "parameterSize": 0.0,
2344 |             "memoryUsageA": 13107200.0,
2345 |             "memoryUsageB": 0.0,
2346 |             "syncTimeFw": {
2347 |               "65": 0.0,
2348 |               "35": 0.0
2349 |             },
2350 |             "syncTimeBw": {
2351 |               "68": 0.0
2352 |             }
2353 |           },
2354 |           {
2355 |             "id": "activation recomp",
2356 |             "timePerSample": 0.0,
2357 |             "parameterSize": 0.0,
2358 |             "memoryUsageA": 1179648.0,
2359 |             "memoryUsageB": 11927552.0,
2360 |             "syncTimeFw": {
2361 |               "65": 0.0,
2362 |               "35": 0.0
2363 |             },
2364 |             "syncTimeBw": {
2365 |               "68": 0.0
2366 |             }
2367 |           }
2368 |         ]
2369 |       }
2370 |     },
2371 |     {
2372 |       "id": 67,
2373 |       "name": "node84",
2374 |       "TMPCs": {
2375 |         "1": [
2376 |           {
2377 |             "id": "vanilla",
2378 |             "timePerSample": 0.0,
2379 |             "parameterSize": 0.0,
2380 |             "memoryUsageA": 0.0,
2381 |             "memoryUsageB": 0.0,
2382 |             "syncTimeFw": {
2383 |               "31": 0.0
2384 |             },
2385 |             "syncTimeBw": {
2386 |               "68": 0.0
2387 |             }
2388 |           },
2389 |           {
2390 |             "id": "activation recomp",
2391 |             "timePerSample": 0.0,
2392 |             "parameterSize": 0.0,
2393 |             "memoryUsageA": 0.0,
2394 |             "memoryUsageB": 0.0,
2395 |             "syncTimeFw": {
2396 |               "31": 0.0
2397 |             },
2398 |             "syncTimeBw": {
2399 |               "68": 0.0
2400 |             }
2401 |           }
2402 |         ]
2403 |       }
2404 |     },
2405 |     {
2406 |       "id": 68,
2407 |       "name": "node85",
2408 |       "TMPCs": {
2409 |         "1": [
2410 |           {
2411 |             "id": "vanilla",
2412 |             "timePerSample": 29.804000000000002,
2413 |             "parameterSize": 50364416.0,
2414 |             "memoryUsageA": 14155776.0,
2415 |             "memoryUsageB": 100728832.0,
2416 |             "syncTimeFw": {
2417 |               "66": 0.0,
2418 |               "67": 0.0
2419 |             },
2420 |             "syncTimeBw": {
2421 |               "69": 0.0
2422 |             }
2423 |           },
2424 |           {
2425 |             "id": "activation recomp",
2426 |             "timePerSample": 40.463,
2427 |             "parameterSize": 50364416.0,
2428 |             "memoryUsageA": 1274019.8399999999,
2429 |             "memoryUsageB": 113610588.16,
2430 |             "syncTimeFw": {
2431 |               "66": 0.0,
2432 |               "67": 0.0
2433 |             },
2434 |             "syncTimeBw": {
2435 |               "69": 0.0
2436 |             }
2437 |           }
2438 |         ]
2439 |       }
2440 |     },
2441 |     {
2442 |       "id": 69,
2443 |       "name": "node86",
2444 |       "TMPCs": {
2445 |         "1": [
2446 |           {
2447 |             "id": "vanilla",
2448 |             "timePerSample": 0.0,
2449 |             "parameterSize": 0.0,
2450 |             "memoryUsageA": 13107200.0,
2451 |             "memoryUsageB": 0.0,
2452 |             "syncTimeFw": {
2453 |               "68": 0.0
2454 |             },
2455 |             "syncTimeBw": {
2456 |               "70": 0.0
2457 |             }
2458 |           },
2459 |           {
2460 |             "id": "activation recomp",
2461 |             "timePerSample": 0.0,
2462 |             "parameterSize": 0.0,
2463 |             "memoryUsageA": 1179648.0,
2464 |             "memoryUsageB": 11927552.0,
2465 |             "syncTimeFw": {
2466 |               "68": 0.0
2467 |             },
2468 |             "syncTimeBw": {
2469 |               "70": 0.0
2470 |             }
2471 |           }
2472 |         ]
2473 |       }
2474 |     },
2475 |     {
2476 |       "id": 70,
2477 |       "name": "node88",
2478 |       "TMPCs": {
2479 |         "1": [
2480 |           {
2481 |             "id": "vanilla",
2482 |             "timePerSample": 0.0,
2483 |             "parameterSize": 0.0,
2484 |             "memoryUsageA": 13107200.0,
2485 |             "memoryUsageB": 0.0,
2486 |             "syncTimeFw": {
2487 |               "69": 0.0,
2488 |               "64": 0.0
2489 |             },
2490 |             "syncTimeBw": {
2491 |               "71": 0.0,
2492 |               "76": 0.0
2493 |             }
2494 |           },
2495 |           {
2496 |             "id": "activation recomp",
2497 |             "timePerSample": 0.0,
2498 |             "parameterSize": 0.0,
2499 |             "memoryUsageA": 1179648.0,
2500 |             "memoryUsageB": 11927552.0,
2501 |             "syncTimeFw": {
2502 |               "69": 0.0,
2503 |               "64": 0.0
2504 |             },
2505 |             "syncTimeBw": {
2506 |               "71": 0.0,
2507 |               "76": 0.0
2508 |             }
2509 |           }
2510 |         ]
2511 |       }
2512 |     },
2513 |     {
2514 |       "id": 71,
2515 |       "name": "node89",
2516 |       "TMPCs": {
2517 |         "1": [
2518 |           {
2519 |             "id": "vanilla",
2520 |             "timePerSample": 0.345,
2521 |             "parameterSize": 0.0,
2522 |             "memoryUsageA": 13107200.0,
2523 |             "memoryUsageB": 0.0,
2524 |             "syncTimeFw": {
2525 |               "70": 0.0
2526 |             },
2527 |             "syncTimeBw": {
2528 |               "72": 0.0
2529 |             }
2530 |           },
2531 |           {
2532 |             "id": "activation recomp",
2533 |             "timePerSample": 0.489,
2534 |             "parameterSize": 0.0,
2535 |             "memoryUsageA": 1179648.0,
2536 |             "memoryUsageB": 11927552.0,
2537 |             "syncTimeFw": {
2538 |               "70": 0.0
2539 |             },
2540 |             "syncTimeBw": {
2541 |               "72": 0.0
2542 |             }
2543 |           }
2544 |         ]
2545 |       }
2546 |     },
2547 |     {
2548 |       "id": 72,
2549 |       "name": "node90",
2550 |       "TMPCs": {
2551 |         "1": [
2552 |           {
2553 |             "id": "vanilla",
2554 |             "timePerSample": 0.0,
2555 |             "parameterSize": 0.0,
2556 |             "memoryUsageA": 13107200.0,
2557 |             "memoryUsageB": 0.0,
2558 |             "syncTimeFw": {
2559 |               "71": 0.0,
2560 |               "35": 0.0
2561 |             },
2562 |             "syncTimeBw": {
2563 |               "74": 0.0
2564 |             }
2565 |           },
2566 |           {
2567 |             "id": "activation recomp",
2568 |             "timePerSample": 0.0,
2569 |             "parameterSize": 0.0,
2570 |             "memoryUsageA": 1179648.0,
2571 |             "memoryUsageB": 11927552.0,
2572 |             "syncTimeFw": {
2573 |               "71": 0.0,
2574 |               "35": 0.0
2575 |             },
2576 |             "syncTimeBw": {
2577 |               "74": 0.0
2578 |             }
2579 |           }
2580 |         ]
2581 |       }
2582 |     },
2583 |     {
2584 |       "id": 73,
2585 |       "name": "node91",
2586 |       "TMPCs": {
2587 |         "1": [
2588 |           {
2589 |             "id": "vanilla",
2590 |             "timePerSample": 0.0,
2591 |             "parameterSize": 0.0,
2592 |             "memoryUsageA": 0.0,
2593 |             "memoryUsageB": 0.0,
2594 |             "syncTimeFw": {
2595 |               "31": 0.0
2596 |             },
2597 |             "syncTimeBw": {
2598 |               "74": 0.0
2599 |             }
2600 |           },
2601 |           {
2602 |             "id": "activation recomp",
2603 |             "timePerSample": 0.0,
2604 |             "parameterSize": 0.0,
2605 |             "memoryUsageA": 0.0,
2606 |             "memoryUsageB": 0.0,
2607 |             "syncTimeFw": {
2608 |               "31": 0.0
2609 |             },
2610 |             "syncTimeBw": {
2611 |               "74": 0.0
2612 |             }
2613 |           }
2614 |         ]
2615 |       }
2616 |     },
2617 |     {
2618 |       "id": 74,
2619 |       "name": "node92",
2620 |       "TMPCs": {
2621 |         "1": [
2622 |           {
2623 |             "id": "vanilla",
2624 |             "timePerSample": 76.825,
2625 |             "parameterSize": 50364416.0,
2626 |             "memoryUsageA": 14155776.0,
2627 |             "memoryUsageB": 100728832.0,
2628 |             "syncTimeFw": {
2629 |               "72": 0.0,
2630 |               "73": 0.0
2631 |             },
2632 |             "syncTimeBw": {
2633 |               "75": 0.0
2634 |             }
2635 |           },
2636 |           {
2637 |             "id": "activation recomp",
2638 |             "timePerSample": 87.436,
2639 |             "parameterSize": 50364416.0,
2640 |             "memoryUsageA": 1274019.8399999999,
2641 |             "memoryUsageB": 113610588.16,
2642 |             "syncTimeFw": {
2643 |               "72": 0.0,
2644 |               "73": 0.0
2645 |             },
2646 |             "syncTimeBw": {
2647 |               "75": 0.0
2648 |             }
2649 |           }
2650 |         ]
2651 |       }
2652 |     },
2653 |     {
2654 |       "id": 75,
2655 |       "name": "node93",
2656 |       "TMPCs": {
2657 |         "1": [
2658 |           {
2659 |             "id": "vanilla",
2660 |             "timePerSample": 0.0,
2661 |             "parameterSize": 0.0,
2662 |             "memoryUsageA": 13107200.0,
2663 |             "memoryUsageB": 0.0,
2664 |             "syncTimeFw": {
2665 |               "74": 0.0
2666 |             },
2667 |             "syncTimeBw": {
2668 |               "76": 0.0
2669 |             }
2670 |           },
2671 |           {
2672 |             "id": "activation recomp",
2673 |             "timePerSample": 0.0,
2674 |             "parameterSize": 0.0,
2675 |             "memoryUsageA": 1179648.0,
2676 |             "memoryUsageB": 11927552.0,
2677 |             "syncTimeFw": {
2678 |               "74": 0.0
2679 |             },
2680 |             "syncTimeBw": {
2681 |               "76": 0.0
2682 |             }
2683 |           }
2684 |         ]
2685 |       }
2686 |     },
2687 |     {
2688 |       "id": 76,
2689 |       "name": "node95",
2690 |       "TMPCs": {
2691 |         "1": [
2692 |           {
2693 |             "id": "vanilla",
2694 |             "timePerSample": 0.0,
2695 |             "parameterSize": 0.0,
2696 |             "memoryUsageA": 13107200.0,
2697 |             "memoryUsageB": 0.0,
2698 |             "syncTimeFw": {
2699 |               "75": 0.0,
2700 |               "70": 0.0
2701 |             },
2702 |             "syncTimeBw": {
2703 |               "77": 0.0
2704 |             }
2705 |           },
2706 |           {
2707 |             "id": "activation recomp",
2708 |             "timePerSample": 0.0,
2709 |             "parameterSize": 0.0,
2710 |             "memoryUsageA": 1179648.0,
2711 |             "memoryUsageB": 11927552.0,
2712 |             "syncTimeFw": {
2713 |               "75": 0.0,
2714 |               "70": 0.0
2715 |             },
2716 |             "syncTimeBw": {
2717 |               "77": 0.0
2718 |             }
2719 |           }
2720 |         ]
2721 |       }
2722 |     },
2723 |     {
2724 |       "id": 77,
2725 |       "name": "node96",
2726 |       "TMPCs": {
2727 |         "1": [
2728 |           {
2729 |             "id": "vanilla",
2730 |             "timePerSample": 30.155,
2731 |             "parameterSize": 132512000.0,
2732 |             "memoryUsageA": 413696000.0,
2733 |             "memoryUsageB": 265024000.0,
2734 |             "syncTimeFw": {
2735 |               "76": 0.0
2736 |             },
2737 |             "syncTimeBw": {}
2738 |           },
2739 |           {
2740 |             "id": "activation recomp",
2741 |             "timePerSample": 54.937,
2742 |             "parameterSize": 132512000.0,
2743 |             "memoryUsageA": 37232640.0,
2744 |             "memoryUsageB": 641487360.0,
2745 |             "syncTimeFw": {
2746 |               "76": 0.0
2747 |             },
2748 |             "syncTimeBw": {}
2749 |           }
2750 |         ]
2751 |       }
2752 |     }
2753 |   ],
2754 |   "edges": [
2755 |     {
2756 |       "sourceId": 0,
2757 |       "destId": 1,
2758 |       "communicationCost": 13107200.0
2759 |     },
2760 |     {
2761 |       "sourceId": 2,
2762 |       "destId": 1,
2763 |       "communicationCost": 0.0
2764 |     },
2765 |     {
2766 |       "sourceId": 1,
2767 |       "destId": 3,
2768 |       "communicationCost": 26214400.0
2769 |     },
2770 |     {
2771 |       "sourceId": 3,
2772 |       "destId": 4,
2773 |       "communicationCost": 26214400.0
2774 |     },
2775 |     {
2776 |       "sourceId": 4,
2777 |       "destId": 5,
2778 |       "communicationCost": 14155776.0
2779 |     },
2780 |     {
2781 |       "sourceId": 5,
2782 |       "destId": 6,
2783 |       "communicationCost": 13107200.0
2784 |     },
2785 |     {
2786 |       "sourceId": 6,
2787 |       "destId": 7,
2788 |       "communicationCost": 13107200.0
2789 |     },
2790 |     {
2791 |       "sourceId": 7,
2792 |       "destId": 8,
2793 |       "communicationCost": 14155776.0
2794 |     },
2795 |     {
2796 |       "sourceId": 8,
2797 |       "destId": 9,
2798 |       "communicationCost": 13107200.0
2799 |     },
2800 |     {
2801 |       "sourceId": 5,
2802 |       "destId": 9,
2803 |       "communicationCost": 13107200.0
2804 |     },
2805 |     {
2806 |       "sourceId": 9,
2807 |       "destId": 10,
2808 |       "communicationCost": 13107200.0
2809 |     },
2810 |     {
2811 |       "sourceId": 10,
2812 |       "destId": 11,
2813 |       "communicationCost": 13107200.0
2814 |     },
2815 |     {
2816 |       "sourceId": 11,
2817 |       "destId": 12,
2818 |       "communicationCost": 14155776.0
2819 |     },
2820 |     {
2821 |       "sourceId": 12,
2822 |       "destId": 13,
2823 |       "communicationCost": 13107200.0
2824 |     },
2825 |     {
2826 |       "sourceId": 9,
2827 |       "destId": 13,
2828 |       "communicationCost": 13107200.0
2829 |     },
2830 |     {
2831 |       "sourceId": 13,
2832 |       "destId": 14,
2833 |       "communicationCost": 13107200.0
2834 |     },
2835 |     {
2836 |       "sourceId": 14,
2837 |       "destId": 15,
2838 |       "communicationCost": 13107200.0
2839 |     },
2840 |     {
2841 |       "sourceId": 15,
2842 |       "destId": 16,
2843 |       "communicationCost": 14155776.0
2844 |     },
2845 |     {
2846 |       "sourceId": 16,
2847 |       "destId": 17,
2848 |       "communicationCost": 13107200.0
2849 |     },
2850 |     {
2851 |       "sourceId": 13,
2852 |       "destId": 17,
2853 |       "communicationCost": 13107200.0
2854 |     },
2855 |     {
2856 |       "sourceId": 17,
2857 |       "destId": 18,
2858 |       "communicationCost": 13107200.0
2859 |     },
2860 |     {
2861 |       "sourceId": 18,
2862 |       "destId": 19,
2863 |       "communicationCost": 13107200.0
2864 |     },
2865 |     {
2866 |       "sourceId": 19,
2867 |       "destId": 20,
2868 |       "communicationCost": 14155776.0
2869 |     },
2870 |     {
2871 |       "sourceId": 20,
2872 |       "destId": 21,
2873 |       "communicationCost": 13107200.0
2874 |     },
2875 |     {
2876 |       "sourceId": 17,
2877 |       "destId": 21,
2878 |       "communicationCost": 13107200.0
2879 |     },
2880 |     {
2881 |       "sourceId": 21,
2882 |       "destId": 22,
2883 |       "communicationCost": 13107200.0
2884 |     },
2885 |     {
2886 |       "sourceId": 22,
2887 |       "destId": 23,
2888 |       "communicationCost": 13107200.0
2889 |     },
2890 |     {
2891 |       "sourceId": 23,
2892 |       "destId": 24,
2893 |       "communicationCost": 14155776.0
2894 |     },
2895 |     {
2896 |       "sourceId": 24,
2897 |       "destId": 25,
2898 |       "communicationCost": 13107200.0
2899 |     },
2900 |     {
2901 |       "sourceId": 21,
2902 |       "destId": 25,
2903 |       "communicationCost": 13107200.0
2904 |     },
2905 |     {
2906 |       "sourceId": 25,
2907 |       "destId": 26,
2908 |       "communicationCost": 13107200.0
2909 |     },
2910 |     {
2911 |       "sourceId": 26,
2912 |       "destId": 27,
2913 |       "communicationCost": 13107200.0
2914 |     },
2915 |     {
2916 |       "sourceId": 27,
2917 |       "destId": 28,
2918 |       "communicationCost": 14155776.0
2919 |     },
2920 |     {
2921 |       "sourceId": 28,
2922 |       "destId": 29,
2923 |       "communicationCost": 13107200.0
2924 |     },
2925 |     {
2926 |       "sourceId": 25,
2927 |       "destId": 29,
2928 |       "communicationCost": 13107200.0
2929 |     },
2930 |     {
2931 |       "sourceId": 31,
2932 |       "destId": 32,
2933 |       "communicationCost": 0.0
2934 |     },
2935 |     {
2936 |       "sourceId": 30,
2937 |       "destId": 33,
2938 |       "communicationCost": 13107200.0
2939 |     },
2940 |     {
2941 |       "sourceId": 32,
2942 |       "destId": 33,
2943 |       "communicationCost": 0.0
2944 |     },
2945 |     {
2946 |       "sourceId": 29,
2947 |       "destId": 33,
2948 |       "communicationCost": 13107200.0
2949 |     },
2950 |     {
2951 |       "sourceId": 2,
2952 |       "destId": 33,
2953 |       "communicationCost": 0.0
2954 |     },
2955 |     {
2956 |       "sourceId": 33,
2957 |       "destId": 34,
2958 |       "communicationCost": 27582976.0
2959 |     },
2960 |     {
2961 |       "sourceId": 33,
2962 |       "destId": 35,
2963 |       "communicationCost": 27582976.0
2964 |     },
2965 |     {
2966 |       "sourceId": 34,
2967 |       "destId": 36,
2968 |       "communicationCost": 13107200.0
2969 |     },
2970 |     {
2971 |       "sourceId": 36,
2972 |       "destId": 37,
2973 |       "communicationCost": 13107200.0
2974 |     },
2975 |     {
2976 |       "sourceId": 35,
2977 |       "destId": 37,
2978 |       "communicationCost": 524288.0
2979 |     },
2980 |     {
2981 |       "sourceId": 31,
2982 |       "destId": 38,
2983 |       "communicationCost": 0.0
2984 |     },
2985 |     {
2986 |       "sourceId": 37,
2987 |       "destId": 39,
2988 |       "communicationCost": 13107200.0
2989 |     },
2990 |     {
2991 |       "sourceId": 38,
2992 |       "destId": 39,
2993 |       "communicationCost": 0.0
2994 |     },
2995 |     {
2996 |       "sourceId": 39,
2997 |       "destId": 40,
2998 |       "communicationCost": 14155776.0
2999 |     },
3000 |     {
3001 |       "sourceId": 40,
3002 |       "destId": 41,
3003 |       "communicationCost": 13107200.0
3004 |     },
3005 |     {
3006 |       "sourceId": 41,
3007 |       "destId": 42,
3008 |       "communicationCost": 13107200.0
3009 |     },
3010 |     {
3011 |       "sourceId": 35,
3012 |       "destId": 42,
3013 |       "communicationCost": 524288.0
3014 |     },
3015 |     {
3016 |       "sourceId": 31,
3017 |       "destId": 43,
3018 |       "communicationCost": 0.0
3019 |     },
3020 |     {
3021 |       "sourceId": 42,
3022 |       "destId": 44,
3023 |       "communicationCost": 13107200.0
3024 |     },
3025 |     {
3026 |       "sourceId": 43,
3027 |       "destId": 44,
3028 |       "communicationCost": 0.0
3029 |     },
3030 |     {
3031 |       "sourceId": 44,
3032 |       "destId": 45,
3033 |       "communicationCost": 14155776.0
3034 |     },
3035 |     {
3036 |       "sourceId": 45,
3037 |       "destId": 46,
3038 |       "communicationCost": 13107200.0
3039 |     },
3040 |     {
3041 |       "sourceId": 40,
3042 |       "destId": 46,
3043 |       "communicationCost": 13107200.0
3044 |     },
3045 |     {
3046 |       "sourceId": 46,
3047 |       "destId": 47,
3048 |       "communicationCost": 13107200.0
3049 |     },
3050 |     {
3051 |       "sourceId": 47,
3052 |       "destId": 48,
3053 |       "communicationCost": 13107200.0
3054 |     },
3055 |     {
3056 |       "sourceId": 35,
3057 |       "destId": 48,
3058 |       "communicationCost": 524288.0
3059 |     },
3060 |     {
3061 |       "sourceId": 31,
3062 |       "destId": 49,
3063 |       "communicationCost": 0.0
3064 |     },
3065 |     {
3066 |       "sourceId": 48,
3067 |       "destId": 50,
3068 |       "communicationCost": 13107200.0
3069 |     },
3070 |     {
3071 |       "sourceId": 49,
3072 |       "destId": 50,
3073 |       "communicationCost": 0.0
3074 |     },
3075 |     {
3076 |       "sourceId": 50,
3077 |       "destId": 51,
3078 |       "communicationCost": 14155776.0
3079 |     },
3080 |     {
3081 |       "sourceId": 51,
3082 |       "destId": 52,
3083 |       "communicationCost": 13107200.0
3084 |     },
3085 |     {
3086 |       "sourceId": 46,
3087 |       "destId": 52,
3088 |       "communicationCost": 13107200.0
3089 |     },
3090 |     {
3091 |       "sourceId": 52,
3092 |       "destId": 53,
3093 |       "communicationCost": 13107200.0
3094 |     },
3095 |     {
3096 |       "sourceId": 53,
3097 |       "destId": 54,
3098 |       "communicationCost": 13107200.0
3099 |     },
3100 |     {
3101 |       "sourceId": 35,
3102 |       "destId": 54,
3103 |       "communicationCost": 524288.0
3104 |     },
3105 |     {
3106 |       "sourceId": 31,
3107 |       "destId": 55,
3108 |       "communicationCost": 0.0
3109 |     },
3110 |     {
3111 |       "sourceId": 54,
3112 |       "destId": 56,
3113 |       "communicationCost": 13107200.0
3114 |     },
3115 |     {
3116 |       "sourceId": 55,
3117 |       "destId": 56,
3118 |       "communicationCost": 0.0
3119 |     },
3120 |     {
3121 |       "sourceId": 56,
3122 |       "destId": 57,
3123 |       "communicationCost": 14155776.0
3124 |     },
3125 |     {
3126 |       "sourceId": 57,
3127 |       "destId": 58,
3128 |       "communicationCost": 13107200.0
3129 |     },
3130 |     {
3131 |       "sourceId": 52,
3132 |       "destId": 58,
3133 |       "communicationCost": 13107200.0
3134 |     },
3135 |     {
3136 |       "sourceId": 58,
3137 |       "destId": 59,
3138 |       "communicationCost": 13107200.0
3139 |     },
3140 |     {
3141 |       "sourceId": 59,
3142 |       "destId": 60,
3143 |       "communicationCost": 13107200.0
3144 |     },
3145 |     {
3146 |       "sourceId": 35,
3147 |       "destId": 60,
3148 |       "communicationCost": 524288.0
3149 |     },
3150 |     {
3151 |       "sourceId": 31,
3152 |       "destId": 61,
3153 |       "communicationCost": 0.0
3154 |     },
3155 |     {
3156 |       "sourceId": 60,
3157 |       "destId": 62,
3158 |       "communicationCost": 13107200.0
3159 |     },
3160 |     {
3161 |       "sourceId": 61,
3162 |       "destId": 62,
3163 |       "communicationCost": 0.0
3164 |     },
3165 |     {
3166 |       "sourceId": 62,
3167 |       "destId": 63,
3168 |       "communicationCost": 14155776.0
3169 |     },
3170 |     {
3171 |       "sourceId": 63,
3172 |       "destId": 64,
3173 |       "communicationCost": 13107200.0
3174 |     },
3175 |     {
3176 |       "sourceId": 58,
3177 |       "destId": 64,
3178 |       "communicationCost": 13107200.0
3179 |     },
3180 |     {
3181 |       "sourceId": 64,
3182 |       "destId": 65,
3183 |       "communicationCost": 13107200.0
3184 |     },
3185 |     {
3186 |       "sourceId": 65,
3187 |       "destId": 66,
3188 |       "communicationCost": 13107200.0
3189 |     },
3190 |     {
3191 |       "sourceId": 35,
3192 |       "destId": 66,
3193 |       "communicationCost": 524288.0
3194 |     },
3195 |     {
3196 |       "sourceId": 31,
3197 |       "destId": 67,
3198 |       "communicationCost": 0.0
3199 |     },
3200 |     {
3201 |       "sourceId": 66,
3202 |       "destId": 68,
3203 |       "communicationCost": 13107200.0
3204 |     },
3205 |     {
3206 |       "sourceId": 67,
3207 |       "destId": 68,
3208 |       "communicationCost": 0.0
3209 |     },
3210 |     {
3211 |       "sourceId": 68,
3212 |       "destId": 69,
3213 |       "communicationCost": 14155776.0
3214 |     },
3215 |     {
3216 |       "sourceId": 69,
3217 |       "destId": 70,
3218 |       "communicationCost": 13107200.0
3219 |     },
3220 |     {
3221 |       "sourceId": 64,
3222 |       "destId": 70,
3223 |       "communicationCost": 13107200.0
3224 |     },
3225 |     {
3226 |       "sourceId": 70,
3227 |       "destId": 71,
3228 |       "communicationCost": 13107200.0
3229 |     },
3230 |     {
3231 |       "sourceId": 71,
3232 |       "destId": 72,
3233 |       "communicationCost": 13107200.0
3234 |     },
3235 |     {
3236 |       "sourceId": 35,
3237 |       "destId": 72,
3238 |       "communicationCost": 524288.0
3239 |     },
3240 |     {
3241 |       "sourceId": 31,
3242 |       "destId": 73,
3243 |       "communicationCost": 0.0
3244 |     },
3245 |     {
3246 |       "sourceId": 72,
3247 |       "destId": 74,
3248 |       "communicationCost": 13107200.0
3249 |     },
3250 |     {
3251 |       "sourceId": 73,
3252 |       "destId": 74,
3253 |       "communicationCost": 0.0
3254 |     },
3255 |     {
3256 |       "sourceId": 74,
3257 |       "destId": 75,
3258 |       "communicationCost": 14155776.0
3259 |     },
3260 |     {
3261 |       "sourceId": 75,
3262 |       "destId": 76,
3263 |       "communicationCost": 13107200.0
3264 |     },
3265 |     {
3266 |       "sourceId": 70,
3267 |       "destId": 76,
3268 |       "communicationCost": 13107200.0
3269 |     },
3270 |     {
3271 |       "sourceId": 76,
3272 |       "destId": 77,
3273 |       "communicationCost": 13107200.0
3274 |     }
3275 |   ]
3276 | }


--------------------------------------------------------------------------------