├── CONTRIBUTING.md ├── LICENSE.txt ├── README.md ├── algo.cpp ├── inputs ├── bert32a100.json ├── gnmt.json └── resnet.json └── nlohmann └── json.hpp /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | This project welcomes contributions and suggestions. Most contributions require you to 4 | agree to a Contributor License Agreement (CLA) declaring that you have the right to, 5 | and actually do, grant us the rights to use your contribution. For details, visit 6 | https://cla.microsoft.com. 7 | 8 | When you submit a pull request, a CLA-bot will automatically determine whether you need 9 | to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the 10 | instructions provided by the bot. You will only need to do this once across all repositories using our CLA. 11 | 12 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 13 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) 14 | or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Code for NeurIPS 2021 paper "Piper: Multidimensional Planner for DNN Parallelization" 2 | 3 | MIT License 4 | 5 | Copyright (c) 2021 - present Microsoft Corporation 6 | 7 | All rights reserved. 8 | 9 | Permission is hereby granted, free of charge, to any person obtaining a copy 10 | of this software and associated documentation files (the "Software"), to deal 11 | in the Software without restriction, including without limitation the rights 12 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 13 | copies of the Software, and to permit persons to whom the Software is 14 | furnished to do so, subject to the following conditions: 15 | 16 | The above copyright notice and this permission notice shall be included in all 17 | copies or substantial portions of the Software. 18 | 19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 20 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 21 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 22 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 23 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 24 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 25 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Piper: Multidimensional Planner for DNN Parallelization - code 2 | 3 | This code package contains algorithms (proof-of-concept implementation) and input files (profiled DNN models / workloads) from the paper "Piper: Multidimensional Planner for DNN Parallelization" published at NeurIPS 2021. 4 | It allows one to reproduce the results in the paper, as well as run the partitioning algorithms on other workloads. 5 | 6 | ## Input format 7 | 8 | All our algorithms take as input a JSON file with the following format (all fields are mandatory unless indicated otherwise). This format closely follows our model (see Section 3 "Problem Setup" in the paper): 9 | * `maxMemoryPerDevice` (floating-point): a memory size limit of a single accelerator, in bytes, 10 | * `maxDevices` (integer): number of accelerators (`k` from the paper), 11 | * `maxBatchSize` (integer): maximum number of microbatches in a batch (`N` from the paper), 12 | * `bandwidth` (floating-point): bandwidth (from each device to the outside), 13 | * `nodes` (array): for each node (layer): 14 | * `id` (integer): unique ID of node, 15 | * `TMPCs` (dictionary): mapping from tensor-parallelism degree (`t`) to an array of TMPCs, each having: 16 | * `id` (string): name, 17 | * `timePerSample` (floating-point): compute latency (backward+forward, quantity `p` from the paper), 18 | * `parameterSize` (floating-point): size of weights (to be used in computing data-parallel resync costs, quantity `w` from the paper), 19 | * `memoryUsageA`, `memoryUsageB` (floating-point): memory usage coefficients `a` and `b` (see paper), 20 | * `syncTimeFw` (dictionary): mapping from heads of outgoing edges to their parameters `c^fw` (see paper), 21 | * `syncTimeBw` (dictionary): mapping from tails of incoming edges to their parameters `c^bw` (see paper), 22 | * `edges` (array): for each edge: 23 | * `sourceId` (integer): the ID of the tail of the edge (edge from `sourceId` to `destId`), 24 | * `destId` (integer): the ID of the head of the edge, 25 | * `communicationCost` (floating-point): cost of transfer over this edge (in bytes). 26 | 27 | Other debug information may be present in the input files, such as `name`s on nodes. 28 | 29 | ## Piper algorithm 30 | 31 | The solution is implemented in `algo.cpp`. It is a single C++ file (using one header-only library for JSON parsing) and can be compiled with a recent version of `gcc` by running e.g. `g++ -O3 algo.cpp -o algo.exe`. 32 | 33 | The compiled program runs experiments from the paper - see `main()` at the end of `algo.cpp`. 34 | It is possible to run only a subset of the evaluations by simply commenting out some lines in `main()`. 35 | The simplest mode of usage is shown in `single()`. 36 | The main example input file is `inputs/bert32a100.json`. 37 | 38 | ## Legal notices 39 | 40 | **Trademarks** 41 | This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies. 42 | 43 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 44 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) 45 | or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. 46 | 47 | We use the [JSON for Modern C++](https://github.com/nlohmann/json) library, copyright (c) 2013-2020 Niels Lohmann, licensed under the MIT license. 48 | -------------------------------------------------------------------------------- /algo.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | 10 | #include "nlohmann/json.hpp" 11 | 12 | using namespace std; 13 | using json = nlohmann::json; 14 | 15 | // begin trivial helper stuff 16 | ostream& dbg = cerr; 17 | 18 | void fail (const string &s) { 19 | cout << "FAIL: " << s << endl; 20 | dbg << "FAIL: " << s << endl; 21 | exit(1); 22 | } 23 | 24 | void warn (const string &s) { 25 | dbg << "WARNING: " << s << endl; 26 | } 27 | 28 | #define DBG(vari) cerr<<"["<<__LINE__<<"] "<<#vari<<" = "<<(vari)< 31 | ostream& operator << (ostream &s, const vector &v) { 32 | for (const T &x : v) { 33 | s << x << " "; 34 | } 35 | return s; 36 | } 37 | 38 | template 39 | string to_string (const vector &v) { 40 | stringstream ss; 41 | ss << v; 42 | return ss.str(); 43 | } 44 | 45 | template 46 | void append (vector &v, const vector &w) { 47 | v.insert(v.end(), w.begin(), w.end()); 48 | } 49 | 50 | template 51 | inline void minify (T &x, const T &y) { 52 | x = min(x,y); 53 | } 54 | 55 | int ceildiv (int x, int y) { 56 | assert(y > 0); 57 | return (x + y - 1) / y; 58 | } 59 | 60 | constexpr double INFTY = 1e30; 61 | 62 | vector vectorOfSetBits (const vector &v) { 63 | vector res; 64 | for (int i = 0; i < v.size(); ++i) { 65 | if (v[i]) { 66 | res.push_back(i); 67 | } 68 | } 69 | return res; 70 | } 71 | 72 | string formatMemoryUsage (const double M) { 73 | stringstream ss; 74 | ss << setprecision(1) << fixed << (M / (1 << 30)) << " GB"; 75 | return ss.str(); 76 | } 77 | 78 | string formatTPS (double our) { 79 | stringstream ss; 80 | int prec = 1; 81 | // make it so that there are 3 nonzero digits displayed 82 | double powerOur = our * 10; 83 | while (powerOur < 1 && prec < 3) { 84 | prec++; 85 | powerOur *= 10; 86 | } 87 | ss << setprecision(prec+2) << fixed << our << " s"; 88 | return ss.str(); 89 | } 90 | 91 | string formatTPSIncrease (double our, double worse) { 92 | if (worse > INFTY/2) { 93 | return "OOM"; 94 | } 95 | stringstream ss; 96 | double percent = (worse / our - 1) * 100; 97 | if (percent < 100) { 98 | ss << setprecision(2) << fixed << percent << "\\%"; 99 | } else { 100 | double times = worse / our; 101 | ss << setprecision(2) << fixed << times << "x"; 102 | } 103 | return ss.str(); 104 | } 105 | 106 | string formatRuntime (double our) { 107 | stringstream ss; 108 | ss << setprecision(1) << fixed << our << " s"; 109 | return ss.str(); 110 | } 111 | 112 | double average(const vector &numbers) { 113 | if (numbers.empty()) { 114 | fail("trying to take average of an empty vector"); 115 | } 116 | return accumulate(numbers.begin(), numbers.end(), 0.0) / static_cast(numbers.size()); 117 | } 118 | 119 | double sampleStddev(const vector &numbers) { 120 | if (numbers.empty()) { 121 | fail("trying to take stddev of an empty vector"); 122 | } 123 | if (numbers.size() == 1) { 124 | fail("trying to take sample stddev of a single number"); 125 | } 126 | const double avg = average(numbers); 127 | double sum_of_squares = 0.0; 128 | for (double number : numbers) { 129 | sum_of_squares += pow(number - avg, 2); 130 | } 131 | return sqrt(sum_of_squares / (static_cast(numbers.size()) - 1)); 132 | } 133 | // end trivial helper stuff 134 | 135 | 136 | constexpr int IDEALS_LIMIT = 40'000; 137 | constexpr int IDEALS_EXPLORATION_LIMIT = 200'000; 138 | constexpr int DEVICES_LIMIT = 10'000; // some loose upper bound on number of devices there can be in any reasonable input 139 | bool DATA_PARALLELISM_ALLOWED = true; 140 | bool TENSOR_PARALLELISM_ALLOWED = true; 141 | bool ACTIVATION_RECOMPUTATION_ALLOWED = true; 142 | bool ACTIVATION_RECOMPUTATION_FORCED = false; 143 | bool ACTIVATION_RECOMPUTATION_ALL_LAYERS_OR_NONE = false; 144 | constexpr bool KNAPSACK_FAST_HEURISTIC = true; 145 | bool OUTPUT_KNAPSACK_INSTANCES_FOR_INSPECTION = false; 146 | bool DEBUG_DATA_PARALLEL_COSTS = false; 147 | constexpr bool FASTER_DP_IMPLEMENTATION = true; 148 | 149 | 150 | struct TMPC { 151 | // known from the context: 152 | // - node id (v) 153 | // - number of devices (k) 154 | double timePerSample; // p 155 | unordered_map syncTimeFw; // syncTimeFw[u] = sfw(u, this node) 156 | unordered_map syncTimeBw; // syncTimeBw[w] = sbw(this node, w) 157 | double parameterSize; // w (only used to compute data-parallel resync costs) 158 | double memoryUsageA, memoryUsageB; // usage is A*y + B, y being ceil(suffix-sum-of-dp-degrees / d) 159 | string id; 160 | }; 161 | 162 | 163 | void from_json (const json &j, TMPC &t) { 164 | j.at("timePerSample").get_to(t.timePerSample); 165 | map syncTimeFw, syncTimeBw; 166 | j.at("syncTimeFw").get_to(syncTimeFw); 167 | for (const pair &p : syncTimeFw) { 168 | // TODO verify that p.first is a number 169 | t.syncTimeFw[stoi(p.first)] = p.second; 170 | } 171 | j.at("syncTimeBw").get_to(syncTimeBw); 172 | for (const pair &p : syncTimeBw) { 173 | // TODO verify that p.first is a number 174 | t.syncTimeBw[stoi(p.first)] = p.second; 175 | } 176 | j.at("parameterSize").get_to(t.parameterSize); 177 | j.at("memoryUsageA").get_to(t.memoryUsageA); 178 | j.at("memoryUsageB").get_to(t.memoryUsageB); 179 | j.at("id").get_to(t.id); 180 | } 181 | 182 | 183 | struct Node { 184 | int id; // v 185 | unordered_map> TMPCs; // TMPCs[k] = vector of TMPCs for number k of devices 186 | string name; // just for debugging etc. 187 | }; 188 | 189 | 190 | void from_json (const json &j, Node &n) { 191 | j.at("id").get_to(n.id); 192 | map> TMPCs; 193 | j.at("TMPCs").get_to(TMPCs); 194 | for (const pair> &p : TMPCs) { 195 | // TODO verify that p.first is a number 196 | n.TMPCs[stoi(p.first)] = p.second; 197 | } 198 | if (j.count("name")) { 199 | j.at("name").get_to(n.name); 200 | } 201 | } 202 | 203 | 204 | struct Edge { 205 | int sourceId; // u 206 | int destId; // v 207 | double communicationCost; // c(u,v), in bytes 208 | }; 209 | 210 | 211 | void from_json (const json &j, Edge &e) { 212 | j.at("sourceId").get_to(e.sourceId); 213 | j.at("destId").get_to(e.destId); 214 | j.at("communicationCost").get_to(e.communicationCost); 215 | } 216 | 217 | 218 | struct Instance { 219 | double maxMemoryPerDevice; 220 | int maxDevices; 221 | double bandwidth; // in bytes per second 222 | int mbsInBatch; 223 | vector nodes; 224 | vector edges; 225 | 226 | // filled with renumber() 227 | unordered_map newNumber; 228 | vector oldNumber; 229 | 230 | void linearize(); 231 | void checkInputCorrectness() const; 232 | bool isDAG() const; 233 | void renumber(); 234 | 235 | // for our experiments 236 | void insertTransformerLayer(); 237 | vector getTransformerIds() const; 238 | vector getNodeIdsToposorted() const; 239 | int getMaxTensorParallelDegree() const; 240 | }; 241 | 242 | 243 | void from_json (const json &j, Instance &ii) { 244 | j.at("maxMemoryPerDevice").get_to(ii.maxMemoryPerDevice); 245 | j.at("maxDevices").get_to(ii.maxDevices); 246 | j.at("bandwidth").get_to(ii.bandwidth); 247 | j.at("mbsInBatch").get_to(ii.mbsInBatch); 248 | j.at("nodes").get_to(ii.nodes); 249 | j.at("edges").get_to(ii.edges); 250 | 251 | ii.checkInputCorrectness(); 252 | ii.renumber(); 253 | ii.checkInputCorrectness(); 254 | } 255 | 256 | 257 | void Instance::linearize () { 258 | fail("not needed for this work, to be implemented when needed"); 259 | // remember, there should be no parallel edges even after adding the path 260 | } 261 | 262 | 263 | void Instance::checkInputCorrectness() const { 264 | if (maxDevices < 1 || maxDevices > DEVICES_LIMIT) { 265 | fail("wrong number of devices"); 266 | } 267 | if (bandwidth < 1e-9) { 268 | fail("wrong bandwidth"); 269 | } 270 | if (maxMemoryPerDevice < 1e-9) { 271 | fail("wrong maxMemoryPerDevice"); 272 | } 273 | if (mbsInBatch < 1) { 274 | fail("wrong mbsInBatch"); 275 | } 276 | if (nodes.empty()) { 277 | fail("no nodes in input"); 278 | } 279 | set> edgesAsPairs; 280 | unordered_map> incomingEdges, outgoingEdges; 281 | for (const Node &n : nodes) { 282 | incomingEdges[n.id] = {}; 283 | outgoingEdges[n.id] = {}; 284 | } 285 | if (edges.empty()) { 286 | fail("no edges in input"); 287 | } 288 | for (const Edge &e : edges) { 289 | if (e.communicationCost < 0) { 290 | fail("communication cost of edge is negative"); // zero is fine 291 | } 292 | if (edgesAsPairs.insert(make_pair(e.sourceId, e.destId)).second == false) { 293 | fail("duplicated edge"); 294 | } 295 | if (!incomingEdges.count(e.destId) || !outgoingEdges.count(e.sourceId)) { 296 | fail("edge endpoint is not in nodes"); 297 | } 298 | incomingEdges[e.destId].insert(e.sourceId); 299 | outgoingEdges[e.sourceId].insert(e.destId); 300 | } 301 | 302 | if (!isDAG()) { 303 | fail("instance is not acyclic"); 304 | } 305 | 306 | set nodeIds; 307 | for (const Node &n : nodes) { 308 | if (nodeIds.insert(n.id).second == false) { 309 | fail("duplicate node ID"); 310 | } 311 | // check TMPCs 312 | for (const auto &it : n.TMPCs) { 313 | if (it.first < 1 || it.first > DEVICES_LIMIT) { 314 | fail("key " + to_string(it.first) + " doesn't look like a number of devices"); 315 | } 316 | set TMPCIds; 317 | for (const TMPC &tmpc : it.second) { 318 | if (tmpc.id.empty()) { 319 | fail("TMPC ID empty"); 320 | } 321 | if (TMPCIds.insert(tmpc.id).second == false) { 322 | fail("TMPC IDs are not distinct within one node and number of devices"); 323 | } 324 | // check each TMPC 325 | if (tmpc.timePerSample < 0) { 326 | fail("wrong timePerSample"); 327 | } 328 | if (tmpc.parameterSize < 0) { 329 | fail("wrong parameterSize"); 330 | } 331 | if (tmpc.memoryUsageA < 0 || tmpc.memoryUsageB < 0) { 332 | fail("wrong memoryUsage"); 333 | } 334 | // syncTimes should contain only real edges 335 | for (const pair &p : tmpc.syncTimeFw) { 336 | if (p.second < 0) { 337 | fail("wrong syncTime"); 338 | } 339 | if (!incomingEdges[n.id].count(p.first)) { 340 | dbg << n.id << "<-" << p.first << endl; 341 | fail("edge present in syncTimes but not in edges"); 342 | } 343 | } 344 | for (const pair &p : tmpc.syncTimeBw) { 345 | if (p.second < 0) { 346 | fail("wrong syncTime"); 347 | } 348 | if (!outgoingEdges[n.id].count(p.first)) { 349 | dbg << n.id << "->" << p.first << endl; 350 | fail("edge present in syncTimes but not in edges"); 351 | } 352 | } 353 | // and they should contain all edges 354 | for (int u : incomingEdges[n.id]) { 355 | if (!tmpc.syncTimeFw.count(u)) { 356 | dbg << "incoming " << n.id << " <- " << u << endl; 357 | fail("edge missing from syncTimes"); 358 | } 359 | } 360 | for (int w : outgoingEdges[n.id]) { 361 | if (!tmpc.syncTimeBw.count(w)) { 362 | dbg << "outgoing " << n.id << " -> " << w << endl; 363 | fail("edge missing from syncTimes"); 364 | } 365 | } 366 | } 367 | if (KNAPSACK_FAST_HEURISTIC) { 368 | if (it.second.size() > 2) { 369 | fail("can't run the fast heuristic with > 2 TMPCs"); 370 | } 371 | } 372 | } 373 | if (!n.TMPCs.count(1) || n.TMPCs.at(1).empty()) { 374 | warn("no TMPC given for executing node " + to_string(n.id) + " w/o tensor parallelism?"); 375 | } 376 | 377 | } 378 | } 379 | 380 | 381 | bool Instance::isDAG() const { 382 | unordered_map indegree; 383 | unordered_map> outgoingEdges; 384 | for (const Edge &e : edges) { 385 | ++indegree[e.destId]; 386 | outgoingEdges[e.sourceId].push_back(e.destId); 387 | } 388 | vector deg0vertices; 389 | for (const Node &n : nodes) { 390 | if (indegree[n.id] == 0) { 391 | deg0vertices.push_back(n.id); 392 | } 393 | } 394 | int processed_vertices = 0; 395 | while (!deg0vertices.empty()) { 396 | int v = deg0vertices.back(); 397 | deg0vertices.pop_back(); 398 | ++processed_vertices; 399 | for (int w : outgoingEdges[v]) { 400 | --indegree[w]; 401 | if (indegree[w] == 0) { 402 | deg0vertices.push_back(w); 403 | } 404 | } 405 | } 406 | return processed_vertices == nodes.size(); 407 | } 408 | 409 | 410 | void Instance::renumber () { 411 | // renumber nodes as 0,1,2,... 412 | assert(oldNumber.empty()); // oldNumber.size() used as next id 413 | // build oldNumber and newNumber 414 | for (Node &n : nodes) { 415 | newNumber[n.id] = oldNumber.size(); 416 | oldNumber.push_back(n.id); 417 | } 418 | // now replace old ids with new ids everywhere 419 | for (Node &n : nodes) { 420 | n.id = newNumber[n.id]; 421 | for (auto &it : n.TMPCs) { 422 | for (auto &tmpc : it.second) { 423 | unordered_map newSyncTimeFw, newSyncTimeBw; 424 | for (const auto &p : tmpc.syncTimeFw) { 425 | newSyncTimeFw[newNumber[p.first]] = p.second; 426 | } 427 | for (const auto &p : tmpc.syncTimeBw) { 428 | newSyncTimeBw[newNumber[p.first]] = p.second; 429 | } 430 | tmpc.syncTimeFw = move(newSyncTimeFw); 431 | tmpc.syncTimeBw = move(newSyncTimeBw); 432 | } 433 | } 434 | } 435 | for (Edge &e : edges) { 436 | e.sourceId = newNumber[e.sourceId]; 437 | e.destId = newNumber[e.destId]; 438 | } 439 | } 440 | 441 | 442 | void Instance::insertTransformerLayer () { 443 | // make a larger DNN for our experiments 444 | 445 | // find a ParallelTransformerLayer 446 | unordered_set idsOfTransformerLayers; 447 | for (const Node &n : nodes) { 448 | if (n.name == "ParallelTransformerLayer") { 449 | idsOfTransformerLayers.insert(n.id); 450 | } 451 | } 452 | unordered_map prevTransformer, nextTransformer; 453 | for (const Edge &e : edges) { 454 | if (idsOfTransformerLayers.count(e.sourceId) && idsOfTransformerLayers.count(e.destId)) { 455 | prevTransformer[e.destId] = e.sourceId; 456 | nextTransformer[e.sourceId] = e.destId; 457 | } 458 | } 459 | int v = -1; 460 | for (int w : idsOfTransformerLayers) { 461 | if (prevTransformer.count(w) && nextTransformer.count(w)) { 462 | // found 463 | v = w; 464 | break; 465 | } 466 | } 467 | if (v == -1) { 468 | fail("no transformer with transformer precedessor and successor found"); 469 | } 470 | 471 | // replicate v 472 | static int freshNumber = 123456789; 473 | ++freshNumber; 474 | int nu = oldNumber.size(); 475 | oldNumber.push_back(freshNumber); 476 | newNumber[freshNumber] = nu; 477 | 478 | const int s = nextTransformer[v]; 479 | assert(nodes[v].id == v); 480 | assert(nodes[s].id == s); 481 | nodes.emplace_back(); 482 | nodes.back().id = nu; 483 | nodes.back().name = "ParallelTransformerLayer"; 484 | 485 | // redirect edge v->s to be v->nu 486 | // and add edge nu->s 487 | double communicationCost = 1e30; 488 | for (Edge &e : edges) { 489 | if (e.sourceId == v && e.destId == s) { 490 | e.destId = nu; 491 | communicationCost = e.communicationCost; 492 | } 493 | } 494 | assert(communicationCost < 1e29); 495 | edges.emplace_back(); 496 | edges.back().sourceId = nu; 497 | edges.back().destId = s; 498 | edges.back().communicationCost = communicationCost; 499 | 500 | for (auto &it : nodes[v].TMPCs) { 501 | for (auto &tmpc : it.second) { 502 | const double sbw = tmpc.syncTimeBw.at(s); 503 | tmpc.syncTimeBw.erase(s); 504 | tmpc.syncTimeBw.emplace(nu, sbw); 505 | } 506 | } 507 | for (auto &it : nodes[s].TMPCs) { 508 | for (auto &tmpc : it.second) { 509 | const double sfw = tmpc.syncTimeFw.at(v); 510 | tmpc.syncTimeFw.erase(v); 511 | tmpc.syncTimeFw.emplace(nu, sfw); 512 | } 513 | } 514 | // create TMPCs for nu 515 | nodes.back().TMPCs = nodes[v].TMPCs; 516 | for (auto &it : nodes.back().TMPCs) { 517 | for (auto &tmpc : it.second) { 518 | // leave all stuff as it was, just rewrite syncTimeFw and syncTimeBw 519 | const double sfw = tmpc.syncTimeFw.at(prevTransformer[v]); 520 | const double sbw = tmpc.syncTimeBw.at(nu); 521 | tmpc.syncTimeFw = {{v, sfw}}; 522 | tmpc.syncTimeBw = {{s, sbw}}; 523 | } 524 | } 525 | 526 | // TODO: ideally, here we should also: 527 | // for each other edge a->v, add a->nu 528 | // for each other edge v->b, add nu->b 529 | // (but, in the BERT model input, the former are irrelevant and latter are absent) 530 | 531 | checkInputCorrectness(); 532 | } 533 | 534 | 535 | vector Instance::getTransformerIds() const { 536 | unordered_set tranformerIdsUnordered; 537 | for (const Node &n : nodes) { 538 | if (n.name == "ParallelTransformerLayer") { 539 | tranformerIdsUnordered.insert(n.id); 540 | } 541 | } 542 | 543 | // return these in topological order 544 | unordered_map indegree; 545 | unordered_map> outgoingEdges; 546 | for (const Edge &e : edges) { 547 | ++indegree[e.destId]; 548 | outgoingEdges[e.sourceId].push_back(e.destId); 549 | } 550 | vector deg0vertices; 551 | for (const Node &n : nodes) { 552 | if (indegree[n.id] == 0) { 553 | deg0vertices.push_back(n.id); 554 | } 555 | } 556 | vector transformerIdsOrdered; 557 | while (!deg0vertices.empty()) { 558 | int v = deg0vertices.back(); 559 | deg0vertices.pop_back(); 560 | if (tranformerIdsUnordered.count(v)) { 561 | transformerIdsOrdered.push_back(v); 562 | } 563 | for (int w : outgoingEdges[v]) { 564 | --indegree[w]; 565 | if (indegree[w] == 0) { 566 | deg0vertices.push_back(w); 567 | } 568 | } 569 | } 570 | return transformerIdsOrdered; 571 | } 572 | 573 | 574 | vector Instance::getNodeIdsToposorted() const { 575 | unordered_map indegree; 576 | unordered_map> outgoingEdges; 577 | for (const Edge &e : edges) { 578 | ++indegree[e.destId]; 579 | outgoingEdges[e.sourceId].push_back(e.destId); 580 | } 581 | vector deg0vertices; 582 | for (const Node &n : nodes) { 583 | if (indegree[n.id] == 0) { 584 | deg0vertices.push_back(n.id); 585 | } 586 | } 587 | vector nodeIdsOrdered; 588 | while (!deg0vertices.empty()) { 589 | int v = deg0vertices.back(); 590 | deg0vertices.pop_back(); 591 | nodeIdsOrdered.push_back(v); 592 | for (int w : outgoingEdges[v]) { 593 | --indegree[w]; 594 | if (indegree[w] == 0) { 595 | deg0vertices.push_back(w); 596 | } 597 | } 598 | } 599 | return nodeIdsOrdered; 600 | } 601 | 602 | 603 | int Instance::getMaxTensorParallelDegree() const { 604 | // returns max t such that some node has some TMPC for tensor-parallel degree t 605 | int res = 1; 606 | for (const Node &n : nodes) { 607 | for (const auto &it : n.TMPCs) { 608 | res = max(res, it.first); 609 | } 610 | } 611 | return res; 612 | } 613 | 614 | 615 | struct ResultStage { 616 | vector nodes; // node ids (this is actually superfluous because of TMPCids) 617 | int dataParallelDegree; // d_i 618 | int tensorParallelDegree; // t_i 619 | unordered_map TMPCids; // TMPCids[v] = identifier of the TMPC used for node v 620 | }; 621 | 622 | 623 | struct Result { 624 | vector stages; 625 | string debugInfo; 626 | }; 627 | 628 | 629 | void to_json (json &j, const ResultStage &rs) { 630 | j = json{{"nodes", rs.nodes}, 631 | {"dataParallelDegree", rs.dataParallelDegree}, 632 | {"tensorParallelDegree", rs.tensorParallelDegree}, 633 | {"TMPCids", rs.TMPCids} 634 | }; 635 | } 636 | 637 | 638 | void to_json (json &j, const Result &r) { 639 | j = json{{"stages", r.stages}}; 640 | } 641 | 642 | 643 | struct Graph { 644 | const Instance &ins; // already renumbered (nodes 0,1,2,...) 645 | const int boundOnS; // set as min(mbsInBatch, maxDevices) 646 | 647 | vector>> incomingEdges; // v -> vector of {u,c(u,v)} 648 | vector>> outgoingEdges; // v -> vector of {w,c(v,w)} 649 | vector node; // node[v] = pointer to node with (new) id v 650 | 651 | // ideals, represented as indicator vectors 652 | unordered_map,int> idealToId; // maps ideal to its ID 653 | vector> ideals; // maps ID to ideal 654 | vector idealsSortedBySize; // IDs of ideals, sorted by size 655 | 656 | // NOTE: throughout the code, we are actually dealing with DOWNSETS, not ideals. should Ctrl+Replace 657 | 658 | // pairs of ideals (that induce contiguous sets) 659 | vector> immediateSubIdeals; 660 | // immediateSubIdeals[id] = IDs of ideals that are immediate subsets of the ideal with ID id 661 | vector> subIdeals; 662 | // subideals[id] = IDs of ideals that are subsets of the ideal with ID id 663 | // (this takes O(numberOfIdealPairs) space; could be done on the fly in the DP, 664 | // but if one can't afford this memory then probably one can't afford the DP alg timewise) 665 | long long numberOfIdealPairs; 666 | 667 | Graph (const Instance &_ins); 668 | void generateIdeals (); 669 | void growIdeal (const vector &ideal, int myId); 670 | void prepareSubIdeals (); 671 | 672 | vector getContiguousSet (int id, int subId) const; 673 | vector>> getAllTPSsForIdealPair (int id, int subId); 674 | 675 | double getDataParallelResyncCost (double parameterSize, int d) const; 676 | 677 | // reconstruction stuff: 678 | ResultStage getTPSWitnessFor (int id, int subId, int t, int d, int s, double targetTPS); 679 | // some "global" variables used in the reconstruction phase 680 | bool reconstructionModeForGetAllTPSsForIdealPair; 681 | bool reconstructionModeForSolveKnapsack; 682 | int reconstructionT, reconstructionD, reconstructionS, reconstructionY; 683 | double reconstructionTPS; 684 | vector reconstructionTMPCindices; // output of solveKnapsack in reconstruction phase 685 | ResultStage reconstructionRS; // output of getTPSWitnessFor in reconstruction phase 686 | 687 | vector solveKnapsack (const vector>& TMPCTPS, 688 | const vector>& TMPCMemoryUsageA, 689 | const vector>& TMPCMemoryUsageB, 690 | int maxY); 691 | 692 | vector>> dp; // dp table 693 | int idOfFullSet; 694 | Result runDP(); 695 | 696 | void renumberResultBack (Result &r) const; 697 | 698 | double getTPSForResult (const Result &r) const; 699 | }; 700 | 701 | 702 | Graph::Graph (const Instance &_ins) : 703 | ins(_ins), 704 | boundOnS(min(_ins.mbsInBatch,_ins.maxDevices)), 705 | numberOfIdealPairs(0), 706 | reconstructionModeForGetAllTPSsForIdealPair(false), 707 | reconstructionModeForSolveKnapsack(false) { 708 | 709 | // precompute incomingEdges and outgoingEdges 710 | incomingEdges.resize(ins.nodes.size()); 711 | outgoingEdges.resize(ins.nodes.size()); 712 | for (const Edge &e : ins.edges) { 713 | incomingEdges[e.destId].emplace_back(e.sourceId, e.communicationCost); 714 | outgoingEdges[e.sourceId].emplace_back(e.destId, e.communicationCost); 715 | } 716 | // precompute node 717 | node.resize(ins.nodes.size()); 718 | for (const Node &n : ins.nodes) { 719 | node[n.id] = &n; 720 | } 721 | 722 | generateIdeals(); 723 | // immediateSubIdeals is prepared. now prepare subIdeals 724 | prepareSubIdeals(); 725 | } 726 | 727 | 728 | void Graph::generateIdeals () { 729 | if (!ideals.empty()) { 730 | fail("generating ideals twice?"); 731 | } 732 | 733 | // start with empty set 734 | const vector emptySet(ins.nodes.size(), false); 735 | idealToId[emptySet] = 0; 736 | ideals.push_back(emptySet); 737 | immediateSubIdeals.emplace_back(); 738 | growIdeal(emptySet, 0); 739 | 740 | dbg << ideals.size() << " ideals" << endl; 741 | if (ideals.size() > IDEALS_LIMIT) { 742 | fail("too many ideals (current limit set at " + to_string(IDEALS_LIMIT) + "); this isn't going to work..."); 743 | } 744 | 745 | // prepare idealsSortedBySize 746 | vector> sorter; // {} 747 | for (int i = 0; i < ideals.size(); ++i) { 748 | sorter.emplace_back(count(ideals[i].begin(), ideals[i].end(), true), i); 749 | } 750 | sort(sorter.begin(), sorter.end()); 751 | for (const auto &it : sorter) { 752 | idealsSortedBySize.push_back(it.second); 753 | } 754 | assert(idealsSortedBySize[0] == 0); 755 | } 756 | 757 | 758 | void Graph::growIdeal (const vector &ideal, int myId) { 759 | // myId == idealToId[ideal] 760 | // try to add every vertex 761 | for (int v = 0; v < ins.nodes.size(); ++v) { 762 | if (!ideal[v]) { 763 | // try ideal+v as a new ideal 764 | // check if valid: do all v's successors belong to ideal? 765 | bool valid = true; 766 | for (const pair& p : outgoingEdges[v]) { 767 | // v -> p.first 768 | if (!ideal[p.first]) { 769 | valid = false; 770 | break; 771 | } 772 | } 773 | if (valid) { 774 | vector newIdeal = ideal; 775 | newIdeal[v] = true; 776 | // check if newIdeal had already been generated 777 | if (!idealToId.count(newIdeal)) { 778 | int newId = ideals.size(); 779 | idealToId[newIdeal] = newId; 780 | ideals.push_back(newIdeal); 781 | if (ideals.size() >= IDEALS_EXPLORATION_LIMIT) { 782 | fail("already over " + to_string(IDEALS_EXPLORATION_LIMIT) + " ideals. this isn't going to work..."); 783 | } 784 | immediateSubIdeals.emplace_back(); 785 | growIdeal(newIdeal, newId); 786 | } 787 | immediateSubIdeals[idealToId[newIdeal]].push_back(myId); 788 | } 789 | } 790 | } 791 | } 792 | 793 | 794 | void Graph::prepareSubIdeals () { 795 | // subideals = transitive closure of immediateSubIdeals 796 | 797 | numberOfIdealPairs = 0; 798 | subIdeals.resize(ideals.size()); 799 | 800 | for (int id = 0; id < ideals.size(); ++id) { 801 | // we will generate subIdeals[id] using some BFS/DFS 802 | vector queue = {id}; 803 | unordered_set enqueuedIdeals = {id}; 804 | while (!queue.empty()) { 805 | int subId = queue.back(); 806 | queue.pop_back(); 807 | 808 | // now visiting subId 809 | if (subId != id) { 810 | subIdeals[id].push_back(subId); 811 | ++numberOfIdealPairs; 812 | } 813 | 814 | // expand further from subId 815 | for (int subSubId : immediateSubIdeals[subId]) { 816 | if (enqueuedIdeals.insert(subSubId).second == true) { 817 | // subSubId was not in enqueuedIdeals before 818 | queue.push_back(subSubId); 819 | } 820 | } 821 | } 822 | } 823 | 824 | dbg << numberOfIdealPairs << " ideal pairs" << endl; 825 | } 826 | 827 | 828 | // returns the difference ideals[id] \ ideals[subId] as vector 829 | vector Graph::getContiguousSet (int id, int subId) const { 830 | vector ideal = ideals[id], subIdeal = ideals[subId]; 831 | for (int v = 0; v < ins.nodes.size(); ++v) { 832 | if (subIdeal[v]) { 833 | ideal[v] = false; 834 | } 835 | } 836 | return ideal; 837 | } 838 | 839 | 840 | // returns the cost per sample (microbatch), in bytes 841 | // (so this should be still divided by the bandwidth) 842 | double Graph::getDataParallelResyncCost (double parameterSize, int d) const { 843 | return 4 * (d-1) * parameterSize / d / ins.mbsInBatch; 844 | // the factor 4 is following the implementation of the PipeDream planner 845 | // (modeling a distributed parameter server implementation of AllReduce) 846 | } 847 | 848 | 849 | Result Graph::runDP () { 850 | // initialize DP table: dp[ideal][k][s] 851 | // (partition ideal over <= k machines, sum-of-dp-degrees <= s) 852 | // note: both are AT MOST rather than EQUAL as in the paper 853 | dp.assign(ideals.size(), vector>( 854 | ins.maxDevices+1, vector( 855 | boundOnS+1, INFTY))); 856 | 857 | // case of the empty set (ideal with ID 0) 858 | // we initialize all dp[0][*][*] = 0 so that it is monotone wrt k and s 859 | for (int k = 0; k <= ins.maxDevices; ++k) { 860 | for (int s = 0; s <= boundOnS; ++s) { 861 | dp[0][k][s] = 0; 862 | } 863 | } 864 | 865 | // dpDrops[id][k] will be the list of s such that dp[id][k][s] < dp[id][k][s-1] 866 | // (used for FASTER_DP_IMPLEMENTATION) 867 | vector>> dpDrops(ideals.size(), vector>( 868 | ins.maxDevices+1, vector(1, 0))); // dpDrops[id][k] = {0} 869 | 870 | // profiling stuff 871 | double timeSpentInGetAllTPSsForIdealPair = 0.0, timeSpentInDPLoop = 0.0; 872 | 873 | // here we go! 874 | dbg << "running DP..." << endl; 875 | for (int id : idealsSortedBySize) { 876 | if (id == 0) continue; 877 | // we want to fill dp[id][*][*] (already initialized to INFTY). 878 | // we will loop over every subideal subId (their list is already 879 | // precomputed in subIdeals[id] for convenience) and account for its 880 | // contributions to dp[id][*][*] 881 | for (int subId : subIdeals[id]) { 882 | 883 | clock_t startTime = clock(); 884 | const vector>> TPS = getAllTPSsForIdealPair(id, subId); 885 | timeSpentInGetAllTPSsForIdealPair += (clock() - startTime) * 1.0 / CLOCKS_PER_SEC; 886 | // TPS[t][d][s] = min TPS when t-tensor and d-data-parallel partitioning 887 | // id\subId (across t*d devices), with sum-of-dp-degrees <= s 888 | 889 | startTime = clock(); 890 | for (int t = 1; t <= ins.maxDevices; ++t) if (!TPS[t].empty()) { 891 | for (int d = 1; d*t <= ins.maxDevices; ++d) if (ins.mbsInBatch % d == 0) { 892 | for (int k = d*t; k <= ins.maxDevices; ++k) { 893 | // id/subId on t*d devices, subId on k-t*d devices 894 | 895 | // now we need to perform the updates of dp[id][k][s] for all s 896 | 897 | if (!FASTER_DP_IMPLEMENTATION) { 898 | 899 | // straightforward implementation 900 | 901 | #pragma GCC unroll 16 902 | for (int s = d; s <= boundOnS; ++s) { 903 | minify(dp[id][k][s], max(dp[subId][k-t*d][s-d], TPS[t][d][s])); 904 | } 905 | 906 | } else { // FASTER_DP_IMPLEMENTATION 907 | 908 | // we do not update dp[id][k][s] for all s here, but only for 909 | // a selected subset that is sufficient 910 | // (namely, those s where dp[subId][k-t*d][s-d] has decreased (from [s-1-d])) 911 | // then, these updates will be propagated right after the loop over subId 912 | // in order to ensure monotonicity 913 | 914 | // dp[subId][k-t*d][s-d] is non-increasing wrt s 915 | // TPS[t][d][s] is non-decreasing wrt s 916 | // 1. find (using binary search) max index u >= 0 such that: 917 | // - f = dpDrops[subId][k-t*d][u] exists 918 | // - f+d <= boundOnS 919 | // - dp[subId][k-t*d][f] > TPS[t][d][f+d] 920 | // 2. do minify from 0 to u incl. (use just dp, not TPS) 921 | // 3. also do minify for u+1 if it is valid (two first points above) (use both dp and TPS here) 922 | int u = -1, uf = 0, ut = int(dpDrops[subId][k-t*d].size()) - 1; 923 | while (uf <= ut) { 924 | int mid = (uf + ut) / 2; 925 | int f = dpDrops[subId][k-t*d][mid]; 926 | if (f+d <= boundOnS && dp[subId][k-t*d][f] > TPS[t][d][f+d]) { 927 | u = mid; 928 | uf = mid+1; 929 | } else { 930 | ut = mid-1; 931 | } 932 | } 933 | #pragma GCC unroll 16 934 | for (int v = 0; v <= u; ++v) { 935 | const int f = dpDrops[subId][k-t*d][v]; 936 | minify(dp[id][k][f+d], dp[subId][k-t*d][f]); 937 | } 938 | if (u+1 < dpDrops[subId][k-t*d].size()) { 939 | int f = dpDrops[subId][k-t*d][u+1]; 940 | if (f+d <= boundOnS) { 941 | assert(dp[subId][k-t*d][f] <= TPS[t][d][f+d]); 942 | minify(dp[id][k][f+d], TPS[t][d][f+d]); 943 | } 944 | } 945 | 946 | } 947 | } 948 | } 949 | } 950 | timeSpentInDPLoop += (clock() - startTime) * 1.0 / CLOCKS_PER_SEC; 951 | } 952 | 953 | // ensure monotonicity 954 | // (only really needed if FASTER_DP_IMPLEMENTATION) 955 | for (int k = 0; k <= ins.maxDevices; ++k) { 956 | //assert(dpDrops[id][k] == vector(1,0)); 957 | for (int s = 0; s <= boundOnS; ++s) { 958 | if (k > 0) { 959 | minify(dp[id][k][s], dp[id][k-1][s]); 960 | } 961 | if (s > 0) { 962 | minify(dp[id][k][s], dp[id][k][s-1]); 963 | if (dp[id][k][s] + 1e-9 < dp[id][k][s-1]) { 964 | // significant drop 965 | dpDrops[id][k].push_back(s); 966 | } 967 | } 968 | } 969 | } 970 | } 971 | 972 | DBG(timeSpentInGetAllTPSsForIdealPair) 973 | DBG(timeSpentInDPLoop) 974 | 975 | // the solution (final TPS) is now known: it's max_k max_s dp[all nodes][k][s] 976 | 977 | // ID of ideal that contains all nodes 978 | idOfFullSet = idealToId.at(vector(ins.nodes.size(), true)); 979 | 980 | double finalTPS = INFTY; 981 | int devicesUsed = -1, sUsed = -1; 982 | for (int k = 0; k <= ins.maxDevices; ++k) { 983 | for (int s = 1; s <= boundOnS; ++s) { 984 | if (dp[idOfFullSet][k][s] + 1e-9 < finalTPS) { 985 | finalTPS = dp[idOfFullSet][k][s]; 986 | devicesUsed = k; 987 | sUsed = s; 988 | } 989 | } 990 | } 991 | if (finalTPS > INFTY/2) { 992 | // cannot partition the graph feasibly (in terms of memory usage) 993 | return Result(); // empty result 994 | } 995 | dbg << "max load = " << finalTPS << " using " << devicesUsed 996 | << " out of " << ins.maxDevices << " devices, and using sum-of-dp-degrees " 997 | << sUsed << " (batch size = " 998 | << ins.mbsInBatch << ")" << endl; 999 | // note: the reported number of devices and batch size might possibly be overshot 1000 | // (since we initialized all dp[0][*][*] = 0 at the beginning, 1001 | // and also since, if FASTER_DP_IMPLEMENTATION, 1002 | // we do minify(dp[id][k][s], dp[id][k-1][s]) and minify(dp[id][k][s], dp[id][k][s-1])) 1003 | // however, the strict inequality in the comparison above should prevent this, 1004 | // as this way we take the minimal (k,s) that attains this TPS 1005 | 1006 | // for debug/experiments only 1007 | vector transformerIds = ins.getTransformerIds(); 1008 | 1009 | // now we reconstruct the solution 1010 | Result result; 1011 | int curId = idOfFullSet, curK = devicesUsed, curS = sUsed; 1012 | while (curId != 0) { // curId is not empty set 1013 | assert(curK > 0); 1014 | assert(curS > 0); 1015 | // how does dp[curId][curK][curS] arise? 1016 | bool found = false; 1017 | for (int subId : subIdeals[curId]) { 1018 | const vector>> TPS = getAllTPSsForIdealPair(curId, subId); 1019 | // possible optimization: could only ask for s = curS 1020 | 1021 | for (int t = 1; t <= curK; ++t) if (!TPS[t].empty()) { 1022 | for (int d = 1; t*d <= curK && d <= curS; ++d) { 1023 | if (ins.mbsInBatch % d != 0) continue; // not really necessary 1024 | // curId\subId on t*d devices, subId on curK-t*d devices 1025 | if (1e-9 > abs(dp[curId][curK][curS] - max(dp[subId][curK-t*d][curS-d], TPS[t][d][curS]))) { 1026 | // found the next stage 1027 | found = true; 1028 | 1029 | ResultStage rs = getTPSWitnessFor(curId, subId, t, d, curS, TPS[t][d][curS]); 1030 | result.stages.push_back(rs); 1031 | 1032 | assert(rs.dataParallelDegree == d); 1033 | assert(rs.tensorParallelDegree == t); 1034 | 1035 | dbg << "formed a stage with nodes [" << rs.nodes << "] using d=" 1036 | << rs.dataParallelDegree << " and t=" << rs.tensorParallelDegree 1037 | << " yielding TPS = " << TPS[t][d][curS] << endl; 1038 | 1039 | int countTransformers = 0, countTransformersWithAR = 0; 1040 | for (const pair &p : rs.TMPCids) { 1041 | if (count(transformerIds.begin(), transformerIds.end(), p.first)) { 1042 | ++countTransformers; 1043 | if (p.second == "activation recomp") { 1044 | ++countTransformersWithAR; 1045 | } 1046 | } 1047 | } 1048 | dbg << "transformer layers (with act. recomp. / total) = " 1049 | << countTransformersWithAR << "/" 1050 | << countTransformers << endl << endl; 1051 | 1052 | curS = curS - d; 1053 | curK = curK - t*d; 1054 | curId = subId; 1055 | 1056 | break; 1057 | } 1058 | } 1059 | if (found) break; 1060 | } 1061 | if (found) break; 1062 | } 1063 | if (!found) { 1064 | fail("didn't find any reconstruction step to make?"); 1065 | } 1066 | } 1067 | if (curK > 0 || curS > 0) { 1068 | fail("k or s didn't fall to 0 by the end of reconstruction?"); 1069 | } 1070 | 1071 | const double verificationTPS = getTPSForResult(result); 1072 | if (abs(finalTPS - verificationTPS) > 1e-5) { 1073 | DBG(finalTPS) 1074 | DBG(verificationTPS) 1075 | fail("verification TPS is different"); 1076 | } 1077 | 1078 | // note: the result is in terms of the new numbers; if wanting to print it etc., 1079 | // then at the end, translate result from new numbers to old by calling: 1080 | // renumberResultBack(result); 1081 | 1082 | return result; 1083 | } 1084 | 1085 | 1086 | // TPS[t][d][s] = min TPS when t-tensor and d-data-parallel partitioning 1087 | // id\subId (across t*d devices), with sum-dp-degrees <= s 1088 | // TPS[t] can be empty for some t! 1089 | // (note: this should be deterministic as it will be rerun during reconstruction) 1090 | vector>> Graph::getAllTPSsForIdealPair (int id, int subId) { 1091 | const vector subgraphVB = getContiguousSet(id, subId); 1092 | const vector subgraph = vectorOfSetBits(subgraphVB); 1093 | 1094 | vector>> result(ins.maxDevices+1); 1095 | 1096 | // t = degree of tensor parallelism 1097 | for (int t = 1; t <= ins.maxDevices; ++t) { 1098 | if (t > 1 && !TENSOR_PARALLELISM_ALLOWED) { 1099 | break; 1100 | } 1101 | 1102 | bool someNodeHasNoTMPCsForT = false; 1103 | for (int v : subgraph) { 1104 | if (!node[v]->TMPCs.count(t) || node[v]->TMPCs.at(t).empty()) { 1105 | someNodeHasNoTMPCsForT = true; 1106 | break; 1107 | } 1108 | } 1109 | if (someNodeHasNoTMPCsForT) { 1110 | // tensor parallelism of degree t is not supported 1111 | // for some node of this subgraph 1112 | continue; 1113 | } 1114 | 1115 | // initialize the result vector 1116 | result[t].assign(ins.maxDevices/t + 1, vector(boundOnS+1, INFTY)); 1117 | 1118 | vector*> TMPCs; // TMPCs[i] = TMPCs for node subgraph[i] on t devices 1119 | vector> TMPCEdgeCommCosts; 1120 | // TMPCEdgeCommCosts[i][l] = sum of all edge-associated communication costs 1121 | // for the l-th TMPC of node subgraph[i] (on t devices) 1122 | 1123 | // prepare TMPCs and TMPCEdgeCommCosts 1124 | for (int v : subgraph) { 1125 | TMPCs.push_back(&(node[v]->TMPCs.at(t))); 1126 | TMPCEdgeCommCosts.emplace_back(); 1127 | for (const TMPC &tmpc : *(TMPCs.back())) { 1128 | double tmpcEdgeCommCost = 0.0; 1129 | // need to take into account c(u,v) and sfw(u,v) 1130 | // (first is a property of the edge, second is a property of the TMPC) 1131 | // for all edges (u,v) incoming into v from outside S 1132 | for (const pair &p : incomingEdges[v]) { 1133 | // edge p.first -> v, of cost p.second 1134 | if (!subgraphVB[p.first]) { 1135 | tmpcEdgeCommCost += 2 * (p.second + tmpc.syncTimeFw.at(p.first)); 1136 | } 1137 | } 1138 | 1139 | // instead of "tmpc.syncTimeFw.at(p.first)" we could do this: 1140 | // for (const pair &p : tmpc.syncTimeFw) { 1141 | // if (!subgraphVB[p.first]) { 1142 | // tmpcEdgeCommCost += p.second; 1143 | // } 1144 | // } 1145 | // and then we could even turn syncTimeFw/Bw into vectors 1146 | 1147 | // and similarly for outgoing edges (c and sbw) 1148 | for (const pair &p : outgoingEdges[v]) { 1149 | // edge v -> p.first, of cost p.second 1150 | if (!subgraphVB[p.first]) { 1151 | tmpcEdgeCommCost += 2 * (p.second + tmpc.syncTimeBw.at(p.first)); 1152 | } 1153 | } 1154 | TMPCEdgeCommCosts.back().push_back(tmpcEdgeCommCost); 1155 | } 1156 | } 1157 | // TMPCs and TMPCEdgeCommCosts prepared 1158 | 1159 | // d = degree of data parallelism 1160 | for (int d = 1; d*t <= ins.maxDevices; ++d) if (ins.mbsInBatch % d == 0) { 1161 | 1162 | if (d > 1 && !DATA_PARALLELISM_ALLOWED) { 1163 | break; 1164 | } 1165 | 1166 | // at this point, we have multiple TMPCs for each node 1167 | // and we have to select one TMPC per node 1168 | // (for each s; can be different TMPC-sets for different s; 1169 | // but only memory usage depends on s) 1170 | 1171 | vector> TMPCTotalTPSContribution, TMPCMemoryUsageA, TMPCMemoryUsageB; 1172 | // TMPCTotalTPSContribution[i][l] = sum of all contributions to TPS (compute, comm) 1173 | // for the l-th TMPC of node subgraph[i] (on t-tensor and d-data parallelism) 1174 | // similarly TMPCMemoryUsageA, B 1175 | // prepare these: 1176 | for (int i = 0; i < subgraph.size(); ++i) { 1177 | const int v = subgraph[i]; 1178 | TMPCTotalTPSContribution.emplace_back(); 1179 | TMPCMemoryUsageA.emplace_back(); 1180 | TMPCMemoryUsageB.emplace_back(); 1181 | for (int l = 0; l < TMPCs[i]->size(); ++l) { 1182 | const TMPC &tmpc = (*TMPCs[i])[l]; 1183 | TMPCMemoryUsageA.back().push_back(tmpc.memoryUsageA); 1184 | TMPCMemoryUsageB.back().push_back(tmpc.memoryUsageB); 1185 | // add up all the contributions of node v to the TPS: 1186 | const double communicationInBytes = 1187 | // 1. edge-related communication 1188 | TMPCEdgeCommCosts[i][l] / d 1189 | + 1190 | // 2. data parallelism resync communication 1191 | getDataParallelResyncCost(tmpc.parameterSize, d); 1192 | // 3. compute 1193 | const double compute = tmpc.timePerSample / d; 1194 | const double totalTPSContribution = communicationInBytes / ins.bandwidth + compute; 1195 | TMPCTotalTPSContribution.back().push_back(totalTPSContribution); 1196 | 1197 | if ((!ACTIVATION_RECOMPUTATION_ALLOWED && tmpc.id == "activation recomp") 1198 | || (ACTIVATION_RECOMPUTATION_FORCED && tmpc.id == "vanilla")) { 1199 | // only used for our specific experiments 1200 | // make this TMPC super unattractive 1201 | TMPCMemoryUsageB.back().back() = 1e60; 1202 | TMPCTotalTPSContribution.back().back() = 1e28; 1203 | } 1204 | } 1205 | } 1206 | 1207 | // we have now built a "knapsack" instance (well, one for each y = ceil(s/d)) 1208 | vector knapsackResults = solveKnapsack(TMPCTotalTPSContribution, 1209 | TMPCMemoryUsageA, 1210 | TMPCMemoryUsageB, 1211 | boundOnS / d + 1); 1212 | // knapsackResults[y] = best TPS (over all choices of TMPC-per-node) 1213 | // when t-tensor- and d-data-parallel partitioning 1214 | // id\subId (across t*d devices), with sum-of-dp-degrees <= d*y 1215 | for (int s = 0; s <= boundOnS; ++s) { 1216 | result[t][d][s] = knapsackResults[ceildiv(s, d)]; 1217 | } 1218 | 1219 | // stuff for the reconstruction phase 1220 | if (reconstructionModeForGetAllTPSsForIdealPair) { 1221 | reconstructionY = ceildiv(reconstructionS, reconstructionD); 1222 | if (t == reconstructionT && d == reconstructionD) { 1223 | // if reconstructionTPS == knapsackResults[reconstructionY]: 1224 | if (1e-9 > abs(reconstructionTPS - knapsackResults[reconstructionY])) { 1225 | // got it 1226 | // now we need to build our "output", which is reconstructionRS 1227 | // (make sure to clear or replace each field, since reconstructionRS can be dirty) 1228 | 1229 | reconstructionRS.nodes = subgraph; 1230 | reconstructionRS.dataParallelDegree = d; 1231 | reconstructionRS.tensorParallelDegree = t; 1232 | 1233 | // rerun knapsackResults in reconstruction mode 1234 | reconstructionModeForSolveKnapsack = true; 1235 | // this will fill reconstructionTMPCindices: 1236 | solveKnapsack(TMPCTotalTPSContribution, 1237 | TMPCMemoryUsageA, 1238 | TMPCMemoryUsageB, 1239 | boundOnS / d + 1); 1240 | // now we need to translate this to reconstructionRS.TMPCids 1241 | reconstructionRS.TMPCids.clear(); 1242 | for (int i = 0; i < subgraph.size(); ++i) { 1243 | const int v = subgraph[i]; 1244 | reconstructionRS.TMPCids[v] = (*TMPCs[i])[reconstructionTMPCindices[i]].id; 1245 | } 1246 | 1247 | reconstructionModeForSolveKnapsack = false; 1248 | // might as well return now 1249 | reconstructionModeForGetAllTPSsForIdealPair = false; 1250 | } 1251 | } 1252 | } 1253 | } 1254 | } 1255 | 1256 | return result; 1257 | } 1258 | 1259 | 1260 | // this function aims to solve the following knapsack problem variant for each y=1,...,maxY. 1261 | // we have n = TMPCTPS.size() nodes. 1262 | // for each node, we have some number of TMPCs; each has TMPCTPS, TMPCMemoryUsageA, TMPCMemoryUsageB. 1263 | // select one TMPC per node so that the sum of TMPCMemoryUsageA*y + TMPCMemoryUsageB <= ins.maxMemoryPerDevice 1264 | // and so that the sum of TMPCTPS is minimized. 1265 | // return that minimum. 1266 | // (note: this should be deterministic as it will be rerun during reconstruction) 1267 | vector Graph::solveKnapsack (const vector>& TMPCTPS, 1268 | const vector>& TMPCMemoryUsageA, 1269 | const vector>& TMPCMemoryUsageB, 1270 | int maxY) { 1271 | const int n = TMPCTPS.size(); 1272 | vector result(maxY+1, INFTY); 1273 | 1274 | // we will just do things independently for each y. 1275 | // perhaps we can do something smarter in the future 1276 | 1277 | vector fastestIndex(n, 0); 1278 | for (int i = 0; i < n; ++i) { 1279 | for (int l = 1; l < TMPCTPS[i].size(); ++l) { 1280 | if (TMPCTPS[i][fastestIndex[i]] > TMPCTPS[i][l]) { 1281 | fastestIndex[i] = l; 1282 | } 1283 | } 1284 | } 1285 | 1286 | vector> TMPCMemoryUsage = TMPCMemoryUsageB; // just needed to give it the right shape 1287 | 1288 | for (int y = 1; y <= maxY; ++y) { 1289 | // update TMPCMemoryUsage 1290 | for (int i = 0; i < n; ++i) { 1291 | for (int l = 0; l < TMPCTPS[i].size(); ++l) { 1292 | TMPCMemoryUsage[i][l] = TMPCMemoryUsageA[i][l] * y + TMPCMemoryUsageB[i][l]; 1293 | } 1294 | } 1295 | 1296 | // check whether choosing the fastest TMPC everywhere might give a memory-feasible result 1297 | // (then we'd just return it) 1298 | double memoryOfFastestSolution = 0.0; 1299 | for (int i = 0; i < n; ++i) { 1300 | memoryOfFastestSolution += TMPCMemoryUsage[i][fastestIndex[i]]; 1301 | } 1302 | if (memoryOfFastestSolution <= ins.maxMemoryPerDevice) { 1303 | // cool! just use that 1304 | result[y] = 0.0; 1305 | for (int i = 0; i < n; ++i) { 1306 | result[y] += TMPCTPS[i][fastestIndex[i]]; 1307 | } 1308 | 1309 | // reconstruction stuff (unfortunately the code repeats) 1310 | if (reconstructionModeForSolveKnapsack) { 1311 | if (reconstructionY == y) { 1312 | // fill reconstructionTMPCindices 1313 | reconstructionTMPCindices = fastestIndex; 1314 | dbg << "we picked the all-fastest knapsack solution, mem usage = " 1315 | << memoryOfFastestSolution << endl; 1316 | } 1317 | } 1318 | 1319 | continue; 1320 | } 1321 | 1322 | // or if choosing the lowest-memory TMPC everywhere is not even memory-feasible 1323 | // (then there is no feasible solution) 1324 | vector lowestMemoryIndex(n, 0); 1325 | for (int i = 0; i < n; ++i) { 1326 | for (int l = 1; l < TMPCTPS[i].size(); ++l) { 1327 | if (TMPCMemoryUsage[i][lowestMemoryIndex[i]] > TMPCMemoryUsage[i][l]) { 1328 | lowestMemoryIndex[i] = l; 1329 | } 1330 | } 1331 | } 1332 | double memoryOfLowestMemorySolution = 0.0; 1333 | for (int i = 0; i < n; ++i) { 1334 | memoryOfLowestMemorySolution += TMPCMemoryUsage[i][lowestMemoryIndex[i]]; 1335 | } 1336 | if (memoryOfLowestMemorySolution > ins.maxMemoryPerDevice) { 1337 | // no feasible solution. and there won't be one for larger y, either. so break 1338 | break; 1339 | } 1340 | 1341 | // the instance is non-trivial. now we run a heuristic 1342 | vector H = fastestIndex; // H[i] = index of the currently chosen TMPC for node i 1343 | // we start from the all-fastest solution 1344 | double memoryToShaveOff = memoryOfFastestSolution - ins.maxMemoryPerDevice; 1345 | // we need to reduce the memory usage by this much 1346 | 1347 | if (ACTIVATION_RECOMPUTATION_ALL_LAYERS_OR_NONE) { 1348 | // special thing for our experiments 1349 | 1350 | // at this point we know that the all-vanilla solution is not feasible 1351 | // but the all-AC one is. so just return the latter 1352 | H = lowestMemoryIndex; 1353 | memoryToShaveOff = memoryOfLowestMemorySolution - ins.maxMemoryPerDevice; 1354 | } else { 1355 | // normal execution 1356 | 1357 | if (!KNAPSACK_FAST_HEURISTIC) { 1358 | 1359 | while (memoryToShaveOff > 1e-9) { // # iterations <= total number of TMPCs 1360 | // we do so by repeatedly picking the best-bang-for-buck swap (change of some H[i]) 1361 | double leastBuckPerBang = INFTY; 1362 | int bestI = -1, bestL = -1; 1363 | // check all possible swaps 1364 | for (int i = 0; i < n; ++i) { 1365 | for (int l = 0; l < TMPCTPS[i].size(); ++l) { 1366 | // considering the change "H[i] := l" 1367 | const double memorySavings = TMPCMemoryUsage[i][H[i]] - TMPCMemoryUsage[i][l]; 1368 | if (memorySavings > 1e-9) { 1369 | if (TMPCTPS[i][l] < TMPCTPS[i][H[i]]) { 1370 | fail("we are gaining both on memory and TPS?!"); 1371 | } 1372 | // important (somewhat): here we truncate large gains to what we actually need 1373 | const double bang = min(memorySavings, memoryToShaveOff); 1374 | const double buckPerBang = (TMPCTPS[i][l] - TMPCTPS[i][H[i]]) / bang; 1375 | if (leastBuckPerBang > buckPerBang) { 1376 | leastBuckPerBang = buckPerBang; 1377 | bestI = i; 1378 | bestL = l; 1379 | } 1380 | } 1381 | } 1382 | } 1383 | if (bestI == -1) { 1384 | fail("no good change was found?!"); 1385 | } 1386 | // now apply the best found change 1387 | const double memorySavings = TMPCMemoryUsage[bestI][H[bestI]] - TMPCMemoryUsage[bestI][bestL]; 1388 | memoryToShaveOff -= memorySavings; 1389 | H[bestI] = bestL; 1390 | } 1391 | 1392 | } else { // KNAPSACK_FAST_HEURISTIC 1393 | 1394 | // so far, for simplicity and speed, we have here 1395 | // a version that only works for <= 2 TMPCs per (v,t) 1396 | // (this is verified in checkInputCorrectness()) 1397 | 1398 | vector> sorter; 1399 | for (int i = 0; i < n; ++i) { 1400 | for (int l = 0; l < TMPCTPS[i].size(); ++l) { 1401 | // considering the change "H[i] := l" 1402 | const double memorySavings = TMPCMemoryUsage[i][H[i]] - TMPCMemoryUsage[i][l]; 1403 | if (memorySavings > 1e-9) { 1404 | if (TMPCTPS[i][l] < TMPCTPS[i][H[i]]) { 1405 | fail("we are gaining both on memory and TPS?!"); 1406 | } 1407 | // here we don't truncate and might overshoot 1408 | // (see the similar point in the slower heuristic) 1409 | const double buckPerBang = (TMPCTPS[i][l] - TMPCTPS[i][H[i]]) / memorySavings; 1410 | sorter.emplace_back(buckPerBang, i, l); 1411 | } 1412 | } 1413 | } 1414 | sort(sorter.begin(), sorter.end()); 1415 | for (const auto &it : sorter) { 1416 | const int bestI = get<1>(it), bestL = get<2>(it); 1417 | // apply this change 1418 | const double memorySavings = TMPCMemoryUsage[bestI][H[bestI]] - TMPCMemoryUsage[bestI][bestL]; 1419 | memoryToShaveOff -= memorySavings; 1420 | H[bestI] = bestL; 1421 | if (memoryToShaveOff < 1e-9) { 1422 | break; // done 1423 | } 1424 | } 1425 | 1426 | } 1427 | } 1428 | // done - we have a feasible solution H 1429 | result[y] = 0.0; 1430 | for (int i = 0; i < n; ++i) { 1431 | result[y] += TMPCTPS[i][H[i]]; 1432 | } 1433 | 1434 | // reconstruction stuff 1435 | if (reconstructionModeForSolveKnapsack) { 1436 | if (reconstructionY == y) { 1437 | // fill reconstructionTMPCindices 1438 | reconstructionTMPCindices = H; 1439 | dbg << "nontrivial knapsack solution, mem usage = " << ins.maxMemoryPerDevice + memoryToShaveOff << endl; 1440 | } 1441 | } 1442 | } 1443 | 1444 | // debug stuff for verification that our knapsack heuristic 1445 | // almost always finds the optimal solution (see Appendix) 1446 | if (OUTPUT_KNAPSACK_INSTANCES_FOR_INSPECTION) { 1447 | static long long executionCount = 0; 1448 | static ofstream knapsackFile("knapsacks.txt"); 1449 | ++executionCount; 1450 | constexpr int y = 2; 1451 | assert(y <= maxY); 1452 | if (rand() % 3 == 0) { 1453 | dbg << "executionCount = " << executionCount << endl; 1454 | knapsackFile << setprecision(15) << fixed << ins.maxMemoryPerDevice << endl; 1455 | knapsackFile << TMPCTPS.size() << endl; 1456 | for (int i = 0; i < TMPCTPS.size(); ++i) { 1457 | knapsackFile << TMPCTPS[i].size() << endl; 1458 | for (int l = 0; l < TMPCTPS[i].size(); ++l) { 1459 | const double memUsageFor5 = y * TMPCMemoryUsageA[i][l] + TMPCMemoryUsageB[i][l]; 1460 | knapsackFile << setprecision(15) << fixed << TMPCTPS[i][l] << " " << memUsageFor5 << endl; 1461 | } 1462 | } 1463 | knapsackFile << setprecision(15) << fixed << result[y] << endl << endl; 1464 | } 1465 | } 1466 | 1467 | return result; 1468 | } 1469 | 1470 | 1471 | ResultStage Graph::getTPSWitnessFor (int id, int subId, int t, int d, int s, double targetTPS) { 1472 | reconstructionModeForGetAllTPSsForIdealPair = true; 1473 | reconstructionT = t; 1474 | reconstructionD = d; 1475 | reconstructionS = s; 1476 | reconstructionTPS = targetTPS; 1477 | getAllTPSsForIdealPair(id, subId); // this will fill reconstructionRS 1478 | reconstructionModeForGetAllTPSsForIdealPair = false; 1479 | return reconstructionRS; 1480 | } 1481 | 1482 | 1483 | void Graph::renumberResultBack (Result &r) const { 1484 | for (ResultStage &rs : r.stages) { 1485 | for (int &nodeId : rs.nodes) { 1486 | nodeId = ins.oldNumber[nodeId]; 1487 | } 1488 | unordered_map newTMPCids; 1489 | for (const pair &p : rs.TMPCids) { 1490 | newTMPCids[ins.oldNumber[p.first]] = p.second; 1491 | } 1492 | rs.TMPCids = move(newTMPCids); 1493 | } 1494 | } 1495 | 1496 | 1497 | double Graph::getTPSForResult (const Result &r) const { 1498 | // for sanity checks of returned solutions, 1499 | // and also to judge baselines 1500 | 1501 | if (r.stages.empty()) { 1502 | // infeasible/OOM/empty result 1503 | return INFTY; 1504 | } 1505 | 1506 | // first step: check that the solution is contiguous 1507 | // (and that there is some topological order in the contracted graph) 1508 | // and that every node belongs to exactly one subgraph 1509 | // and that we don't use too many devices 1510 | vector stageOfNode(ins.nodes.size(), -1); 1511 | int devicesUsed = 0, sumOfDpDegrees = 0; 1512 | for (int i = 0; i < r.stages.size(); ++i) { 1513 | for (int v : r.stages[i].nodes) { 1514 | if (stageOfNode[v] != -1) { 1515 | fail("duplicate node"); 1516 | } 1517 | stageOfNode[v] = i; 1518 | } 1519 | if (r.stages[i].dataParallelDegree < 1 || r.stages[i].dataParallelDegree > ins.maxDevices) { 1520 | fail("wrong data-parallel degree"); 1521 | } 1522 | if (ins.mbsInBatch % r.stages[i].dataParallelDegree != 0) { 1523 | fail("data-parallel degree must divide the number of microbatches in a batch"); 1524 | } 1525 | if (r.stages[i].tensorParallelDegree < 1 || r.stages[i].tensorParallelDegree > ins.maxDevices) { 1526 | fail("wrong tensor-parallel degree"); 1527 | } 1528 | devicesUsed += r.stages[i].dataParallelDegree * r.stages[i].tensorParallelDegree; 1529 | sumOfDpDegrees += r.stages[i].dataParallelDegree; 1530 | } 1531 | for (const Edge &e : ins.edges) { 1532 | if (stageOfNode[e.sourceId] > stageOfNode[e.destId]) { 1533 | fail("problem with contiguity (or stages given in wrong order)"); 1534 | } 1535 | } 1536 | for (int v = 0; v < ins.nodes.size(); ++v) { 1537 | if (-1 == stageOfNode[v]) { 1538 | fail("node does not appear in any subgraph"); 1539 | } 1540 | } 1541 | if (sumOfDpDegrees > ins.mbsInBatch) { 1542 | fail("sum of data-parallel degrees too large"); 1543 | } 1544 | if (devicesUsed > ins.maxDevices) { 1545 | fail("too many devices used"); 1546 | } 1547 | 1548 | for (int v = 0; v < ins.nodes.size(); ++v) { 1549 | if (v != ins.nodes[v].id) { 1550 | fail("some issue with numbering"); 1551 | } 1552 | } 1553 | 1554 | // now we want to compute the TPS, 1555 | // and also verify it's not OOM 1556 | double finalTPS = 0.0; 1557 | int suffixSumOfDataParallelDegrees = sumOfDpDegrees; 1558 | for (int i = 0; i < r.stages.size(); ++i) { 1559 | const ResultStage &rs = r.stages[i]; 1560 | 1561 | // we want to compute TPS for this stage 1562 | 1563 | const int d = rs.dataParallelDegree, t = rs.tensorParallelDegree; 1564 | 1565 | const int y = ceildiv(suffixSumOfDataParallelDegrees, d); 1566 | suffixSumOfDataParallelDegrees -= d; // update for next iteration 1567 | 1568 | // get TMPCs (also check TMPCids) 1569 | unordered_map TMPCs; 1570 | for (const pair &p : rs.TMPCids) { 1571 | const int v = p.first; 1572 | if (!count(rs.nodes.begin(), rs.nodes.end(), v)) { 1573 | fail("node appears in TMPCids but not in nodes"); 1574 | } 1575 | if (v < 0 || v >= ins.nodes.size()) { 1576 | fail("wrong node id in TMPCids"); 1577 | } 1578 | if (TMPCs.count(v)) { 1579 | fail("duplicate node in TMPCids"); 1580 | } 1581 | bool found = false; 1582 | if (!ins.nodes[v].TMPCs.count(t)) { 1583 | fail("no TMPCs for that node and that t"); 1584 | } 1585 | for (const TMPC &tmpc : ins.nodes[v].TMPCs.at(t)) { 1586 | if (tmpc.id == p.second) { 1587 | assert(!found); 1588 | found = true; 1589 | TMPCs[v] = &tmpc; 1590 | } 1591 | } 1592 | if (!found) { 1593 | fail("no TMPC with that ID was found"); 1594 | } 1595 | } 1596 | if (rs.nodes.size() > rs.TMPCids.size()) { 1597 | fail("more nodes in nodes than in TMPCids"); 1598 | } 1599 | // okay, TMPCs is populated. now compute TPS and memory usage 1600 | 1601 | double compute = 0.0, edgeCommunicationInBytes = 0.0, dataParallelCommunicationInBytes = 0.0, 1602 | memoryUsage = 0.0; 1603 | 1604 | for (int v : rs.nodes) { 1605 | compute += TMPCs[v]->timePerSample; 1606 | for (const pair &p : incomingEdges[v]) { 1607 | if (!count(rs.nodes.begin(), rs.nodes.end(), p.first)) { 1608 | edgeCommunicationInBytes += 2 * (p.second + TMPCs[v]->syncTimeFw.at(p.first)); 1609 | } 1610 | } 1611 | for (const pair &p : outgoingEdges[v]) { 1612 | if (!count(rs.nodes.begin(), rs.nodes.end(), p.first)) { 1613 | edgeCommunicationInBytes += 2 * (p.second + TMPCs[v]->syncTimeBw.at(p.first)); 1614 | } 1615 | } 1616 | dataParallelCommunicationInBytes += getDataParallelResyncCost(TMPCs[v]->parameterSize, d); 1617 | memoryUsage += TMPCs[v]->memoryUsageA * y + TMPCs[v]->memoryUsageB; 1618 | } 1619 | 1620 | if (memoryUsage > ins.maxMemoryPerDevice) { 1621 | // given solution is OOM 1622 | return INFTY; 1623 | } 1624 | 1625 | const double communicationInBytes = edgeCommunicationInBytes / d + dataParallelCommunicationInBytes; 1626 | 1627 | if (DEBUG_DATA_PARALLEL_COSTS) { 1628 | const double dataParallelCost = dataParallelCommunicationInBytes / ins.bandwidth; 1629 | const double theRestxxxxxxxxx = (edgeCommunicationInBytes / ins.bandwidth + compute) / d; 1630 | cerr << setprecision(10) << fixed; 1631 | DBG(dataParallelCost); 1632 | DBG(theRestxxxxxxxxx); 1633 | } 1634 | 1635 | const double stageTPS = communicationInBytes / ins.bandwidth + compute / d; 1636 | 1637 | finalTPS = max(finalTPS, stageTPS); 1638 | } 1639 | return finalTPS; 1640 | } 1641 | 1642 | 1643 | Result runPipeDream2BWPlanner (const Instance &ins, 1644 | bool useTensorParallelism, 1645 | bool tryPuttingNonTransformerNodesSeparately) { 1646 | vector transformers = ins.getTransformerIds(); 1647 | vector initialNodes, finalNodes; // before and after transformers, respectively 1648 | 1649 | while (transformers.size() + initialNodes.size() + finalNodes.size() < ins.nodes.size()) { 1650 | for (const Edge &e : ins.edges) { 1651 | const int u = e.sourceId, v = e.destId; 1652 | const bool initialU = count(initialNodes.begin(), initialNodes.end(), u); 1653 | const bool initialV = count(initialNodes.begin(), initialNodes.end(), v); 1654 | const bool finalU = count(finalNodes.begin(), finalNodes.end(), u); 1655 | const bool finalV = count(finalNodes.begin(), finalNodes.end(), v); 1656 | const bool transformerU = count(transformers.begin(), transformers.end(), u); 1657 | const bool transformerV = count(transformers.begin(), transformers.end(), v); 1658 | 1659 | if (!initialU && !transformerU && !finalU) { 1660 | if (initialV || transformerV) { 1661 | initialNodes.push_back(u); 1662 | } 1663 | } 1664 | if (!initialV && !transformerV && !finalV) { 1665 | if (transformerU || finalU) { 1666 | finalNodes.push_back(v); 1667 | } 1668 | } 1669 | } 1670 | } 1671 | 1672 | Graph g(ins); // needed to run getTPSForResult() 1673 | 1674 | double bestTPS = INFTY; 1675 | Result bestResult; 1676 | 1677 | for (int putNonTransformerNodesSeparately = 0; 1678 | putNonTransformerNodesSeparately <= tryPuttingNonTransformerNodesSeparately; 1679 | ++putNonTransformerNodesSeparately) { 1680 | 1681 | for (int stages = 1 + 2 * putNonTransformerNodesSeparately; 1682 | stages <= ins.maxDevices && stages <= transformers.size() + 2 * putNonTransformerNodesSeparately; 1683 | stages ++) { 1684 | 1685 | Result result; 1686 | result.stages.resize(stages); 1687 | 1688 | const int transformerStages = stages - 2 * putNonTransformerNodesSeparately, 1689 | firstTransformerStage = putNonTransformerNodesSeparately, 1690 | lastTransformerStage = stages - 1 - putNonTransformerNodesSeparately; 1691 | assert(lastTransformerStage - firstTransformerStage + 1 == transformerStages); 1692 | 1693 | // divide transformers equally-ish among stages 1694 | const int transformersPerStage = transformers.size() / transformerStages, 1695 | largerStages = transformers.size() % transformerStages; 1696 | int nextTransformerIndex = 0; 1697 | for (int st = firstTransformerStage; st <= lastTransformerStage; ++st) { 1698 | // last `largerStages` stages have one additional transformer each 1699 | const bool thisIsALargerStage = (st >= lastTransformerStage + 1 - largerStages); 1700 | for (int i = 0; i < transformersPerStage + thisIsALargerStage; ++i) { 1701 | result.stages[st].nodes.push_back(transformers[nextTransformerIndex]); 1702 | nextTransformerIndex++; 1703 | } 1704 | } 1705 | assert(nextTransformerIndex == transformers.size()); 1706 | // the initial nodes go to first stage, the final nodes go to last stage 1707 | append(result.stages[0].nodes, initialNodes); 1708 | append(result.stages.back().nodes, finalNodes); 1709 | result.debugInfo = (putNonTransformerNodesSeparately ? "1" : "0"); 1710 | 1711 | for (int d = 1; stages*d <= min(ins.maxDevices, ins.mbsInBatch); d ++) { 1712 | if (ins.mbsInBatch % d == 0) { 1713 | for (int t = 1; stages*d*t <= ins.maxDevices; t ++) { 1714 | if (t > 1) { 1715 | if (!useTensorParallelism) { 1716 | break; 1717 | } 1718 | // should check if all nodes support t-degree tensor parallelism 1719 | // we'll just check one node since in our inputs they all support some t or not 1720 | if (!ins.nodes[0].TMPCs.count(t)) { 1721 | continue; 1722 | } 1723 | } 1724 | 1725 | for (ResultStage &rs : result.stages) { 1726 | rs.dataParallelDegree = d; 1727 | rs.tensorParallelDegree = t; 1728 | } 1729 | 1730 | // PipeDream-2BW uses activation recomputation everywhere or nowhere 1731 | for (bool activationRecomputationEverywhere : {false, true}) { 1732 | for (ResultStage &rs : result.stages) { 1733 | for (int v : rs.nodes) { 1734 | rs.TMPCids[v] = activationRecomputationEverywhere ? "activation recomp" : "vanilla"; 1735 | } 1736 | } 1737 | 1738 | // result is built. try it 1739 | const double TPS = g.getTPSForResult(result); 1740 | if (TPS < bestTPS) { 1741 | bestTPS = TPS; 1742 | bestResult = result; 1743 | } 1744 | } 1745 | } 1746 | } 1747 | } 1748 | } 1749 | } 1750 | 1751 | if (bestTPS > INFTY/2) { 1752 | dbg << "2BW couldn't partition in any way (OOM)" << endl; 1753 | return bestResult; // empty result 1754 | } 1755 | 1756 | dbg << "best 2BW: TPS = " << bestTPS 1757 | << ", stages = " << bestResult.stages.size() 1758 | << ", d = " << bestResult.stages[0].dataParallelDegree 1759 | << ", t = " << bestResult.stages[0].tensorParallelDegree 1760 | << ", activation recomp = " << bestResult.stages[0].TMPCids.begin()->second 1761 | << ", put non-transformers separately = " << bestResult.debugInfo 1762 | << endl; 1763 | 1764 | //for (const ResultStage &rs : bestResult.stages) { 1765 | // dbg << rs.nodes << endl; 1766 | //} 1767 | 1768 | return bestResult; 1769 | } 1770 | 1771 | 1772 | 1773 | // run the PipeDream-2BW-like planner, 1774 | // but without treating transformer and non-transformer layers differently 1775 | Result runPipeDream2BWPlannerNonTransformer (const Instance &ins, 1776 | bool useTensorParallelism) { 1777 | Graph g(ins); // needed to run getTPSForResult() 1778 | 1779 | const vector nodeIds = ins.getNodeIdsToposorted(); 1780 | 1781 | double bestTPS = INFTY; 1782 | Result bestResult; 1783 | 1784 | for (int stages = 1; stages <= ins.maxDevices; ++stages) { 1785 | Result result; 1786 | result.stages.resize(stages); 1787 | 1788 | // divide layers equally-ish among stages 1789 | const int layersPerStage = nodeIds.size() / stages, 1790 | largerStages = nodeIds.size() % stages; 1791 | int nextStageIndex = 0; 1792 | for (int st = 0; st <= stages - 1; ++st) { 1793 | // last `largerStages` have one additional layer each 1794 | const bool thisIsALargerStage = (st >= stages - largerStages); 1795 | for (int i = 0; i < layersPerStage + thisIsALargerStage; ++i) { 1796 | result.stages[st].nodes.push_back(nodeIds[nextStageIndex]); 1797 | nextStageIndex++; 1798 | } 1799 | } 1800 | assert(nextStageIndex == nodeIds.size()); 1801 | 1802 | for (int d = 1; stages*d <= min(ins.maxDevices, ins.mbsInBatch); d ++) { 1803 | if (ins.mbsInBatch % d == 0) { 1804 | for (int t = 1; stages*d*t <= ins.maxDevices; t ++) { 1805 | if (t > 1) { 1806 | if (!useTensorParallelism) { 1807 | break; 1808 | } 1809 | // should check if all nodes support t-degree tensor parallelism 1810 | // we'll just check one node since in our inputs they all support some t or not 1811 | if (!ins.nodes[0].TMPCs.count(t)) { 1812 | continue; 1813 | } 1814 | } 1815 | 1816 | for (ResultStage &rs : result.stages) { 1817 | rs.dataParallelDegree = d; 1818 | rs.tensorParallelDegree = t; 1819 | } 1820 | 1821 | // PipeDream-2BW uses activation recomputation everywhere or nowhere 1822 | for (bool activationRecomputationEverywhere : {false, true}) { 1823 | for (ResultStage &rs : result.stages) { 1824 | for (int v : rs.nodes) { 1825 | rs.TMPCids[v] = activationRecomputationEverywhere ? "activation recomp" : "vanilla"; 1826 | } 1827 | } 1828 | 1829 | // result is built. try it 1830 | const double TPS = g.getTPSForResult(result); 1831 | if (TPS < bestTPS) { 1832 | bestTPS = TPS; 1833 | bestResult = result; 1834 | } 1835 | } 1836 | } 1837 | } 1838 | } 1839 | } 1840 | 1841 | 1842 | if (bestTPS > INFTY/2) { 1843 | dbg << "2BW couldn't partition in any way (OOM)" << endl; 1844 | return bestResult; // empty result 1845 | } 1846 | 1847 | dbg << "best 2BW: TPS = " << bestTPS 1848 | << ", stages = " << bestResult.stages.size() 1849 | << ", d = " << bestResult.stages[0].dataParallelDegree 1850 | << ", t = " << bestResult.stages[0].tensorParallelDegree 1851 | << ", activation recomp = " << bestResult.stages[0].TMPCids.begin()->second 1852 | << endl; 1853 | 1854 | //for (const ResultStage &rs : bestResult.stages) { 1855 | // dbg << rs.nodes << endl; 1856 | //} 1857 | 1858 | return bestResult; 1859 | } 1860 | 1861 | 1862 | 1863 | // build a single configuration, one of those that the PipeDream-2BW-like planner would consider 1864 | // (here code unfortunately repeats a lot from runPipeDream2BWPlanner) 1865 | double buildEquiPartitionResult (Instance &ins, 1866 | int d, 1867 | int t, 1868 | int stages, 1869 | bool putNonTransformerNodesSeparately, 1870 | bool activationRecomputationEverywhere) { 1871 | vector transformers = ins.getTransformerIds(); 1872 | vector initialNodes, finalNodes; // before and after transformers, respectively 1873 | 1874 | while (transformers.size() + initialNodes.size() + finalNodes.size() < ins.nodes.size()) { 1875 | for (const Edge &e : ins.edges) { 1876 | const int u = e.sourceId, v = e.destId; 1877 | const bool initialU = count(initialNodes.begin(), initialNodes.end(), u); 1878 | const bool initialV = count(initialNodes.begin(), initialNodes.end(), v); 1879 | const bool finalU = count(finalNodes.begin(), finalNodes.end(), u); 1880 | const bool finalV = count(finalNodes.begin(), finalNodes.end(), v); 1881 | const bool transformerU = count(transformers.begin(), transformers.end(), u); 1882 | const bool transformerV = count(transformers.begin(), transformers.end(), v); 1883 | 1884 | if (!initialU && !transformerU && !finalU) { 1885 | if (initialV || transformerV) { 1886 | initialNodes.push_back(u); 1887 | } 1888 | } 1889 | if (!initialV && !transformerV && !finalV) { 1890 | if (transformerU || finalU) { 1891 | finalNodes.push_back(v); 1892 | } 1893 | } 1894 | } 1895 | } 1896 | 1897 | Graph g(ins); // needed to run getTPSForResult() 1898 | assert(stages >= 1 + 2 * putNonTransformerNodesSeparately); 1899 | assert(stages <= ins.maxDevices && stages <= transformers.size() + 2 * putNonTransformerNodesSeparately); 1900 | 1901 | Result result; 1902 | result.stages.resize(stages); 1903 | 1904 | const int transformerStages = stages - 2 * putNonTransformerNodesSeparately, 1905 | firstTransformerStage = putNonTransformerNodesSeparately, 1906 | lastTransformerStage = stages - 1 - putNonTransformerNodesSeparately; 1907 | assert(lastTransformerStage - firstTransformerStage + 1 == transformerStages); 1908 | 1909 | // divide transformers equally-ish among stages 1910 | const int transformersPerStage = transformers.size() / transformerStages, 1911 | largerStages = transformers.size() % transformerStages; 1912 | int nextTransformerIndex = 0; 1913 | for (int st = firstTransformerStage; st <= lastTransformerStage; ++st) { 1914 | // last `largerStages` stages have one additional transformer each 1915 | const bool thisIsALargerStage = (st >= lastTransformerStage + 1 - largerStages); 1916 | for (int i = 0; i < transformersPerStage + thisIsALargerStage; ++i) { 1917 | result.stages[st].nodes.push_back(transformers[nextTransformerIndex]); 1918 | nextTransformerIndex++; 1919 | } 1920 | } 1921 | assert(nextTransformerIndex == transformers.size()); 1922 | // the initial nodes go to first stage, the final nodes go to last stage 1923 | append(result.stages[0].nodes, initialNodes); 1924 | append(result.stages.back().nodes, finalNodes); 1925 | result.debugInfo = (putNonTransformerNodesSeparately ? "1" : "0"); 1926 | 1927 | assert(d >= 1); 1928 | assert(stages*d <= min(ins.maxDevices, ins.mbsInBatch)); 1929 | assert(t >= 1); 1930 | assert(stages*d*t <= ins.maxDevices); 1931 | assert(ins.mbsInBatch % d == 0); 1932 | 1933 | // should check if all nodes support t-degree tensor parallelism 1934 | // we'll just check one node since in our inputs they all support some t or not 1935 | assert(ins.nodes[0].TMPCs.count(t)); 1936 | 1937 | for (ResultStage &rs : result.stages) { 1938 | rs.dataParallelDegree = d; 1939 | rs.tensorParallelDegree = t; 1940 | } 1941 | 1942 | for (ResultStage &rs : result.stages) { 1943 | for (int v : rs.nodes) { 1944 | rs.TMPCids[v] = activationRecomputationEverywhere ? "activation recomp" : "vanilla"; 1945 | } 1946 | } 1947 | 1948 | // result is built. try it 1949 | const double TPS = g.getTPSForResult(result); 1950 | 1951 | // dbg output 1952 | dbg << "TPS = " << TPS << endl; 1953 | 1954 | return TPS; 1955 | } 1956 | 1957 | 1958 | 1959 | void runOurAlgoOnInstances (const vector &instances) { 1960 | for (const Instance &ins : instances) { 1961 | // our alg 1962 | clock_t startTime = clock(); 1963 | Graph g(ins); 1964 | Result our = g.runDP(); 1965 | double runtime = (clock() - startTime) * 1.0 / CLOCKS_PER_SEC; 1966 | DBG(runtime) 1967 | if (our.stages.empty()) { 1968 | continue; // infeasible, not interesting 1969 | } 1970 | double ourTPS = g.getTPSForResult(our); 1971 | } 1972 | } 1973 | 1974 | 1975 | // read the BERT32 instance, then possibly add more transformer layers 1976 | Instance readBERTA100 (int noTransformers) { 1977 | assert(noTransformers >= 32); 1978 | json j; 1979 | ifstream jsonfile("inputs/bert32a100.json"); 1980 | jsonfile >> j; 1981 | Instance ins = j.get(); 1982 | for (int t = 33; t <= noTransformers; ++t) { 1983 | ins.insertTransformerLayer(); 1984 | } 1985 | vector transformers = ins.getTransformerIds(); 1986 | dbg << "transformer (renumbered) ids = [" << transformers << "]" << endl; 1987 | return ins; 1988 | } 1989 | 1990 | 1991 | Instance readGNMT () { 1992 | json j; 1993 | ifstream jsonfile("inputs/gnmt.json"); 1994 | jsonfile >> j; 1995 | Instance ins = j.get(); 1996 | return ins; 1997 | } 1998 | 1999 | 2000 | Instance readResnet () { 2001 | json j; 2002 | ifstream jsonfile("inputs/resnet.json"); 2003 | jsonfile >> j; 2004 | Instance ins = j.get(); 2005 | return ins; 2006 | } 2007 | 2008 | 2009 | // run just a single instance 2010 | void single () { 2011 | Instance ins = readBERTA100(32); 2012 | ins.maxMemoryPerDevice = 8.0 * (1 << 30); 2013 | ins.maxDevices = 512; 2014 | ins.mbsInBatch = 1920; 2015 | ins.bandwidth = 25.0 * (1LL << 30); 2016 | 2017 | runOurAlgoOnInstances({ins}); 2018 | } 2019 | 2020 | 2021 | void plots () { 2022 | for (double memGB : {1,2,8,80}) { 2023 | for (int bertSize : {32}) { 2024 | for (int batchSize : {1920}) { 2025 | stringstream ss; 2026 | ss << "bert" << bertSize << "-" << memGB << "GB-bs" << batchSize << ".csv"; 2027 | ofstream of(ss.str()); 2028 | of << ",Number of devices,color,\\sf{type},\\sf{throughput}\n"; 2029 | int cnt = 0; 2030 | 2031 | dbg << "=================================================" << endl; 2032 | dbg << "bertSize = " << bertSize << ", batch size = " << batchSize << endl; 2033 | dbg << "=================================================" << endl; 2034 | 2035 | for (int k : {8,32,64,128,512,1024,2048}) { 2036 | 2037 | dbg << "=================================================" << endl; 2038 | dbg << "now running k = " << k << endl; 2039 | dbg << "=================================================" << endl; 2040 | 2041 | Instance ins = readBERTA100(bertSize); 2042 | ins.maxMemoryPerDevice = (1 << 30) * 1.0 * memGB; 2043 | ins.maxDevices = k; 2044 | ins.bandwidth = 25.0 * (1 << 30); 2045 | ins.mbsInBatch = batchSize; 2046 | 2047 | if (ins.getMaxTensorParallelDegree() * batchSize < k) { 2048 | dbg << "skipping device count " << k << " since they cannot all be used" << endl; 2049 | continue; 2050 | } 2051 | 2052 | Graph g(ins); 2053 | 2054 | const Result ourResult = g.runDP(); 2055 | if (ourResult.stages.empty()) { 2056 | dbg << "skipping device count " << k << " since our solution is OOM" << endl; 2057 | continue; 2058 | } 2059 | const double ourTPS = g.getTPSForResult(ourResult); 2060 | assert(ourTPS < INFTY/2); 2061 | of << (cnt++) << ",$k=" << k << "$,"; 2062 | of << "#2b7bba,Piper,1.000\n"; 2063 | 2064 | DATA_PARALLELISM_ALLOWED = false; 2065 | const double noDP = g.getTPSForResult(g.runDP()); 2066 | DATA_PARALLELISM_ALLOWED = true; 2067 | of << (cnt++) << ",$k=" << k << "$,"; 2068 | of << "#DB7093,no DP," << setprecision(3) << fixed << (ourTPS / noDP) << "\n"; 2069 | 2070 | TENSOR_PARALLELISM_ALLOWED = false; 2071 | const double noTP = g.getTPSForResult(g.runDP()); 2072 | TENSOR_PARALLELISM_ALLOWED = true; 2073 | of << (cnt++) << ",$k=" << k << "$,"; 2074 | of << "#FFD700,no TP," << setprecision(3) << fixed << (ourTPS / noTP) << "\n"; 2075 | 2076 | ACTIVATION_RECOMPUTATION_ALLOWED = false; 2077 | const double noAR = g.getTPSForResult(g.runDP()); 2078 | ACTIVATION_RECOMPUTATION_ALLOWED = true; 2079 | of << (cnt++) << ",$k=" << k << "$,"; 2080 | of << "#008000,no AR," << setprecision(3) << fixed << (ourTPS / noAR) << "\n"; 2081 | 2082 | //const double PD2BW = g.getTPSForResult(runPipeDream2BWPlanner(ins, false, false)); 2083 | //of << (cnt++) << ",$k=" << k << "$,"; 2084 | //of << "#191970,equi-no TP-no sep," << setprecision(3) << fixed << (ourTPS / PD2BW) << "\n"; 2085 | 2086 | //const double PD2BWHR = g.getTPSForResult(runPipeDream2BWPlanner(ins, true, false)); 2087 | //of << (cnt++) << ",$k=" << k << "$,"; 2088 | //of << "#FF4500,equi-no sep," << setprecision(3) << fixed << (ourTPS / PD2BWHR) << "\n"; 2089 | 2090 | const double PD2BWsep = g.getTPSForResult(runPipeDream2BWPlanner(ins, false, true)); 2091 | of << (cnt++) << ",$k=" << k << "$,"; 2092 | of << "#191970,equi-no TP," << setprecision(3) << fixed << (ourTPS / PD2BWsep) << "\n"; 2093 | 2094 | const double PD2BWsepHR = g.getTPSForResult(runPipeDream2BWPlanner(ins, true, true)); 2095 | of << (cnt++) << ",$k=" << k << "$,"; 2096 | of << "#FF4500,equi," << setprecision(3) << fixed << (ourTPS / PD2BWsepHR) << "\n"; 2097 | } 2098 | } 2099 | } 2100 | } 2101 | } 2102 | 2103 | 2104 | void correlationExperiment () { 2105 | Instance ins = readBERTA100(32); 2106 | ins.maxDevices = 64; 2107 | ins.bandwidth = (300.0 + 25.0)/2 * (1 << 30); 2108 | vector results; 2109 | ins.mbsInBatch = 128; 2110 | results.push_back(buildEquiPartitionResult(ins, 32, 1, 2, false, true)); 2111 | results.push_back(buildEquiPartitionResult(ins, 16, 1, 4, false, true)); 2112 | results.push_back(buildEquiPartitionResult(ins, 8, 1, 8, false, true)); 2113 | results.push_back(buildEquiPartitionResult(ins, 4, 1, 16, false, true)); 2114 | results.push_back(buildEquiPartitionResult(ins, 2, 1, 32, false, true)); 2115 | ins.mbsInBatch = 512; 2116 | results.push_back(buildEquiPartitionResult(ins, 32, 1, 2, false, true)); 2117 | results.push_back(buildEquiPartitionResult(ins, 16, 1, 4, false, true)); 2118 | results.push_back(buildEquiPartitionResult(ins, 8, 1, 8, false, true)); 2119 | results.push_back(buildEquiPartitionResult(ins, 4, 1, 16, false, true)); 2120 | results.push_back(buildEquiPartitionResult(ins, 2, 1, 32, false, true)); 2121 | results.push_back(-1e30); 2122 | results.push_back(-1e30); 2123 | results.push_back(-1e30); 2124 | ins.mbsInBatch = 128; 2125 | results.push_back(buildEquiPartitionResult(ins, 32, 2, 1, false, true)); 2126 | results.push_back(buildEquiPartitionResult(ins, 16, 4, 1, false, true)); 2127 | results.push_back(buildEquiPartitionResult(ins, 8, 8, 1, false, true)); 2128 | ins.mbsInBatch = 512; 2129 | results.push_back(buildEquiPartitionResult(ins, 32, 2, 1, false, true)); 2130 | results.push_back(buildEquiPartitionResult(ins, 16, 4, 1, false, true)); 2131 | results.push_back(buildEquiPartitionResult(ins, 8, 8, 1, false, true)); 2132 | for (double res : results) { 2133 | cout << res << endl; 2134 | } 2135 | } 2136 | 2137 | 2138 | 2139 | 2140 | 2141 | void scalability () { 2142 | for (int bertSize : {32,48,64,96}) { 2143 | for (int batchSize : {480,512,1920,2048}) { 2144 | //if (bertSize == 96 && batchSize == 2048) continue; 2145 | Instance ins = readBERTA100(bertSize); 2146 | stringstream ss; 2147 | ss << "bert" << bertSize << "bs" << batchSize << "times.txt"; 2148 | ofstream of(ss.str()); 2149 | of << "k time stddev\n"; 2150 | for (int k : {8,16,32,64,128,256,512,1024,1536,2048}) { 2151 | const int tries = (k > 600) ? 3 : 5; 2152 | vector times; 2153 | for (int t = 0; t < tries; ++t) { 2154 | ins.maxDevices = k; 2155 | ins.mbsInBatch = batchSize; 2156 | clock_t startTime = clock(); 2157 | Graph g(ins); 2158 | g.runDP(); 2159 | const double runtime = (clock() - startTime) * 1.0 / CLOCKS_PER_SEC; 2160 | times.push_back(runtime); 2161 | } 2162 | of << k << " " << setprecision(3) << fixed << average(times) << " " << sampleStddev(times) << endl; 2163 | } 2164 | } 2165 | } 2166 | } 2167 | 2168 | 2169 | 2170 | void runAndWrite (Graph &g, const string &filenameCore) { 2171 | g.runDP(); 2172 | 2173 | ofstream ofDevices(filenameCore + "-devices.txt"); 2174 | ofDevices << "k TPS\n"; 2175 | for (int k = 8; k <= g.ins.maxDevices; ++k) { 2176 | const double TPS_k = g.dp[g.idOfFullSet][k][g.boundOnS]; 2177 | if (TPS_k > INFTY/2) { 2178 | continue; 2179 | } 2180 | ofDevices << k << " " << setprecision(7) << fixed << TPS_k << "\n"; 2181 | } 2182 | 2183 | ofstream ofSBound(filenameCore + "-sbound.txt"); 2184 | ofSBound << "s TPS\n"; 2185 | for (int s = 8; s <= g.boundOnS; ++s) { 2186 | const double TPS_s = g.dp[g.idOfFullSet][g.ins.maxDevices][s]; 2187 | if (TPS_s > INFTY/2) { 2188 | continue; 2189 | } 2190 | ofSBound << s << " " << setprecision(7) << fixed << TPS_s << "\n"; 2191 | } 2192 | } 2193 | 2194 | 2195 | // run the PipeDream-2BW-like planner 2196 | void runPD2BWAndWrite (Instance &ins, const string &filenameCore, bool useTensorParallelism, bool tryPuttingNonTransformerNodesSeparately) { 2197 | 2198 | const int backupMaxDevices = ins.maxDevices; 2199 | 2200 | ofstream ofDevices(filenameCore + "-devices.txt"); 2201 | ofDevices << "k TPS\n"; 2202 | for (int k : {8,16,32,64,128,256,512,768,1024,1024+256,1024+512,1024+512+256,2048}) { 2203 | ins.maxDevices = k; 2204 | Graph g(ins); 2205 | ofDevices << k << " " << setprecision(7) << fixed << g.getTPSForResult(runPipeDream2BWPlanner(ins, useTensorParallelism, tryPuttingNonTransformerNodesSeparately)) << endl; 2206 | } 2207 | 2208 | ins.maxDevices = backupMaxDevices; 2209 | 2210 | 2211 | ofstream ofBatch(filenameCore + "-batch.txt"); 2212 | ofBatch << "bs TPS\n"; 2213 | for (int bs : {8,16,32,64,128,256,256+128,512,512+128,512+256,512+256+128,1024}) { 2214 | ins.mbsInBatch = bs; 2215 | Graph g(ins); 2216 | ofBatch << bs << " " << setprecision(7) << fixed << g.getTPSForResult(runPipeDream2BWPlanner(ins, useTensorParallelism, tryPuttingNonTransformerNodesSeparately)) << endl; 2217 | } 2218 | } 2219 | 2220 | 2221 | // measuring runtime 2222 | void parallelizabilityExperiment (int bertSize, int memSize, int batchSize) { 2223 | Instance ins = readBERTA100(bertSize); 2224 | ins.maxDevices = 2048; 2225 | ins.maxMemoryPerDevice = memSize * 1.0 * (1 << 30); 2226 | ins.mbsInBatch = batchSize; 2227 | Graph g(ins); 2228 | 2229 | // Piper 2230 | runAndWrite(g, "parallel-piper-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize)); 2231 | 2232 | DATA_PARALLELISM_ALLOWED = false; 2233 | runAndWrite(g, "parallel-nodp-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize)); 2234 | DATA_PARALLELISM_ALLOWED = true; 2235 | 2236 | TENSOR_PARALLELISM_ALLOWED = false; 2237 | runAndWrite(g, "parallel-notp-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize)); 2238 | TENSOR_PARALLELISM_ALLOWED = true; 2239 | 2240 | ACTIVATION_RECOMPUTATION_ALLOWED = false; 2241 | runAndWrite(g, "parallel-noar-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize)); 2242 | ACTIVATION_RECOMPUTATION_ALLOWED = true; 2243 | 2244 | //runPD2BWAndWrite(ins, "parallel-equi-no sep-no-TP-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize), false, false); 2245 | //runPD2BWAndWrite(ins, "parallel-equi-no sep-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize), true, false); 2246 | runPD2BWAndWrite(ins, "parallel-equi-no-TP-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize), false, true); 2247 | runPD2BWAndWrite(ins, "parallel-equi-" + to_string(bertSize) + "-" + to_string(memSize) + "GB-bs" + to_string(batchSize), true, true); 2248 | } 2249 | 2250 | 2251 | void runResnet () { 2252 | vector> pairs = { 2253 | {8, 4.0}, {8, 8.0}, 2254 | {16, 1.5}, {16, 2.0}, {16, 4.0}, {16, 8.0}, 2255 | {32, 1.5}, {32, 2.0}, {32, 4.0}, {32, 8.0}, 2256 | {64, 1.0}, {64, 1.5}, {64, 2.0}, {64, 4.0}, 2257 | {128, 1.0}, {128, 1.5}, {128, 2.0}, {128, 4.0}, 2258 | {8, 16.0}, {16, 16.0}, {32, 16.0}, {64, 8.0}, {64, 16.0}, 2259 | {128, 8.0}, {128, 16.0} 2260 | }; 2261 | sort(pairs.begin(), pairs.end()); 2262 | for (pair mm : pairs) { 2263 | cerr << endl << endl << endl << endl; 2264 | DBG(mm.first); 2265 | DBG(mm.second); 2266 | Instance ins = readResnet(); 2267 | ins.maxMemoryPerDevice = mm.second * (1 << 30); 2268 | ins.maxDevices = mm.first; 2269 | //ins.mbsInBatch = 1920*4; 2270 | ins.bandwidth = 25.0 * (1LL << 30); 2271 | 2272 | // our alg 2273 | clock_t startTime = clock(); 2274 | Graph g(ins); 2275 | Result our = g.runDP(); 2276 | double runtime = (clock() - startTime) * 1.0 / CLOCKS_PER_SEC; 2277 | DBG(runtime) 2278 | if (our.stages.empty()) { 2279 | continue; // infeasible, not interesting 2280 | } 2281 | double ourTPS = g.getTPSForResult(our); 2282 | 2283 | Result equi = runPipeDream2BWPlannerNonTransformer(ins, false); 2284 | double equiTPS = equi.stages.empty() ? 1e30 : g.getTPSForResult(equi); 2285 | double ratio = equi.stages.empty() ? 0 : ourTPS / equiTPS; 2286 | 2287 | cout << mm.first << " & " 2288 | << mm.second << " & " 2289 | << setprecision(3) << fixed << ourTPS << " & "; 2290 | if (equi.stages.empty()) cout << "OOM"; else cout << setprecision(3) << fixed << equiTPS; 2291 | cout << " & " 2292 | << setprecision(3) << fixed << ratio << "$\\times$ & " 2293 | << setprecision(1) << fixed << runtime << "s \\\\\n"; 2294 | } 2295 | } 2296 | 2297 | 2298 | void runGNMT () { 2299 | vector> pairs = { 2300 | {2, 2.5}, {2, 3.5}, 2301 | {4, 1.2}, {4, 2.5}, 2302 | {8, 0.6}, {8, 1.2}, {8, 2.4}, 2303 | {16, 0.3}, {16, 0.6}, {16, 1.2}, 2304 | {32, 0.3}, 2305 | {64, 0.3}, 2306 | {32, 0.8} 2307 | }; 2308 | sort(pairs.begin(), pairs.end()); 2309 | for (pair mm : pairs) { 2310 | cerr << endl << endl << endl << endl; 2311 | DBG(mm.first); 2312 | DBG(mm.second); 2313 | Instance ins = readGNMT(); 2314 | ins.maxMemoryPerDevice = mm.second * (1 << 30); 2315 | ins.maxDevices = mm.first; 2316 | //ins.mbsInBatch = 256; 2317 | ins.bandwidth = 25.0 * (1LL << 30); 2318 | 2319 | // our alg 2320 | clock_t startTime = clock(); 2321 | Graph g(ins); 2322 | Result our = g.runDP(); 2323 | double runtime = (clock() - startTime) * 1.0 / CLOCKS_PER_SEC; 2324 | DBG(runtime) 2325 | if (our.stages.empty()) { 2326 | continue; // infeasible, not interesting 2327 | } 2328 | double ourTPS = g.getTPSForResult(our); 2329 | 2330 | Result equi = runPipeDream2BWPlannerNonTransformer(ins, false); 2331 | double equiTPS = equi.stages.empty() ? 1e30 : g.getTPSForResult(equi); 2332 | double ratio = equi.stages.empty() ? 0 : ourTPS / equiTPS; 2333 | 2334 | cout << mm.first << " & " 2335 | << mm.second << " & " 2336 | << setprecision(3) << fixed << ourTPS << " & "; 2337 | if (equi.stages.empty()) cout << "OOM"; else cout << setprecision(3) << fixed << equiTPS; 2338 | cout << " & " 2339 | << setprecision(3) << fixed << ratio << "$\\times$ & " 2340 | << setprecision(1) << fixed << runtime << "s \\\\\n"; 2341 | } 2342 | } 2343 | 2344 | 2345 | int main (int argc, char **argv) { 2346 | plots(); // generates data for the bar plots (Fig. 1) 2347 | 2348 | parallelizabilityExperiment(32, 8, 960); // Fig. 2a 2349 | parallelizabilityExperiment(32, 8, 512); // Fig. 2b 2350 | 2351 | scalability(); // Fig. 3 2352 | 2353 | correlationExperiment(); // Fig. 4 (y-axis values) 2354 | 2355 | // experiment for Appendix C 2356 | OUTPUT_KNAPSACK_INSTANCES_FOR_INSPECTION = true; 2357 | single(); 2358 | OUTPUT_KNAPSACK_INSTANCES_FOR_INSPECTION = false; 2359 | 2360 | // additional experiments on Resnet50 and GNMT 2361 | runGNMT(); 2362 | cout << endl; 2363 | runResnet(); 2364 | } 2365 | -------------------------------------------------------------------------------- /inputs/gnmt.json: -------------------------------------------------------------------------------- 1 | { 2 | "maxDevices": 10, 3 | "maxMemoryPerDevice": 80000000000.0, 4 | "bandwidth": 25000000000.0, 5 | "mbsInBatch": 256, 6 | "nodes": [ 7 | { 8 | "id": 0, 9 | "name": "node4", 10 | "TMPCs": { 11 | "1": [ 12 | { 13 | "id": "vanilla", 14 | "timePerSample": 25.678, 15 | "parameterSize": 132382720.0, 16 | "memoryUsageA": 13107200.0, 17 | "memoryUsageB": 264765440.0, 18 | "syncTimeFw": {}, 19 | "syncTimeBw": { 20 | "1": 0.0 21 | } 22 | }, 23 | { 24 | "id": "activation recomp", 25 | "timePerSample": 25.818, 26 | "parameterSize": 132382720.0, 27 | "memoryUsageA": 1179648.0, 28 | "memoryUsageB": 276692992.0, 29 | "syncTimeFw": {}, 30 | "syncTimeBw": { 31 | "1": 0.0 32 | } 33 | } 34 | ] 35 | } 36 | }, 37 | { 38 | "id": 1, 39 | "name": "node5", 40 | "TMPCs": { 41 | "1": [ 42 | { 43 | "id": "vanilla", 44 | "timePerSample": 14.145, 45 | "parameterSize": 67174400.0, 46 | "memoryUsageA": 26214400.0, 47 | "memoryUsageB": 134348800.0, 48 | "syncTimeFw": { 49 | "0": 0.0, 50 | "2": 0.0 51 | }, 52 | "syncTimeBw": { 53 | "3": 0.0 54 | } 55 | }, 56 | { 57 | "id": "activation recomp", 58 | "timePerSample": 28.270999999999997, 59 | "parameterSize": 67174400.0, 60 | "memoryUsageA": 2359296.0, 61 | "memoryUsageB": 158203904.0, 62 | "syncTimeFw": { 63 | "0": 0.0, 64 | "2": 0.0 65 | }, 66 | "syncTimeBw": { 67 | "3": 0.0 68 | } 69 | } 70 | ] 71 | } 72 | }, 73 | { 74 | "id": 2, 75 | "name": "node2", 76 | "TMPCs": { 77 | "1": [ 78 | { 79 | "id": "vanilla", 80 | "timePerSample": 0.0, 81 | "parameterSize": 0.0, 82 | "memoryUsageA": 0.0, 83 | "memoryUsageB": 0.0, 84 | "syncTimeFw": {}, 85 | "syncTimeBw": { 86 | "1": 0.0, 87 | "33": 0.0 88 | } 89 | }, 90 | { 91 | "id": "activation recomp", 92 | "timePerSample": 0.0, 93 | "parameterSize": 0.0, 94 | "memoryUsageA": 0.0, 95 | "memoryUsageB": 0.0, 96 | "syncTimeFw": {}, 97 | "syncTimeBw": { 98 | "1": 0.0, 99 | "33": 0.0 100 | } 101 | } 102 | ] 103 | } 104 | }, 105 | { 106 | "id": 3, 107 | "name": "node6", 108 | "TMPCs": { 109 | "1": [ 110 | { 111 | "id": "vanilla", 112 | "timePerSample": 0.471, 113 | "parameterSize": 0.0, 114 | "memoryUsageA": 26214400.0, 115 | "memoryUsageB": 0.0, 116 | "syncTimeFw": { 117 | "1": 0.0 118 | }, 119 | "syncTimeBw": { 120 | "4": 0.0 121 | } 122 | }, 123 | { 124 | "id": "activation recomp", 125 | "timePerSample": 0.657, 126 | "parameterSize": 0.0, 127 | "memoryUsageA": 2359296.0, 128 | "memoryUsageB": 23855104.0, 129 | "syncTimeFw": { 130 | "1": 0.0 131 | }, 132 | "syncTimeBw": { 133 | "4": 0.0 134 | } 135 | } 136 | ] 137 | } 138 | }, 139 | { 140 | "id": 4, 141 | "name": "node7", 142 | "TMPCs": { 143 | "1": [ 144 | { 145 | "id": "vanilla", 146 | "timePerSample": 29.075000000000003, 147 | "parameterSize": 50364416.0, 148 | "memoryUsageA": 14155776.0, 149 | "memoryUsageB": 100728832.0, 150 | "syncTimeFw": { 151 | "3": 0.0 152 | }, 153 | "syncTimeBw": { 154 | "5": 0.0 155 | } 156 | }, 157 | { 158 | "id": "activation recomp", 159 | "timePerSample": 39.373000000000005, 160 | "parameterSize": 50364416.0, 161 | "memoryUsageA": 1274019.8399999999, 162 | "memoryUsageB": 113610588.16, 163 | "syncTimeFw": { 164 | "3": 0.0 165 | }, 166 | "syncTimeBw": { 167 | "5": 0.0 168 | } 169 | } 170 | ] 171 | } 172 | }, 173 | { 174 | "id": 5, 175 | "name": "node8", 176 | "TMPCs": { 177 | "1": [ 178 | { 179 | "id": "vanilla", 180 | "timePerSample": 0.0, 181 | "parameterSize": 0.0, 182 | "memoryUsageA": 13107200.0, 183 | "memoryUsageB": 0.0, 184 | "syncTimeFw": { 185 | "4": 0.0 186 | }, 187 | "syncTimeBw": { 188 | "6": 0.0, 189 | "9": 0.0 190 | } 191 | }, 192 | { 193 | "id": "activation recomp", 194 | "timePerSample": 0.0, 195 | "parameterSize": 0.0, 196 | "memoryUsageA": 1179648.0, 197 | "memoryUsageB": 11927552.0, 198 | "syncTimeFw": { 199 | "4": 0.0 200 | }, 201 | "syncTimeBw": { 202 | "6": 0.0, 203 | "9": 0.0 204 | } 205 | } 206 | ] 207 | } 208 | }, 209 | { 210 | "id": 6, 211 | "name": "node10", 212 | "TMPCs": { 213 | "1": [ 214 | { 215 | "id": "vanilla", 216 | "timePerSample": 0.28300000000000003, 217 | "parameterSize": 0.0, 218 | "memoryUsageA": 13107200.0, 219 | "memoryUsageB": 0.0, 220 | "syncTimeFw": { 221 | "5": 0.0 222 | }, 223 | "syncTimeBw": { 224 | "7": 0.0 225 | } 226 | }, 227 | { 228 | "id": "activation recomp", 229 | "timePerSample": 0.40700000000000003, 230 | "parameterSize": 0.0, 231 | "memoryUsageA": 1179648.0, 232 | "memoryUsageB": 11927552.0, 233 | "syncTimeFw": { 234 | "5": 0.0 235 | }, 236 | "syncTimeBw": { 237 | "7": 0.0 238 | } 239 | } 240 | ] 241 | } 242 | }, 243 | { 244 | "id": 7, 245 | "name": "node11", 246 | "TMPCs": { 247 | "1": [ 248 | { 249 | "id": "vanilla", 250 | "timePerSample": 20.436, 251 | "parameterSize": 33587200.0, 252 | "memoryUsageA": 14155776.0, 253 | "memoryUsageB": 67174400.0, 254 | "syncTimeFw": { 255 | "6": 0.0 256 | }, 257 | "syncTimeBw": { 258 | "8": 0.0 259 | } 260 | }, 261 | { 262 | "id": "activation recomp", 263 | "timePerSample": 27.323, 264 | "parameterSize": 33587200.0, 265 | "memoryUsageA": 1274019.8399999999, 266 | "memoryUsageB": 80056156.16, 267 | "syncTimeFw": { 268 | "6": 0.0 269 | }, 270 | "syncTimeBw": { 271 | "8": 0.0 272 | } 273 | } 274 | ] 275 | } 276 | }, 277 | { 278 | "id": 8, 279 | "name": "node12", 280 | "TMPCs": { 281 | "1": [ 282 | { 283 | "id": "vanilla", 284 | "timePerSample": 0.0, 285 | "parameterSize": 0.0, 286 | "memoryUsageA": 13107200.0, 287 | "memoryUsageB": 0.0, 288 | "syncTimeFw": { 289 | "7": 0.0 290 | }, 291 | "syncTimeBw": { 292 | "9": 0.0 293 | } 294 | }, 295 | { 296 | "id": "activation recomp", 297 | "timePerSample": 0.0, 298 | "parameterSize": 0.0, 299 | "memoryUsageA": 1179648.0, 300 | "memoryUsageB": 11927552.0, 301 | "syncTimeFw": { 302 | "7": 0.0 303 | }, 304 | "syncTimeBw": { 305 | "9": 0.0 306 | } 307 | } 308 | ] 309 | } 310 | }, 311 | { 312 | "id": 9, 313 | "name": "node14", 314 | "TMPCs": { 315 | "1": [ 316 | { 317 | "id": "vanilla", 318 | "timePerSample": 0.0, 319 | "parameterSize": 0.0, 320 | "memoryUsageA": 13107200.0, 321 | "memoryUsageB": 0.0, 322 | "syncTimeFw": { 323 | "8": 0.0, 324 | "5": 0.0 325 | }, 326 | "syncTimeBw": { 327 | "10": 0.0, 328 | "13": 0.0 329 | } 330 | }, 331 | { 332 | "id": "activation recomp", 333 | "timePerSample": 0.0, 334 | "parameterSize": 0.0, 335 | "memoryUsageA": 1179648.0, 336 | "memoryUsageB": 11927552.0, 337 | "syncTimeFw": { 338 | "8": 0.0, 339 | "5": 0.0 340 | }, 341 | "syncTimeBw": { 342 | "10": 0.0, 343 | "13": 0.0 344 | } 345 | } 346 | ] 347 | } 348 | }, 349 | { 350 | "id": 10, 351 | "name": "node15", 352 | "TMPCs": { 353 | "1": [ 354 | { 355 | "id": "vanilla", 356 | "timePerSample": 0.29600000000000004, 357 | "parameterSize": 0.0, 358 | "memoryUsageA": 13107200.0, 359 | "memoryUsageB": 0.0, 360 | "syncTimeFw": { 361 | "9": 0.0 362 | }, 363 | "syncTimeBw": { 364 | "11": 0.0 365 | } 366 | }, 367 | { 368 | "id": "activation recomp", 369 | "timePerSample": 0.42500000000000004, 370 | "parameterSize": 0.0, 371 | "memoryUsageA": 1179648.0, 372 | "memoryUsageB": 11927552.0, 373 | "syncTimeFw": { 374 | "9": 0.0 375 | }, 376 | "syncTimeBw": { 377 | "11": 0.0 378 | } 379 | } 380 | ] 381 | } 382 | }, 383 | { 384 | "id": 11, 385 | "name": "node16", 386 | "TMPCs": { 387 | "1": [ 388 | { 389 | "id": "vanilla", 390 | "timePerSample": 20.421, 391 | "parameterSize": 33587200.0, 392 | "memoryUsageA": 14155776.0, 393 | "memoryUsageB": 67174400.0, 394 | "syncTimeFw": { 395 | "10": 0.0 396 | }, 397 | "syncTimeBw": { 398 | "12": 0.0 399 | } 400 | }, 401 | { 402 | "id": "activation recomp", 403 | "timePerSample": 27.323999999999998, 404 | "parameterSize": 33587200.0, 405 | "memoryUsageA": 1274019.8399999999, 406 | "memoryUsageB": 80056156.16, 407 | "syncTimeFw": { 408 | "10": 0.0 409 | }, 410 | "syncTimeBw": { 411 | "12": 0.0 412 | } 413 | } 414 | ] 415 | } 416 | }, 417 | { 418 | "id": 12, 419 | "name": "node17", 420 | "TMPCs": { 421 | "1": [ 422 | { 423 | "id": "vanilla", 424 | "timePerSample": 0.0, 425 | "parameterSize": 0.0, 426 | "memoryUsageA": 13107200.0, 427 | "memoryUsageB": 0.0, 428 | "syncTimeFw": { 429 | "11": 0.0 430 | }, 431 | "syncTimeBw": { 432 | "13": 0.0 433 | } 434 | }, 435 | { 436 | "id": "activation recomp", 437 | "timePerSample": 0.0, 438 | "parameterSize": 0.0, 439 | "memoryUsageA": 1179648.0, 440 | "memoryUsageB": 11927552.0, 441 | "syncTimeFw": { 442 | "11": 0.0 443 | }, 444 | "syncTimeBw": { 445 | "13": 0.0 446 | } 447 | } 448 | ] 449 | } 450 | }, 451 | { 452 | "id": 13, 453 | "name": "node19", 454 | "TMPCs": { 455 | "1": [ 456 | { 457 | "id": "vanilla", 458 | "timePerSample": 0.0, 459 | "parameterSize": 0.0, 460 | "memoryUsageA": 13107200.0, 461 | "memoryUsageB": 0.0, 462 | "syncTimeFw": { 463 | "12": 0.0, 464 | "9": 0.0 465 | }, 466 | "syncTimeBw": { 467 | "14": 0.0, 468 | "17": 0.0 469 | } 470 | }, 471 | { 472 | "id": "activation recomp", 473 | "timePerSample": 0.0, 474 | "parameterSize": 0.0, 475 | "memoryUsageA": 1179648.0, 476 | "memoryUsageB": 11927552.0, 477 | "syncTimeFw": { 478 | "12": 0.0, 479 | "9": 0.0 480 | }, 481 | "syncTimeBw": { 482 | "14": 0.0, 483 | "17": 0.0 484 | } 485 | } 486 | ] 487 | } 488 | }, 489 | { 490 | "id": 14, 491 | "name": "node20", 492 | "TMPCs": { 493 | "1": [ 494 | { 495 | "id": "vanilla", 496 | "timePerSample": 0.28200000000000003, 497 | "parameterSize": 0.0, 498 | "memoryUsageA": 13107200.0, 499 | "memoryUsageB": 0.0, 500 | "syncTimeFw": { 501 | "13": 0.0 502 | }, 503 | "syncTimeBw": { 504 | "15": 0.0 505 | } 506 | }, 507 | { 508 | "id": "activation recomp", 509 | "timePerSample": 0.404, 510 | "parameterSize": 0.0, 511 | "memoryUsageA": 1179648.0, 512 | "memoryUsageB": 11927552.0, 513 | "syncTimeFw": { 514 | "13": 0.0 515 | }, 516 | "syncTimeBw": { 517 | "15": 0.0 518 | } 519 | } 520 | ] 521 | } 522 | }, 523 | { 524 | "id": 15, 525 | "name": "node21", 526 | "TMPCs": { 527 | "1": [ 528 | { 529 | "id": "vanilla", 530 | "timePerSample": 20.483, 531 | "parameterSize": 33587200.0, 532 | "memoryUsageA": 14155776.0, 533 | "memoryUsageB": 67174400.0, 534 | "syncTimeFw": { 535 | "14": 0.0 536 | }, 537 | "syncTimeBw": { 538 | "16": 0.0 539 | } 540 | }, 541 | { 542 | "id": "activation recomp", 543 | "timePerSample": 27.372, 544 | "parameterSize": 33587200.0, 545 | "memoryUsageA": 1274019.8399999999, 546 | "memoryUsageB": 80056156.16, 547 | "syncTimeFw": { 548 | "14": 0.0 549 | }, 550 | "syncTimeBw": { 551 | "16": 0.0 552 | } 553 | } 554 | ] 555 | } 556 | }, 557 | { 558 | "id": 16, 559 | "name": "node22", 560 | "TMPCs": { 561 | "1": [ 562 | { 563 | "id": "vanilla", 564 | "timePerSample": 0.0, 565 | "parameterSize": 0.0, 566 | "memoryUsageA": 13107200.0, 567 | "memoryUsageB": 0.0, 568 | "syncTimeFw": { 569 | "15": 0.0 570 | }, 571 | "syncTimeBw": { 572 | "17": 0.0 573 | } 574 | }, 575 | { 576 | "id": "activation recomp", 577 | "timePerSample": 0.0, 578 | "parameterSize": 0.0, 579 | "memoryUsageA": 1179648.0, 580 | "memoryUsageB": 11927552.0, 581 | "syncTimeFw": { 582 | "15": 0.0 583 | }, 584 | "syncTimeBw": { 585 | "17": 0.0 586 | } 587 | } 588 | ] 589 | } 590 | }, 591 | { 592 | "id": 17, 593 | "name": "node24", 594 | "TMPCs": { 595 | "1": [ 596 | { 597 | "id": "vanilla", 598 | "timePerSample": 0.0, 599 | "parameterSize": 0.0, 600 | "memoryUsageA": 13107200.0, 601 | "memoryUsageB": 0.0, 602 | "syncTimeFw": { 603 | "16": 0.0, 604 | "13": 0.0 605 | }, 606 | "syncTimeBw": { 607 | "18": 0.0, 608 | "21": 0.0 609 | } 610 | }, 611 | { 612 | "id": "activation recomp", 613 | "timePerSample": 0.0, 614 | "parameterSize": 0.0, 615 | "memoryUsageA": 1179648.0, 616 | "memoryUsageB": 11927552.0, 617 | "syncTimeFw": { 618 | "16": 0.0, 619 | "13": 0.0 620 | }, 621 | "syncTimeBw": { 622 | "18": 0.0, 623 | "21": 0.0 624 | } 625 | } 626 | ] 627 | } 628 | }, 629 | { 630 | "id": 18, 631 | "name": "node25", 632 | "TMPCs": { 633 | "1": [ 634 | { 635 | "id": "vanilla", 636 | "timePerSample": 0.28, 637 | "parameterSize": 0.0, 638 | "memoryUsageA": 13107200.0, 639 | "memoryUsageB": 0.0, 640 | "syncTimeFw": { 641 | "17": 0.0 642 | }, 643 | "syncTimeBw": { 644 | "19": 0.0 645 | } 646 | }, 647 | { 648 | "id": "activation recomp", 649 | "timePerSample": 0.4, 650 | "parameterSize": 0.0, 651 | "memoryUsageA": 1179648.0, 652 | "memoryUsageB": 11927552.0, 653 | "syncTimeFw": { 654 | "17": 0.0 655 | }, 656 | "syncTimeBw": { 657 | "19": 0.0 658 | } 659 | } 660 | ] 661 | } 662 | }, 663 | { 664 | "id": 19, 665 | "name": "node26", 666 | "TMPCs": { 667 | "1": [ 668 | { 669 | "id": "vanilla", 670 | "timePerSample": 20.647, 671 | "parameterSize": 33587200.0, 672 | "memoryUsageA": 14155776.0, 673 | "memoryUsageB": 67174400.0, 674 | "syncTimeFw": { 675 | "18": 0.0 676 | }, 677 | "syncTimeBw": { 678 | "20": 0.0 679 | } 680 | }, 681 | { 682 | "id": "activation recomp", 683 | "timePerSample": 27.666, 684 | "parameterSize": 33587200.0, 685 | "memoryUsageA": 1274019.8399999999, 686 | "memoryUsageB": 80056156.16, 687 | "syncTimeFw": { 688 | "18": 0.0 689 | }, 690 | "syncTimeBw": { 691 | "20": 0.0 692 | } 693 | } 694 | ] 695 | } 696 | }, 697 | { 698 | "id": 20, 699 | "name": "node27", 700 | "TMPCs": { 701 | "1": [ 702 | { 703 | "id": "vanilla", 704 | "timePerSample": 0.0, 705 | "parameterSize": 0.0, 706 | "memoryUsageA": 13107200.0, 707 | "memoryUsageB": 0.0, 708 | "syncTimeFw": { 709 | "19": 0.0 710 | }, 711 | "syncTimeBw": { 712 | "21": 0.0 713 | } 714 | }, 715 | { 716 | "id": "activation recomp", 717 | "timePerSample": 0.0, 718 | "parameterSize": 0.0, 719 | "memoryUsageA": 1179648.0, 720 | "memoryUsageB": 11927552.0, 721 | "syncTimeFw": { 722 | "19": 0.0 723 | }, 724 | "syncTimeBw": { 725 | "21": 0.0 726 | } 727 | } 728 | ] 729 | } 730 | }, 731 | { 732 | "id": 21, 733 | "name": "node29", 734 | "TMPCs": { 735 | "1": [ 736 | { 737 | "id": "vanilla", 738 | "timePerSample": 0.0, 739 | "parameterSize": 0.0, 740 | "memoryUsageA": 13107200.0, 741 | "memoryUsageB": 0.0, 742 | "syncTimeFw": { 743 | "20": 0.0, 744 | "17": 0.0 745 | }, 746 | "syncTimeBw": { 747 | "22": 0.0, 748 | "25": 0.0 749 | } 750 | }, 751 | { 752 | "id": "activation recomp", 753 | "timePerSample": 0.0, 754 | "parameterSize": 0.0, 755 | "memoryUsageA": 1179648.0, 756 | "memoryUsageB": 11927552.0, 757 | "syncTimeFw": { 758 | "20": 0.0, 759 | "17": 0.0 760 | }, 761 | "syncTimeBw": { 762 | "22": 0.0, 763 | "25": 0.0 764 | } 765 | } 766 | ] 767 | } 768 | }, 769 | { 770 | "id": 22, 771 | "name": "node30", 772 | "TMPCs": { 773 | "1": [ 774 | { 775 | "id": "vanilla", 776 | "timePerSample": 0.28400000000000003, 777 | "parameterSize": 0.0, 778 | "memoryUsageA": 13107200.0, 779 | "memoryUsageB": 0.0, 780 | "syncTimeFw": { 781 | "21": 0.0 782 | }, 783 | "syncTimeBw": { 784 | "23": 0.0 785 | } 786 | }, 787 | { 788 | "id": "activation recomp", 789 | "timePerSample": 0.404, 790 | "parameterSize": 0.0, 791 | "memoryUsageA": 1179648.0, 792 | "memoryUsageB": 11927552.0, 793 | "syncTimeFw": { 794 | "21": 0.0 795 | }, 796 | "syncTimeBw": { 797 | "23": 0.0 798 | } 799 | } 800 | ] 801 | } 802 | }, 803 | { 804 | "id": 23, 805 | "name": "node31", 806 | "TMPCs": { 807 | "1": [ 808 | { 809 | "id": "vanilla", 810 | "timePerSample": 20.605, 811 | "parameterSize": 33587200.0, 812 | "memoryUsageA": 14155776.0, 813 | "memoryUsageB": 67174400.0, 814 | "syncTimeFw": { 815 | "22": 0.0 816 | }, 817 | "syncTimeBw": { 818 | "24": 0.0 819 | } 820 | }, 821 | { 822 | "id": "activation recomp", 823 | "timePerSample": 27.596, 824 | "parameterSize": 33587200.0, 825 | "memoryUsageA": 1274019.8399999999, 826 | "memoryUsageB": 80056156.16, 827 | "syncTimeFw": { 828 | "22": 0.0 829 | }, 830 | "syncTimeBw": { 831 | "24": 0.0 832 | } 833 | } 834 | ] 835 | } 836 | }, 837 | { 838 | "id": 24, 839 | "name": "node32", 840 | "TMPCs": { 841 | "1": [ 842 | { 843 | "id": "vanilla", 844 | "timePerSample": 0.0, 845 | "parameterSize": 0.0, 846 | "memoryUsageA": 13107200.0, 847 | "memoryUsageB": 0.0, 848 | "syncTimeFw": { 849 | "23": 0.0 850 | }, 851 | "syncTimeBw": { 852 | "25": 0.0 853 | } 854 | }, 855 | { 856 | "id": "activation recomp", 857 | "timePerSample": 0.0, 858 | "parameterSize": 0.0, 859 | "memoryUsageA": 1179648.0, 860 | "memoryUsageB": 11927552.0, 861 | "syncTimeFw": { 862 | "23": 0.0 863 | }, 864 | "syncTimeBw": { 865 | "25": 0.0 866 | } 867 | } 868 | ] 869 | } 870 | }, 871 | { 872 | "id": 25, 873 | "name": "node34", 874 | "TMPCs": { 875 | "1": [ 876 | { 877 | "id": "vanilla", 878 | "timePerSample": 0.0, 879 | "parameterSize": 0.0, 880 | "memoryUsageA": 13107200.0, 881 | "memoryUsageB": 0.0, 882 | "syncTimeFw": { 883 | "24": 0.0, 884 | "21": 0.0 885 | }, 886 | "syncTimeBw": { 887 | "26": 0.0, 888 | "29": 0.0 889 | } 890 | }, 891 | { 892 | "id": "activation recomp", 893 | "timePerSample": 0.0, 894 | "parameterSize": 0.0, 895 | "memoryUsageA": 1179648.0, 896 | "memoryUsageB": 11927552.0, 897 | "syncTimeFw": { 898 | "24": 0.0, 899 | "21": 0.0 900 | }, 901 | "syncTimeBw": { 902 | "26": 0.0, 903 | "29": 0.0 904 | } 905 | } 906 | ] 907 | } 908 | }, 909 | { 910 | "id": 26, 911 | "name": "node35", 912 | "TMPCs": { 913 | "1": [ 914 | { 915 | "id": "vanilla", 916 | "timePerSample": 0.28900000000000003, 917 | "parameterSize": 0.0, 918 | "memoryUsageA": 13107200.0, 919 | "memoryUsageB": 0.0, 920 | "syncTimeFw": { 921 | "25": 0.0 922 | }, 923 | "syncTimeBw": { 924 | "27": 0.0 925 | } 926 | }, 927 | { 928 | "id": "activation recomp", 929 | "timePerSample": 0.40900000000000003, 930 | "parameterSize": 0.0, 931 | "memoryUsageA": 1179648.0, 932 | "memoryUsageB": 11927552.0, 933 | "syncTimeFw": { 934 | "25": 0.0 935 | }, 936 | "syncTimeBw": { 937 | "27": 0.0 938 | } 939 | } 940 | ] 941 | } 942 | }, 943 | { 944 | "id": 27, 945 | "name": "node36", 946 | "TMPCs": { 947 | "1": [ 948 | { 949 | "id": "vanilla", 950 | "timePerSample": 21.167, 951 | "parameterSize": 33587200.0, 952 | "memoryUsageA": 14155776.0, 953 | "memoryUsageB": 67174400.0, 954 | "syncTimeFw": { 955 | "26": 0.0 956 | }, 957 | "syncTimeBw": { 958 | "28": 0.0 959 | } 960 | }, 961 | { 962 | "id": "activation recomp", 963 | "timePerSample": 28.222, 964 | "parameterSize": 33587200.0, 965 | "memoryUsageA": 1274019.8399999999, 966 | "memoryUsageB": 80056156.16, 967 | "syncTimeFw": { 968 | "26": 0.0 969 | }, 970 | "syncTimeBw": { 971 | "28": 0.0 972 | } 973 | } 974 | ] 975 | } 976 | }, 977 | { 978 | "id": 28, 979 | "name": "node37", 980 | "TMPCs": { 981 | "1": [ 982 | { 983 | "id": "vanilla", 984 | "timePerSample": 0.0, 985 | "parameterSize": 0.0, 986 | "memoryUsageA": 13107200.0, 987 | "memoryUsageB": 0.0, 988 | "syncTimeFw": { 989 | "27": 0.0 990 | }, 991 | "syncTimeBw": { 992 | "29": 0.0 993 | } 994 | }, 995 | { 996 | "id": "activation recomp", 997 | "timePerSample": 0.0, 998 | "parameterSize": 0.0, 999 | "memoryUsageA": 1179648.0, 1000 | "memoryUsageB": 11927552.0, 1001 | "syncTimeFw": { 1002 | "27": 0.0 1003 | }, 1004 | "syncTimeBw": { 1005 | "29": 0.0 1006 | } 1007 | } 1008 | ] 1009 | } 1010 | }, 1011 | { 1012 | "id": 29, 1013 | "name": "node39", 1014 | "TMPCs": { 1015 | "1": [ 1016 | { 1017 | "id": "vanilla", 1018 | "timePerSample": 0.0, 1019 | "parameterSize": 0.0, 1020 | "memoryUsageA": 13107200.0, 1021 | "memoryUsageB": 0.0, 1022 | "syncTimeFw": { 1023 | "28": 0.0, 1024 | "25": 0.0 1025 | }, 1026 | "syncTimeBw": { 1027 | "33": 0.0 1028 | } 1029 | }, 1030 | { 1031 | "id": "activation recomp", 1032 | "timePerSample": 0.0, 1033 | "parameterSize": 0.0, 1034 | "memoryUsageA": 1179648.0, 1035 | "memoryUsageB": 11927552.0, 1036 | "syncTimeFw": { 1037 | "28": 0.0, 1038 | "25": 0.0 1039 | }, 1040 | "syncTimeBw": { 1041 | "33": 0.0 1042 | } 1043 | } 1044 | ] 1045 | } 1046 | }, 1047 | { 1048 | "id": 30, 1049 | "name": "node41", 1050 | "TMPCs": { 1051 | "1": [ 1052 | { 1053 | "id": "vanilla", 1054 | "timePerSample": 0.5880000000000001, 1055 | "parameterSize": 132382720.0, 1056 | "memoryUsageA": 13107200.0, 1057 | "memoryUsageB": 264765440.0, 1058 | "syncTimeFw": {}, 1059 | "syncTimeBw": { 1060 | "33": 0.0 1061 | } 1062 | }, 1063 | { 1064 | "id": "activation recomp", 1065 | "timePerSample": 0.716, 1066 | "parameterSize": 132382720.0, 1067 | "memoryUsageA": 1179648.0, 1068 | "memoryUsageB": 276692992.0, 1069 | "syncTimeFw": {}, 1070 | "syncTimeBw": { 1071 | "33": 0.0 1072 | } 1073 | } 1074 | ] 1075 | } 1076 | }, 1077 | { 1078 | "id": 31, 1079 | "name": "node40", 1080 | "TMPCs": { 1081 | "1": [ 1082 | { 1083 | "id": "vanilla", 1084 | "timePerSample": 0.0, 1085 | "parameterSize": 0.0, 1086 | "memoryUsageA": 0.0, 1087 | "memoryUsageB": 0.0, 1088 | "syncTimeFw": {}, 1089 | "syncTimeBw": { 1090 | "32": 0.0, 1091 | "38": 0.0, 1092 | "43": 0.0, 1093 | "49": 0.0, 1094 | "55": 0.0, 1095 | "61": 0.0, 1096 | "67": 0.0, 1097 | "73": 0.0 1098 | } 1099 | }, 1100 | { 1101 | "id": "activation recomp", 1102 | "timePerSample": 0.0, 1103 | "parameterSize": 0.0, 1104 | "memoryUsageA": 0.0, 1105 | "memoryUsageB": 0.0, 1106 | "syncTimeFw": {}, 1107 | "syncTimeBw": { 1108 | "32": 0.0, 1109 | "38": 0.0, 1110 | "43": 0.0, 1111 | "49": 0.0, 1112 | "55": 0.0, 1113 | "61": 0.0, 1114 | "67": 0.0, 1115 | "73": 0.0 1116 | } 1117 | } 1118 | ] 1119 | } 1120 | }, 1121 | { 1122 | "id": 32, 1123 | "name": "node42", 1124 | "TMPCs": { 1125 | "1": [ 1126 | { 1127 | "id": "vanilla", 1128 | "timePerSample": 0.0, 1129 | "parameterSize": 0.0, 1130 | "memoryUsageA": 0.0, 1131 | "memoryUsageB": 0.0, 1132 | "syncTimeFw": { 1133 | "31": 0.0 1134 | }, 1135 | "syncTimeBw": { 1136 | "33": 0.0 1137 | } 1138 | }, 1139 | { 1140 | "id": "activation recomp", 1141 | "timePerSample": 0.0, 1142 | "parameterSize": 0.0, 1143 | "memoryUsageA": 0.0, 1144 | "memoryUsageB": 0.0, 1145 | "syncTimeFw": { 1146 | "31": 0.0 1147 | }, 1148 | "syncTimeBw": { 1149 | "33": 0.0 1150 | } 1151 | } 1152 | ] 1153 | } 1154 | }, 1155 | { 1156 | "id": 33, 1157 | "name": "node43", 1158 | "TMPCs": { 1159 | "1": [ 1160 | { 1161 | "id": "vanilla", 1162 | "timePerSample": 38.296, 1163 | "parameterSize": 41979904.0, 1164 | "memoryUsageA": 27582976.0, 1165 | "memoryUsageB": 83959808.0, 1166 | "syncTimeFw": { 1167 | "30": 0.0, 1168 | "32": 0.0, 1169 | "29": 0.0, 1170 | "2": 0.0 1171 | }, 1172 | "syncTimeBw": { 1173 | "34": 0.0, 1174 | "35": 0.0 1175 | } 1176 | }, 1177 | { 1178 | "id": "activation recomp", 1179 | "timePerSample": 53.506, 1180 | "parameterSize": 41979904.0, 1181 | "memoryUsageA": 2482467.84, 1182 | "memoryUsageB": 109060316.16, 1183 | "syncTimeFw": { 1184 | "30": 0.0, 1185 | "32": 0.0, 1186 | "29": 0.0, 1187 | "2": 0.0 1188 | }, 1189 | "syncTimeBw": { 1190 | "34": 0.0, 1191 | "35": 0.0 1192 | } 1193 | } 1194 | ] 1195 | } 1196 | }, 1197 | { 1198 | "id": 34, 1199 | "name": "node44", 1200 | "TMPCs": { 1201 | "1": [ 1202 | { 1203 | "id": "vanilla", 1204 | "timePerSample": 0.0, 1205 | "parameterSize": 0.0, 1206 | "memoryUsageA": 13107200.0, 1207 | "memoryUsageB": 0.0, 1208 | "syncTimeFw": { 1209 | "33": 0.0 1210 | }, 1211 | "syncTimeBw": { 1212 | "36": 0.0 1213 | } 1214 | }, 1215 | { 1216 | "id": "activation recomp", 1217 | "timePerSample": 0.0, 1218 | "parameterSize": 0.0, 1219 | "memoryUsageA": 1179648.0, 1220 | "memoryUsageB": 11927552.0, 1221 | "syncTimeFw": { 1222 | "33": 0.0 1223 | }, 1224 | "syncTimeBw": { 1225 | "36": 0.0 1226 | } 1227 | } 1228 | ] 1229 | } 1230 | }, 1231 | { 1232 | "id": 35, 1233 | "name": "node46", 1234 | "TMPCs": { 1235 | "1": [ 1236 | { 1237 | "id": "vanilla", 1238 | "timePerSample": 0.0, 1239 | "parameterSize": 0.0, 1240 | "memoryUsageA": 524288.0, 1241 | "memoryUsageB": 0.0, 1242 | "syncTimeFw": { 1243 | "33": 0.0 1244 | }, 1245 | "syncTimeBw": { 1246 | "37": 0.0, 1247 | "42": 0.0, 1248 | "48": 0.0, 1249 | "54": 0.0, 1250 | "60": 0.0, 1251 | "66": 0.0, 1252 | "72": 0.0 1253 | } 1254 | }, 1255 | { 1256 | "id": "activation recomp", 1257 | "timePerSample": 0.0, 1258 | "parameterSize": 0.0, 1259 | "memoryUsageA": 47185.92, 1260 | "memoryUsageB": 477102.08, 1261 | "syncTimeFw": { 1262 | "33": 0.0 1263 | }, 1264 | "syncTimeBw": { 1265 | "37": 0.0, 1266 | "42": 0.0, 1267 | "48": 0.0, 1268 | "54": 0.0, 1269 | "60": 0.0, 1270 | "66": 0.0, 1271 | "72": 0.0 1272 | } 1273 | } 1274 | ] 1275 | } 1276 | }, 1277 | { 1278 | "id": 36, 1279 | "name": "node48", 1280 | "TMPCs": { 1281 | "1": [ 1282 | { 1283 | "id": "vanilla", 1284 | "timePerSample": 0.366, 1285 | "parameterSize": 0.0, 1286 | "memoryUsageA": 13107200.0, 1287 | "memoryUsageB": 0.0, 1288 | "syncTimeFw": { 1289 | "34": 0.0 1290 | }, 1291 | "syncTimeBw": { 1292 | "37": 0.0 1293 | } 1294 | }, 1295 | { 1296 | "id": "activation recomp", 1297 | "timePerSample": 0.47, 1298 | "parameterSize": 0.0, 1299 | "memoryUsageA": 1179648.0, 1300 | "memoryUsageB": 11927552.0, 1301 | "syncTimeFw": { 1302 | "34": 0.0 1303 | }, 1304 | "syncTimeBw": { 1305 | "37": 0.0 1306 | } 1307 | } 1308 | ] 1309 | } 1310 | }, 1311 | { 1312 | "id": 37, 1313 | "name": "node49", 1314 | "TMPCs": { 1315 | "1": [ 1316 | { 1317 | "id": "vanilla", 1318 | "timePerSample": 0.0, 1319 | "parameterSize": 0.0, 1320 | "memoryUsageA": 13107200.0, 1321 | "memoryUsageB": 0.0, 1322 | "syncTimeFw": { 1323 | "36": 0.0, 1324 | "35": 0.0 1325 | }, 1326 | "syncTimeBw": { 1327 | "39": 0.0 1328 | } 1329 | }, 1330 | { 1331 | "id": "activation recomp", 1332 | "timePerSample": 0.0, 1333 | "parameterSize": 0.0, 1334 | "memoryUsageA": 1179648.0, 1335 | "memoryUsageB": 11927552.0, 1336 | "syncTimeFw": { 1337 | "36": 0.0, 1338 | "35": 0.0 1339 | }, 1340 | "syncTimeBw": { 1341 | "39": 0.0 1342 | } 1343 | } 1344 | ] 1345 | } 1346 | }, 1347 | { 1348 | "id": 38, 1349 | "name": "node50", 1350 | "TMPCs": { 1351 | "1": [ 1352 | { 1353 | "id": "vanilla", 1354 | "timePerSample": 0.0, 1355 | "parameterSize": 0.0, 1356 | "memoryUsageA": 0.0, 1357 | "memoryUsageB": 0.0, 1358 | "syncTimeFw": { 1359 | "31": 0.0 1360 | }, 1361 | "syncTimeBw": { 1362 | "39": 0.0 1363 | } 1364 | }, 1365 | { 1366 | "id": "activation recomp", 1367 | "timePerSample": 0.0, 1368 | "parameterSize": 0.0, 1369 | "memoryUsageA": 0.0, 1370 | "memoryUsageB": 0.0, 1371 | "syncTimeFw": { 1372 | "31": 0.0 1373 | }, 1374 | "syncTimeBw": { 1375 | "39": 0.0 1376 | } 1377 | } 1378 | ] 1379 | } 1380 | }, 1381 | { 1382 | "id": 39, 1383 | "name": "node51", 1384 | "TMPCs": { 1385 | "1": [ 1386 | { 1387 | "id": "vanilla", 1388 | "timePerSample": 29.261, 1389 | "parameterSize": 50364416.0, 1390 | "memoryUsageA": 14155776.0, 1391 | "memoryUsageB": 100728832.0, 1392 | "syncTimeFw": { 1393 | "37": 0.0, 1394 | "38": 0.0 1395 | }, 1396 | "syncTimeBw": { 1397 | "40": 0.0 1398 | } 1399 | }, 1400 | { 1401 | "id": "activation recomp", 1402 | "timePerSample": 39.778, 1403 | "parameterSize": 50364416.0, 1404 | "memoryUsageA": 1274019.8399999999, 1405 | "memoryUsageB": 113610588.16, 1406 | "syncTimeFw": { 1407 | "37": 0.0, 1408 | "38": 0.0 1409 | }, 1410 | "syncTimeBw": { 1411 | "40": 0.0 1412 | } 1413 | } 1414 | ] 1415 | } 1416 | }, 1417 | { 1418 | "id": 40, 1419 | "name": "node52", 1420 | "TMPCs": { 1421 | "1": [ 1422 | { 1423 | "id": "vanilla", 1424 | "timePerSample": 0.0, 1425 | "parameterSize": 0.0, 1426 | "memoryUsageA": 13107200.0, 1427 | "memoryUsageB": 0.0, 1428 | "syncTimeFw": { 1429 | "39": 0.0 1430 | }, 1431 | "syncTimeBw": { 1432 | "41": 0.0, 1433 | "46": 0.0 1434 | } 1435 | }, 1436 | { 1437 | "id": "activation recomp", 1438 | "timePerSample": 0.0, 1439 | "parameterSize": 0.0, 1440 | "memoryUsageA": 1179648.0, 1441 | "memoryUsageB": 11927552.0, 1442 | "syncTimeFw": { 1443 | "39": 0.0 1444 | }, 1445 | "syncTimeBw": { 1446 | "41": 0.0, 1447 | "46": 0.0 1448 | } 1449 | } 1450 | ] 1451 | } 1452 | }, 1453 | { 1454 | "id": 41, 1455 | "name": "node54", 1456 | "TMPCs": { 1457 | "1": [ 1458 | { 1459 | "id": "vanilla", 1460 | "timePerSample": 0.386, 1461 | "parameterSize": 0.0, 1462 | "memoryUsageA": 13107200.0, 1463 | "memoryUsageB": 0.0, 1464 | "syncTimeFw": { 1465 | "40": 0.0 1466 | }, 1467 | "syncTimeBw": { 1468 | "42": 0.0 1469 | } 1470 | }, 1471 | { 1472 | "id": "activation recomp", 1473 | "timePerSample": 0.495, 1474 | "parameterSize": 0.0, 1475 | "memoryUsageA": 1179648.0, 1476 | "memoryUsageB": 11927552.0, 1477 | "syncTimeFw": { 1478 | "40": 0.0 1479 | }, 1480 | "syncTimeBw": { 1481 | "42": 0.0 1482 | } 1483 | } 1484 | ] 1485 | } 1486 | }, 1487 | { 1488 | "id": 42, 1489 | "name": "node55", 1490 | "TMPCs": { 1491 | "1": [ 1492 | { 1493 | "id": "vanilla", 1494 | "timePerSample": 0.0, 1495 | "parameterSize": 0.0, 1496 | "memoryUsageA": 13107200.0, 1497 | "memoryUsageB": 0.0, 1498 | "syncTimeFw": { 1499 | "41": 0.0, 1500 | "35": 0.0 1501 | }, 1502 | "syncTimeBw": { 1503 | "44": 0.0 1504 | } 1505 | }, 1506 | { 1507 | "id": "activation recomp", 1508 | "timePerSample": 0.0, 1509 | "parameterSize": 0.0, 1510 | "memoryUsageA": 1179648.0, 1511 | "memoryUsageB": 11927552.0, 1512 | "syncTimeFw": { 1513 | "41": 0.0, 1514 | "35": 0.0 1515 | }, 1516 | "syncTimeBw": { 1517 | "44": 0.0 1518 | } 1519 | } 1520 | ] 1521 | } 1522 | }, 1523 | { 1524 | "id": 43, 1525 | "name": "node56", 1526 | "TMPCs": { 1527 | "1": [ 1528 | { 1529 | "id": "vanilla", 1530 | "timePerSample": 0.0, 1531 | "parameterSize": 0.0, 1532 | "memoryUsageA": 0.0, 1533 | "memoryUsageB": 0.0, 1534 | "syncTimeFw": { 1535 | "31": 0.0 1536 | }, 1537 | "syncTimeBw": { 1538 | "44": 0.0 1539 | } 1540 | }, 1541 | { 1542 | "id": "activation recomp", 1543 | "timePerSample": 0.0, 1544 | "parameterSize": 0.0, 1545 | "memoryUsageA": 0.0, 1546 | "memoryUsageB": 0.0, 1547 | "syncTimeFw": { 1548 | "31": 0.0 1549 | }, 1550 | "syncTimeBw": { 1551 | "44": 0.0 1552 | } 1553 | } 1554 | ] 1555 | } 1556 | }, 1557 | { 1558 | "id": 44, 1559 | "name": "node57", 1560 | "TMPCs": { 1561 | "1": [ 1562 | { 1563 | "id": "vanilla", 1564 | "timePerSample": 29.324, 1565 | "parameterSize": 50364416.0, 1566 | "memoryUsageA": 14155776.0, 1567 | "memoryUsageB": 100728832.0, 1568 | "syncTimeFw": { 1569 | "42": 0.0, 1570 | "43": 0.0 1571 | }, 1572 | "syncTimeBw": { 1573 | "45": 0.0 1574 | } 1575 | }, 1576 | { 1577 | "id": "activation recomp", 1578 | "timePerSample": 39.877, 1579 | "parameterSize": 50364416.0, 1580 | "memoryUsageA": 1274019.8399999999, 1581 | "memoryUsageB": 113610588.16, 1582 | "syncTimeFw": { 1583 | "42": 0.0, 1584 | "43": 0.0 1585 | }, 1586 | "syncTimeBw": { 1587 | "45": 0.0 1588 | } 1589 | } 1590 | ] 1591 | } 1592 | }, 1593 | { 1594 | "id": 45, 1595 | "name": "node58", 1596 | "TMPCs": { 1597 | "1": [ 1598 | { 1599 | "id": "vanilla", 1600 | "timePerSample": 0.0, 1601 | "parameterSize": 0.0, 1602 | "memoryUsageA": 13107200.0, 1603 | "memoryUsageB": 0.0, 1604 | "syncTimeFw": { 1605 | "44": 0.0 1606 | }, 1607 | "syncTimeBw": { 1608 | "46": 0.0 1609 | } 1610 | }, 1611 | { 1612 | "id": "activation recomp", 1613 | "timePerSample": 0.0, 1614 | "parameterSize": 0.0, 1615 | "memoryUsageA": 1179648.0, 1616 | "memoryUsageB": 11927552.0, 1617 | "syncTimeFw": { 1618 | "44": 0.0 1619 | }, 1620 | "syncTimeBw": { 1621 | "46": 0.0 1622 | } 1623 | } 1624 | ] 1625 | } 1626 | }, 1627 | { 1628 | "id": 46, 1629 | "name": "node60", 1630 | "TMPCs": { 1631 | "1": [ 1632 | { 1633 | "id": "vanilla", 1634 | "timePerSample": 0.0, 1635 | "parameterSize": 0.0, 1636 | "memoryUsageA": 13107200.0, 1637 | "memoryUsageB": 0.0, 1638 | "syncTimeFw": { 1639 | "45": 0.0, 1640 | "40": 0.0 1641 | }, 1642 | "syncTimeBw": { 1643 | "47": 0.0, 1644 | "52": 0.0 1645 | } 1646 | }, 1647 | { 1648 | "id": "activation recomp", 1649 | "timePerSample": 0.0, 1650 | "parameterSize": 0.0, 1651 | "memoryUsageA": 1179648.0, 1652 | "memoryUsageB": 11927552.0, 1653 | "syncTimeFw": { 1654 | "45": 0.0, 1655 | "40": 0.0 1656 | }, 1657 | "syncTimeBw": { 1658 | "47": 0.0, 1659 | "52": 0.0 1660 | } 1661 | } 1662 | ] 1663 | } 1664 | }, 1665 | { 1666 | "id": 47, 1667 | "name": "node61", 1668 | "TMPCs": { 1669 | "1": [ 1670 | { 1671 | "id": "vanilla", 1672 | "timePerSample": 0.376, 1673 | "parameterSize": 0.0, 1674 | "memoryUsageA": 13107200.0, 1675 | "memoryUsageB": 0.0, 1676 | "syncTimeFw": { 1677 | "46": 0.0 1678 | }, 1679 | "syncTimeBw": { 1680 | "48": 0.0 1681 | } 1682 | }, 1683 | { 1684 | "id": "activation recomp", 1685 | "timePerSample": 0.496, 1686 | "parameterSize": 0.0, 1687 | "memoryUsageA": 1179648.0, 1688 | "memoryUsageB": 11927552.0, 1689 | "syncTimeFw": { 1690 | "46": 0.0 1691 | }, 1692 | "syncTimeBw": { 1693 | "48": 0.0 1694 | } 1695 | } 1696 | ] 1697 | } 1698 | }, 1699 | { 1700 | "id": 48, 1701 | "name": "node62", 1702 | "TMPCs": { 1703 | "1": [ 1704 | { 1705 | "id": "vanilla", 1706 | "timePerSample": 0.0, 1707 | "parameterSize": 0.0, 1708 | "memoryUsageA": 13107200.0, 1709 | "memoryUsageB": 0.0, 1710 | "syncTimeFw": { 1711 | "47": 0.0, 1712 | "35": 0.0 1713 | }, 1714 | "syncTimeBw": { 1715 | "50": 0.0 1716 | } 1717 | }, 1718 | { 1719 | "id": "activation recomp", 1720 | "timePerSample": 0.0, 1721 | "parameterSize": 0.0, 1722 | "memoryUsageA": 1179648.0, 1723 | "memoryUsageB": 11927552.0, 1724 | "syncTimeFw": { 1725 | "47": 0.0, 1726 | "35": 0.0 1727 | }, 1728 | "syncTimeBw": { 1729 | "50": 0.0 1730 | } 1731 | } 1732 | ] 1733 | } 1734 | }, 1735 | { 1736 | "id": 49, 1737 | "name": "node63", 1738 | "TMPCs": { 1739 | "1": [ 1740 | { 1741 | "id": "vanilla", 1742 | "timePerSample": 0.0, 1743 | "parameterSize": 0.0, 1744 | "memoryUsageA": 0.0, 1745 | "memoryUsageB": 0.0, 1746 | "syncTimeFw": { 1747 | "31": 0.0 1748 | }, 1749 | "syncTimeBw": { 1750 | "50": 0.0 1751 | } 1752 | }, 1753 | { 1754 | "id": "activation recomp", 1755 | "timePerSample": 0.0, 1756 | "parameterSize": 0.0, 1757 | "memoryUsageA": 0.0, 1758 | "memoryUsageB": 0.0, 1759 | "syncTimeFw": { 1760 | "31": 0.0 1761 | }, 1762 | "syncTimeBw": { 1763 | "50": 0.0 1764 | } 1765 | } 1766 | ] 1767 | } 1768 | }, 1769 | { 1770 | "id": 50, 1771 | "name": "node64", 1772 | "TMPCs": { 1773 | "1": [ 1774 | { 1775 | "id": "vanilla", 1776 | "timePerSample": 29.527, 1777 | "parameterSize": 50364416.0, 1778 | "memoryUsageA": 14155776.0, 1779 | "memoryUsageB": 100728832.0, 1780 | "syncTimeFw": { 1781 | "48": 0.0, 1782 | "49": 0.0 1783 | }, 1784 | "syncTimeBw": { 1785 | "51": 0.0 1786 | } 1787 | }, 1788 | { 1789 | "id": "activation recomp", 1790 | "timePerSample": 40.251999999999995, 1791 | "parameterSize": 50364416.0, 1792 | "memoryUsageA": 1274019.8399999999, 1793 | "memoryUsageB": 113610588.16, 1794 | "syncTimeFw": { 1795 | "48": 0.0, 1796 | "49": 0.0 1797 | }, 1798 | "syncTimeBw": { 1799 | "51": 0.0 1800 | } 1801 | } 1802 | ] 1803 | } 1804 | }, 1805 | { 1806 | "id": 51, 1807 | "name": "node65", 1808 | "TMPCs": { 1809 | "1": [ 1810 | { 1811 | "id": "vanilla", 1812 | "timePerSample": 0.0, 1813 | "parameterSize": 0.0, 1814 | "memoryUsageA": 13107200.0, 1815 | "memoryUsageB": 0.0, 1816 | "syncTimeFw": { 1817 | "50": 0.0 1818 | }, 1819 | "syncTimeBw": { 1820 | "52": 0.0 1821 | } 1822 | }, 1823 | { 1824 | "id": "activation recomp", 1825 | "timePerSample": 0.0, 1826 | "parameterSize": 0.0, 1827 | "memoryUsageA": 1179648.0, 1828 | "memoryUsageB": 11927552.0, 1829 | "syncTimeFw": { 1830 | "50": 0.0 1831 | }, 1832 | "syncTimeBw": { 1833 | "52": 0.0 1834 | } 1835 | } 1836 | ] 1837 | } 1838 | }, 1839 | { 1840 | "id": 52, 1841 | "name": "node67", 1842 | "TMPCs": { 1843 | "1": [ 1844 | { 1845 | "id": "vanilla", 1846 | "timePerSample": 0.0, 1847 | "parameterSize": 0.0, 1848 | "memoryUsageA": 13107200.0, 1849 | "memoryUsageB": 0.0, 1850 | "syncTimeFw": { 1851 | "51": 0.0, 1852 | "46": 0.0 1853 | }, 1854 | "syncTimeBw": { 1855 | "53": 0.0, 1856 | "58": 0.0 1857 | } 1858 | }, 1859 | { 1860 | "id": "activation recomp", 1861 | "timePerSample": 0.0, 1862 | "parameterSize": 0.0, 1863 | "memoryUsageA": 1179648.0, 1864 | "memoryUsageB": 11927552.0, 1865 | "syncTimeFw": { 1866 | "51": 0.0, 1867 | "46": 0.0 1868 | }, 1869 | "syncTimeBw": { 1870 | "53": 0.0, 1871 | "58": 0.0 1872 | } 1873 | } 1874 | ] 1875 | } 1876 | }, 1877 | { 1878 | "id": 53, 1879 | "name": "node68", 1880 | "TMPCs": { 1881 | "1": [ 1882 | { 1883 | "id": "vanilla", 1884 | "timePerSample": 0.384, 1885 | "parameterSize": 0.0, 1886 | "memoryUsageA": 13107200.0, 1887 | "memoryUsageB": 0.0, 1888 | "syncTimeFw": { 1889 | "52": 0.0 1890 | }, 1891 | "syncTimeBw": { 1892 | "54": 0.0 1893 | } 1894 | }, 1895 | { 1896 | "id": "activation recomp", 1897 | "timePerSample": 0.505, 1898 | "parameterSize": 0.0, 1899 | "memoryUsageA": 1179648.0, 1900 | "memoryUsageB": 11927552.0, 1901 | "syncTimeFw": { 1902 | "52": 0.0 1903 | }, 1904 | "syncTimeBw": { 1905 | "54": 0.0 1906 | } 1907 | } 1908 | ] 1909 | } 1910 | }, 1911 | { 1912 | "id": 54, 1913 | "name": "node69", 1914 | "TMPCs": { 1915 | "1": [ 1916 | { 1917 | "id": "vanilla", 1918 | "timePerSample": 0.0, 1919 | "parameterSize": 0.0, 1920 | "memoryUsageA": 13107200.0, 1921 | "memoryUsageB": 0.0, 1922 | "syncTimeFw": { 1923 | "53": 0.0, 1924 | "35": 0.0 1925 | }, 1926 | "syncTimeBw": { 1927 | "56": 0.0 1928 | } 1929 | }, 1930 | { 1931 | "id": "activation recomp", 1932 | "timePerSample": 0.0, 1933 | "parameterSize": 0.0, 1934 | "memoryUsageA": 1179648.0, 1935 | "memoryUsageB": 11927552.0, 1936 | "syncTimeFw": { 1937 | "53": 0.0, 1938 | "35": 0.0 1939 | }, 1940 | "syncTimeBw": { 1941 | "56": 0.0 1942 | } 1943 | } 1944 | ] 1945 | } 1946 | }, 1947 | { 1948 | "id": 55, 1949 | "name": "node70", 1950 | "TMPCs": { 1951 | "1": [ 1952 | { 1953 | "id": "vanilla", 1954 | "timePerSample": 0.0, 1955 | "parameterSize": 0.0, 1956 | "memoryUsageA": 0.0, 1957 | "memoryUsageB": 0.0, 1958 | "syncTimeFw": { 1959 | "31": 0.0 1960 | }, 1961 | "syncTimeBw": { 1962 | "56": 0.0 1963 | } 1964 | }, 1965 | { 1966 | "id": "activation recomp", 1967 | "timePerSample": 0.0, 1968 | "parameterSize": 0.0, 1969 | "memoryUsageA": 0.0, 1970 | "memoryUsageB": 0.0, 1971 | "syncTimeFw": { 1972 | "31": 0.0 1973 | }, 1974 | "syncTimeBw": { 1975 | "56": 0.0 1976 | } 1977 | } 1978 | ] 1979 | } 1980 | }, 1981 | { 1982 | "id": 56, 1983 | "name": "node71", 1984 | "TMPCs": { 1985 | "1": [ 1986 | { 1987 | "id": "vanilla", 1988 | "timePerSample": 29.506, 1989 | "parameterSize": 50364416.0, 1990 | "memoryUsageA": 14155776.0, 1991 | "memoryUsageB": 100728832.0, 1992 | "syncTimeFw": { 1993 | "54": 0.0, 1994 | "55": 0.0 1995 | }, 1996 | "syncTimeBw": { 1997 | "57": 0.0 1998 | } 1999 | }, 2000 | { 2001 | "id": "activation recomp", 2002 | "timePerSample": 40.149, 2003 | "parameterSize": 50364416.0, 2004 | "memoryUsageA": 1274019.8399999999, 2005 | "memoryUsageB": 113610588.16, 2006 | "syncTimeFw": { 2007 | "54": 0.0, 2008 | "55": 0.0 2009 | }, 2010 | "syncTimeBw": { 2011 | "57": 0.0 2012 | } 2013 | } 2014 | ] 2015 | } 2016 | }, 2017 | { 2018 | "id": 57, 2019 | "name": "node72", 2020 | "TMPCs": { 2021 | "1": [ 2022 | { 2023 | "id": "vanilla", 2024 | "timePerSample": 0.0, 2025 | "parameterSize": 0.0, 2026 | "memoryUsageA": 13107200.0, 2027 | "memoryUsageB": 0.0, 2028 | "syncTimeFw": { 2029 | "56": 0.0 2030 | }, 2031 | "syncTimeBw": { 2032 | "58": 0.0 2033 | } 2034 | }, 2035 | { 2036 | "id": "activation recomp", 2037 | "timePerSample": 0.0, 2038 | "parameterSize": 0.0, 2039 | "memoryUsageA": 1179648.0, 2040 | "memoryUsageB": 11927552.0, 2041 | "syncTimeFw": { 2042 | "56": 0.0 2043 | }, 2044 | "syncTimeBw": { 2045 | "58": 0.0 2046 | } 2047 | } 2048 | ] 2049 | } 2050 | }, 2051 | { 2052 | "id": 58, 2053 | "name": "node74", 2054 | "TMPCs": { 2055 | "1": [ 2056 | { 2057 | "id": "vanilla", 2058 | "timePerSample": 0.0, 2059 | "parameterSize": 0.0, 2060 | "memoryUsageA": 13107200.0, 2061 | "memoryUsageB": 0.0, 2062 | "syncTimeFw": { 2063 | "57": 0.0, 2064 | "52": 0.0 2065 | }, 2066 | "syncTimeBw": { 2067 | "59": 0.0, 2068 | "64": 0.0 2069 | } 2070 | }, 2071 | { 2072 | "id": "activation recomp", 2073 | "timePerSample": 0.0, 2074 | "parameterSize": 0.0, 2075 | "memoryUsageA": 1179648.0, 2076 | "memoryUsageB": 11927552.0, 2077 | "syncTimeFw": { 2078 | "57": 0.0, 2079 | "52": 0.0 2080 | }, 2081 | "syncTimeBw": { 2082 | "59": 0.0, 2083 | "64": 0.0 2084 | } 2085 | } 2086 | ] 2087 | } 2088 | }, 2089 | { 2090 | "id": 59, 2091 | "name": "node75", 2092 | "TMPCs": { 2093 | "1": [ 2094 | { 2095 | "id": "vanilla", 2096 | "timePerSample": 0.394, 2097 | "parameterSize": 0.0, 2098 | "memoryUsageA": 13107200.0, 2099 | "memoryUsageB": 0.0, 2100 | "syncTimeFw": { 2101 | "58": 0.0 2102 | }, 2103 | "syncTimeBw": { 2104 | "60": 0.0 2105 | } 2106 | }, 2107 | { 2108 | "id": "activation recomp", 2109 | "timePerSample": 0.525, 2110 | "parameterSize": 0.0, 2111 | "memoryUsageA": 1179648.0, 2112 | "memoryUsageB": 11927552.0, 2113 | "syncTimeFw": { 2114 | "58": 0.0 2115 | }, 2116 | "syncTimeBw": { 2117 | "60": 0.0 2118 | } 2119 | } 2120 | ] 2121 | } 2122 | }, 2123 | { 2124 | "id": 60, 2125 | "name": "node76", 2126 | "TMPCs": { 2127 | "1": [ 2128 | { 2129 | "id": "vanilla", 2130 | "timePerSample": 0.0, 2131 | "parameterSize": 0.0, 2132 | "memoryUsageA": 13107200.0, 2133 | "memoryUsageB": 0.0, 2134 | "syncTimeFw": { 2135 | "59": 0.0, 2136 | "35": 0.0 2137 | }, 2138 | "syncTimeBw": { 2139 | "62": 0.0 2140 | } 2141 | }, 2142 | { 2143 | "id": "activation recomp", 2144 | "timePerSample": 0.0, 2145 | "parameterSize": 0.0, 2146 | "memoryUsageA": 1179648.0, 2147 | "memoryUsageB": 11927552.0, 2148 | "syncTimeFw": { 2149 | "59": 0.0, 2150 | "35": 0.0 2151 | }, 2152 | "syncTimeBw": { 2153 | "62": 0.0 2154 | } 2155 | } 2156 | ] 2157 | } 2158 | }, 2159 | { 2160 | "id": 61, 2161 | "name": "node77", 2162 | "TMPCs": { 2163 | "1": [ 2164 | { 2165 | "id": "vanilla", 2166 | "timePerSample": 0.0, 2167 | "parameterSize": 0.0, 2168 | "memoryUsageA": 0.0, 2169 | "memoryUsageB": 0.0, 2170 | "syncTimeFw": { 2171 | "31": 0.0 2172 | }, 2173 | "syncTimeBw": { 2174 | "62": 0.0 2175 | } 2176 | }, 2177 | { 2178 | "id": "activation recomp", 2179 | "timePerSample": 0.0, 2180 | "parameterSize": 0.0, 2181 | "memoryUsageA": 0.0, 2182 | "memoryUsageB": 0.0, 2183 | "syncTimeFw": { 2184 | "31": 0.0 2185 | }, 2186 | "syncTimeBw": { 2187 | "62": 0.0 2188 | } 2189 | } 2190 | ] 2191 | } 2192 | }, 2193 | { 2194 | "id": 62, 2195 | "name": "node78", 2196 | "TMPCs": { 2197 | "1": [ 2198 | { 2199 | "id": "vanilla", 2200 | "timePerSample": 29.687, 2201 | "parameterSize": 50364416.0, 2202 | "memoryUsageA": 14155776.0, 2203 | "memoryUsageB": 100728832.0, 2204 | "syncTimeFw": { 2205 | "60": 0.0, 2206 | "61": 0.0 2207 | }, 2208 | "syncTimeBw": { 2209 | "63": 0.0 2210 | } 2211 | }, 2212 | { 2213 | "id": "activation recomp", 2214 | "timePerSample": 40.338, 2215 | "parameterSize": 50364416.0, 2216 | "memoryUsageA": 1274019.8399999999, 2217 | "memoryUsageB": 113610588.16, 2218 | "syncTimeFw": { 2219 | "60": 0.0, 2220 | "61": 0.0 2221 | }, 2222 | "syncTimeBw": { 2223 | "63": 0.0 2224 | } 2225 | } 2226 | ] 2227 | } 2228 | }, 2229 | { 2230 | "id": 63, 2231 | "name": "node79", 2232 | "TMPCs": { 2233 | "1": [ 2234 | { 2235 | "id": "vanilla", 2236 | "timePerSample": 0.0, 2237 | "parameterSize": 0.0, 2238 | "memoryUsageA": 13107200.0, 2239 | "memoryUsageB": 0.0, 2240 | "syncTimeFw": { 2241 | "62": 0.0 2242 | }, 2243 | "syncTimeBw": { 2244 | "64": 0.0 2245 | } 2246 | }, 2247 | { 2248 | "id": "activation recomp", 2249 | "timePerSample": 0.0, 2250 | "parameterSize": 0.0, 2251 | "memoryUsageA": 1179648.0, 2252 | "memoryUsageB": 11927552.0, 2253 | "syncTimeFw": { 2254 | "62": 0.0 2255 | }, 2256 | "syncTimeBw": { 2257 | "64": 0.0 2258 | } 2259 | } 2260 | ] 2261 | } 2262 | }, 2263 | { 2264 | "id": 64, 2265 | "name": "node81", 2266 | "TMPCs": { 2267 | "1": [ 2268 | { 2269 | "id": "vanilla", 2270 | "timePerSample": 0.0, 2271 | "parameterSize": 0.0, 2272 | "memoryUsageA": 13107200.0, 2273 | "memoryUsageB": 0.0, 2274 | "syncTimeFw": { 2275 | "63": 0.0, 2276 | "58": 0.0 2277 | }, 2278 | "syncTimeBw": { 2279 | "65": 0.0, 2280 | "70": 0.0 2281 | } 2282 | }, 2283 | { 2284 | "id": "activation recomp", 2285 | "timePerSample": 0.0, 2286 | "parameterSize": 0.0, 2287 | "memoryUsageA": 1179648.0, 2288 | "memoryUsageB": 11927552.0, 2289 | "syncTimeFw": { 2290 | "63": 0.0, 2291 | "58": 0.0 2292 | }, 2293 | "syncTimeBw": { 2294 | "65": 0.0, 2295 | "70": 0.0 2296 | } 2297 | } 2298 | ] 2299 | } 2300 | }, 2301 | { 2302 | "id": 65, 2303 | "name": "node82", 2304 | "TMPCs": { 2305 | "1": [ 2306 | { 2307 | "id": "vanilla", 2308 | "timePerSample": 0.387, 2309 | "parameterSize": 0.0, 2310 | "memoryUsageA": 13107200.0, 2311 | "memoryUsageB": 0.0, 2312 | "syncTimeFw": { 2313 | "64": 0.0 2314 | }, 2315 | "syncTimeBw": { 2316 | "66": 0.0 2317 | } 2318 | }, 2319 | { 2320 | "id": "activation recomp", 2321 | "timePerSample": 0.513, 2322 | "parameterSize": 0.0, 2323 | "memoryUsageA": 1179648.0, 2324 | "memoryUsageB": 11927552.0, 2325 | "syncTimeFw": { 2326 | "64": 0.0 2327 | }, 2328 | "syncTimeBw": { 2329 | "66": 0.0 2330 | } 2331 | } 2332 | ] 2333 | } 2334 | }, 2335 | { 2336 | "id": 66, 2337 | "name": "node83", 2338 | "TMPCs": { 2339 | "1": [ 2340 | { 2341 | "id": "vanilla", 2342 | "timePerSample": 0.0, 2343 | "parameterSize": 0.0, 2344 | "memoryUsageA": 13107200.0, 2345 | "memoryUsageB": 0.0, 2346 | "syncTimeFw": { 2347 | "65": 0.0, 2348 | "35": 0.0 2349 | }, 2350 | "syncTimeBw": { 2351 | "68": 0.0 2352 | } 2353 | }, 2354 | { 2355 | "id": "activation recomp", 2356 | "timePerSample": 0.0, 2357 | "parameterSize": 0.0, 2358 | "memoryUsageA": 1179648.0, 2359 | "memoryUsageB": 11927552.0, 2360 | "syncTimeFw": { 2361 | "65": 0.0, 2362 | "35": 0.0 2363 | }, 2364 | "syncTimeBw": { 2365 | "68": 0.0 2366 | } 2367 | } 2368 | ] 2369 | } 2370 | }, 2371 | { 2372 | "id": 67, 2373 | "name": "node84", 2374 | "TMPCs": { 2375 | "1": [ 2376 | { 2377 | "id": "vanilla", 2378 | "timePerSample": 0.0, 2379 | "parameterSize": 0.0, 2380 | "memoryUsageA": 0.0, 2381 | "memoryUsageB": 0.0, 2382 | "syncTimeFw": { 2383 | "31": 0.0 2384 | }, 2385 | "syncTimeBw": { 2386 | "68": 0.0 2387 | } 2388 | }, 2389 | { 2390 | "id": "activation recomp", 2391 | "timePerSample": 0.0, 2392 | "parameterSize": 0.0, 2393 | "memoryUsageA": 0.0, 2394 | "memoryUsageB": 0.0, 2395 | "syncTimeFw": { 2396 | "31": 0.0 2397 | }, 2398 | "syncTimeBw": { 2399 | "68": 0.0 2400 | } 2401 | } 2402 | ] 2403 | } 2404 | }, 2405 | { 2406 | "id": 68, 2407 | "name": "node85", 2408 | "TMPCs": { 2409 | "1": [ 2410 | { 2411 | "id": "vanilla", 2412 | "timePerSample": 29.804000000000002, 2413 | "parameterSize": 50364416.0, 2414 | "memoryUsageA": 14155776.0, 2415 | "memoryUsageB": 100728832.0, 2416 | "syncTimeFw": { 2417 | "66": 0.0, 2418 | "67": 0.0 2419 | }, 2420 | "syncTimeBw": { 2421 | "69": 0.0 2422 | } 2423 | }, 2424 | { 2425 | "id": "activation recomp", 2426 | "timePerSample": 40.463, 2427 | "parameterSize": 50364416.0, 2428 | "memoryUsageA": 1274019.8399999999, 2429 | "memoryUsageB": 113610588.16, 2430 | "syncTimeFw": { 2431 | "66": 0.0, 2432 | "67": 0.0 2433 | }, 2434 | "syncTimeBw": { 2435 | "69": 0.0 2436 | } 2437 | } 2438 | ] 2439 | } 2440 | }, 2441 | { 2442 | "id": 69, 2443 | "name": "node86", 2444 | "TMPCs": { 2445 | "1": [ 2446 | { 2447 | "id": "vanilla", 2448 | "timePerSample": 0.0, 2449 | "parameterSize": 0.0, 2450 | "memoryUsageA": 13107200.0, 2451 | "memoryUsageB": 0.0, 2452 | "syncTimeFw": { 2453 | "68": 0.0 2454 | }, 2455 | "syncTimeBw": { 2456 | "70": 0.0 2457 | } 2458 | }, 2459 | { 2460 | "id": "activation recomp", 2461 | "timePerSample": 0.0, 2462 | "parameterSize": 0.0, 2463 | "memoryUsageA": 1179648.0, 2464 | "memoryUsageB": 11927552.0, 2465 | "syncTimeFw": { 2466 | "68": 0.0 2467 | }, 2468 | "syncTimeBw": { 2469 | "70": 0.0 2470 | } 2471 | } 2472 | ] 2473 | } 2474 | }, 2475 | { 2476 | "id": 70, 2477 | "name": "node88", 2478 | "TMPCs": { 2479 | "1": [ 2480 | { 2481 | "id": "vanilla", 2482 | "timePerSample": 0.0, 2483 | "parameterSize": 0.0, 2484 | "memoryUsageA": 13107200.0, 2485 | "memoryUsageB": 0.0, 2486 | "syncTimeFw": { 2487 | "69": 0.0, 2488 | "64": 0.0 2489 | }, 2490 | "syncTimeBw": { 2491 | "71": 0.0, 2492 | "76": 0.0 2493 | } 2494 | }, 2495 | { 2496 | "id": "activation recomp", 2497 | "timePerSample": 0.0, 2498 | "parameterSize": 0.0, 2499 | "memoryUsageA": 1179648.0, 2500 | "memoryUsageB": 11927552.0, 2501 | "syncTimeFw": { 2502 | "69": 0.0, 2503 | "64": 0.0 2504 | }, 2505 | "syncTimeBw": { 2506 | "71": 0.0, 2507 | "76": 0.0 2508 | } 2509 | } 2510 | ] 2511 | } 2512 | }, 2513 | { 2514 | "id": 71, 2515 | "name": "node89", 2516 | "TMPCs": { 2517 | "1": [ 2518 | { 2519 | "id": "vanilla", 2520 | "timePerSample": 0.345, 2521 | "parameterSize": 0.0, 2522 | "memoryUsageA": 13107200.0, 2523 | "memoryUsageB": 0.0, 2524 | "syncTimeFw": { 2525 | "70": 0.0 2526 | }, 2527 | "syncTimeBw": { 2528 | "72": 0.0 2529 | } 2530 | }, 2531 | { 2532 | "id": "activation recomp", 2533 | "timePerSample": 0.489, 2534 | "parameterSize": 0.0, 2535 | "memoryUsageA": 1179648.0, 2536 | "memoryUsageB": 11927552.0, 2537 | "syncTimeFw": { 2538 | "70": 0.0 2539 | }, 2540 | "syncTimeBw": { 2541 | "72": 0.0 2542 | } 2543 | } 2544 | ] 2545 | } 2546 | }, 2547 | { 2548 | "id": 72, 2549 | "name": "node90", 2550 | "TMPCs": { 2551 | "1": [ 2552 | { 2553 | "id": "vanilla", 2554 | "timePerSample": 0.0, 2555 | "parameterSize": 0.0, 2556 | "memoryUsageA": 13107200.0, 2557 | "memoryUsageB": 0.0, 2558 | "syncTimeFw": { 2559 | "71": 0.0, 2560 | "35": 0.0 2561 | }, 2562 | "syncTimeBw": { 2563 | "74": 0.0 2564 | } 2565 | }, 2566 | { 2567 | "id": "activation recomp", 2568 | "timePerSample": 0.0, 2569 | "parameterSize": 0.0, 2570 | "memoryUsageA": 1179648.0, 2571 | "memoryUsageB": 11927552.0, 2572 | "syncTimeFw": { 2573 | "71": 0.0, 2574 | "35": 0.0 2575 | }, 2576 | "syncTimeBw": { 2577 | "74": 0.0 2578 | } 2579 | } 2580 | ] 2581 | } 2582 | }, 2583 | { 2584 | "id": 73, 2585 | "name": "node91", 2586 | "TMPCs": { 2587 | "1": [ 2588 | { 2589 | "id": "vanilla", 2590 | "timePerSample": 0.0, 2591 | "parameterSize": 0.0, 2592 | "memoryUsageA": 0.0, 2593 | "memoryUsageB": 0.0, 2594 | "syncTimeFw": { 2595 | "31": 0.0 2596 | }, 2597 | "syncTimeBw": { 2598 | "74": 0.0 2599 | } 2600 | }, 2601 | { 2602 | "id": "activation recomp", 2603 | "timePerSample": 0.0, 2604 | "parameterSize": 0.0, 2605 | "memoryUsageA": 0.0, 2606 | "memoryUsageB": 0.0, 2607 | "syncTimeFw": { 2608 | "31": 0.0 2609 | }, 2610 | "syncTimeBw": { 2611 | "74": 0.0 2612 | } 2613 | } 2614 | ] 2615 | } 2616 | }, 2617 | { 2618 | "id": 74, 2619 | "name": "node92", 2620 | "TMPCs": { 2621 | "1": [ 2622 | { 2623 | "id": "vanilla", 2624 | "timePerSample": 76.825, 2625 | "parameterSize": 50364416.0, 2626 | "memoryUsageA": 14155776.0, 2627 | "memoryUsageB": 100728832.0, 2628 | "syncTimeFw": { 2629 | "72": 0.0, 2630 | "73": 0.0 2631 | }, 2632 | "syncTimeBw": { 2633 | "75": 0.0 2634 | } 2635 | }, 2636 | { 2637 | "id": "activation recomp", 2638 | "timePerSample": 87.436, 2639 | "parameterSize": 50364416.0, 2640 | "memoryUsageA": 1274019.8399999999, 2641 | "memoryUsageB": 113610588.16, 2642 | "syncTimeFw": { 2643 | "72": 0.0, 2644 | "73": 0.0 2645 | }, 2646 | "syncTimeBw": { 2647 | "75": 0.0 2648 | } 2649 | } 2650 | ] 2651 | } 2652 | }, 2653 | { 2654 | "id": 75, 2655 | "name": "node93", 2656 | "TMPCs": { 2657 | "1": [ 2658 | { 2659 | "id": "vanilla", 2660 | "timePerSample": 0.0, 2661 | "parameterSize": 0.0, 2662 | "memoryUsageA": 13107200.0, 2663 | "memoryUsageB": 0.0, 2664 | "syncTimeFw": { 2665 | "74": 0.0 2666 | }, 2667 | "syncTimeBw": { 2668 | "76": 0.0 2669 | } 2670 | }, 2671 | { 2672 | "id": "activation recomp", 2673 | "timePerSample": 0.0, 2674 | "parameterSize": 0.0, 2675 | "memoryUsageA": 1179648.0, 2676 | "memoryUsageB": 11927552.0, 2677 | "syncTimeFw": { 2678 | "74": 0.0 2679 | }, 2680 | "syncTimeBw": { 2681 | "76": 0.0 2682 | } 2683 | } 2684 | ] 2685 | } 2686 | }, 2687 | { 2688 | "id": 76, 2689 | "name": "node95", 2690 | "TMPCs": { 2691 | "1": [ 2692 | { 2693 | "id": "vanilla", 2694 | "timePerSample": 0.0, 2695 | "parameterSize": 0.0, 2696 | "memoryUsageA": 13107200.0, 2697 | "memoryUsageB": 0.0, 2698 | "syncTimeFw": { 2699 | "75": 0.0, 2700 | "70": 0.0 2701 | }, 2702 | "syncTimeBw": { 2703 | "77": 0.0 2704 | } 2705 | }, 2706 | { 2707 | "id": "activation recomp", 2708 | "timePerSample": 0.0, 2709 | "parameterSize": 0.0, 2710 | "memoryUsageA": 1179648.0, 2711 | "memoryUsageB": 11927552.0, 2712 | "syncTimeFw": { 2713 | "75": 0.0, 2714 | "70": 0.0 2715 | }, 2716 | "syncTimeBw": { 2717 | "77": 0.0 2718 | } 2719 | } 2720 | ] 2721 | } 2722 | }, 2723 | { 2724 | "id": 77, 2725 | "name": "node96", 2726 | "TMPCs": { 2727 | "1": [ 2728 | { 2729 | "id": "vanilla", 2730 | "timePerSample": 30.155, 2731 | "parameterSize": 132512000.0, 2732 | "memoryUsageA": 413696000.0, 2733 | "memoryUsageB": 265024000.0, 2734 | "syncTimeFw": { 2735 | "76": 0.0 2736 | }, 2737 | "syncTimeBw": {} 2738 | }, 2739 | { 2740 | "id": "activation recomp", 2741 | "timePerSample": 54.937, 2742 | "parameterSize": 132512000.0, 2743 | "memoryUsageA": 37232640.0, 2744 | "memoryUsageB": 641487360.0, 2745 | "syncTimeFw": { 2746 | "76": 0.0 2747 | }, 2748 | "syncTimeBw": {} 2749 | } 2750 | ] 2751 | } 2752 | } 2753 | ], 2754 | "edges": [ 2755 | { 2756 | "sourceId": 0, 2757 | "destId": 1, 2758 | "communicationCost": 13107200.0 2759 | }, 2760 | { 2761 | "sourceId": 2, 2762 | "destId": 1, 2763 | "communicationCost": 0.0 2764 | }, 2765 | { 2766 | "sourceId": 1, 2767 | "destId": 3, 2768 | "communicationCost": 26214400.0 2769 | }, 2770 | { 2771 | "sourceId": 3, 2772 | "destId": 4, 2773 | "communicationCost": 26214400.0 2774 | }, 2775 | { 2776 | "sourceId": 4, 2777 | "destId": 5, 2778 | "communicationCost": 14155776.0 2779 | }, 2780 | { 2781 | "sourceId": 5, 2782 | "destId": 6, 2783 | "communicationCost": 13107200.0 2784 | }, 2785 | { 2786 | "sourceId": 6, 2787 | "destId": 7, 2788 | "communicationCost": 13107200.0 2789 | }, 2790 | { 2791 | "sourceId": 7, 2792 | "destId": 8, 2793 | "communicationCost": 14155776.0 2794 | }, 2795 | { 2796 | "sourceId": 8, 2797 | "destId": 9, 2798 | "communicationCost": 13107200.0 2799 | }, 2800 | { 2801 | "sourceId": 5, 2802 | "destId": 9, 2803 | "communicationCost": 13107200.0 2804 | }, 2805 | { 2806 | "sourceId": 9, 2807 | "destId": 10, 2808 | "communicationCost": 13107200.0 2809 | }, 2810 | { 2811 | "sourceId": 10, 2812 | "destId": 11, 2813 | "communicationCost": 13107200.0 2814 | }, 2815 | { 2816 | "sourceId": 11, 2817 | "destId": 12, 2818 | "communicationCost": 14155776.0 2819 | }, 2820 | { 2821 | "sourceId": 12, 2822 | "destId": 13, 2823 | "communicationCost": 13107200.0 2824 | }, 2825 | { 2826 | "sourceId": 9, 2827 | "destId": 13, 2828 | "communicationCost": 13107200.0 2829 | }, 2830 | { 2831 | "sourceId": 13, 2832 | "destId": 14, 2833 | "communicationCost": 13107200.0 2834 | }, 2835 | { 2836 | "sourceId": 14, 2837 | "destId": 15, 2838 | "communicationCost": 13107200.0 2839 | }, 2840 | { 2841 | "sourceId": 15, 2842 | "destId": 16, 2843 | "communicationCost": 14155776.0 2844 | }, 2845 | { 2846 | "sourceId": 16, 2847 | "destId": 17, 2848 | "communicationCost": 13107200.0 2849 | }, 2850 | { 2851 | "sourceId": 13, 2852 | "destId": 17, 2853 | "communicationCost": 13107200.0 2854 | }, 2855 | { 2856 | "sourceId": 17, 2857 | "destId": 18, 2858 | "communicationCost": 13107200.0 2859 | }, 2860 | { 2861 | "sourceId": 18, 2862 | "destId": 19, 2863 | "communicationCost": 13107200.0 2864 | }, 2865 | { 2866 | "sourceId": 19, 2867 | "destId": 20, 2868 | "communicationCost": 14155776.0 2869 | }, 2870 | { 2871 | "sourceId": 20, 2872 | "destId": 21, 2873 | "communicationCost": 13107200.0 2874 | }, 2875 | { 2876 | "sourceId": 17, 2877 | "destId": 21, 2878 | "communicationCost": 13107200.0 2879 | }, 2880 | { 2881 | "sourceId": 21, 2882 | "destId": 22, 2883 | "communicationCost": 13107200.0 2884 | }, 2885 | { 2886 | "sourceId": 22, 2887 | "destId": 23, 2888 | "communicationCost": 13107200.0 2889 | }, 2890 | { 2891 | "sourceId": 23, 2892 | "destId": 24, 2893 | "communicationCost": 14155776.0 2894 | }, 2895 | { 2896 | "sourceId": 24, 2897 | "destId": 25, 2898 | "communicationCost": 13107200.0 2899 | }, 2900 | { 2901 | "sourceId": 21, 2902 | "destId": 25, 2903 | "communicationCost": 13107200.0 2904 | }, 2905 | { 2906 | "sourceId": 25, 2907 | "destId": 26, 2908 | "communicationCost": 13107200.0 2909 | }, 2910 | { 2911 | "sourceId": 26, 2912 | "destId": 27, 2913 | "communicationCost": 13107200.0 2914 | }, 2915 | { 2916 | "sourceId": 27, 2917 | "destId": 28, 2918 | "communicationCost": 14155776.0 2919 | }, 2920 | { 2921 | "sourceId": 28, 2922 | "destId": 29, 2923 | "communicationCost": 13107200.0 2924 | }, 2925 | { 2926 | "sourceId": 25, 2927 | "destId": 29, 2928 | "communicationCost": 13107200.0 2929 | }, 2930 | { 2931 | "sourceId": 31, 2932 | "destId": 32, 2933 | "communicationCost": 0.0 2934 | }, 2935 | { 2936 | "sourceId": 30, 2937 | "destId": 33, 2938 | "communicationCost": 13107200.0 2939 | }, 2940 | { 2941 | "sourceId": 32, 2942 | "destId": 33, 2943 | "communicationCost": 0.0 2944 | }, 2945 | { 2946 | "sourceId": 29, 2947 | "destId": 33, 2948 | "communicationCost": 13107200.0 2949 | }, 2950 | { 2951 | "sourceId": 2, 2952 | "destId": 33, 2953 | "communicationCost": 0.0 2954 | }, 2955 | { 2956 | "sourceId": 33, 2957 | "destId": 34, 2958 | "communicationCost": 27582976.0 2959 | }, 2960 | { 2961 | "sourceId": 33, 2962 | "destId": 35, 2963 | "communicationCost": 27582976.0 2964 | }, 2965 | { 2966 | "sourceId": 34, 2967 | "destId": 36, 2968 | "communicationCost": 13107200.0 2969 | }, 2970 | { 2971 | "sourceId": 36, 2972 | "destId": 37, 2973 | "communicationCost": 13107200.0 2974 | }, 2975 | { 2976 | "sourceId": 35, 2977 | "destId": 37, 2978 | "communicationCost": 524288.0 2979 | }, 2980 | { 2981 | "sourceId": 31, 2982 | "destId": 38, 2983 | "communicationCost": 0.0 2984 | }, 2985 | { 2986 | "sourceId": 37, 2987 | "destId": 39, 2988 | "communicationCost": 13107200.0 2989 | }, 2990 | { 2991 | "sourceId": 38, 2992 | "destId": 39, 2993 | "communicationCost": 0.0 2994 | }, 2995 | { 2996 | "sourceId": 39, 2997 | "destId": 40, 2998 | "communicationCost": 14155776.0 2999 | }, 3000 | { 3001 | "sourceId": 40, 3002 | "destId": 41, 3003 | "communicationCost": 13107200.0 3004 | }, 3005 | { 3006 | "sourceId": 41, 3007 | "destId": 42, 3008 | "communicationCost": 13107200.0 3009 | }, 3010 | { 3011 | "sourceId": 35, 3012 | "destId": 42, 3013 | "communicationCost": 524288.0 3014 | }, 3015 | { 3016 | "sourceId": 31, 3017 | "destId": 43, 3018 | "communicationCost": 0.0 3019 | }, 3020 | { 3021 | "sourceId": 42, 3022 | "destId": 44, 3023 | "communicationCost": 13107200.0 3024 | }, 3025 | { 3026 | "sourceId": 43, 3027 | "destId": 44, 3028 | "communicationCost": 0.0 3029 | }, 3030 | { 3031 | "sourceId": 44, 3032 | "destId": 45, 3033 | "communicationCost": 14155776.0 3034 | }, 3035 | { 3036 | "sourceId": 45, 3037 | "destId": 46, 3038 | "communicationCost": 13107200.0 3039 | }, 3040 | { 3041 | "sourceId": 40, 3042 | "destId": 46, 3043 | "communicationCost": 13107200.0 3044 | }, 3045 | { 3046 | "sourceId": 46, 3047 | "destId": 47, 3048 | "communicationCost": 13107200.0 3049 | }, 3050 | { 3051 | "sourceId": 47, 3052 | "destId": 48, 3053 | "communicationCost": 13107200.0 3054 | }, 3055 | { 3056 | "sourceId": 35, 3057 | "destId": 48, 3058 | "communicationCost": 524288.0 3059 | }, 3060 | { 3061 | "sourceId": 31, 3062 | "destId": 49, 3063 | "communicationCost": 0.0 3064 | }, 3065 | { 3066 | "sourceId": 48, 3067 | "destId": 50, 3068 | "communicationCost": 13107200.0 3069 | }, 3070 | { 3071 | "sourceId": 49, 3072 | "destId": 50, 3073 | "communicationCost": 0.0 3074 | }, 3075 | { 3076 | "sourceId": 50, 3077 | "destId": 51, 3078 | "communicationCost": 14155776.0 3079 | }, 3080 | { 3081 | "sourceId": 51, 3082 | "destId": 52, 3083 | "communicationCost": 13107200.0 3084 | }, 3085 | { 3086 | "sourceId": 46, 3087 | "destId": 52, 3088 | "communicationCost": 13107200.0 3089 | }, 3090 | { 3091 | "sourceId": 52, 3092 | "destId": 53, 3093 | "communicationCost": 13107200.0 3094 | }, 3095 | { 3096 | "sourceId": 53, 3097 | "destId": 54, 3098 | "communicationCost": 13107200.0 3099 | }, 3100 | { 3101 | "sourceId": 35, 3102 | "destId": 54, 3103 | "communicationCost": 524288.0 3104 | }, 3105 | { 3106 | "sourceId": 31, 3107 | "destId": 55, 3108 | "communicationCost": 0.0 3109 | }, 3110 | { 3111 | "sourceId": 54, 3112 | "destId": 56, 3113 | "communicationCost": 13107200.0 3114 | }, 3115 | { 3116 | "sourceId": 55, 3117 | "destId": 56, 3118 | "communicationCost": 0.0 3119 | }, 3120 | { 3121 | "sourceId": 56, 3122 | "destId": 57, 3123 | "communicationCost": 14155776.0 3124 | }, 3125 | { 3126 | "sourceId": 57, 3127 | "destId": 58, 3128 | "communicationCost": 13107200.0 3129 | }, 3130 | { 3131 | "sourceId": 52, 3132 | "destId": 58, 3133 | "communicationCost": 13107200.0 3134 | }, 3135 | { 3136 | "sourceId": 58, 3137 | "destId": 59, 3138 | "communicationCost": 13107200.0 3139 | }, 3140 | { 3141 | "sourceId": 59, 3142 | "destId": 60, 3143 | "communicationCost": 13107200.0 3144 | }, 3145 | { 3146 | "sourceId": 35, 3147 | "destId": 60, 3148 | "communicationCost": 524288.0 3149 | }, 3150 | { 3151 | "sourceId": 31, 3152 | "destId": 61, 3153 | "communicationCost": 0.0 3154 | }, 3155 | { 3156 | "sourceId": 60, 3157 | "destId": 62, 3158 | "communicationCost": 13107200.0 3159 | }, 3160 | { 3161 | "sourceId": 61, 3162 | "destId": 62, 3163 | "communicationCost": 0.0 3164 | }, 3165 | { 3166 | "sourceId": 62, 3167 | "destId": 63, 3168 | "communicationCost": 14155776.0 3169 | }, 3170 | { 3171 | "sourceId": 63, 3172 | "destId": 64, 3173 | "communicationCost": 13107200.0 3174 | }, 3175 | { 3176 | "sourceId": 58, 3177 | "destId": 64, 3178 | "communicationCost": 13107200.0 3179 | }, 3180 | { 3181 | "sourceId": 64, 3182 | "destId": 65, 3183 | "communicationCost": 13107200.0 3184 | }, 3185 | { 3186 | "sourceId": 65, 3187 | "destId": 66, 3188 | "communicationCost": 13107200.0 3189 | }, 3190 | { 3191 | "sourceId": 35, 3192 | "destId": 66, 3193 | "communicationCost": 524288.0 3194 | }, 3195 | { 3196 | "sourceId": 31, 3197 | "destId": 67, 3198 | "communicationCost": 0.0 3199 | }, 3200 | { 3201 | "sourceId": 66, 3202 | "destId": 68, 3203 | "communicationCost": 13107200.0 3204 | }, 3205 | { 3206 | "sourceId": 67, 3207 | "destId": 68, 3208 | "communicationCost": 0.0 3209 | }, 3210 | { 3211 | "sourceId": 68, 3212 | "destId": 69, 3213 | "communicationCost": 14155776.0 3214 | }, 3215 | { 3216 | "sourceId": 69, 3217 | "destId": 70, 3218 | "communicationCost": 13107200.0 3219 | }, 3220 | { 3221 | "sourceId": 64, 3222 | "destId": 70, 3223 | "communicationCost": 13107200.0 3224 | }, 3225 | { 3226 | "sourceId": 70, 3227 | "destId": 71, 3228 | "communicationCost": 13107200.0 3229 | }, 3230 | { 3231 | "sourceId": 71, 3232 | "destId": 72, 3233 | "communicationCost": 13107200.0 3234 | }, 3235 | { 3236 | "sourceId": 35, 3237 | "destId": 72, 3238 | "communicationCost": 524288.0 3239 | }, 3240 | { 3241 | "sourceId": 31, 3242 | "destId": 73, 3243 | "communicationCost": 0.0 3244 | }, 3245 | { 3246 | "sourceId": 72, 3247 | "destId": 74, 3248 | "communicationCost": 13107200.0 3249 | }, 3250 | { 3251 | "sourceId": 73, 3252 | "destId": 74, 3253 | "communicationCost": 0.0 3254 | }, 3255 | { 3256 | "sourceId": 74, 3257 | "destId": 75, 3258 | "communicationCost": 14155776.0 3259 | }, 3260 | { 3261 | "sourceId": 75, 3262 | "destId": 76, 3263 | "communicationCost": 13107200.0 3264 | }, 3265 | { 3266 | "sourceId": 70, 3267 | "destId": 76, 3268 | "communicationCost": 13107200.0 3269 | }, 3270 | { 3271 | "sourceId": 76, 3272 | "destId": 77, 3273 | "communicationCost": 13107200.0 3274 | } 3275 | ] 3276 | } --------------------------------------------------------------------------------