├── .gitignore ├── ABOUT.md ├── LICENSE ├── README.md └── img ├── figure2.svg ├── figure3.svg ├── figure4.svg ├── figure5.svg ├── figure6.svg ├── figure7.svg ├── figure8.svg └── figure9.svg /.gitignore: -------------------------------------------------------------------------------- 1 | UNTRACKED/ 2 | -------------------------------------------------------------------------------- /ABOUT.md: -------------------------------------------------------------------------------- 1 | # About this Repository 2 | 3 | I started off simply taking notes on the [TensorFlow white paper](http://download.tensorflow.org/paper/whitepaper2015.pdf), but as I worked I started putting more and more time into finding/linking to reference material in the TensorFlow documentation and resources. Additionally, I attempted to increase my understanding of the paper by re-phrasing certain sections as well as translating some of the algorithms from paragraph-form to ordered lists. 4 | 5 | Reading the white paper has improved my comfort with the APIs dramatically, and I hope that others will benefit from my notes. 6 | 7 | ## Features 8 | 9 | * Notes broken down section by section, as well as subsection by subsection 10 | * Relevant links to documentation, resources, and references throughout 11 | * SVG versions of figures/graphs 12 | * So many bullet points! 13 | 14 | ### To-do list 15 | 16 | * Create and utilize anchor tags throughout notes for self-navigating 17 | 18 | ## Contributions 19 | 20 | Please feel free to submit pull requests for corrections, improved readability, consistency in terminology, additional graphs or whatever else you can think of. 21 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Sam Abrahams 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TensorFlow White Paper Notes 2 | 3 | ## Features 4 | 5 | * Notes broken down section by section, as well as subsection by subsection 6 | * Relevant links to documentation, resources, and references throughout 7 | * SVG versions of figures/graphs 8 | * So many bullet points! 9 | 10 | ### To-do list 11 | 12 | * Create and utilize anchor tags throughout notes for self-navigating 13 | 14 | _[White Paper available at this link](http://download.tensorflow.org/paper/whitepaper2015.pdf)_ 15 | 16 | - - - 17 | 18 | # TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems 19 | 20 | ## Abstract 21 | 22 | * [**TensorFlow**](https://www.tensorflow.org/) is both an interface for expressing machine learning algorithms and an implentation to execute them 23 | * Code can be transported across various machine architectures with little to no changes to the code 24 | * Has been used at Google for all manner of machine learning tasks 25 | * [Reference implementation and API released under Apache 2.0 license](https://github.com/tensorflow/tensorflow) 26 | 27 | - - - 28 | 29 | ## 1 Introduction 30 | 31 | * Google Brain started in 2011, and [**DistBelief**](http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf) was its first-generation scalable, distributed machine learning system 32 | * DistBelief was used for a large number of research and commercial tasks 33 | * TensorFlow, Google's second-generation machine learning system, was designed from lessons learned in the process of engineering and using DistBelief 34 | * The TensorFlow API is used to describe a dataflow-like model, and the implementation then maps those models onto the underlying machine hardware 35 | * This allows users to have a single system that runs on a broad spectrum of machines, reducing overhead caused from rewriting code for different hardware 36 | * Focus of development was to maintain flexibility for research purposes while attaining enough performance to be used in production 37 | * Can express various types of parallelism by replicating the dataflow model across multiple machines and running them in parallel 38 | * Some functions within TensorFlow allow for less consistency in parallelism if desired 39 | * Larger, multiple machine uses of TensorFlow can take advantage of less-strict synchronization requirements 40 | * TensorFlow is more flexible, faster, and supports more machine learning models than DistBelief 41 | 42 | - - - 43 | 44 | ## 2 Programming Model and Basic Concepts 45 | 46 | * TensorFlow computations are represented by [_directed graphs_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Graph), which are composed of _nodes_ 47 | * Some nodes are able to maintain and update a persistent state and/or have some sort of branching and looping structures 48 | * This branching/looping is modeled similarly to [MSR's Naiad](http://research.microsoft.com:8082/pubs/201100/naiad_sosp2013.pdf) 49 | * Graphs are constructed using supported front-end languages (C++/Python as of writing) 50 | * A Node has zero or more inputs/outputs, and it represents an [_operation_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Operation) 51 | * Values of 'normal' edges (the connection between one node's output to another node's input) are [_tensors_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Tensor), n-dimensional arrays. 52 | * The type of each element in the tensor is inferred while the graph is being constructed, prior to execution 53 | * There are 'special' edges, called [_control dependencies_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Graph#control_dependencies): no model data is transferred on these edges, rather they indicate that the source node must finish execution before the destination node begins execution 54 | * Can be thought of as a baton in a relay race. Attaching a control dependency means that the next node can't begin running until the previous node 'hands off' the baton. 55 | * Used by client to enforce happens-before relations 56 | * Used in reference implementation to manage memory usage 57 | 58 | 59 | 60 | ### Operations and Kernels 61 | 62 | * Operations have names and represent an abstract computation, such as ["matrix multiply"](https://www.tensorflow.org/versions/master/api_docs/python/tf/matmul) or ["add"](https://www.tensorflow.org/versions/master/api_docs/python/tf/add) 63 | * Operations can optionally require _attributes_. Attributes must be explicitly provided or be possible to infer prior to running the graph 64 | * A common use of attributes is to declare which data type the operation is being performed with (i.e. float tensors vs. int32 tensors) 65 | * A _kernel_ is an implementation of an operation designed for specific types of devices, such as CPU or GPU 66 | * The TensorFlow library includes several built-in operations/kernels. The table below lists some of them: 67 | 68 | Category | Examples 69 | ---|--- 70 | Element-wise mathematical operations | Add, Sub, Mul, Div, Exp, Log, Greater, Less, Equal 71 | Array operations | Concat, Slice, Split, Constant, Rank, Shape, Shuffle 72 | Matrix operations | MatMul, MatrixInverse, MatrixDeterminant 73 | Stateful operations | Variable, Assign, AssignAdd 74 | Neural-net building blocks | SoftMax, Sigmoid, ReLU, Convolution2D, MaxPool 75 | Checkpointing operations | Save, Restore 76 | Queue and synchronization operations | Enqueue, Dequeue, MutexAcquire, MutexRelease 77 | Control flow operations | Merge, Switch, Enter, Leave, NextIteration 78 | 79 | _Check out [this directory in the TensorFlow repository](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/kernels) for kernel implementations_ 80 | 81 | ### Sessions 82 | 83 | * Clients interact with TensorFlow by creating a [_Session_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Session), which supports two main functions: _Extend_ and [_Run_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Session#run) 84 | * The Extend method adds additional nodes and edges to the existing dataflow model 85 | * _Note: Extend is called automatically by TensorFlow, not directly by the client_ 86 | * Run takes as argument a set of named nodes to be computed as well as an optional set of tensors to be used in place of certain node outputs. It then uses the graph to figure all nodes required to compute the requested outputs, and performs them in a order that respects their dependencies. 87 | * Most TensorFlow programs setup a graph within a Session once, and then run the full graph or subsets of the graph multiple times. 88 | 89 | ### Variables 90 | 91 | * A [_Variable_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Variable) is a handle to a persistent and mutable tensor which survives each execution of a graph 92 | * For ML tasks, learned parameters are usually held in TensorFlow Variables 93 | 94 | _See the official [How-To](https://www.tensorflow.org/versions/master/how_tos/variables/index.html) to learn more about TensorFlow Variables_ 95 | 96 | - - - 97 | 98 | ## 3 Implementation 99 | 100 | * There are three primary components in a TensorFlow system: the _client_, the _master_, and _worker processes_ 101 | * The client uses a Session interface to communicate with the master 102 | * The master schedules and coordinates worker processes and relays results back to the client 103 | * Worker processes are responsible for maintaining access to devices such as CPU/GPU cores and execute graph nodes on their respective devices 104 | * There are both local and distributed implementations of TensorFlow, ~~but only the local version has been open-sourced as of writing~~ 105 | * **Update as of February 2016:** The initial open-source implementation of the TensorFlow distributed runtime is [available on the TensorFlow GitHub repository.](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/distributed_runtime) However, using it at this time requires building TensorFlow from source, and full API documentation is not yet available. 106 | 107 | ### Devices 108 | 109 | * Each device has both a device type and a name 110 | * Names are composed of the device's type, its index in a worker process, and (when used in a distributed setting) an identification of the job and task of the worker process 111 | * Example device names: 112 | Local: `/job:localhost/device:cpu:0` 113 | Distributed: `/job:worker/task:17/device:gpu:3` 114 | * A device object manages its device's memory and executes kernels as requested 115 | 116 | ### Tensors 117 | 118 | * Typed, multi-dimensional array 119 | * Memory management of tensors is handled automatically 120 | * Available types (from the [TensorFlow documentation](https://www.tensorflow.org/versions/master/programmers_guide/dims_types#data_types)): 121 | 122 | Data type | Python type | Description 123 | --- | --- | --- 124 | `DT_FLOAT` | `tf.float32` | 32 bits floating point 125 | `DT_DOUBLE` | `tf.float64` | 64 bits floating point 126 | `DT_INT8` | `tf.int8` | 8 bits signed integer 127 | `DT_INT16` | `tf.int16` | 16 bits signed integer 128 | `DT_INT32` | `tf.int32` | 32 bits signed integer 129 | `DT_INT64` | `tf.int64` | 64 bits signed integer 130 | `DT_UINT8` | `tf.uint8` | 8 bits unsigned integer 131 | `DT_STRING` | `tf.string` | Variable length byte arrays. Each element of a Tensor is a byte array 132 | `DT_BOOL` | `tf.bool` | Boolean 133 | `DT_COMPLEX64` | `tf.complex64` | Complex number made of two 32 bits floating points: real and imaginary parts 134 | `DT_QINT8` | `tf.qint8` | 8 bits signed integer used in quantized Ops 135 | `DT_QINT32` | `tf.qint32` | 32 bits signed integer used in quantized Ops 136 | `DT_QUINT8` | `tf.quint8` | 8 bits unsigned integer used in quantized Ops 137 | 138 | ## 3.1 Single-Device Execution 139 | 140 | _**NOTE:** To reiterate- in this context, "single device" means using a single CPU core or single GPU, **not** a single machine. Similarly, "multi-device" does not refer to multiple machines, but to multiple CPU cores and/or GPUs. See "3.3 Distributed Execution" for multiple machine discussion._ 141 | 142 | * Overview of the execution of a single-worker process, single-device job: 143 | 1. All nodes required to compute the desired output node(s) are determined 144 | 2. Each node is given a count of dependencies that need to be completed before it can begin execution 145 | 3. When a node's dependency count is zero, it is added to a ready queue 146 | 4. The ready queue delegates node kernel execution to device objects 147 | 5. When a node completes execution, the counts of all dependant nodes are decremented 148 | 6. Repeat steps 3-5 until the desired output is computed 149 | 150 | ## 3.2 Multi-Device Execution 151 | 152 | * There are two main challenges introduced when using multiple devices: 153 | * Deciding which device should process each node 154 | * Managing communication between devices as necessary after assigning nodes 155 | 156 | ### Node Placement 157 | 158 | * One of the main responsibilities of the TensorFlow implementation is to map computation onto available devices 159 | * The following is a simplified version of this mapping algorithm: 160 | 1. A cost model is input into the algorithm 161 | * The cost model contains estimates of of the input/output tensors (in bytes) and estimated computation time for each node in the graph 162 | 2. Using the cost model, the algorithm simulates an execution of the graph to make node-placement decisions as described below: 163 | 1. Starting with the source nodes, a set of feasible devices is considered for each node ready to be executed 164 | * A "feasible" device is one that has a kernel implementation for the given operation 165 | * A node is ready for execution once its dependencies have finished running 166 | 2. If a node has multiple feasible devices, the computation time of the node is examined with respect to placing the node on each possible device 167 | * This examination takes into account the execution time of the operation (given the device type), as well as the costs of possibly introducing communication between devices. 168 | 3. The device that would finish the operation the soonest is selected as the node's device. 169 | 4. Repeat steps 1-3 for each node in the graph execution until all nodes have been allocated to devices 170 | 3. After the simulation, the real execution runs using the node-placement decisions made during the simulation 171 | * Section 4.3 will describe some extensions to help guide the placement algorithm 172 | * Improving the placement algorithm's development is an ongoing process as of writing 173 | 174 | _**NOTE:** At the moment, node placement is done by a [simple_placer class](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/simple_placer.h) which only considers explicit placement requirements provided by the user and implicit colocation constraints based on node type ([see documentation comments for details](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/simple_placer.h#L32-L41)_ 175 | 176 | ### Cross-Device Communication 177 | 178 | * After the nodes have been placed onto their respective devices, the execution graph is split into subgraphs- one per device 179 | * Any edge between nodes on different devices is replaced by two new edges: 180 | * The outputing node will have an edge between it and a new _Send_ node, placed within the subgraph of its device 181 | * The recieving node will have an edge between it and a new _Receive_ node, placed within the subgraph of its device 182 | * See Figure 4 for illustration of adding Send/Receive nodes 183 | * The Send and Receive nodes coordinate data transfer across devices 184 | * This isolates cross-device communication to the implementation of the Send and Receive nodes 185 | * All users of a particular tensor on a particular device use a single Receive node, as opposed to having one Receive node per user per device. This minimizes data transmission between devices as well as memory allocated on the receiving device 186 | * This means that any given device should receive the output of any given operation only once, and should store that output only once in memory 187 | * This method of communication also allows individual node scheduling to be handled by the worker processes as opposed to the master 188 | * The Send and Receive nodes provide synchronization between worker processes and devices, which enables the master to only issue a single Run request per graph execution per worker process 189 | * This improves scalability and fine-grain control over node execution 190 | 191 | ## 3.3 Distributed Execution 192 | 193 | * Similar to multi-device execution. Send/Receive nodes that communicate across worker processes use mechanisms such as TCP or RDMA to transmit data from machine to machine 194 | 195 | ### Fault Tolerance 196 | 197 | * There are two main ways failures are detected: 198 | * Errors between Send and Receive nodes 199 | * Periodic health-checks from the master process to the worker processes 200 | * When a failure is detected, the entire graph execution is aborted and restarted 201 | * Because TensorFlow Variables persist across each graph execution, there are mechanisms to save and restore their state 202 | * Each Variable node is connected to a Save node, which periodically write the contents of the Variable to persistent storage 203 | * Additionally, each Variable is connected to a Restore node, which is enabled when the graph execution is restarted. The Restore node reads state data from persistent storage and applies it to the Variable node 204 | 205 | - - - 206 | 207 | ## 4 Extensions 208 | 209 | _The following subsections describe advanced features and extensions of the programming model introduced in Section 2_ 210 | 211 | ## 4.1 Gradient Computation 212 | 213 | * TensorFlow has built-in gradient computation 214 | * If a tensor, _C_, depends on a set of previous nodes, the gradient of _C_ with respect to those previous nodes can be automatically computed with a built-in function, even if there are many layers in between them 215 | * See [`tf.gradients()`](https://www.tensorflow.org/versions/master/api_docs/python/train.html#gradients) for usage 216 | * Gradients are computed by creating additional nodes and edges in the graph as described below (see Figure 5): 217 | * When computing the gradient of a tensor, _C_, with respect to some dependency, _I_, TensorFlow first finds the forward path from _I_ to _C_ in the model graph. This is shown as the left-hand side of Figure 5 218 | * Once the path between the two is found, TensorFlow starts at _C_ and moves backward toward _I_. For every operation on this backward path, a node is added to the graph, composing the partial gradients of each added node via the [chain rule](https://en.wikipedia.org/wiki/Chain_rule). This is shown as the right-hand side of Figure 5 219 | * Partial gradients are computed via a "gradient function", which corresponds to an operation on the forward path. These gradient functions are provided alongside operation kernels 220 | * The gradient function takes as input the partial derivatives already computed along the backwards path and, optionally, the inputs and/or outputs of the corresponding operation on the forward path 221 | * For example, in Figure 5, the _dReLU_ operation (gradient function for the Rectified Linear Unit operation) takes in the previously computed partial derivatives (indicated by arrows coming from "..."), as well as the inputs from the _ReLU_ operation (indicated by arrows coming from _Add_, as the outputs of _Add_ are the inputs of _ReLU_). _dReLU_ does not, however, take in the outputs from the _ReLU_ function (indicated by the grayed out arrows coming from _ReLU). Once the partial gradients are computed, _dReLU_ outputs them to the next gradient function, in this case _dAdd_ 222 | * The partial gradients for any node outputs that are **not** dependencies of _C_ are set to zero. This can occur when a node has multiple outputs, but only some of them connect to _C_ 223 | * This process continues until the partial derivatives of _C_ with respect to _I_ are found 224 | * Memory management for is an active area of improvement for the automatic gradient computation algorithm. 225 | * Tensors early in the computation graph may be needed at the end of gradient calculation, causing them to stick around in GPU memory 226 | * Current options for memory management improvements include improved heuristics to determine graph execution order, recomputing tensors as opposed to storing them in memory, and utilizing host CPU memory instead of leaving long-lived tensors in GPU memory 227 | 228 | ## 4.2 Partial Execution 229 | 230 | * TensorFlow has built-in functionality to run smaller chunks of a defined execution graph, as well as the ability to insert pre-defined data as a replacement for any edge in the graph 231 | * Each node in the graph is given a name upon instantiation, and each output of a node is referred to by number starting from zero 232 | * e.g. "bar:0" is the first output of node "bar" 233 | * The Session's [run method](https://www.tensorflow.org/versions/master/api_docs/python/tf/Session#run) takes two arguments, `fetches` and `feed_dict`, which define the subgraph to be executed: 234 | * `fetches` is a list of desired operation nodes and/or output tensors to be executed. If outputs are requested, the Run function will return the calculated tensors to the client (assuming the Run function is successful) 235 | * `feed_dict` is a dictionary of optional inputs, which map named node outputs (_`name:port`_) to defined tensor values. This allows a user to effectively define the 'start' of a subgraph. Additionally, `feed_dict` is used to define data in Placeholder objects 236 | * The execution graph is then transformed based on the values fed to `fetches` and `feed_dict` 237 | * Each output tensor specified in `fetches` is connected to a **fetch** node, which stores and returns its value to the client once Run is successfully completed 238 | * Note: no fetch nodes are created for operation nodes named in `fetches`, as TensorFlow makes a distinction between opererations and the outputs of those operations 239 | * An example of an operation a user may specify as a `fetches` parameter to a Run command is the operation returned by [`tf.initialize_all_variables`](https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#initialize_all_variables). The operation doesn't provide an output, but it is run in the execution graph 240 | * Each named node output (`node:port`) specified in `feed_dict` is replaced with a **feed** node, which takes on the value of the tensor mapped to the named output. Each node in the execution graph that depends on the named output will take in data from the feed node in its place 241 | * Because a feed node has no dependencies, it is the start of its own execution chain 242 | * Once the **fetch** and **feed** nodes have been inserted, TensorFlow determines which nodes need to be executed. It moves backwards, starting at the fetch nodes, and uses the dependencies of the graph to determine all nodes that must be executed in the modified graph in order to compute the desired outputs 243 | 244 | ## 4.3 Device Constraints 245 | 246 | * Users can provide partial constraints on nodes about which devices they can run on 247 | * Examples: only allowing a node to run on GPUs, specifying a specific worker process/task, or ensuring that it is grouped with specific nodes 248 | * Note: By default, GPUs are given priority for device placement if the given operation has both a CPU and a GPU kernel implementation 249 | * These constraints require modifications to the placement algorithm described in Section 3.2: 250 | * After finding the feasible set of devices for each node, TensorFlow uses the constraints to determine which nodes must be grouped on the same device 251 | * For each of these groups, TensorFlow computes the intersection of feasible devices for each node in the group 252 | * Final selection of devices is determined using similar heuristics as described in Section 3.2, ensuring fast execution while taking device restrictions, such as memory, into account 253 | 254 | _Aside: I'm not sure if this functionality is available in the open source implementation of TensorFlow yet. As of now I can only find information regarding placing nodes on **specific** devices. [Read more about manual device placement here](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/g3doc/how_tos/using_gpu/index.md#Manual-device-placement). Let me know if you can find the documentation for this feature!_ 255 | It is possible to provide partial constraints (https://www.tensorflow.org/versions/r0.11/how_tos/variables/index.html#device-placement) e.g. with tf.device("/job:ps/task:7") or with tf.device("/gpu:0"). 256 | 257 | ## 4.4 Control Flow 258 | 259 | * TensorFlow incorporates a few primitive control flow operators which allow for the skipping of subgraph execution and the expression of iteration. Using these primitive operators, higher-level constructs such as `if` and `while` statements can be compiled into TensorFlow graphs 260 | * Each iteration in a loop is identified with a unique tag, and the present value of that execution state is represented by a frame 261 | * Inputs can enter an iteration whenever they are available 262 | * This allows multiple iterations to be executed concurrently 263 | * Because a loop may contain nodes placed of separate devices, managing the state of loops becomes a problem of distributed termination detection. That is, there needs to be a way to detect the termination of nodes across devices. 264 | * TensorFlow solves this by adding additional control nodes to the graph. These nodes manage the beginning and end of each loop iteration and determine when the loop should end. 265 | * At every iteration, the device that owns the control node for a loop sends out control messages to the other devices used in that loop 266 | * The implementation of TensorFlow also takes `if` statements into account when computing gradients, as it must include or omit nodes as necessary to properly step backwards through the execution graph 267 | 268 | ## 4.5 Input Operations 269 | 270 | * In addition to using the `feed_dict` parameter in the [`Session.run`](https://www.tensorflow.org/versions/master/api_docs/python/tf/Session#run) method to manually feed in input data, TensorFlow supports reading tensors in directly from files 271 | * Using this feature can reduce data transfer overhead when using TensorFlow on a distributed implementation (specifically when the client is on a different machine from the worker process): 272 | * Using `feed_dict` will cause data to first be sent from the storage system to the client, and then from client to the worker process 273 | * Reading from the file will cause data to be sent directly from the storage system to the worker process 274 | * Data can be read in as individual data examples or in batches of examples 275 | * TensorFlow classes for reading data: 276 | * Text/CSV: [`tf.TextLineReader`](https://www.tensorflow.org/versions/master/api_docs/python/tf/TextLineReader) 277 | * [Basics of CSV parsing in TensorFlow](https://www.tensorflow.org/versions/master/programmers_guide/reading_data#csv_files) 278 | * Fixed Length records: [`tf.FixedLengthRecordReader`](https://www.tensorflow.org/versions/master/api_docs/python/tf/FixedLengthRecordReader) 279 | * [Basics of fixed length record parsing in TensorFlow](https://www.tensorflow.org/versions/master/programmers_guide/reading_data#fixed_length_records) 280 | * TensorFlow data format: [`tf.TFRecordReader`](https://www.tensorflow.org/versions/master/api_docs/python/io_ops.html#TFRecordReader) 281 | * [Basics of TensorFlow data format handling](https://www.tensorflow.org/versions/master/programmers_guide/reading_data#standard_tensorflow_format) 282 | 283 | ## 4.6 Queues 284 | 285 | * TensorFlow includes [Queues](https://www.tensorflow.org/versions/master/api_guides/python/io_ops#Queues), which allow for the enqueuing and dequeuing of tensor objects. This enables asynchronous graph execution and the handing off of data between concurrent nodes 286 | * Enqueue operations can block until there is space available in the queue 287 | * Dequeue operations can block until a minimum number of elements are placed in the queue 288 | * [`FIFOQueue`](https://www.tensorflow.org/versions/master/api_docs/python/tf/FIFOQueue) is a standard 'first-in, first-out' queue 289 | * [`RandomShuffleQueue`](https://www.tensorflow.org/versions/master/api_docs/python/tf/RandomShuffleQueue) is a queue that randomly shuffles its elements periodically, which can be useful for machine learning algorithms that want to randomize training data 290 | * An example use of queues is to allow input data to be prefetched from the storage system while previous data is still being processed 291 | 292 | ## 4.7 Containers 293 | 294 | * A _Container_ is the mechanism that manages long-lived (i.e. survives multiple executions of a graph) mutable state 295 | * Container objects hold the values of _Variable_ objects 296 | * There is a default container that persists until the process terminates 297 | * Other named containers may be initialized 298 | * Containers allow for the sharing of state between otherwise unrelated computation graphs defined on separate Sessions 299 | 300 | - - - 301 | 302 | ## 5 Optimizations 303 | 304 | _This section describes certain performance/resource usage optimizations used in the implementation of TensorFlow_ 305 | 306 | ## 5.1 Common Subexpression Elimination 307 | 308 | * Before execution, TensorFlow does a pass over the computation graph and reduces nodes with identical inputs and operations down to a single node. 309 | * This prevents redundant execution of the same computation 310 | 311 | ## 5.2 Controlling Data Communication and Memory Usage 312 | 313 | * Proper scheduling of operations can create dramatic improvements on data transfer and memory usage rates by reducing the amount of time intermediate data needs to be stored in memory 314 | * GPUs benefit from this a great deal, as they have scarce memory 315 | * Can also reduce competition amongst operations to use network resources 316 | * One particular example used in the implementation TensorFlow is the scheduling of Recieve nodes (see "3.2 Cross Device Communication") 317 | * Recieve nodes, without thoughtful scheduling, may execute much earlier than necessary 318 | * This could cause data to be stored in the memory of devices for much longer 319 | * TensorFlow attempts to delay the execution of Recieve nodes until just before their results are needed. 320 | 321 | ## 5.3 Asynchronous Kernels 322 | 323 | * TensorFlow supports non-blocking kernels, which are good for environments that can't afford having many active threads, and they allow nodes to wait for I/O or other events without blocking an execution thread 324 | * Normal, synchronous kernels complete their execution at the end of the Compute method 325 | * Asynchronous kernels use a slightly different interface: the Compute method is passed a lambda/callback that should be invoked at the end of the kernel's execution 326 | * Examples of asynchronous kernels built into TensorFlow: Recieve, Enqueue, and Dequeue 327 | 328 | ## 5.4 Optimized Libraries for Kernel Implementations 329 | 330 | * TensorFlow makes use of several existing, optimized libraries for many kernel implementations 331 | * Library for linear algebra: 332 | * [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page) 333 | * Libraries for matrix multiplification on different devices: 334 | * [BLAS](http://www.netlib.org/blas/) 335 | * [cuBLAS (CUDA BLAS)](https://developer.nvidia.com/cublas) 336 | * Libraries for convolutional kernels for deep neural networks: 337 | * [cuda-convnet](https://code.google.com/p/cuda-convnet/) 338 | * [cuDNN](https://developer.nvidia.com/cudnn) 339 | 340 | ## 5.5 Lossy Compression 341 | 342 | * Because some machine learning algorithms still work well with less precise arithmetic, TensorFlow often uses lossy compression on higher-precision numbers when sending data between devices 343 | * This most often happens when sending data between devices on different machines, but it sometimes compresses data sent on the same machine 344 | * For example, a 32-bit floating point number may be converted into a (for all intents and purposes) 16-bit floating point number before being sent to another device, where it is converted into a lossy version of the original 32-bit number 345 | * The 16-bit floating number is really just a 32-bit floating point with 16-bits less precision after the decimal 346 | * The conversion back to 32-bit floating point just fills in zeros to replace the lost precision after the decimal 347 | * 'Filling in with zeros' is used as the 16-bit -> 32-bit conversion menthod because it is faster than using stochastic rounding, even though the latter is more mathematically correct 348 | 349 | - - - 350 | 351 | ## 6 Status and Experience 352 | 353 | * The system, documentation, and examples for TensorFlow can be found at [tensorflow.org](https://www.tensorflow.org/) 354 | * Currently, there are front-ends for Python and C++, and it's expected that more will be added over time (created both from internal Google users and the open-source community) 355 | 356 | ### Advice and Lessons Learned 357 | 358 | _The following are "words of wisdom" coming from the experience of porting Google's_ [Inception](http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf) _neural network into TensorFlow. After successfully doing so, the team was rewarded with a 6-fold improvement on training time over DistBelief's implementation. This advice will hopefully be useful to others as they build their own models._ 359 | 360 | 1. _Build tools to gain insight into the exact number of parameters in a given model_ 361 | * This can help you catch subtle flaws in a complex network architecture, such as operations and variables instantiated incorrectly 362 | 2. _Start small and scale up_ 363 | * The TensorFlow team started by importing a small network used by DistBelief 364 | * Debugging a small network gave insight into the edge cases of certain operations, while having to do the same on a larger network would be nearly impossible to figure out 365 | 3. _Always ensure that the objective (loss function) matches between machine learning systems when learning is turned off_ 366 | * By setting the learning rate to zero (i.e. turning off learning), the TensorFlow team was able to identify unexpected behavior stemming from randomly initialized variables in the model 367 | 4. _Make a single machine implementation match before debugging a distributed implementation_ 368 | * This helped the TensorFlow team separate and debug differences in training performance between DistBelief and TensorFlow 369 | * Once the single machine implementation worked, they were able to find bugs related to race conditions and non-atomic operations in the distributed model 370 | 5. _Guard against numerical errors_ 371 | * Different numerical libraries handle non-finite floating point numbers differently 372 | * Checking for non-finite floating point values can allow one to detect errors in real time, guarding against numerical instability 373 | 6. _Analyze pieces of a network and understand the magnitude of numerical error_ 374 | * By running subsections of the neural network on both DistBelief and TensorFlow in parallel, the team was able to ensure that the implemented algorithms were indeed identical 375 | * Note that because the networks used floating point numbers, there is a given amount of numerical error that should be expected and taken into account when comparing the two systems 376 | 377 | - - - 378 | 379 | ## 7 Common Programming Idioms 380 | 381 | _This section describes how TensorFlow's basic dataflow graphs can be used to speed up training neural network models on large datasets using techniques developed by the TensorFlow team._ 382 | 383 | _The techniques presented here assume that the model is using stochastic gradient descent with mini-batches of around 100-1000 examples_ 384 | 385 | ### Data Parallel Training 386 | 387 | * Users can parallelize the computation of the gradient, separating mini-batch elements onto different devices 388 | * For example, a mini-batch size of 1000 elements can be split into 10 smaller, parallel computations of 100 elements. After they all finish running, their results can be combined to achieve the same result as if the entire calculation was performed in a single, sequential computation 389 | * This translates to having many replicas of a computational subgraph and using a single client thread to coordinate the training loop of those replicas 390 | * The above approach can be modified further by making it asynchronous. Instead of having a single client thread, there are multiple clients (one for each subgraph replica), and each replica updates the trained parameters asynchronously 391 | * See Section 4 (pages 3-6) of [_Large Scale Distributed Deep Networks_](http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf) for further description of asynchronous approach 392 | 393 | ### Model Parallel Training 394 | 395 | * Can run separate portions of the computation graph on different devices simultaneously on the same batch of examples 396 | * See Figure 8 for a visual example of sequence-to-sequence learning parallelized across three devices 397 | 398 | ### Concurrent Steps for Model Computation Pipelining 399 | 400 | * Can also run a small set of concurrent training steps on a single device 401 | * Similar concept to asynchronous data parallelism, but the parallelism is only on a single device 402 | * This can "fill in the gaps" of device utilization, when parallel execution on all devices might not make full use of computation cycles 403 | * See Figure 9 for a visual example 404 | 405 | - - - 406 | 407 | ## 8 Performance 408 | 409 | _Stay tuned for future versions of the TensorFlow white paper, which will include performance evaluations for single machine and distributed implementations of TensorFlow_ 410 | 411 | - - - 412 | 413 | ## 9 Tools 414 | 415 | _This section discusses additional tools, developed by the TensorFlow team, that work alongside the graph modeling and execution features described above._ 416 | 417 | ## 9.1 TensorBoard: Visualization of Graph Structures and Summary Statistics 418 | 419 | _[TensorBoard](https://www.tensorflow.org/versions/master/resources/faq.html#tensorboard) was designed to help users visualize the structure of their graphs, as well as understand the behavior of their models_ 420 | 421 | ### Visualization of Computation Graphs 422 | 423 | * TensorBoard includes features that allow for more digestible visualizations 424 | * TensorBoard can take models with tens of thousands of nodes and collapse them into high-level blocks, highlighting subgraphs with identical structures 425 | * It separates out "high-degree" nodes from the rest of the graph to further reduce visual clutter 426 | * _Note: I haven't found a proper definition of "high-degree" nodes in TensorFlow. The paper says they "often serve book-keeping functions". I imagine they are operations similar to [`tf.initialize_all_variables`](https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#initialize_all_variables), which are necessary to run the execution graph in TensorFlow but aren't really part of the mathematical model_ 427 | * The TensorBoard visualization is interactive 428 | * Users can pan, zoom, and expand the collapsed blocks in order to get a lower-level view of the model 429 | 430 | ### Visualization of Summary Data 431 | 432 | * TensorBoard supports Summary operations that can be inserted into the execution graph to examine and track various values over time 433 | * _Scalar summaries_: e.g. average loss function over set of examples, total execution time of the model 434 | * _Histogram based summaries_: e.g. distribution of parameter values in a neural network layer 435 | * _Image-based summaries_: e.g. visualization of filter weights learned in a convolutional neural network 436 | * Typically, Summary nodes are setup to monitor specific interesting values and are executed periodically during normal graph execution 437 | * After Summary nodes are executed, the client writes the summary data to a log file. TensorBoard monitors this log to display summary information over time 438 | * The "time" used in the visualization of summary data can be: wall-clock time; absoute time; or "steps", the number of graph executions that have occurred since the first execution in the TensorFlow program 439 | 440 | ## 9.2 Performance tracing 441 | 442 | * A tool called EEG is used to examine fine-grained information about the ordering/performance of TensorFlow graphs 443 | * Works for both single machine and distributed implementations of TensorFlow 444 | * _Note: EEG is not open sourced as of writing_ 445 | * Helps users understand bottlenecks in a TensorFlow program 446 | 447 | _The following is a brief overview of what EEG does under the hood_ 448 | 449 | * Traces are collected via a number of sources including: 450 | * [Linux `ftrace`](http://elinux.org/Ftrace) 451 | * Internal Google tracing tools 452 | * [The CUDA Profiling Tools Interface](https://developer.nvidia.com/cuda-profiling-tools-interface) 453 | * The trace logs enable EEG to recreate the execution of the graph with microsecond-level precision. Events from the traces, within a timerange, are extracted and visualized according to the resolution of the client's user interface 454 | * The user can zoom in on portions of the graph, and EEG will update the visualization to show finer details 455 | * Any significant delays due to communication, synchronization, or direct memory access issues are identified and highlighted in the UI 456 | 457 | _Please see pages 14 and 15 of the [November 2015 white paper](http://download.tensorflow.org/paper/whitepaper2015.pdf) to see a specific example of EEG visualization along with descriptions of the current UI_ 458 | 459 | - - - 460 | 461 | ## 10 Future Work 462 | 463 | _This section lists areas of improvement and extension for TensorFlow identified for consideration by the TensorFlow team_ 464 | 465 | _Extensions_: 466 | 467 | * A "function" mechanism, where users can specify subgraphs of TensorFlow execution to be reusable 468 | * In the TensorFlow team's design of this mechanism (not yet implemented), these kinds of functions can be reusable across front-end languages. That is, a user could define a subgraph function in Python and use it in C++ without redefining it 469 | 470 | _Improvements_: 471 | 472 | * Continue work on a just-in-time compiler that can take in an execution subgraph and output an optimized routine for that subgraph 473 | * Such a compiler might also take in some runtime profiling information as input 474 | * The compiler should be able to perform loop fusion, block and tile for locality, and specialize routines for particular shapes and sizes of tensors, along with other optimizations 475 | * Improving the node placement and node scheduling algorithms 476 | * Instead of using man-made heuristics, have the system learn how to make good placement/scheduling decisions 477 | 478 | - - - 479 | 480 | ## 11 Related Work 481 | 482 | ### Open source, single machine systems with portions of similar functionality 483 | 484 | _Systems designed primarily for neural networks:_ 485 | 486 | * [Caffe](http://caffe.berkeleyvision.org/) 487 | * [Chainer](http://chainer.org/) 488 | * [Computational Network Toolkit](https://cntk.codeplex.com/) 489 | * [Theano](http://deeplearning.net/software/theano/) 490 | * [Torch](http://torch.ch/) 491 | 492 | _Systems that support symbolic differentiation:_ 493 | 494 | * [Chainer](http://chainer.org/) 495 | * [Theano](http://deeplearning.net/software/theano/) 496 | 497 | _Systems with a core written in C++:_ 498 | 499 | * [Caffe](http://caffe.berkeleyvision.org/) 500 | 501 | ### Comparisons with DistBelief and Project Adam 502 | 503 | _Similarities shared with [DistBelief](http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf) and [Project Adam](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf):_ 504 | 505 | * They allow computations to be distributed across many devices on many machines 506 | * Users can specify models using relatively high-level descriptions 507 | 508 | _Differences between TensorFlow and DistBelief/Project Adam:_ 509 | 510 | * TensorFlow's dataflow graph model is more flexible and better at expressing more types of machine learning models and algorithms 511 | * Allows for the expression of stateful Parameter nodes as variables 512 | * Variable update operations are simply additional nodes in the graph 513 | * Both DistBelief and Project Adam have subsystems dedicated to handling parameters 514 | 515 | ### Comparison with [Hallide](http://people.csail.mit.edu/fredo/tmp/Halide-5min.pdf) image processing system 516 | 517 | * Both use similar intermediate data representation to their dataflow graphs 518 | * However, Hallide has additional high level knowledge of its operations, which it uses to create optimized code that combine multiple operations 519 | * Hallide is single-machine only 520 | * The TensorFlow team hopes to extend TensorFlow's capabilities to incorporate Hallide's techniques into a distrubted setting 521 | 522 | ### Related distributed dataflow graph systems 523 | 524 | _Systems that represent complex workflows as dataflow graphs_ 525 | 526 | * [Dryad](http://www.michaelisard.com/pubs/eurosys07.pdf) 527 | * [Flume](https://flume.apache.org/) 528 | 529 | _Systems that support data-dependent control flow_ 530 | 531 | * [CIEL](http://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/ciel.pdf) 532 | * Iteration is implemented as a directed acyclic graph (DAG) that dynamically unfolds 533 | * [Naiad](http://research.microsoft.com/en-us/projects/naiad/) 534 | * Iteration is implemented as a static graph with cycles 535 | 536 | _Systems optimized for accessing the same data repeatedly_ 537 | 538 | * [Spark](http://spark.apache.org/) 539 | * Uses resilient distributed datasets (RDDs) to cache outputs of earlier operations, in case they are needed again 540 | 541 | _Systems that execute dataflow graphs across heterogenous devices, including GPUs_ 542 | 543 | * [Dandelion](http://research-srv.microsoft.com/pubs/201110/sosp13-dandelion-final.pdf) 544 | 545 | #### Features that TensorFlow incorporates from the above distributed systems 546 | 547 | _Feature implementations that are most similar to TensorFlow are listed after the feature_ 548 | 549 | * The dataflow scheduler (i.e. the module that selects the next node to run) 550 | * CIEL, Dryad, Flume, Spark 551 | * Distributed architecture: using a single, optimized dataflow graph and caching information about the graph to lower coordination overhead 552 | * Naiad 553 | * Works best when there is enough RAM in the cluster to hold all working variables/computations 554 | * Naiad, Spark 555 | * Iteration: multiple replicas of the same graph can execute at once while sharing the same set of persistent variables. Replicas can share the variables asynchronously or use mechanisms such as queues to access them synchronously 556 | * Hybrid of many approaches 557 | * Iteration: a node only executes when all of its dependencies have completed 558 | * CIEL 559 | * Graph iteration is represented as a static, cyclic graph 560 | 561 | - - - 562 | 563 | ## 12 Conclusions 564 | 565 | * TensorFlow is a flexible dataflow graph programming model 566 | * There are both single machine and distributed implementations of TensorFlow 567 | * TensorFlow was developed using prior experience in Google, as well as methods used in other previous systems 568 | * An [open source implementation of TensorFlow is available](https://github.com/tensorflow/tensorflow) 569 | * As of writing, only a single-device implementation has been released from Google 570 | 571 | - - - 572 | 573 | ## Figures 574 | 575 | #### Figure 1: Example TensorFlow code fragment 576 | ```python 577 | import tensorflow as tf 578 | 579 | # 100-d vector, init to zeros 580 | b = tf.Variable (tf.zeros([100]) 581 | 582 | # 784x100 matrix with random values 583 | W = tf.Variable(tf.random_uniform([784,100], -1, 1)) 584 | 585 | # Placeholder for input 586 | x = tf.placehoder(name="x") 587 | 588 | # Rectified linear unit of (W*x +b) 589 | relu = tf.nn.relu(tf.matmul(W, x) + b) 590 | 591 | # Cost computed as a function of relu 592 | C = [...] 593 | 594 | # Instantiate a Session 595 | s = tf.Session() 596 | 597 | for step in xrange(0, 10): 598 | # Create a 100-d vector for input 599 | input = ...construct 100-D input array ... 600 | 601 | # Find the cost, using the constructed vector as the placeholder input 602 | result = s.run(C, feed_dict = {x: input}) 603 | print step, result 604 | ``` 605 | 606 | #### Figure 2: Corresponding computation graph for Figure 1 607 | 608 | 609 | #### Figure 3: Single machine (left) and distributed system (right) structure 610 | 611 | 612 | #### Figure 4: Before and after insertion of Send/Recieve nodes 613 | 614 | 615 | #### Figure 5: Gradients computed for graph in figure 2 616 | 617 | 618 | #### Figure 6: Before and after graph transformation for partial execution 619 | 620 | 621 | #### Figure 7: Synchronous and asynchronous data parallel training 622 | 623 | 624 | #### Figure 8: Model parallel training 625 | 626 | 627 | #### Figure 9: Concurrent steps 628 | 629 | -------------------------------------------------------------------------------- /img/figure2.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 8 | 9 | 10 | 11 | 12 | 13 | 15 | 16 | 17 | 20 | 21 | 22 | 25 | 26 | 27 | 29 | 30 | 31 | 34 | 37 | 39 | 42 | 44 | 45 | 46 | 47 | 49 | 52 | 55 | 56 | 57 | 60 | 62 | 63 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 73 | 75 | 77 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | -------------------------------------------------------------------------------- /img/figure3.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 20 | 21 | 22 | 24 | 25 | 26 | 28 | 29 | 31 | 33 | 35 | 37 | 39 | 41 | 43 | 45 | 47 | 49 | 51 | 53 | 55 | 57 | single process 58 | client 59 | worker 60 | master 61 | client 62 | process 63 | master 64 | process 65 | worker 66 | process 1 67 | worker 68 | process 2 69 | worker 70 | process 3 71 | GPU0 72 | GPU0 73 | CPU0 74 | GPU1 75 | GPU1 76 | CPU1 77 | ... 78 | ... 79 | 81 | 83 | 85 | 87 | GPU0 88 | GPU1 89 | CPU1 90 | ... 91 | 93 | 95 | 97 | 99 | GPU0 100 | GPU1 101 | CPU1 102 | ... 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | session 152 | run 153 | session 154 | run 155 | execute 156 | subgraph 157 | execute 158 | subgraph 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | -------------------------------------------------------------------------------- /img/figure4.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 31 | 32 | 34 | b 35 | c 36 | y 37 | W 38 | a 39 | x 40 | b 41 | c 42 | y 43 | W 44 | a 45 | x 46 | Device B 47 | Device A 48 | Device B 49 | Device A 50 | 51 | 52 | 53 | 54 | recv 55 | send 56 | send 57 | recv 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | -------------------------------------------------------------------------------- /img/figure5.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 40 | 41 | 42 | 45 | 46 | 47 | 50 | 51 | 52 | 54 | 55 | 56 | 59 | 62 | 64 | 67 | 69 | 70 | 71 | 72 | 74 | 77 | 80 | 81 | 82 | 85 | 88 | 89 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 99 | 101 | 103 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 172 | 174 | 176 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 238 | 241 | 243 | 244 | 246 | 247 | 248 | 251 | 253 | 256 | 259 | 260 | 261 | 264 | 267 | 270 | 272 | 275 | 277 | 278 | 279 | 280 | 283 | 285 | 286 | 289 | 292 | 293 | 294 | 297 | 299 | 300 | 303 | 306 | 307 | 308 | 311 | 313 | 314 | 317 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 1 344 | 345 | -------------------------------------------------------------------------------- /img/figure6.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 33 | 35 | 38 | 40 | 43 | 46 | 49 | 52 | 55 | 57 | 60 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 171 | 173 | 175 | 177 | 179 | 181 | 184 | 187 | 190 | 191 | -------------------------------------------------------------------------------- /img/figure7.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 12 | 14 | 16 | 18 | 20 | 22 | 24 | 26 | 29 | 31 | 33 | 36 | 38 | 40 | 43 | 45 | 47 | 50 | 52 | 54 | 57 | 59 | 61 | 64 | 66 | 68 | 71 | 73 | 75 | 78 | 80 | 82 | 84 | 86 | 88 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 323 | 324 | 326 | 328 | 330 | 332 | 333 | 334 | 336 | 339 | 342 | 343 | 344 | 346 | 349 | 352 | 355 | 357 | 359 | 360 | 361 | 364 | 365 | 366 | 368 | 371 | 372 | 373 | 376 | 379 | 381 | 384 | 387 | 390 | 392 | 395 | 397 | 400 | 403 | 404 | 406 | 408 | 411 | 413 | 416 | 417 | 418 | 419 | 423 | 425 | 428 | 431 | 432 | 433 | 434 | 437 | 440 | 441 | 443 | 445 | 448 | 452 | 453 | 454 | 456 | 458 | 461 | 463 | 465 | 466 | 467 | 471 | 473 | 476 | 479 | 480 | 481 | 482 | 485 | 488 | 489 | 491 | 493 | 496 | 498 | 499 | 500 | 502 | 504 | 507 | 509 | 511 | 512 | 513 | 517 | 519 | 522 | 525 | 526 | 527 | 528 | 531 | 534 | 535 | 537 | 539 | 542 | 544 | 545 | 546 | 548 | 550 | 553 | 555 | 557 | 558 | 559 | 563 | 565 | 568 | 570 | 571 | 572 | 573 | 576 | 579 | 580 | 582 | 584 | 587 | 591 | 592 | 593 | 595 | 597 | 600 | 602 | 604 | 605 | 606 | 610 | 612 | 615 | 617 | 618 | 619 | 620 | 623 | 626 | 627 | 629 | 631 | 634 | 636 | 637 | 638 | 640 | 642 | 645 | 647 | 649 | 650 | 651 | 655 | 657 | 660 | 662 | 663 | 664 | 665 | 668 | 671 | 672 | 674 | 676 | 679 | 681 | 682 | 683 | 685 | 687 | 690 | 692 | 694 | 695 | 696 | 698 | 701 | 704 | 707 | 709 | 711 | 712 | 713 | 715 | 718 | 721 | 724 | 726 | 729 | 730 | 731 | 733 | 736 | 739 | 742 | 744 | 747 | 748 | 749 | 751 | 752 | 754 | 756 | 758 | 760 | 763 | 764 | 765 | 767 | 768 | 770 | 772 | 774 | 776 | 778 | 779 | 780 | 782 | 783 | 785 | 787 | 789 | 791 | 792 | 793 | 794 | 797 | 798 | 799 | 801 | 804 | 805 | 806 | 808 | 811 | 812 | 813 | 815 | 818 | 819 | 820 | 823 | 826 | 828 | 831 | 834 | 837 | 839 | 842 | 844 | 847 | 850 | 851 | 853 | 855 | 858 | 860 | 863 | 864 | 865 | 866 | 869 | 872 | 874 | 876 | 878 | 880 | 882 | 884 | 886 | 888 | 891 | 894 | 897 | 899 | 902 | 905 | 908 | 910 | 913 | 914 | 915 | 918 | 919 | 921 | 924 | 927 | 928 | 929 | 931 | 934 | 937 | 939 | 941 | 943 | 945 | 947 | 949 | 951 | 953 | 956 | 959 | 962 | 964 | 967 | 970 | 973 | 975 | 978 | 979 | 980 | 983 | 984 | 986 | 989 | 992 | 993 | 994 | -------------------------------------------------------------------------------- /img/figure8.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 9 | 11 | 13 | 15 | 17 | 19 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 98 | 101 | 102 | 104 | 106 | 109 | 110 | 111 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 157 | 158 | 159 | 160 | 162 | 163 | 164 | 166 | 167 | 168 | 170 | 171 | 172 | 174 | 175 | 176 | 178 | 179 | 181 | 183 | 185 | 187 | 189 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 271 | 274 | 275 | 277 | 279 | 282 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 330 | 332 | 333 | 334 | 338 | 339 | 340 | 344 | 345 | 346 | 350 | 351 | 352 | 356 | 357 | 358 | 362 | 363 | 365 | 367 | 369 | 371 | 373 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 455 | 458 | 459 | 461 | 463 | 466 | 469 | 470 | 471 | 472 | 475 | 478 | 479 | 480 | 482 | 483 | 484 | 486 | 487 | 488 | 490 | 491 | 492 | 494 | 495 | 496 | 498 | 499 | 501 | 502 | 504 | 505 | 507 | 510 | 512 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | -------------------------------------------------------------------------------- /img/figure9.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 15 | 17 | 19 | 21 | 23 | 26 | 28 | Update 29 | model 30 | P 31 | input 32 | Client 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 51 | 53 | 56 | Update 57 | model 58 | input 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 77 | 79 | 82 | Update 83 | model 84 | input 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 132 | 133 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 165 | 166 | 167 | 168 | 169 | 170 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 181 | 182 | 184 | 185 | 186 | 187 | 188 | --------------------------------------------------------------------------------