├── .gitignore
├── ABOUT.md
├── LICENSE
├── README.md
└── img
├── figure2.svg
├── figure3.svg
├── figure4.svg
├── figure5.svg
├── figure6.svg
├── figure7.svg
├── figure8.svg
└── figure9.svg
/.gitignore:
--------------------------------------------------------------------------------
1 | UNTRACKED/
2 |
--------------------------------------------------------------------------------
/ABOUT.md:
--------------------------------------------------------------------------------
1 | # About this Repository
2 |
3 | I started off simply taking notes on the [TensorFlow white paper](http://download.tensorflow.org/paper/whitepaper2015.pdf), but as I worked I started putting more and more time into finding/linking to reference material in the TensorFlow documentation and resources. Additionally, I attempted to increase my understanding of the paper by re-phrasing certain sections as well as translating some of the algorithms from paragraph-form to ordered lists.
4 |
5 | Reading the white paper has improved my comfort with the APIs dramatically, and I hope that others will benefit from my notes.
6 |
7 | ## Features
8 |
9 | * Notes broken down section by section, as well as subsection by subsection
10 | * Relevant links to documentation, resources, and references throughout
11 | * SVG versions of figures/graphs
12 | * So many bullet points!
13 |
14 | ### To-do list
15 |
16 | * Create and utilize anchor tags throughout notes for self-navigating
17 |
18 | ## Contributions
19 |
20 | Please feel free to submit pull requests for corrections, improved readability, consistency in terminology, additional graphs or whatever else you can think of.
21 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2015 Sam Abrahams
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
23 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # TensorFlow White Paper Notes
2 |
3 | ## Features
4 |
5 | * Notes broken down section by section, as well as subsection by subsection
6 | * Relevant links to documentation, resources, and references throughout
7 | * SVG versions of figures/graphs
8 | * So many bullet points!
9 |
10 | ### To-do list
11 |
12 | * Create and utilize anchor tags throughout notes for self-navigating
13 |
14 | _[White Paper available at this link](http://download.tensorflow.org/paper/whitepaper2015.pdf)_
15 |
16 | - - -
17 |
18 | # TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
19 |
20 | ## Abstract
21 |
22 | * [**TensorFlow**](https://www.tensorflow.org/) is both an interface for expressing machine learning algorithms and an implentation to execute them
23 | * Code can be transported across various machine architectures with little to no changes to the code
24 | * Has been used at Google for all manner of machine learning tasks
25 | * [Reference implementation and API released under Apache 2.0 license](https://github.com/tensorflow/tensorflow)
26 |
27 | - - -
28 |
29 | ## 1 Introduction
30 |
31 | * Google Brain started in 2011, and [**DistBelief**](http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf) was its first-generation scalable, distributed machine learning system
32 | * DistBelief was used for a large number of research and commercial tasks
33 | * TensorFlow, Google's second-generation machine learning system, was designed from lessons learned in the process of engineering and using DistBelief
34 | * The TensorFlow API is used to describe a dataflow-like model, and the implementation then maps those models onto the underlying machine hardware
35 | * This allows users to have a single system that runs on a broad spectrum of machines, reducing overhead caused from rewriting code for different hardware
36 | * Focus of development was to maintain flexibility for research purposes while attaining enough performance to be used in production
37 | * Can express various types of parallelism by replicating the dataflow model across multiple machines and running them in parallel
38 | * Some functions within TensorFlow allow for less consistency in parallelism if desired
39 | * Larger, multiple machine uses of TensorFlow can take advantage of less-strict synchronization requirements
40 | * TensorFlow is more flexible, faster, and supports more machine learning models than DistBelief
41 |
42 | - - -
43 |
44 | ## 2 Programming Model and Basic Concepts
45 |
46 | * TensorFlow computations are represented by [_directed graphs_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Graph), which are composed of _nodes_
47 | * Some nodes are able to maintain and update a persistent state and/or have some sort of branching and looping structures
48 | * This branching/looping is modeled similarly to [MSR's Naiad](http://research.microsoft.com:8082/pubs/201100/naiad_sosp2013.pdf)
49 | * Graphs are constructed using supported front-end languages (C++/Python as of writing)
50 | * A Node has zero or more inputs/outputs, and it represents an [_operation_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Operation)
51 | * Values of 'normal' edges (the connection between one node's output to another node's input) are [_tensors_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Tensor), n-dimensional arrays.
52 | * The type of each element in the tensor is inferred while the graph is being constructed, prior to execution
53 | * There are 'special' edges, called [_control dependencies_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Graph#control_dependencies): no model data is transferred on these edges, rather they indicate that the source node must finish execution before the destination node begins execution
54 | * Can be thought of as a baton in a relay race. Attaching a control dependency means that the next node can't begin running until the previous node 'hands off' the baton.
55 | * Used by client to enforce happens-before relations
56 | * Used in reference implementation to manage memory usage
57 |
58 |
59 |
60 | ### Operations and Kernels
61 |
62 | * Operations have names and represent an abstract computation, such as ["matrix multiply"](https://www.tensorflow.org/versions/master/api_docs/python/tf/matmul) or ["add"](https://www.tensorflow.org/versions/master/api_docs/python/tf/add)
63 | * Operations can optionally require _attributes_. Attributes must be explicitly provided or be possible to infer prior to running the graph
64 | * A common use of attributes is to declare which data type the operation is being performed with (i.e. float tensors vs. int32 tensors)
65 | * A _kernel_ is an implementation of an operation designed for specific types of devices, such as CPU or GPU
66 | * The TensorFlow library includes several built-in operations/kernels. The table below lists some of them:
67 |
68 | Category | Examples
69 | ---|---
70 | Element-wise mathematical operations | Add, Sub, Mul, Div, Exp, Log, Greater, Less, Equal
71 | Array operations | Concat, Slice, Split, Constant, Rank, Shape, Shuffle
72 | Matrix operations | MatMul, MatrixInverse, MatrixDeterminant
73 | Stateful operations | Variable, Assign, AssignAdd
74 | Neural-net building blocks | SoftMax, Sigmoid, ReLU, Convolution2D, MaxPool
75 | Checkpointing operations | Save, Restore
76 | Queue and synchronization operations | Enqueue, Dequeue, MutexAcquire, MutexRelease
77 | Control flow operations | Merge, Switch, Enter, Leave, NextIteration
78 |
79 | _Check out [this directory in the TensorFlow repository](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/kernels) for kernel implementations_
80 |
81 | ### Sessions
82 |
83 | * Clients interact with TensorFlow by creating a [_Session_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Session), which supports two main functions: _Extend_ and [_Run_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Session#run)
84 | * The Extend method adds additional nodes and edges to the existing dataflow model
85 | * _Note: Extend is called automatically by TensorFlow, not directly by the client_
86 | * Run takes as argument a set of named nodes to be computed as well as an optional set of tensors to be used in place of certain node outputs. It then uses the graph to figure all nodes required to compute the requested outputs, and performs them in a order that respects their dependencies.
87 | * Most TensorFlow programs setup a graph within a Session once, and then run the full graph or subsets of the graph multiple times.
88 |
89 | ### Variables
90 |
91 | * A [_Variable_](https://www.tensorflow.org/versions/master/api_docs/python/tf/Variable) is a handle to a persistent and mutable tensor which survives each execution of a graph
92 | * For ML tasks, learned parameters are usually held in TensorFlow Variables
93 |
94 | _See the official [How-To](https://www.tensorflow.org/versions/master/how_tos/variables/index.html) to learn more about TensorFlow Variables_
95 |
96 | - - -
97 |
98 | ## 3 Implementation
99 |
100 | * There are three primary components in a TensorFlow system: the _client_, the _master_, and _worker processes_
101 | * The client uses a Session interface to communicate with the master
102 | * The master schedules and coordinates worker processes and relays results back to the client
103 | * Worker processes are responsible for maintaining access to devices such as CPU/GPU cores and execute graph nodes on their respective devices
104 | * There are both local and distributed implementations of TensorFlow, ~~but only the local version has been open-sourced as of writing~~
105 | * **Update as of February 2016:** The initial open-source implementation of the TensorFlow distributed runtime is [available on the TensorFlow GitHub repository.](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/distributed_runtime) However, using it at this time requires building TensorFlow from source, and full API documentation is not yet available.
106 |
107 | ### Devices
108 |
109 | * Each device has both a device type and a name
110 | * Names are composed of the device's type, its index in a worker process, and (when used in a distributed setting) an identification of the job and task of the worker process
111 | * Example device names:
112 | Local: `/job:localhost/device:cpu:0`
113 | Distributed: `/job:worker/task:17/device:gpu:3`
114 | * A device object manages its device's memory and executes kernels as requested
115 |
116 | ### Tensors
117 |
118 | * Typed, multi-dimensional array
119 | * Memory management of tensors is handled automatically
120 | * Available types (from the [TensorFlow documentation](https://www.tensorflow.org/versions/master/programmers_guide/dims_types#data_types)):
121 |
122 | Data type | Python type | Description
123 | --- | --- | ---
124 | `DT_FLOAT` | `tf.float32` | 32 bits floating point
125 | `DT_DOUBLE` | `tf.float64` | 64 bits floating point
126 | `DT_INT8` | `tf.int8` | 8 bits signed integer
127 | `DT_INT16` | `tf.int16` | 16 bits signed integer
128 | `DT_INT32` | `tf.int32` | 32 bits signed integer
129 | `DT_INT64` | `tf.int64` | 64 bits signed integer
130 | `DT_UINT8` | `tf.uint8` | 8 bits unsigned integer
131 | `DT_STRING` | `tf.string` | Variable length byte arrays. Each element of a Tensor is a byte array
132 | `DT_BOOL` | `tf.bool` | Boolean
133 | `DT_COMPLEX64` | `tf.complex64` | Complex number made of two 32 bits floating points: real and imaginary parts
134 | `DT_QINT8` | `tf.qint8` | 8 bits signed integer used in quantized Ops
135 | `DT_QINT32` | `tf.qint32` | 32 bits signed integer used in quantized Ops
136 | `DT_QUINT8` | `tf.quint8` | 8 bits unsigned integer used in quantized Ops
137 |
138 | ## 3.1 Single-Device Execution
139 |
140 | _**NOTE:** To reiterate- in this context, "single device" means using a single CPU core or single GPU, **not** a single machine. Similarly, "multi-device" does not refer to multiple machines, but to multiple CPU cores and/or GPUs. See "3.3 Distributed Execution" for multiple machine discussion._
141 |
142 | * Overview of the execution of a single-worker process, single-device job:
143 | 1. All nodes required to compute the desired output node(s) are determined
144 | 2. Each node is given a count of dependencies that need to be completed before it can begin execution
145 | 3. When a node's dependency count is zero, it is added to a ready queue
146 | 4. The ready queue delegates node kernel execution to device objects
147 | 5. When a node completes execution, the counts of all dependant nodes are decremented
148 | 6. Repeat steps 3-5 until the desired output is computed
149 |
150 | ## 3.2 Multi-Device Execution
151 |
152 | * There are two main challenges introduced when using multiple devices:
153 | * Deciding which device should process each node
154 | * Managing communication between devices as necessary after assigning nodes
155 |
156 | ### Node Placement
157 |
158 | * One of the main responsibilities of the TensorFlow implementation is to map computation onto available devices
159 | * The following is a simplified version of this mapping algorithm:
160 | 1. A cost model is input into the algorithm
161 | * The cost model contains estimates of of the input/output tensors (in bytes) and estimated computation time for each node in the graph
162 | 2. Using the cost model, the algorithm simulates an execution of the graph to make node-placement decisions as described below:
163 | 1. Starting with the source nodes, a set of feasible devices is considered for each node ready to be executed
164 | * A "feasible" device is one that has a kernel implementation for the given operation
165 | * A node is ready for execution once its dependencies have finished running
166 | 2. If a node has multiple feasible devices, the computation time of the node is examined with respect to placing the node on each possible device
167 | * This examination takes into account the execution time of the operation (given the device type), as well as the costs of possibly introducing communication between devices.
168 | 3. The device that would finish the operation the soonest is selected as the node's device.
169 | 4. Repeat steps 1-3 for each node in the graph execution until all nodes have been allocated to devices
170 | 3. After the simulation, the real execution runs using the node-placement decisions made during the simulation
171 | * Section 4.3 will describe some extensions to help guide the placement algorithm
172 | * Improving the placement algorithm's development is an ongoing process as of writing
173 |
174 | _**NOTE:** At the moment, node placement is done by a [simple_placer class](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/simple_placer.h) which only considers explicit placement requirements provided by the user and implicit colocation constraints based on node type ([see documentation comments for details](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/simple_placer.h#L32-L41)_
175 |
176 | ### Cross-Device Communication
177 |
178 | * After the nodes have been placed onto their respective devices, the execution graph is split into subgraphs- one per device
179 | * Any edge between nodes on different devices is replaced by two new edges:
180 | * The outputing node will have an edge between it and a new _Send_ node, placed within the subgraph of its device
181 | * The recieving node will have an edge between it and a new _Receive_ node, placed within the subgraph of its device
182 | * See Figure 4 for illustration of adding Send/Receive nodes
183 | * The Send and Receive nodes coordinate data transfer across devices
184 | * This isolates cross-device communication to the implementation of the Send and Receive nodes
185 | * All users of a particular tensor on a particular device use a single Receive node, as opposed to having one Receive node per user per device. This minimizes data transmission between devices as well as memory allocated on the receiving device
186 | * This means that any given device should receive the output of any given operation only once, and should store that output only once in memory
187 | * This method of communication also allows individual node scheduling to be handled by the worker processes as opposed to the master
188 | * The Send and Receive nodes provide synchronization between worker processes and devices, which enables the master to only issue a single Run request per graph execution per worker process
189 | * This improves scalability and fine-grain control over node execution
190 |
191 | ## 3.3 Distributed Execution
192 |
193 | * Similar to multi-device execution. Send/Receive nodes that communicate across worker processes use mechanisms such as TCP or RDMA to transmit data from machine to machine
194 |
195 | ### Fault Tolerance
196 |
197 | * There are two main ways failures are detected:
198 | * Errors between Send and Receive nodes
199 | * Periodic health-checks from the master process to the worker processes
200 | * When a failure is detected, the entire graph execution is aborted and restarted
201 | * Because TensorFlow Variables persist across each graph execution, there are mechanisms to save and restore their state
202 | * Each Variable node is connected to a Save node, which periodically write the contents of the Variable to persistent storage
203 | * Additionally, each Variable is connected to a Restore node, which is enabled when the graph execution is restarted. The Restore node reads state data from persistent storage and applies it to the Variable node
204 |
205 | - - -
206 |
207 | ## 4 Extensions
208 |
209 | _The following subsections describe advanced features and extensions of the programming model introduced in Section 2_
210 |
211 | ## 4.1 Gradient Computation
212 |
213 | * TensorFlow has built-in gradient computation
214 | * If a tensor, _C_, depends on a set of previous nodes, the gradient of _C_ with respect to those previous nodes can be automatically computed with a built-in function, even if there are many layers in between them
215 | * See [`tf.gradients()`](https://www.tensorflow.org/versions/master/api_docs/python/train.html#gradients) for usage
216 | * Gradients are computed by creating additional nodes and edges in the graph as described below (see Figure 5):
217 | * When computing the gradient of a tensor, _C_, with respect to some dependency, _I_, TensorFlow first finds the forward path from _I_ to _C_ in the model graph. This is shown as the left-hand side of Figure 5
218 | * Once the path between the two is found, TensorFlow starts at _C_ and moves backward toward _I_. For every operation on this backward path, a node is added to the graph, composing the partial gradients of each added node via the [chain rule](https://en.wikipedia.org/wiki/Chain_rule). This is shown as the right-hand side of Figure 5
219 | * Partial gradients are computed via a "gradient function", which corresponds to an operation on the forward path. These gradient functions are provided alongside operation kernels
220 | * The gradient function takes as input the partial derivatives already computed along the backwards path and, optionally, the inputs and/or outputs of the corresponding operation on the forward path
221 | * For example, in Figure 5, the _dReLU_ operation (gradient function for the Rectified Linear Unit operation) takes in the previously computed partial derivatives (indicated by arrows coming from "..."), as well as the inputs from the _ReLU_ operation (indicated by arrows coming from _Add_, as the outputs of _Add_ are the inputs of _ReLU_). _dReLU_ does not, however, take in the outputs from the _ReLU_ function (indicated by the grayed out arrows coming from _ReLU). Once the partial gradients are computed, _dReLU_ outputs them to the next gradient function, in this case _dAdd_
222 | * The partial gradients for any node outputs that are **not** dependencies of _C_ are set to zero. This can occur when a node has multiple outputs, but only some of them connect to _C_
223 | * This process continues until the partial derivatives of _C_ with respect to _I_ are found
224 | * Memory management for is an active area of improvement for the automatic gradient computation algorithm.
225 | * Tensors early in the computation graph may be needed at the end of gradient calculation, causing them to stick around in GPU memory
226 | * Current options for memory management improvements include improved heuristics to determine graph execution order, recomputing tensors as opposed to storing them in memory, and utilizing host CPU memory instead of leaving long-lived tensors in GPU memory
227 |
228 | ## 4.2 Partial Execution
229 |
230 | * TensorFlow has built-in functionality to run smaller chunks of a defined execution graph, as well as the ability to insert pre-defined data as a replacement for any edge in the graph
231 | * Each node in the graph is given a name upon instantiation, and each output of a node is referred to by number starting from zero
232 | * e.g. "bar:0" is the first output of node "bar"
233 | * The Session's [run method](https://www.tensorflow.org/versions/master/api_docs/python/tf/Session#run) takes two arguments, `fetches` and `feed_dict`, which define the subgraph to be executed:
234 | * `fetches` is a list of desired operation nodes and/or output tensors to be executed. If outputs are requested, the Run function will return the calculated tensors to the client (assuming the Run function is successful)
235 | * `feed_dict` is a dictionary of optional inputs, which map named node outputs (_`name:port`_) to defined tensor values. This allows a user to effectively define the 'start' of a subgraph. Additionally, `feed_dict` is used to define data in Placeholder objects
236 | * The execution graph is then transformed based on the values fed to `fetches` and `feed_dict`
237 | * Each output tensor specified in `fetches` is connected to a **fetch** node, which stores and returns its value to the client once Run is successfully completed
238 | * Note: no fetch nodes are created for operation nodes named in `fetches`, as TensorFlow makes a distinction between opererations and the outputs of those operations
239 | * An example of an operation a user may specify as a `fetches` parameter to a Run command is the operation returned by [`tf.initialize_all_variables`](https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#initialize_all_variables). The operation doesn't provide an output, but it is run in the execution graph
240 | * Each named node output (`node:port`) specified in `feed_dict` is replaced with a **feed** node, which takes on the value of the tensor mapped to the named output. Each node in the execution graph that depends on the named output will take in data from the feed node in its place
241 | * Because a feed node has no dependencies, it is the start of its own execution chain
242 | * Once the **fetch** and **feed** nodes have been inserted, TensorFlow determines which nodes need to be executed. It moves backwards, starting at the fetch nodes, and uses the dependencies of the graph to determine all nodes that must be executed in the modified graph in order to compute the desired outputs
243 |
244 | ## 4.3 Device Constraints
245 |
246 | * Users can provide partial constraints on nodes about which devices they can run on
247 | * Examples: only allowing a node to run on GPUs, specifying a specific worker process/task, or ensuring that it is grouped with specific nodes
248 | * Note: By default, GPUs are given priority for device placement if the given operation has both a CPU and a GPU kernel implementation
249 | * These constraints require modifications to the placement algorithm described in Section 3.2:
250 | * After finding the feasible set of devices for each node, TensorFlow uses the constraints to determine which nodes must be grouped on the same device
251 | * For each of these groups, TensorFlow computes the intersection of feasible devices for each node in the group
252 | * Final selection of devices is determined using similar heuristics as described in Section 3.2, ensuring fast execution while taking device restrictions, such as memory, into account
253 |
254 | _Aside: I'm not sure if this functionality is available in the open source implementation of TensorFlow yet. As of now I can only find information regarding placing nodes on **specific** devices. [Read more about manual device placement here](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/g3doc/how_tos/using_gpu/index.md#Manual-device-placement). Let me know if you can find the documentation for this feature!_
255 | It is possible to provide partial constraints (https://www.tensorflow.org/versions/r0.11/how_tos/variables/index.html#device-placement) e.g. with tf.device("/job:ps/task:7") or with tf.device("/gpu:0").
256 |
257 | ## 4.4 Control Flow
258 |
259 | * TensorFlow incorporates a few primitive control flow operators which allow for the skipping of subgraph execution and the expression of iteration. Using these primitive operators, higher-level constructs such as `if` and `while` statements can be compiled into TensorFlow graphs
260 | * Each iteration in a loop is identified with a unique tag, and the present value of that execution state is represented by a frame
261 | * Inputs can enter an iteration whenever they are available
262 | * This allows multiple iterations to be executed concurrently
263 | * Because a loop may contain nodes placed of separate devices, managing the state of loops becomes a problem of distributed termination detection. That is, there needs to be a way to detect the termination of nodes across devices.
264 | * TensorFlow solves this by adding additional control nodes to the graph. These nodes manage the beginning and end of each loop iteration and determine when the loop should end.
265 | * At every iteration, the device that owns the control node for a loop sends out control messages to the other devices used in that loop
266 | * The implementation of TensorFlow also takes `if` statements into account when computing gradients, as it must include or omit nodes as necessary to properly step backwards through the execution graph
267 |
268 | ## 4.5 Input Operations
269 |
270 | * In addition to using the `feed_dict` parameter in the [`Session.run`](https://www.tensorflow.org/versions/master/api_docs/python/tf/Session#run) method to manually feed in input data, TensorFlow supports reading tensors in directly from files
271 | * Using this feature can reduce data transfer overhead when using TensorFlow on a distributed implementation (specifically when the client is on a different machine from the worker process):
272 | * Using `feed_dict` will cause data to first be sent from the storage system to the client, and then from client to the worker process
273 | * Reading from the file will cause data to be sent directly from the storage system to the worker process
274 | * Data can be read in as individual data examples or in batches of examples
275 | * TensorFlow classes for reading data:
276 | * Text/CSV: [`tf.TextLineReader`](https://www.tensorflow.org/versions/master/api_docs/python/tf/TextLineReader)
277 | * [Basics of CSV parsing in TensorFlow](https://www.tensorflow.org/versions/master/programmers_guide/reading_data#csv_files)
278 | * Fixed Length records: [`tf.FixedLengthRecordReader`](https://www.tensorflow.org/versions/master/api_docs/python/tf/FixedLengthRecordReader)
279 | * [Basics of fixed length record parsing in TensorFlow](https://www.tensorflow.org/versions/master/programmers_guide/reading_data#fixed_length_records)
280 | * TensorFlow data format: [`tf.TFRecordReader`](https://www.tensorflow.org/versions/master/api_docs/python/io_ops.html#TFRecordReader)
281 | * [Basics of TensorFlow data format handling](https://www.tensorflow.org/versions/master/programmers_guide/reading_data#standard_tensorflow_format)
282 |
283 | ## 4.6 Queues
284 |
285 | * TensorFlow includes [Queues](https://www.tensorflow.org/versions/master/api_guides/python/io_ops#Queues), which allow for the enqueuing and dequeuing of tensor objects. This enables asynchronous graph execution and the handing off of data between concurrent nodes
286 | * Enqueue operations can block until there is space available in the queue
287 | * Dequeue operations can block until a minimum number of elements are placed in the queue
288 | * [`FIFOQueue`](https://www.tensorflow.org/versions/master/api_docs/python/tf/FIFOQueue) is a standard 'first-in, first-out' queue
289 | * [`RandomShuffleQueue`](https://www.tensorflow.org/versions/master/api_docs/python/tf/RandomShuffleQueue) is a queue that randomly shuffles its elements periodically, which can be useful for machine learning algorithms that want to randomize training data
290 | * An example use of queues is to allow input data to be prefetched from the storage system while previous data is still being processed
291 |
292 | ## 4.7 Containers
293 |
294 | * A _Container_ is the mechanism that manages long-lived (i.e. survives multiple executions of a graph) mutable state
295 | * Container objects hold the values of _Variable_ objects
296 | * There is a default container that persists until the process terminates
297 | * Other named containers may be initialized
298 | * Containers allow for the sharing of state between otherwise unrelated computation graphs defined on separate Sessions
299 |
300 | - - -
301 |
302 | ## 5 Optimizations
303 |
304 | _This section describes certain performance/resource usage optimizations used in the implementation of TensorFlow_
305 |
306 | ## 5.1 Common Subexpression Elimination
307 |
308 | * Before execution, TensorFlow does a pass over the computation graph and reduces nodes with identical inputs and operations down to a single node.
309 | * This prevents redundant execution of the same computation
310 |
311 | ## 5.2 Controlling Data Communication and Memory Usage
312 |
313 | * Proper scheduling of operations can create dramatic improvements on data transfer and memory usage rates by reducing the amount of time intermediate data needs to be stored in memory
314 | * GPUs benefit from this a great deal, as they have scarce memory
315 | * Can also reduce competition amongst operations to use network resources
316 | * One particular example used in the implementation TensorFlow is the scheduling of Recieve nodes (see "3.2 Cross Device Communication")
317 | * Recieve nodes, without thoughtful scheduling, may execute much earlier than necessary
318 | * This could cause data to be stored in the memory of devices for much longer
319 | * TensorFlow attempts to delay the execution of Recieve nodes until just before their results are needed.
320 |
321 | ## 5.3 Asynchronous Kernels
322 |
323 | * TensorFlow supports non-blocking kernels, which are good for environments that can't afford having many active threads, and they allow nodes to wait for I/O or other events without blocking an execution thread
324 | * Normal, synchronous kernels complete their execution at the end of the Compute method
325 | * Asynchronous kernels use a slightly different interface: the Compute method is passed a lambda/callback that should be invoked at the end of the kernel's execution
326 | * Examples of asynchronous kernels built into TensorFlow: Recieve, Enqueue, and Dequeue
327 |
328 | ## 5.4 Optimized Libraries for Kernel Implementations
329 |
330 | * TensorFlow makes use of several existing, optimized libraries for many kernel implementations
331 | * Library for linear algebra:
332 | * [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page)
333 | * Libraries for matrix multiplification on different devices:
334 | * [BLAS](http://www.netlib.org/blas/)
335 | * [cuBLAS (CUDA BLAS)](https://developer.nvidia.com/cublas)
336 | * Libraries for convolutional kernels for deep neural networks:
337 | * [cuda-convnet](https://code.google.com/p/cuda-convnet/)
338 | * [cuDNN](https://developer.nvidia.com/cudnn)
339 |
340 | ## 5.5 Lossy Compression
341 |
342 | * Because some machine learning algorithms still work well with less precise arithmetic, TensorFlow often uses lossy compression on higher-precision numbers when sending data between devices
343 | * This most often happens when sending data between devices on different machines, but it sometimes compresses data sent on the same machine
344 | * For example, a 32-bit floating point number may be converted into a (for all intents and purposes) 16-bit floating point number before being sent to another device, where it is converted into a lossy version of the original 32-bit number
345 | * The 16-bit floating number is really just a 32-bit floating point with 16-bits less precision after the decimal
346 | * The conversion back to 32-bit floating point just fills in zeros to replace the lost precision after the decimal
347 | * 'Filling in with zeros' is used as the 16-bit -> 32-bit conversion menthod because it is faster than using stochastic rounding, even though the latter is more mathematically correct
348 |
349 | - - -
350 |
351 | ## 6 Status and Experience
352 |
353 | * The system, documentation, and examples for TensorFlow can be found at [tensorflow.org](https://www.tensorflow.org/)
354 | * Currently, there are front-ends for Python and C++, and it's expected that more will be added over time (created both from internal Google users and the open-source community)
355 |
356 | ### Advice and Lessons Learned
357 |
358 | _The following are "words of wisdom" coming from the experience of porting Google's_ [Inception](http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf) _neural network into TensorFlow. After successfully doing so, the team was rewarded with a 6-fold improvement on training time over DistBelief's implementation. This advice will hopefully be useful to others as they build their own models._
359 |
360 | 1. _Build tools to gain insight into the exact number of parameters in a given model_
361 | * This can help you catch subtle flaws in a complex network architecture, such as operations and variables instantiated incorrectly
362 | 2. _Start small and scale up_
363 | * The TensorFlow team started by importing a small network used by DistBelief
364 | * Debugging a small network gave insight into the edge cases of certain operations, while having to do the same on a larger network would be nearly impossible to figure out
365 | 3. _Always ensure that the objective (loss function) matches between machine learning systems when learning is turned off_
366 | * By setting the learning rate to zero (i.e. turning off learning), the TensorFlow team was able to identify unexpected behavior stemming from randomly initialized variables in the model
367 | 4. _Make a single machine implementation match before debugging a distributed implementation_
368 | * This helped the TensorFlow team separate and debug differences in training performance between DistBelief and TensorFlow
369 | * Once the single machine implementation worked, they were able to find bugs related to race conditions and non-atomic operations in the distributed model
370 | 5. _Guard against numerical errors_
371 | * Different numerical libraries handle non-finite floating point numbers differently
372 | * Checking for non-finite floating point values can allow one to detect errors in real time, guarding against numerical instability
373 | 6. _Analyze pieces of a network and understand the magnitude of numerical error_
374 | * By running subsections of the neural network on both DistBelief and TensorFlow in parallel, the team was able to ensure that the implemented algorithms were indeed identical
375 | * Note that because the networks used floating point numbers, there is a given amount of numerical error that should be expected and taken into account when comparing the two systems
376 |
377 | - - -
378 |
379 | ## 7 Common Programming Idioms
380 |
381 | _This section describes how TensorFlow's basic dataflow graphs can be used to speed up training neural network models on large datasets using techniques developed by the TensorFlow team._
382 |
383 | _The techniques presented here assume that the model is using stochastic gradient descent with mini-batches of around 100-1000 examples_
384 |
385 | ### Data Parallel Training
386 |
387 | * Users can parallelize the computation of the gradient, separating mini-batch elements onto different devices
388 | * For example, a mini-batch size of 1000 elements can be split into 10 smaller, parallel computations of 100 elements. After they all finish running, their results can be combined to achieve the same result as if the entire calculation was performed in a single, sequential computation
389 | * This translates to having many replicas of a computational subgraph and using a single client thread to coordinate the training loop of those replicas
390 | * The above approach can be modified further by making it asynchronous. Instead of having a single client thread, there are multiple clients (one for each subgraph replica), and each replica updates the trained parameters asynchronously
391 | * See Section 4 (pages 3-6) of [_Large Scale Distributed Deep Networks_](http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf) for further description of asynchronous approach
392 |
393 | ### Model Parallel Training
394 |
395 | * Can run separate portions of the computation graph on different devices simultaneously on the same batch of examples
396 | * See Figure 8 for a visual example of sequence-to-sequence learning parallelized across three devices
397 |
398 | ### Concurrent Steps for Model Computation Pipelining
399 |
400 | * Can also run a small set of concurrent training steps on a single device
401 | * Similar concept to asynchronous data parallelism, but the parallelism is only on a single device
402 | * This can "fill in the gaps" of device utilization, when parallel execution on all devices might not make full use of computation cycles
403 | * See Figure 9 for a visual example
404 |
405 | - - -
406 |
407 | ## 8 Performance
408 |
409 | _Stay tuned for future versions of the TensorFlow white paper, which will include performance evaluations for single machine and distributed implementations of TensorFlow_
410 |
411 | - - -
412 |
413 | ## 9 Tools
414 |
415 | _This section discusses additional tools, developed by the TensorFlow team, that work alongside the graph modeling and execution features described above._
416 |
417 | ## 9.1 TensorBoard: Visualization of Graph Structures and Summary Statistics
418 |
419 | _[TensorBoard](https://www.tensorflow.org/versions/master/resources/faq.html#tensorboard) was designed to help users visualize the structure of their graphs, as well as understand the behavior of their models_
420 |
421 | ### Visualization of Computation Graphs
422 |
423 | * TensorBoard includes features that allow for more digestible visualizations
424 | * TensorBoard can take models with tens of thousands of nodes and collapse them into high-level blocks, highlighting subgraphs with identical structures
425 | * It separates out "high-degree" nodes from the rest of the graph to further reduce visual clutter
426 | * _Note: I haven't found a proper definition of "high-degree" nodes in TensorFlow. The paper says they "often serve book-keeping functions". I imagine they are operations similar to [`tf.initialize_all_variables`](https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#initialize_all_variables), which are necessary to run the execution graph in TensorFlow but aren't really part of the mathematical model_
427 | * The TensorBoard visualization is interactive
428 | * Users can pan, zoom, and expand the collapsed blocks in order to get a lower-level view of the model
429 |
430 | ### Visualization of Summary Data
431 |
432 | * TensorBoard supports Summary operations that can be inserted into the execution graph to examine and track various values over time
433 | * _Scalar summaries_: e.g. average loss function over set of examples, total execution time of the model
434 | * _Histogram based summaries_: e.g. distribution of parameter values in a neural network layer
435 | * _Image-based summaries_: e.g. visualization of filter weights learned in a convolutional neural network
436 | * Typically, Summary nodes are setup to monitor specific interesting values and are executed periodically during normal graph execution
437 | * After Summary nodes are executed, the client writes the summary data to a log file. TensorBoard monitors this log to display summary information over time
438 | * The "time" used in the visualization of summary data can be: wall-clock time; absoute time; or "steps", the number of graph executions that have occurred since the first execution in the TensorFlow program
439 |
440 | ## 9.2 Performance tracing
441 |
442 | * A tool called EEG is used to examine fine-grained information about the ordering/performance of TensorFlow graphs
443 | * Works for both single machine and distributed implementations of TensorFlow
444 | * _Note: EEG is not open sourced as of writing_
445 | * Helps users understand bottlenecks in a TensorFlow program
446 |
447 | _The following is a brief overview of what EEG does under the hood_
448 |
449 | * Traces are collected via a number of sources including:
450 | * [Linux `ftrace`](http://elinux.org/Ftrace)
451 | * Internal Google tracing tools
452 | * [The CUDA Profiling Tools Interface](https://developer.nvidia.com/cuda-profiling-tools-interface)
453 | * The trace logs enable EEG to recreate the execution of the graph with microsecond-level precision. Events from the traces, within a timerange, are extracted and visualized according to the resolution of the client's user interface
454 | * The user can zoom in on portions of the graph, and EEG will update the visualization to show finer details
455 | * Any significant delays due to communication, synchronization, or direct memory access issues are identified and highlighted in the UI
456 |
457 | _Please see pages 14 and 15 of the [November 2015 white paper](http://download.tensorflow.org/paper/whitepaper2015.pdf) to see a specific example of EEG visualization along with descriptions of the current UI_
458 |
459 | - - -
460 |
461 | ## 10 Future Work
462 |
463 | _This section lists areas of improvement and extension for TensorFlow identified for consideration by the TensorFlow team_
464 |
465 | _Extensions_:
466 |
467 | * A "function" mechanism, where users can specify subgraphs of TensorFlow execution to be reusable
468 | * In the TensorFlow team's design of this mechanism (not yet implemented), these kinds of functions can be reusable across front-end languages. That is, a user could define a subgraph function in Python and use it in C++ without redefining it
469 |
470 | _Improvements_:
471 |
472 | * Continue work on a just-in-time compiler that can take in an execution subgraph and output an optimized routine for that subgraph
473 | * Such a compiler might also take in some runtime profiling information as input
474 | * The compiler should be able to perform loop fusion, block and tile for locality, and specialize routines for particular shapes and sizes of tensors, along with other optimizations
475 | * Improving the node placement and node scheduling algorithms
476 | * Instead of using man-made heuristics, have the system learn how to make good placement/scheduling decisions
477 |
478 | - - -
479 |
480 | ## 11 Related Work
481 |
482 | ### Open source, single machine systems with portions of similar functionality
483 |
484 | _Systems designed primarily for neural networks:_
485 |
486 | * [Caffe](http://caffe.berkeleyvision.org/)
487 | * [Chainer](http://chainer.org/)
488 | * [Computational Network Toolkit](https://cntk.codeplex.com/)
489 | * [Theano](http://deeplearning.net/software/theano/)
490 | * [Torch](http://torch.ch/)
491 |
492 | _Systems that support symbolic differentiation:_
493 |
494 | * [Chainer](http://chainer.org/)
495 | * [Theano](http://deeplearning.net/software/theano/)
496 |
497 | _Systems with a core written in C++:_
498 |
499 | * [Caffe](http://caffe.berkeleyvision.org/)
500 |
501 | ### Comparisons with DistBelief and Project Adam
502 |
503 | _Similarities shared with [DistBelief](http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf) and [Project Adam](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf):_
504 |
505 | * They allow computations to be distributed across many devices on many machines
506 | * Users can specify models using relatively high-level descriptions
507 |
508 | _Differences between TensorFlow and DistBelief/Project Adam:_
509 |
510 | * TensorFlow's dataflow graph model is more flexible and better at expressing more types of machine learning models and algorithms
511 | * Allows for the expression of stateful Parameter nodes as variables
512 | * Variable update operations are simply additional nodes in the graph
513 | * Both DistBelief and Project Adam have subsystems dedicated to handling parameters
514 |
515 | ### Comparison with [Hallide](http://people.csail.mit.edu/fredo/tmp/Halide-5min.pdf) image processing system
516 |
517 | * Both use similar intermediate data representation to their dataflow graphs
518 | * However, Hallide has additional high level knowledge of its operations, which it uses to create optimized code that combine multiple operations
519 | * Hallide is single-machine only
520 | * The TensorFlow team hopes to extend TensorFlow's capabilities to incorporate Hallide's techniques into a distrubted setting
521 |
522 | ### Related distributed dataflow graph systems
523 |
524 | _Systems that represent complex workflows as dataflow graphs_
525 |
526 | * [Dryad](http://www.michaelisard.com/pubs/eurosys07.pdf)
527 | * [Flume](https://flume.apache.org/)
528 |
529 | _Systems that support data-dependent control flow_
530 |
531 | * [CIEL](http://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/ciel.pdf)
532 | * Iteration is implemented as a directed acyclic graph (DAG) that dynamically unfolds
533 | * [Naiad](http://research.microsoft.com/en-us/projects/naiad/)
534 | * Iteration is implemented as a static graph with cycles
535 |
536 | _Systems optimized for accessing the same data repeatedly_
537 |
538 | * [Spark](http://spark.apache.org/)
539 | * Uses resilient distributed datasets (RDDs) to cache outputs of earlier operations, in case they are needed again
540 |
541 | _Systems that execute dataflow graphs across heterogenous devices, including GPUs_
542 |
543 | * [Dandelion](http://research-srv.microsoft.com/pubs/201110/sosp13-dandelion-final.pdf)
544 |
545 | #### Features that TensorFlow incorporates from the above distributed systems
546 |
547 | _Feature implementations that are most similar to TensorFlow are listed after the feature_
548 |
549 | * The dataflow scheduler (i.e. the module that selects the next node to run)
550 | * CIEL, Dryad, Flume, Spark
551 | * Distributed architecture: using a single, optimized dataflow graph and caching information about the graph to lower coordination overhead
552 | * Naiad
553 | * Works best when there is enough RAM in the cluster to hold all working variables/computations
554 | * Naiad, Spark
555 | * Iteration: multiple replicas of the same graph can execute at once while sharing the same set of persistent variables. Replicas can share the variables asynchronously or use mechanisms such as queues to access them synchronously
556 | * Hybrid of many approaches
557 | * Iteration: a node only executes when all of its dependencies have completed
558 | * CIEL
559 | * Graph iteration is represented as a static, cyclic graph
560 |
561 | - - -
562 |
563 | ## 12 Conclusions
564 |
565 | * TensorFlow is a flexible dataflow graph programming model
566 | * There are both single machine and distributed implementations of TensorFlow
567 | * TensorFlow was developed using prior experience in Google, as well as methods used in other previous systems
568 | * An [open source implementation of TensorFlow is available](https://github.com/tensorflow/tensorflow)
569 | * As of writing, only a single-device implementation has been released from Google
570 |
571 | - - -
572 |
573 | ## Figures
574 |
575 | #### Figure 1: Example TensorFlow code fragment
576 | ```python
577 | import tensorflow as tf
578 |
579 | # 100-d vector, init to zeros
580 | b = tf.Variable (tf.zeros([100])
581 |
582 | # 784x100 matrix with random values
583 | W = tf.Variable(tf.random_uniform([784,100], -1, 1))
584 |
585 | # Placeholder for input
586 | x = tf.placehoder(name="x")
587 |
588 | # Rectified linear unit of (W*x +b)
589 | relu = tf.nn.relu(tf.matmul(W, x) + b)
590 |
591 | # Cost computed as a function of relu
592 | C = [...]
593 |
594 | # Instantiate a Session
595 | s = tf.Session()
596 |
597 | for step in xrange(0, 10):
598 | # Create a 100-d vector for input
599 | input = ...construct 100-D input array ...
600 |
601 | # Find the cost, using the constructed vector as the placeholder input
602 | result = s.run(C, feed_dict = {x: input})
603 | print step, result
604 | ```
605 |
606 | #### Figure 2: Corresponding computation graph for Figure 1
607 |
608 |
609 | #### Figure 3: Single machine (left) and distributed system (right) structure
610 |
611 |
612 | #### Figure 4: Before and after insertion of Send/Recieve nodes
613 |
614 |
615 | #### Figure 5: Gradients computed for graph in figure 2
616 |
617 |
618 | #### Figure 6: Before and after graph transformation for partial execution
619 |
620 |
621 | #### Figure 7: Synchronous and asynchronous data parallel training
622 |
623 |
624 | #### Figure 8: Model parallel training
625 |
626 |
627 | #### Figure 9: Concurrent steps
628 |
629 |
--------------------------------------------------------------------------------
/img/figure2.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
136 |
--------------------------------------------------------------------------------
/img/figure3.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
167 |
--------------------------------------------------------------------------------
/img/figure4.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
161 |
--------------------------------------------------------------------------------
/img/figure5.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
345 |
--------------------------------------------------------------------------------
/img/figure6.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
191 |
--------------------------------------------------------------------------------
/img/figure7.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
994 |
--------------------------------------------------------------------------------
/img/figure8.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
528 |
--------------------------------------------------------------------------------
/img/figure9.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
188 |
--------------------------------------------------------------------------------