├── .gitignore ├── .gitmodules ├── README(chs).md └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | .vscode 3 | *.pyc -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "code/framework"] 2 | path = code/framework 3 | url = https://github.com/vahidk/TensorflowFramework 4 | -------------------------------------------------------------------------------- /README(chs).md: -------------------------------------------------------------------------------- 1 | # Effective TensorFlow 2 中文版 2 | 3 | 目录 4 | ================= 5 | ## Part I: TensorFlow 2 基础 6 | 1. [TensorFlow 2 基础](#basics) 7 | 2. [广播](#broadcast) 8 | 3. [利用重载OPs](#overloaded_ops) 9 | 4. [控制流操作: 条件与循环](#control_flow) 10 | 5. [原型核和使用Python OPs可视化](#python_ops) 11 | 6. [TensorFlow中的数值稳定性](#stable) 12 | --- 13 | 14 | _我们针对新发布的 TensorFlow 2.x API 更新了教程. 如果你想看 TensorFlow 1.x 的教程请移步 [v1 branch](https://github.com/vahidk/EffectiveTensorflow/tree/v1)._ 15 | 16 | _安装 TensorFlow 2.0 (alpha) 请参照 [官方网站](https://www.tensorflow.org/install/pip):_ 17 | ``` 18 | pip install tensorflow==2.0.0-alpha0 19 | ``` 20 | 21 | _我们致力于逐步扩展新的文章,并保持与Tensorflow API更新同步。如果你有任何建议请提出来。_ 22 | 23 | # Part I: TensorFlow 2.0 基础 24 | 25 | 26 | ## TensorFlow 基础 27 | 28 | 重新设计的TensorFlow 2带来了更方便使用的API。如果你熟悉numpy,你用Tensorflow 2会很爽。不像完全静态图符号计算的Tensorflow 1,TF2隐藏静态图那部分,变得像个numpy。值得注意的是,虽然交互变化了,但是TF2仍然有静态图抽象的优势,TF1能做的TF2都能做。 29 | 30 | 让我们从一个简单的例子开始吧,我们那俩随机矩阵乘起来。我们先看看Numpy怎么做这事先。 31 | ```python 32 | import numpy as np 33 | 34 | x = np.random.normal(size=[10, 10]) 35 | y = np.random.normal(size=[10, 10]) 36 | z = np.dot(x, y) 37 | 38 | print(z) 39 | ``` 40 | 41 | 现在看看用TensorFlow 2.0怎么办: 42 | ```python 43 | import tensorflow as tf 44 | 45 | x = tf.random.normal([10, 10]) 46 | y = tf.random.normal([10, 10]) 47 | z = tf.matmul(x, y) 48 | 49 | print(z) 50 | ``` 51 | 与NumPy差不多,TensorFlow 2也马上执行并返回结果。唯一的不同是TensorFlow用tf.Tensor类型存储结果,当然这种数据可以方便的转换为NumPy数据,调用tf.Tensor.numpy()成员函数就行: 52 | 53 | ```python 54 | print(z.numpy()) 55 | ``` 56 | 57 | 为了理解符号计算的强大,让我们看看另一个例子。假设我们有从一个曲线(举个栗子 f(x) = 5x^2 + 3)上采集的样本点,并且我们要基于这些样本估计f(x)。我们建立了一个参数化函数g(x, w) = w0 x^2 + w1 x + w2,这个函数有输入x和隐藏参数w,我们的目标就是找出隐藏参数让g(x, w) ≈ f(x)。这个可以通过最小化以下的loss函数:L(w) = ∑ (f(x) - g(x, w))^2。虽然这个问题有解析解,但是我们更乐意用一个可以应用到任意可微分方程上的通用方法,嗯,SGD。我们仅需要计算L(w) 在不同样本点上关于w的平均提督,然后往梯度反方向调整就行。 58 | 59 | 60 | 那么,怎么用TensorFlow实现呢: 61 | 62 | ```python 63 | import numpy as np 64 | import tensorflow as tf 65 | 66 | # 假设我们知道我们期望的多项式方程是二阶方程, 67 | # 我们分配一个长3的向量并用随机噪声初始化。 68 | 69 | w = tf.Variable(tf.random.normal([3, 1])) 70 | 71 | # 用Adam优化器优化,初始学习率0.1 72 | opt = tf.optimizers.Adam(0.1) 73 | 74 | def model(x): 75 | # 定义yhat为y的估计 76 | f = tf.stack([tf.square(x), x, tf.ones_like(x)], 1) 77 | yhat = tf.squeeze(tf.matmul(f, w), 1) 78 | return yhat 79 | 80 | def compute_loss(y, yhat): 81 | # loss用y和yhat之间的L2距离估计。 82 | # 对w加了正则项保证w较小。 83 | loss = tf.nn.l2_loss(yhat - y) + 0.1 * tf.nn.l2_loss(w) 84 | return loss 85 | 86 | def generate_data(): 87 | # 根据真实函数生成一些训练样本 88 | x = np.random.uniform(-10.0, 10.0, size=100).astype(np.float32) 89 | y = 5 * np.square(x) + 3 90 | return x, y 91 | 92 | def train_step(): 93 | x, y = generate_data() 94 | 95 | def _loss_fn(): 96 | yhat = model(x) 97 | loss = compute_loss(y, yhat) 98 | return loss 99 | 100 | opt.minimize(_loss_fn, [w]) 101 | 102 | for _ in range(1000): 103 | train_step() 104 | 105 | print(w.numpy()) 106 | ``` 107 | 运行这段代码你会看到近似下面这个的结果: 108 | ```python 109 | [4.9924135, 0.00040895029, 3.4504161] 110 | ``` 111 | 这和我们的参数很接近了. 112 | 113 | 注意,上面的代码是交互式执行 (i.e. eager模式下ops直接执行),这种操作并不高效. TensorFlow 2.0也提供静态图执行的法子,方便在GPUs和TPUs上快速并行执行。开启也很简单对于训练阶段函数用tf.function修饰就OK: 114 | 115 | ```python 116 | @tf.function 117 | def train_step(): 118 | x, y = generate_data() 119 | 120 | def _loss_fn(): 121 | yhat = model(x) 122 | loss = compute_loss(y, yhat) 123 | return loss 124 | 125 | opt.minimize(_loss_fn, [w]) 126 | ``` 127 | 128 | tf.function多牛逼,他也可以吧while、for之类函数转换进去。我们后面细说。 129 | 130 | 这些只是TF能做的冰山一角。很多有几百万参数的复杂神经网络可以在TF用几行代码搞定。TF也可以在不同设备,不同线程上处理。 131 | 132 | ## 广播操作 133 | 134 | TF支持广播元素操作。一般来说,如果你想执行加法或者乘法之类操作,你得确保相加或者相乘元素形状匹配,比如你不能把形状为[3, 2]的tensor加到形状为[3, 4]的tensor上。但是有个特例,就是当你把一个tensor和另一有维度长度是1的tensor是去加去乘,TF会把银行的把那个维扩展,让两个tensor可操作。(去看numpy的广播机制吧) 135 | 136 | ```python 137 | import tensorflow as tf 138 | 139 | a = tf.constant([[1., 2.], [3., 4.]]) 140 | b = tf.constant([[1.], [2.]]) 141 | # c = a + tf.tile(b, [1, 2]) 142 | c = a + b 143 | 144 | print(c) 145 | ``` 146 | 147 | 广播可以让我们代码更短更高效。我们可以把不同长度的特征连接起来。比如用一些非线性操作复制特定维度,这在很多神经网络里经常用的到: 148 | 149 | 150 | ```python 151 | a = tf.random.uniform([5, 3, 5]) 152 | b = tf.random.uniform([5, 1, 6]) 153 | 154 | # 连接a和b 155 | tiled_b = tf.tile(b, [1, 3, 1]) 156 | c = tf.concat([a, tiled_b], 2) 157 | d = tf.keras.layers.Dense(10, activation=tf.nn.relu).apply(c) 158 | 159 | print(d) 160 | ``` 161 | 162 | 但这个用了广播就更简单了,我们可以用f(m(x + y))等效f(mx + my)这个特性。然后隐含用广播来做连接。 163 | 164 | ```python 165 | pa = tf.keras.layers.Dense(10).apply(a) 166 | pb = tf.keras.layers.Dense(10).apply(b) 167 | d = tf.nn.relu(pa + pb) 168 | 169 | print(d) 170 | ``` 171 | 172 | 事实下面的代码在可以广播的场景下更好用。 173 | 174 | ```python 175 | def merge(a, b, units, activation=None): 176 | pa = tf.keras.layers.Dense(units).apply(a) 177 | pb = tf.keras.layers.Dense(units).apply(b) 178 | c = pa + pb 179 | if activation is not None: 180 | c = activation(c) 181 | return c 182 | ``` 183 | 184 | 所以,我们说了广播的好处,那么广播有啥坏处呢。隐含的广播可能导致debug麻烦。 185 | 186 | ```python 187 | a = tf.constant([[1.], [2.]]) 188 | b = tf.constant([1., 2.]) 189 | c = tf.reduce_sum(a + b) 190 | 191 | print(c) 192 | ``` 193 | 194 | 所以c的结果是啥?正确答案是12,当tensor形状不一样,TF自动的进行了广播。 195 | 196 | 避免这个问题的法子就是尽量显式,比如reduce时候注明维度。 197 | 198 | ```python 199 | a = tf.constant([[1.], [2.]]) 200 | b = tf.constant([1., 2.]) 201 | c = tf.reduce_sum(a + b, 0) 202 | 203 | print(c) 204 | ``` 205 | 206 | 这里c得到[5, 7], 然后很容易发现问题。以后用reduce和tf.squeeze操作时最好注明维度。 207 | 208 | ## 利用重载函数 209 | 210 | 就像numpy,TF重载一些python操作来让graph构建更容易更可读。 211 | 212 | 切片操作可以方便的索引tensor: 213 | ```python 214 | z = x[begin:end] # z = tf.slice(x, [begin], [end-begin]) 215 | ``` 216 | 尽量不要用切片,因为这个效率很逊。为了理解这玩意效率到底有多逊,让我们康康一个例子。下面将做一个列方向上的reduce_sum。 217 | 218 | ```python 219 | import tensorflow as tf 220 | import time 221 | 222 | x = tf.random.uniform([500, 10]) 223 | 224 | z = tf.zeros([10]) 225 | 226 | start = time.time() 227 | for i in range(500): 228 | z += x[i] 229 | print("Took %f seconds." % (time.time() - start)) 230 | ``` 231 | 我的水果Pro上执行这段花了0.045秒,好逊。这是因为执行了500次切片,很慢的,更好的法子是矩阵分解。 232 | ```python 233 | z = tf.zeros([10]) 234 | for x_i in tf.unstack(x): 235 | z += x_i 236 | ``` 237 | 花了0.01秒,当然,最勇的法子是用tf.reduce_sum操作: 238 | ```python 239 | z = tf.reduce_sum(x, axis=0) 240 | ``` 241 | 这个操作用了0.0001秒, 比最初的方法快了100倍。 242 | 243 | TF也重载了一堆算数和逻辑操作 244 | ```python 245 | z = -x # z = tf.negative(x) 246 | z = x + y # z = tf.add(x, y) 247 | z = x - y # z = tf.subtract(x, y) 248 | z = x * y # z = tf.mul(x, y) 249 | z = x / y # z = tf.div(x, y) 250 | z = x // y # z = tf.floordiv(x, y) 251 | z = x % y # z = tf.mod(x, y) 252 | z = x ** y # z = tf.pow(x, y) 253 | z = x @ y # z = tf.matmul(x, y) 254 | z = x > y # z = tf.greater(x, y) 255 | z = x >= y # z = tf.greater_equal(x, y) 256 | z = x < y # z = tf.less(x, y) 257 | z = x <= y # z = tf.less_equal(x, y) 258 | z = abs(x) # z = tf.abs(x) 259 | z = x & y # z = tf.logical_and(x, y) 260 | z = x | y # z = tf.logical_or(x, y) 261 | z = x ^ y # z = tf.logical_xor(x, y) 262 | z = ~x # z = tf.logical_not(x) 263 | ``` 264 | 265 | 你也可以这些操作的扩展用法。 比如`x += y` 和 `x **= 2`。 266 | 267 | 注意,py不允许and or not之类的重载。 268 | 269 | 其他比如等于(==) 和不等(!=) 等被NumPy重载的操作并没有被TensorFlow实现,请用函数版本的 `tf.equal` 和 `tf.not_equal`。(less_equal,greater_equal之类也得用函数式) 270 | 271 | ## 控制流,条件与循环 272 | 273 | 当我们构建一个复杂的模型,比如递归神经网络,我们需要用条件或者循环来控制操作流。这一节里我们介绍一些常用的流控制操作。 274 | 275 | 假设你想根据一个判断式来决定是否相乘或相加俩tensor。这个可以用py内置函数或者用tf.cond函数。 276 | 277 | ```python 278 | a = tf.constant(1) 279 | b = tf.constant(2) 280 | 281 | p = tf.constant(True) 282 | 283 | # 或者: 284 | # x = tf.cond(p, lambda: a + b, lambda: a * b) 285 | x = a + b if p else a * b 286 | 287 | print(x.numpy()) 288 | ``` 289 | 由于判断式为真,因此输出相加结果,等于3。 290 | 291 | 大多数时候你在TF里用很大的tensor,并且想把操作应用到batch上。用tf.where就能对一个batch得到满足判断式的成分进行操作。 292 | ```python 293 | a = tf.constant([1, 1]) 294 | b = tf.constant([2, 2]) 295 | 296 | p = tf.constant([True, False]) 297 | 298 | x = tf.where(p, a + b, a * b) 299 | 300 | print(x.numpy()) 301 | ``` 302 | 结果得到[3, 2]. 303 | 304 | 另一个常用的操作是tf.while_loop,他允许在TF里用动态循环处理可变长度序列。来个例子: 305 | 306 | ```python 307 | @tf.function 308 | def fibonacci(n): 309 | a = tf.constant(1) 310 | b = tf.constant(1) 311 | 312 | for i in range(2, n): 313 | a, b = b, a + b 314 | 315 | return b 316 | 317 | n = tf.constant(5) 318 | b = fibonacci(n) 319 | 320 | print(b.numpy()) 321 | ``` 322 | 输出5. 注意tf.function装饰器自动把python代码转换为tf.while_loop因此我们不用折腾TF API。 323 | 324 | 现在想一下,我们想要保持完整的斐波那契数列的话,我们需要更新代码来保存历史值: 325 | ```python 326 | @tf.function 327 | def fibonacci(n): 328 | a = tf.constant(1) 329 | b = tf.constant(1) 330 | c = tf.constant([1, 1]) 331 | 332 | for i in range(2, n): 333 | a, b = b, a + b 334 | c = tf.concat([c, [b]], 0) 335 | 336 | return c 337 | 338 | n = tf.constant(5) 339 | b = fibonacci(n) 340 | 341 | print(b.numpy()) 342 | ``` 343 | 344 | 如果你这么执行了,TF会反馈循环值发生变化。 345 | 解决这个问题可以用 "shape invariants",但是这个只能在底层tf.while_loop API里用。 346 | 347 | 348 | ```python 349 | n = tf.constant(5) 350 | 351 | def cond(i, a, b, c): 352 | return i < n 353 | 354 | def body(i, a, b, c): 355 | a, b = b, a + b 356 | c = tf.concat([c, [b]], 0) 357 | return i + 1, a, b, c 358 | 359 | i, a, b, c = tf.while_loop( 360 | cond, body, (2, 1, 1, tf.constant([1, 1])), 361 | shape_invariants=(tf.TensorShape([]), 362 | tf.TensorShape([]), 363 | tf.TensorShape([]), 364 | tf.TensorShape([None]))) 365 | 366 | print(c.numpy()) 367 | ``` 368 | 这个又丑又慢。我们建立一堆没用的中间tensor。TF有更好的解决方法,用tf.TensorArray就行了: 369 | ```python 370 | @tf.function 371 | def fibonacci(n): 372 | a = tf.constant(1) 373 | b = tf.constant(1) 374 | 375 | c = tf.TensorArray(tf.int32, n) 376 | c = c.write(0, a) 377 | c = c.write(1, b) 378 | 379 | for i in range(2, n): 380 | a, b = b, a + b 381 | c = c.write(i, b) 382 | 383 | return c.stack() 384 | 385 | n = tf.constant(5) 386 | c = fibonacci(n) 387 | 388 | print(c.numpy()) 389 | ``` 390 | TF while循环再建立负载递归神经网络时候很有用。这里有个实验,[beam search](https://en.wikipedia.org/wiki/Beam_search) 他用了tf.while_loops,你那么勇应该可以用tensor arrays实现的更高效吧。 391 | 392 | ## 原型核和用Python OPs可视化 393 | 394 | TF里操作kernel使用Cpp实现来保证效率。但用Cpp写TensorFlow kernel很烦诶,所以你在实现自己的kernel前可以实验下自己想法是否奏效。用tf.py_function() 你可以把任何python操作编程tf操作。 395 | 396 | 下面就是自己实现一个非线性的Relu: 397 | ```python 398 | import numpy as np 399 | import tensorflow as tf 400 | import uuid 401 | 402 | def relu(inputs): 403 | # Define the op in python 404 | def _py_relu(x): 405 | return np.maximum(x, 0.) 406 | 407 | # Define the op's gradient in python 408 | def _py_relu_grad(x): 409 | return np.float32(x > 0) 410 | 411 | @tf.custom_gradient 412 | def _relu(x): 413 | y = tf.py_function(_py_relu, [x], tf.float32) 414 | 415 | def _relu_grad(dy): 416 | return dy * tf.py_function(_py_relu_grad, [x], tf.float32) 417 | 418 | return y, _relu_grad 419 | 420 | return _relu(inputs) 421 | ``` 422 | 为了验证梯度的正确性,你应该比较解析和数值梯度。 423 | ```python 424 | # 计算解析梯度 425 | x = tf.random.normal([10], dtype=np.float32) 426 | with tf.GradientTape() as tape: 427 | tape.watch(x) 428 | y = relu(x) 429 | g = tape.gradient(y, x) 430 | print(g) 431 | 432 | # 计算数值梯度 433 | dx_n = 1e-5 434 | dy_n = relu(x + dx_n) - relu(x) 435 | g_n = dy_n / dx_n 436 | print(g_n) 437 | ``` 438 | 这俩值应该很接近。 439 | 440 | 注意这个实现很低效,因此只应该用在原型里,因为python代码超慢,后面你会想Cpp重新实现计算kernel的,大概。 441 | 442 | 实际,我们通常用python操作来做可视化。比如你做图像分类,你在训练时想可视化你的模型预测,用Tensorboard看tf.summary.image()保存的结果吧: 443 | ```python 444 | image = tf.placeholder(tf.float32) 445 | tf.summary.image("image", image) 446 | ``` 447 | 但是你这只能可视化输入图,没法知道预测值,用tf的操作肯定嗝屁了,你可以用python操作: 448 | ```python 449 | def visualize_labeled_images(images, labels, max_outputs=3, name="image"): 450 | def _visualize_image(image, label): 451 | # python里绘图 452 | fig = plt.figure(figsize=(3, 3), dpi=80) 453 | ax = fig.add_subplot(111) 454 | ax.imshow(image[::-1,...]) 455 | ax.text(0, 0, str(label), 456 | horizontalalignment="left", 457 | verticalalignment="top") 458 | fig.canvas.draw() 459 | 460 | # 写入内存中 461 | buf = io.BytesIO() 462 | data = fig.savefig(buf, format="png") 463 | buf.seek(0) 464 | 465 | # Pillow解码图像 466 | img = PIL.Image.open(buf) 467 | return np.array(img.getdata()).reshape(img.size[0], img.size[1], -1) 468 | 469 | def _visualize_images(images, labels): 470 | # 只显示batch中部分图 471 | outputs = [] 472 | for i in range(max_outputs): 473 | output = _visualize_image(images[i], labels[i]) 474 | outputs.append(output) 475 | return np.array(outputs, dtype=np.uint8) 476 | 477 | # 、运行python op. 478 | figs = tf.py_function(_visualize_images, [images, labels], tf.uint8) 479 | return tf.summary.image(name, figs) 480 | ``` 481 | 482 | 由于验证测试过一段时间测试一次,所以不用担心效率。 483 | 484 | ## Numerical stability in TensorFlow 485 | 486 | 用TF或者Numpy之类数学计算库的时候,既要考虑数学计算的正确性,也要注意数值计算的稳定性。 487 | 488 | 举个例子,小学就教了x * y / y在y不等于0情况下等于x,但是实际: 489 | ```python 490 | import numpy as np 491 | 492 | x = np.float32(1) 493 | 494 | y = np.float32(1e-50) # y 被当成0了 495 | z = x * y / y 496 | 497 | print(z) # prints nan 498 | ``` 499 | 500 | 对于单精度浮点y太小了,直接被当成0了,当然y很大的时候也有问题: 501 | 502 | ```python 503 | y = np.float32(1e39) # y 被当成无穷大 504 | z = x * y / y 505 | 506 | print(z) # prints nan 507 | ``` 508 | 509 | 单精度浮点的最小值是1.4013e-45,任何比他小的值都被当成0,同样的任何大于3.40282e+38的,会被当成无穷大。 510 | 511 | ```python 512 | print(np.nextafter(np.float32(0), np.float32(1))) # prints 1.4013e-45 513 | print(np.finfo(np.float32).max) # print 3.40282e+38 514 | ``` 515 | 为了保证你计算的稳定,你必须避免过小值或者过大值。这个听起来理所当然,但是在TF进行梯度下降的时候可能很难debug。你在FP时候要保证稳定,在BP时候还要保证。 516 | 517 | 让我们看一个例子,我们想要在一个logits向量上计算softmax,一个naive的实现就像: 518 | ```python 519 | import tensorflow as tf 520 | 521 | def unstable_softmax(logits): 522 | exp = tf.exp(logits) 523 | return exp / tf.reduce_sum(exp) 524 | 525 | print(unstable_softmax([1000., 0.]).numpy()) # prints [ nan, 0.] 526 | ``` 527 | 所以你logits的exp的值,即使logits很小会得到很大的值,说不定超过单精度的范围。最大的不溢出logit值是ln(3.40282e+38) = 88.7,比他大的就会导致nan。 528 | 529 | 所以怎么让这玩意稳定,exp(x - c) / ∑ exp(x - c) = exp(x) / ∑ exp(x)就搞掂了。如果我们logits减去一个数,结果还是一样的,一般减去logits最大值。这样exp函数的输入被限定在[-inf, 0],然后输出就是[0.0, 1.0],就很棒: 530 | 531 | ```python 532 | import tensorflow as tf 533 | 534 | def softmax(logits): 535 | exp = tf.exp(logits - tf.reduce_max(logits)) 536 | return exp / tf.reduce_sum(exp) 537 | 538 | print(softmax([1000., 0.]).numpy()) # prints [ 1., 0.] 539 | ``` 540 | 541 | 我们看一个更加复杂的情况,考虑一个分类问题,我们用softmax来得到logits的可能性,之后用交叉熵计算预测和真值。交叉熵这么算xe(p, q) = -∑ p_i log(q_i)。然后一个naive的实现如下: 542 | 543 | ```python 544 | def unstable_softmax_cross_entropy(labels, logits): 545 | logits = tf.math.log(softmax(logits)) 546 | return -tf.reduce_sum(labels * logits) 547 | 548 | labels = tf.constant([0.5, 0.5]) 549 | logits = tf.constant([1000., 0.]) 550 | 551 | xe = unstable_softmax_cross_entropy(labels, logits) 552 | 553 | print(xe.numpy()) # prints inf 554 | ``` 555 | 556 | 由于softmax输出结果接近0,log的输出接近无限导致了计算的不稳定,我们扩展softmax并简化了计算交叉熵: 557 | 558 | ```python 559 | def softmax_cross_entropy(labels, logits): 560 | scaled_logits = logits - tf.reduce_max(logits) 561 | normalized_logits = scaled_logits - tf.reduce_logsumexp(scaled_logits) 562 | return -tf.reduce_sum(labels * normalized_logits) 563 | 564 | labels = tf.constant([0.5, 0.5]) 565 | logits = tf.constant([1000., 0.]) 566 | 567 | xe = softmax_cross_entropy(labels, logits) 568 | 569 | print(xe.numpy()) # prints 500.0 570 | ``` 571 | 572 | 我们也证明了梯度计算的正确性: 573 | ```python 574 | with tf.GradientTape() as tape: 575 | tape.watch(logits) 576 | xe = softmax_cross_entropy(labels, logits) 577 | 578 | g = tape.gradient(xe, logits) 579 | print(g.numpy()) # prints [0.5, -0.5] 580 | ``` 581 | 这就对了。 582 | 583 | 必须再次提醒,在做梯度相关操作时候必须注意保证每一层梯度都在有效范围内,exp和log操作由于可以把小数变得很大,因此可能让计算变得不稳定,所以使用exp和log操作必须十分谨慎。 584 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Effective TensorFlow 2 2 | 3 | Table of Contents 4 | ================= 5 | ## Part I: TensorFlow 2 Fundamentals 6 | 1. [TensorFlow 2 Basics](#basics) 7 | 2. [Broadcasting the good and the ugly](#broadcast) 8 | 3. [Take advantage of the overloaded operators](#overloaded_ops) 9 | 4. [Control flow operations: conditionals and loops](#control_flow) 10 | 5. [Prototyping kernels and advanced visualization with Python ops](#python_ops) 11 | 6. [Numerical stability in TensorFlow](#stable) 12 | --- 13 | 14 | _We updated the guide to follow the newly released TensorFlow 2.x API. If you want the original guide for TensorFlow 1.x see the [v1 branch](https://github.com/vahidk/EffectiveTensorflow/tree/v1)._ 15 | 16 | _To install TensorFlow 2.0 (alpha) follow the [instructions on the official website](https://www.tensorflow.org/install/pip):_ 17 | ``` 18 | pip install tensorflow==2.0.0-alpha0 19 | ``` 20 | 21 | _We aim to gradually expand this series by adding new articles and keep the content up to date with the latest releases of TensorFlow API. If you have suggestions on how to improve this series or find the explanations ambiguous, feel free to create an issue, send patches, or reach out by email._ 22 | 23 | # Part I: TensorFlow 2.0 Fundamentals 24 | 25 | 26 | ## TensorFlow Basics 27 | 28 | TensorFlow 2 went under a massive redesign to make the API more accessible and easier to use. If you are familiar with numpy you will find yourself right at home when using TensorFlow 2. Unlike TensorFlow 1 which was purely symbolic, TensorFlow 2 hides its symbolic nature behind the hood to look like any other imperative library like NumPy. It's important to note the change is mostly an interface change, and TensorFlow 2 is still able to take advantage of its symbolic machinery to do everything that TensorFlow 1.x can do (e.g. automatic-differentiation and massively parallel computation on TPUs/GPUs). 29 | 30 | Let's start with a simple example, we want to multiply two random matrices. First we look at an implementation done in NumPy: 31 | ```python 32 | import numpy as np 33 | 34 | x = np.random.normal(size=[10, 10]) 35 | y = np.random.normal(size=[10, 10]) 36 | z = np.dot(x, y) 37 | 38 | print(z) 39 | ``` 40 | 41 | Now we perform the exact same computation this time in TensorFlow 2.0: 42 | ```python 43 | import tensorflow as tf 44 | 45 | x = tf.random.normal([10, 10]) 46 | y = tf.random.normal([10, 10]) 47 | z = tf.matmul(x, y) 48 | 49 | print(z) 50 | ``` 51 | Similar to NumPy TensorFlow 2 also immediately performs the computation and produces the result. The only difference is that TensorFlow uses tf.Tensor type to store the results which can be easily converted to NumPy, by calling tf.Tensor.numpy() member function: 52 | 53 | ```python 54 | print(z.numpy()) 55 | ``` 56 | 57 | To understand how powerful symbolic computation can be let's have a look at another example. Assume that we have samples from a curve (say f(x) = 5x^2 + 3) and we want to estimate f(x) based on these samples. We define a parametric function g(x, w) = w0 x^2 + w1 x + w2, which is a function of the input x and latent parameters w, our goal is then to find the latent parameters such that g(x, w) ≈ f(x). This can be done by minimizing the following loss function: L(w) = ∑ (f(x) - g(x, w))^2. Although there's a closed form solution for this simple problem, we opt to use a more general approach that can be applied to any arbitrary differentiable function, and that is using stochastic gradient descent. We simply compute the average gradient of L(w) with respect to w over a set of sample points and move in the opposite direction. 58 | 59 | Here's how it can be done in TensorFlow: 60 | 61 | ```python 62 | import numpy as np 63 | import tensorflow as tf 64 | 65 | # Assuming we know that the desired function is a polynomial of 2nd degree, we 66 | # allocate a vector of size 3 to hold the coefficients and initialize it with 67 | # random noise. 68 | w = tf.Variable(tf.random.normal([3, 1])) 69 | 70 | # We use the Adam optimizer with learning rate set to 0.1 to minimize the loss. 71 | opt = tf.optimizers.Adam(0.1) 72 | 73 | def model(x): 74 | # We define yhat to be our estimate of y. 75 | f = tf.stack([tf.square(x), x, tf.ones_like(x)], 1) 76 | yhat = tf.squeeze(tf.matmul(f, w), 1) 77 | return yhat 78 | 79 | def compute_loss(y, yhat): 80 | # The loss is defined to be the l2 distance between our estimate of y and its 81 | # true value. We also added a shrinkage term, to ensure the resulting weights 82 | # would be small. 83 | loss = tf.nn.l2_loss(yhat - y) + 0.1 * tf.nn.l2_loss(w) 84 | return loss 85 | 86 | def generate_data(): 87 | # Generate some training data based on the true function 88 | x = np.random.uniform(-10.0, 10.0, size=100).astype(np.float32) 89 | y = 5 * np.square(x) + 3 90 | return x, y 91 | 92 | def train_step(): 93 | x, y = generate_data() 94 | 95 | def _loss_fn(): 96 | yhat = model(x) 97 | loss = compute_loss(y, yhat) 98 | return loss 99 | 100 | opt.minimize(_loss_fn, [w]) 101 | 102 | for _ in range(1000): 103 | train_step() 104 | 105 | print(w.numpy()) 106 | ``` 107 | By running this piece of code you should see a result close to this: 108 | ```python 109 | [4.9924135, 0.00040895029, 3.4504161] 110 | ``` 111 | Which is a relatively close approximation to our parameters. 112 | 113 | Note that in the above code we are running Tensorflow in imperative mode (i.e. operations get instantly executed), which is not very efficient. TensorFlow 2.0 can also turn a given piece of python code into a graph which can then optimized and efficiently parallelized on GPUs and TPUs. To get all those benefits we simply need to decorate the train_step function with tf.function decorator: 114 | 115 | ```python 116 | @tf.function 117 | def train_step(): 118 | x, y = generate_data() 119 | 120 | def _loss_fn(): 121 | yhat = model(x) 122 | loss = compute_loss(y, yhat) 123 | return loss 124 | 125 | opt.minimize(_loss_fn, [w]) 126 | ``` 127 | 128 | What's cool about tf.function is that it's also able to convert basic python statements like while, for and if into native TensorFlow functions. We will get to that later. 129 | 130 | This is just tip of the iceberg for what TensorFlow can do. Many problems such as optimizing large neural networks with millions of parameters can be implemented efficiently in TensorFlow in just a few lines of code. TensorFlow takes care of scaling across multiple devices, and threads, and supports a variety of platforms. 131 | 132 | ## Broadcasting the good and the ugly 133 | 134 | TensorFlow supports broadcasting elementwise operations. Normally when you want to perform operations like addition and multiplication, you need to make sure that shapes of the operands match, e.g. you can’t add a tensor of shape [3, 2] to a tensor of shape [3, 4]. But there’s a special case and that’s when you have a singular dimension. TensorFlow implicitly tiles the tensor across its singular dimensions to match the shape of the other operand. So it’s valid to add a tensor of shape [3, 2] to a tensor of shape [3, 1] 135 | 136 | ```python 137 | import tensorflow as tf 138 | 139 | a = tf.constant([[1., 2.], [3., 4.]]) 140 | b = tf.constant([[1.], [2.]]) 141 | # c = a + tf.tile(b, [1, 2]) 142 | c = a + b 143 | 144 | print(c) 145 | ``` 146 | 147 | Broadcasting allows us to perform implicit tiling which makes the code shorter, and more memory efficient, since we don’t need to store the result of the tiling operation. One neat place that this can be used is when combining features of varying length. In order to concatenate features of varying length we commonly tile the input tensors, concatenate the result and apply some nonlinearity. This is a common pattern across a variety of neural network architectures: 148 | 149 | ```python 150 | a = tf.random.uniform([5, 3, 5]) 151 | b = tf.random.uniform([5, 1, 6]) 152 | 153 | # concat a and b and apply nonlinearity 154 | tiled_b = tf.tile(b, [1, 3, 1]) 155 | c = tf.concat([a, tiled_b], 2) 156 | d = tf.keras.layers.Dense(10, activation=tf.nn.relu).apply(c) 157 | 158 | print(d) 159 | ``` 160 | 161 | But this can be done more efficiently with broadcasting. We use the fact that f(m(x + y)) is equal to f(mx + my). So we can do the linear operations separately and use broadcasting to do implicit concatenation: 162 | 163 | ```python 164 | pa = tf.keras.layers.Dense(10).apply(a) 165 | pb = tf.keras.layers.Dense(10).apply(b) 166 | d = tf.nn.relu(pa + pb) 167 | 168 | print(d) 169 | ``` 170 | 171 | In fact this piece of code is pretty general and can be applied to tensors of arbitrary shape as long as broadcasting between tensors is possible: 172 | 173 | ```python 174 | def merge(a, b, units, activation=None): 175 | pa = tf.keras.layers.Dense(units).apply(a) 176 | pb = tf.keras.layers.Dense(units).apply(b) 177 | c = pa + pb 178 | if activation is not None: 179 | c = activation(c) 180 | return c 181 | ``` 182 | 183 | So far we discussed the good part of broadcasting. But what’s the ugly part you may ask? Implicit assumptions almost always make debugging harder to do. Consider the following example: 184 | 185 | ```python 186 | a = tf.constant([[1.], [2.]]) 187 | b = tf.constant([1., 2.]) 188 | c = tf.reduce_sum(a + b) 189 | 190 | print(c) 191 | ``` 192 | 193 | What do you think the value of c would be after evaluation? If you guessed 6, that’s wrong. It’s going to be 12. This is because when rank of two tensors don’t match, TensorFlow automatically expands the first dimension of the tensor with lower rank before the elementwise operation, so the result of addition would be [[2, 3], [3, 4]], and the reducing over all parameters would give us 12. 194 | 195 | The way to avoid this problem is to be as explicit as possible. Had we specified which dimension we would want to reduce across, catching this bug would have been much easier: 196 | 197 | ```python 198 | a = tf.constant([[1.], [2.]]) 199 | b = tf.constant([1., 2.]) 200 | c = tf.reduce_sum(a + b, 0) 201 | 202 | print(c) 203 | ``` 204 | 205 | Here the value of c would be [5, 7], and we immediately would guess based on the shape of the result that there’s something wrong. A general rule of thumb is to always specify the dimensions in reduction operations and when using tf.squeeze. 206 | 207 | ## Take advantage of the overloaded operators 208 | 209 | Just like NumPy, TensorFlow overloads a number of python operators to make building graphs easier and the code more readable. 210 | 211 | The slicing op is one of the overloaded operators that can make indexing tensors very easy: 212 | ```python 213 | z = x[begin:end] # z = tf.slice(x, [begin], [end-begin]) 214 | ``` 215 | Be very careful when using this op though. The slicing op is very inefficient and often better avoided, especially when the number of slices is high. To understand how inefficient this op can be let's look at an example. We want to manually perform reduction across the rows of a matrix: 216 | ```python 217 | import tensorflow as tf 218 | import time 219 | 220 | x = tf.random.uniform([500, 10]) 221 | 222 | z = tf.zeros([10]) 223 | 224 | start = time.time() 225 | for i in range(500): 226 | z += x[i] 227 | print("Took %f seconds." % (time.time() - start)) 228 | ``` 229 | On my MacBook Pro, this took 0.045 seconds to run which is quite slow. The reason is that we are calling the slice op 500 times, which is going to be very slow to run. A better choice would have been to use tf.unstack op to slice the matrix into a list of vectors all at once: 230 | ```python 231 | z = tf.zeros([10]) 232 | for x_i in tf.unstack(x): 233 | z += x_i 234 | ``` 235 | This took 0.01 seconds. Of course, the right way to do this simple reduction is to use tf.reduce_sum op: 236 | ```python 237 | z = tf.reduce_sum(x, axis=0) 238 | ``` 239 | This took 0.0001 seconds, which is 100x faster than the original implementation. 240 | 241 | TensorFlow also overloads a range of arithmetic and logical operators: 242 | ```python 243 | z = -x # z = tf.negative(x) 244 | z = x + y # z = tf.add(x, y) 245 | z = x - y # z = tf.subtract(x, y) 246 | z = x * y # z = tf.mul(x, y) 247 | z = x / y # z = tf.div(x, y) 248 | z = x // y # z = tf.floordiv(x, y) 249 | z = x % y # z = tf.mod(x, y) 250 | z = x ** y # z = tf.pow(x, y) 251 | z = x @ y # z = tf.matmul(x, y) 252 | z = x > y # z = tf.greater(x, y) 253 | z = x >= y # z = tf.greater_equal(x, y) 254 | z = x < y # z = tf.less(x, y) 255 | z = x <= y # z = tf.less_equal(x, y) 256 | z = abs(x) # z = tf.abs(x) 257 | z = x & y # z = tf.logical_and(x, y) 258 | z = x | y # z = tf.logical_or(x, y) 259 | z = x ^ y # z = tf.logical_xor(x, y) 260 | z = ~x # z = tf.logical_not(x) 261 | ``` 262 | 263 | You can also use the augmented version of these ops. For example `x += y` and `x **= 2` are also valid. 264 | 265 | Note that Python doesn't allow overloading "and", "or", and "not" keywords. 266 | 267 | Other operators that aren't supported are equal (==) and not equal (!=) operators which are overloaded in NumPy but not in TensorFlow. Use the function versions instead which are `tf.equal` and `tf.not_equal`. 268 | 269 | ## Control flow operations: conditionals and loops 270 | 271 | When building complex models such as recurrent neural networks you may need to control the flow of operations through conditionals and loops. In this section we introduce a number of commonly used control flow ops. 272 | 273 | Let's assume you want to decide whether to multiply to or add two given tensors based on a predicate. This can be simply implemented with either python's built-in if statement or using tf.cond function: 274 | ```python 275 | a = tf.constant(1) 276 | b = tf.constant(2) 277 | 278 | p = tf.constant(True) 279 | 280 | # Alternatively: 281 | # x = tf.cond(p, lambda: a + b, lambda: a * b) 282 | x = a + b if p else a * b 283 | 284 | print(x.numpy()) 285 | ``` 286 | Since the predicate is True in this case, the output would be the result of the addition, which is 3. 287 | 288 | Most of the times when using TensorFlow you are using large tensors and want to perform operations in batch. A related conditional operation is tf.where, which like tf.cond takes a predicate, but selects the output based on the condition in batch. 289 | ```python 290 | a = tf.constant([1, 1]) 291 | b = tf.constant([2, 2]) 292 | 293 | p = tf.constant([True, False]) 294 | 295 | x = tf.where(p, a + b, a * b) 296 | 297 | print(x.numpy()) 298 | ``` 299 | This will return [3, 2]. 300 | 301 | Another widely used control flow operation is tf.while_loop. It allows building dynamic loops in TensorFlow that operate on sequences of variable length. Let's see how we can generate Fibonacci sequence with tf.while_loops: 302 | 303 | ```python 304 | @tf.function 305 | def fibonacci(n): 306 | a = tf.constant(1) 307 | b = tf.constant(1) 308 | 309 | for i in range(2, n): 310 | a, b = b, a + b 311 | 312 | return b 313 | 314 | n = tf.constant(5) 315 | b = fibonacci(n) 316 | 317 | print(b.numpy()) 318 | ``` 319 | This will print 5. Note that tf.function automatically converts the given python code to use tf.while_loop so we don't need to directly interact with the TF API. 320 | 321 | Now imagine we want to keep the whole series of Fibonacci sequence. We may update our body to keep a record of the history of current values: 322 | ```python 323 | @tf.function 324 | def fibonacci(n): 325 | a = tf.constant(1) 326 | b = tf.constant(1) 327 | c = tf.constant([1, 1]) 328 | 329 | for i in range(2, n): 330 | a, b = b, a + b 331 | c = tf.concat([c, [b]], 0) 332 | 333 | return c 334 | 335 | n = tf.constant(5) 336 | b = fibonacci(n) 337 | 338 | print(b.numpy()) 339 | ``` 340 | 341 | Now if you try running this, TensorFlow will complain that the shape of the the one of the loop variables is changing. 342 | One way to fix this is is to use "shape invariants", but this functionality is only available when using the low-level tf.while_loop API: 343 | 344 | 345 | ```python 346 | n = tf.constant(5) 347 | 348 | def cond(i, a, b, c): 349 | return i < n 350 | 351 | def body(i, a, b, c): 352 | a, b = b, a + b 353 | c = tf.concat([c, [b]], 0) 354 | return i + 1, a, b, c 355 | 356 | i, a, b, c = tf.while_loop( 357 | cond, body, (2, 1, 1, tf.constant([1, 1])), 358 | shape_invariants=(tf.TensorShape([]), 359 | tf.TensorShape([]), 360 | tf.TensorShape([]), 361 | tf.TensorShape([None]))) 362 | 363 | print(c.numpy()) 364 | ``` 365 | 366 | This is not only getting ugly, but is also pretty inefficient. Note that we are building a lot of intermediary tensors that we don't use. TensorFlow has a better solution for this kind of growing arrays. Meet tf.TensorArray. Let's do the same thing this time with tensor arrays: 367 | ```python 368 | @tf.function 369 | def fibonacci(n): 370 | a = tf.constant(1) 371 | b = tf.constant(1) 372 | 373 | c = tf.TensorArray(tf.int32, n) 374 | c = c.write(0, a) 375 | c = c.write(1, b) 376 | 377 | for i in range(2, n): 378 | a, b = b, a + b 379 | c = c.write(i, b) 380 | 381 | return c.stack() 382 | 383 | n = tf.constant(5) 384 | c = fibonacci(n) 385 | 386 | print(c.numpy()) 387 | ``` 388 | TensorFlow while loops and tensor arrays are essential tools for building complex recurrent neural networks. As an exercise try implementing [beam search](https://en.wikipedia.org/wiki/Beam_search) using tf.while_loops. Can you make it more efficient with tensor arrays? 389 | 390 | ## Prototyping kernels and advanced visualization with Python ops 391 | 392 | Operation kernels in TensorFlow are entirely written in C++ for efficiency. But writing a TensorFlow kernel in C++ can be quite a pain. So, before spending hours implementing your kernel you may want to prototype something quickly, however inefficient. With tf.py_function() you can turn any piece of python code to a TensorFlow operation. 393 | 394 | For example this is how you can implement a simple ReLU nonlinearity kernel in TensorFlow as a python op: 395 | ```python 396 | import numpy as np 397 | import tensorflow as tf 398 | import uuid 399 | 400 | def relu(inputs): 401 | # Define the op in python 402 | def _py_relu(x): 403 | return np.maximum(x, 0.) 404 | 405 | # Define the op's gradient in python 406 | def _py_relu_grad(x): 407 | return np.float32(x > 0) 408 | 409 | @tf.custom_gradient 410 | def _relu(x): 411 | y = tf.py_function(_py_relu, [x], tf.float32) 412 | 413 | def _relu_grad(dy): 414 | return dy * tf.py_function(_py_relu_grad, [x], tf.float32) 415 | 416 | return y, _relu_grad 417 | 418 | return _relu(inputs) 419 | ``` 420 | 421 | To verify that the gradients are correct you can compare the numerical and analytical gradients and compare the vlaues. 422 | ```python 423 | # Compute analytical gradient 424 | x = tf.random.normal([10], dtype=np.float32) 425 | with tf.GradientTape() as tape: 426 | tape.watch(x) 427 | y = relu(x) 428 | g = tape.gradient(y, x) 429 | print(g) 430 | 431 | # Compute numerical gradient 432 | dx_n = 1e-5 433 | dy_n = relu(x + dx_n) - relu(x) 434 | g_n = dy_n / dx_n 435 | print(g_n) 436 | ``` 437 | The numbers should be very close. 438 | 439 | Note that this implementation is pretty inefficient, and is only useful for prototyping, since the python code is not parallelizable and won't run on GPU. Once you verified your idea, you definitely would want to write it as a C++ kernel. 440 | 441 | In practice we commonly use python ops to do visualization on Tensorboard. Consider the case that you are building an image classification model and want to visualize your model predictions during training. TensorFlow allows visualizing images with tf.summary.image() function: 442 | ```python 443 | image = tf.placeholder(tf.float32) 444 | tf.summary.image("image", image) 445 | ``` 446 | But this only visualizes the input image. In order to visualize the predictions you have to find a way to add annotations to the image which may be almost impossible with existing ops. An easier way to do this is to do the drawing in python, and wrap it in a python op: 447 | ```python 448 | def visualize_labeled_images(images, labels, max_outputs=3, name="image"): 449 | def _visualize_image(image, label): 450 | # Do the actual drawing in python 451 | fig = plt.figure(figsize=(3, 3), dpi=80) 452 | ax = fig.add_subplot(111) 453 | ax.imshow(image[::-1,...]) 454 | ax.text(0, 0, str(label), 455 | horizontalalignment="left", 456 | verticalalignment="top") 457 | fig.canvas.draw() 458 | 459 | # Write the plot as a memory file. 460 | buf = io.BytesIO() 461 | data = fig.savefig(buf, format="png") 462 | buf.seek(0) 463 | 464 | # Read the image and convert to numpy array 465 | img = PIL.Image.open(buf) 466 | return np.array(img.getdata()).reshape(img.size[0], img.size[1], -1) 467 | 468 | def _visualize_images(images, labels): 469 | # Only display the given number of examples in the batch 470 | outputs = [] 471 | for i in range(max_outputs): 472 | output = _visualize_image(images[i], labels[i]) 473 | outputs.append(output) 474 | return np.array(outputs, dtype=np.uint8) 475 | 476 | # Run the python op. 477 | figs = tf.py_function(_visualize_images, [images, labels], tf.uint8) 478 | return tf.summary.image(name, figs) 479 | ``` 480 | 481 | Note that since summaries are usually only evaluated once in a while (not per step), this implementation may be used in practice without worrying about efficiency. 482 | 483 | ## Numerical stability in TensorFlow 484 | 485 | When using any numerical computation library such as NumPy or TensorFlow, it's important to note that writing mathematically correct code doesn't necessarily lead to correct results. You also need to make sure that the computations are stable. 486 | 487 | Let's start with a simple example. From primary school we know that x * y / y is equal to x for any non zero value of x. But let's see if that's always true in practice: 488 | ```python 489 | import numpy as np 490 | 491 | x = np.float32(1) 492 | 493 | y = np.float32(1e-50) # y would be stored as zero 494 | z = x * y / y 495 | 496 | print(z) # prints nan 497 | ``` 498 | 499 | The reason for the incorrect result is that y is simply too small for float32 type. A similar problem occurs when y is too large: 500 | 501 | ```python 502 | y = np.float32(1e39) # y would be stored as inf 503 | z = x * y / y 504 | 505 | print(z) # prints nan 506 | ``` 507 | 508 | The smallest positive value that float32 type can represent is 1.4013e-45 and anything below that would be stored as zero. Also, any number beyond 3.40282e+38, would be stored as inf. 509 | 510 | ```python 511 | print(np.nextafter(np.float32(0), np.float32(1))) # prints 1.4013e-45 512 | print(np.finfo(np.float32).max) # print 3.40282e+38 513 | ``` 514 | 515 | To make sure that your computations are stable, you want to avoid values with small or very large absolute value. This may sound very obvious, but these kind of problems can become extremely hard to debug especially when doing gradient descent in TensorFlow. This is because you not only need to make sure that all the values in the forward pass are within the valid range of your data types, but also you need to make sure of the same for the backward pass (during gradient computation). 516 | 517 | Let's look at a real example. We want to compute the softmax over a vector of logits. A naive implementation would look something like this: 518 | ```python 519 | import tensorflow as tf 520 | 521 | def unstable_softmax(logits): 522 | exp = tf.exp(logits) 523 | return exp / tf.reduce_sum(exp) 524 | 525 | print(unstable_softmax([1000., 0.]).numpy()) # prints [ nan, 0.] 526 | ``` 527 | Note that computing the exponential of logits for relatively small numbers results to gigantic results that are out of float32 range. The largest valid logit for our naive softmax implementation is ln(3.40282e+38) = 88.7, anything beyond that leads to a nan outcome. 528 | 529 | But how can we make this more stable? The solution is rather simple. It's easy to see that exp(x - c) / ∑ exp(x - c) = exp(x) / ∑ exp(x). Therefore we can subtract any constant from the logits and the result would remain the same. We choose this constant to be the maximum of logits. This way the domain of the exponential function would be limited to [-inf, 0], and consequently its range would be [0.0, 1.0] which is desirable: 530 | 531 | ```python 532 | import tensorflow as tf 533 | 534 | def softmax(logits): 535 | exp = tf.exp(logits - tf.reduce_max(logits)) 536 | return exp / tf.reduce_sum(exp) 537 | 538 | print(softmax([1000., 0.]).numpy()) # prints [ 1., 0.] 539 | ``` 540 | 541 | Let's look at a more complicated case. Consider we have a classification problem. We use the softmax function to produce probabilities from our logits. We then define our loss function to be the cross entropy between our predictions and the labels. Recall that cross entropy for a categorical distribution can be simply defined as xe(p, q) = -∑ p_i log(q_i). So a naive implementation of the cross entropy would look like this: 542 | 543 | ```python 544 | def unstable_softmax_cross_entropy(labels, logits): 545 | logits = tf.math.log(softmax(logits)) 546 | return -tf.reduce_sum(labels * logits) 547 | 548 | labels = tf.constant([0.5, 0.5]) 549 | logits = tf.constant([1000., 0.]) 550 | 551 | xe = unstable_softmax_cross_entropy(labels, logits) 552 | 553 | print(xe.numpy()) # prints inf 554 | ``` 555 | 556 | Note that in this implementation as the softmax output approaches zero, the log's output approaches infinity which causes instability in our computation. We can rewrite this by expanding the softmax and doing some simplifications: 557 | 558 | ```python 559 | def softmax_cross_entropy(labels, logits): 560 | scaled_logits = logits - tf.reduce_max(logits) 561 | normalized_logits = scaled_logits - tf.reduce_logsumexp(scaled_logits) 562 | return -tf.reduce_sum(labels * normalized_logits) 563 | 564 | labels = tf.constant([0.5, 0.5]) 565 | logits = tf.constant([1000., 0.]) 566 | 567 | xe = softmax_cross_entropy(labels, logits) 568 | 569 | print(xe.numpy()) # prints 500.0 570 | ``` 571 | 572 | We can also verify that the gradients are also computed correctly: 573 | ```python 574 | with tf.GradientTape() as tape: 575 | tape.watch(logits) 576 | xe = softmax_cross_entropy(labels, logits) 577 | 578 | g = tape.gradient(xe, logits) 579 | print(g.numpy()) # prints [0.5, -0.5] 580 | ``` 581 | which is correct. 582 | 583 | Let me remind again that extra care must be taken when doing gradient descent to make sure that the range of your functions as well as the gradients for each layer are within a valid range. Exponential and logarithmic functions when used naively are especially problematic because they can map small numbers to enormous ones and the other way around. 584 | 585 | --------------------------------------------------------------------------------