├── ICLR2019 Oral- 图网络有多猛?.md ├── README.md ├── colab-talk-shenzhen-0615 ├── Colab-Practice.ipynb └── TFUG-Colab-Talk.pdf ├── img └── 多目标优化Algorithm2.png ├── tf.function和Autograph使用指南-Part 1.md ├── tf.function和Autograph使用指南-Part 2.md ├── tf.function和Autograph使用指南-Part 3.md └── 用多目标优化解决多任务学习.md /ICLR2019 Oral- 图网络有多猛?.md: -------------------------------------------------------------------------------- 1 | # ICLR2019 Oral: 图网络有多猛? 2 | 3 | 前一阵子看到了这篇很有意思的文章: [How Powerful are Graph Neural Networks?](https://arxiv.org/abs/1810.00826) 4 | , 最近在想transformer结构的时候又想起了这篇文章, 并且特意看了以下ICLR这篇文章的会议视频. 有意思的是, 在作者分享完了之后, 有一位研究者也问了作者怎么看待Self Attention, 作者并没有回答的很详细, 只是说了Self Attention不是Max power GNN(这个概念下面会解释), 但是在node feature比较丰富的时候自有它的应用. 5 | 6 | 闲话说完, 说回这篇文章. 7 | 8 | 这篇文章说了以下事情: 9 | 10 | 1. 介绍了目前GNN的一般框架和WL test(Weisfeiler-Lehman test) 11 | 2. 定义了什么是强大的GNN, 证明了什么样的GNN符合强大的标准, 并给出一个具体实现(GIN) 12 | 3. 讲了一下一些不powerful的操作 13 | 14 | 这篇文章个人感觉写的蛮好的, 浅显易懂, 证明都在附录中, 看完对图网络也有了大概的认识. 这篇文章会介绍一下这个工作的结果, 并不会深入理论证明(其实是懒得看了). 水平有限, 如有谬误, 欢迎指正. 15 | 16 | ## GNN的一般框架 17 | 18 | 符号解释: 19 | 20 | - $V$, $E$: 图的节点和边的集合 21 | - $h_v^k$: 节点$v$在第$k$次循环(第$k$层)的特征向量 22 | - $N(v)$: 节点$v$的所有相邻节点 23 | 24 | 我们知道, 一般图包含有节点($V$)和边($E$), 我们用$X_v$来表示图的初始节点特征向量(比如one hot), 在GNN中, 我们希望将整个图的信息都学习到一个个节点特征向量$h_v$中, 然后对这个图进行节点分类和整个图的分类. 25 | 26 | 因此, 一般的NLP任务也可以看作是图, 节点为词, 边就是相邻的词, 对每个词做分类, 即序列标注(NER等任务), 就是对节点做分类. 而一般的分类任务, 比如话题分类, 就是对图做分类. 27 | 28 | 一般的GNN都是一个循环聚合相邻节点的过程, 也就是说, 一个节点在第k次循环的特征$h_{k}$取决于: 前一次循环的特征$h_{k-1}$和前一次循环的所有邻居的特征$\{h_{u}^{k-1}, u \in N(v) \}$, 而最终整个图的表示, 则是综合所有节点的特征向量, 具体说来就是以下三个公式: 29 | 30 | $$ 31 | \begin{aligned} 32 | &a_v^{k} = AGGREGATE^k(\{h_u^{k-1}: u \in N(v) \}) \\ 33 | &h_v^k = COMBINE^k(h_v^{k-1}, a_v^k) \\ 34 | &h_G = READOUT(\{ h_v^k|v \in G \}) 35 | \end{aligned} 36 | $$ 37 | 38 | 上面第一条公式得到相邻节点的特征, 第二条公式结合了相邻节点特征和自身特征得到新的特征, 第三条公式结合了所有节点的最终表示得到图表示. GNN实现的不同点一般在与$AGGREGATE$, $COMBINE$和$READOUT$的选择. 比如在GraphSAGE中, $AGGREGATE$就是dense + ReLu + Max Pooling, $COMBINE$就是concat+dense. 39 | 40 | ## WL Test 41 | 42 | WL Test是将节点不断循环聚合的过程, 每个循环里面会做以下两个事情: 43 | 44 | 1. 整合节点及其相邻节点的标签 45 | 2. 将整合后的标签hash成一个新的标签 46 | 47 | 对于两个图, 如果在WL test的循环中出现任何一个节点的标签不一致, 那么这两个图是不类似的(non-isomorphic). 48 | 49 | 和GNN不同的是, GNN将节点映射成一个低维的, dense的向量, 而WL test则映射成一个one-hot. 50 | 51 | 比如, 对"机器学习真有趣", "机器学习真无趣"这个两图的"习"节点进行WL test循环, 得到的aggregated label为: 52 | 53 | ``` 54 | iter 0: 习 55 | iter 1: 学习真 56 | iter 2: 器学习真有 57 | 58 | iter 0: 习 59 | iter 1: 学习真 60 | iter 2: 器学习真无 61 | ``` 62 | 63 | 在`iter 2`发现'习'节点标签不一样, 那么这两个图是non-isomorphic的. 64 | 65 | ## 什么是强大的GNN? 66 | 67 | 感觉有股中二的气息... 68 | 69 | 强大的GNN指的是, 对于任意不同的图, 都能够通过将它们映射到同一个空间中的不同向量来区分它们. 70 | 71 | 接着作者证明了以下两个theorem: 72 | 73 | 1. WL test是强大的 74 | 2. 如果$AGGREGATION$, $COMBINE$和$READOUT$函数都是一对一的映射的话, 那么GNN和WL test一样强大 75 | 76 | 因此, 只要满足上面的2的GNN, 就是强大的, 作者提出了一个强大并且简单的GNN: Graph Isomorphism Network(GIN). 77 | 78 | ## GIN 79 | 80 | 接着, 作者证明了下面的引理(原文并不限于图的语境下, 这里为了方便理解稍做修改, 不严谨!!!): 81 | 82 | 假设图$G$的节点是可数的, 且节点的相邻节点$N(v)$数量有上界, 那么**存在**一个函数$f: V \rightarrow R^n$, 使得有无限个$\epsilon$, 函数 83 | 84 | $$h(v, N(v)) = (1+\epsilon)\cdot f(v) + \Sigma_{u \in N(v)}f(u)$$ 85 | 86 | 和$(v, N(v))$是一一对应的(为啥要加和? 请见Lemma 5). 并且任意函数$g$都可以拆解成$g(v, N(v)) = \phi(h(v, N(v)))$. 87 | 88 | 存在这样一个函数就好办了, 由于universal approximation theorem, MLP可以拟合任意函数, 直接一个MLP怼上去就好了, 顺带还拟合了复合函数$f\circ \phi$: 89 | 90 | $$ h_v^k=MLP((1 + \epsilon ) \cdot h_v^{k-1} + \Sigma_{u \in N(v)}h_u^{k-1}) $$ 91 | 92 | 这里$\epsilon$可以预先设定一个固定值, 也可以通过学习得到. 93 | 94 | 好了, 到这里我们有$AGGREGATION$, $COMBINE$都是一对一映射了 95 | 96 | 对于$READOUT$, 直接加和就好了, 为了让结果漂亮一点, 作者还concat了每一层的特征 97 | 98 | $$ h_G = CONCAT(SUM(\{ h_v^k | v \in G \}) | k = 0, 1, \dots, K)$$ 99 | 100 | ## 不够强大的GNN 101 | 102 | 如果用下面的结构就不够强: 103 | 104 | - 单层Perceptron 105 | - 用mean pooling或者max pooling代替sum 106 | - 对于mean, 如果图的统计信息和分布信息比图结构重要的话 那么mean pooling的结果也会不错. 另外, 如果节点特征差异比较大并且很少重复的话, 那么mean和sum一样强大 107 | - 如果对于任务来说, 最重要的是确定边缘节点, 或者说数据“骨架”而不是图结构的话, max pooling可能效果也不错 108 | 109 | ## 实验结果 110 | 111 | 这里就不详细叙述了, 个人感觉比较有意思的实验结果是Figure 4, 作者分别比较了sum, mean, max, mlp, single layer perceptron在**训练集**的效果, 看能不能拟合到WL subtree kernel的效果, 实验结果证明了作者是对的. 至于泛化能力, 理论里面并没有对泛化能力做保证, 但是还是效果还是很不错的. 112 | 113 | 原文链接: https://github.com/JayYip/deep-learning-nlp-notes/blob/master/ICLR2019%20Oral-%20%E5%9B%BE%E7%BD%91%E7%BB%9C%E6%9C%89%E5%A4%9A%E7%8C%9B%3F.md -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 深度学习和NLP随笔 2 | 3 | 记录深度学习和NLP学习笔记的地方 4 | 5 | 知乎专栏: https://zhuanlan.zhihu.com/c_1116453746221309952 6 | 7 | 正确显示公式需要此插件: https://chrome.google.com/webstore/detail/mathjax-plugin-for-github/ioemnmodlmafdkllaclgeombjnmnbima 8 | -------------------------------------------------------------------------------- /colab-talk-shenzhen-0615/Colab-Practice.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{},"cell_type":"markdown","source":"# Colab Practice\n这个notebook大部分代码来自: https://www.tensorflow.org/tutorials/keras/basic_text_classification"},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"! pip install tensorflow-gpu==2.0.0-alpha0","execution_count":1,"outputs":[{"output_type":"stream","text":"Collecting tensorflow-gpu==2.0.0-alpha0\n\u001b[?25l Downloading https://files.pythonhosted.org/packages/1a/66/32cffad095253219d53f6b6c2a436637bbe45ac4e7be0244557210dc3918/tensorflow_gpu-2.0.0a0-cp36-cp36m-manylinux1_x86_64.whl (332.1MB)\n\u001b[K 100% |████████████████████████████████| 332.1MB 79kB/s eta 0:00:01 1% |▋ | 6.1MB 68.9MB/s eta 0:00:05 25% |████████ | 83.2MB 54.1MB/s eta 0:00:05████ | 145.7MB 43.0MB/s eta 0:00:05 44% |██████████████▎ | 147.8MB 10.9MB/s eta 0:00:17 44% |██████████████▍ | 148.9MB 11.2MB/s eta 0:00:17 52% |████████████████▊ | 173.3MB 45.7MB/s eta 0:00:04 97% |███████████████████████████████ | 322.3MB 48.9MB/s eta 0:00:01\n\u001b[?25hRequirement already satisfied: keras-applications>=1.0.6 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (1.0.7)\nRequirement already satisfied: protobuf>=3.6.1 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (3.7.1)\nRequirement already satisfied: numpy<2.0,>=1.14.5 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (1.16.3)\nRequirement already satisfied: absl-py>=0.7.0 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (0.7.1)\nCollecting google-pasta>=0.1.2 (from tensorflow-gpu==2.0.0-alpha0)\n\u001b[?25l Downloading https://files.pythonhosted.org/packages/d0/33/376510eb8d6246f3c30545f416b2263eee461e40940c2a4413c711bdf62d/google_pasta-0.1.7-py3-none-any.whl (52kB)\n\u001b[K 100% |████████████████████████████████| 61kB 22.9MB/s ta 0:00:01\n\u001b[?25hRequirement already satisfied: astor>=0.6.0 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (0.7.1)\nCollecting tb-nightly<1.14.0a20190302,>=1.14.0a20190301 (from tensorflow-gpu==2.0.0-alpha0)\n\u001b[?25l Downloading https://files.pythonhosted.org/packages/a9/51/aa1d756644bf4624c03844115e4ac4058eff77acd786b26315f051a4b195/tb_nightly-1.14.0a20190301-py3-none-any.whl (3.0MB)\n\u001b[K 100% |████████████████████████████████| 3.0MB 1.9MB/s ta 0:00:01176kB 11.2MB/s eta 0:00:01 82% |██████████████████████████▌ | 2.5MB 10.0MB/s eta 0:00:01\n\u001b[?25hRequirement already satisfied: wheel>=0.26 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (0.31.1)\nRequirement already satisfied: gast>=0.2.0 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (0.2.2)\nRequirement already satisfied: termcolor>=1.1.0 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (1.1.0)\nCollecting tf-estimator-nightly<1.14.0.dev2019030116,>=1.14.0.dev2019030115 (from tensorflow-gpu==2.0.0-alpha0)\n\u001b[?25l Downloading https://files.pythonhosted.org/packages/13/82/f16063b4eed210dc2ab057930ac1da4fbe1e91b7b051a6c8370b401e6ae7/tf_estimator_nightly-1.14.0.dev2019030115-py2.py3-none-any.whl (411kB)\n\u001b[K 100% |████████████████████████████████| 419kB 30.0MB/s ta 0:00:01\n\u001b[?25hRequirement already satisfied: grpcio>=1.8.6 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (1.20.0)\nRequirement already satisfied: six>=1.10.0 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (1.12.0)\nRequirement already satisfied: keras-preprocessing>=1.0.5 in /opt/conda/lib/python3.6/site-packages (from tensorflow-gpu==2.0.0-alpha0) (1.0.9)\nRequirement already satisfied: h5py in /opt/conda/lib/python3.6/site-packages (from keras-applications>=1.0.6->tensorflow-gpu==2.0.0-alpha0) (2.9.0)\nRequirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from protobuf>=3.6.1->tensorflow-gpu==2.0.0-alpha0) (39.1.0)\nRequirement already satisfied: markdown>=2.6.8 in /opt/conda/lib/python3.6/site-packages (from tb-nightly<1.14.0a20190302,>=1.14.0a20190301->tensorflow-gpu==2.0.0-alpha0) (3.1)\nRequirement already satisfied: werkzeug>=0.11.15 in /opt/conda/lib/python3.6/site-packages (from tb-nightly<1.14.0a20190302,>=1.14.0a20190301->tensorflow-gpu==2.0.0-alpha0) (0.14.1)\nInstalling collected packages: google-pasta, tb-nightly, tf-estimator-nightly, tensorflow-gpu\nSuccessfully installed google-pasta-0.1.7 tb-nightly-1.14.0a20190301 tensorflow-gpu-2.0.0a0 tf-estimator-nightly-1.14.0.dev2019030115\n\u001b[33mYou are using pip version 19.0.3, however version 19.1.1 is available.\nYou should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"# import相关库\n"},{"metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","trusted":true},"cell_type":"code","source":"import tensorflow as tf\nfrom tensorflow import keras\nfrom tensorflow.keras.utils import get_file\nimport numpy as np","execution_count":2,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# 下载IMDB电影评论数据\n\n注意,在Kaggle Kernel里面numpy版本为1.16.3, `allow_pickle`的默认值被修改了, 会导致错误, 因此需要rework\n\nRef: https://github.com/keras-team/keras/pull/12714"},{"metadata":{"trusted":true},"cell_type":"code","source":"def load_data(path='imdb.npz', num_words=None, skip_top=0,\n maxlen=None, seed=113,\n start_char=1, oov_char=2, index_from=3, **kwargs):\n \"\"\"Loads the IMDB dataset.\n # Arguments\n path: where to cache the data (relative to `~/.keras/dataset`).\n num_words: max number of words to include. Words are ranked\n by how often they occur (in the training set) and only\n the most frequent words are kept\n skip_top: skip the top N most frequently occurring words\n (which may not be informative).\n maxlen: sequences longer than this will be filtered out.\n seed: random seed for sample shuffling.\n start_char: The start of a sequence will be marked with this character.\n Set to 1 because 0 is usually the padding character.\n oov_char: words that were cut out because of the `num_words`\n or `skip_top` limit will be replaced with this character.\n index_from: index actual words with this index and higher.\n # Returns\n Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.\n # Raises\n ValueError: in case `maxlen` is so low\n that no input sequence could be kept.\n Note that the 'out of vocabulary' character is only used for\n words that were present in the training set but are not included\n because they're not making the `num_words` cut here.\n Words that were not seen in the training set but are in the test set\n have simply been skipped.\n \"\"\"\n # Legacy support\n if 'nb_words' in kwargs:\n warnings.warn('The `nb_words` argument in `load_data` '\n 'has been renamed `num_words`.')\n num_words = kwargs.pop('nb_words')\n if kwargs:\n raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))\n\n path = get_file(path,\n origin='https://s3.amazonaws.com/text-datasets/imdb.npz',\n file_hash='599dadb1135973df5b59232a0e9a887c')\n with np.load(path, allow_pickle=True) as f:\n x_train, labels_train = f['x_train'], f['y_train']\n x_test, labels_test = f['x_test'], f['y_test']\n\n rng = np.random.RandomState(seed)\n indices = np.arange(len(x_train))\n rng.shuffle(indices)\n x_train = x_train[indices]\n labels_train = labels_train[indices]\n\n indices = np.arange(len(x_test))\n rng.shuffle(indices)\n x_test = x_test[indices]\n labels_test = labels_test[indices]\n\n xs = np.concatenate([x_train, x_test])\n labels = np.concatenate([labels_train, labels_test])\n\n if start_char is not None:\n xs = [[start_char] + [w + index_from for w in x] for x in xs]\n elif index_from:\n xs = [[w + index_from for w in x] for x in xs]\n\n if maxlen:\n xs, labels = _remove_long_seq(maxlen, xs, labels)\n if not xs:\n raise ValueError('After filtering for sequences shorter than maxlen=' +\n str(maxlen) + ', no sequence was kept. '\n 'Increase maxlen.')\n if not num_words:\n num_words = max([max(x) for x in xs])\n\n # by convention, use 2 as OOV word\n # reserve 'index_from' (=3 by default) characters:\n # 0 (padding), 1 (start), 2 (OOV)\n if oov_char is not None:\n xs = [[w if (skip_top <= w < num_words) else oov_char for w in x]\n for x in xs]\n else:\n xs = [[w for w in x if skip_top <= w < num_words]\n for x in xs]\n\n idx = len(x_train)\n x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])\n x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])\n\n return (x_train, y_train), (x_test, y_test)","execution_count":3,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# get data\n\n(train_data, train_labels), (test_data, test_labels) = load_data(num_words=10000)\n\nprint(train_data[0], train_labels[0])\nprint('Number of training instances: {0}, number of testing instances: {1}'.format(train_data.shape[0], test_data.shape[0]))","execution_count":4,"outputs":[{"output_type":"stream","text":"Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz\n17465344/17464789 [==============================] - 0s 0us/step\n[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32] 1\nNumber of training instances: 25000, number of testing instances: 25000\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"我们看到的是词索引, 想要看到原本的词需要用词表找回原来的词语"},{"metadata":{"trusted":true},"cell_type":"code","source":"# get vocab\nword_to_id = keras.datasets.imdb.get_word_index()\nindex_from=3\nword_to_id = {k:(v+index_from) for k,v in word_to_id.items()}\nword_to_id[\"\"] = 0\nword_to_id[\"\"] = 1\nword_to_id[\"\"] = 2\nid_to_word = {value:key for key,value in word_to_id.items()}\n\nprint(' '.join([id_to_word[i] for i in train_data[0]]))","execution_count":5,"outputs":[{"output_type":"stream","text":"Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json\n1646592/1641221 [==============================] - 0s 0us/step\n this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert is an amazing actor and now the same being director father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also to the two little boy's that played the of norman and paul they were just brilliant children are often left out of the list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"接下来我们要对输入进行补全(padding)\n\n补全会导致一部分无用计算, 但是更加方便处理(思考题: 怎样减少无用计算?)"},{"metadata":{"trusted":true},"cell_type":"code","source":"train_data = keras.preprocessing.sequence.pad_sequences(train_data,\n value=word_to_id[\"\"],\n padding='post',\n maxlen=256)\n\ntest_data = keras.preprocessing.sequence.pad_sequences(test_data,\n value=word_to_id[\"\"],\n padding='post',\n maxlen=256)","execution_count":6,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train_data.shape","execution_count":7,"outputs":[{"output_type":"execute_result","execution_count":7,"data":{"text/plain":"(25000, 256)"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Problem: Use tf.data to implement input pipeline\n\n# placeholder for implementing dataset\ndef create_dataset_from_tensor_slices(X, y):\n return tf.data.Dataset.from_tensor_slices((np.array(X), np.array(y)))\n\ndef create_dataset_from_generator(X, y):\n def create_gen():\n for single_x, single_y in zip(X, y):\n yield (single_x, single_y)\n output_types = (tf.int32, tf.int32)\n output_shapes = ([256], [])\n return tf.data.Dataset.from_generator(create_gen, output_types=output_types, output_shapes=output_shapes)\n\ndef create_dataset_tfrecord(X, y, mode='train'):\n file_name = '{0}.tfrecord'.format(mode)\n \n # serialize features\n # WARNING: DO NOT WRITE MULTITPLE TIMES IN PRACTICE!!! IT'S SLOW!!!\n def _int64_list_feature(value):\n \"\"\"Returns an int64_list from a bool / enum / int / uint.\"\"\"\n return tf.train.Feature(int64_list=tf.train.Int64List(value=value))\n def _int64_feature(value):\n \"\"\"Returns an int64_list from a bool / enum / int / uint.\"\"\"\n return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))\n def serialize_fn(single_x, single_y):\n feature_tuples = {'feature': _int64_list_feature(single_x), 'label': _int64_feature(single_y)}\n example_proto = tf.train.Example(\n features=tf.train.Features(feature=feature_tuples))\n return example_proto.SerializeToString()\n # write to file\n with tf.io.TFRecordWriter(file_name) as writer:\n for single_x, single_y in zip(X, y):\n example = serialize_fn(single_x, single_y)\n writer.write(example)\n \n # read file\n dataset = tf.data.TFRecordDataset(file_name)\n def parse_fn(example_proto):\n feature_description = {'feature': tf.io.FixedLenFeature([256], tf.int64), 'label': tf.io.FixedLenFeature([], tf.int64)}\n feature_tuple = tf.io.parse_single_example(\n example_proto, feature_description)\n return feature_tuple['feature'], feature_tuple['label']\n dataset = dataset.map(parse_fn)\n return dataset\n\n# train_dataset = create_dataset_from_generator(train_data, train_labels)\n# test_dataset = create_dataset_from_generator(test_data, test_labels)\n\ntrain_dataset = create_dataset_tfrecord(train_data, train_labels)\ntest_dataset = create_dataset_tfrecord(test_data, test_labels, mode='test')\n\ntrain_dataset = train_dataset.shuffle(10000).batch(256).prefetch(100).repeat()\ntest_dataset = test_dataset.batch(256).prefetch(100)\n ","execution_count":8,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# 构建模型\n\n下面就是激动人心的时候了: 写一个文本分类模型!\n\n你需要改写一下下面的模型,让其准确率更高\n\n你可以尝试使用Dropout, CudnnGRU等更加fancy的方法"},{"metadata":{"trusted":true},"cell_type":"code","source":"# Problem: Implement a custom keras layer which has the identical effects of dense, but print the mean\n# of the variables if the mean value is greater than zero. Print for maximum 10 times.\n\n# placeholder for implementing using Functional API or Model Subclassing\nclass WeirdDense(tf.keras.layers.Layer):\n\n def __init__(self, output_dim, activation):\n super(WeirdDense, self).__init__()\n self.output_dim = output_dim\n self.activation = activation\n self.print_times = tf.Variable(0, dtype=tf.int32, trainable=False)\n \n\n def build(self, input_shape):\n # Create a trainable weight variable for this layer.\n self.w = self.add_weight(shape=(input_shape[-1], self.output_dim),\n initializer='random_normal',\n trainable=True)\n self.b = self.add_weight(shape=(self.output_dim,),\n initializer='random_normal',\n trainable=True)\n @tf.function\n def call(self, x):\n mean_val = tf.reduce_mean(self.w)\n if tf.greater(mean_val, 0):\n if tf.less_equal(self.print_times, 10):\n tf.print(mean_val)\n self.print_times.assign_add(1)\n\n return_tensor = self.activation(tf.matmul(x, self.w) + self.b)\n return return_tensor\n \n\n def compute_output_shape(self, input_shape):\n return (input_shape[0], self.output_dim)","execution_count":9,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# input shape is the vocabulary count used for the movie reviews (10,000 words)\nvocab_size = 10000\n\nmodel = keras.Sequential()\nmodel.add(keras.layers.Embedding(vocab_size, 16))\nmodel.add(keras.layers.GlobalAveragePooling1D())\nmodel.add(WeirdDense(16, activation=tf.nn.relu))\nmodel.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))\n\nmodel.summary()","execution_count":12,"outputs":[{"output_type":"stream","text":"Model: \"sequential_1\"\n_________________________________________________________________\nLayer (type) Output Shape Param # \n=================================================================\nembedding_1 (Embedding) (None, None, 16) 160000 \n_________________________________________________________________\nglobal_average_pooling1d_1 ( (None, 16) 0 \n_________________________________________________________________\nweird_dense_1 (WeirdDense) (None, 16) 273 \n_________________________________________________________________\ndense_1 (Dense) (None, 1) 17 \n=================================================================\nTotal params: 160,290\nTrainable params: 160,289\nNon-trainable params: 1\n_________________________________________________________________\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"如果要使用Keras提供的训练、预测API, 你需要先compile模型, 然后调用该API"},{"metadata":{"trusted":true},"cell_type":"code","source":"model.compile(optimizer='adam',\n loss='binary_crossentropy',\n metrics=['acc'])\nhistory = model.fit(train_dataset,\n epochs=1,\n steps_per_epoch=100,\n validation_data=test_dataset,\n validation_steps=100,\n verbose=1)","execution_count":13,"outputs":[{"output_type":"stream","text":" 91/100 [==========================>...] - ETA: 0s - loss: 0.6912 - acc: 0.56832.54052e-05\n5.81007916e-05\n 97/100 [============================>.] - ETA: 0s - loss: 0.6908 - acc: 0.57829.01562744e-05\n0.000122097204\n 99/100 [============================>.] - ETA: 0s - loss: 0.6906 - acc: 0.58070.000152926776\n0.000183319906\n0.000183319906\n0.000183319906\n0.000183319906\n0.000183319906\n0.000183319906\n","name":"stdout"},{"output_type":"stream","text":"W0617 08:00:08.996987 139913903105408 training_generator.py:228] Your dataset ran out of data; interrupting training. Make sure that your dataset can generate at least `validation_steps * epochs` batches (in this case, 100 batches). You may need to use the repeat() function when building your dataset.\n","name":"stderr"},{"output_type":"stream","text":"\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r100/100 [==============================] - 4s 43ms/step - loss: 0.6905 - acc: 0.5824 - val_loss: 0.6685 - val_acc: 0.7057\n","name":"stdout"}]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.4","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1} -------------------------------------------------------------------------------- /colab-talk-shenzhen-0615/TFUG-Colab-Talk.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JayYip/deep-learning-nlp-notes/8600715b708eb638ddcadd6b6b751605e98882a7/colab-talk-shenzhen-0615/TFUG-Colab-Talk.pdf -------------------------------------------------------------------------------- /img/多目标优化Algorithm2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JayYip/deep-learning-nlp-notes/8600715b708eb638ddcadd6b6b751605e98882a7/img/多目标优化Algorithm2.png -------------------------------------------------------------------------------- /tf.function和Autograph使用指南-Part 1.md: -------------------------------------------------------------------------------- 1 | # tf.function和Autograph使用指南-Part 1 2 | 3 | AutoGraph是TF提供的一个非常具有前景的工具, 它能够将一部分python语法的代码转译成高效的图表示代码. 由于从TF 2.0开始, TF将会默认使用动态图(eager execution), 因此利用AutoGraph, **在理想情况下**, 能让我们实现用动态图写(方便, 灵活), 用静态图跑(高效, 稳定). 4 | 5 | 但是! 在使用的过程中, 如无意外肯定是会有意外的, 这篇文章就是指出一些AutoGraph和tf.function的奇怪的行为, 让你更愉快地使用它们. 6 | 7 | 本文假设读者具有一定的Python和TensorFlow的使用经验. 8 | 9 | ## 会话执行 10 | 11 | 对tf1.X有经验的读者应该不会对让我们~~又爱~~又恨的计算图(`tf.Graph`)和执行会话(`tf.Session`)感到陌生, 一个常规的流程如下: 12 | 1. 初始化一个计算图并且将该计算图设置为当前scope下的默认计算图 13 | 2. 用TF API设计计算图(比如: `y=tf.matmul(a, x) + b`) 14 | 3. 提前界定好参数共享并划分相应的参数scope 15 | 4. 创建并配置好`tf.Session` 16 | 5. 将计算图传给`tf.Session` 17 | 6. 初始化参数 18 | 7. 用`tf.Session.run`来执行计算图的节点, 被执行的节点会反向追踪所有依赖的需要执行的节点并**执行计算**. 19 | 20 | 以下是上述过程的一个代码例子: 21 | ```python 22 | g = tf.Graph() #初始化计算图 23 | with g.as_default(): # 设置为默认计算图 24 | a = tf.constant([[10,10],[11.,1.]]) 25 | x = tf.constant([[1.,0.],[0.,1.]]) 26 | b = tf.Variable(12.) 27 | y = tf.matmul(a, x) + b # 描述计算图 28 | init_op = tf.global_variables_initializer() # 待执行节点 29 | 30 | with tf.Session() as sess: # 配置会话 31 | sess.run(init_op) # 执行节点 32 | print(sess.run(y)) # 输出结果 33 | ``` 34 | 35 | 在TF 2.0中, 由于默认为动态图, 计算会直接被执行, 也就是说, 我们不需要 36 | 37 | - 定义计算图 38 | - 会话执行 39 | - 参数初始化 40 | - 用scope定义参数分享 41 | - 用`tf.control_dependencies`来声明节点的非直接依赖 42 | 43 | 我们可以像写普通python代码(or pytorch)一样, 写了就执行: 44 | ```python 45 | a = tf.constant([[10,10],[11.,1.]]) 46 | x = tf.constant([[1.,0.],[0.,1.]]) 47 | b = tf.Variable(12.) 48 | y = tf.matmul(a, x) + b 49 | print(y.numpy()) 50 | ``` 51 | 一般来说, eager代码会比执行相同操作的静态图代码的效率低, 因为很多计算图优化的方法只能用在数据流图上. 52 | 53 | 如果想在TF 2.0上构建传统的计算图, 我们就需要用到`tf.function`. 54 | 55 | ## 函数, 而非会话 56 | TF 2.0的其中一个重要改变就是[去除`tf.Session`](https://github.com/tensorflow/community/blob/master/rfcs/20180918-functions-not-sessions-20.md)(此处应有掌声). 这个改变会迫使用户用更好的方式来组织代码: 不用再用让人纠结的`tf.Session`来执行代码, 就是一个个python函数, 加上一个简单的装饰器. 57 | 58 | 在TF 2.0里面, 如果需要构建计算图, 我们只需要给python函数加上`@tf.function`的装饰器. 59 | 60 | > 上文提到静态图的执行效率更高, 但是加速并不是一定的. 一般来说, 计算图越复杂, 加速效果越明显. 对于复杂的计算图, 比如训练深度学习模型, 获得的加速是巨大的. (译者注: 个人感觉还是要结合实际来看, 如果某一部分的计算既有复杂的计算图, 而计算图的复杂性又带来了额外的[内存消耗](https://mxnet.incubator.apache.org/versions/master/architecture/note_memory.html) 61 | 或者计算量, 那么加速会比较明显, 但是很多时候, 比如一般的CNN模型, 主要计算量并不在于图的复杂性, 而在于卷积、矩阵乘法等操作, 加速并不会很明显. 此处想法有待验证) 62 | 63 | 这个自动将python代码转成图表示代码的工具就叫做AutoGraph. 64 | 65 | 在TF 2.0中, 如果一个函数被`@tf.function`装饰了, 那么AutoGraph将会被自动调用, 从而将python函数转换成可执行的图表示. 66 | 67 | ## tf.function: 究竟发生了什么? 68 | 在第一次调用被`@tf.function`装饰的函数时, 下列事情将会发生: 69 | 70 | - 该函数被执行并跟踪。和Tensorflow 1.x类似, Eager会在这个函数中被禁用,因此每个`tf.`API只会定义一个生成`tf.Tensor`输出的节点 71 | - AutoGraph用于检测可以转换为等效图表示的Python操作(`while`→`tf.while`,`for`→`tf.while`,`if`→`tf.cond`,`assert`→`tf.assert`...) 72 | - 为了保留执行顺序,在每个语句之后自动添加`tf.control_dependencies`,以便在执行第`i+1`行时确保第`i`行已经被执行. 至此计算图已经确定 73 | - 根据函数名称和输入参数,创建唯一ID并将其与定义好的计算图相关联。计算图被缓存到一个映射表中:`map [id] = graph` 74 | - 如果ID配对上了,之后的函数调用都会直接使用该计算图 75 | 76 | 下一节将会具体阐述如何将TF 1.X代码块分别改写到eager和计算图版本. 77 | 78 | ## 改写到eager execution 79 | 80 | 要使用`tf.function`, 第一步需要先将TF 1.X的设计计算图的代码放进python函数里面. 81 | 82 | ```python 83 | def f(): 84 | a = tf.constant([[10,10],[11.,1.]]) 85 | x = tf.constant([[1.,0.],[0.,1.]]) 86 | b = tf.Variable(12.) 87 | y = tf.matmul(a, x) + b 88 | return y 89 | ``` 90 | 91 | 应为TF 2.0默认是eager的, 我们可以直接执行该函数(不需要`tf.Session`): 92 | 93 | ```python 94 | print(f().numpy()) 95 | ``` 96 | 97 | 我们就会得到输出: 98 | 99 | ```python 100 | [[22. 22.] 101 | [23. 13.]] 102 | ``` 103 | 104 | ## 从eager到tf.function 105 | 106 | 我们可以直接用`@tf.function`来装饰函数`f`, 我们在原来`f`的基础上加上宇宙第一的debug大法: `print`来更好地看看究竟发生了什么. 107 | ```python 108 | @tf.function 109 | def f(): 110 | a = tf.constant([[10,10],[11.,1.]]) 111 | x = tf.constant([[1.,0.],[0.,1.]]) 112 | b = tf.Variable(12.) 113 | y = tf.matmul(a, x) + b 114 | print("PRINT: ", y) 115 | tf.print("TF-PRINT: ", y) 116 | return y 117 | 118 | f() 119 | ``` 120 | 121 | 所以发生了什么呢? 122 | 123 | - `@tf.function`将函数`f`包进了[`tensorflow.python.eager.def_function.Function`](https://github.com/tensorflow/tensorflow/blob/0596296aed520fdc81829297d01cad9f8f48da14/tensorflow/python/eager/function.py#L1268)这个对象, 函数`f`被赋予到了这个对象的`.python_function`属性. 124 | - 当`f()`被执行的时候, 计算图会同时被构建, 但是计算不会执行, 因此我们会得到以下结果, `tf.`的操作不会被执行: 125 | ```python 126 | PRINT: Tensor("add:0", shape=(2, 2), dtype=float32) 127 | ``` 128 | - 最终, 你会看到代码会执行失败: 129 | ```python 130 | ValueError: tf.function-decorated function tried to create variables on non-first call. 131 | ``` 132 | 在 [RFC: Functions, not Session](https://github.com/tensorflow/community/blob/master/rfcs/20180918-functions-not-sessions-20.md#functions-that-create-state)里面有个非常明确的指示: 133 | > State (like `tf.Variable` objects) are only created the first time the function f is called. 状态(比如`tf.Variable`) 只会在函数被第一次调用时创建. 134 | 135 | 但是 [Alexandre Passos](https://github.com/tensorflow/tensorflow/issues/26812#issuecomment-474595919)指出, 在函数转换成图表示时, 我们没有办法确定`tf.function`调用了多少次函数, 因此我们在第一次调用函数`f`时, 在图构建的过程中, 可能会被执行了多次, 这就导致了上述错误. 136 | 137 | 造成这个错误的根源在于同样的命令在动态图和静态图中的不一致性. 在动态图中, `tf.Variable`时一个普通的python变量, 超出了其作用域范围就会被销毁. 而在静态图中, `tf.Variable`则是计算图中一个持续存在的节点, 不受python的作用域的影响. 因此, 这是使用`tf.function`的第一个教训: 138 | > 将一个在动态图中可行的函数转换成静态图需要用静态图的方式思考该函数是否可行 139 | 140 | 那么我们可以怎样去规避这个错误呢? 141 | 142 | 1. 将`tf.Variable`作为函数的参数传入 143 | 2. 将父作用域继承`tf.Variable` 144 | 3. 将`tf.Variable`作为类属性来调用 145 | 146 | ## 用改变变量作用域来处理 147 | 148 | 这里指方法2和方法3. 显然的, 我们推荐使用方法3: 149 | 150 | ```python 151 | class F(): 152 | def __init__(self): 153 | self._b = None 154 | 155 | @tf.function 156 | def __call__(self): 157 | a = tf.constant([[10, 10], [11., 1.]]) 158 | x = tf.constant([[1., 0.], [0., 1.]]) 159 | if self._b is None: 160 | self._b = tf.Variable(12.) 161 | y = tf.matmul(a, x) + self._b 162 | print("PRINT: ", y) 163 | tf.print("TF-PRINT: ", y) 164 | return y 165 | 166 | f = F() 167 | f() 168 | ``` 169 | 170 | ## 将状态作为传入参数来处理 171 | 172 | 我们之后会看到, 我们并不能随意地用`tf.function`来转化eager的代码并达到加速的目的, 我们需要想象一下转化是怎么完成的, 在转python的代码到图操作的时候究竟发生了什么, 这些转化包含了什么**黑魔法**. 这里的例子比较简单, 我们会在接下来的文章中更深入的探讨. 173 | ```python 174 | @tf.function 175 | def f(b): 176 | a = tf.constant([[10,10],[11.,1.]]) 177 | x = tf.constant([[1.,0.],[0.,1.]]) 178 | y = tf.matmul(a, x) + b 179 | print("PRINT: ", y) 180 | tf.print("TF-PRINT: ", y) 181 | return y 182 | 183 | b = tf.Variable(12.) 184 | f(b) 185 | ``` 186 | 上述函数会得到我们想要的结果, 另外, 作为参数被传入的变量能够在函数中直接更新, 而更新后的值会在函数外也适用. 下面的代码会打印出1,2,3 187 | ```python 188 | a = tf.Variable(0) 189 | 190 | @tf.function 191 | def g(x): 192 | x.assign_add(1) 193 | return x 194 | 195 | print(g(a)) 196 | print(g(a)) 197 | print(g(a)) 198 | ``` 199 | ## 总结 200 | 201 | - 我们可以用`@tf.function`装饰器来将python代码转成图表示代码 202 | - 我们不能在被装饰函数中初始化`tf.Variable` 203 | - 可以用变量作用域继承(对象属性)或者参数传入的方法使用在函数外初始化的变量 204 | 205 | 在之后的部分我们会更加深入地探讨输入参数类型对效率的影响, 以及python操作的转换细节. 206 | 207 | 原文链接: https://github.com/JayYip/deep-learning-nlp-notes/blob/master/tf.function%E5%92%8CAutograph%E4%BD%BF%E7%94%A8%E6%8C%87%E5%8D%97-Part%201.md 208 | 209 | 声明: 本文翻译自[Paolo Galeone的博客](https://pgaleone.eu/tensorflow/tf.function/2019/03/21/dissecting-tf-function-part-1/), 已取得作者的同意, 如需转载本文请联系本人 210 | 211 | Disclaimer: This is a translation of the article [Analyzing tf.function to discover AutoGraph strengths and subtleties](https://pgaleone.eu/tensorflow/tf.function/2019/03/21/dissecting-tf-function-part-1/) by Paolo Galeone. -------------------------------------------------------------------------------- /tf.function和Autograph使用指南-Part 2.md: -------------------------------------------------------------------------------- 1 | # tf.function和Autograph使用指南-Part 2 2 | 3 | 在第1部分中,我们已经知道了如何将TF 1.x代码转换为其eager的代码,然后又将eager的代码通过`tf.function`转换为图表示代码,并遇到了在该函数中创建状态(`tf.Variable`)时由于eager和图表示的差异而导致的问题。 4 | 5 | 在本文中,我们将要分析用`tf.Tensor`和python对象作为被`tf.function`装饰的函数的输入时的异同. 那么是否在被装饰函数中的所有逻辑都将转换为符合期望的图表示代码呢? 6 | 7 | ## tf.function调用了AutoGraph 8 | 9 | 首先我们来看下`tf.function`的所有输入标志(`input_signature`): 10 | ``` 11 | def function(func=None, 12 | input_signature=None, 13 | autograph=True, 14 | experimental_autograph_options=None) 15 | ``` 16 | 参数`autograph`的默认值为`True`, 因此之前我们用`@tf.function`装饰的函数是使用了`autograph`了的. 我们可以看[文档](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/function)中对该参数的描述: 17 | 18 | - 当该值为`True`时, 所有依赖于tensor值的python操作都会被加入到TensorFlow图中 19 | - 当该值为`False`时, 该函数的操作会被加入到Tensorflow图中, 但是python的逻辑不会被AutoGraph转化 20 | 21 | 因此, `tf.function`默认会调用`AutoGraph`. 我们接下来看看如果我们改变输入参数类型和函数结构的话会发生什么. 22 | 23 | ## 改变tf.Tensor的数据类型 24 | 25 | 我们先构思一个用来测试的函数, 函数输入参数的数据类型是很重要的, 因为输入参数会被用来构造图, 而且这个图是静态类型的, 并且会有它的一个独特的ID.(请参见第一部分) 26 | ``` 27 | @tf.function 28 | def f(x): 29 | print("Python execution: ", x) 30 | tf.print("Graph execution: ", x) 31 | return x 32 | ``` 33 | 注意两个print是不同的. 接着我们执行以下测试: 34 | ```python 35 | print("##### float32 test #####") 36 | a = tf.constant(1, dtype=tf.float32) 37 | print("first call") 38 | f(a) 39 | a = tf.constant(1.1, dtype=tf.float32) 40 | print("second call") 41 | f(a) 42 | 43 | print("##### uint8 test #####") 44 | 45 | b = tf.constant(2, dtype=tf.uint8) 46 | print("first call") 47 | f(b) 48 | b = tf.constant(3, dtype=tf.uint8) 49 | print("second call") 50 | f(b) 51 | ``` 52 | 我们得到以下结果: 53 | ``` 54 | ##### float32 test ##### 55 | first call 56 | Python execution: Tensor("x:0", shape=(), dtype=float32) 57 | Graph execution: 1 58 | second call 59 | Graph execution: 1.1 60 | ##### uint8 test ##### 61 | first call 62 | Python execution: Tensor("x:0", shape=(), dtype=uint8) 63 | Graph execution: 2 64 | second call 65 | Graph execution: 3 66 | ``` 67 | 我们可以看到传入不同数据类型的tensor时, 图会重新被构建一次(注意: `print`并没有被转成`tf.print`). 我们可以用`tf.autograph.to_code(f.python_function)`来看看生成的图表示代码: 68 | ``` 69 | def tf__f(x): 70 | try: 71 | with ag__.function_scope('f'): 72 | do_return = False 73 | retval_ = None 74 | with ag__.utils.control_dependency_on_returns(ag__.converted_call(print, None, ag__.ConversionOptions(recursive=True, force_conversion=False, optional_features=ag__.Feature.ALL, internal_convert_user_code=True), ('Python execution: ', x), {})): 75 | tf_1, x_1 = ag__.utils.alias_tensors(tf, x) 76 | with ag__.utils.control_dependency_on_returns(ag__.converted_call('print', tf_1, ag__.ConversionOptions(recursive=True, force_conversion=False, optional_features=ag__.Feature.ALL, internal_convert_user_code=True), ('Graph execution: ', x_1), {})): 77 | x_2 = ag__.utils.alias_tensors(x_1) 78 | do_return = True 79 | retval_ = x_1 80 | return retval_ 81 | except: 82 | ag__.rewrite_graph_construction_error(ag_source_map__) 83 | ``` 84 | 85 | 其中, 有些python代码只会在构建图的时候被执行: 86 | 87 | ``` 88 | with ag__.utils.control_dependency_on_returns( 89 | ag__.converted_call( 90 | print, None, ag__.ConversionOptions( 91 | recursive=True, 92 | force_conversion=False, 93 | optional_features=ag__.Feature.ALL, 94 | internal_convert_user_code=True), 95 | ('Python execution: ', x), {}) 96 | ): 97 | ``` 98 | 我们可以看到`ag__.utils.control_dependency_on_returns`会在`ag__.converted_call`返回结果时建立一个`tf.control_dependencies`的依赖, 这样能确保图是按照python代码的顺序执行的. 99 | 100 | [converted_call](https://github.com/tensorflow/tensorflow/blob/56c8527fa73f694b76963dbb28a9d011d233086f/tensorflow/python/autograph/impl/api.py#L206)会打包函数的执行. 它的输入包含了可能需要的转换和执行该函数需要的所有输入参数: 101 | 102 | - `f`: 函数本身, 这里是`print`. 103 | - `owner`: 函数所在模块, 由于`print`是python内置函数, 因此为`None`. `tf.print`的owner就是`tf_1`, `tf `的别名 104 | - `options`: 转换选项, 也就是`ag__.ConversionOptions` 105 | - `args`, `kwargs`: 函数`f`(`print`)的参数 106 | 107 | **那么问题来了:** 108 | 109 | 为什么在追踪函数执行的时候要确保这些python代码只被执行一次呢? 110 | 111 | **猜测:** 112 | 113 | 我的猜测是没有直接的办法知道该python代码有没有可能会让图效果改变的副作用, 因此在函数第一次执行的时候就直接追踪并加入到图中了. 114 | 115 | 如果在第一次执行的时候监测到副作用, 那么图会被更新, 不然的话, 就像在这个例子中, python函数`print`会被`tf.no_op`取代. 116 | 117 | 由于这只是我的猜测, 我在[这里](https://groups.google.com/a/tensorflow.org/forum/#!topic/developers/SD_ijT4MuPw)提问了, 如果你也对这个问题感兴趣可以留意下. 118 | 119 | ## 使用python原生类型 120 | 121 | 我们可以不止用`tf.Tensor`作为输入, AutoGraph能够根据输入类型自动转化成新的图, 接下来我们会测试一下python原生类型和`tf.Tensor`作为输入时的异同. 122 | 123 | 由于python有三种数值类型: 整数, 浮点数, 和复数, 我们逐一对此进行测试: 124 | ``` 125 | def printinfo(x): 126 | print("Type: ", type(x), " value: ", x) 127 | 128 | print("##### int test #####") 129 | print("first call") 130 | a = 1 131 | printinfo(a) 132 | f(a) 133 | print("second call") 134 | b = 2 135 | printinfo(b) 136 | f(b) 137 | 138 | print("##### float test #####") 139 | print("first call") 140 | a = 1.0 141 | printinfo(a) 142 | f(a) 143 | print("second call") 144 | b = 2.0 145 | printinfo(b) 146 | f(b) 147 | 148 | print("##### complex test #####") 149 | print("first call") 150 | a = complex(1.0, 2.0) 151 | printinfo(a) 152 | f(a) 153 | print("second call") 154 | b = complex(2.0, 1.0) 155 | printinfo(b) 156 | f(b) 157 | ``` 158 | 输出有点不太符合预期了: 159 | ``` 160 | ##### int test ##### 161 | first call 162 | Type: value: 1 163 | Python execution: 1 164 | Graph execution: 1 165 | 166 | second call 167 | Type: value: 2 168 | Python execution: 2 169 | Graph execution: 2 170 | 171 | ##### float test ##### 172 | first call 173 | Type: value: 1.0 174 | Graph execution: 1 175 | second call 176 | Type: value: 2.0 177 | Graph execution: 2 178 | 179 | ##### complex test ##### 180 | first call 181 | Type: value: (1+2j) 182 | Python execution: (1+2j) 183 | Graph execution: (1+2j) 184 | second call 185 | Type: value: (2+1j) 186 | Python execution: (2+1j) 187 | Graph execution: (2+1j) 188 | ``` 189 | 这意味着对于每一个不同的数值, 都有一个独立的图! 就是说: 190 | 191 | - `f(1)`构造了图, `f(1.0)`重复使用了这个图 192 | - `f(2)`构造了图, `f(2.0)`重复使用了这个图 193 | - `f(1+2j)`和`f(2+1j)`都分别构造了图 194 | 195 | 这就很奇怪了. 196 | 197 | 我们可以通过调用函数`f(1.0)`看返回值的类型来看看其是否调用了输入整数的图: 198 | ``` 199 | ret = f(1.0) 200 | if tf.float32 == ret.dtype: 201 | print("f(1.0) returns float") 202 | else: 203 | print("f(1.0) return ", ret) 204 | ``` 205 | 结果: 206 | ``` 207 | Graph execution: 1 208 | f(1.0) return tf.Tensor(1, shape=(), dtype=int32) 209 | ``` 210 | 因此, 在输入参数是python原生类型的时候, 与ID相关联的是参数的**值**(1.0==1)而不是类型. 211 | 212 | **警告**: 由于每次不同的python值都会生成一个图, 因此对于每个可能的值, 都会执行一次python代码和图构造, 从而极大地降低效率. 213 | 214 | (官方文档: Therefore, python numerical inputs should be restricted to arguments that will have few distinct values, such as hyperparameters like the number of layers in a neural network. This allows TensorFlow to optimize each variant of the neural network.) 215 | 216 | ## 效率测量 217 | 218 | 我们用以下代码来验证: 219 | 220 | ``` 221 | @tf.function 222 | def g(x): 223 | return x 224 | 225 | start = time.time() 226 | for i in tf.range(1000): 227 | g(i) 228 | end = time.time() 229 | 230 | print("tf.Tensor time elapsed: ", (end-start)) 231 | 232 | start = time.time() 233 | for i in range(1000): 234 | g(i) 235 | end = time.time() 236 | 237 | print("Native type time elapsed: ", (end-start)) 238 | ``` 239 | 240 | 按照上面的理论, 第一个循环只会执行一次python代码和构建一个图, 而第二次循环则会执行1000次python代码和构建1000个图. 241 | 242 | 结果是符合预期的: 243 | 244 | ``` 245 | tf.Tensor time elapsed: 0.41594886779785156 246 | Native type time elapsed: 5.189513444900513 247 | ``` 248 | 结论: 在每个地方都用`tf.Tensor`. 249 | 250 | (译者注: 这并不准确, 我们应该理解成: 用python原生类型来控制构造图, 用`tf.Tensor`做一切实际运算. 比如我们应该用python的整数来控制隐层数量, 用`tf.Tensor`来传入训练数据, 而不应该用 `tf.Tensor`来控制隐层数量, numpy array来传入训练数据) 251 | 252 | ## tf.function真的是在用AutoGraph吗? 253 | 254 | 这个部分暂且省略, 因为在这个[帖子](https://groups.google.com/a/tensorflow.org/forum/#!topic/developers/SD_ijT4MuPw)里面的开发者说, 他们之后会让`tf.function`和AutoGraph的行为一致. 255 | 256 | ## 结论 257 | 258 | 在本文中我们分析了`tf.function`在启用AutoGraph的情况下的行为: 259 | 260 | - 在用`tf.Tensor`的时候, 所有东西都在预期之中 261 | - 如果是在用python原生类型的时候, 每个不同的值都会建立一个不同的图, 相同的值的话会被重复 262 | - python的函数代码只会在构建图的时候被执行一次 263 | 264 | 在第三部分的文章中, 我们会探寻比`print`更加复杂的函数, 来看看各种操作重载等python操作的行为. 265 | 266 | 原文链接: https://github.com/JayYip/deep-learning-nlp-notes/blob/master/tf.function%E5%92%8CAutograph%E4%BD%BF%E7%94%A8%E6%8C%87%E5%8D%97-Part%202.md 267 | 268 | 声明: 本文翻译自[Paolo Galeone的博客](https://links.jianshu.com/go?to=https%3A%2F%2Fpgaleone.eu%2Ftensorflow%2Ftf.function%2F2019%2F03%2F21%2Fdissecting-tf-function-part-1%2F), 已取得作者的同意, 如需转载本文请联系本人 269 | 270 | Disclaimer: This is a translation of the article [Analyzing tf.function to discover AutoGraph strengths and subtleties](https://pgaleone.eu/tensorflow/tf.function/2019/04/03/dissecting-tf-function-part-2/) by Paolo Galeone. 271 | -------------------------------------------------------------------------------- /tf.function和Autograph使用指南-Part 3.md: -------------------------------------------------------------------------------- 1 | # tf.function和Autograph使用指南-Part 3 2 | 3 | 在第一部分的文章中我们看到了怎么将TF 1.X的代码改写到eager的代码, 再从eager的代码改写到图表示的代码. 并且我们发现了我们不能在转化成图表示的函数中创建状态. 4 | 5 | 在第二部分的文章中我们探讨了python原生类型数据和`tf.Tensor`作为被`tf.function`装饰的函数的输入时的不同, 并且发现了如果两者使用不当的话会造成运行速度极大地下降和可能不能得到我们想要的结果. 6 | 7 | 在这最后的文章中, 我们会分析一下用`tf.function`来装饰更加复杂的python函数的情况, 来看看我们是否需要在写函数的时候分析代码的转化过程. 8 | 9 | ## AutoGraph的功能和限制 10 | 11 | 我们在TF的官方repo的`python/autograph`里面可以找到[这个文档](https://github.com/tensorflow/tensorflow/blob/560e2575ecad30bedff5b192f33f6d06b19ccaeb/tensorflow/python/autograph/LIMITATIONS.md). 在这个文档中我们可以看到AutoGraph能做什么, 和有什么限制. 在里面的表中我们可以具体地知道什么python操作会被转换, 或者将来会被转换, 什么操作则不支持. 在这一节里面, 我们会分析一下函数是否按照我们预期那样被转换和我们是否需要在写函数的时候想清楚转换的过程. 12 | 13 | ### if ... else 14 | 15 | 我们分析以下的简单函数: 16 | 17 | ```python 18 | @tf.function 19 | def if_else(a, b): 20 | if a > b: 21 | tf.print("a > b", a, b) 22 | else: 23 | tf.print("a <= b", a, b) 24 | ``` 25 | 26 | 我们先来看看转换出来的代码: 27 | 28 | ```python 29 | print(tf.autograph.to_code(if_else.python_function)) 30 | ``` 31 | 32 | 得到的结果: 33 | 34 | ```python 35 | def tf__if_else(a, b): 36 | cond = a > b 37 | 38 | def get_state(): 39 | return () 40 | 41 | def set_state(_): 42 | pass 43 | 44 | def if_true(): 45 | ag__.converted_call( 46 | "print", 47 | tf, 48 | ag__.ConversionOptions( 49 | recursive=True, 50 | force_conversion=False, 51 | optional_features=(), 52 | internal_convert_user_code=True, 53 | ), 54 | ("a > b", a, b), 55 | None, 56 | ) 57 | return ag__.match_staging_level(1, cond) 58 | 59 | def if_false(): 60 | ag__.converted_call( 61 | "print", 62 | tf, 63 | ag__.ConversionOptions( 64 | recursive=True, 65 | force_conversion=False, 66 | optional_features=(), 67 | internal_convert_user_code=True, 68 | ), 69 | ("a <= b", a, b), 70 | None, 71 | ) 72 | return ag__.match_staging_level(1, cond) 73 | 74 | ag__.if_stmt(cond, if_true, if_false, get_state, set_state) 75 | ``` 76 | 77 | 我们可以看到就像我们写`tf.cond`一样, `ag__.if_stmt`接收`cond`, `true_fn`, `false_fn`, 如果`cond`为`True`则执行`true_fn`, 否则则执行`false_fn`. 先忽略掉`get_state`和`set_state`吧. 78 | 79 | 现在我们可以执行图表示代码了: 80 | 81 | ```python 82 | x = tf.constant(1) 83 | if_else(x, x) 84 | ``` 85 | 结果如我们所料: `a <= b 1 1` 86 | 87 | ### if ... elif ... else 88 | 89 | 稍微修改一下函数: 90 | 91 | ```python 92 | @tf.function 93 | def if_elif(a, b): 94 | if a > b: 95 | tf.print("a > b", a, b) 96 | elif a == b: 97 | tf.print("a == b", a, b) 98 | else: 99 | tf.print("a < b", a, b) 100 | ``` 101 | 102 | 先来看看图表示的代码: 103 | 104 | ```python 105 | def tf__if_elif(a, b): 106 | cond_1 = a > b 107 | 108 | def if_true_1(): 109 | # tf.print("a > b", a, b) 110 | return ag__.match_staging_level(1, cond_1) 111 | 112 | def if_false_1(): 113 | cond = a == b 114 | 115 | def if_true(): 116 | # tf.print(a == b, a, b) 117 | return ag__.match_staging_level(1, cond) 118 | 119 | def if_false(): 120 | # tf.print(a < b, a,b) 121 | return ag__.match_staging_level(1, cond) 122 | 123 | ag__.if_stmt(cond, if_true, if_false, get_state, set_state) 124 | return ag__.match_staging_level(1, cond_1) 125 | 126 | ag__.if_stmt(cond_1, if_true_1, if_false_1, get_state_1, set_state_1) 127 | ``` 128 | 129 | 我们可以看到其实就是嵌套的`tf.cond`, 没啥好说的. 130 | 131 | 我们开看看执行结果吧: 132 | 133 | ```python 134 | x = tf.constant(1) 135 | if_elif(x, x) 136 | ``` 137 | 138 | 我们预期结果是`a == b 1 1`, 但是如果你会发现, 实际的结果是`a < b 1 1`! 什么鬼! 139 | 140 | 我们来看看eager模式的代码执行情况: 141 | 142 | ```python 143 | x = tf.constant(1) 144 | if_elif.python_function(x, x) 145 | ``` 146 | 147 | 结果是正确的: `a == b 1 1`. 但是! 如果你执行下面代码: 148 | 149 | ```python 150 | x, y = tf.constant(1), tf.constant(1) 151 | if_elif.python_function(x, y) 152 | ``` 153 | 154 | 你会看到`a < b 1 1`! 惊喜不惊喜?!! 155 | 156 | **第一个教训: 并不是所有操作都一样地转换** 157 | 158 | 这个奇怪的现象其实是因为`tf.Tensor`的`__eq__`方法的不一样的改写. 详情可以参考[这个StackOverflow的回答](https://stackoverflow.com/questions/46785041/why-does-tensorflow-not-override-eq)和[这个Github issue](https://github.com/tensorflow/tensorflow/issues/9359). 简单的来说, 就是用`==`来比较`tf.Tensor`的时候, 检查的并不是`tf.Tensor`的**值**, 而是比较两个`tf.Tensor`的**hash**. 159 | 160 | 私货: 为了更清楚地看看这个现象, 我们可以执行下面的代码: 161 | 162 | ```python 163 | @tf.function 164 | def if_elif(a, b): 165 | print(a.__hash__()) 166 | print(b.__hash__()) 167 | if a > b: 168 | tf.print("a > b", a, b) 169 | elif a == b: 170 | tf.print("a == b", a, b) 171 | else: 172 | tf.print("a < b", a, b) 173 | 174 | if_elif(x, x) 175 | 176 | if_elif.python_function(x, x) 177 | ``` 178 | 179 | 我们会看到结果(hash值可能会不一样): 180 | ```python 181 | 4975557040 182 | 5209512760 183 | a < b 1 1 184 | 185 | 5196691160 186 | 5196691160 187 | a == b 1 1 188 | ``` 189 | 190 | **第二个教训: AutoGraph是如何(不)转化算子的** 191 | 192 | 在上面的例子中, 我们事实上假设了AutoGraph不止会转换`if`, `elif`, `else`等命令, 还会转换python内置的计算, 如`__eq__`, `__gt__`, `__lt__`等. 但是从上面图表示的代码可以看到, 这些都没有像函数那用用`ag__.converted_call`转换. 从上面例子我们也可以猜到, 实际上所有条件判断都是`False`的: 193 | 194 | ```python 195 | @tf.function 196 | def if_elif(a, b): 197 | if a > b: 198 | tf.print("a > b", a, b) 199 | elif a == b: 200 | tf.print("a == b", a, b) 201 | elif a < b: 202 | tf.print("a < b", a, b) 203 | else: 204 | tf.print("wat") 205 | x = tf.constant(1) 206 | if_elif(x,x) 207 | ``` 208 | 209 | 输出: `wat` 210 | 211 | **第三个教训: 怎么写函数** 212 | 213 | 为了让eager和图表示代码表现一直, 我们需要知道: 214 | 215 | 1. 算子的含义很重要, 有些算子被改写到了和原生python有不一样的意义 216 | 2. AutoGrah能够自动转换`if`, `elif`等语句, 但是我们在写的时候仍然需要加倍小心 217 | 218 | 因此, 在实际应用中, 我们最好**处处都使用显式的TF操作**. 我们可以用最安全的方法改写上述例子: 219 | ```python 220 | @tf.function 221 | def if_elif(a, b): 222 | if tf.math.greater(a, b): 223 | tf.print("a > b", a, b) 224 | elif tf.math.equal(a, b): 225 | tf.print("a == b", a, b) 226 | elif tf.math.less(a, b): 227 | tf.print("a < b", a, b) 228 | else: 229 | tf.print("wat") 230 | ``` 231 | 232 | 得到的图表示代码(为了方便展示, 稍微清理了一下): 233 | 234 | ```python 235 | def tf__if_elif(a, b): 236 | cond_2 = ag__.converted_call("greater", ...) # a > b 237 | 238 | def if_true_2(): 239 | ag__.converted_call("print", ...) # tf.print a > b 240 | return ag__.match_staging_level(1, cond_2) 241 | 242 | def if_false_2(): 243 | cond_1 = ag__.converted_call("equal", ...) # tf.math.equal 244 | 245 | def if_true_1(): 246 | ag__.converted_call("print", ...) # tf.print a == b 247 | return ag__.match_staging_level(1, cond_1) 248 | 249 | def if_false_1(): 250 | cond = ag__.converted_call("less", ...) # a < b 251 | 252 | def if_true(): 253 | ag__.converted_call("print", ...) # tf.print a < b 254 | return ag__.match_staging_level(1, cond) 255 | 256 | def if_false(): 257 | ag__.converted_call("print", ...) # tf.print wat 258 | return ag__.match_staging_level(1, cond) 259 | 260 | ag__.if_stmt(cond, if_true, if_false, get_state, set_state) 261 | return ag__.match_staging_level(1, cond_1) 262 | 263 | ag__.if_stmt(cond_1, if_true_1, if_false_1, get_state_1, set_state_1) 264 | return ag__.match_staging_level(1, cond_2) 265 | 266 | ag__.if_stmt(cond_2, if_true_2, if_false_2, get_state_2, set_state_2) 267 | ``` 268 | 269 | 可以看到, 现在每一部分都用`ag__.converted_call`转换了. 270 | 271 | ### for ... in range 272 | 273 | 根据之前的三个教训, 我们就可以顺利地用`tf.function`转换`for`循环了, 我们用一个简单的从1加到`x-1`的函数作为例子, 需要注意两点: 274 | 275 | 1. 不要在函数里面创建状态 276 | 2. 用`tf.range`, 而不是`range` 277 | 278 | **译者注:** 在被`tf.function`装饰的函数内使用`tf.range`和`range`的时候, `tf.range`是动态展开的, `range`是静态展开的, 具体可以看[这里](https://www.tensorflow.org/alpha/tutorials/eager/tf_function#autograph_and_loops). 动态展开是指, 我们可以循环到某个地方中断循环(`break`或者`return`), 而静态展开则是把每个循环的图都事先构造好了. 如果我们改写一下官方文档里面的例子, 你会发现: 279 | ```python 280 | @tf.function 281 | def buggy_py_for_tf_break(upto): 282 | x = 0 283 | for i in range(upto): 284 | print(i) 285 | if tf.equal(i, 10): 286 | break 287 | x += i 288 | return x 289 | 290 | @tf.function 291 | def tf_for_tf_break(upto): 292 | x = 0 293 | for i in tf.range(upto): 294 | if tf.equal(i, 10): 295 | break 296 | x += i 297 | return x 298 | 299 | print(buggy_py_for_tf_break(tf.constant(100))) # it's ok! 300 | print(tf_for_tf_break(tf.constant(100))) # it's ok! 301 | print(tf_for_tf_break(100)) # it's ok! 302 | print(buggy_py_for_tf_break(100)) # ERROR!! 303 | ``` 304 | 305 | 为什么呢? 我不告诉你. 306 | 307 | (译者注完) 308 | 309 | 记住上面两点, 可以写成: 310 | ```python 311 | x = tf.Variable(1) 312 | @tf.function 313 | def test_for(upto): 314 | for i in tf.range(upto): 315 | x.assign_add(i) 316 | 317 | x.assign(tf.constant(0)) 318 | test_for(tf.constant(5)) 319 | print("x value: ", x.numpy()) 320 | ``` 321 | 322 | 就像我们想的那样, `x`的值为10. 323 | 324 | **思考问题:** 如果我们用`x += i`替换掉`x.assign_add(i)`会怎样呢? 325 | 326 | ## 结论 327 | 328 | 总结一下这三篇文章的要点: 329 | 330 | - 如果函数中创建了状态的话, 需要格外小心. 该函数在转换成图表示的时候可能会有问题(Part 1) 331 | - AutoGraph**不会**封装python原生类型, 这可能会导致严重的效率问题(Part 2). 请谨慎使用python原生类型. 332 | - `tf.print`和`print`是不同的, 被转换的函数在第一次被调用时和之后的调用可能会有不同的结果(Part2) 333 | - `tf.Tensor`的运算重载和我们想象中不太一样, 为了保证正常运行, 推荐在被转换函数里面使用`tf.equal`这样的tf的操作而不是`==`这样的python操作. 334 | 335 | 原文链接: https://github.com/JayYip/deep-learning-nlp-notes/blob/master/tf.function%E5%92%8CAutograph%E4%BD%BF%E7%94%A8%E6%8C%87%E5%8D%97-Part%203.md 336 | 337 | 声明: 本文翻译自[Paolo Galeone的博客](https://links.jianshu.com/go?to=https%3A%2F%2Fpgaleone.eu%2Ftensorflow%2Ftf.function%2F2019%2F03%2F21%2Fdissecting-tf-function-part-1%2F), 已取得作者的同意, 如需转载本文请联系本人 338 | 339 | Disclaimer: This is a translation of the article [Analyzing tf.function to discover AutoGraph strengths and subtleties](https://pgaleone.eu/tensorflow/tf.function/2019/04/03/dissecting-tf-function-part-2/) by Paolo Galeone. -------------------------------------------------------------------------------- /用多目标优化解决多任务学习.md: -------------------------------------------------------------------------------- 1 | # NIPS2018 - 用多目标优化解决多任务学习 2 | 3 | 题外话: 多任务学习可以说是机器学习的终极目标之一, 就像物理学家在追求统一所有力一样, 个人认为机器学习也在追求一个模型解决几乎所有问题. 虽然我们现在还离这个目标很远, 但是多任务学习在实际应用中是非常有价值的, 对于像BERT这么复杂的模型, 用一个模型解决多个问题才能物尽其用啊. (稍稍推广下[bert-multitask-learning](https://github.com/JayYip/bert-multitask-learning)) 4 | 5 | 这是Intel在NIPS 2018上发表的关于多任务学习的文章: [Multi-Task Learning as Multi-Objective Optimization](https://arxiv.org/abs/1810.04650). 多任务学习其实有很多种做法, hard parameter sharing, soft parameter sharing等等, 但是个人认为hard parameter sharing更加具有实用价值, 这里不展开说了. 这篇文章属于用hard parameter sharing 做多任务学习, 通过loss weighting来提升效果, 而本文的主要贡献是对这个weight的快速计算. 作者在文章中说了以下事情: 6 | 7 | 1. 介绍了用多目标优化解决多任务学习的一般形式 8 | 2. 介绍了怎么比较多任务学习结果的优劣: 帕累托最优 9 | 3. 将帕累托最优的求解转化成任务权重的求解 10 | 4. 证明如何简化该计算 11 | 12 | 文章的idea其实很简单, 但是理论比较多, 如果对理论不感兴趣的话了解一下作者做了什么工作就好了: 通过链式法则的推导, 并证明了, 在梯度为full rank的情况下, 我们不需要对每个任务的向后传播都算到底(所有层), 只需要算到共享模型的最后一层, 用这个去解出各个任务的权重, 既能加快速度, 又能得到一个比较好的解. 下面我会**尝试**总结一下作者的推导过程. 13 | 14 | ## 符号解释 15 | 16 | - $t, T$: 任务以及任务集合 17 | - $\theta, \theta^{sh}$: 模型参数, 共享的模型参数 18 | - $\alpha$: 任务权重 19 | - $\eta$: 学习率 20 | - $Z$: $\theta^{sh}$的最后一层输出 21 | - $\mathcal{L}$: 损失函数 22 | - $\mathcal{X}, \mathcal{Y}^t$: 输出空间和任务t的标签空间 23 | 24 | ## 多目标优化解决多任务学习的一般形式 25 | 26 | 多任务学习可以一般可以表示为最小化下面式子: 27 | 28 | $$ 29 | \min_{\theta^{sh}, \theta^1, \dots, \theta^T} \sum^T_{t=1} \alpha^t \hat{\mathcal{L}}^t(\theta^{sh}, \theta^t) 30 | $$ 31 | 32 | ## 帕累托最优 33 | 34 | 我们可以想象, 对于不同的$\alpha^t$ 的取值, 我们可以学到不同的参数$\theta$, 那么我们怎么判断参数之间的优劣呢? 显然的, 如果对于参数$\theta^1$, 在每个任务上的表现都大于等于$\theta^2$, 那么我们就认为$\theta^1$比$\theta^2$要好(dominiate). 按照这个条件, 如果参数$\theta^{*}$不差于任何参数, 那么$\theta^{*}$则称为**帕累托最优**. 35 | 36 | ## 用求解任务权重解帕累托最优 37 | 38 | 这是问题的第一步转换, 理论基础是[这篇文章](https://www.sciencedirect.com/science/article/pii/S1631073X12000738)(我没看). 文章证明了, 下面式子的解要么是帕累托静止点(帕累托最优的必要条件), 要么是一个能优化所有任务的好的优化方向. 39 | 40 | $$ 41 | \min _{\alpha^{1}, \ldots, \alpha^{T}}\left\{\left\|\sum_{t=1}^{T} \alpha^{t} \nabla_{\boldsymbol{\theta}^{s h}} \hat{\mathcal{L}}^{t}\left(\boldsymbol{\theta}^{s h}, \boldsymbol{\theta}^{t}\right)\right\|_{2}^{2} | \sum_{t=1}^{T} \alpha^{t}=1, \alpha^{t} \geq 0 \quad \forall t\right\} 42 | $$ 43 | 44 | 我们可以看到, $\theta^t$的梯度和$\alpha^t$是没啥关系的, $\alpha^t$只作用在$\theta^{sh}$的梯度上, 因此向后传播的过程为: 45 | 46 | 1. 对所有$\theta^t$做一般的gradient descent 47 | 2. 解出上面式子, 并对$\theta^{sh}$做$\sum_{t=1}^{T} \alpha^{t} \nabla_{\boldsymbol{\theta}^{s h}}$的gradient descent. 48 | 49 | 计算过程写下来就是原文里面的Algorithm2: 50 | 51 | ![image](img/多目标优化Algorithm2.png) 52 | 53 | ## 简化上述计算 54 | 55 | 从Algorithm2第8行可以看到, 对于share parameters的梯度, 需要对每个任务都算一遍, 当任务比较多的时候, 运算复杂度高, 因此, 作者提出了第二步转换, 也是本文的主要工作. 56 | 57 | 复合函数求导的链式法则可得上界: 58 | 59 | $$ 60 | \left\|\sum_{t=1}^{T} \alpha^{t} \nabla_{\boldsymbol{\theta}^{s h}} \hat{\mathcal{L}}^{t}\left(\boldsymbol{\theta}^{s h}, \boldsymbol{\theta}^{t}\right)\right\|_{2}^{2} \leq\left\|\frac{\partial \mathbf{Z}}{\partial \boldsymbol{\theta}^{s h}}\right\|_{2}^{2}\left\|\sum_{t=1}^{T} \alpha^{t} \nabla_{\mathbf{Z}} \hat{\mathcal{L}}^{t}\left(\boldsymbol{\theta}^{s h}, \boldsymbol{\theta}^{t}\right)\right\|_{2}^{2} 61 | $$ 62 | 63 | 其中Z为输入的最后一层的表示. 去掉与alpha不想关的项, 可得: 64 | 65 | $$ 66 | \min _{\alpha^{1}, \ldots, \alpha^{T}}\left\{\left\|\sum_{t=1}^{T} \alpha^{t} \nabla_{\mathbf{Z}} \hat{\mathcal{L}}^{t}\left(\boldsymbol{\theta}^{s h}, \boldsymbol{\theta}^{t}\right)\right\|_{2}^{2} | \sum_{t=1}^{T} \alpha^{t}=1, \alpha^{t} \geq 0 \quad \forall t\right\} 67 | $$ 68 | 69 | 然后作者证明了 $\frac{\partial \mathbf{Z}}{\partial \boldsymbol{\theta}^{s h}}$在full rank 的情况下, MGDA-UB(Multiple Gradient Descent Algorithm – Upper Bound, 就是上面的那个公式)的解要么是Pareto stationary point(Pareto optimal的必要条件), 要么是一个能优化所有任务的好的优化方向. 那么算法修改的地方是Algorithm 2中的第8行的梯度计算, 从对所有share parameters算梯度改成对最后一层表示的梯度. 70 | 71 | ## 实验结果 72 | 73 | 作者在MultiMNIST和CelebA数据集上取得了超过single task的结果. 训练时间也极大地缩短了(任务越多, 训练时间差距越大). 从作者的实验结果也可以看到, 简单的average的效果其实是不如single task的. 74 | 75 | 需要注意的是, 在本文我忽略了很多过程, 只把我认为的最关键的部分拿了出来, 主要是为了简单叙述一下作者简化计算的方法. 76 | 77 | 原文链接: https://github.com/JayYip/deep-learning-nlp-notes/blob/master/%E7%94%A8%E5%A4%9A%E7%9B%AE%E6%A0%87%E4%BC%98%E5%8C%96%E8%A7%A3%E5%86%B3%E5%A4%9A%E4%BB%BB%E5%8A%A1%E5%AD%A6%E4%B9%A0.md --------------------------------------------------------------------------------