├── .github ├── ISSUE_TEMPLATE.md └── PULL_REQUEST_TEMPLATE.md ├── .gitignore ├── Dockerfile ├── OWNERS ├── README.md ├── bazel ├── BUILD ├── WORKSPACE ├── hello_lib.py └── hello_main.py ├── cifar10 ├── README.md ├── cifar10.py ├── cifar10_async_dist_train.py ├── cifar10_eval.py ├── cifar10_input.py ├── cifar10_sync_dist_train.py └── cifar10_train.py ├── data ├── README.md ├── ptb.test.txt ├── ptb.train.txt ├── ptb.valid.txt ├── t10k-images-idx3-ubyte.gz ├── t10k-labels-idx1-ubyte.gz ├── text8.zip ├── train-images-idx3-ubyte.gz └── train-labels-idx1-ubyte.gz ├── distributed ├── README.md ├── create_tf_server_yaml.py ├── mnist_cnn.py ├── mnist_dnn.py ├── mnist_dnn_sync_update.py ├── start_servers.py ├── start_tf.sh └── word2vector.py ├── file_server.py ├── imagenet_serving ├── Dockerfile ├── README.md ├── pic │ ├── 02ea79e4aad9d6275da78a9170fa4e82.jpg │ ├── 07889356d62fa6517b0db6cf9dcf1f96.jpg │ ├── 516308313.jpg │ └── 7e7e745620d307aa2cb4afcdfa90d189.jpg └── serving.json ├── k8s_tf.yaml ├── k8s_tf_runner.yaml ├── notebooks ├── RNN_PennTreeBank_LanguageModeling.ipynb ├── hello_world.ipynb ├── mnist.ipynb ├── mnist_cnn.ipynb ├── mnist_skflow.ipynb ├── mnist_tensorboard.ipynb ├── scope.ipynb └── word2vec_basic.ipynb ├── picture ├── create_local.png ├── create_terminal.png ├── dist_creation.png ├── expanded_view.png ├── homepage.png ├── jupyter.png ├── list_view.png └── terminal_view.png ├── run_tf.sh └── word2vector ├── BUILD ├── README.md ├── __init__.py ├── index.html ├── questions-words.txt ├── word2vec.py ├── word2vec_basic.py ├── word2vec_kernels.cc ├── word2vec_ops.cc ├── word2vec_optimized.py └── words_calculator_server.py /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | **Is this a BUG REPORT or FEATURE REQUEST?**: 4 | 5 | > Uncomment only one, leave it on its own line: 6 | > 7 | > /kind bug 8 | > /kind feature 9 | 10 | **What happened**: 11 | 12 | **What you expected to happen**: 13 | 14 | **How to reproduce it (as minimally and precisely as possible)**: 15 | 16 | **Anything else we need to know?**: 17 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | **What this PR does / why we need it**: 4 | 5 | Add your description 6 | 7 | **Which issue(s) this PR is related to** *(optional, link to 3rd issue(s))*: 8 | 9 | Fixes # 10 | 11 | Reference to # 12 | 13 | 14 | **Special notes for your reviewer**: 15 | 16 | /cc @your-reviewer 17 | 18 | 26 | 27 | **Release note**: 28 | 32 | 33 | ```release-note 34 | NONE 35 | ``` 36 | 37 | 47 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | word2vector/text8* 3 | word2vector/glove* 4 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM index.caicloud.io/caicloud/ml-libs 2 | 3 | RUN apt-get update && apt-get install -y bc 4 | 5 | RUN rm -rf /notebooks/* 6 | 7 | COPY data /tmp/data 8 | COPY file_server.py /file_server.py 9 | COPY run_tf.sh /run_tf.sh 10 | 11 | COPY notebooks /notebooks 12 | COPY distributed /distributed 13 | 14 | CMD ["/run_tf.sh"] 15 | -------------------------------------------------------------------------------- /OWNERS: -------------------------------------------------------------------------------- 1 | approvers: 2 | - perhapszzy 3 | reviewers: 4 | - perhapszzy 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TensorFlow Examples 2 | This repository includes TensorFlow example codes for both distributed and non-distributed version. **Contributions are very welcome.** 3 | 4 | ## Local examples 5 | To run the local examples on Jupyter Notebooks, you can either use caicloud.io directly or run it in docker with caicloud TensorFlow image. 6 | 7 | ### Use caicloud.io machine learning SaaS 8 | - Step 1. Login into [caicloud.io](https://console.caicloud.io/login). Registry [here](https://console.caicloud.io/reg) if you don't have a caicloud account. After login, you may see something like this 9 | ![alt text](https://github.com/caicloud/tensorflow-demo/blob/master/picture/homepage.png) 10 | 11 | - Step 2. Click on "机器学习" and then click on "单机实验”. You may see something like the picture below if you haven't created one. If you have already created one, you can skip Step 3. 12 | ![alt text](https://github.com/caicloud/tensorflow-demo/blob/master/picture/create_local.png) 13 | 14 | - Step 3. Creat an experiment environment by click “创建单机实验” and fill the required fields. 15 | ![alt text](https://github.com/caicloud/tensorflow-demo/blob/master/picture/list_view.png) 16 | ![alt text](https://github.com/caicloud/tensorflow-demo/blob/master/picture/expanded_view.png) 17 | 18 | - Step 4. Open Jupyter Notebook 19 | ![alt text](https://github.com/caicloud/tensorflow-demo/blob/master/picture/jupyter.png) 20 | 21 | ### Use caicloud TensorFlow docker image 22 | - Step 1. [Install Docker](https://docs.docker.com/engine/installation/) 23 | 24 | - Step 2. Pull image 25 | 26 | ``` 27 | docker pull index.caicloud.io/tensorflow:0.8.0 28 | ``` 29 | 30 | Note you need to have a [caicloud account](https://console.caicloud.io/reg) to pull the image. 31 | 32 | - Step 3. Start the image 33 | 34 | ``` 35 | docker run --net=host index.caicloud.io/tensorflow:0.8.0 36 | ``` 37 | 38 | - Step 4. Access the Jupyter Notebook at ```localhost:8888``` 39 | 40 | 41 | ## Distributed examples 42 | Distributed TensorFlow examples could only be run on [caicloud.io](caicloud.io). 43 | 44 | - Step 1. Create distributed TensorFlow cluster. This may take a few minutes. Note you'll need to create a kubernetes cluster before deploying a TensorFlow cluster. This [doc](http://www.clipular.com/c/4898024607711232.png?k=8TxxmTwy57gXs7SZ9iVVopscjKg) describes how to create a kubernetes cluster on caicloud.io. 45 | ![alt text](https://github.com/caicloud/tensorflow-demo/blob/master/picture/dist_creation.png) 46 | 47 | - Step 2. Open Jupyter Notebook. 48 | 49 | - Step 3. Create a terminal. 50 | ![alt text](https://github.com/caicloud/tensorflow-demo/blob/master/picture/create_terminal.png) 51 | ![alt text](https://github.com/caicloud/tensorflow-demo/blob/master/picture/terminal_view.png) 52 | 53 | - Step 4. Go into the distrubted examples directory: 54 | 55 | ``` 56 | cd /distributed 57 | ls 58 | ``` 59 | 60 | - Step 5. Run examples follow instructions [here](https://github.com/caicloud/tensorflow-demo/blob/master/distributed/README.md) 61 | 62 | -------------------------------------------------------------------------------- /bazel/BUILD: -------------------------------------------------------------------------------- 1 | py_library( 2 | name = "hello_lib", 3 | srcs = [ 4 | "hello_lib.py", 5 | ] 6 | ) 7 | 8 | py_binary( 9 | name = "hello_main", 10 | srcs = [ 11 | "hello_main.py", 12 | ], 13 | deps = [ 14 | ":hello_lib", 15 | ], 16 | ) 17 | 18 | -------------------------------------------------------------------------------- /bazel/WORKSPACE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/bazel/WORKSPACE -------------------------------------------------------------------------------- /bazel/hello_lib.py: -------------------------------------------------------------------------------- 1 | def print_hello_world(): 2 | print('Hello World') 3 | -------------------------------------------------------------------------------- /bazel/hello_main.py: -------------------------------------------------------------------------------- 1 | import hello_lib 2 | hello_lib.print_hello_world() 3 | -------------------------------------------------------------------------------- /cifar10/README.md: -------------------------------------------------------------------------------- 1 | CIFAR-10 is a common benchmark in machine learning for image recognition. 2 | 3 | http://www.cs.toronto.edu/~kriz/cifar.html 4 | 5 | Code in this directory demonstrates how to use TensorFlow to train and evaluate a convolutional neural network (CNN) on both CPU and GPU. We also demonstrate how to train a CNN over multiple GPUs. 6 | 7 | Detailed instructions on how to get started available at: 8 | 9 | http://tensorflow.org/tutorials/deep_cnn/ 10 | 11 | -------------------------------------------------------------------------------- /cifar10/cifar10.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """Builds the CIFAR-10 network. 17 | 18 | Summary of available functions: 19 | 20 | # Compute input images and labels for training. If you would like to run 21 | # evaluations, use inputs() instead. 22 | inputs, labels = distorted_inputs() 23 | 24 | # Compute inference on the model inputs to make a prediction. 25 | predictions = inference(inputs) 26 | 27 | # Compute the total loss of the prediction with respect to the labels. 28 | loss = loss(predictions, labels) 29 | 30 | # Create a graph to run one step of training with respect to the loss. 31 | train_op = train(loss, global_step) 32 | """ 33 | # pylint: disable=missing-docstring 34 | from __future__ import absolute_import 35 | from __future__ import division 36 | from __future__ import print_function 37 | 38 | import gzip 39 | import os 40 | import re 41 | import sys 42 | import tarfile 43 | 44 | from six.moves import urllib 45 | import tensorflow as tf 46 | 47 | from tensorflow.models.image.cifar10 import cifar10_input 48 | 49 | FLAGS = tf.app.flags.FLAGS 50 | 51 | # Basic model parameters. 52 | tf.app.flags.DEFINE_integer('batch_size', 128, 53 | """Number of images to process in a batch.""") 54 | tf.app.flags.DEFINE_string('data_dir', '/tmp/cifar10_data', 55 | """Path to the CIFAR-10 data directory.""") 56 | 57 | # Global constants describing the CIFAR-10 data set. 58 | IMAGE_SIZE = cifar10_input.IMAGE_SIZE 59 | NUM_CLASSES = cifar10_input.NUM_CLASSES 60 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = cifar10_input.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN 61 | NUM_EXAMPLES_PER_EPOCH_FOR_EVAL = cifar10_input.NUM_EXAMPLES_PER_EPOCH_FOR_EVAL 62 | 63 | 64 | # Constants describing the training process. 65 | MOVING_AVERAGE_DECAY = 0.9999 # The decay to use for the moving average. 66 | NUM_EPOCHS_PER_DECAY = 350.0 # Epochs after which learning rate decays. 67 | LEARNING_RATE_DECAY_FACTOR = 0.1 # Learning rate decay factor. 68 | INITIAL_LEARNING_RATE = 0.1 # Initial learning rate. 69 | 70 | # If a model is trained with multiple GPUs, prefix all Op names with tower_name 71 | # to differentiate the operations. Note that this prefix is removed from the 72 | # names of the summaries when visualizing a model. 73 | TOWER_NAME = 'tower' 74 | 75 | DATA_URL = 'http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz' 76 | 77 | 78 | def _activation_summary(x): 79 | """Helper to create summaries for activations. 80 | 81 | Creates a summary that provides a histogram of activations. 82 | Creates a summary that measure the sparsity of activations. 83 | 84 | Args: 85 | x: Tensor 86 | Returns: 87 | nothing 88 | """ 89 | # Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training 90 | # session. This helps the clarity of presentation on tensorboard. 91 | tensor_name = re.sub('%s_[0-9]*/' % TOWER_NAME, '', x.op.name) 92 | tf.histogram_summary(tensor_name + '/activations', x) 93 | tf.scalar_summary(tensor_name + '/sparsity', tf.nn.zero_fraction(x)) 94 | 95 | 96 | def _variable_on_cpu(name, shape, initializer): 97 | """Helper to create a Variable stored on CPU memory. 98 | 99 | Args: 100 | name: name of the variable 101 | shape: list of ints 102 | initializer: initializer for Variable 103 | 104 | Returns: 105 | Variable Tensor 106 | """ 107 | with tf.device('/cpu:0'): 108 | var = tf.get_variable(name, shape, initializer=initializer) 109 | return var 110 | 111 | 112 | def _variable_with_weight_decay(name, shape, stddev, wd): 113 | """Helper to create an initialized Variable with weight decay. 114 | 115 | Note that the Variable is initialized with a truncated normal distribution. 116 | A weight decay is added only if one is specified. 117 | 118 | Args: 119 | name: name of the variable 120 | shape: list of ints 121 | stddev: standard deviation of a truncated Gaussian 122 | wd: add L2Loss weight decay multiplied by this float. If None, weight 123 | decay is not added for this Variable. 124 | 125 | Returns: 126 | Variable Tensor 127 | """ 128 | var = _variable_on_cpu(name, shape, 129 | tf.truncated_normal_initializer(stddev=stddev)) 130 | if wd is not None: 131 | weight_decay = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss') 132 | tf.add_to_collection('losses', weight_decay) 133 | return var 134 | 135 | 136 | def distorted_inputs(): 137 | """Construct distorted input for CIFAR training using the Reader ops. 138 | 139 | Returns: 140 | images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size. 141 | labels: Labels. 1D tensor of [batch_size] size. 142 | 143 | Raises: 144 | ValueError: If no data_dir 145 | """ 146 | if not FLAGS.data_dir: 147 | raise ValueError('Please supply a data_dir') 148 | data_dir = os.path.join(FLAGS.data_dir, 'cifar-10-batches-bin') 149 | return cifar10_input.distorted_inputs(data_dir=data_dir, 150 | batch_size=FLAGS.batch_size) 151 | 152 | 153 | def inputs(eval_data): 154 | """Construct input for CIFAR evaluation using the Reader ops. 155 | 156 | Args: 157 | eval_data: bool, indicating if one should use the train or eval data set. 158 | 159 | Returns: 160 | images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size. 161 | labels: Labels. 1D tensor of [batch_size] size. 162 | 163 | Raises: 164 | ValueError: If no data_dir 165 | """ 166 | if not FLAGS.data_dir: 167 | raise ValueError('Please supply a data_dir') 168 | data_dir = os.path.join(FLAGS.data_dir, 'cifar-10-batches-bin') 169 | return cifar10_input.inputs(eval_data=eval_data, data_dir=data_dir, 170 | batch_size=FLAGS.batch_size) 171 | 172 | 173 | def inference(images): 174 | """Build the CIFAR-10 model. 175 | 176 | Args: 177 | images: Images returned from distorted_inputs() or inputs(). 178 | 179 | Returns: 180 | Logits. 181 | """ 182 | # We instantiate all variables using tf.get_variable() instead of 183 | # tf.Variable() in order to share variables across multiple GPU training runs. 184 | # If we only ran this model on a single GPU, we could simplify this function 185 | # by replacing all instances of tf.get_variable() with tf.Variable(). 186 | # 187 | # conv1 188 | with tf.variable_scope('conv1') as scope: 189 | kernel = _variable_with_weight_decay('weights', shape=[5, 5, 3, 64], 190 | stddev=1e-4, wd=0.0) 191 | conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME') 192 | biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0)) 193 | bias = tf.nn.bias_add(conv, biases) 194 | conv1 = tf.nn.relu(bias, name=scope.name) 195 | _activation_summary(conv1) 196 | 197 | # pool1 198 | pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], 199 | padding='SAME', name='pool1') 200 | # norm1 201 | norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, 202 | name='norm1') 203 | 204 | # conv2 205 | with tf.variable_scope('conv2') as scope: 206 | kernel = _variable_with_weight_decay('weights', shape=[5, 5, 64, 64], 207 | stddev=1e-4, wd=0.0) 208 | conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME') 209 | biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1)) 210 | bias = tf.nn.bias_add(conv, biases) 211 | conv2 = tf.nn.relu(bias, name=scope.name) 212 | _activation_summary(conv2) 213 | 214 | # norm2 215 | norm2 = tf.nn.lrn(conv2, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, 216 | name='norm2') 217 | # pool2 218 | pool2 = tf.nn.max_pool(norm2, ksize=[1, 3, 3, 1], 219 | strides=[1, 2, 2, 1], padding='SAME', name='pool2') 220 | 221 | # local3 222 | with tf.variable_scope('local3') as scope: 223 | # Move everything into depth so we can perform a single matrix multiply. 224 | reshape = tf.reshape(pool2, [FLAGS.batch_size, -1]) 225 | dim = reshape.get_shape()[1].value 226 | weights = _variable_with_weight_decay('weights', shape=[dim, 384], 227 | stddev=0.04, wd=0.004) 228 | biases = _variable_on_cpu('biases', [384], tf.constant_initializer(0.1)) 229 | local3 = tf.nn.relu(tf.matmul(reshape, weights) + biases, name=scope.name) 230 | _activation_summary(local3) 231 | 232 | # local4 233 | with tf.variable_scope('local4') as scope: 234 | weights = _variable_with_weight_decay('weights', shape=[384, 192], 235 | stddev=0.04, wd=0.004) 236 | biases = _variable_on_cpu('biases', [192], tf.constant_initializer(0.1)) 237 | local4 = tf.nn.relu(tf.matmul(local3, weights) + biases, name=scope.name) 238 | _activation_summary(local4) 239 | 240 | # softmax, i.e. softmax(WX + b) 241 | with tf.variable_scope('softmax_linear') as scope: 242 | weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES], 243 | stddev=1/192.0, wd=0.0) 244 | biases = _variable_on_cpu('biases', [NUM_CLASSES], 245 | tf.constant_initializer(0.0)) 246 | softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name) 247 | _activation_summary(softmax_linear) 248 | 249 | return softmax_linear 250 | 251 | 252 | def loss(logits, labels): 253 | """Add L2Loss to all the trainable variables. 254 | 255 | Add summary for "Loss" and "Loss/avg". 256 | Args: 257 | logits: Logits from inference(). 258 | labels: Labels from distorted_inputs or inputs(). 1-D tensor 259 | of shape [batch_size] 260 | 261 | Returns: 262 | Loss tensor of type float. 263 | """ 264 | # Calculate the average cross entropy loss across the batch. 265 | labels = tf.cast(labels, tf.int64) 266 | cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits( 267 | logits, labels, name='cross_entropy_per_example') 268 | cross_entropy_mean = tf.reduce_mean(cross_entropy, name='cross_entropy') 269 | tf.add_to_collection('losses', cross_entropy_mean) 270 | 271 | # The total loss is defined as the cross entropy loss plus all of the weight 272 | # decay terms (L2 loss). 273 | return tf.add_n(tf.get_collection('losses'), name='total_loss') 274 | 275 | 276 | def _add_loss_summaries(total_loss): 277 | """Add summaries for losses in CIFAR-10 model. 278 | 279 | Generates moving average for all losses and associated summaries for 280 | visualizing the performance of the network. 281 | 282 | Args: 283 | total_loss: Total loss from loss(). 284 | Returns: 285 | loss_averages_op: op for generating moving averages of losses. 286 | """ 287 | # Compute the moving average of all individual losses and the total loss. 288 | loss_averages = tf.train.ExponentialMovingAverage(0.9, name='avg') 289 | losses = tf.get_collection('losses') 290 | loss_averages_op = loss_averages.apply(losses + [total_loss]) 291 | 292 | # Attach a scalar summary to all individual losses and the total loss; do the 293 | # same for the averaged version of the losses. 294 | for l in losses + [total_loss]: 295 | # Name each loss as '(raw)' and name the moving average version of the loss 296 | # as the original loss name. 297 | tf.scalar_summary(l.op.name +' (raw)', l) 298 | tf.scalar_summary(l.op.name, loss_averages.average(l)) 299 | 300 | return loss_averages_op 301 | 302 | 303 | def train(total_loss, global_step): 304 | """Train CIFAR-10 model. 305 | 306 | Create an optimizer and apply to all trainable variables. Add moving 307 | average for all trainable variables. 308 | 309 | Args: 310 | total_loss: Total loss from loss(). 311 | global_step: Integer Variable counting the number of training steps 312 | processed. 313 | Returns: 314 | train_op: op for training. 315 | """ 316 | # Variables that affect learning rate. 317 | num_batches_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN / FLAGS.batch_size 318 | decay_steps = int(num_batches_per_epoch * NUM_EPOCHS_PER_DECAY) 319 | 320 | # Decay the learning rate exponentially based on the number of steps. 321 | lr = tf.train.exponential_decay(INITIAL_LEARNING_RATE, 322 | global_step, 323 | decay_steps, 324 | LEARNING_RATE_DECAY_FACTOR, 325 | staircase=True) 326 | tf.scalar_summary('learning_rate', lr) 327 | 328 | # Generate moving averages of all losses and associated summaries. 329 | loss_averages_op = _add_loss_summaries(total_loss) 330 | 331 | # Compute gradients. 332 | with tf.control_dependencies([loss_averages_op]): 333 | opt = tf.train.GradientDescentOptimizer(lr) 334 | grads = opt.compute_gradients(total_loss) 335 | 336 | # Apply gradients. 337 | apply_gradient_op = opt.apply_gradients(grads, global_step=global_step) 338 | 339 | # Add histograms for trainable variables. 340 | for var in tf.trainable_variables(): 341 | tf.histogram_summary(var.op.name, var) 342 | 343 | # Add histograms for gradients. 344 | for grad, var in grads: 345 | if grad is not None: 346 | tf.histogram_summary(var.op.name + '/gradients', grad) 347 | 348 | # Track the moving averages of all trainable variables. 349 | variable_averages = tf.train.ExponentialMovingAverage( 350 | MOVING_AVERAGE_DECAY, global_step) 351 | variables_averages_op = variable_averages.apply(tf.trainable_variables()) 352 | 353 | with tf.control_dependencies([apply_gradient_op, variables_averages_op]): 354 | train_op = tf.no_op(name='train') 355 | 356 | return train_op 357 | 358 | 359 | def maybe_download_and_extract(): 360 | """Download and extract the tarball from Alex's website.""" 361 | dest_directory = FLAGS.data_dir 362 | if not os.path.exists(dest_directory): 363 | os.makedirs(dest_directory) 364 | filename = DATA_URL.split('/')[-1] 365 | filepath = os.path.join(dest_directory, filename) 366 | if not os.path.exists(filepath): 367 | def _progress(count, block_size, total_size): 368 | sys.stdout.write('\r>> Downloading %s %.1f%%' % (filename, 369 | float(count * block_size) / float(total_size) * 100.0)) 370 | sys.stdout.flush() 371 | filepath, _ = urllib.request.urlretrieve(DATA_URL, filepath, _progress) 372 | print() 373 | statinfo = os.stat(filepath) 374 | print('Successfully downloaded', filename, statinfo.st_size, 'bytes.') 375 | tarfile.open(filepath, 'r:gz').extractall(dest_directory) 376 | -------------------------------------------------------------------------------- /cifar10/cifar10_async_dist_train.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | from datetime import datetime 6 | import os.path 7 | import time 8 | 9 | import numpy as np 10 | from six.moves import xrange 11 | import tensorflow as tf 12 | 13 | from tensorflow.models.image.cifar10 import cifar10 14 | 15 | FLAGS = tf.app.flags.FLAGS 16 | 17 | tf.app.flags.DEFINE_string('job_name', '', 'One of "ps", "worker"') 18 | tf.app.flags.DEFINE_string('ps_hosts', '', 19 | """Comma-separated list of hostname:port for the """ 20 | """parameter server jobs. e.g. """ 21 | """'machine1:2222,machine2:1111,machine2:2222'""") 22 | tf.app.flags.DEFINE_string('worker_hosts', '', 23 | """Comma-separated list of hostname:port for the """ 24 | """worker jobs. e.g. """ 25 | """'machine1:2222,machine2:1111,machine2:2222'""") 26 | tf.app.flags.DEFINE_integer('task_id', 0, 'Task ID of the worker/replica running the training.') 27 | 28 | 29 | tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train', 30 | """Directory where to write event logs """ 31 | """and checkpoint.""") 32 | tf.app.flags.DEFINE_integer('max_steps', 1000000, 33 | """Number of batches to run.""") 34 | tf.app.flags.DEFINE_boolean('log_device_placement', False, 35 | """Whether to log device placement.""") 36 | 37 | tf.logging.set_verbosity(tf.logging.INFO) 38 | 39 | def train(): 40 | ps_hosts = FLAGS.ps_hosts.split(',') 41 | worker_hosts = FLAGS.worker_hosts.split(',') 42 | print ('PS hosts are: %s' % ps_hosts) 43 | print ('Worker hosts are: %s' % worker_hosts) 44 | 45 | server = tf.train.Server( 46 | {'ps': ps_hosts, 'worker': worker_hosts}, 47 | job_name = FLAGS.job_name, 48 | task_index=FLAGS.task_id) 49 | 50 | if FLAGS.job_name == 'ps': 51 | server.join() 52 | 53 | is_chief = (FLAGS.task_id == 0) 54 | if is_chief: 55 | if tf.gfile.Exists(FLAGS.train_dir): 56 | tf.gfile.DeleteRecursively(FLAGS.train_dir) 57 | tf.gfile.MakeDirs(FLAGS.train_dir) 58 | 59 | device_setter = tf.train.replica_device_setter(ps_tasks=1) 60 | with tf.device('/job:worker/task:%d' % FLAGS.task_id): 61 | with tf.device(device_setter): 62 | global_step = tf.Variable(0, trainable=False) 63 | 64 | # Get images and labels for CIFAR-10. 65 | images, labels = cifar10.distorted_inputs() 66 | 67 | # Build a Graph that computes the logits predictions from the 68 | # inference model. 69 | logits = cifar10.inference(images) 70 | 71 | # Calculate loss. 72 | loss = cifar10.loss(logits, labels) 73 | train_op = cifar10.train(loss, global_step) 74 | 75 | saver = tf.train.Saver() 76 | # We run the summaries in the same thread as the training operations by 77 | # passing in None for summary_op to avoid a summary_thread being started. 78 | # Running summaries and training operations in parallel could run out of 79 | # GPU memory. 80 | sv = tf.train.Supervisor(is_chief=is_chief, 81 | logdir=FLAGS.train_dir, 82 | init_op=tf.initialize_all_variables(), 83 | summary_op=tf.merge_all_summaries(), 84 | global_step=global_step, 85 | saver=saver, 86 | save_model_secs=60) 87 | 88 | tf.logging.info('%s Supervisor' % datetime.now()) 89 | 90 | sess_config = tf.ConfigProto(allow_soft_placement=True, 91 | log_device_placement=FLAGS.log_device_placement) 92 | 93 | print ("Before session init") 94 | # Get a session. 95 | sess = sv.prepare_or_wait_for_session(server.target, config=sess_config) 96 | print ("Session init done") 97 | 98 | # Start the queue runners. 99 | queue_runners = tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS) 100 | sv.start_queue_runners(sess, queue_runners) 101 | print ('Started %d queues for processing input data.' % len(queue_runners)) 102 | 103 | """Train CIFAR-10 for a number of steps.""" 104 | for step in xrange(FLAGS.max_steps): 105 | start_time = time.time() 106 | _, loss_value, gs = sess.run([train_op, loss, global_step]) 107 | duration = time.time() - start_time 108 | 109 | assert not np.isnan(loss_value), 'Model diverged with loss = NaN' 110 | 111 | if step % 10 == 0: 112 | num_examples_per_step = FLAGS.batch_size 113 | examples_per_sec = num_examples_per_step / duration 114 | sec_per_batch = float(duration) 115 | 116 | format_str = ('%s: step %d (global_step %d), loss = %.2f (%.1f examples/sec; %.3f sec/batch)') 117 | print (format_str % (datetime.now(), step, gs, loss_value, examples_per_sec, sec_per_batch)) 118 | 119 | if is_chief: 120 | saver.save(sess, os.path.join(FLAGS.train_dir, 'model.ckpt'), global_step=global_step) 121 | 122 | 123 | def main(argv=None): 124 | cifar10.maybe_download_and_extract() 125 | train() 126 | 127 | if __name__ == '__main__': 128 | tf.app.run() 129 | -------------------------------------------------------------------------------- /cifar10/cifar10_eval.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """Evaluation for CIFAR-10. 17 | 18 | Accuracy: 19 | cifar10_train.py achieves 83.0% accuracy after 100K steps (256 epochs 20 | of data) as judged by cifar10_eval.py. 21 | 22 | Speed: 23 | On a single Tesla K40, cifar10_train.py processes a single batch of 128 images 24 | in 0.25-0.35 sec (i.e. 350 - 600 images /sec). The model reaches ~86% 25 | accuracy after 100K steps in 8 hours of training time. 26 | 27 | Usage: 28 | Please see the tutorial and website for how to download the CIFAR-10 29 | data set, compile the program and train the model. 30 | 31 | http://tensorflow.org/tutorials/deep_cnn/ 32 | """ 33 | from __future__ import absolute_import 34 | from __future__ import division 35 | from __future__ import print_function 36 | 37 | from datetime import datetime 38 | import math 39 | import time 40 | 41 | import numpy as np 42 | import tensorflow as tf 43 | 44 | from tensorflow.models.image.cifar10 import cifar10 45 | 46 | FLAGS = tf.app.flags.FLAGS 47 | 48 | tf.app.flags.DEFINE_string('eval_dir', '/tmp/cifar10_eval', 49 | """Directory where to write event logs.""") 50 | tf.app.flags.DEFINE_string('eval_data', 'test', 51 | """Either 'test' or 'train_eval'.""") 52 | tf.app.flags.DEFINE_string('checkpoint_dir', '/tmp/cifar10_train', 53 | """Directory where to read model checkpoints.""") 54 | tf.app.flags.DEFINE_integer('eval_interval_secs', 60 * 5, 55 | """How often to run the eval.""") 56 | tf.app.flags.DEFINE_integer('num_examples', 10000, 57 | """Number of examples to run.""") 58 | tf.app.flags.DEFINE_boolean('run_once', False, 59 | """Whether to run eval only once.""") 60 | 61 | 62 | def eval_once(saver, summary_writer, top_k_op, summary_op): 63 | """Run Eval once. 64 | 65 | Args: 66 | saver: Saver. 67 | summary_writer: Summary writer. 68 | top_k_op: Top K op. 69 | summary_op: Summary op. 70 | """ 71 | with tf.Session() as sess: 72 | ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir) 73 | if ckpt and ckpt.model_checkpoint_path: 74 | # Restores from checkpoint 75 | saver.restore(sess, ckpt.model_checkpoint_path) 76 | # Assuming model_checkpoint_path looks something like: 77 | # /my-favorite-path/cifar10_train/model.ckpt-0, 78 | # extract global_step from it. 79 | global_step = ckpt.model_checkpoint_path.split('/')[-1].split('-')[-1] 80 | else: 81 | print('No checkpoint file found') 82 | return 83 | 84 | # Start the queue runners. 85 | coord = tf.train.Coordinator() 86 | try: 87 | threads = [] 88 | for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS): 89 | threads.extend(qr.create_threads(sess, coord=coord, daemon=True, 90 | start=True)) 91 | 92 | num_iter = int(math.ceil(FLAGS.num_examples / FLAGS.batch_size)) 93 | true_count = 0 # Counts the number of correct predictions. 94 | total_sample_count = num_iter * FLAGS.batch_size 95 | step = 0 96 | while step < num_iter and not coord.should_stop(): 97 | predictions = sess.run([top_k_op]) 98 | true_count += np.sum(predictions) 99 | step += 1 100 | 101 | # Compute precision @ 1. 102 | precision = true_count / total_sample_count 103 | print('%s: precision @ 1 = %.3f' % (datetime.now(), precision)) 104 | 105 | summary = tf.Summary() 106 | summary.ParseFromString(sess.run(summary_op)) 107 | summary.value.add(tag='Precision @ 1', simple_value=precision) 108 | summary_writer.add_summary(summary, global_step) 109 | except Exception as e: # pylint: disable=broad-except 110 | coord.request_stop(e) 111 | 112 | coord.request_stop() 113 | coord.join(threads, stop_grace_period_secs=10) 114 | 115 | 116 | def evaluate(): 117 | """Eval CIFAR-10 for a number of steps.""" 118 | with tf.Graph().as_default() as g: 119 | # Get images and labels for CIFAR-10. 120 | eval_data = FLAGS.eval_data == 'test' 121 | images, labels = cifar10.inputs(eval_data=eval_data) 122 | 123 | # Build a Graph that computes the logits predictions from the 124 | # inference model. 125 | logits = cifar10.inference(images) 126 | 127 | # Calculate predictions. 128 | top_k_op = tf.nn.in_top_k(logits, labels, 1) 129 | 130 | # Restore the moving average version of the learned variables for eval. 131 | variable_averages = tf.train.ExponentialMovingAverage( 132 | cifar10.MOVING_AVERAGE_DECAY) 133 | variables_to_restore = variable_averages.variables_to_restore() 134 | saver = tf.train.Saver(variables_to_restore) 135 | 136 | # Build the summary operation based on the TF collection of Summaries. 137 | summary_op = tf.merge_all_summaries() 138 | 139 | summary_writer = tf.train.SummaryWriter(FLAGS.eval_dir, g) 140 | 141 | while True: 142 | eval_once(saver, summary_writer, top_k_op, summary_op) 143 | if FLAGS.run_once: 144 | break 145 | time.sleep(FLAGS.eval_interval_secs) 146 | 147 | 148 | def main(argv=None): # pylint: disable=unused-argument 149 | cifar10.maybe_download_and_extract() 150 | if tf.gfile.Exists(FLAGS.eval_dir): 151 | tf.gfile.DeleteRecursively(FLAGS.eval_dir) 152 | tf.gfile.MakeDirs(FLAGS.eval_dir) 153 | evaluate() 154 | 155 | 156 | if __name__ == '__main__': 157 | tf.app.run() 158 | -------------------------------------------------------------------------------- /cifar10/cifar10_input.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """Routine for decoding the CIFAR-10 binary file format.""" 17 | 18 | from __future__ import absolute_import 19 | from __future__ import division 20 | from __future__ import print_function 21 | 22 | import os 23 | 24 | from six.moves import xrange # pylint: disable=redefined-builtin 25 | import tensorflow as tf 26 | 27 | # Process images of this size. Note that this differs from the original CIFAR 28 | # image size of 32 x 32. If one alters this number, then the entire model 29 | # architecture will change and any model would need to be retrained. 30 | IMAGE_SIZE = 24 31 | 32 | # Global constants describing the CIFAR-10 data set. 33 | NUM_CLASSES = 10 34 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 50000 35 | NUM_EXAMPLES_PER_EPOCH_FOR_EVAL = 10000 36 | 37 | 38 | def read_cifar10(filename_queue): 39 | """Reads and parses examples from CIFAR10 data files. 40 | 41 | Recommendation: if you want N-way read parallelism, call this function 42 | N times. This will give you N independent Readers reading different 43 | files & positions within those files, which will give better mixing of 44 | examples. 45 | 46 | Args: 47 | filename_queue: A queue of strings with the filenames to read from. 48 | 49 | Returns: 50 | An object representing a single example, with the following fields: 51 | height: number of rows in the result (32) 52 | width: number of columns in the result (32) 53 | depth: number of color channels in the result (3) 54 | key: a scalar string Tensor describing the filename & record number 55 | for this example. 56 | label: an int32 Tensor with the label in the range 0..9. 57 | uint8image: a [height, width, depth] uint8 Tensor with the image data 58 | """ 59 | 60 | class CIFAR10Record(object): 61 | pass 62 | result = CIFAR10Record() 63 | 64 | # Dimensions of the images in the CIFAR-10 dataset. 65 | # See http://www.cs.toronto.edu/~kriz/cifar.html for a description of the 66 | # input format. 67 | label_bytes = 1 # 2 for CIFAR-100 68 | result.height = 32 69 | result.width = 32 70 | result.depth = 3 71 | image_bytes = result.height * result.width * result.depth 72 | # Every record consists of a label followed by the image, with a 73 | # fixed number of bytes for each. 74 | record_bytes = label_bytes + image_bytes 75 | 76 | # Read a record, getting filenames from the filename_queue. No 77 | # header or footer in the CIFAR-10 format, so we leave header_bytes 78 | # and footer_bytes at their default of 0. 79 | reader = tf.FixedLengthRecordReader(record_bytes=record_bytes) 80 | result.key, value = reader.read(filename_queue) 81 | 82 | # Convert from a string to a vector of uint8 that is record_bytes long. 83 | record_bytes = tf.decode_raw(value, tf.uint8) 84 | 85 | # The first bytes represent the label, which we convert from uint8->int32. 86 | result.label = tf.cast( 87 | tf.slice(record_bytes, [0], [label_bytes]), tf.int32) 88 | 89 | # The remaining bytes after the label represent the image, which we reshape 90 | # from [depth * height * width] to [depth, height, width]. 91 | depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]), 92 | [result.depth, result.height, result.width]) 93 | # Convert from [depth, height, width] to [height, width, depth]. 94 | result.uint8image = tf.transpose(depth_major, [1, 2, 0]) 95 | 96 | return result 97 | 98 | 99 | def _generate_image_and_label_batch(image, label, min_queue_examples, 100 | batch_size, shuffle): 101 | """Construct a queued batch of images and labels. 102 | 103 | Args: 104 | image: 3-D Tensor of [height, width, 3] of type.float32. 105 | label: 1-D Tensor of type.int32 106 | min_queue_examples: int32, minimum number of samples to retain 107 | in the queue that provides of batches of examples. 108 | batch_size: Number of images per batch. 109 | shuffle: boolean indicating whether to use a shuffling queue. 110 | 111 | Returns: 112 | images: Images. 4D tensor of [batch_size, height, width, 3] size. 113 | labels: Labels. 1D tensor of [batch_size] size. 114 | """ 115 | # Create a queue that shuffles the examples, and then 116 | # read 'batch_size' images + labels from the example queue. 117 | num_preprocess_threads = 16 118 | if shuffle: 119 | images, label_batch = tf.train.shuffle_batch( 120 | [image, label], 121 | batch_size=batch_size, 122 | num_threads=num_preprocess_threads, 123 | capacity=min_queue_examples + 3 * batch_size, 124 | min_after_dequeue=min_queue_examples) 125 | else: 126 | images, label_batch = tf.train.batch( 127 | [image, label], 128 | batch_size=batch_size, 129 | num_threads=num_preprocess_threads, 130 | capacity=min_queue_examples + 3 * batch_size) 131 | 132 | # Display the training images in the visualizer. 133 | tf.image_summary('images', images) 134 | 135 | return images, tf.reshape(label_batch, [batch_size]) 136 | 137 | 138 | def distorted_inputs(data_dir, batch_size): 139 | """Construct distorted input for CIFAR training using the Reader ops. 140 | 141 | Args: 142 | data_dir: Path to the CIFAR-10 data directory. 143 | batch_size: Number of images per batch. 144 | 145 | Returns: 146 | images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size. 147 | labels: Labels. 1D tensor of [batch_size] size. 148 | """ 149 | filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i) 150 | for i in xrange(1, 6)] 151 | for f in filenames: 152 | if not tf.gfile.Exists(f): 153 | raise ValueError('Failed to find file: ' + f) 154 | 155 | # Create a queue that produces the filenames to read. 156 | filename_queue = tf.train.string_input_producer(filenames) 157 | 158 | # Read examples from files in the filename queue. 159 | read_input = read_cifar10(filename_queue) 160 | reshaped_image = tf.cast(read_input.uint8image, tf.float32) 161 | 162 | height = IMAGE_SIZE 163 | width = IMAGE_SIZE 164 | 165 | # Image processing for training the network. Note the many random 166 | # distortions applied to the image. 167 | 168 | # Randomly crop a [height, width] section of the image. 169 | distorted_image = tf.random_crop(reshaped_image, [height, width, 3]) 170 | 171 | # Randomly flip the image horizontally. 172 | distorted_image = tf.image.random_flip_left_right(distorted_image) 173 | 174 | # Because these operations are not commutative, consider randomizing 175 | # the order their operation. 176 | distorted_image = tf.image.random_brightness(distorted_image, 177 | max_delta=63) 178 | distorted_image = tf.image.random_contrast(distorted_image, 179 | lower=0.2, upper=1.8) 180 | 181 | # Subtract off the mean and divide by the variance of the pixels. 182 | float_image = tf.image.per_image_whitening(distorted_image) 183 | 184 | # Ensure that the random shuffling has good mixing properties. 185 | min_fraction_of_examples_in_queue = 0.4 186 | min_queue_examples = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 187 | min_fraction_of_examples_in_queue) 188 | print ('Filling queue with %d CIFAR images before starting to train. ' 189 | 'This will take a few minutes.' % min_queue_examples) 190 | 191 | # Generate a batch of images and labels by building up a queue of examples. 192 | return _generate_image_and_label_batch(float_image, read_input.label, 193 | min_queue_examples, batch_size, 194 | shuffle=True) 195 | 196 | 197 | def inputs(eval_data, data_dir, batch_size): 198 | """Construct input for CIFAR evaluation using the Reader ops. 199 | 200 | Args: 201 | eval_data: bool, indicating if one should use the train or eval data set. 202 | data_dir: Path to the CIFAR-10 data directory. 203 | batch_size: Number of images per batch. 204 | 205 | Returns: 206 | images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size. 207 | labels: Labels. 1D tensor of [batch_size] size. 208 | """ 209 | if not eval_data: 210 | filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i) 211 | for i in xrange(1, 6)] 212 | num_examples_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN 213 | else: 214 | filenames = [os.path.join(data_dir, 'test_batch.bin')] 215 | num_examples_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_EVAL 216 | 217 | for f in filenames: 218 | if not tf.gfile.Exists(f): 219 | raise ValueError('Failed to find file: ' + f) 220 | 221 | # Create a queue that produces the filenames to read. 222 | filename_queue = tf.train.string_input_producer(filenames) 223 | 224 | # Read examples from files in the filename queue. 225 | read_input = read_cifar10(filename_queue) 226 | reshaped_image = tf.cast(read_input.uint8image, tf.float32) 227 | 228 | height = IMAGE_SIZE 229 | width = IMAGE_SIZE 230 | 231 | # Image processing for evaluation. 232 | # Crop the central [height, width] of the image. 233 | resized_image = tf.image.resize_image_with_crop_or_pad(reshaped_image, 234 | width, height) 235 | 236 | # Subtract off the mean and divide by the variance of the pixels. 237 | float_image = tf.image.per_image_whitening(resized_image) 238 | 239 | # Ensure that the random shuffling has good mixing properties. 240 | min_fraction_of_examples_in_queue = 0.4 241 | min_queue_examples = int(num_examples_per_epoch * 242 | min_fraction_of_examples_in_queue) 243 | 244 | # Generate a batch of images and labels by building up a queue of examples. 245 | return _generate_image_and_label_batch(float_image, read_input.label, 246 | min_queue_examples, batch_size, 247 | shuffle=False) 248 | -------------------------------------------------------------------------------- /cifar10/cifar10_sync_dist_train.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | from datetime import datetime 6 | import os.path 7 | import time 8 | 9 | import numpy as np 10 | from six.moves import xrange # pylint: disable=redefined-builtin 11 | import tensorflow as tf 12 | 13 | from tensorflow.models.image.cifar10 import cifar10 14 | 15 | FLAGS = tf.app.flags.FLAGS 16 | 17 | tf.app.flags.DEFINE_string('job_name', '', 'One of "ps", "worker"') 18 | tf.app.flags.DEFINE_string('ps_hosts', '', 19 | """Comma-separated list of hostname:port for the """ 20 | """parameter server jobs. e.g. """ 21 | """'machine1:2222,machine2:1111,machine2:2222'""") 22 | tf.app.flags.DEFINE_string('worker_hosts', '', 23 | """Comma-separated list of hostname:port for the """ 24 | """worker jobs. e.g. """ 25 | """'machine1:2222,machine2:1111,machine2:2222'""") 26 | tf.app.flags.DEFINE_integer('task_id', 0, 'Task ID of the worker/replica running the training.') 27 | 28 | 29 | tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train', 30 | """Directory where to write event logs """ 31 | """and checkpoint.""") 32 | tf.app.flags.DEFINE_integer('max_steps', 1000000, 33 | """Number of batches to run.""") 34 | tf.app.flags.DEFINE_boolean('log_device_placement', False, 35 | """Whether to log device placement.""") 36 | 37 | tf.logging.set_verbosity(tf.logging.INFO) 38 | 39 | INITIAL_LEARNING_RATE = 0.1 # Initial learning rate. 40 | MOVING_AVERAGE_DECAY = 0.9999 # The decay to use for the moving average. 41 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 50000 42 | NUM_EPOCHS_PER_DECAY = 350.0 # Epochs after which learning rate decays. 43 | LEARNING_RATE_DECAY_FACTOR = 0.1 # Learning rate decay factor. 44 | 45 | def train(): 46 | ps_hosts = FLAGS.ps_hosts.split(',') 47 | worker_hosts = FLAGS.worker_hosts.split(',') 48 | print ('PS hosts are: %s' % ps_hosts) 49 | print ('Worker hosts are: %s' % worker_hosts) 50 | 51 | server = tf.train.Server({'ps': ps_hosts, 'worker': worker_hosts}, 52 | job_name = FLAGS.job_name, 53 | task_index=FLAGS.task_id) 54 | 55 | if FLAGS.job_name == 'ps': 56 | server.join() 57 | 58 | is_chief = (FLAGS.task_id == 0) 59 | if is_chief: 60 | if tf.gfile.Exists(FLAGS.train_dir): 61 | tf.gfile.DeleteRecursively(FLAGS.train_dir) 62 | tf.gfile.MakeDirs(FLAGS.train_dir) 63 | 64 | 65 | device_setter = tf.train.replica_device_setter(ps_tasks=1) 66 | with tf.device('/job:worker/task:%d' % FLAGS.task_id): 67 | with tf.device(device_setter): 68 | global_step = tf.Variable(0, trainable=False) 69 | 70 | # Get images and labels for CIFAR-10. 71 | images, labels = cifar10.distorted_inputs() 72 | 73 | # Build a Graph that computes the logits predictions from the 74 | # inference model. 75 | logits = cifar10.inference(images) 76 | 77 | # Calculate loss. 78 | loss = cifar10.loss(logits, labels) 79 | 80 | # Need to re-implement the training part for sync updates. 81 | num_batches_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN / FLAGS.batch_size 82 | decay_steps = int(num_batches_per_epoch * NUM_EPOCHS_PER_DECAY) 83 | 84 | # Decay the learning rate exponentially based on the number of steps. 85 | lr = tf.train.exponential_decay(INITIAL_LEARNING_RATE, 86 | global_step, 87 | decay_steps, 88 | LEARNING_RATE_DECAY_FACTOR, 89 | staircase=True) 90 | tf.scalar_summary('learning_rate', lr) 91 | opt = tf.train.GradientDescentOptimizer(lr) 92 | 93 | # Track the moving averages of all trainable variables. 94 | exp_moving_averager = tf.train.ExponentialMovingAverage(MOVING_AVERAGE_DECAY, global_step) 95 | variables_to_average = (tf.trainable_variables() + tf.moving_average_variables()) 96 | 97 | opt = tf.train.SyncReplicasOptimizer( 98 | opt, 99 | replicas_to_aggregate=len(worker_hosts), 100 | replica_id=FLAGS.task_id, 101 | total_num_replicas=len(worker_hosts), 102 | variable_averages=exp_moving_averager, 103 | variables_to_average=variables_to_average) 104 | 105 | 106 | # Compute gradients with respect to the loss. 107 | grads = opt.compute_gradients(loss) 108 | 109 | # Add histograms for gradients. 110 | for grad, var in grads: 111 | if grad is not None: 112 | tf.histogram_summary(var.op.name + '/gradients', grad) 113 | 114 | apply_gradients_op = opt.apply_gradients(grads, global_step=global_step) 115 | 116 | with tf.control_dependencies([apply_gradients_op]): 117 | train_op = tf.identity(loss, name='train_op') 118 | 119 | 120 | chief_queue_runners = [opt.get_chief_queue_runner()] 121 | init_tokens_op = opt.get_init_tokens_op() 122 | 123 | saver = tf.train.Saver() 124 | # We run the summaries in the same thread as the training operations by 125 | # passing in None for summary_op to avoid a summary_thread being started. 126 | # Running summaries and training operations in parallel could run out of 127 | # GPU memory. 128 | sv = tf.train.Supervisor(is_chief=is_chief, 129 | logdir=FLAGS.train_dir, 130 | init_op=tf.initialize_all_variables(), 131 | summary_op=tf.merge_all_summaries(), 132 | global_step=global_step, 133 | saver=saver, 134 | save_model_secs=60) 135 | 136 | tf.logging.info('%s Supervisor' % datetime.now()) 137 | 138 | sess_config = tf.ConfigProto(allow_soft_placement=True, 139 | log_device_placement=FLAGS.log_device_placement) 140 | 141 | print ("Before session init") 142 | # Get a session. 143 | sess = sv.prepare_or_wait_for_session(server.target, config=sess_config) 144 | print ("Session init done") 145 | 146 | # Start the queue runners. 147 | queue_runners = tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS) 148 | sv.start_queue_runners(sess, queue_runners) 149 | print ('Started %d queues for processing input data.' % len(queue_runners)) 150 | 151 | sv.start_queue_runners(sess, chief_queue_runners) 152 | sess.run(init_tokens_op) 153 | 154 | print ('Start training') 155 | """Train CIFAR-10 for a number of steps.""" 156 | for step in xrange(FLAGS.max_steps): 157 | start_time = time.time() 158 | _, loss_value, gs = sess.run([train_op, loss, global_step]) 159 | duration = time.time() - start_time 160 | 161 | assert not np.isnan(loss_value), 'Model diverged with loss = NaN' 162 | 163 | if step % 10 == 0: 164 | num_examples_per_step = FLAGS.batch_size 165 | examples_per_sec = num_examples_per_step / duration 166 | sec_per_batch = float(duration) 167 | 168 | format_str = ('%s: step %d (global_step %d), loss = %.2f (%.1f examples/sec; %.3f sec/batch)') 169 | print (format_str % (datetime.now(), step, gs, loss_value, examples_per_sec, sec_per_batch)) 170 | 171 | if is_chief: 172 | saver.save(sess, os.path.join(FLAGS.train_dir, 'model.ckpt'), 173 | global_step=global_step) 174 | 175 | def main(argv=None): 176 | cifar10.maybe_download_and_extract() 177 | train() 178 | 179 | if __name__ == '__main__': 180 | tf.app.run() 181 | -------------------------------------------------------------------------------- /cifar10/cifar10_train.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """A binary to train CIFAR-10 using a single GPU. 17 | 18 | Accuracy: 19 | cifar10_train.py achieves ~86% accuracy after 100K steps (256 epochs of 20 | data) as judged by cifar10_eval.py. 21 | 22 | Speed: With batch_size 128. 23 | 24 | System | Step Time (sec/batch) | Accuracy 25 | ------------------------------------------------------------------ 26 | 1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours) 27 | 1 Tesla K40m | 0.25-0.35 | ~86% at 100K steps (4 hours) 28 | 29 | Usage: 30 | Please see the tutorial and website for how to download the CIFAR-10 31 | data set, compile the program and train the model. 32 | 33 | http://tensorflow.org/tutorials/deep_cnn/ 34 | """ 35 | from __future__ import absolute_import 36 | from __future__ import division 37 | from __future__ import print_function 38 | 39 | from datetime import datetime 40 | import os.path 41 | import time 42 | 43 | import numpy as np 44 | from six.moves import xrange # pylint: disable=redefined-builtin 45 | import tensorflow as tf 46 | 47 | from tensorflow.models.image.cifar10 import cifar10 48 | 49 | FLAGS = tf.app.flags.FLAGS 50 | 51 | tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train', 52 | """Directory where to write event logs """ 53 | """and checkpoint.""") 54 | tf.app.flags.DEFINE_integer('max_steps', 1000000, 55 | """Number of batches to run.""") 56 | tf.app.flags.DEFINE_boolean('log_device_placement', False, 57 | """Whether to log device placement.""") 58 | 59 | 60 | def train(): 61 | """Train CIFAR-10 for a number of steps.""" 62 | with tf.Graph().as_default(): 63 | global_step = tf.Variable(0, trainable=False) 64 | 65 | # Get images and labels for CIFAR-10. 66 | images, labels = cifar10.distorted_inputs() 67 | 68 | # Build a Graph that computes the logits predictions from the 69 | # inference model. 70 | logits = cifar10.inference(images) 71 | 72 | # Calculate loss. 73 | loss = cifar10.loss(logits, labels) 74 | 75 | # Build a Graph that trains the model with one batch of examples and 76 | # updates the model parameters. 77 | train_op = cifar10.train(loss, global_step) 78 | 79 | # Create a saver. 80 | saver = tf.train.Saver(tf.all_variables()) 81 | 82 | # Build the summary operation based on the TF collection of Summaries. 83 | summary_op = tf.merge_all_summaries() 84 | 85 | # Build an initialization operation to run below. 86 | init = tf.initialize_all_variables() 87 | 88 | # Start running operations on the Graph. 89 | sess = tf.Session(config=tf.ConfigProto( 90 | log_device_placement=FLAGS.log_device_placement)) 91 | sess.run(init) 92 | 93 | # Start the queue runners. 94 | tf.train.start_queue_runners(sess=sess) 95 | 96 | summary_writer = tf.train.SummaryWriter(FLAGS.train_dir, sess.graph) 97 | 98 | for step in xrange(FLAGS.max_steps): 99 | start_time = time.time() 100 | _, loss_value = sess.run([train_op, loss]) 101 | duration = time.time() - start_time 102 | 103 | assert not np.isnan(loss_value), 'Model diverged with loss = NaN' 104 | 105 | if step % 10 == 0: 106 | num_examples_per_step = FLAGS.batch_size 107 | examples_per_sec = num_examples_per_step / duration 108 | sec_per_batch = float(duration) 109 | 110 | format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f ' 111 | 'sec/batch)') 112 | print (format_str % (datetime.now(), step, loss_value, 113 | examples_per_sec, sec_per_batch)) 114 | 115 | if step % 100 == 0: 116 | summary_str = sess.run(summary_op) 117 | summary_writer.add_summary(summary_str, step) 118 | 119 | # Save the model checkpoint periodically. 120 | if step % 1000 == 0 or (step + 1) == FLAGS.max_steps: 121 | checkpoint_path = os.path.join(FLAGS.train_dir, 'model.ckpt') 122 | saver.save(sess, checkpoint_path, global_step=step) 123 | 124 | 125 | def main(argv=None): # pylint: disable=unused-argument 126 | cifar10.maybe_download_and_extract() 127 | if tf.gfile.Exists(FLAGS.train_dir): 128 | tf.gfile.DeleteRecursively(FLAGS.train_dir) 129 | tf.gfile.MakeDirs(FLAGS.train_dir) 130 | train() 131 | 132 | 133 | if __name__ == '__main__': 134 | tf.app.run() 135 | -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | ### Data needed in the examples. 2 | -------------------------------------------------------------------------------- /data/t10k-images-idx3-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/data/t10k-images-idx3-ubyte.gz -------------------------------------------------------------------------------- /data/t10k-labels-idx1-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/data/t10k-labels-idx1-ubyte.gz -------------------------------------------------------------------------------- /data/text8.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/data/text8.zip -------------------------------------------------------------------------------- /data/train-images-idx3-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/data/train-images-idx3-ubyte.gz -------------------------------------------------------------------------------- /data/train-labels-idx1-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/data/train-labels-idx1-ubyte.gz -------------------------------------------------------------------------------- /distributed/README.md: -------------------------------------------------------------------------------- 1 | ## Distributed TensorFlow examples. 2 | This directory includes yaml files several distributed TensorFlow examples. 3 | 4 | #### Create yaml to start TensorFlow servers 5 | This [script](https://github.com/caicloud/tensorflow-demo/blob/master/distributed/create_tf_server_yaml.py) can generate yaml file needed to create distributed tensorflow cluster. You can use ```--num_workers``` to specify number of workers and use ```--num_parameter_servers``` to specify number of parameter servers. 6 | 7 | 8 | #### Start examples using the TensorFlow servers 9 | This [script](https://github.com/caicloud/tensorflow-demo/blob/master/distributed/start_tf.sh) can run all the following examples on distributed TensorFlow. Example cmd: 10 | ``` 11 | ./start_tf.sh 8 3 mnist_cnn.py 12 | ``` 13 | The first parameter gives the number of workers. This can be equal or smaller than the nubmer of workers specified when creating the cluster). 14 | 15 | The second parameter gives the number of parameter servers. This must be the same as num_parameter_servers specified when creating the TensorFlow cluster. 16 | 17 | The third parameter gives the code to be run. 18 | 19 | #### MNIST examples 20 | - DNN example ([code](https://github.com/caicloud/tensorflow-demo/blob/master/distributed/mnist_dnn.py)) 21 | - DNN example with Sync updates ([code](https://github.com/caicloud/tensorflow-demo/blob/master/distributed/mnist_dnn_sync_update.py)) 22 | - CNN example ([code](https://github.com/caicloud/tensorflow-demo/blob/master/distributed/mnist_cnn.py)) 23 | 24 | #### Word to Vector example 25 | - Word to Vector example ([code](https://github.com/caicloud/tensorflow-demo/blob/master/distributed/word2vector.py)) 26 | 27 | -------------------------------------------------------------------------------- /distributed/create_tf_server_yaml.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # Copyright 2016 The TensorFlow Authors. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # ============================================================================== 16 | 17 | """Generates YAML configuration files for distributed Tensorflow workers. 18 | The workers will be run in a Kubernetes (k8s) container cluster. 19 | """ 20 | from __future__ import absolute_import 21 | from __future__ import division 22 | from __future__ import print_function 23 | 24 | import argparse 25 | import sys 26 | 27 | # Note: It is intentional that we do not import tensorflow in this script. The 28 | # machine that launches a TensorFlow k8s cluster does not have to have the 29 | # Python package of TensorFlow installed on it. 30 | 31 | 32 | DEFAULT_DOCKER_IMAGE = 'tensorflow/tf_grpc_test_server' 33 | DEFAULT_PORT = 2222 34 | 35 | # TODO(cais): Consider adding resource requests/limits to the pods. 36 | WORKER_RC = ( 37 | """apiVersion: v1 38 | kind: ReplicationController 39 | metadata: 40 | name: tf-worker{worker_id} 41 | spec: 42 | replicas: 1 43 | template: 44 | metadata: 45 | labels: 46 | tf-worker: "{worker_id}" 47 | spec: 48 | containers: 49 | - name: tf-worker{worker_id} 50 | image: {docker_image} 51 | args: 52 | - --cluster_spec={cluster_spec} 53 | - --job_name=worker 54 | - --task_id={worker_id} 55 | ports: 56 | - containerPort: {port} 57 | """) 58 | WORKER_SVC = ( 59 | """apiVersion: v1 60 | kind: Service 61 | metadata: 62 | name: tf-worker{worker_id} 63 | labels: 64 | tf-worker: "{worker_id}" 65 | spec: 66 | ports: 67 | - port: {port} 68 | targetPort: {port} 69 | selector: 70 | tf-worker: "{worker_id}" 71 | """) 72 | WORKER_LB_SVC = ( 73 | """apiVersion: v1 74 | kind: Service 75 | metadata: 76 | name: tf-worker{worker_id} 77 | labels: 78 | tf-worker: "{worker_id}" 79 | spec: 80 | type: LoadBalancer 81 | ports: 82 | - port: {port} 83 | selector: 84 | tf-worker: "{worker_id}" 85 | """) 86 | PARAM_SERVER_RC = ( 87 | """apiVersion: v1 88 | kind: ReplicationController 89 | metadata: 90 | name: tf-ps{param_server_id} 91 | spec: 92 | replicas: 1 93 | template: 94 | metadata: 95 | labels: 96 | tf-ps: "{param_server_id}" 97 | spec: 98 | containers: 99 | - name: tf-ps{param_server_id} 100 | image: {docker_image} 101 | args: 102 | - --cluster_spec={cluster_spec} 103 | - --job_name=ps 104 | - --task_id={param_server_id} 105 | ports: 106 | - containerPort: {port} 107 | """) 108 | PARAM_SERVER_SVC = ( 109 | """apiVersion: v1 110 | kind: Service 111 | metadata: 112 | name: tf-ps{param_server_id} 113 | labels: 114 | tf-ps: "{param_server_id}" 115 | spec: 116 | ports: 117 | - port: {port} 118 | selector: 119 | tf-ps: "{param_server_id}" 120 | """) 121 | 122 | 123 | def main(): 124 | """Do arg parsing.""" 125 | parser = argparse.ArgumentParser() 126 | parser.add_argument('--num_workers', 127 | type=int, 128 | default=2, 129 | help='How many worker pods to run') 130 | parser.add_argument('--num_parameter_servers', 131 | type=int, 132 | default=1, 133 | help='How many paramater server pods to run') 134 | parser.add_argument('--grpc_port', 135 | type=int, 136 | default=DEFAULT_PORT, 137 | help='GRPC server port (Default: %d)' % DEFAULT_PORT) 138 | parser.add_argument('--request_load_balancer', 139 | type=bool, 140 | default=False, 141 | help='To request worker0 to be exposed on a public IP ' 142 | 'address via an external load balancer, enabling you to ' 143 | 'run client processes from outside the cluster') 144 | parser.add_argument('--docker_image', 145 | type=str, 146 | default=DEFAULT_DOCKER_IMAGE, 147 | help='Override default docker image for the TensorFlow ' 148 | 'GRPC server') 149 | args = parser.parse_args() 150 | 151 | if args.num_workers <= 0: 152 | sys.stderr.write('--num_workers must be greater than 0; received %d\n' 153 | % args.num_workers) 154 | sys.exit(1) 155 | if args.num_parameter_servers <= 0: 156 | sys.stderr.write( 157 | '--num_parameter_servers must be greater than 0; received %d\n' 158 | % args.num_parameter_servers) 159 | sys.exit(1) 160 | 161 | # Generate contents of yaml config 162 | yaml_config = GenerateConfig(args.num_workers, 163 | args.num_parameter_servers, 164 | args.grpc_port, 165 | args.request_load_balancer, 166 | args.docker_image) 167 | print(yaml_config) # pylint: disable=superfluous-parens 168 | 169 | 170 | def GenerateConfig(num_workers, 171 | num_param_servers, 172 | port, 173 | request_load_balancer, 174 | docker_image): 175 | """Generate configuration strings.""" 176 | config = '' 177 | for worker in range(num_workers): 178 | config += WORKER_RC.format( 179 | port=port, 180 | worker_id=worker, 181 | docker_image=docker_image, 182 | cluster_spec=WorkerClusterSpecString(num_workers, 183 | num_param_servers, 184 | port)) 185 | config += '---\n' 186 | if request_load_balancer: 187 | config += WORKER_LB_SVC.format(port=port, 188 | worker_id=worker) 189 | else: 190 | config += WORKER_SVC.format(port=port, 191 | worker_id=worker) 192 | config += '---\n' 193 | 194 | for param_server in range(num_param_servers): 195 | config += PARAM_SERVER_RC.format( 196 | port=port, 197 | param_server_id=param_server, 198 | docker_image=docker_image, 199 | cluster_spec=ParamServerClusterSpecString(num_workers, 200 | num_param_servers, 201 | port)) 202 | config += '---\n' 203 | config += PARAM_SERVER_SVC.format(port=port, 204 | param_server_id=param_server) 205 | config += '---\n' 206 | 207 | return config 208 | 209 | 210 | def WorkerClusterSpecString(num_workers, 211 | num_param_servers, 212 | port): 213 | """Generates worker cluster spec.""" 214 | return ClusterSpecString(num_workers, num_param_servers, port) 215 | 216 | 217 | def ParamServerClusterSpecString(num_workers, 218 | num_param_servers, 219 | port): 220 | """Generates parameter server spec.""" 221 | return ClusterSpecString(num_workers, num_param_servers, port) 222 | 223 | 224 | def ClusterSpecString(num_workers, 225 | num_param_servers, 226 | port): 227 | """Generates general cluster spec.""" 228 | spec = 'worker|' 229 | for worker in range(num_workers): 230 | spec += 'tf-worker%d:%d' % (worker, port) 231 | if worker != num_workers-1: 232 | spec += ';' 233 | 234 | spec += ',ps|' 235 | for param_server in range(num_param_servers): 236 | spec += 'tf-ps%d:%d' % (param_server, port) 237 | if param_server != num_param_servers-1: 238 | spec += ';' 239 | 240 | return spec 241 | 242 | 243 | if __name__ == '__main__': 244 | main() 245 | -------------------------------------------------------------------------------- /distributed/mnist_cnn.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import time 3 | import datetime 4 | 5 | import numpy 6 | import tensorflow as tf 7 | from tensorflow.examples.tutorials.mnist import input_data 8 | 9 | flags = tf.app.flags 10 | flags.DEFINE_integer("worker_index", 0, 11 | "Worker task index, should be >= 0. worker_index=0 is " 12 | "the master worker task the performs the variable " 13 | "initialization ") 14 | 15 | flags.DEFINE_string("workers", None, 16 | "The worker url list, separated by comma (e.g. tf-worker1:2222,1.2.3.4:2222)") 17 | 18 | flags.DEFINE_string("parameter_servers", None, 19 | "The ps url list, separated by comma (e.g. tf-ps2:2222,1.2.3.5:2222)") 20 | 21 | flags.DEFINE_string("worker_grpc_url", None, 22 | "Worker GRPC URL (e.g., grpc://1.2.3.4:2222, or " 23 | "grpc://tf-worker0:2222)") 24 | 25 | flags.DEFINE_string("name_scope", None, "The variable name scope.") 26 | FLAGS = flags.FLAGS 27 | 28 | TRAING_STEP = 5000 29 | BATCH_SIZE = 64 30 | EVAL_SIZE = 50 31 | IMAGE_SIZE = 28 32 | NUM_CHANNELS = 1 33 | NUM_LABELS = 10 34 | 35 | print("Loading data from worker index = %d" % FLAGS.worker_index) 36 | mnist = input_data.read_data_sets("/tmp/data", one_hot=True) 37 | print("Testing set size: %d" % len(mnist.test.images)) 38 | print("Training set size: %d" % len(mnist.train.images)) 39 | 40 | print("Worker GRPC URL: %s" % FLAGS.worker_grpc_url) 41 | print("Workers = %s" % FLAGS.workers) 42 | print("Using name scope %s" % FLAGS.name_scope) 43 | 44 | is_chief = (FLAGS.worker_index == 0) 45 | if is_chief: tf.reset_default_graph() 46 | 47 | cluster = tf.train.ClusterSpec({"ps": FLAGS.parameter_servers.split(","), "worker": FLAGS.workers.split(",")}) 48 | # Construct device setter object 49 | device_setter = tf.train.replica_device_setter(cluster=cluster) 50 | 51 | # The device setter will automatically place Variables ops on separate 52 | # parameter servers (ps). The non-Variable ops will be placed on the workers. 53 | with tf.device(device_setter): 54 | with tf.name_scope(FLAGS.name_scope): 55 | global_step = tf.Variable(0, trainable=False) 56 | 57 | # The variables below hold all the trainable weights. 58 | # Convolutional layers. 59 | conv1_weights = tf.Variable(tf.truncated_normal([5, 5, NUM_CHANNELS, 32], stddev=0.1, seed = 2)) 60 | conv1_biases = tf.Variable(tf.zeros([32])) 61 | 62 | conv2_weights = tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1, seed = 2)) 63 | conv2_biases = tf.Variable(tf.constant(0.1, shape=[64])) 64 | 65 | # fully connected, depth 512. 66 | fc1_weights = tf.Variable(tf.truncated_normal([IMAGE_SIZE // 4 * IMAGE_SIZE // 4 * 64, 512], stddev=0.1, seed=2)) 67 | fc1_biases = tf.Variable(tf.constant(0.1, shape=[512])) 68 | 69 | fc2_weights = tf.Variable(tf.truncated_normal([512, NUM_LABELS], stddev=0.1, seed=2)) 70 | fc2_biases = tf.Variable(tf.constant(0.1, shape=[NUM_LABELS])) 71 | 72 | x = tf.placeholder(tf.float32, shape=(None, IMAGE_SIZE, IMAGE_SIZE, NUM_CHANNELS)) 73 | y_ = tf.placeholder(tf.float32, shape=(None, NUM_LABELS)) 74 | 75 | def model(data, train=False): 76 | """The Model definition.""" 77 | # 2D convolution, with 'SAME' padding (i.e. the output feature map has 78 | # the same size as the input). Note that {strides} is a 4D array whose 79 | # shape matches the data layout: [image index, y, x, depth]. 80 | conv = tf.nn.conv2d(data, conv1_weights, strides=[1, 1, 1, 1], padding='SAME') 81 | # Bias and rectified linear non-linearity. 82 | relu = tf.nn.relu(tf.nn.bias_add(conv, conv1_biases)) 83 | # Max pooling. The kernel size spec {ksize} also follows the layout of 84 | # the data. Here we have a pooling window of 2, and a stride of 2. 85 | pool = tf.nn.max_pool(relu, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') 86 | 87 | conv1 = tf.nn.conv2d(pool, conv2_weights, strides=[1, 1, 1, 1], padding='SAME') 88 | relu1 = tf.nn.relu(tf.nn.bias_add(conv1, conv2_biases)) 89 | pool1 = tf.nn.max_pool(relu1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') 90 | 91 | # Reshape the feature map into a 2D matrix to feed it to the fully connected layers. 92 | pool_shape = pool1.get_shape().as_list() 93 | reshape = tf.reshape(pool1, [-1, pool_shape[1] * pool_shape[2] * pool_shape[3]]) 94 | 95 | # Fully connected layer. Note that the '+' operation automatically broadcasts the biases. 96 | hidden = tf.nn.relu(tf.matmul(reshape, fc1_weights) + fc1_biases) 97 | # Add a 50% dropout during training only. Dropout also scales 98 | # activations such that no rescaling is needed at evaluation time. 99 | if train: hidden = tf.nn.dropout(hidden, 0.5) 100 | return tf.nn.softmax(tf.matmul(hidden, fc2_weights) + fc2_biases) 101 | 102 | train_y = model(x, True) 103 | loss = -tf.reduce_mean(y_ * tf.log(train_y)) 104 | regularizers = (tf.nn.l2_loss(fc1_weights) + tf.nn.l2_loss(fc1_biases) + 105 | tf.nn.l2_loss(fc2_weights) + tf.nn.l2_loss(fc2_biases)) 106 | loss += 5e-4 * regularizers 107 | 108 | # Decay once per epoch, using an exponential schedule starting at 0.01. 109 | learning_rate = tf.train.exponential_decay( 110 | 0.01, # Base learning rate. 111 | global_step * BATCH_SIZE, # Current index into the dataset. 112 | mnist.train.num_examples, # Decay step. 113 | 0.95, # Decay rate. 114 | staircase=True) 115 | 116 | # Use simple momentum for the optimization. 117 | optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(loss, global_step=global_step) 118 | 119 | # Training accuracy 120 | train_correct_prediction = tf.equal(tf.argmax(y_, 1), tf.argmax(train_y, 1)) 121 | train_accuracy = tf.reduce_mean(tf.cast(train_correct_prediction, tf.float32)) 122 | 123 | # Predictions for the test and validation, which we'll compute less often. 124 | eval_y = model(x, False) 125 | eval_correct_prediction = tf.reduce_sum(tf.cast(tf.equal(tf.argmax(y_, 1), tf.argmax(eval_y, 1)), tf.float32)) 126 | 127 | reshaped_test_data = numpy.reshape(mnist.test.images, [-1, 28, 28, 1]) 128 | test_label = mnist.test.labels 129 | reshaped_validate_data = numpy.reshape(mnist.validation.images, [-1, 28, 28, 1]) 130 | validate_label = mnist.validation.labels 131 | 132 | sv = tf.train.Supervisor(is_chief=is_chief, 133 | logdir="/tmp/dist-mnist-log/train", 134 | saver=tf.train.Saver(), 135 | init_op=tf.initialize_all_variables(), 136 | recovery_wait_secs=1, 137 | global_step=global_step) 138 | sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True, 139 | device_filters=["/job:ps", "/job:worker/task:%d" % FLAGS.worker_index]) 140 | 141 | # The chief worker (worker_index==0) session will prepare the session, 142 | # while the remaining workers will wait for the preparation to complete. 143 | if is_chief: 144 | print("Worker %d: Initializing session..." % FLAGS.worker_index) 145 | else: 146 | print("Worker %d: Waiting for session to be initialized..." % FLAGS.worker_index) 147 | 148 | with sv.prepare_or_wait_for_session(FLAGS.worker_grpc_url, config=sess_config) as sess: 149 | print("Worker %d: Session initialization complete." % FLAGS.worker_index) 150 | 151 | def get_eval(data_x, data_y): 152 | total_len = len(data_x) 153 | start = 0 154 | end = EVAL_SIZE 155 | total_correct = 0 156 | while end < total_len: 157 | cur_correct, step = sess.run([eval_correct_prediction, global_step], feed_dict={x: data_x[start:end], y_:data_y[start:end]}) 158 | total_correct += cur_correct 159 | start = end 160 | end += EVAL_SIZE 161 | if end > total_len: end = total_len 162 | 163 | return float(total_correct) / float(total_len) 164 | 165 | # Perform training 166 | time_begin = time.time() 167 | print("Training begins @ %f" % time_begin) 168 | 169 | local_step = 0 170 | while True: 171 | # Training feed 172 | batch_xs, batch_ys = mnist.train.next_batch(BATCH_SIZE) 173 | reshaped_x = numpy.reshape(batch_xs, [BATCH_SIZE, 28, 28, 1]) 174 | train_feed = {x: reshaped_x, y_: batch_ys} 175 | 176 | _, step = sess.run([optimizer, global_step], feed_dict=train_feed) 177 | if local_step % 100 == 0: 178 | validate_acc = get_eval(reshaped_validate_data, validate_label) 179 | test_acc = get_eval(reshaped_test_data, test_label) 180 | print("Worker %d: After %d training step(s) (global step: %d), validation accuracy = %g, test accuracy = %g" % 181 | (FLAGS.worker_index, local_step, step, validate_acc, test_acc)) 182 | if step >= TRAING_STEP: break 183 | local_step += 1 184 | 185 | time_end = time.time() 186 | print("Training ends @ %f" % time_end) 187 | training_time = time_end - time_begin 188 | print("Training elapsed time: %f s" % training_time) 189 | 190 | # Accuracy on test data 191 | test_acc = get_eval(reshaped_test_data, test_label) 192 | print("Final test accuracy = %g" % (test_acc)) 193 | 194 | -------------------------------------------------------------------------------- /distributed/mnist_dnn.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import time 3 | 4 | import tensorflow as tf 5 | from tensorflow.examples.tutorials.mnist import input_data 6 | 7 | flags = tf.app.flags 8 | flags.DEFINE_integer("worker_index", 0, 9 | "Worker task index, should be >= 0. worker_index=0 is " 10 | "the master worker task the performs the variable " 11 | "initialization ") 12 | 13 | flags.DEFINE_string("workers", None, 14 | "The worker url list, separated by comma (e.g. tf-worker1:2222,1.2.3.4:2222)") 15 | 16 | flags.DEFINE_string("parameter_servers", None, 17 | "The ps url list, separated by comma (e.g. tf-ps2:2222,1.2.3.5:2222)") 18 | 19 | flags.DEFINE_string("worker_grpc_url", None, 20 | "Worker GRPC URL (e.g., grpc://1.2.3.4:2222, or " 21 | "grpc://tf-worker0:2222)") 22 | 23 | flags.DEFINE_string("name_scope", None, "The variable name scope.") 24 | FLAGS = flags.FLAGS 25 | 26 | TRAING_STEP = 5000 27 | hidden_nodes = 500 28 | 29 | def nn_layer(input_tensor, input_dim, output_dim, act=tf.nn.relu): 30 | with tf.name_scope(FLAGS.name_scope): 31 | weights = tf.Variable(tf.truncated_normal([input_dim, output_dim], stddev=0.1, seed = 2)) 32 | biases = tf.Variable(tf.constant(0.1, shape=[output_dim])) 33 | activations = act(tf.matmul(input_tensor, weights) + biases) 34 | return activations 35 | 36 | def model(x, y_, global_step): 37 | hidden1 = nn_layer(x, 784, hidden_nodes) 38 | y = nn_layer(hidden1, hidden_nodes, 10, act=tf.nn.softmax) 39 | 40 | cross_entropy = -tf.reduce_mean(y_ * tf.log(tf.clip_by_value(y, 1e-10, 1.0))) 41 | train_step = tf.train.AdamOptimizer(0.001).minimize(cross_entropy, global_step=global_step) 42 | 43 | correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) 44 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 45 | 46 | return train_step, accuracy 47 | 48 | print("Loading data from worker index = %d" % FLAGS.worker_index) 49 | mnist = input_data.read_data_sets("/tmp/data", one_hot=True) 50 | print("Testing set size: %d" % len(mnist.test.images)) 51 | print("Training set size: %d" % len(mnist.train.images)) 52 | 53 | print("Worker GRPC URL: %s" % FLAGS.worker_grpc_url) 54 | print("Workers = %s" % FLAGS.workers) 55 | print("Using name scope %s" % FLAGS.name_scope) 56 | 57 | is_chief = (FLAGS.worker_index == 0) 58 | cluster = tf.train.ClusterSpec({"ps": FLAGS.parameter_servers.split(","), "worker": FLAGS.workers.split(",")}) 59 | # Construct device setter object 60 | device_setter = tf.train.replica_device_setter(cluster=cluster) 61 | 62 | # The device setter will automatically place Variables ops on separate 63 | # parameter servers (ps). The non-Variable ops will be placed on the workers. 64 | with tf.device(device_setter): 65 | with tf.name_scope(FLAGS.name_scope): 66 | global_step = tf.Variable(0, trainable=False) 67 | 68 | x = tf.placeholder(tf.float32, [None, 784]) 69 | y_ = tf.placeholder(tf.float32, [None, 10]) 70 | val_feed = {x: mnist.test.images, y_: mnist.test.labels} 71 | 72 | train_step, accuracy = model(x, y_, global_step) 73 | 74 | sv = tf.train.Supervisor(is_chief=is_chief, 75 | logdir="/tmp/dist-mnist-log/train", 76 | saver=tf.train.Saver(), 77 | init_op=tf.initialize_all_variables(), 78 | recovery_wait_secs=1, 79 | global_step=global_step) 80 | sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True, 81 | device_filters=["/job:ps", "/job:worker/task:%d" % FLAGS.worker_index]) 82 | 83 | # The chief worker (worker_index==0) session will prepare the session, 84 | # while the remaining workers will wait for the preparation to complete. 85 | if is_chief: 86 | print("Worker %d: Initializing session..." % FLAGS.worker_index) 87 | else: 88 | print("Worker %d: Waiting for session to be initialized..." % FLAGS.worker_index) 89 | 90 | with sv.prepare_or_wait_for_session(FLAGS.worker_grpc_url, config=sess_config) as sess: 91 | print("Worker %d: Session initialization complete." % FLAGS.worker_index) 92 | 93 | # Perform training 94 | time_begin = time.time() 95 | print("Training begins @ %f" % time_begin) 96 | 97 | local_step = 0 98 | while True: 99 | # Training feed 100 | batch_xs, batch_ys = mnist.train.next_batch(100) 101 | train_feed = {x: batch_xs, y_: batch_ys} 102 | 103 | _, step = sess.run([train_step, global_step], feed_dict=train_feed) 104 | if local_step % 100 == 0: 105 | print("Worker %d: training step %d done (global step: %d); Accuracy: %g" % 106 | (FLAGS.worker_index, local_step, step, sess.run(accuracy, feed_dict=val_feed))) 107 | if step >= TRAING_STEP: break 108 | local_step += 1 109 | 110 | time_end = time.time() 111 | print("Training ends @ %f" % time_end) 112 | training_time = time_end - time_begin 113 | print("Training elapsed time: %f s" % training_time) 114 | 115 | # Accuracy on test data 116 | print("Final test accuracy = %g" % (sess.run(accuracy, feed_dict=val_feed))) 117 | 118 | -------------------------------------------------------------------------------- /distributed/mnist_dnn_sync_update.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import time 3 | 4 | import tensorflow as tf 5 | from tensorflow.examples.tutorials.mnist import input_data 6 | 7 | flags = tf.app.flags 8 | flags.DEFINE_integer("worker_index", 0, 9 | "Worker task index, should be >= 0. worker_index=0 is " 10 | "the master worker task the performs the variable " 11 | "initialization ") 12 | 13 | flags.DEFINE_string("workers", None, 14 | "The worker url list, separated by comma (e.g. tf-worker1:2222,1.2.3.4:2222)") 15 | 16 | flags.DEFINE_string("parameter_servers", None, 17 | "The ps url list, separated by comma (e.g. tf-ps2:2222,1.2.3.5:2222)") 18 | 19 | flags.DEFINE_string("worker_grpc_url", None, 20 | "Worker GRPC URL (e.g., grpc://1.2.3.4:2222, or " 21 | "grpc://tf-worker0:2222)") 22 | 23 | flags.DEFINE_string("name_scope", None, "The variable name scope.") 24 | FLAGS = flags.FLAGS 25 | 26 | TRAING_STEP = 5000 27 | hidden_nodes = 500 28 | 29 | def nn_layer(input_tensor, input_dim, output_dim, act=tf.nn.relu): 30 | with tf.name_scope(FLAGS.name_scope): 31 | weights = tf.Variable(tf.truncated_normal([input_dim, output_dim], stddev=0.1, seed = 2)) 32 | biases = tf.Variable(tf.constant(0.1, shape=[output_dim])) 33 | activations = act(tf.matmul(input_tensor, weights) + biases) 34 | return activations 35 | 36 | def model(x, y_, global_step): 37 | hidden1 = nn_layer(x, 784, hidden_nodes) 38 | y = nn_layer(hidden1, hidden_nodes, 10, act=tf.nn.softmax) 39 | 40 | cross_entropy = -tf.reduce_mean(y_ * tf.log(tf.clip_by_value(y, 1e-10, 1.0))) 41 | num_workers = len(FLAGS.workers.split(",")) 42 | opt = tf.train.SyncReplicasOptimizer( 43 | tf.train.AdamOptimizer(0.01), 44 | replicas_to_aggregate=num_workers, 45 | total_num_replicas=num_workers, 46 | replica_id=FLAGS.worker_index, 47 | name="mnist_sync_replicas") 48 | train_step = opt.minimize(cross_entropy, global_step=global_step) 49 | 50 | correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) 51 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 52 | 53 | return opt, train_step, accuracy 54 | 55 | print("Loading data from worker index = %d" % FLAGS.worker_index) 56 | mnist = input_data.read_data_sets("/tmp/data", one_hot=True) 57 | print("Testing set size: %d" % len(mnist.test.images)) 58 | print("Training set size: %d" % len(mnist.train.images)) 59 | 60 | print("Worker GRPC URL: %s" % FLAGS.worker_grpc_url) 61 | print("Workers = %s" % FLAGS.workers) 62 | print("Using name scope %s" % FLAGS.name_scope) 63 | 64 | is_chief = (FLAGS.worker_index == 0) 65 | cluster = tf.train.ClusterSpec({"ps": FLAGS.parameter_servers.split(","), "worker": FLAGS.workers.split(",")}) 66 | # Construct device setter object 67 | device_setter = tf.train.replica_device_setter(cluster=cluster) 68 | 69 | # The device setter will automatically place Variables ops on separate 70 | # parameter servers (ps). The non-Variable ops will be placed on the workers. 71 | with tf.device(device_setter): 72 | with tf.name_scope(FLAGS.name_scope): 73 | global_step = tf.Variable(0, trainable=False) 74 | 75 | x = tf.placeholder(tf.float32, [None, 784]) 76 | y_ = tf.placeholder(tf.float32, [None, 10]) 77 | val_feed = {x: mnist.test.images, y_: mnist.test.labels} 78 | 79 | opt, train_step, accuracy = model(x, y_, global_step) 80 | 81 | chief_queue_runner = opt.get_chief_queue_runner() 82 | init_tokens_op = opt.get_init_tokens_op() 83 | 84 | sv = tf.train.Supervisor(is_chief=is_chief, 85 | logdir="/tmp/dist-mnist-log/train", 86 | saver=tf.train.Saver(), 87 | init_op=tf.initialize_all_variables(), 88 | recovery_wait_secs=1, 89 | global_step=global_step) 90 | sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True, 91 | device_filters=["/job:ps", "/job:worker/task:%d" % FLAGS.worker_index]) 92 | 93 | # The chief worker (worker_index==0) session will prepare the session, 94 | # while the remaining workers will wait for the preparation to complete. 95 | if is_chief: 96 | print("Worker %d: Initializing session..." % FLAGS.worker_index) 97 | else: 98 | print("Worker %d: Waiting for session to be initialized..." % FLAGS.worker_index) 99 | 100 | with sv.prepare_or_wait_for_session(FLAGS.worker_grpc_url, config=sess_config) as sess: 101 | print("Worker %d: Session initialization complete." % FLAGS.worker_index) 102 | print("Starting chief queue runner and running init_tokens_op") 103 | sv.start_queue_runners(sess, [chief_queue_runner]) 104 | sess.run(init_tokens_op) 105 | 106 | # Perform training 107 | time_begin = time.time() 108 | print("Training begins @ %f" % time_begin) 109 | 110 | local_step = 0 111 | while True: 112 | # Training feed 113 | batch_xs, batch_ys = mnist.train.next_batch(100) 114 | train_feed = {x: batch_xs, y_: batch_ys} 115 | 116 | _, step = sess.run([train_step, global_step], feed_dict=train_feed) 117 | if local_step % 100 == 0: 118 | print("Worker %d: training step %d done (global step: %d); Accuracy: %g" % 119 | (FLAGS.worker_index, local_step, step, sess.run(accuracy, feed_dict=val_feed))) 120 | if step >= TRAING_STEP: break 121 | local_step += 1 122 | 123 | time_end = time.time() 124 | print("Training ends @ %f" % time_end) 125 | training_time = time_end - time_begin 126 | print("Training elapsed time: %f s" % training_time) 127 | 128 | # Accuracy on test data 129 | print("Final test accuracy = %g" % (sess.run(accuracy, feed_dict=val_feed))) 130 | 131 | -------------------------------------------------------------------------------- /distributed/start_servers.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import time 3 | c = tf.constant("Hello, distributed TensorFlow!") 4 | server = tf.train.Server.create_local_server() 5 | sess = tf.Session(server.target) # Create a session on the server. 6 | print server.target 7 | print sess.run(c) 8 | 9 | time.sleep(10) 10 | 11 | -------------------------------------------------------------------------------- /distributed/start_tf.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | NUM_WORKER=${1} 4 | NUM_PS=${2} 5 | CODE=${3} 6 | 7 | WORKER_GRPC_URLS="grpc://tf-worker0:2222" 8 | WORKER_URLS="tf-worker0:2222" 9 | IDX=1 10 | while true; do 11 | if [[ "${IDX}" == "${NUM_WORKER}" ]]; then 12 | break 13 | fi 14 | 15 | WORKER_GRPC_URLS="${WORKER_GRPC_URLS} grpc://tf-worker${IDX}:2222" 16 | WORKER_URLS="${WORKER_URLS},tf-worker${IDX}:2222" 17 | ((IDX++)) 18 | done 19 | 20 | PS_URLS="tf-ps0:2222" 21 | IDX=1 22 | while true; do 23 | if [[ "${IDX}" == "${NUM_PS}" ]]; then 24 | break 25 | fi 26 | 27 | PS_URLS="${PS_URLS},tf-ps${IDX}:2222" 28 | ((IDX++)) 29 | done 30 | 31 | echo "WORKERS = ${WORKER_URLS}" 32 | echo "PARAMETER_SERVERS = ${PS_URLS}" 33 | echo "Running ${CODE}" 34 | WKR_LOG_PREFIX="/tmp/worker" 35 | URLS=($WORKER_GRPC_URLS) 36 | CUR_BATCH=$(date "+%H%M%S%d%m%y") 37 | 38 | IDX=0 39 | ((NUM_WORKER--)) 40 | while true; do 41 | if [[ "${IDX}" == "${NUM_WORKER}" ]]; then 42 | break 43 | fi 44 | 45 | WORKER_GRPC_URL="${URLS[IDX]}" 46 | python "${CODE}" \ 47 | --worker_grpc_url="${WORKER_GRPC_URL}" \ 48 | --worker_index=${IDX} \ 49 | --workers=${WORKER_URLS} \ 50 | --name_scope=${CUR_BATCH} \ 51 | --parameter_servers=${PS_URLS} > "${WKR_LOG_PREFIX}${IDX}.log" & 52 | echo "Worker ${IDX}: " 53 | echo " GRPC URL: ${WORKER_GRPC_URL}" 54 | echo " log file: ${WKR_LOG_PREFIX}${IDX}.log" 55 | 56 | ((IDX++)) 57 | done 58 | 59 | WORKER_GRPC_URL="${URLS[IDX]}" 60 | python "${CODE}" \ 61 | --worker_grpc_url="${WORKER_GRPC_URL}" \ 62 | --worker_index=${IDX} \ 63 | --workers=${WORKER_URLS} \ 64 | --name_scope=${CUR_BATCH} \ 65 | --parameter_servers=${PS_URLS} 66 | 67 | echo "Done!" 68 | -------------------------------------------------------------------------------- /distributed/word2vector.py: -------------------------------------------------------------------------------- 1 | import collections 2 | import math 3 | import random 4 | import time 5 | import zipfile 6 | 7 | import numpy as np 8 | import tensorflow as tf 9 | from tensorflow.examples.tutorials.mnist import input_data 10 | 11 | flags = tf.app.flags 12 | flags.DEFINE_integer("worker_index", 0, 13 | "Worker task index, should be >= 0. worker_index=0 is " 14 | "the master worker task the performs the variable " 15 | "initialization ") 16 | 17 | flags.DEFINE_string("workers", None, 18 | "The worker url list, separated by comma (e.g. tf-worker1:2222,1.2.3.4:2222)") 19 | 20 | flags.DEFINE_string("parameter_servers", None, 21 | "The ps url list, separated by comma (e.g. tf-ps2:2222,1.2.3.5:2222)") 22 | 23 | flags.DEFINE_string("worker_grpc_url", None, 24 | "Worker GRPC URL (e.g., grpc://1.2.3.4:2222, or " 25 | "grpc://tf-worker0:2222)") 26 | FLAGS = flags.FLAGS 27 | 28 | vocabulary_size = 50000 29 | batch_size = 128 30 | valid_size = 8 # Random set of words to evaluate similarity on. 31 | valid_window = 100 # Only pick dev samples in the head of the distribution. 32 | embedding_size = 128 # Dimension of the embedding vector. 33 | num_sampled = 64 # Number of negative examples to sample. 34 | train_step = 500000 35 | skip_window = 1 # How many words to consider left and right. 36 | num_skips = 2 # How many times to reuse an input to generate a label. 37 | 38 | 39 | # Read data and split words. 40 | def read_data(filename): 41 | """Extract the first file enclosed in a zip file as a list of words""" 42 | with zipfile.ZipFile(filename) as f: 43 | data = f.read(f.namelist()[0]).split() 44 | return data 45 | 46 | # Build dictionary. 47 | def build_dataset(words): 48 | count = [['UNK', -1]] 49 | count.extend(collections.Counter(words).most_common(vocabulary_size - 1)) 50 | dictionary = dict() 51 | for word, _ in count: 52 | dictionary[word] = len(dictionary) 53 | data = list() 54 | unk_count = 0 55 | for word in words: 56 | if word in dictionary: 57 | index = dictionary[word] 58 | else: 59 | index = 0 # dictionary['UNK'] 60 | unk_count += 1 61 | data.append(index) 62 | count[0][1] = unk_count 63 | reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 64 | return data, count, dictionary, reverse_dictionary 65 | 66 | data_index = 0 67 | # function to generate training batch 68 | def generate_batch(batch_size, num_skips, skip_window): 69 | global data_index 70 | assert batch_size % num_skips == 0 71 | assert num_skips <= 2 * skip_window 72 | batch = np.ndarray(shape=(batch_size), dtype=np.int32) 73 | labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) 74 | span = 2 * skip_window + 1 # [ skip_window target skip_window ] 75 | buffer = collections.deque(maxlen=span) 76 | for _ in range(span): 77 | buffer.append(data[data_index]) 78 | data_index = (data_index + 1) % len(data) 79 | for i in range(batch_size // num_skips): 80 | target = skip_window # target label at the center of the buffer 81 | targets_to_avoid = [ skip_window ] 82 | for j in range(num_skips): 83 | while target in targets_to_avoid: 84 | target = random.randint(0, span - 1) 85 | targets_to_avoid.append(target) 86 | batch[i * num_skips + j] = buffer[skip_window] 87 | labels[i * num_skips + j, 0] = buffer[target] 88 | buffer.append(data[data_index]) 89 | data_index = (data_index + 1) % len(data) 90 | return batch, labels 91 | 92 | print("Loading data from worker index = %d" % FLAGS.worker_index) 93 | words = read_data("/tmp/data/text8.zip") 94 | print("Data size: %d" % len(words)) 95 | data, count, dictionary, reverse_dictionary = build_dataset(words) 96 | 97 | print("Worker GRPC URL: %s" % FLAGS.worker_grpc_url) 98 | print("Workers = %s" % FLAGS.workers) 99 | is_chief = (FLAGS.worker_index == 0) 100 | if is_chief: tf.reset_default_graph() 101 | 102 | cluster = tf.train.ClusterSpec({"ps": FLAGS.parameter_servers.split(","), "worker": FLAGS.workers.split(",")}) 103 | # Construct device setter object 104 | device_setter = tf.train.replica_device_setter(cluster=cluster) 105 | 106 | # The device setter will automatically place Variables ops on separate 107 | # parameter servers (ps). The non-Variable ops will be placed on the workers. 108 | with tf.device(device_setter): 109 | global_step = tf.Variable(0, trainable=False) 110 | 111 | # Training input data. 112 | train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) 113 | train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) 114 | 115 | # Validate input data. 116 | valid_examples = np.random.choice(valid_window, valid_size, replace=False) 117 | valid_dataset = tf.constant(valid_examples, dtype=tf.int32) 118 | 119 | # Look up embeddings for inputs. 120 | embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) 121 | embed = tf.nn.embedding_lookup(embeddings, train_inputs) 122 | 123 | # Construct the variables for the NCE loss 124 | nce_weights = tf.Variable( 125 | tf.truncated_normal([vocabulary_size, embedding_size], 126 | stddev=1.0 / math.sqrt(embedding_size))) 127 | nce_biases = tf.Variable(tf.zeros([vocabulary_size])) 128 | 129 | # Compute the average NCE loss for the batch. 130 | # tf.nce_loss automatically draws a new sample of the negative labels each 131 | # time we evaluate the loss. 132 | loss = tf.reduce_mean(tf.nn.nce_loss( 133 | nce_weights, nce_biases, embed, train_labels, num_sampled, vocabulary_size)) 134 | 135 | # Construct the SGD optimizer using a learning rate of 1.0. 136 | optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss, global_step=global_step) 137 | 138 | # Compute the cosine similarity between minibatch examples and all embeddings. 139 | norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) 140 | normalized_embeddings = embeddings / norm 141 | valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) 142 | similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True) 143 | 144 | sv = tf.train.Supervisor(is_chief=is_chief, 145 | logdir="/tmp/dist-w2v", 146 | saver=tf.train.Saver(), 147 | init_op=tf.initialize_all_variables(), 148 | recovery_wait_secs=1, 149 | global_step=global_step) 150 | sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True, 151 | device_filters=["/job:ps", "/job:worker/task:%d" % FLAGS.worker_index]) 152 | 153 | # The chief worker (worker_index==0) session will prepare the session, 154 | # while the remaining workers will wait for the preparation to complete. 155 | if is_chief: 156 | print("Worker %d: Initializing session..." % FLAGS.worker_index) 157 | else: 158 | print("Worker %d: Waiting for session to be initialized..." % FLAGS.worker_index) 159 | 160 | with sv.prepare_or_wait_for_session(FLAGS.worker_grpc_url, config=sess_config) as sess: 161 | print("Worker %d: Session initialization complete." % FLAGS.worker_index) 162 | 163 | # Output evaluation result 164 | def print_eval_result(): 165 | sim = similarity.eval() 166 | for i in xrange(valid_size): 167 | valid_word = reverse_dictionary[valid_examples[i]] 168 | top_k = 8 # number of nearest neighbors 169 | nearest = (-sim[i, :]).argsort()[1:top_k + 1] 170 | log_str = "Nearest to %s:" % valid_word 171 | for k in xrange(top_k): 172 | close_word = reverse_dictionary[nearest[k]] 173 | log_str = "%s %s," % (log_str, close_word) 174 | print(log_str) 175 | 176 | # Perform training 177 | time_begin = time.time() 178 | average_loss = 0 179 | local_step = 0 180 | print("Training begins @ %f" % time_begin) 181 | while True: 182 | # Training feed 183 | batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window) 184 | feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels} 185 | 186 | _, loss_val, step = sess.run([optimizer, loss, global_step], feed_dict=feed_dict) 187 | average_loss += loss_val 188 | if local_step % 5000 == 0: 189 | if local_step > 0: average_loss /= 5000 190 | print("Worker %d: Finished %d training steps (global step: %d): Average Loss: %g" % (FLAGS.worker_index, local_step, step, average_loss)) 191 | average_loss = 0 192 | 193 | if local_step % 20000 == 0: 194 | print("Validate evaluation result at step ", local_step) 195 | print_eval_result() 196 | 197 | if step >= train_step: break 198 | local_step += 1 199 | 200 | time_end = time.time() 201 | print("Training ends @ %f" % time_end) 202 | training_time = time_end - time_begin 203 | print("Training elapsed time: %f s" % training_time) 204 | 205 | # Final word vectors. 206 | print("Final answer:") 207 | print_eval_result() 208 | -------------------------------------------------------------------------------- /file_server.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #!/usr/bin/env python 3 | 4 | """Simple HTTP Server With Upload. 5 | This module builds on BaseHTTPServer by implementing the standard GET 6 | and HEAD requests in a fairly straightforward manner. 7 | """ 8 | 9 | 10 | __version__ = "0.1" 11 | __all__ = ["SimpleHTTPRequestHandler"] 12 | __author__ = "bones7456" 13 | __home_page__ = "http://li2z.cn/" 14 | 15 | import os 16 | import posixpath 17 | import BaseHTTPServer 18 | import urllib 19 | import cgi 20 | import shutil 21 | import mimetypes 22 | import re 23 | try: 24 | from cStringIO import StringIO 25 | except ImportError: 26 | from StringIO import StringIO 27 | 28 | 29 | class SimpleHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler): 30 | 31 | """Simple HTTP request handler with GET/HEAD/POST commands. 32 | This serves files from the current directory and any of its 33 | subdirectories. The MIME type for files is determined by 34 | calling the .guess_type() method. And can reveive file uploaded 35 | by client. 36 | The GET/HEAD/POST requests are identical except that the HEAD 37 | request omits the actual contents of the file. 38 | """ 39 | 40 | server_version = "SimpleHTTPWithUpload/" + __version__ 41 | 42 | def do_GET(self): 43 | """Serve a GET request.""" 44 | f = self.send_head() 45 | if f: 46 | self.copyfile(f, self.wfile) 47 | f.close() 48 | 49 | def do_HEAD(self): 50 | """Serve a HEAD request.""" 51 | f = self.send_head() 52 | if f: 53 | f.close() 54 | 55 | def do_POST(self): 56 | """Serve a POST request.""" 57 | r, info = self.deal_post_data() 58 | print r, info, "by: ", self.client_address 59 | f = StringIO() 60 | f.write('') 61 | f.write("\nUpload Result Page\n") 62 | f.write("\n

Upload Result Page

\n") 63 | f.write("
\n") 64 | if r: 65 | f.write("Success:") 66 | else: 67 | f.write("Failed:") 68 | f.write(info) 69 | f.write("
back" % self.headers['referer']) 70 | f.write("
Powerd By: bones7456, check new version at ") 71 | f.write("") 72 | f.write("here.\n\n") 73 | length = f.tell() 74 | f.seek(0) 75 | self.send_response(200) 76 | self.send_header("Content-type", "text/html") 77 | self.send_header("Content-Length", str(length)) 78 | self.end_headers() 79 | if f: 80 | self.copyfile(f, self.wfile) 81 | f.close() 82 | 83 | def deal_post_data(self): 84 | boundary = self.headers.plisttext.split("=")[1] 85 | remainbytes = int(self.headers['content-length']) 86 | line = self.rfile.readline() 87 | remainbytes -= len(line) 88 | if not boundary in line: 89 | return (False, "Content NOT begin with boundary") 90 | line = self.rfile.readline() 91 | remainbytes -= len(line) 92 | fn = re.findall(r'Content-Disposition.*name="file"; filename="(.*)"', line) 93 | if not fn: 94 | return (False, "Can't find out file name...") 95 | path = self.translate_path(self.path) 96 | fn = os.path.join(path, fn[0]) 97 | line = self.rfile.readline() 98 | remainbytes -= len(line) 99 | line = self.rfile.readline() 100 | remainbytes -= len(line) 101 | try: 102 | out = open(fn, 'wb') 103 | except IOError: 104 | return (False, "Can't create file to write, do you have permission to write?") 105 | 106 | preline = self.rfile.readline() 107 | remainbytes -= len(preline) 108 | while remainbytes > 0: 109 | line = self.rfile.readline() 110 | remainbytes -= len(line) 111 | if boundary in line: 112 | preline = preline[0:-1] 113 | if preline.endswith('\r'): 114 | preline = preline[0:-1] 115 | out.write(preline) 116 | out.close() 117 | return (True, "File '%s' upload success!" % fn) 118 | else: 119 | out.write(preline) 120 | preline = line 121 | return (False, "Unexpect Ends of data.") 122 | 123 | def send_head(self): 124 | """Common code for GET and HEAD commands. 125 | This sends the response code and MIME headers. 126 | Return value is either a file object (which has to be copied 127 | to the outputfile by the caller unless the command was HEAD, 128 | and must be closed by the caller under all circumstances), or 129 | None, in which case the caller has nothing further to do. 130 | """ 131 | path = self.translate_path(self.path) 132 | f = None 133 | if os.path.isdir(path): 134 | if not self.path.endswith('/'): 135 | # redirect browser - doing basically what apache does 136 | self.send_response(301) 137 | self.send_header("Location", self.path + "/") 138 | self.end_headers() 139 | return None 140 | for index in "index.html", "index.htm": 141 | index = os.path.join(path, index) 142 | if os.path.exists(index): 143 | path = index 144 | break 145 | else: 146 | return self.list_directory(path) 147 | ctype = self.guess_type(path) 148 | try: 149 | # Always read in binary mode. Opening files in text mode may cause 150 | # newline translations, making the actual size of the content 151 | # transmitted *less* than the content-length! 152 | f = open(path, 'rb') 153 | except IOError: 154 | self.send_error(404, "File not found") 155 | return None 156 | self.send_response(200) 157 | self.send_header("Content-type", ctype) 158 | fs = os.fstat(f.fileno()) 159 | self.send_header("Content-Length", str(fs[6])) 160 | self.send_header("Last-Modified", self.date_time_string(fs.st_mtime)) 161 | self.end_headers() 162 | return f 163 | 164 | def list_directory(self, path): 165 | """Helper to produce a directory listing (absent index.html). 166 | Return value is either a file object, or None (indicating an 167 | error). In either case, the headers are sent, making the 168 | interface the same as for send_head(). 169 | """ 170 | try: 171 | list = os.listdir(path) 172 | except os.error: 173 | self.send_error(404, "No permission to list directory") 174 | return None 175 | print list 176 | list.sort(key=lambda a: a.lower()) 177 | f = StringIO() 178 | displaypath = cgi.escape(urllib.unquote(self.path)) 179 | f.write('') 180 | f.write("\nDirectory listing for %s\n" % displaypath) 181 | f.write("\n

Directory listing for %s

\n" % displaypath) 182 | f.write("
\n") 183 | f.write("
") 184 | f.write("") 185 | f.write("
\n") 186 | f.write("
\n\n
\n\n\n") 200 | length = f.tell() 201 | f.seek(0) 202 | self.send_response(200) 203 | self.send_header("Content-type", "text/html") 204 | self.send_header("Content-Length", str(length)) 205 | self.end_headers() 206 | return f 207 | 208 | def translate_path(self, path): 209 | """Translate a /-separated PATH to the local filename syntax. 210 | Components that mean special things to the local file system 211 | (e.g. drive or directory names) are ignored. (XXX They should 212 | probably be diagnosed.) 213 | """ 214 | # abandon query parameters 215 | path = path.split('?',1)[0] 216 | path = path.split('#',1)[0] 217 | path = posixpath.normpath(urllib.unquote(path)) 218 | words = path.split('/') 219 | words = filter(None, words) 220 | path = os.getcwd() 221 | for word in words: 222 | drive, word = os.path.splitdrive(word) 223 | head, word = os.path.split(word) 224 | if word in (os.curdir, os.pardir): continue 225 | path = os.path.join(path, word) 226 | return path 227 | 228 | def copyfile(self, source, outputfile): 229 | """Copy all data between two file objects. 230 | The SOURCE argument is a file object open for reading 231 | (or anything with a read() method) and the DESTINATION 232 | argument is a file object open for writing (or 233 | anything with a write() method). 234 | The only reason for overriding this would be to change 235 | the block size or perhaps to replace newlines by CRLF 236 | -- note however that this the default server uses this 237 | to copy binary data as well. 238 | """ 239 | shutil.copyfileobj(source, outputfile) 240 | 241 | def guess_type(self, path): 242 | """Guess the type of a file. 243 | Argument is a PATH (a filename). 244 | Return value is a string of the form type/subtype, 245 | usable for a MIME Content-type header. 246 | The default implementation looks the file's extension 247 | up in the table self.extensions_map, using application/octet-stream 248 | as a default; however it would be permissible (if 249 | slow) to look inside the data to make a better guess. 250 | """ 251 | 252 | base, ext = posixpath.splitext(path) 253 | if ext in self.extensions_map: 254 | return self.extensions_map[ext] 255 | ext = ext.lower() 256 | if ext in self.extensions_map: 257 | return self.extensions_map[ext] 258 | else: 259 | return self.extensions_map[''] 260 | 261 | if not mimetypes.inited: 262 | mimetypes.init() # try to read system mime.types 263 | extensions_map = mimetypes.types_map.copy() 264 | extensions_map.update({ 265 | '': 'application/octet-stream', # Default 266 | '.py': 'text/plain', 267 | '.c': 'text/plain', 268 | '.h': 'text/plain', 269 | }) 270 | 271 | 272 | def test(HandlerClass = SimpleHTTPRequestHandler, 273 | ServerClass = BaseHTTPServer.HTTPServer): 274 | BaseHTTPServer.test(HandlerClass, ServerClass) 275 | 276 | if __name__ == '__main__': 277 | test() 278 | -------------------------------------------------------------------------------- /imagenet_serving/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM ubuntu:14.04 2 | 3 | MAINTAINER Jeremiah Harmsen 4 | 5 | RUN apt-get update && apt-get install -y \ 6 | build-essential \ 7 | curl \ 8 | git \ 9 | libfreetype6-dev \ 10 | libpng12-dev \ 11 | libzmq3-dev \ 12 | pkg-config \ 13 | python-dev \ 14 | python-numpy \ 15 | python-pip \ 16 | software-properties-common \ 17 | swig \ 18 | zip \ 19 | zlib1g-dev \ 20 | && \ 21 | apt-get clean && \ 22 | rm -rf /var/lib/apt/lists/* 23 | 24 | RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py && \ 25 | python get-pip.py && \ 26 | rm get-pip.py 27 | 28 | # Set up grpc 29 | 30 | RUN pip install enum34 futures six && \ 31 | pip install --pre protobuf>=3.0.0a3 && \ 32 | pip install -i https://testpypi.python.org/simple --pre grpcio 33 | 34 | # Set up Bazel. 35 | 36 | # We need to add a custom PPA to pick up JDK8, since trusty doesn't 37 | # have an openjdk8 backport. openjdk-r is maintained by a reliable contributor: 38 | # Matthias Klose (https://launchpad.net/~doko). It will do until 39 | # we either update the base image beyond 14.04 or openjdk-8 is 40 | # finally backported to trusty; see e.g. 41 | # https://bugs.launchpad.net/trusty-backports/+bug/1368094 42 | RUN add-apt-repository -y ppa:openjdk-r/ppa && \ 43 | apt-get update && \ 44 | apt-get install -y openjdk-8-jdk openjdk-8-jre-headless && \ 45 | apt-get clean && \ 46 | rm -rf /var/lib/apt/lists/* 47 | 48 | # Running bazel inside a `docker build` command causes trouble, cf: 49 | # https://github.com/bazelbuild/bazel/issues/134 50 | # The easiest solution is to set up a bazelrc file forcing --batch. 51 | RUN echo "startup --batch" >>/root/.bazelrc 52 | # Similarly, we need to workaround sandboxing issues: 53 | # https://github.com/bazelbuild/bazel/issues/418 54 | RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \ 55 | >>/root/.bazelrc 56 | ENV BAZELRC /root/.bazelrc 57 | # Install the most recent bazel release. 58 | ENV BAZEL_VERSION 0.2.0 59 | WORKDIR / 60 | RUN mkdir /bazel && \ 61 | cd /bazel && \ 62 | curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \ 63 | curl -fSsL -o /bazel/LICENSE.txt https://raw.githubusercontent.com/bazelbuild/bazel/master/LICENSE.txt && \ 64 | chmod +x bazel-*.sh && \ 65 | ./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \ 66 | cd / && \ 67 | rm -f /bazel/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh 68 | 69 | CMD ["/bin/bash"] 70 | -------------------------------------------------------------------------------- /imagenet_serving/README.md: -------------------------------------------------------------------------------- 1 | ## [ImageNet](http://www.image-net.org/)线上分类服务器 2 | 这里通过一个已经训练好的model来分类新的图片。Caicloud提供了一个已经编译好的镜像```index.caicloud.io/caicloud/inception_serving```。如果想要了解这个镜像是如何编译的,[这里](https://tensorflow.github.io/serving/serving_inception)有详细的介绍。 3 | 4 | 镜像准备好之后,可以通过serving.json来在kubernetes里启动service: 5 | ``` 6 | kubectl create -f serving.json 7 | ``` 8 | 9 | 当服务在Kubernetes里面建好之后,使用以下命令得到服务端口: 10 | ``` 11 | kubectl describe svc inception-service 12 | ``` 13 | 14 | 我们可以看到类似以下的信息: 15 | ``` 16 | Name: inception-service 17 | Namespace: default 18 | Labels: 19 | Selector: worker=inception-pod 20 | Type: NodePort 21 | IP: 10.254.121.195 22 | Port: 9000/TCP 23 | NodePort: 32668/TCP 24 | ``` 25 | 其中```NodePort```就是我们需要的端口号,有了端口号,我们还需要知道IP。通过下面的命令可以查到IP。先查看所有节点列表 26 | ``` 27 | kubectl get nodes 28 | ``` 29 | 可以得到类似下面的信息: 30 | ``` 31 | NAME STATUS AGE 32 | i-dh4t40ez Ready 19d 33 | i-jnr9dxhz Ready,SchedulingDisabled 19d 34 | i-tiga0i1q Ready 19d 35 | ``` 36 | 随便选取一个节点,获取节点IP信息: 37 | ``` 38 | kubectl describe node i-dh4t40ez 39 | ``` 40 | 可以得到类似如下的结果: 41 | ``` 42 | Name: i-dh4t40ez 43 | Labels: failure-domain.beta.kubernetes.io/region=ac1,failure-domain.beta.kubernetes.io/zone=ac1,kubernetes.io/hostname=i-dh4t40ez 44 | CreationTimestamp: Thu, 26 May 2016 07:10:59 +0800 45 | Phase: 46 | Conditions: 47 | Type Status LastHeartbeatTime LastTransitionTime Reason Message 48 | ---- ------ ----------------- ------------------ ------ ------- 49 | OutOfDisk False Tue, 14 Jun 2016 10:53:07 +0800 Thu, 26 May 2016 07:10:59 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available 50 | Ready True Tue, 14 Jun 2016 10:53:07 +0800 Thu, 26 May 2016 07:10:59 +0800 KubeletReady kubelet is posting ready status 51 | Addresses: 180.101.191.78,10.244.1.0 52 | ``` 53 | 其中```Addresses```中给出了外网IP```180.101.191.78```,这样这个图片分类器的服务地址就是```180.101.191.78:32668``` 54 | 55 | ## client端 56 | #### 直接使用Docker镜像 57 | ``` 58 | docker run -it -v "$PWD"/pic:/pic index.caicloud.io/caicloud/inception_serving 59 | cd serving 60 | ``` 61 | 62 | #### 在本地编译 63 | 1. 根据[文档](https://tensorflow.github.io/serving/setup#prerequisites)安装必要的工具 64 | 65 | 2. 下载代码并编译client端(第一次编译的时间比较长,需要2-4个小时): 66 | ``` 67 | git clone --recurse-submodules https://github.com/tensorflow/serving 68 | cd serving/tensorflow 69 | ./configure 70 | cd .. 71 | bazel build -c opt tensorflow_serving/... 72 | ``` 73 | 74 | ## 使用client端分类图片 75 | ``` 76 | bazel-bin/tensorflow_serving/example/inception_client --server=180.101.191.78:32668 --image=/pic/02ea79e4aad9d6275da78a9170fa4e82.jpg 77 | ``` 78 | 参数```server```需要替换成启动的服务器的地址。 79 | 80 | 运行时有可能会出现超时问题,如果出现此问题,可以修改时间限制的参数: 81 | ``` 82 | vi tensorflow_serving/example/inception_client.py 83 | ``` 84 | 修改下面超时时限: 85 | ``` 86 | result = stub.Classify(request, 10.0) # 10 secs timeout 87 | ``` 88 | 89 | -------------------------------------------------------------------------------- /imagenet_serving/pic/02ea79e4aad9d6275da78a9170fa4e82.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/imagenet_serving/pic/02ea79e4aad9d6275da78a9170fa4e82.jpg -------------------------------------------------------------------------------- /imagenet_serving/pic/07889356d62fa6517b0db6cf9dcf1f96.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/imagenet_serving/pic/07889356d62fa6517b0db6cf9dcf1f96.jpg -------------------------------------------------------------------------------- /imagenet_serving/pic/516308313.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/imagenet_serving/pic/516308313.jpg -------------------------------------------------------------------------------- /imagenet_serving/pic/7e7e745620d307aa2cb4afcdfa90d189.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/imagenet_serving/pic/7e7e745620d307aa2cb4afcdfa90d189.jpg -------------------------------------------------------------------------------- /imagenet_serving/serving.json: -------------------------------------------------------------------------------- 1 | { 2 | "apiVersion": "v1", 3 | "kind": "ReplicationController", 4 | "metadata": { 5 | "name": "inception-controller" 6 | }, 7 | "spec": { 8 | "replicas": 3, 9 | "selector": { 10 | "worker": "inception-pod" 11 | }, 12 | "template": { 13 | "metadata": { 14 | "labels": { 15 | "worker": "inception-pod" 16 | } 17 | }, 18 | "spec": { 19 | "containers": [ 20 | { 21 | "name": "inception-container", 22 | "image": "index.caicloud.io/caicloud/inception_serving", 23 | "command": [ 24 | "/bin/sh", 25 | "-c" 26 | ], 27 | "args": [ 28 | "/serving/bazel-bin/tensorflow_serving/example/inception_inference --port=9000 /serving/inception-export" 29 | ], 30 | "ports": [ 31 | { 32 | "containerPort": 9000 33 | } 34 | ] 35 | } 36 | ], 37 | "restartPolicy": "Always" 38 | } 39 | } 40 | } 41 | } 42 | 43 | { 44 | "kind": "Service", 45 | "apiVersion": "v1", 46 | "metadata": { 47 | "name": "inception-service" 48 | }, 49 | "spec": { 50 | "ports": [ 51 | { 52 | "port": 9000 53 | } 54 | ], 55 | "selector": { 56 | "worker": "inception-pod" 57 | }, 58 | "type": "NodePort" 59 | } 60 | } 61 | -------------------------------------------------------------------------------- /k8s_tf.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: ReplicationController 3 | metadata: 4 | name: tf-worker1 5 | spec: 6 | replicas: 1 7 | template: 8 | metadata: 9 | labels: 10 | tf-worker: "0" 11 | spec: 12 | containers: 13 | - name: tf-worker1 14 | image: index.caicloud.io/tf_grpc_test_server 15 | args: 16 | - --cluster_spec=worker|tf-worker1:2222;tf-worker3:2222;tf-worker5:2222,ps|tf-ps2:2222;tf-ps4:2222 17 | - --job_name=worker 18 | - --task_id=0 19 | ports: 20 | - containerPort: 2222 21 | --- 22 | apiVersion: v1 23 | kind: Service 24 | metadata: 25 | name: tf-worker1 26 | labels: 27 | tf-worker: "0" 28 | spec: 29 | type: LoadBalancer 30 | ports: 31 | - port: 2222 32 | selector: 33 | tf-worker: "0" 34 | --- 35 | apiVersion: v1 36 | kind: ReplicationController 37 | metadata: 38 | name: tf-worker3 39 | spec: 40 | replicas: 1 41 | template: 42 | metadata: 43 | labels: 44 | tf-worker: "1" 45 | spec: 46 | containers: 47 | - name: tf-worker3 48 | image: index.caicloud.io/caicloud/tf_grpc_test_server 49 | args: 50 | - --cluster_spec=worker|tf-worker1:2222;tf-worker3:2222;tf-worker5:2222,ps|tf-ps2:2222;tf-ps4:2222 51 | - --job_name=worker 52 | - --task_id=1 53 | ports: 54 | - containerPort: 2222 55 | --- 56 | apiVersion: v1 57 | kind: Service 58 | metadata: 59 | type: LoadBalancer 60 | name: tf-worker3 61 | labels: 62 | tf-worker: "1" 63 | spec: 64 | ports: 65 | - port: 2222 66 | selector: 67 | tf-worker: "1" 68 | --- 69 | apiVersion: v1 70 | kind: ReplicationController 71 | metadata: 72 | name: tf-worker5 73 | spec: 74 | replicas: 1 75 | template: 76 | metadata: 77 | labels: 78 | tf-worker: "2" 79 | spec: 80 | containers: 81 | - name: tf-worker5 82 | image: index.caicloud.io/caicloud/tf_grpc_test_server 83 | args: 84 | - --cluster_spec=worker|tf-worker1:2222;tf-worker3:2222;tf-worker5:2222,ps|tf-ps2:2222;tf-ps4:2222 85 | - --job_name=worker 86 | - --task_id=2 87 | ports: 88 | - containerPort: 2222 89 | --- 90 | apiVersion: v1 91 | kind: Service 92 | metadata: 93 | type: LoadBalancer 94 | name: tf-worker5 95 | labels: 96 | tf-worker: "2" 97 | spec: 98 | ports: 99 | - port: 2222 100 | selector: 101 | tf-worker: "2" 102 | --- 103 | apiVersion: v1 104 | kind: ReplicationController 105 | metadata: 106 | name: tf-ps2 107 | spec: 108 | replicas: 1 109 | template: 110 | metadata: 111 | labels: 112 | tf-ps: "0" 113 | spec: 114 | containers: 115 | - name: tf-ps2 116 | image: index.caicloud.io/caicloud/tf_grpc_test_server 117 | args: 118 | - --cluster_spec=worker|tf-worker1:2222;tf-worker3:2222;tf-worker5:2222,ps|tf-ps2:2222;tf-ps4:2222 119 | - --job_name=ps 120 | - --task_id=0 121 | ports: 122 | - containerPort: 2222 123 | --- 124 | apiVersion: v1 125 | kind: Service 126 | metadata: 127 | name: tf-ps2 128 | labels: 129 | tf-ps: "0" 130 | spec: 131 | ports: 132 | - port: 2222 133 | selector: 134 | tf-ps: "0" 135 | --- 136 | apiVersion: v1 137 | kind: ReplicationController 138 | metadata: 139 | name: tf-ps4 140 | spec: 141 | replicas: 1 142 | template: 143 | metadata: 144 | labels: 145 | tf-ps: "1" 146 | spec: 147 | containers: 148 | - name: tf-ps4 149 | image: index.caicloud.io/caicloud/tf_grpc_test_server 150 | args: 151 | - --cluster_spec=worker|tf-worker1:2222;tf-worker3:2222;tf-worker5:2222,ps|tf-ps2:2222;tf-ps4:2222 152 | - --job_name=ps 153 | - --task_id=1 154 | ports: 155 | - containerPort: 2222 156 | --- 157 | apiVersion: v1 158 | kind: Service 159 | metadata: 160 | name: tf-ps4 161 | labels: 162 | tf-ps: "1" 163 | spec: 164 | ports: 165 | - port: 2222 166 | selector: 167 | tf-ps: "1" 168 | --- 169 | 170 | -------------------------------------------------------------------------------- /k8s_tf_runner.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: ReplicationController 3 | metadata: 4 | name: tf-runner 5 | spec: 6 | replicas: 1 7 | template: 8 | metadata: 9 | labels: 10 | name: tf-runner 11 | spec: 12 | containers: 13 | - name: tf-runner 14 | image: index.caicloud.io/tensorflow:0.8.0 15 | imagePullPolicy: Always 16 | ports: 17 | - containerPort: 6006 18 | name: tensorboard 19 | - containerPort: 8888 20 | name: jupyter 21 | - containerPort: 8000 22 | name: fileserver 23 | --- 24 | apiVersion: v1 25 | kind: Service 26 | metadata: 27 | name: tf-runner 28 | labels: 29 | name: tf-runner 30 | spec: 31 | type: LoadBalancer 32 | ports: 33 | - name: tensorboard 34 | port: 6006 35 | - name: jupyter 36 | port: 8888 37 | - name: fileserver 38 | port: 8000 39 | selector: 40 | name: tf-runner 41 | 42 | -------------------------------------------------------------------------------- /notebooks/RNN_PennTreeBank_LanguageModeling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Step1: 读取数据" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "Words in training data: 929589\n", 22 | "Words in validating data: 73760\n", 23 | "Words in testing data: 82430\n", 24 | "Example training data: [9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983]\n", 25 | "Example validating data: [1132, 93, 358, 5, 329, 51, 9836, 6, 326, 2476]\n", 26 | "Example testing data: [102, 14, 24, 32, 752, 381, 2, 29, 120, 0]\n" 27 | ] 28 | } 29 | ], 30 | "source": [ 31 | "import time\n", 32 | "import collections\n", 33 | "import os\n", 34 | "\n", 35 | "import numpy as np\n", 36 | "import tensorflow as tf\n", 37 | "\n", 38 | "def read_words(filename):\n", 39 | " with tf.gfile.GFile(filename, \"r\") as f:\n", 40 | " return f.read().replace(\"\\n\", \"\").split()\n", 41 | "\n", 42 | "def build_vocab(filename):\n", 43 | " data = read_words(filename)\n", 44 | " counter = collections.Counter(data)\n", 45 | " count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))\n", 46 | " words, _ = list(zip(*count_pairs))\n", 47 | " word_to_id = dict(zip(words, range(len(words))))\n", 48 | " return word_to_id\n", 49 | "\n", 50 | "def file_to_word_ids(filename, word_to_id):\n", 51 | " data = read_words(filename)\n", 52 | " return [word_to_id[word] for word in data]\n", 53 | "\n", 54 | "def ptb_raw_data():\n", 55 | " train_path = \"ptb.train.txt\"\n", 56 | " valid_path = \"ptb.valid.txt\"\n", 57 | " test_path = \"ptb.test.txt\"\n", 58 | "\n", 59 | " word_to_id = build_vocab(train_path)\n", 60 | " train_data = file_to_word_ids(train_path, word_to_id)\n", 61 | " valid_data = file_to_word_ids(valid_path, word_to_id)\n", 62 | " test_data = file_to_word_ids(test_path, word_to_id)\n", 63 | " return train_data, valid_data, test_data\n", 64 | "\n", 65 | "train_data, valid_data, test_data = ptb_raw_data()\n", 66 | "print \"Words in training data:\", len(train_data)\n", 67 | "print \"Words in validating data:\", len(valid_data)\n", 68 | "print \"Words in testing data:\", len(test_data)\n", 69 | "print \"Example training data:\", train_data[:10]\n", 70 | "print \"Example validating data:\", valid_data[:10]\n", 71 | "print \"Example testing data:\", test_data[:10]" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### Step2: 整理RNN数据格式" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 2, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [ 88 | { 89 | "name": "stdout", 90 | "output_type": "stream", 91 | "text": [ 92 | "X: [[ 0 1 2]\n", 93 | " [ 8 9 10]\n", 94 | " [16 17 18]]\n", 95 | "Y: [[ 1 2 3]\n", 96 | " [ 9 10 11]\n", 97 | " [17 18 19]]\n", 98 | "-------------------\n", 99 | "X: [[ 3 4 5]\n", 100 | " [11 12 13]\n", 101 | " [19 20 21]]\n", 102 | "Y: [[ 4 5 6]\n", 103 | " [12 13 14]\n", 104 | " [20 21 22]]\n", 105 | "-------------------\n" 106 | ] 107 | } 108 | ], 109 | "source": [ 110 | "def ptb_iterator(raw_data, batch_size, num_steps):\n", 111 | " raw_data = np.array(raw_data, dtype=np.int32)\n", 112 | " data_len = len(raw_data)\n", 113 | " batch_len = data_len // batch_size\n", 114 | " data = np.zeros([batch_size, batch_len], dtype=np.int32)\n", 115 | " for i in range(batch_size):\n", 116 | " data[i] = raw_data[batch_len * i:batch_len * (i + 1)]\n", 117 | "\n", 118 | " epoch_size = (batch_len - 1) // num_steps\n", 119 | " if epoch_size == 0:\n", 120 | " raise ValueError(\"epoch_size == 0, decrease batch_size or num_steps\")\n", 121 | "\n", 122 | " for i in range(epoch_size):\n", 123 | " x = data[:, i*num_steps:(i+1)*num_steps]\n", 124 | " y = data[:, i*num_steps+1:(i+1)*num_steps+1]\n", 125 | " yield (x, y)\n", 126 | "\n", 127 | "result = ptb_iterator(range(25), 3, 3)\n", 128 | "for x, y in result:\n", 129 | " print \"X:\", x\n", 130 | " print \"Y:\", y\n", 131 | " print \"-------------------\"\n", 132 | " " 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "### Step 3: 建立RNN网络" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "metadata": { 146 | "collapsed": false 147 | }, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "Model generated!\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "hidden_size = 650\n", 159 | "num_layer = 2\n", 160 | "vocab_size = 10000\n", 161 | "\n", 162 | "class PTBModel(object):\n", 163 | " def __init__(self, is_training, batch_size, num_steps):\n", 164 | " self.batch_size = batch_size\n", 165 | " self.num_steps = num_steps\n", 166 | " \n", 167 | " # Define Input & Output\n", 168 | " self.input_data = tf.placeholder(tf.int32, [batch_size, num_steps])\n", 169 | " self.targets = tf.placeholder(tf.int32, [batch_size, num_steps])\n", 170 | " \n", 171 | " # Define RNN network\n", 172 | " lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size, forget_bias=0.0)\n", 173 | " if is_training :\n", 174 | " lstm_cell = tf.nn.rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=0.5)\n", 175 | " cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * num_layer)\n", 176 | "\n", 177 | " # Embedding\n", 178 | " self.initial_state = cell.zero_state(batch_size, tf.float32)\n", 179 | " embedding = tf.get_variable(\"embedding\", [vocab_size, hidden_size])\n", 180 | " inputs = tf.nn.embedding_lookup(embedding, self.input_data)\n", 181 | " if is_training: inputs = tf.nn.dropout(inputs, 0.5)\n", 182 | "\n", 183 | " # Forward propregate\n", 184 | " outputs = []\n", 185 | " state = self.initial_state\n", 186 | " with tf.variable_scope(\"RNN\"):\n", 187 | " for time_step in range(num_steps):\n", 188 | " if time_step > 0: tf.get_variable_scope().reuse_variables()\n", 189 | " (cell_output, state) = cell(inputs[:, time_step, :], state)\n", 190 | " outputs.append(cell_output)\n", 191 | "\n", 192 | " output = tf.reshape(tf.concat(1, outputs), [-1, hidden_size])\n", 193 | " softmax_w = tf.get_variable(\"softmax_w\", [hidden_size, vocab_size])\n", 194 | " softmax_b = tf.get_variable(\"softmax_b\", [vocab_size])\n", 195 | " logits = tf.matmul(output, softmax_w) + softmax_b\n", 196 | " loss = tf.nn.seq2seq.sequence_loss_by_example(\n", 197 | " [logits], [tf.reshape(self.targets, [-1])], [tf.ones([batch_size * num_steps])])\n", 198 | " self.cost = cost = tf.reduce_sum(loss) / batch_size\n", 199 | " self.final_state = state\n", 200 | "\n", 201 | " if not is_training: return\n", 202 | " self.lr = tf.Variable(0.0, trainable=False)\n", 203 | " tvars = tf.trainable_variables()\n", 204 | " grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5)\n", 205 | " optimizer = tf.train.GradientDescentOptimizer(self.lr)\n", 206 | " self.train_op = optimizer.apply_gradients(zip(grads, tvars))\n", 207 | "\n", 208 | " def assign_lr(self, session, lr_value):\n", 209 | " session.run(tf.assign(self.lr, lr_value))\n", 210 | " \n", 211 | "print(\"Model generated!\")" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": { 217 | "collapsed": true 218 | }, 219 | "source": [ 220 | "### Step 4: 训练模型" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "metadata": { 227 | "collapsed": false 228 | }, 229 | "outputs": [ 230 | { 231 | "name": "stdout", 232 | "output_type": "stream", 233 | "text": [ 234 | "Epoch: 1 Learning rate: 1.000\n", 235 | "0.008 perplexity: 5743.727 speed: 197 wps\n", 236 | "0.107 perplexity: 1201.711 speed: 233 wps\n", 237 | "0.206 perplexity: 863.146 speed: 235 wps\n", 238 | "0.306 perplexity: 692.729 speed: 237 wps\n", 239 | "0.405 perplexity: 595.858 speed: 239 wps\n", 240 | "0.505 perplexity: 529.792 speed: 240 wps\n", 241 | "0.604 perplexity: 475.562 speed: 241 wps\n", 242 | "0.704 perplexity: 437.009 speed: 241 wps\n", 243 | "0.803 perplexity: 407.033 speed: 242 wps\n", 244 | "0.903 perplexity: 380.664 speed: 242 wps\n", 245 | "Epoch: 1 Train Perplexity: 360.271\n", 246 | "Epoch: 1 Valid Perplexity: 213.824\n", 247 | "Epoch: 2 Learning rate: 1.000\n", 248 | "0.008 perplexity: 257.275 speed: 252 wps\n", 249 | "0.107 perplexity: 199.571 speed: 231 wps\n", 250 | "0.206 perplexity: 207.185 speed: 231 wps\n", 251 | "0.306 perplexity: 201.614 speed: 234 wps\n", 252 | "0.405 perplexity: 199.146 speed: 236 wps\n", 253 | "0.505 perplexity: 196.261 speed: 240 wps\n", 254 | "0.604 perplexity: 190.792 speed: 241 wps\n", 255 | "0.704 perplexity: 187.582 speed: 242 wps\n", 256 | "0.803 perplexity: 184.645 speed: 243 wps\n", 257 | "0.903 perplexity: 180.317 speed: 244 wps\n", 258 | "Epoch: 2 Train Perplexity: 177.518\n", 259 | "Epoch: 2 Valid Perplexity: 153.699\n", 260 | "Epoch: 3 Learning rate: 1.000\n", 261 | "0.008 perplexity: 185.005 speed: 229 wps\n", 262 | "0.107 perplexity: 143.224 speed: 239 wps\n", 263 | "0.206 perplexity: 153.393 speed: 229 wps\n", 264 | "0.306 perplexity: 149.645 speed: 207 wps\n", 265 | "0.405 perplexity: 148.905 speed: 200 wps\n", 266 | "0.505 perplexity: 147.926 speed: 197 wps\n", 267 | "0.604 perplexity: 144.739 speed: 195 wps\n", 268 | "0.704 perplexity: 143.543 speed: 190 wps\n", 269 | "0.803 perplexity: 142.433 speed: 188 wps\n", 270 | "0.903 perplexity: 139.689 speed: 188 wps\n", 271 | "Epoch: 3 Train Perplexity: 138.276\n", 272 | "Epoch: 3 Valid Perplexity: 130.090\n", 273 | "Epoch: 4 Learning rate: 1.000\n", 274 | "0.008 perplexity: 151.183 speed: 186 wps\n", 275 | "0.107 perplexity: 116.480 speed: 182 wps\n" 276 | ] 277 | } 278 | ], 279 | "source": [ 280 | "def run_epoch(session, m, data, eval_op, verbose=False):\n", 281 | " epoch_size = ((len(data) // m.batch_size) - 1) // m.num_steps\n", 282 | " start_time = time.time()\n", 283 | " costs = 0.0\n", 284 | " iters = 0\n", 285 | " state = m.initial_state.eval()\n", 286 | " for step, (x, y) in enumerate(ptb_iterator(data, m.batch_size, m.num_steps)):\n", 287 | " cost, state, _ = session.run([m.cost, m.final_state, eval_op], \n", 288 | " {m.input_data: x, m.targets: y, m.initial_state: state})\n", 289 | " costs += cost\n", 290 | " iters += m.num_steps\n", 291 | "\n", 292 | " if verbose and step % (epoch_size // 10) == 10:\n", 293 | " print(\"%.3f perplexity: %.3f speed: %.0f wps\" % \n", 294 | " (step * 1.0 / epoch_size, np.exp(costs / iters),\n", 295 | " iters * m.batch_size / (time.time() - start_time)))\n", 296 | " return np.exp(costs / iters)\n", 297 | "\n", 298 | "with tf.Session() as session:\n", 299 | " initializer = tf.random_uniform_initializer(-0.05, 0.05)\n", 300 | " with tf.variable_scope(\"model\", reuse=None, initializer=initializer):\n", 301 | " m = PTBModel(True, 20, 35)\n", 302 | " with tf.variable_scope(\"model\", reuse=True, initializer=initializer):\n", 303 | " mtest = PTBModel(False, 1, 1)\n", 304 | "\n", 305 | " tf.initialize_all_variables().run()\n", 306 | "\n", 307 | " for i in range(39):\n", 308 | " base_lr = 1.0\n", 309 | " lr_decay = 0.8 ** max(i - 6, 0.0)\n", 310 | " m.assign_lr(session, base_lr * lr_decay)\n", 311 | "\n", 312 | " print(\"Epoch: %d Learning rate: %.3f\" % (i + 1, session.run(m.lr)))\n", 313 | " train_perplexity = run_epoch(session, m, train_data, m.train_op, verbose=True)\n", 314 | " print(\"Epoch: %d Train Perplexity: %.3f\" % (i + 1, train_perplexity))\n", 315 | " valid_perplexity = run_epoch(session, mtest, valid_data, tf.no_op())\n", 316 | " print(\"Epoch: %d Valid Perplexity: %.3f\" % (i + 1, valid_perplexity))\n", 317 | "\n", 318 | " test_perplexity = run_epoch(session, mtest, test_data, tf.no_op())\n", 319 | " print(\"Test Perplexity: %.3f\" % test_perplexity)" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "metadata": { 326 | "collapsed": true 327 | }, 328 | "outputs": [], 329 | "source": [] 330 | } 331 | ], 332 | "metadata": { 333 | "kernelspec": { 334 | "display_name": "Python 2", 335 | "language": "python", 336 | "name": "python2" 337 | }, 338 | "language_info": { 339 | "codemirror_mode": { 340 | "name": "ipython", 341 | "version": 2 342 | }, 343 | "file_extension": ".py", 344 | "mimetype": "text/x-python", 345 | "name": "python", 346 | "nbconvert_exporter": "python", 347 | "pygments_lexer": "ipython2", 348 | "version": "2.7.6" 349 | } 350 | }, 351 | "nbformat": 4, 352 | "nbformat_minor": 0 353 | } 354 | -------------------------------------------------------------------------------- /notebooks/hello_world.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "import tensorflow的工具包" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import tensorflow as tf" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "初始化一个session来运行tensorflow程序。 所有对tensorflow图上节点的操作都需要在session底下进行。session维护了运行的上下文信息" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": false 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "sess = tf.InteractiveSession()" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "通过tf.name_scope来指定命名空间,用一个命名空间中的变量在图上会自动缩略到一起。\n", 44 | "\n", 45 | "在input中指定了3个输入,一个是常数,用tf.constant指定; 一个是变量,用tf.Variable指定,对于变量,我们需要指定初始值。这里的初始值为[0,1)的随机值;最后一个是placeholder,可以理解为输入数据的一个接口。" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 3, 51 | "metadata": { 52 | "collapsed": true 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "with tf.name_scope('input'):\n", 57 | " input1 = tf.constant([1.0, 2.0, 3.0], name=\"input1\")\n", 58 | " input2 = tf.Variable(tf.random_uniform([3]), name=\"input2\") \n", 59 | " input3 = tf.placeholder(tf.float32, [3], name=\"input3\")" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "初始化所有变量。tensorflow要求在计算之前要对所有变量初始化。" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 4, 72 | "metadata": { 73 | "collapsed": false 74 | }, 75 | "outputs": [ 76 | { 77 | "name": "stdout", 78 | "output_type": "stream", 79 | "text": [ 80 | "\n", 81 | "[ 0.9391259 0.75194204 0.15929091]\n" 82 | ] 83 | } 84 | ], 85 | "source": [ 86 | "tf.initialize_all_variables().run() \n", 87 | "print input2\n", 88 | "print input2.eval()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "指定一个加法操作。注意这里的加法不是3个数的加法,而是3个向量的加法。 tensorflow和numpy类似,可以支持对任意维度的矩阵做运算。" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 5, 101 | "metadata": { 102 | "collapsed": false 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "with tf.name_scope(\"add\"):\n", 107 | " output = tf.add_n([input1, input2, input3], name=\"add\")" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "将建立的计算图信息记录到log中,这样tensorboard可以实现可视化。" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 8, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [ 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "input/input1:0\n" 129 | ] 130 | } 131 | ], 132 | "source": [ 133 | "writer = tf.train.SummaryWriter(\"/log/demo-hello\", sess.graph)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "输出加法结果,注意tensorflow采用惰性计算(lazy evluation),只有当变量被用到的时候才会被计算。所以当查看结果时要明确支出需要查看的变量" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 7, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "[ 3.93912601 5.75194216 7.15929079]\n" 155 | ] 156 | } 157 | ], 158 | "source": [ 159 | "print sess.run(output, {input3: [2, 3, 4]})" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": { 166 | "collapsed": true 167 | }, 168 | "outputs": [], 169 | "source": [] 170 | } 171 | ], 172 | "metadata": { 173 | "kernelspec": { 174 | "display_name": "Python 2", 175 | "language": "python", 176 | "name": "python2" 177 | }, 178 | "language_info": { 179 | "codemirror_mode": { 180 | "name": "ipython", 181 | "version": 2 182 | }, 183 | "file_extension": ".py", 184 | "mimetype": "text/x-python", 185 | "name": "python", 186 | "nbconvert_exporter": "python", 187 | "pygments_lexer": "ipython2", 188 | "version": "2.7.6" 189 | } 190 | }, 191 | "nbformat": 4, 192 | "nbformat_minor": 0 193 | } 194 | -------------------------------------------------------------------------------- /notebooks/mnist_cnn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Step 1: 读取数据" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "Extracting /tmp/data/train-images-idx3-ubyte.gz\n", 22 | "Extracting /tmp/data/train-labels-idx1-ubyte.gz\n", 23 | "Extracting /tmp/data/t10k-images-idx3-ubyte.gz\n", 24 | "Extracting /tmp/data/t10k-labels-idx1-ubyte.gz\n", 25 | "Training data size: 55000\n", 26 | "Validating data size: 5000\n", 27 | "Testing data size: 10000\n" 28 | ] 29 | } 30 | ], 31 | "source": [ 32 | "import tensorflow as tf\n", 33 | "import time\n", 34 | "from tensorflow.examples.tutorials.mnist import input_data\n", 35 | "\n", 36 | "mnist = input_data.read_data_sets(\"/tmp/data\", one_hot=True)\n", 37 | "\n", 38 | "print \"Training data size: \", mnist.train.num_examples\n", 39 | "print \"Validating data size: \", mnist.validation.num_examples\n", 40 | "print \"Testing data size: \", mnist.test.num_examples" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Step 2: 建立神经网络" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "Network created!\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "# This is where training samples and labels are fed to the graph.\n", 67 | "# These placeholder nodes will be fed a batch of training data at each\n", 68 | "# training step using the {feed_dict} argument to the Run() call below.\n", 69 | "BATCH_SIZE = 64\n", 70 | "EVAL_SIZE = 10000\n", 71 | "IMAGE_SIZE = 28\n", 72 | "NUM_CHANNELS = 1\n", 73 | "NUM_LABELS = 10\n", 74 | "\n", 75 | "x = tf.placeholder(tf.float32, shape=(None, IMAGE_SIZE, IMAGE_SIZE, NUM_CHANNELS))\n", 76 | "y_ = tf.placeholder(tf.float32, shape=(None, NUM_LABELS))\n", 77 | "\n", 78 | "# The variables below hold all the trainable weights. \n", 79 | "# Convolutional layers.\n", 80 | "conv1_weights = tf.Variable(tf.truncated_normal([5, 5, NUM_CHANNELS, 32], stddev=0.1, seed = 2))\n", 81 | "conv1_biases = tf.Variable(tf.zeros([32]))\n", 82 | "\n", 83 | "conv2_weights = tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1, seed = 2))\n", 84 | "conv2_biases = tf.Variable(tf.constant(0.1, shape=[64]))\n", 85 | "\n", 86 | "# fully connected, depth 512.\n", 87 | "fc1_weights = tf.Variable(tf.truncated_normal([IMAGE_SIZE // 4 * IMAGE_SIZE // 4 * 64, 512], stddev=0.1, seed = 2))\n", 88 | "fc1_biases = tf.Variable(tf.constant(0.1, shape=[512]))\n", 89 | "\n", 90 | "fc2_weights = tf.Variable(tf.truncated_normal([512, NUM_LABELS], stddev=0.1, seed = 2))\n", 91 | "fc2_biases = tf.Variable(tf.constant(0.1, shape=[NUM_LABELS]))\n", 92 | "\n", 93 | "def model(data, train=False):\n", 94 | " \"\"\"The Model definition.\"\"\"\n", 95 | " # 2D convolution, with 'SAME' padding (i.e. the output feature map has\n", 96 | " # the same size as the input). Note that {strides} is a 4D array whose\n", 97 | " # shape matches the data layout: [image index, y, x, depth].\n", 98 | " conv = tf.nn.conv2d(data, conv1_weights, strides=[1, 1, 1, 1], padding='SAME')\n", 99 | " # Bias and rectified linear non-linearity.\n", 100 | " relu = tf.nn.relu(tf.nn.bias_add(conv, conv1_biases))\n", 101 | " # Max pooling. The kernel size spec {ksize} also follows the layout of\n", 102 | " # the data. Here we have a pooling window of 2, and a stride of 2.\n", 103 | " pool = tf.nn.max_pool(relu, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')\n", 104 | " \n", 105 | " conv1 = tf.nn.conv2d(pool, conv2_weights, strides=[1, 1, 1, 1], padding='SAME')\n", 106 | " relu1 = tf.nn.relu(tf.nn.bias_add(conv1, conv2_biases))\n", 107 | " pool1 = tf.nn.max_pool(relu1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')\n", 108 | " \n", 109 | " # Reshape the feature map into a 2D matrix to feed it to the fully connected layers.\n", 110 | " pool_shape = pool1.get_shape().as_list()\n", 111 | " reshape = tf.reshape(pool1, [-1, pool_shape[1] * pool_shape[2] * pool_shape[3]])\n", 112 | " \n", 113 | " # Fully connected layer. Note that the '+' operation automatically broadcasts the biases.\n", 114 | " hidden = tf.nn.relu(tf.matmul(reshape, fc1_weights) + fc1_biases)\n", 115 | " # Add a 50% dropout during training only. Dropout also scales\n", 116 | " # activations such that no rescaling is needed at evaluation time.\n", 117 | " if train: hidden = tf.nn.dropout(hidden, 0.5)\n", 118 | " return tf.nn.softmax(tf.matmul(hidden, fc2_weights) + fc2_biases)\n", 119 | "\n", 120 | "print(\"Network created!\")" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "### Step 3: 指定训练过程" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 3, 133 | "metadata": { 134 | "collapsed": false 135 | }, 136 | "outputs": [ 137 | { 138 | "name": "stdout", 139 | "output_type": "stream", 140 | "text": [ 141 | "Training & eval step setup!\n" 142 | ] 143 | } 144 | ], 145 | "source": [ 146 | "# Predictions for the current training minibatch.\n", 147 | "train_y = model(x, True)\n", 148 | "\n", 149 | "# softmax cross entropy loss.\n", 150 | "loss = -tf.reduce_mean(y_ * tf.log(tf.clip_by_value(train_y, 1e-10, 1.0)))\n", 151 | "# L2 regularization for the fully connected parameters.\n", 152 | "regularizers = (tf.nn.l2_loss(fc1_weights) + tf.nn.l2_loss(fc1_biases) + \n", 153 | " tf.nn.l2_loss(fc2_weights) + tf.nn.l2_loss(fc2_biases))\n", 154 | "# Add the regularization term to the loss.\n", 155 | "loss += 5e-4 * regularizers\n", 156 | "\n", 157 | "# Optimizer: set up a variable that's incremented once per batch and\n", 158 | "# controls the learning rate decay.\n", 159 | "step = tf.Variable(0)\n", 160 | "\n", 161 | "# Decay once per epoch, using an exponential schedule starting at 0.01.\n", 162 | "learning_rate = tf.train.exponential_decay(\n", 163 | " 0.01, # Base learning rate.\n", 164 | " step * BATCH_SIZE, # Current index into the dataset.\n", 165 | " mnist.train.num_examples, # Decay step.\n", 166 | " 0.95, # Decay rate.\n", 167 | " staircase=True)\n", 168 | "\n", 169 | "# Use simple momentum for the optimization.\n", 170 | "optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(loss, global_step=step)\n", 171 | "\n", 172 | "# Training accuracy\n", 173 | "train_correct_prediction = tf.equal(tf.argmax(y_, 1), tf.argmax(train_y, 1))\n", 174 | "train_accuracy = tf.reduce_mean(tf.cast(train_correct_prediction, tf.float32))\n", 175 | " \n", 176 | "# Predictions for the test and validation, which we'll compute less often.\n", 177 | "eval_y = model(x, False)\n", 178 | "eval_correct_prediction = tf.equal(tf.argmax(y_, 1), tf.argmax(eval_y, 1))\n", 179 | "eval_accuracy = tf.reduce_mean(tf.cast(eval_correct_prediction, tf.float32))\n", 180 | "\n", 181 | "print(\"Training & eval step setup!\")" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "### Step 4: 训练模型" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 4, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [ 198 | { 199 | "name": "stdout", 200 | "output_type": "stream", 201 | "text": [ 202 | "Initialized!\n", 203 | "After 0 training step(s), validation accuracy = 0.1118, test accuracy = 0.1245\n", 204 | "After 100 training step(s), validation accuracy = 0.831, test accuracy = 0.8325\n", 205 | "After 200 training step(s), validation accuracy = 0.896, test accuracy = 0.8913\n", 206 | "After 300 training step(s), validation accuracy = 0.9232, test accuracy = 0.9247\n", 207 | "After 400 training step(s), validation accuracy = 0.9318, test accuracy = 0.9332\n", 208 | "Final accuracy = 0.9396\n" 209 | ] 210 | } 211 | ], 212 | "source": [ 213 | "import numpy\n", 214 | "\n", 215 | "# Create a local session to run the training.\n", 216 | "start_time = time.time()\n", 217 | "ROUNDS = 500\n", 218 | "\n", 219 | "reshaped_test_data = numpy.reshape(mnist.test.images, [-1, 28, 28, 1])\n", 220 | "test_label = mnist.test.labels\n", 221 | "reshaped_validate_data = numpy.reshape(mnist.validation.images, [-1, 28, 28, 1])\n", 222 | "validate_label = mnist.validation.labels\n", 223 | "\n", 224 | "with tf.Session() as sess:\n", 225 | " # Run all the initializers to prepare the trainable parameters.\n", 226 | " tf.initialize_all_variables().run()\n", 227 | " print('Initialized!')\n", 228 | " # Loop through training steps.\n", 229 | " for i in range(ROUNDS):\n", 230 | " # Run the graph and fetch some of the nodes.\n", 231 | " xs, ys = mnist.train.next_batch(BATCH_SIZE)\n", 232 | " reshaped_x = numpy.reshape(xs, [BATCH_SIZE, 28, 28, 1])\n", 233 | " sess.run(optimizer, feed_dict={x: reshaped_x, y_: ys})\n", 234 | " \n", 235 | " if i % 100 == 0:\n", 236 | " elapsed_time = time.time() - start_time\n", 237 | " start_time = time.time()\n", 238 | "\n", 239 | " validate_acc = sess.run(eval_accuracy, feed_dict={x: reshaped_validate_data, y_:validate_label})\n", 240 | " test_acc = sess.run(eval_accuracy, feed_dict={x: reshaped_test_data, y_:test_label})\n", 241 | " print(\"After %d training step(s), validation accuracy = %g, test accuracy = %g\" % \n", 242 | " (i, validate_acc, test_acc))\n", 243 | "\n", 244 | " test_acc = sess.run(eval_accuracy, feed_dict={x: reshaped_test_data, y_:test_label})\n", 245 | " print(\"Final accuracy = %g\" % (test_acc))" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": { 252 | "collapsed": true 253 | }, 254 | "outputs": [], 255 | "source": [] 256 | } 257 | ], 258 | "metadata": { 259 | "kernelspec": { 260 | "display_name": "Python 2", 261 | "language": "python", 262 | "name": "python2" 263 | }, 264 | "language_info": { 265 | "codemirror_mode": { 266 | "name": "ipython", 267 | "version": 2 268 | }, 269 | "file_extension": ".py", 270 | "mimetype": "text/x-python", 271 | "name": "python", 272 | "nbconvert_exporter": "python", 273 | "pygments_lexer": "ipython2", 274 | "version": "2.7.6" 275 | } 276 | }, 277 | "nbformat": 4, 278 | "nbformat_minor": 0 279 | } 280 | -------------------------------------------------------------------------------- /notebooks/mnist_tensorboard.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Step 1: 读取数据" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "Extracting /tmp/data/train-images-idx3-ubyte.gz\n", 22 | "Extracting /tmp/data/train-labels-idx1-ubyte.gz\n", 23 | "Extracting /tmp/data/t10k-images-idx3-ubyte.gz\n", 24 | "Extracting /tmp/data/t10k-labels-idx1-ubyte.gz\n", 25 | "Training data size: 55000\n", 26 | "Validating data size: 5000\n", 27 | "Testing data size: 10000\n" 28 | ] 29 | } 30 | ], 31 | "source": [ 32 | "import tensorflow as tf\n", 33 | "import time\n", 34 | "from tensorflow.examples.tutorials.mnist import input_data\n", 35 | "\n", 36 | "SUMMARY_DIR = \"/log/mnist-log\"\n", 37 | "if tf.gfile.Exists(SUMMARY_DIR):\n", 38 | " tf.gfile.DeleteRecursively(SUMMARY_DIR)\n", 39 | "tf.gfile.MakeDirs(SUMMARY_DIR)\n", 40 | "\n", 41 | "mnist = input_data.read_data_sets(\"/tmp/data\", one_hot=True)\n", 42 | "\n", 43 | "print \"Training data size: \", mnist.train.num_examples\n", 44 | "print \"Validating data size: \", mnist.validation.num_examples\n", 45 | "print \"Testing data size: \", mnist.test.num_examples" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### Step 2: 创建神经网络,并指定log信息" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 2, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [ 62 | { 63 | "name": "stdout", 64 | "output_type": "stream", 65 | "text": [ 66 | "Network created!\n" 67 | ] 68 | } 69 | ], 70 | "source": [ 71 | "tf.reset_default_graph()\n", 72 | "sess = tf.InteractiveSession()\n", 73 | "\n", 74 | "def weight_variable(shape):\n", 75 | " \"\"\"Create a weight variable with appropriate initialization.\"\"\"\n", 76 | " initial = tf.truncated_normal(shape, stddev=0.1, seed = 2)\n", 77 | " return tf.Variable(initial)\n", 78 | "\n", 79 | "def bias_variable(shape):\n", 80 | " \"\"\"Create a bias variable with appropriate initialization.\"\"\"\n", 81 | " initial = tf.constant(0.1, shape=shape)\n", 82 | " return tf.Variable(initial)\n", 83 | "\n", 84 | "def variable_summaries(var, name):\n", 85 | " \"\"\"Attach a lot of summaries to a Tensor.\"\"\"\n", 86 | " with tf.name_scope('summaries'):\n", 87 | " mean = tf.reduce_mean(var)\n", 88 | " tf.scalar_summary('mean/' + name, mean)\n", 89 | " with tf.name_scope('stddev'):\n", 90 | " stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean)))\n", 91 | " tf.scalar_summary('sttdev/' + name, stddev)\n", 92 | " tf.scalar_summary('max/' + name, tf.reduce_max(var))\n", 93 | " tf.scalar_summary('min/' + name, tf.reduce_min(var))\n", 94 | " tf.histogram_summary(name, var)\n", 95 | "\n", 96 | "def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):\n", 97 | " \"\"\"Reusable code for making a simple neural net layer.\n", 98 | " It does a matrix multiply, bias add, and then uses relu to nonlinearize.\n", 99 | " It also sets up name scoping so that the resultant graph is easy to read, and\n", 100 | " adds a number of summary ops.\n", 101 | " \"\"\"\n", 102 | " # Adding a name scope ensures logical grouping of the layers in the graph.\n", 103 | " with tf.name_scope(layer_name):\n", 104 | " # This Variable will hold the state of the weights for the layer\n", 105 | " with tf.name_scope('weights'):\n", 106 | " weights = weight_variable([input_dim, output_dim])\n", 107 | " variable_summaries(weights, layer_name + '/weights')\n", 108 | " with tf.name_scope('biases'):\n", 109 | " biases = bias_variable([output_dim])\n", 110 | " variable_summaries(biases, layer_name + '/biases')\n", 111 | " with tf.name_scope('Wx_plus_b'):\n", 112 | " preactivate = tf.matmul(input_tensor, weights) + biases\n", 113 | " tf.histogram_summary(layer_name + '/pre_activations', preactivate)\n", 114 | " activations = act(preactivate, 'activation')\n", 115 | " tf.histogram_summary(layer_name + '/activations', activations)\n", 116 | " return activations\n", 117 | "\n", 118 | "# Create a multilayer model.\n", 119 | "with tf.name_scope('input'):\n", 120 | " x = tf.placeholder(tf.float32, [None, 784], name='x-input')\n", 121 | " image_shaped_input = tf.reshape(x, [-1, 28, 28, 1])\n", 122 | " tf.image_summary('input', image_shaped_input, 20)\n", 123 | " y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')\n", 124 | "\n", 125 | "hidden_nodes = 500\n", 126 | "hidden1 = nn_layer(x, 784, hidden_nodes, 'layer1')\n", 127 | "y = nn_layer(hidden1, hidden_nodes, 10, 'layer2', act=tf.nn.softmax)\n", 128 | "print(\"Network created!\")" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "### Step 3: 指定训练过程" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 3, 141 | "metadata": { 142 | "collapsed": false 143 | }, 144 | "outputs": [ 145 | { 146 | "name": "stdout", 147 | "output_type": "stream", 148 | "text": [ 149 | "Training & eval step setup!\n" 150 | ] 151 | } 152 | ], 153 | "source": [ 154 | "with tf.name_scope('cross_entropy'):\n", 155 | " diff = y_ * tf.log(tf.clip_by_value(y, 1e-10, 1.0))\n", 156 | " with tf.name_scope('total'):\n", 157 | " cross_entropy = -tf.reduce_mean(diff)\n", 158 | " tf.scalar_summary('cross entropy', cross_entropy)\n", 159 | "\n", 160 | "with tf.name_scope('train'):\n", 161 | " train_step = tf.train.AdamOptimizer(0.001).minimize(cross_entropy)\n", 162 | "\n", 163 | "with tf.name_scope('accuracy'):\n", 164 | " with tf.name_scope('correct_prediction'):\n", 165 | " correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))\n", 166 | " with tf.name_scope('accuracy'):\n", 167 | " accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))\n", 168 | " tf.scalar_summary('accuracy', accuracy)\n", 169 | "\n", 170 | "print(\"Training & eval step setup!\")" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "### Step 4: 指定日志文件地址, 初始化所有变量" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 4, 183 | "metadata": { 184 | "collapsed": false 185 | }, 186 | "outputs": [ 187 | { 188 | "name": "stdout", 189 | "output_type": "stream", 190 | "text": [ 191 | "Log init done!\n" 192 | ] 193 | } 194 | ], 195 | "source": [ 196 | "# Merge all the summaries and write them out to SUMMARY_DIR\n", 197 | "merged = tf.merge_all_summaries()\n", 198 | "train_writer = tf.train.SummaryWriter(SUMMARY_DIR + '/train', sess.graph)\n", 199 | "test_writer = tf.train.SummaryWriter(SUMMARY_DIR + '/test')\n", 200 | "\n", 201 | "tf.initialize_all_variables().run()\n", 202 | "\n", 203 | "validate_feed = {x: mnist.validation.images, y_: mnist.validation.labels}\n", 204 | "test_feed = {x: mnist.test.images, y_: mnist.test.labels}\n", 205 | "print(\"Log init done!\")" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "### Step 5: 训练模型" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 5, 218 | "metadata": { 219 | "collapsed": false 220 | }, 221 | "outputs": [ 222 | { 223 | "name": "stdout", 224 | "output_type": "stream", 225 | "text": [ 226 | "Training begins @ 1466980873.216786\n", 227 | "After 0 training step(s), validation accuracy = 0.2188, test accuracy = 0.2096\n", 228 | "After 100 training step(s), validation accuracy = 0.9168, test accuracy = 0.9113\n", 229 | "After 200 training step(s), validation accuracy = 0.9382, test accuracy = 0.934\n", 230 | "After 300 training step(s), validation accuracy = 0.9528, test accuracy = 0.9483\n", 231 | "After 400 training step(s), validation accuracy = 0.9538, test accuracy = 0.9535\n", 232 | "Training ends @ 1466980891.376311\n", 233 | "Training elapsed time: 18.159525 s\n", 234 | "After 500 training step(s), test accuracy = 0.9557\n" 235 | ] 236 | } 237 | ], 238 | "source": [ 239 | "def feed_dict(train):\n", 240 | " \"\"\"Make a TensorFlow feed_dict: maps data onto Tensor placeholders.\"\"\"\n", 241 | " if train:\n", 242 | " xs, ys = mnist.train.next_batch(100)\n", 243 | " else:\n", 244 | " xs, ys = mnist.test.images, mnist.test.labels\n", 245 | " return {x: xs, y_: ys}\n", 246 | "\n", 247 | "STEPS = 500\n", 248 | "saver = tf.train.Saver()\n", 249 | "time_begin = time.time()\n", 250 | "print(\"Training begins @ %f\" % time_begin)\n", 251 | "for i in range(STEPS):\n", 252 | " _, summary = sess.run([train_step, merged], feed_dict=feed_dict(True))\n", 253 | "\n", 254 | " if i % 100 == 0:\n", 255 | " # Write summary\n", 256 | " train_writer.add_summary(summary, i)\n", 257 | " \n", 258 | " summary = sess.run(merged, feed_dict=feed_dict(False))\n", 259 | " test_writer.add_summary(summary, i) \n", 260 | " \n", 261 | " # Print training information.\n", 262 | " validate_acc = sess.run(accuracy, feed_dict=validate_feed)\n", 263 | " test_acc = sess.run(accuracy, feed_dict=test_feed)\n", 264 | " print(\"After %d training step(s), validation accuracy = %g, test accuracy = %g\" % \n", 265 | " (i, validate_acc, test_acc))\n", 266 | " \n", 267 | " # Store model.\n", 268 | " if i == 300: saver.save(sess, \"/tmp/saved_model\")\n", 269 | "\n", 270 | " \n", 271 | "time_end = time.time()\n", 272 | "print(\"Training ends @ %f\" % time_end)\n", 273 | "training_time = time_end - time_begin\n", 274 | "print(\"Training elapsed time: %f s\" % training_time)\n", 275 | "test_acc = sess.run(accuracy, feed_dict=test_feed)\n", 276 | "print(\"After %d training step(s), test accuracy = %g\" % (STEPS, test_acc))" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": 6, 282 | "metadata": { 283 | "collapsed": false 284 | }, 285 | "outputs": [ 286 | { 287 | "name": "stdout", 288 | "output_type": "stream", 289 | "text": [ 290 | "Test accuracy for stored model = 0.9483\n" 291 | ] 292 | } 293 | ], 294 | "source": [ 295 | "saver.restore(sess, \"/tmp/saved_model\")\n", 296 | "test_acc = sess.run(accuracy, feed_dict=test_feed)\n", 297 | "print(\"Test accuracy for stored model = %g\" % (test_acc))\n", 298 | "\n", 299 | "sess.close()" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": { 306 | "collapsed": true 307 | }, 308 | "outputs": [], 309 | "source": [] 310 | } 311 | ], 312 | "metadata": { 313 | "kernelspec": { 314 | "display_name": "Python 2", 315 | "language": "python", 316 | "name": "python2" 317 | }, 318 | "language_info": { 319 | "codemirror_mode": { 320 | "name": "ipython", 321 | "version": 2 322 | }, 323 | "file_extension": ".py", 324 | "mimetype": "text/x-python", 325 | "name": "python", 326 | "nbconvert_exporter": "python", 327 | "pygments_lexer": "ipython2", 328 | "version": "2.7.6" 329 | } 330 | }, 331 | "nbformat": 4, 332 | "nbformat_minor": 0 333 | } 334 | -------------------------------------------------------------------------------- /notebooks/scope.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [ 10 | { 11 | "name": "stdout", 12 | "output_type": "stream", 13 | "text": [ 14 | "a/b/Variable:0\n", 15 | "a/b/b:0\n" 16 | ] 17 | } 18 | ], 19 | "source": [ 20 | "import tensorflow as tf\n", 21 | "\n", 22 | "with tf.name_scope(\"a\"):\n", 23 | " with tf.name_scope(\"b\"):\n", 24 | " a = tf.Variable([1]) \n", 25 | " print a.name\n", 26 | " b = tf.Variable([1], name=\"b\")\n", 27 | " print b.name" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 2, 33 | "metadata": { 34 | "collapsed": false 35 | }, 36 | "outputs": [ 37 | { 38 | "name": "stdout", 39 | "output_type": "stream", 40 | "text": [ 41 | "a_1/b/Variable:0\n", 42 | "a_1/b/b:0\n" 43 | ] 44 | } 45 | ], 46 | "source": [ 47 | "with tf.variable_scope(\"a\"):\n", 48 | " with tf.variable_scope(\"b\"):\n", 49 | " a = tf.Variable([1]) \n", 50 | " print a.name\n", 51 | " b = tf.Variable([1], name=\"b\")\n", 52 | " print b.name" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 3, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [ 62 | { 63 | "name": "stdout", 64 | "output_type": "stream", 65 | "text": [ 66 | "a_2/b/Variable:0\n", 67 | "a_2/b/b:0\n", 68 | "a/b/b_1:0\n", 69 | "a/b/b_1:0\n" 70 | ] 71 | } 72 | ], 73 | "source": [ 74 | "with tf.variable_scope(\"a\"):\n", 75 | " with tf.variable_scope(\"b\"):\n", 76 | " a = tf.Variable([1]) \n", 77 | " print a.name\n", 78 | " b = tf.Variable([1], name=\"b\")\n", 79 | " print b.name\n", 80 | " v = tf.get_variable(\"b\", [1])\n", 81 | " print v.name\n", 82 | " with tf.variable_scope(\"b\", reuse=True):\n", 83 | " v1 = tf.get_variable(\"b\", [1])\n", 84 | " print v1.name " 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 4, 90 | "metadata": { 91 | "collapsed": false 92 | }, 93 | "outputs": [ 94 | { 95 | "name": "stdout", 96 | "output_type": "stream", 97 | "text": [ 98 | "foo/bar/v:0\n", 99 | "foo/a/bar/Variable:0\n", 100 | "foo/a/bar/Const:0\n" 101 | ] 102 | } 103 | ], 104 | "source": [ 105 | "with tf.variable_scope(\"foo\"):\n", 106 | " with tf.name_scope(\"a\"):\n", 107 | " with tf.variable_scope(\"bar\"):\n", 108 | " v = tf.get_variable(\"v\", [1])\n", 109 | " print v.name\n", 110 | " \n", 111 | " v1 = tf.Variable([1]) \n", 112 | " print v1.name\n", 113 | "\n", 114 | " c = tf.constant(10.0)\n", 115 | " print c.name" 116 | ] 117 | } 118 | ], 119 | "metadata": { 120 | "kernelspec": { 121 | "display_name": "Python 2", 122 | "language": "python", 123 | "name": "python2" 124 | }, 125 | "language_info": { 126 | "codemirror_mode": { 127 | "name": "ipython", 128 | "version": 2 129 | }, 130 | "file_extension": ".py", 131 | "mimetype": "text/x-python", 132 | "name": "python", 133 | "nbconvert_exporter": "python", 134 | "pygments_lexer": "ipython2", 135 | "version": "2.7.6" 136 | } 137 | }, 138 | "nbformat": 4, 139 | "nbformat_minor": 0 140 | } 141 | -------------------------------------------------------------------------------- /picture/create_local.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/picture/create_local.png -------------------------------------------------------------------------------- /picture/create_terminal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/picture/create_terminal.png -------------------------------------------------------------------------------- /picture/dist_creation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/picture/dist_creation.png -------------------------------------------------------------------------------- /picture/expanded_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/picture/expanded_view.png -------------------------------------------------------------------------------- /picture/homepage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/picture/homepage.png -------------------------------------------------------------------------------- /picture/jupyter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/picture/jupyter.png -------------------------------------------------------------------------------- /picture/list_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/picture/list_view.png -------------------------------------------------------------------------------- /picture/terminal_view.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/caicloud/tensorflow-demo/87460710b5fc9c829de369013c0008a9322b6c4f/picture/terminal_view.png -------------------------------------------------------------------------------- /run_tf.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | cd / 4 | # run file server. 5 | python /file_server.py & 6 | 7 | cd notebooks 8 | # run tensorboard 9 | tensorboard --logdir=/log & 10 | 11 | # run jupyter 12 | bash /run_jupyter.sh 13 | -------------------------------------------------------------------------------- /word2vector/BUILD: -------------------------------------------------------------------------------- 1 | # Description: 2 | # TensorFlow model for word2vec 3 | 4 | package(default_visibility = ["//tensorflow:internal"]) 5 | 6 | licenses(["notice"]) # Apache 2.0 7 | 8 | exports_files(["LICENSE"]) 9 | 10 | load("//tensorflow:tensorflow.bzl", "tf_gen_op_wrapper_py") 11 | 12 | py_library( 13 | name = "package", 14 | srcs = [ 15 | "__init__.py", 16 | ], 17 | srcs_version = "PY2AND3", 18 | visibility = ["//tensorflow:__subpackages__"], 19 | deps = [ 20 | ":gen_word2vec", 21 | ":word2vec", 22 | ":word2vec_optimized", 23 | ], 24 | ) 25 | 26 | py_binary( 27 | name = "word2vec", 28 | srcs = [ 29 | "word2vec.py", 30 | ], 31 | srcs_version = "PY2AND3", 32 | deps = [ 33 | ":gen_word2vec", 34 | "//tensorflow:tensorflow_py", 35 | "//tensorflow/python:platform", 36 | ], 37 | ) 38 | 39 | py_binary( 40 | name = "word2vec_optimized", 41 | srcs = [ 42 | "word2vec_optimized.py", 43 | ], 44 | srcs_version = "PY2AND3", 45 | deps = [ 46 | ":gen_word2vec", 47 | "//tensorflow:tensorflow_py", 48 | "//tensorflow/python:platform", 49 | ], 50 | ) 51 | 52 | cc_library( 53 | name = "word2vec_ops", 54 | srcs = [ 55 | "word2vec_ops.cc", 56 | ], 57 | linkstatic = 1, 58 | visibility = ["//tensorflow:internal"], 59 | deps = [ 60 | "//tensorflow/core:framework", 61 | ], 62 | alwayslink = 1, 63 | ) 64 | 65 | cc_library( 66 | name = "word2vec_kernels", 67 | srcs = [ 68 | "word2vec_kernels.cc", 69 | ], 70 | linkstatic = 1, 71 | visibility = ["//tensorflow:internal"], 72 | deps = [ 73 | ":word2vec_ops", 74 | "//tensorflow/core", 75 | ], 76 | alwayslink = 1, 77 | ) 78 | 79 | tf_gen_op_wrapper_py( 80 | name = "gen_word2vec", 81 | out = "gen_word2vec.py", 82 | deps = [":word2vec_ops"], 83 | ) 84 | 85 | filegroup( 86 | name = "all_files", 87 | srcs = glob( 88 | ["**/*"], 89 | exclude = [ 90 | "**/METADATA", 91 | "**/OWNERS", 92 | ], 93 | ), 94 | visibility = ["//tensorflow:__subpackages__"], 95 | ) 96 | -------------------------------------------------------------------------------- /word2vector/README.md: -------------------------------------------------------------------------------- 1 | # Vector Representations of Words 2 | ## 背景介绍 3 | 传统的自然语言处理一般使用Bag-of-words模型,把每个单词当成一个符号。比如"cat"用Id123表示,"kitty"用Id456表示。用这样的方式表达单词的一个最大坏处是它忽略了单词之间的语义关系。同时Bag-of-words模型也会导致特征矩阵过于稀疏的问题。用向量来表示一个单词(word to vector, embedding)就可以从一定程度上解决这些问题。具体的Word2Vec的背景,方法和应用在[这篇文章](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)中都有详述,这里我们就不再赘述。下面我们需要介绍如何将Word2Vec算法在Tensorflow上跑起来以及Word2Vec的一个小应用。 4 | 5 | ## 基础版Word2Vec 6 | ``` 7 | python word2vec_basic.py 8 | ``` 9 | 10 | ## I/O速度提高版 11 | 如果已经运行过基础版Word2Vec,那么训练数据已经被下载下来了,否则可以通过下面命令下载数据: 12 | ``` 13 | wget http://mattmahoney.net/dc/text8.zip 14 | ``` 15 | 16 | 解压准备好的训练数据: 17 | ``` 18 | unzip text8.zip 19 | ``` 20 | 21 | 通过运行训练程序: 22 | ``` 23 | python word2vec.py --train_data=text8 --eval_data=questions-words.txt --save_path=/tmp 24 | ``` 25 | 26 | 单机环境下,这个程序可能需要运行10几个小时。 27 | 28 | ## 训练速度提高版 29 | 如果没有准备数据,那么可以通过上述方法下载数据,数据准备好之后运行: 30 | ``` 31 | python word2vec_optimized.py --train_data=text8 --eval_data=questions-words.txt --save_path=/tmp 32 | ``` 33 | 相比上面的模型,这个方法可以加速~15-20倍。 34 | 35 | 36 | ## 实现单词加减法 37 | #### 使用上面训练出来的向量 38 | 上面几个程序都没有输出最后每个单词得到的向量,如果想要使用上述结果,需要输出每个单词对应的向量,格式如下: 39 | ``` 40 | 单词1 向量1 41 | 单词2 向量2 42 | ... 43 | 单词n 向量n 44 | ``` 45 | 46 | 其中单词存在```self._options.vocab_words```中,每个单词对应的embedding在```self._emb``` (word2vec.py),```self._w_in``` (word2vec_optimized.py)中。 47 | 48 | #### 使用已经训练好的向量 49 | 网上有很多已经训练好的Word2Vec模型,其中stanford nlp实验室的[GloVe](http://nlp.stanford.edu/projects/glove/)就提供了不少模型。可以通过下述命令直接下载: 50 | ``` 51 | wget http://nlp.stanford.edu/data/glove.6B.zip 52 | unzip glove.6B.zip 53 | ``` 54 | 55 | #### 运行单词计算器 56 | 57 | -------------------------------------------------------------------------------- /word2vector/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """Import generated word2vec optimized ops into embedding package.""" 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | from tensorflow.models.embedding import gen_word2vec 22 | -------------------------------------------------------------------------------- /word2vector/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 24 | 25 | 26 | - + = 27 |
28 |
29 |
30 | 31 | 32 | -------------------------------------------------------------------------------- /word2vector/word2vec_basic.py: -------------------------------------------------------------------------------- 1 | import collections 2 | import math 3 | import os 4 | import random 5 | import zipfile 6 | 7 | import numpy as np 8 | from six.moves import urllib 9 | from six.moves import xrange 10 | 11 | # Step1: prepare data. 12 | url = 'http://mattmahoney.net/dc/' 13 | 14 | def download(filename): 15 | print('Starting downloading data ...') 16 | filename, _ = urllib.request.urlretrieve(url + filename, filename) 17 | print('Data downloaded!') 18 | return filename 19 | 20 | def maybe_download(filename, expected_bytes): 21 | """Download a file if not present or size is incorrect.""" 22 | if not os.path.exists(filename): 23 | filename = download(filename) 24 | else: 25 | if os.stat(filename).st_size != expected_bytes: 26 | os.remove(filename) 27 | filename = download(filename) 28 | 29 | statinfo = os.stat(filename) 30 | if statinfo.st_size == expected_bytes: 31 | print('Found and verified', filename) 32 | else: 33 | print(statinfo.st_size) 34 | raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?') 35 | return filename 36 | 37 | filename = maybe_download('text8.zip', 31344016) 38 | print("Data prepared!") 39 | 40 | # Step2: split words. 41 | def read_data(filename): 42 | """Extract the first file enclosed in a zip file as a list of words""" 43 | with zipfile.ZipFile(filename) as f: 44 | data = f.read(f.namelist()[0]).split() 45 | return data 46 | 47 | words = read_data(filename) 48 | print('Data size', len(words)) 49 | 50 | # Step3: Build dictionary. 51 | vocabulary_size = 50000 52 | def build_dataset(words): 53 | count = [['UNK', -1]] 54 | count.extend(collections.Counter(words).most_common(vocabulary_size - 1)) 55 | dictionary = dict() 56 | for word, _ in count: 57 | dictionary[word] = len(dictionary) 58 | data = list() 59 | unk_count = 0 60 | for word in words: 61 | if word in dictionary: 62 | index = dictionary[word] 63 | else: 64 | index = 0 # dictionary['UNK'] 65 | unk_count += 1 66 | data.append(index) 67 | count[0][1] = unk_count 68 | reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 69 | return data, count, dictionary, reverse_dictionary 70 | 71 | data, count, dictionary, reverse_dictionary = build_dataset(words) 72 | del words # Hint to reduce memory. 73 | print('Most common words (+UNK)', count[:5]) 74 | print('Sample data', data[:10]) 75 | 76 | # Step4: function to generate training batch 77 | data_index = 0 78 | 79 | def generate_batch(batch_size, num_skips, skip_window): 80 | global data_index 81 | assert batch_size % num_skips == 0 82 | assert num_skips <= 2 * skip_window 83 | batch = np.ndarray(shape=(batch_size), dtype=np.int32) 84 | labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) 85 | span = 2 * skip_window + 1 # [ skip_window target skip_window ] 86 | buffer = collections.deque(maxlen=span) 87 | for _ in range(span): 88 | buffer.append(data[data_index]) 89 | data_index = (data_index + 1) % len(data) 90 | for i in range(batch_size // num_skips): 91 | target = skip_window # target label at the center of the buffer 92 | targets_to_avoid = [ skip_window ] 93 | for j in range(num_skips): 94 | while target in targets_to_avoid: 95 | target = random.randint(0, span - 1) 96 | targets_to_avoid.append(target) 97 | batch[i * num_skips + j] = buffer[skip_window] 98 | labels[i * num_skips + j, 0] = buffer[target] 99 | buffer.append(data[data_index]) 100 | data_index = (data_index + 1) % len(data) 101 | return batch, labels 102 | 103 | print(generate_batch(4, 2, 1)) 104 | print(generate_batch(6, 3, 2)) 105 | 106 | 107 | # Step5: Build graph 108 | import tensorflow as tf 109 | 110 | # Training input data. 111 | batch_size = 128 112 | train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) 113 | train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) 114 | 115 | # Validate input data. 116 | valid_size = 8 # Random set of words to evaluate similarity on. 117 | valid_window = 100 # Only pick dev samples in the head of the distribution. 118 | valid_examples = np.random.choice(valid_window, valid_size, replace=False) 119 | valid_dataset = tf.constant(valid_examples, dtype=tf.int32) 120 | 121 | # Look up embeddings for inputs. 122 | embedding_size = 128 # Dimension of the embedding vector. 123 | embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) 124 | embed = tf.nn.embedding_lookup(embeddings, train_inputs) 125 | 126 | # Construct the variables for the NCE loss 127 | nce_weights = tf.Variable( 128 | tf.truncated_normal([vocabulary_size, embedding_size], 129 | stddev=1.0 / math.sqrt(embedding_size))) 130 | nce_biases = tf.Variable(tf.zeros([vocabulary_size])) 131 | 132 | # Compute the average NCE loss for the batch. 133 | # tf.nce_loss automatically draws a new sample of the negative labels each 134 | # time we evaluate the loss. 135 | num_sampled = 64 # Number of negative examples to sample. 136 | loss = tf.reduce_mean(tf.nn.nce_loss( 137 | nce_weights, nce_biases, embed, train_labels, num_sampled, vocabulary_size)) 138 | 139 | # Construct the SGD optimizer using a learning rate of 1.0. 140 | optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss) 141 | 142 | # Compute the cosine similarity between minibatch examples and all embeddings. 143 | norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) 144 | normalized_embeddings = embeddings / norm 145 | valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) 146 | similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True) 147 | 148 | sess = tf.InteractiveSession() 149 | tf.initialize_all_variables().run() 150 | 151 | # Output evaluation result 152 | def print_eval_result(): 153 | sim = similarity.eval() 154 | for i in xrange(valid_size): 155 | valid_word = reverse_dictionary[valid_examples[i]] 156 | top_k = 8 # number of nearest neighbors 157 | nearest = (-sim[i, :]).argsort()[1:top_k + 1] 158 | log_str = "Nearest to %s:" % valid_word 159 | for k in xrange(top_k): 160 | close_word = reverse_dictionary[nearest[k]] 161 | log_str = "%s %s," % (log_str, close_word) 162 | print(log_str) 163 | 164 | print_eval_result() 165 | 166 | # Step6: train model. 167 | num_steps = 100001 168 | average_loss = 0 169 | skip_window = 1 # How many words to consider left and right. 170 | num_skips = 2 # How many times to reuse an input to generate a label. 171 | for step in xrange(num_steps): 172 | batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window) 173 | feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels} 174 | 175 | # We perform one update step by evaluating the optimizer op (including it 176 | # in the list of returned values for session.run() 177 | _, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict) 178 | average_loss += loss_val 179 | 180 | if step % 5000 == 0: 181 | if step > 0: average_loss /= 5000 182 | # The average loss is an estimate of the loss over the last 2000 batches. 183 | print("Average loss at step ", step, ": ", average_loss) 184 | average_loss = 0 185 | 186 | if step % 20000 == 0: 187 | print("Validate evaluation result at at step ", step) 188 | print_eval_result() 189 | 190 | final_embeddings = normalized_embeddings.eval() 191 | sess.close() 192 | 193 | # Step 7: Visualize the embeddings. 194 | def plot_with_labels(low_dim_embs, labels, filename='tsne.png'): 195 | print(low_dim_embs.shape) 196 | assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings" 197 | plt.figure(figsize=(18, 18)) #in inches 198 | for i, label in enumerate(labels): 199 | x, y = low_dim_embs[i,:] 200 | plt.scatter(x, y) 201 | plt.annotate(label, 202 | xy=(x, y), 203 | xytext=(5, 2), 204 | textcoords='offset points', 205 | ha='right', 206 | va='bottom') 207 | plt.show() 208 | plt.savefig(filename) 209 | 210 | try: 211 | from sklearn.manifold import TSNE 212 | import matplotlib.pyplot as plt 213 | 214 | tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000) 215 | plot_only = 500 216 | low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:]) 217 | labels = [reverse_dictionary[i] for i in xrange(plot_only)] 218 | plot_with_labels(low_dim_embs, labels) 219 | 220 | except ImportError: 221 | print("Please install sklearn and matplotlib to visualize embeddings.") 222 | -------------------------------------------------------------------------------- /word2vector/word2vec_kernels.cc: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | 3 | Licensed under the Apache License, Version 2.0 (the "License"); 4 | you may not use this file except in compliance with the License. 5 | You may obtain a copy of the License at 6 | 7 | http://www.apache.org/licenses/LICENSE-2.0 8 | 9 | Unless required by applicable law or agreed to in writing, software 10 | distributed under the License is distributed on an "AS IS" BASIS, 11 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | See the License for the specific language governing permissions and 13 | limitations under the License. 14 | ==============================================================================*/ 15 | 16 | #include "tensorflow/core/framework/op.h" 17 | #include "tensorflow/core/framework/op_kernel.h" 18 | #include "tensorflow/core/lib/core/stringpiece.h" 19 | #include "tensorflow/core/lib/gtl/map_util.h" 20 | #include "tensorflow/core/lib/random/distribution_sampler.h" 21 | #include "tensorflow/core/lib/random/philox_random.h" 22 | #include "tensorflow/core/lib/random/simple_philox.h" 23 | #include "tensorflow/core/lib/strings/str_util.h" 24 | #include "tensorflow/core/platform/thread_annotations.h" 25 | #include "tensorflow/core/util/guarded_philox_random.h" 26 | 27 | namespace tensorflow { 28 | 29 | // Number of examples to precalculate. 30 | const int kPrecalc = 3000; 31 | // Number of words to read into a sentence before processing. 32 | const int kSentenceSize = 1000; 33 | 34 | namespace { 35 | 36 | bool ScanWord(StringPiece* input, string* word) { 37 | str_util::RemoveLeadingWhitespace(input); 38 | StringPiece tmp; 39 | if (str_util::ConsumeNonWhitespace(input, &tmp)) { 40 | word->assign(tmp.data(), tmp.size()); 41 | return true; 42 | } else { 43 | return false; 44 | } 45 | } 46 | 47 | } // end namespace 48 | 49 | class SkipgramOp : public OpKernel { 50 | public: 51 | explicit SkipgramOp(OpKernelConstruction* ctx) 52 | : OpKernel(ctx), rng_(&philox_) { 53 | string filename; 54 | OP_REQUIRES_OK(ctx, ctx->GetAttr("filename", &filename)); 55 | OP_REQUIRES_OK(ctx, ctx->GetAttr("batch_size", &batch_size_)); 56 | OP_REQUIRES_OK(ctx, ctx->GetAttr("window_size", &window_size_)); 57 | OP_REQUIRES_OK(ctx, ctx->GetAttr("min_count", &min_count_)); 58 | OP_REQUIRES_OK(ctx, ctx->GetAttr("subsample", &subsample_)); 59 | OP_REQUIRES_OK(ctx, Init(ctx->env(), filename)); 60 | 61 | mutex_lock l(mu_); 62 | example_pos_ = corpus_size_; 63 | label_pos_ = corpus_size_; 64 | label_limit_ = corpus_size_; 65 | sentence_index_ = kSentenceSize; 66 | for (int i = 0; i < kPrecalc; ++i) { 67 | NextExample(&precalc_examples_[i].input, &precalc_examples_[i].label); 68 | } 69 | } 70 | 71 | void Compute(OpKernelContext* ctx) override { 72 | Tensor words_per_epoch(DT_INT64, TensorShape({})); 73 | Tensor current_epoch(DT_INT32, TensorShape({})); 74 | Tensor total_words_processed(DT_INT64, TensorShape({})); 75 | Tensor examples(DT_INT32, TensorShape({batch_size_})); 76 | auto Texamples = examples.flat(); 77 | Tensor labels(DT_INT32, TensorShape({batch_size_})); 78 | auto Tlabels = labels.flat(); 79 | { 80 | mutex_lock l(mu_); 81 | for (int i = 0; i < batch_size_; ++i) { 82 | Texamples(i) = precalc_examples_[precalc_index_].input; 83 | Tlabels(i) = precalc_examples_[precalc_index_].label; 84 | precalc_index_++; 85 | if (precalc_index_ >= kPrecalc) { 86 | precalc_index_ = 0; 87 | for (int j = 0; j < kPrecalc; ++j) { 88 | NextExample(&precalc_examples_[j].input, 89 | &precalc_examples_[j].label); 90 | } 91 | } 92 | } 93 | words_per_epoch.scalar()() = corpus_size_; 94 | current_epoch.scalar()() = current_epoch_; 95 | total_words_processed.scalar()() = total_words_processed_; 96 | } 97 | ctx->set_output(0, word_); 98 | ctx->set_output(1, freq_); 99 | ctx->set_output(2, words_per_epoch); 100 | ctx->set_output(3, current_epoch); 101 | ctx->set_output(4, total_words_processed); 102 | ctx->set_output(5, examples); 103 | ctx->set_output(6, labels); 104 | } 105 | 106 | private: 107 | struct Example { 108 | int32 input; 109 | int32 label; 110 | }; 111 | 112 | int32 batch_size_ = 0; 113 | int32 window_size_ = 5; 114 | float subsample_ = 1e-3; 115 | int min_count_ = 5; 116 | int32 vocab_size_ = 0; 117 | Tensor word_; 118 | Tensor freq_; 119 | int32 corpus_size_ = 0; 120 | std::vector corpus_; 121 | std::vector precalc_examples_; 122 | int precalc_index_ = 0; 123 | std::vector sentence_; 124 | int sentence_index_ = 0; 125 | 126 | mutex mu_; 127 | random::PhiloxRandom philox_ GUARDED_BY(mu_); 128 | random::SimplePhilox rng_ GUARDED_BY(mu_); 129 | int32 current_epoch_ GUARDED_BY(mu_) = -1; 130 | int64 total_words_processed_ GUARDED_BY(mu_) = 0; 131 | int32 example_pos_ GUARDED_BY(mu_); 132 | int32 label_pos_ GUARDED_BY(mu_); 133 | int32 label_limit_ GUARDED_BY(mu_); 134 | 135 | // {example_pos_, label_pos_} is the cursor for the next example. 136 | // example_pos_ wraps around at the end of corpus_. For each 137 | // example, we randomly generate [label_pos_, label_limit) for 138 | // labels. 139 | void NextExample(int32* example, int32* label) EXCLUSIVE_LOCKS_REQUIRED(mu_) { 140 | while (true) { 141 | if (label_pos_ >= label_limit_) { 142 | ++total_words_processed_; 143 | ++sentence_index_; 144 | if (sentence_index_ >= kSentenceSize) { 145 | sentence_index_ = 0; 146 | for (int i = 0; i < kSentenceSize; ++i, ++example_pos_) { 147 | if (example_pos_ >= corpus_size_) { 148 | ++current_epoch_; 149 | example_pos_ = 0; 150 | } 151 | if (subsample_ > 0) { 152 | int32 word_freq = freq_.flat()(corpus_[example_pos_]); 153 | // See Eq. 5 in http://arxiv.org/abs/1310.4546 154 | float keep_prob = 155 | (std::sqrt(word_freq / (subsample_ * corpus_size_)) + 1) * 156 | (subsample_ * corpus_size_) / word_freq; 157 | if (rng_.RandFloat() > keep_prob) { 158 | i--; 159 | continue; 160 | } 161 | } 162 | sentence_[i] = corpus_[example_pos_]; 163 | } 164 | } 165 | const int32 skip = 1 + rng_.Uniform(window_size_); 166 | label_pos_ = std::max(0, sentence_index_ - skip); 167 | label_limit_ = 168 | std::min(kSentenceSize, sentence_index_ + skip + 1); 169 | } 170 | if (sentence_index_ != label_pos_) { 171 | break; 172 | } 173 | ++label_pos_; 174 | } 175 | *example = sentence_[sentence_index_]; 176 | *label = sentence_[label_pos_++]; 177 | } 178 | 179 | Status Init(Env* env, const string& filename) { 180 | string data; 181 | TF_RETURN_IF_ERROR(ReadFileToString(env, filename, &data)); 182 | StringPiece input = data; 183 | string w; 184 | corpus_size_ = 0; 185 | std::unordered_map word_freq; 186 | while (ScanWord(&input, &w)) { 187 | ++(word_freq[w]); 188 | ++corpus_size_; 189 | } 190 | if (corpus_size_ < window_size_ * 10) { 191 | return errors::InvalidArgument("The text file ", filename, 192 | " contains too little data: ", 193 | corpus_size_, " words"); 194 | } 195 | typedef std::pair WordFreq; 196 | std::vector ordered; 197 | for (const auto& p : word_freq) { 198 | if (p.second >= min_count_) ordered.push_back(p); 199 | } 200 | LOG(INFO) << "Data file: " << filename << " contains " << data.size() 201 | << " bytes, " << corpus_size_ << " words, " << word_freq.size() 202 | << " unique words, " << ordered.size() 203 | << " unique frequent words."; 204 | word_freq.clear(); 205 | std::sort(ordered.begin(), ordered.end(), 206 | [](const WordFreq& x, const WordFreq& y) { 207 | return x.second > y.second; 208 | }); 209 | vocab_size_ = static_cast(1 + ordered.size()); 210 | Tensor word(DT_STRING, TensorShape({vocab_size_})); 211 | Tensor freq(DT_INT32, TensorShape({vocab_size_})); 212 | word.flat()(0) = "UNK"; 213 | static const int32 kUnkId = 0; 214 | std::unordered_map word_id; 215 | int64 total_counted = 0; 216 | for (std::size_t i = 0; i < ordered.size(); ++i) { 217 | const auto& w = ordered[i].first; 218 | auto id = i + 1; 219 | word.flat()(id) = w; 220 | auto word_count = ordered[i].second; 221 | freq.flat()(id) = word_count; 222 | total_counted += word_count; 223 | word_id[w] = id; 224 | } 225 | freq.flat()(kUnkId) = corpus_size_ - total_counted; 226 | word_ = word; 227 | freq_ = freq; 228 | corpus_.reserve(corpus_size_); 229 | input = data; 230 | while (ScanWord(&input, &w)) { 231 | corpus_.push_back(gtl::FindWithDefault(word_id, w, kUnkId)); 232 | } 233 | precalc_examples_.resize(kPrecalc); 234 | sentence_.resize(kSentenceSize); 235 | return Status::OK(); 236 | } 237 | }; 238 | 239 | REGISTER_KERNEL_BUILDER(Name("Skipgram").Device(DEVICE_CPU), SkipgramOp); 240 | 241 | class NegTrainOp : public OpKernel { 242 | public: 243 | explicit NegTrainOp(OpKernelConstruction* ctx) : OpKernel(ctx) { 244 | base_.Init(0, 0); 245 | 246 | OP_REQUIRES_OK(ctx, ctx->GetAttr("num_negative_samples", &num_samples_)); 247 | 248 | std::vector vocab_count; 249 | OP_REQUIRES_OK(ctx, ctx->GetAttr("vocab_count", &vocab_count)); 250 | 251 | std::vector vocab_weights; 252 | vocab_weights.reserve(vocab_count.size()); 253 | for (const auto& f : vocab_count) { 254 | float r = std::pow(static_cast(f), 0.75f); 255 | vocab_weights.push_back(r); 256 | } 257 | sampler_ = new random::DistributionSampler(vocab_weights); 258 | } 259 | 260 | ~NegTrainOp() { delete sampler_; } 261 | 262 | void Compute(OpKernelContext* ctx) override { 263 | Tensor w_in = ctx->mutable_input(0, false); 264 | OP_REQUIRES(ctx, TensorShapeUtils::IsMatrix(w_in.shape()), 265 | errors::InvalidArgument("Must be a matrix")); 266 | Tensor w_out = ctx->mutable_input(1, false); 267 | OP_REQUIRES(ctx, w_in.shape() == w_out.shape(), 268 | errors::InvalidArgument("w_in.shape == w_out.shape")); 269 | const Tensor& examples = ctx->input(2); 270 | OP_REQUIRES(ctx, TensorShapeUtils::IsVector(examples.shape()), 271 | errors::InvalidArgument("Must be a vector")); 272 | const Tensor& labels = ctx->input(3); 273 | OP_REQUIRES(ctx, examples.shape() == labels.shape(), 274 | errors::InvalidArgument("examples.shape == labels.shape")); 275 | const Tensor& learning_rate = ctx->input(4); 276 | OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(learning_rate.shape()), 277 | errors::InvalidArgument("Must be a scalar")); 278 | 279 | auto Tw_in = w_in.matrix(); 280 | auto Tw_out = w_out.matrix(); 281 | auto Texamples = examples.flat(); 282 | auto Tlabels = labels.flat(); 283 | auto lr = learning_rate.scalar()(); 284 | const int64 vocab_size = w_in.dim_size(0); 285 | const int64 dims = w_in.dim_size(1); 286 | const int64 batch_size = examples.dim_size(0); 287 | OP_REQUIRES(ctx, vocab_size == sampler_->num(), 288 | errors::InvalidArgument("vocab_size mismatches: ", vocab_size, 289 | " vs. ", sampler_->num())); 290 | 291 | // Gradient accumulator for v_in. 292 | Tensor buf(DT_FLOAT, TensorShape({dims})); 293 | auto Tbuf = buf.flat(); 294 | 295 | // Scalar buffer to hold sigmoid(+/- dot). 296 | Tensor g_buf(DT_FLOAT, TensorShape({})); 297 | auto g = g_buf.scalar(); 298 | 299 | // The following loop needs 2 random 32-bit values per negative 300 | // sample. We reserve 8 values per sample just in case the 301 | // underlying implementation changes. 302 | auto rnd = base_.ReserveSamples32(batch_size * num_samples_ * 8); 303 | random::SimplePhilox srnd(&rnd); 304 | 305 | for (int64 i = 0; i < batch_size; ++i) { 306 | const int32 example = Texamples(i); 307 | DCHECK(0 <= example && example < vocab_size) << example; 308 | const int32 label = Tlabels(i); 309 | DCHECK(0 <= label && label < vocab_size) << label; 310 | auto v_in = Tw_in.chip<0>(example); 311 | 312 | // Positive: example predicts label. 313 | // forward: x = v_in' * v_out 314 | // l = log(sigmoid(x)) 315 | // backward: dl/dx = g = sigmoid(-x) 316 | // dl/d(v_in) = g * v_out' 317 | // dl/d(v_out) = v_in' * g 318 | { 319 | auto v_out = Tw_out.chip<0>(label); 320 | auto dot = (v_in * v_out).sum(); 321 | g = (dot.exp() + 1.f).inverse(); 322 | Tbuf = v_out * (g() * lr); 323 | v_out += v_in * (g() * lr); 324 | } 325 | 326 | // Negative samples: 327 | // forward: x = v_in' * v_sample 328 | // l = log(sigmoid(-x)) 329 | // backward: dl/dx = g = -sigmoid(x) 330 | // dl/d(v_in) = g * v_out' 331 | // dl/d(v_out) = v_in' * g 332 | for (int j = 0; j < num_samples_; ++j) { 333 | const int sample = sampler_->Sample(&srnd); 334 | if (sample == label) continue; // Skip. 335 | auto v_sample = Tw_out.chip<0>(sample); 336 | auto dot = (v_in * v_sample).sum(); 337 | g = -((-dot).exp() + 1.f).inverse(); 338 | Tbuf += v_sample * (g() * lr); 339 | v_sample += v_in * (g() * lr); 340 | } 341 | 342 | // Applies the gradient on v_in. 343 | v_in += Tbuf; 344 | } 345 | } 346 | 347 | private: 348 | int32 num_samples_ = 0; 349 | random::DistributionSampler* sampler_ = nullptr; 350 | GuardedPhiloxRandom base_; 351 | }; 352 | 353 | REGISTER_KERNEL_BUILDER(Name("NegTrain").Device(DEVICE_CPU), NegTrainOp); 354 | 355 | } // end namespace tensorflow 356 | -------------------------------------------------------------------------------- /word2vector/word2vec_ops.cc: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | 3 | Licensed under the Apache License, Version 2.0 (the "License"); 4 | you may not use this file except in compliance with the License. 5 | You may obtain a copy of the License at 6 | 7 | http://www.apache.org/licenses/LICENSE-2.0 8 | 9 | Unless required by applicable law or agreed to in writing, software 10 | distributed under the License is distributed on an "AS IS" BASIS, 11 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | See the License for the specific language governing permissions and 13 | limitations under the License. 14 | ==============================================================================*/ 15 | 16 | #include "tensorflow/core/framework/op.h" 17 | 18 | namespace tensorflow { 19 | 20 | REGISTER_OP("Skipgram") 21 | .Output("vocab_word: string") 22 | .Output("vocab_freq: int32") 23 | .Output("words_per_epoch: int64") 24 | .Output("current_epoch: int32") 25 | .Output("total_words_processed: int64") 26 | .Output("examples: int32") 27 | .Output("labels: int32") 28 | .SetIsStateful() 29 | .Attr("filename: string") 30 | .Attr("batch_size: int") 31 | .Attr("window_size: int = 5") 32 | .Attr("min_count: int = 5") 33 | .Attr("subsample: float = 1e-3") 34 | .Doc(R"doc( 35 | Parses a text file and creates a batch of examples. 36 | 37 | vocab_word: A vector of words in the corpus. 38 | vocab_freq: Frequencies of words. Sorted in the non-ascending order. 39 | words_per_epoch: Number of words per epoch in the data file. 40 | current_epoch: The current epoch number. 41 | total_words_processed: The total number of words processed so far. 42 | examples: A vector of word ids. 43 | labels: A vector of word ids. 44 | filename: The corpus's text file name. 45 | batch_size: The size of produced batch. 46 | window_size: The number of words to predict to the left and right of the target. 47 | min_count: The minimum number of word occurrences for it to be included in the 48 | vocabulary. 49 | subsample: Threshold for word occurrence. Words that appear with higher 50 | frequency will be randomly down-sampled. Set to 0 to disable. 51 | )doc"); 52 | 53 | REGISTER_OP("NegTrain") 54 | .Input("w_in: Ref(float)") 55 | .Input("w_out: Ref(float)") 56 | .Input("examples: int32") 57 | .Input("labels: int32") 58 | .Input("lr: float") 59 | .SetIsStateful() 60 | .Attr("vocab_count: list(int)") 61 | .Attr("num_negative_samples: int") 62 | .Doc(R"doc( 63 | Training via negative sampling. 64 | 65 | w_in: input word embedding. 66 | w_out: output word embedding. 67 | examples: A vector of word ids. 68 | labels: A vector of word ids. 69 | vocab_count: Count of words in the vocabulary. 70 | num_negative_samples: Number of negative samples per example. 71 | )doc"); 72 | 73 | } // end namespace tensorflow 74 | -------------------------------------------------------------------------------- /word2vector/word2vec_optimized.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """Multi-threaded word2vec unbatched skip-gram model. 17 | 18 | Trains the model described in: 19 | (Mikolov, et. al.) Efficient Estimation of Word Representations in Vector Space 20 | ICLR 2013. 21 | http://arxiv.org/abs/1301.3781 22 | This model does true SGD (i.e. no minibatching). To do this efficiently, custom 23 | ops are used to sequentially process data within a 'batch'. 24 | 25 | The key ops used are: 26 | * skipgram custom op that does input processing. 27 | * neg_train custom op that efficiently calculates and applies the gradient using 28 | true SGD. 29 | """ 30 | from __future__ import absolute_import 31 | from __future__ import division 32 | from __future__ import print_function 33 | 34 | import os 35 | import sys 36 | import threading 37 | import time 38 | 39 | from six.moves import xrange # pylint: disable=redefined-builtin 40 | 41 | import numpy as np 42 | import tensorflow as tf 43 | 44 | from tensorflow.models.embedding import gen_word2vec as word2vec 45 | 46 | flags = tf.app.flags 47 | 48 | flags.DEFINE_string("save_path", None, "Directory to write the model.") 49 | flags.DEFINE_string( 50 | "train_data", None, 51 | "Training data. E.g., unzipped file http://mattmahoney.net/dc/text8.zip.") 52 | flags.DEFINE_string( 53 | "eval_data", None, "Analogy questions. " 54 | "https://word2vec.googlecode.com/svn/trunk/questions-words.txt.") 55 | flags.DEFINE_integer("embedding_size", 200, "The embedding dimension size.") 56 | flags.DEFINE_integer( 57 | "epochs_to_train", 15, 58 | "Number of epochs to train. Each epoch processes the training data once " 59 | "completely.") 60 | flags.DEFINE_float("learning_rate", 0.025, "Initial learning rate.") 61 | flags.DEFINE_integer("num_neg_samples", 25, 62 | "Negative samples per training example.") 63 | flags.DEFINE_integer("batch_size", 500, 64 | "Numbers of training examples each step processes " 65 | "(no minibatching).") 66 | flags.DEFINE_integer("concurrent_steps", 12, 67 | "The number of concurrent training steps.") 68 | flags.DEFINE_integer("window_size", 5, 69 | "The number of words to predict to the left and right " 70 | "of the target word.") 71 | flags.DEFINE_integer("min_count", 5, 72 | "The minimum number of word occurrences for it to be " 73 | "included in the vocabulary.") 74 | flags.DEFINE_float("subsample", 1e-3, 75 | "Subsample threshold for word occurrence. Words that appear " 76 | "with higher frequency will be randomly down-sampled. Set " 77 | "to 0 to disable.") 78 | flags.DEFINE_boolean( 79 | "interactive", False, 80 | "If true, enters an IPython interactive session to play with the trained " 81 | "model. E.g., try model.analogy('france', 'paris', 'russia') and " 82 | "model.nearby(['proton', 'elephant', 'maxwell'])") 83 | 84 | FLAGS = flags.FLAGS 85 | 86 | 87 | class Options(object): 88 | """Options used by our word2vec model.""" 89 | 90 | def __init__(self): 91 | # Model options. 92 | 93 | # Embedding dimension. 94 | self.emb_dim = FLAGS.embedding_size 95 | 96 | # Training options. 97 | 98 | # The training text file. 99 | self.train_data = FLAGS.train_data 100 | 101 | # Number of negative samples per example. 102 | self.num_samples = FLAGS.num_neg_samples 103 | 104 | # The initial learning rate. 105 | self.learning_rate = FLAGS.learning_rate 106 | 107 | # Number of epochs to train. After these many epochs, the learning 108 | # rate decays linearly to zero and the training stops. 109 | self.epochs_to_train = FLAGS.epochs_to_train 110 | 111 | # Concurrent training steps. 112 | self.concurrent_steps = FLAGS.concurrent_steps 113 | 114 | # Number of examples for one training step. 115 | self.batch_size = FLAGS.batch_size 116 | 117 | # The number of words to predict to the left and right of the target word. 118 | self.window_size = FLAGS.window_size 119 | 120 | # The minimum number of word occurrences for it to be included in the 121 | # vocabulary. 122 | self.min_count = FLAGS.min_count 123 | 124 | # Subsampling threshold for word occurrence. 125 | self.subsample = FLAGS.subsample 126 | 127 | # Where to write out summaries. 128 | self.save_path = FLAGS.save_path 129 | 130 | # Eval options. 131 | 132 | # The text file for eval. 133 | self.eval_data = FLAGS.eval_data 134 | 135 | 136 | class Word2Vec(object): 137 | """Word2Vec model (Skipgram).""" 138 | 139 | def __init__(self, options, session): 140 | self._options = options 141 | self._session = session 142 | self._word2id = {} 143 | self._id2word = [] 144 | self.build_graph() 145 | self.build_eval_graph() 146 | self.save_vocab() 147 | self._read_analogies() 148 | 149 | def _read_analogies(self): 150 | """Reads through the analogy question file. 151 | 152 | Returns: 153 | questions: a [n, 4] numpy array containing the analogy question's 154 | word ids. 155 | questions_skipped: questions skipped due to unknown words. 156 | """ 157 | questions = [] 158 | questions_skipped = 0 159 | with open(self._options.eval_data, "rb") as analogy_f: 160 | for line in analogy_f: 161 | if line.startswith(b":"): # Skip comments. 162 | continue 163 | words = line.strip().lower().split(b" ") 164 | ids = [self._word2id.get(w.strip()) for w in words] 165 | if None in ids or len(ids) != 4: 166 | questions_skipped += 1 167 | else: 168 | questions.append(np.array(ids)) 169 | print("Eval analogy file: ", self._options.eval_data) 170 | print("Questions: ", len(questions)) 171 | print("Skipped: ", questions_skipped) 172 | self._analogy_questions = np.array(questions, dtype=np.int32) 173 | 174 | def build_graph(self): 175 | """Build the model graph.""" 176 | opts = self._options 177 | 178 | # The training data. A text file. 179 | (words, counts, words_per_epoch, current_epoch, total_words_processed, 180 | examples, labels) = word2vec.skipgram(filename=opts.train_data, 181 | batch_size=opts.batch_size, 182 | window_size=opts.window_size, 183 | min_count=opts.min_count, 184 | subsample=opts.subsample) 185 | (opts.vocab_words, opts.vocab_counts, 186 | opts.words_per_epoch) = self._session.run([words, counts, words_per_epoch]) 187 | opts.vocab_size = len(opts.vocab_words) 188 | print("Data file: ", opts.train_data) 189 | print("Vocab size: ", opts.vocab_size - 1, " + UNK") 190 | print("Words per epoch: ", opts.words_per_epoch) 191 | 192 | self._id2word = opts.vocab_words 193 | for i, w in enumerate(self._id2word): 194 | self._word2id[w] = i 195 | 196 | # Declare all variables we need. 197 | # Input words embedding: [vocab_size, emb_dim] 198 | w_in = tf.Variable( 199 | tf.random_uniform( 200 | [opts.vocab_size, 201 | opts.emb_dim], -0.5 / opts.emb_dim, 0.5 / opts.emb_dim), 202 | name="w_in") 203 | 204 | # Global step: scalar, i.e., shape []. 205 | w_out = tf.Variable(tf.zeros([opts.vocab_size, opts.emb_dim]), name="w_out") 206 | 207 | # Global step: [] 208 | global_step = tf.Variable(0, name="global_step") 209 | 210 | # Linear learning rate decay. 211 | words_to_train = float(opts.words_per_epoch * opts.epochs_to_train) 212 | lr = opts.learning_rate * tf.maximum( 213 | 0.0001, 214 | 1.0 - tf.cast(total_words_processed, tf.float32) / words_to_train) 215 | 216 | # Training nodes. 217 | inc = global_step.assign_add(1) 218 | with tf.control_dependencies([inc]): 219 | train = word2vec.neg_train(w_in, 220 | w_out, 221 | examples, 222 | labels, 223 | lr, 224 | vocab_count=opts.vocab_counts.tolist(), 225 | num_negative_samples=opts.num_samples) 226 | 227 | self._w_in = w_in 228 | self._examples = examples 229 | self._labels = labels 230 | self._lr = lr 231 | self._train = train 232 | self.step = global_step 233 | self._epoch = current_epoch 234 | self._words = total_words_processed 235 | 236 | def save_vocab(self): 237 | """Save the vocabulary to a file so the model can be reloaded.""" 238 | opts = self._options 239 | with open(os.path.join(opts.save_path, "vocab.txt"), "w") as f: 240 | for i in xrange(opts.vocab_size): 241 | f.write("%s %d\n" % (tf.compat.as_text(opts.vocab_words[i]), 242 | opts.vocab_counts[i])) 243 | 244 | def build_eval_graph(self): 245 | """Build the evaluation graph.""" 246 | # Eval graph 247 | opts = self._options 248 | 249 | # Each analogy task is to predict the 4th word (d) given three 250 | # words: a, b, c. E.g., a=italy, b=rome, c=france, we should 251 | # predict d=paris. 252 | 253 | # The eval feeds three vectors of word ids for a, b, c, each of 254 | # which is of size N, where N is the number of analogies we want to 255 | # evaluate in one batch. 256 | analogy_a = tf.placeholder(dtype=tf.int32) # [N] 257 | analogy_b = tf.placeholder(dtype=tf.int32) # [N] 258 | analogy_c = tf.placeholder(dtype=tf.int32) # [N] 259 | 260 | # Normalized word embeddings of shape [vocab_size, emb_dim]. 261 | nemb = tf.nn.l2_normalize(self._w_in, 1) 262 | 263 | # Each row of a_emb, b_emb, c_emb is a word's embedding vector. 264 | # They all have the shape [N, emb_dim] 265 | a_emb = tf.gather(nemb, analogy_a) # a's embs 266 | b_emb = tf.gather(nemb, analogy_b) # b's embs 267 | c_emb = tf.gather(nemb, analogy_c) # c's embs 268 | 269 | # We expect that d's embedding vectors on the unit hyper-sphere is 270 | # near: c_emb + (b_emb - a_emb), which has the shape [N, emb_dim]. 271 | target = c_emb + (b_emb - a_emb) 272 | 273 | # Compute cosine distance between each pair of target and vocab. 274 | # dist has shape [N, vocab_size]. 275 | dist = tf.matmul(target, nemb, transpose_b=True) 276 | 277 | # For each question (row in dist), find the top 4 words. 278 | _, pred_idx = tf.nn.top_k(dist, 4) 279 | 280 | # Nodes for computing neighbors for a given word according to 281 | # their cosine distance. 282 | nearby_word = tf.placeholder(dtype=tf.int32) # word id 283 | nearby_emb = tf.gather(nemb, nearby_word) 284 | nearby_dist = tf.matmul(nearby_emb, nemb, transpose_b=True) 285 | nearby_val, nearby_idx = tf.nn.top_k(nearby_dist, 286 | min(1000, opts.vocab_size)) 287 | 288 | # Nodes in the construct graph which are used by training and 289 | # evaluation to run/feed/fetch. 290 | self._analogy_a = analogy_a 291 | self._analogy_b = analogy_b 292 | self._analogy_c = analogy_c 293 | self._analogy_pred_idx = pred_idx 294 | self._nearby_word = nearby_word 295 | self._nearby_val = nearby_val 296 | self._nearby_idx = nearby_idx 297 | 298 | # Properly initialize all variables. 299 | tf.initialize_all_variables().run() 300 | 301 | self.saver = tf.train.Saver() 302 | 303 | def _train_thread_body(self): 304 | initial_epoch, = self._session.run([self._epoch]) 305 | while True: 306 | _, epoch = self._session.run([self._train, self._epoch]) 307 | if epoch != initial_epoch: 308 | break 309 | 310 | def train(self): 311 | """Train the model.""" 312 | opts = self._options 313 | 314 | initial_epoch, initial_words = self._session.run([self._epoch, self._words]) 315 | 316 | workers = [] 317 | for _ in xrange(opts.concurrent_steps): 318 | t = threading.Thread(target=self._train_thread_body) 319 | t.start() 320 | workers.append(t) 321 | 322 | last_words, last_time = initial_words, time.time() 323 | while True: 324 | time.sleep(5) # Reports our progress once a while. 325 | (epoch, step, words, 326 | lr) = self._session.run([self._epoch, self.step, self._words, self._lr]) 327 | now = time.time() 328 | last_words, last_time, rate = words, now, (words - last_words) / ( 329 | now - last_time) 330 | print("Epoch %4d Step %8d: lr = %5.3f words/sec = %8.0f\r" % (epoch, step, 331 | lr, rate), 332 | end="") 333 | sys.stdout.flush() 334 | if epoch != initial_epoch: 335 | break 336 | 337 | for t in workers: 338 | t.join() 339 | 340 | def _predict(self, analogy): 341 | """Predict the top 4 answers for analogy questions.""" 342 | idx, = self._session.run([self._analogy_pred_idx], { 343 | self._analogy_a: analogy[:, 0], 344 | self._analogy_b: analogy[:, 1], 345 | self._analogy_c: analogy[:, 2] 346 | }) 347 | return idx 348 | 349 | def eval(self): 350 | """Evaluate analogy questions and reports accuracy.""" 351 | 352 | # How many questions we get right at precision@1. 353 | correct = 0 354 | 355 | total = self._analogy_questions.shape[0] 356 | start = 0 357 | while start < total: 358 | limit = start + 2500 359 | sub = self._analogy_questions[start:limit, :] 360 | idx = self._predict(sub) 361 | start = limit 362 | for question in xrange(sub.shape[0]): 363 | for j in xrange(4): 364 | if idx[question, j] == sub[question, 3]: 365 | # Bingo! We predicted correctly. E.g., [italy, rome, france, paris]. 366 | correct += 1 367 | break 368 | elif idx[question, j] in sub[question, :3]: 369 | # We need to skip words already in the question. 370 | continue 371 | else: 372 | # The correct label is not the precision@1 373 | break 374 | print() 375 | print("Eval %4d/%d accuracy = %4.1f%%" % (correct, total, 376 | correct * 100.0 / total)) 377 | 378 | def analogy(self, w0, w1, w2): 379 | """Predict word w3 as in w0:w1 vs w2:w3.""" 380 | wid = np.array([[self._word2id.get(w, 0) for w in [w0, w1, w2]]]) 381 | idx = self._predict(wid) 382 | for c in [self._id2word[i] for i in idx[0, :]]: 383 | if c not in [w0, w1, w2]: 384 | return c 385 | return "unknown" 386 | 387 | def nearby(self, words, num=20): 388 | """Prints out nearby words given a list of words.""" 389 | ids = np.array([self._word2id.get(x, 0) for x in words]) 390 | vals, idx = self._session.run( 391 | [self._nearby_val, self._nearby_idx], {self._nearby_word: ids}) 392 | for i in xrange(len(words)): 393 | print("\n%s\n=====================================" % (words[i])) 394 | for (neighbor, distance) in zip(idx[i, :num], vals[i, :num]): 395 | print("%-20s %6.4f" % (self._id2word[neighbor], distance)) 396 | 397 | 398 | def _start_shell(local_ns=None): 399 | # An interactive shell is useful for debugging/development. 400 | import IPython 401 | user_ns = {} 402 | if local_ns: 403 | user_ns.update(local_ns) 404 | user_ns.update(globals()) 405 | IPython.start_ipython(argv=[], user_ns=user_ns) 406 | 407 | 408 | def main(_): 409 | """Train a word2vec model.""" 410 | if not FLAGS.train_data or not FLAGS.eval_data or not FLAGS.save_path: 411 | print("--train_data --eval_data and --save_path must be specified.") 412 | sys.exit(1) 413 | opts = Options() 414 | with tf.Graph().as_default(), tf.Session() as session: 415 | with tf.device("/cpu:0"): 416 | model = Word2Vec(opts, session) 417 | for _ in xrange(opts.epochs_to_train): 418 | model.train() # Process one epoch 419 | model.eval() # Eval analogies. 420 | # Perform a final save. 421 | model.saver.save(session, os.path.join(opts.save_path, "model.ckpt"), 422 | global_step=model.step) 423 | if FLAGS.interactive: 424 | # E.g., 425 | # [0]: model.Analogy('france', 'paris', 'russia') 426 | # [1]: model.Nearby(['proton', 'elephant', 'maxwell']) 427 | _start_shell(locals()) 428 | 429 | 430 | if __name__ == "__main__": 431 | tf.app.run() 432 | -------------------------------------------------------------------------------- /word2vector/words_calculator_server.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import BaseHTTPServer 3 | import numpy 4 | from sklearn.metrics.pairwise import cosine_similarity 5 | from urlparse import urlparse 6 | 7 | v = [] 8 | reverse = {} 9 | dic = {} 10 | 11 | def isword(s): 12 | for i in range(len(s)): 13 | if s[i] <= 'z' and s[i] >= 'a': 14 | return True 15 | return False 16 | 17 | class SimpleHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler): 18 | def do_GET(self): 19 | global v 20 | global reverse 21 | global dic 22 | 23 | query = urlparse(self.path).query 24 | query_components = dict(qc.split("=") for qc in query.split("&")) 25 | a = query_components["a"] 26 | b = query_components["b"] 27 | c = query_components["c"] 28 | 29 | print "Get request with: ", a, b, c 30 | if not a in dic: 31 | self.send_response(200) 32 | self.send_header('Content-type', 'text/html') 33 | self.end_headers() 34 | self.wfile.write("Word " + a + " is not in the dictionary") 35 | if not b in dic: 36 | self.send_response(200) 37 | self.send_header('Content-type', 'text/html') 38 | self.end_headers() 39 | self.wfile.write("Word " + b + " is not in the dictionary") 40 | if not c in dic: 41 | self.send_response(200) 42 | self.send_header('Content-type', 'text/html') 43 | self.end_headers() 44 | self.wfile.write("Word " + c + " is not in the dictionary") 45 | 46 | cur = v[dic[a]] - v[dic[b]] + v[dic[c]] 47 | sim = cosine_similarity([cur], v)[0] 48 | x = numpy.argsort(-sim) 49 | content = "
    \n" 50 | i = 0 51 | j = 0 52 | while j < 5: 53 | if reverse[x[i]] != a and reverse[x[i]] != c: 54 | if j == 0: 55 | content += "
  1. " + reverse[x[i]] + "

  2. " 56 | else: 57 | content += "
  3. " + reverse[x[i]] + "
  4. \n" 58 | j += 1 59 | i += 1 60 | content += "
" 61 | print "Response: " + content 62 | 63 | self.send_response(200) 64 | self.send_header('Content-type', 'text/html') 65 | self.end_headers() 66 | self.wfile.write(content) 67 | 68 | fname = sys.argv[1] 69 | with open(fname) as fp: 70 | print "Loading Word2Vec ..." 71 | i = 0 72 | for line in fp: 73 | paras = line.split(" ") 74 | if not isword(paras[0]): continue 75 | dic[paras[0]] = i 76 | reverse[i] = paras[0] 77 | a = numpy.asarray(map(float, paras[1:])) 78 | v.append(a / numpy.sqrt(sum(numpy.square(a)))) 79 | i += 1 80 | print "Word2Vec dictionary loaded!" 81 | 82 | Handler = SimpleHTTPRequestHandler 83 | Server = BaseHTTPServer.HTTPServer 84 | Protocol = "HTTP/1.0" 85 | if sys.argv[2:]: 86 | port = int(sys.argv[2]) 87 | else: 88 | port = 8000 89 | server_address = ('127.0.0.1', port) 90 | Handler.protocol_version = Protocol 91 | httpd = Server(server_address, Handler) 92 | print("Serving HTTP") 93 | httpd.serve_forever() 94 | 95 | --------------------------------------------------------------------------------