├── .github
    ├── ISSUE_TEMPLATE
    │   ├── bug_report.md
    │   └── feature_request.md
    └── PULL_REQUEST_TEMPLATE.md
├── .gitignore
├── Basic Classifier.ipynb
├── Basic Distributed Classifier.ipynb
├── Basic Federated Classifier.ipynb
├── LICENSE
├── README.md
├── advanced_classifier.py
├── advanced_distributed_classifier.py
├── advanced_federated_classifier.py
├── basic_classifier.py
├── basic_distributed_classifier.py
├── basic_federated_classifier.py
├── federated-MPI
    ├── README.md
    ├── mpi_advanced_classifier.py
    └── mpi_basic_classifier.py
├── federated-keras
    ├── README.md
    ├── keras_distributed_classifier.py
    └── keras_federated_classifier.py
├── federated-sockets
    ├── FederatedHook.py
    ├── README.md
    ├── advanced_socket_fed_classifier.py
    ├── basic_socket_fed_classifier.py
    └── config.py
├── federated_averaging_optimizer.py
└── images
    ├── Logo_Acuratio.png
    ├── colab_logo.png
    ├── comindorg_logo.png
    ├── graph_tensorboard.png
    ├── slack_logo.jpg
    └── telegram_logo.jpg


/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Bug report
 3 | about: Create a report to help us improve
 4 | 
 5 | ---
 6 | 
 7 | **Describe the bug**
 8 | A clear and concise description of what the bug is.
 9 | 
10 | **To Reproduce**
11 | Steps to reproduce the behavior:
12 | 1. Go to '...'
13 | 2. Click on '....'
14 | 3. Scroll down to '....'
15 | 4. See error
16 | 
17 | **Expected behavior**
18 | A clear and concise description of what you expected to happen.
19 | 
20 | **Screenshots**
21 | If applicable, add screenshots to help explain your problem.
22 | 
23 | **Desktop (please complete the following information):**
24 |  - OS: [e.g. iOS]
25 |  - Browser [e.g. chrome, safari]
26 |  - Version [e.g. 22]
27 | 
28 | **Smartphone (please complete the following information):**
29 |  - Device: [e.g. iPhone6]
30 |  - OS: [e.g. iOS8.1]
31 |  - Browser [e.g. stock browser, safari]
32 |  - Version [e.g. 22]
33 | 
34 | **Additional context**
35 | Add any other context about the problem here.
36 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Feature request
 3 | about: Suggest an idea for this project
 4 | 
 5 | ---
 6 | 
 7 | **Is your feature request related to a problem? Please describe.**
 8 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
 9 | 
10 | **Describe the solution you'd like**
11 | A clear and concise description of what you want to happen.
12 | 
13 | **Describe alternatives you've considered**
14 | A clear and concise description of any alternative solutions or features you've considered.
15 | 
16 | **Additional context**
17 | Add any other context or screenshots about the feature request here.
18 | 


--------------------------------------------------------------------------------
/.github/PULL_REQUEST_TEMPLATE.md:
--------------------------------------------------------------------------------
 1 | IMPORTANT: Please create an issue for your Pull Request.
 2 | 
 3 | Please provide enough information so that others can review your pull request:
 4 | 
 5 | Explain the details for making this change. What existing problem does the pull request solve?
 6 | 
 7 | Closing issues
 8 | 
 9 | Put closes #XXXX in your comment to auto-close the issue that your PR fixes (if such).
10 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | local_settings.py
 57 | db.sqlite3
 58 | 
 59 | # Flask stuff:
 60 | instance/
 61 | .webassets-cache
 62 | 
 63 | # Scrapy stuff:
 64 | .scrapy
 65 | 
 66 | # Sphinx documentation
 67 | docs/_build/
 68 | 
 69 | # PyBuilder
 70 | target/
 71 | 
 72 | # Jupyter Notebook
 73 | .ipynb_checkpoints
 74 | 
 75 | # pyenv
 76 | .python-version
 77 | 
 78 | # celery beat schedule file
 79 | celerybeat-schedule
 80 | 
 81 | # SageMath parsed files
 82 | *.sage.py
 83 | 
 84 | # Environments
 85 | .env
 86 | .venv
 87 | env/
 88 | venv/
 89 | ENV/
 90 | env.bak/
 91 | venv.bak/
 92 | 
 93 | # Spyder project settings
 94 | .spyderproject
 95 | .spyproject
 96 | 
 97 | # Rope project settings
 98 | .ropeproject
 99 | 
100 | # mkdocs documentation
101 | /site
102 | 
103 | # mypy
104 | .mypy_cache/
105 | 


--------------------------------------------------------------------------------
/Basic Federated Classifier.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Basic federated classifier with TensorFlow"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "The code in this notebook is copyright 2018 <a href='https://comind.org/'>coMind</a>. Licensed under the Apache License, Version 2.0; you may not use this code except in compliance with the License. You may obtain a copy of the <a href='http://www.apache.org/licenses/LICENSE-2.0'>License</a>.\n",
 15 |     "\n",
 16 |     "Join the <a href='https://comindorg.slack.com/join/shared_invite/enQtNDMxMzc0NDA5OTEwLWIyZTg5MTg1MTM4NjhiNDM4YTU1OTI1NTgwY2NkNzZjYWY1NmI0ZjIyNWJiMTNkZmRhZDg2Nzc3YTYyNGQzM2I'>conversation</a> at Slack.\n",
 17 |     "\n",
 18 |     "This a series of three tutorials you are in the last one: \n",
 19 |     "* [Basic Classifier](https://github.com/coMindOrg/federated-averaging-tutorials/blob/master/Basic%20Classifier.ipynb)\n",
 20 |     "* [Basic Distributed Classifier](https://github.com/coMindOrg/federated-averaging-tutorials/blob/master/Basic%20Distributed%20Classifier.ipynb)\n",
 21 |     "* [Basic Federated Classifier](https://github.com/coMindOrg/federated-averaging-tutorials/blob/master/Basic%20Federated%20Classifier.ipynb)\n",
 22 |     "\n",
 23 |     "In this tutorial we will see how to train a model using federated averaging.\n",
 24 |     "\n",
 25 |     "To begin a brief explanation of what it means to train using federated averaging with respect to training using a SyncReplicasOptimizer.\n",
 26 |     "\n",
 27 |     "In the previous tutorial, we explained that with SyncReplicasOptimizer each worker generated a gradient for its weights and wrote it to the parameter server. The chief read those gradients (including its own), it averaged them and updated the shared model.\n",
 28 |     "\n",
 29 |     "This time each worker will be updating its weights locally, as if it were the only one training. Every certain number of steps it will send its weights (not the gradients, but the weights themselves) to the parameter server. The chief will read the weights from there, it will average and write them again to the parameter server so that all the workers can overwrite theirs."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "The entire first part of the code is the same as the distributed classifier tutorial.\n",
 37 |     "\n",
 38 |     "Two differences only:\n",
 39 |     "\n",
 40 |     "- This time we also import __federated_average_optimizer__, the library with which we can federalize learning.\n",
 41 |     "- On the other hand we define the variable __INTERVAL_STEPS__. Every how many steps we will perform the average of the weights. Put another way, how many steps will each worker make in local before writing their weights in the parameter server and overwriting them with the average that the chief has made."
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "# TensorFlow and tf.keras\n",
 51 |     "import tensorflow as tf\n",
 52 |     "from tensorflow import keras\n",
 53 |     "\n",
 54 |     "# Helper libraries\n",
 55 |     "import os\n",
 56 |     "import numpy as np\n",
 57 |     "from time import time\n",
 58 |     "import matplotlib.pyplot as plt\n",
 59 |     "import federated_averaging_optimizer\n",
 60 |     "\n",
 61 |     "flags = tf.app.flags\n",
 62 |     "flags.DEFINE_integer(\"task_index\", None,\n",
 63 |     "                     \"Worker task index, should be >= 0. task_index=0 is \"\n",
 64 |     "                     \"the master worker task that performs the variable \"\n",
 65 |     "                     \"initialization \")\n",
 66 |     "flags.DEFINE_string(\"ps_hosts\", \"localhost:2222\",\n",
 67 |     "                    \"Comma-separated list of hostname:port pairs\")\n",
 68 |     "flags.DEFINE_string(\"worker_hosts\", \"localhost:2223,localhost:2224\",\n",
 69 |     "                    \"Comma-separated list of hostname:port pairs\")\n",
 70 |     "flags.DEFINE_string(\"job_name\", None, \"job name: worker or ps\")\n",
 71 |     "\n",
 72 |     "BATCH_SIZE = 32\n",
 73 |     "EPOCHS = 5\n",
 74 |     "INTERVAL_STEPS = 10\n",
 75 |     "\n",
 76 |     "FLAGS = flags.FLAGS\n",
 77 |     "\n",
 78 |     "if FLAGS.job_name is None or FLAGS.job_name == \"\":\n",
 79 |     "    raise ValueError(\"Must specify an explicit `job_name`\")\n",
 80 |     "if FLAGS.task_index is None or FLAGS.task_index == \"\":\n",
 81 |     "    raise ValueError(\"Must specify an explicit `task_index`\")\n",
 82 |     "\n",
 83 |     "if FLAGS.task_index == 0:\n",
 84 |     "    print('--- GPU Disabled ---')\n",
 85 |     "    os.environ['CUDA_VISIBLE_DEVICES'] = ''\n",
 86 |     "\n",
 87 |     "#Construct the cluster and start the server\n",
 88 |     "ps_spec = FLAGS.ps_hosts.split(\",\")\n",
 89 |     "worker_spec = FLAGS.worker_hosts.split(\",\")\n",
 90 |     "\n",
 91 |     "# Get the number of workers.\n",
 92 |     "num_workers = len(worker_spec)\n",
 93 |     "print('{} workers defined'.format(num_workers))\n",
 94 |     "\n",
 95 |     "cluster = tf.train.ClusterSpec({\"ps\": ps_spec, \"worker\": worker_spec})\n",
 96 |     "\n",
 97 |     "server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)\n",
 98 |     "if FLAGS.job_name == \"ps\":\n",
 99 |     "    print('--- Parameter Server Ready ---')\n",
100 |     "    server.join()\n",
101 |     "\n",
102 |     "fashion_mnist = keras.datasets.fashion_mnist\n",
103 |     "(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()\n",
104 |     "print('Data loaded')\n",
105 |     "\n",
106 |     "class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',\n",
107 |     "               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']\n",
108 |     "\n",
109 |     "train_images = np.split(train_images, num_workers)[FLAGS.task_index]\n",
110 |     "train_labels = np.split(train_labels, num_workers)[FLAGS.task_index]\n",
111 |     "print('Local dataset size: {}'.format(train_images.shape[0]))\n",
112 |     "\n",
113 |     "train_images = train_images / 255.0\n",
114 |     "test_images = test_images / 255.0\n",
115 |     "\n",
116 |     "is_chief = (FLAGS.task_index == 0)\n",
117 |     "\n",
118 |     "checkpoint_dir='logs_dir/federated_worker_{}/{}'.format(FLAGS.task_index, time())\n",
119 |     "print('Checkpoint directory: ' + checkpoint_dir)\n",
120 |     "\n",
121 |     "worker_device = \"/job:worker/task:%d\" % FLAGS.task_index\n",
122 |     "print('Worker device: ' + worker_device + ' - is_chief: {}'.format(is_chief))"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "Here we begin the definition of the graph in the same way as it was done in the basic classifier, we explicitly place every operation in the local worker. The rest is fairly standard until we reach the definition of the optimizer."
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {},
136 |    "outputs": [],
137 |    "source": [
138 |     "with tf.device(worker_device)\n",
139 |     "    global_step = tf.train.get_or_create_global_step()\n",
140 |     "\n",
141 |     "    with tf.name_scope('dataset'), tf.device('/cpu:0'):\n",
142 |     "        images_placeholder = tf.placeholder(train_images.dtype, [None, train_images.shape[1], train_images.shape[2]], \n",
143 |     "                                            name='images_placeholder')\n",
144 |     "        labels_placeholder = tf.placeholder(train_labels.dtype, [None], name='labels_placeholder')\n",
145 |     "        batch_size = tf.placeholder(tf.int64, name='batch_size')\n",
146 |     "        shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')\n",
147 |     "\n",
148 |     "        dataset = tf.data.Dataset.from_tensor_slices((images_placeholder, labels_placeholder))\n",
149 |     "        dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)\n",
150 |     "        dataset = dataset.repeat(EPOCHS)\n",
151 |     "        dataset = dataset.batch(batch_size)\n",
152 |     "        iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)\n",
153 |     "        dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')\n",
154 |     "        X, y = iterator.get_next()\n",
155 |     "\n",
156 |     "    flatten_layer = tf.layers.flatten(X, name='flatten')\n",
157 |     "\n",
158 |     "    dense_layer = tf.layers.dense(flatten_layer, 128, activation=tf.nn.relu, name='relu')\n",
159 |     "\n",
160 |     "    predictions = tf.layers.dense(dense_layer, 10, activation=tf.nn.softmax, name='softmax')\n",
161 |     "\n",
162 |     "    summary_averages = tf.train.ExponentialMovingAverage(0.9)\n",
163 |     "\n",
164 |     "    with tf.name_scope('loss'):\n",
165 |     "        loss = tf.reduce_mean(keras.losses.sparse_categorical_crossentropy(y, predictions))\n",
166 |     "        loss_averages_op = summary_averages.apply([loss])\n",
167 |     "        tf.summary.scalar('cross_entropy', summary_averages.average(loss))\n",
168 |     "\n",
169 |     "    with tf.name_scope('accuracy'):\n",
170 |     "        with tf.name_scope('correct_prediction'):\n",
171 |     "            correct_prediction = tf.equal(tf.argmax(predictions, 1), tf.cast(y, tf.int64))\n",
172 |     "        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy_metric')\n",
173 |     "        accuracy_averages_op = summary_averages.apply([accuracy])\n",
174 |     "        tf.summary.scalar('accuracy', summary_averages.average(accuracy))\n"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "markdown",
179 |    "metadata": {},
180 |    "source": [
181 |     "We used the __replica_device_setter__ in the distributed learning to automatically choose in which device to place each defined op. Here we create it just to pass it as an argument to the custom optimizer that we have created to contain the logic of the federated averaging.\n",
182 |     "\n",
183 |     "This custom optimizer will use the __replica_device_setter__ to place a copy of each trainable variable in the ps, this new variables will store the averaged values of all the local models.\n",
184 |     "\n",
185 |     "Once this optimizer has been defined, we create the training operation and a, in the same way as we did with SyncReplicasOptimizer, a hook that will run inside the MonitoredTrainingSession, which handles the initialization."
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": null,
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": [
194 |     "    with tf.name_scope('train'):\n",
195 |     "        device_setter = tf.train.replica_device_setter(worker_device=worker_device, cluster=cluster)\n",
196 |     "        optimizer = federated_averaging_optimizer.FederatedAveragingOptimizer(\n",
197 |     "            tf.train.AdamOptimizer(np.sqrt(num_workers) * 0.001), \n",
198 |     "            replicas_to_aggregate=num_workers, interval_steps=INTERVAL_STEPS, is_chief=is_chief, \n",
199 |     "            device_setter=device_setter)\n",
200 |     "        with tf.control_dependencies([loss_averages_op, accuracy_averages_op]):\n",
201 |     "            train_op = optimizer.minimize(loss, global_step=global_step)\n",
202 |     "        model_average_hook = optimizer.make_session_run_hook()"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "markdown",
207 |    "metadata": {},
208 |    "source": [
209 |     "We keep defining our hooks as usual."
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": null,
215 |    "metadata": {},
216 |    "outputs": [],
217 |    "source": [
218 |     "n_batches = int(train_images.shape[0] / BATCH_SIZE)\n",
219 |     "last_step = int(n_batches * EPOCHS)\n",
220 |     "\n",
221 |     "print('Graph definition finished')\n",
222 |     "\n",
223 |     "sess_config = tf.ConfigProto(\n",
224 |     "    allow_soft_placement=True,\n",
225 |     "    log_device_placement=False,\n",
226 |     "    operation_timeout_in_ms=20000,\n",
227 |     "    device_filters=[\"/job:ps\",\n",
228 |     "    \"/job:worker/task:%d\" % FLAGS.task_index])\n",
229 |     "\n",
230 |     "print('Training {} batches...'.format(last_step))\n",
231 |     "\n",
232 |     "class _LoggerHook(tf.train.SessionRunHook):\n",
233 |     "    def begin(self):\n",
234 |     "        self._total_loss = 0\n",
235 |     "        self._total_acc = 0\n",
236 |     "\n",
237 |     "    def before_run(self, run_context):\n",
238 |     "        return tf.train.SessionRunArgs([loss, accuracy, global_step])\n",
239 |     "\n",
240 |     "    def after_run(self, run_context, run_values):\n",
241 |     "        loss_value, acc_value, step_value = run_values.results\n",
242 |     "        self._total_loss += loss_value\n",
243 |     "        self._total_acc += acc_value\n",
244 |     "        if (step_value + 1) % n_batches == 0 and not step_value == 0:\n",
245 |     "            print(\"Epoch {}/{} - loss: {:.4f} - acc: {:.4f}\".format(\n",
246 |     "                int(step_value / n_batches) + 1, EPOCHS, self._total_loss / n_batches, self._total_acc / n_batches))\n",
247 |     "            self._total_loss = 0\n",
248 |     "            self._total_acc = 0\n",
249 |     "\n",
250 |     "class _InitHook(tf.train.SessionRunHook):\n",
251 |     "    def after_create_session(self, session, coord):\n",
252 |     "        session.run(dataset_init_op, feed_dict={\n",
253 |     "            images_placeholder: train_images, labels_placeholder: train_labels, \n",
254 |     "            batch_size: BATCH_SIZE, shuffle_size: train_images.shape[0]})"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "markdown",
259 |    "metadata": {},
260 |    "source": [
261 |     "The shared variables generated within the custom optimizer get their initialized value from their corresponding trainable variables in the local worker. Therefore their initialization ops will be unavailable out of this session even if we try to restore a saved checkpoint.\n",
262 |     "\n",
263 |     "We need to define a custom saver which ignores this shared variables. In this case, we only save the trainable_variables ."
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "metadata": {},
270 |    "outputs": [],
271 |    "source": [
272 |     "class _SaverHook(tf.train.SessionRunHook):\n",
273 |     "    def begin(self):\n",
274 |     "        self._saver = tf.train.Saver(tf.trainable_variables())\n",
275 |     "\n",
276 |     "    def before_run(self, run_context):\n",
277 |     "        return tf.train.SessionRunArgs(global_step)\n",
278 |     "\n",
279 |     "    def after_run(self, run_context, run_values):\n",
280 |     "        step_value = run_values.results\n",
281 |     "        if step_value % n_batches == 0 and not step_value == 0:\n",
282 |     "            self._saver.save(run_context.session, checkpoint_dir+'/model.ckpt', step_value)\n",
283 |     "\n",
284 |     "    def end(self, session):\n",
285 |     "        self._saver.save(session, checkpoint_dir+'/model.ckpt', session.run(global_step))"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "metadata": {},
291 |    "source": [
292 |     "The execution of the training session is standard. Notice the new hooks that we have added to the hook lists.\n",
293 |     "\n",
294 |     "WARNING! Do not define a chief worker. We need each worker to initialize their local session and train on its own!"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "code",
299 |    "execution_count": null,
300 |    "metadata": {},
301 |    "outputs": [],
302 |    "source": [
303 |     "with tf.name_scope('monitored_session'):\n",
304 |     "    with tf.train.MonitoredTrainingSession(\n",
305 |     "            master=server.target,\n",
306 |     "            checkpoint_dir=checkpoint_dir,\n",
307 |     "            hooks=[_LoggerHook(), _InitHook(), _SaverHook(), model_average_hook],\n",
308 |     "            config=sess_config,\n",
309 |     "            stop_grace_period_secs=10,\n",
310 |     "            save_checkpoint_secs=None) as mon_sess:\n",
311 |     "        while not mon_sess.should_stop():\n",
312 |     "            mon_sess.run(train_op)"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "Finally, we evaluate the model."
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "code",
324 |    "execution_count": null,
325 |    "metadata": {},
326 |    "outputs": [],
327 |    "source": [
328 |     "if is_chief:\n",
329 |     "    print('--- Begin Evaluation ---')\n",
330 |     "    tf.reset_default_graph()\n",
331 |     "    with tf.Session() as sess:\n",
332 |     "        ckpt = tf.train.get_checkpoint_state(checkpoint_dir)\n",
333 |     "        saver = tf.train.import_meta_graph(ckpt.model_checkpoint_path + '.meta', clear_devices=True)\n",
334 |     "        saver.restore(sess, ckpt.model_checkpoint_path)\n",
335 |     "        print('Model restored')\n",
336 |     "        graph = tf.get_default_graph()\n",
337 |     "        images_placeholder = graph.get_tensor_by_name('dataset/images_placeholder:0')\n",
338 |     "        labels_placeholder = graph.get_tensor_by_name('dataset/labels_placeholder:0')\n",
339 |     "        batch_size = graph.get_tensor_by_name('dataset/batch_size:0')\n",
340 |     "        accuracy = graph.get_tensor_by_name('accuracy/accuracy_metric:0')\n",
341 |     "        predictions = graph.get_tensor_by_name('softmax/BiasAdd:0')\n",
342 |     "        dataset_init_op = graph.get_operation_by_name('dataset/dataset_init')\n",
343 |     "        sess.run(dataset_init_op, feed_dict={\n",
344 |     "            images_placeholder: test_images, labels_placeholder: test_labels, \n",
345 |     "            batch_size: test_images.shape[0], shuffle_size: 1})\n",
346 |     "        print('Test accuracy: {:4f}'.format(sess.run(accuracy)))\n",
347 |     "        predicted = sess.run(predictions)\n",
348 |     "\n",
349 |     "    # Plot the first 25 test images, their predicted label, and the true label\n",
350 |     "    # Color correct predictions in green, incorrect predictions in red\n",
351 |     "    plt.figure(figsize=(10, 10))\n",
352 |     "    for i in range(25):\n",
353 |     "        plt.subplot(5, 5, i + 1)\n",
354 |     "        plt.xticks([])\n",
355 |     "        plt.yticks([])\n",
356 |     "        plt.grid(False)\n",
357 |     "        plt.imshow(test_images[i], cmap=plt.cm.binary)\n",
358 |     "        predicted_label = np.argmax(predicted[i])\n",
359 |     "        true_label = test_labels[i]\n",
360 |     "        if predicted_label == true_label:\n",
361 |     "            color = 'green'\n",
362 |     "        else:\n",
363 |     "            color = 'red'\n",
364 |     "        plt.xlabel(\"{} ({})\".format(class_names[predicted_label],\n",
365 |     "                                    class_names[true_label]),\n",
366 |     "                                    color=color)\n",
367 |     "\n",
368 |     "    plt.show(True)"
369 |    ]
370 |   }
371 |  ],
372 |  "metadata": {
373 |   "kernelspec": {
374 |    "display_name": "Python 3",
375 |    "language": "python",
376 |    "name": "python3"
377 |   },
378 |   "language_info": {
379 |    "codemirror_mode": {
380 |     "name": "ipython",
381 |     "version": 3
382 |    },
383 |    "file_extension": ".py",
384 |    "mimetype": "text/x-python",
385 |    "name": "python",
386 |    "nbconvert_exporter": "python",
387 |    "pygments_lexer": "ipython3",
388 |    "version": "3.6.7"
389 |   }
390 |  },
391 |  "nbformat": 4,
392 |  "nbformat_minor": 2
393 | }
394 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # coMind: is Acuratio's open source project.
 2 | 
 3 | [<img src="https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/master/images/Logo_Acuratio.png" alt="Acuratio_Logo" width="170"/>](https://acuratio.com)
 4 | 
 5 | Check out our Multicloud Platform at [acuratio.com](https://acuratio.com)
 6 | 
 7 | This library is depecrated, contact us at hello@acuratio.com. Let us know what's your problem or use case and we'll get in touch with a privacy preserving solution.
 8 | 
 9 | Federated averaging has a set of features that makes it perfect to train models in a collaborative way while preserving the privacy of sensitive data. In this repository you can learn how to start training ML models in a federated setup.
10 | 
11 | <img src="https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/master/images/comindorg_logo.png" alt="drawing" width="150"/>
12 | 
13 | ## What can you expect to find here.
14 | 
15 | We have developed a custom optimizer for TensorFlow to easily train neural networks in a federated way (NOTE: everytime we refer to federated here, we mean federated averaging).
16 | 
17 | What is federated machine learning? In short, it is a step forward from distributed learning that can improve performance and training times. In our tutorials we explain in depth how it works, so we definitely encourage you to have a look!
18 | 
19 | In addition to this custom optimizer, you can find some tutorials and examples to help you get started with TensorFlow and federated learning. From a basic training example, where all the steps of a local classification model are shown, to more elaborated distributed and federated learning setups.
20 | 
21 | In this repository you will find 3 different types of files.
22 | 
23 | - `federated_averaging_optimizer.py` which is the custom optimizer we have created to implement federated averaging in TensorFlow.
24 | 
25 | - `basic_classifier.py`, `basic_distributed_classifier.py`, `basic_federated_classifier.py`, `advanced_classifier.py`, `advanced_distributed_classifier.py`, `advanced_federated_classifier.py` which are three basic and three advanced examples on how to train and evaluate TensorFlow models in a local, distributed and federated way.
26 | 
27 | - `Basic Classifier.ipynb`, `Basic Distributed Classifier.ipynb`, `Basic Federated Classifier.ipynb` which are three IPython Notebooks where you can find the three basic examples named above and in depth documentation to walk you through.
28 | 
29 | ## Installation dependencies
30 | 
31 | - Python 3
32 | - TensorFlow
33 | - matplotlib (for the examples and tutorials)
34 | 
35 | ## Usage
36 | 
37 | Download and open the notebooks with Jupyter or Google Colab. The notebook with the local training example `Basic Classifier.ipynb` and the python scripts `basic_classifier.py` and `advanced_classifier.py` can be run right away. For the others you will need to open three different shells. One of them will be executing the parameter server and the other two the workers.
38 | 
39 | For example, to run the `basic_distributed_classifier.py`:
40 | 
41 | * 1st shell command should look like this: `python3 basic_distributed_classifier.py --job_name=ps --task_index=0`
42 | 
43 | * 2nd shell: `python3 basic_distributed_classifier.py --job_name=worker --task_index=0`
44 | 
45 | * 3rd shell: `python3 basic_distributed_classifier.py --job_name=worker --task_index=1`
46 | 
47 | Follow the same steps for the `basic_federated_classifier.py`, `advanced_distributed_classifier.py` and `advanced_federated_classifier.py`.
48 | 
49 | ### Colab Notebooks <img height="30px" src="https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/master/images/colab_logo.png" align="left"> 
50 | 
51 | * [Basic Classifier](https://colab.research.google.com/drive/1hJ6UhELZ9sK3eX2_c-MamjxNt4gzgCis)
52 | * [Basic Distributed Classifier](https://colab.research.google.com/drive/1ZsSOD_J9aFRL4xACVUw0lau0Bc9IPD-C)
53 | * [Basic Federated Classifier](https://colab.research.google.com/drive/1zMNAJlqnNSziKYECTWhPyj4HSzg1g8sx)
54 | 
55 | ## Additional resources
56 | 
57 | Check [MPI](https://github.com/coMindOrg/federated-averaging-tutorials/tree/master/federated-MPI) to find an implementation of Federated Averaging with [Message Passing Interface](https://www.mpich.org/). This takes the communication out of TensorFlow and averages the weights with a custom hook. 
58 | 
59 | Check [sockets](https://github.com/coMindOrg/federated-averaging-tutorials/tree/master/federated-sockets) to find an implementation with python sockets. The same idea as with MPI but in this case we only need to know the public IP of the chief worker, and a custom hook will take care of the synchronization for us!
60 | 
61 | Check [this](https://github.com/coMindOrg/federated-averaging-tutorials/tree/master/federated-keras) to see an easier implementation with keras!
62 | 
63 | Check [this script](https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py) to see how to generate CIFAR-10 TFRecords.
64 | 
65 | ## Troubleshooting and Help
66 | 
67 | coMind has public Slack and Telegram channels which are a great place to ask questions and all things related to federated machine learning.
68 | 
69 | ## Bugs and Issues
70 | 
71 | Have a bug or an issue? [Open a new issue](https://github.com/coMindOrg/federated-averaging-tutorials/issues) here on GitHub or join our community in Slack or Telegram.
72 | 
73 | *[Click here to join the Slack channel!](https://comindorg.slack.com/join/shared_invite/enQtNDMxMzc0NDA5OTEwLWIyZTg5MTg1MTM4NjhiNDM4YTU1OTI1NTgwY2NkNzZjYWY1NmI0ZjIyNWJiMTNkZmRhZDg2Nzc3YTYyNGQzM2I)* <img height="30px" src="https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/master/images/slack_logo.jpg" align="left"> 
74 | 
75 | *[Click here to join the Telegram channel!](https://t.me/comind)* <img height="30px" src="https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/master/images/telegram_logo.jpg" align="left">
76 | 
77 | ## References
78 | 
79 | The Federated Averaging algorithm is explained in more detail in the following paper:
80 | 
81 | H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. [Communication-efficient learning of deep networks from decentralized data](https://arxiv.org/pdf/1602.05629.pdf). In Conference on Artificial Intelligence and Statistics, 2017.
82 | 
83 | The datsets used in these examples were:
84 | 
85 | Alex Krizhevsky. [Learning Multiple Layers of Features from Tiny Images](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf).
86 | 
87 | Han Xiao, Kashif Rasul, Roland Vollgraf. [Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms](https://arxiv.org/abs/1708.07747).
88 | 
89 | ## About
90 | 
91 | coMind is an open source project for training privacy-preserving federated deep learning models. 
92 | 
93 | * https://comind.org/
94 | * [Twitter](https://twitter.com/coMindOrg)
95 | 


--------------------------------------------------------------------------------
/advanced_classifier.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # ==============================================================================
 17 | 
 18 | # TensorFlow
 19 | import tensorflow as tf
 20 | 
 21 | # Helper libraries
 22 | import os
 23 | import numpy as np
 24 | from time import time
 25 | import multiprocessing
 26 | 
 27 | # You can safely tune these variables
 28 | BATCH_SIZE = 128
 29 | SHUFFLE_SIZE = BATCH_SIZE * 100
 30 | EPOCHS = 250
 31 | EPOCHS_PER_DECAY = 50
 32 | BATCHES_TO_PREFETCH = 1
 33 | # ----------------
 34 | 
 35 | # Dataset dependent constants
 36 | num_train_images = 50000
 37 | num_test_images = 10000
 38 | height = 32
 39 | width = 32
 40 | channels = 3
 41 | num_batch_files = 5
 42 | 
 43 | # Path to TFRecord files (check readme for instructions on how to get these files)
 44 | cifar10_train_files = ['cifar-10-tf-records/train{}.tfrecords'.format(i) for i in range(num_batch_files)]
 45 | cifar10_test_file = 'cifar-10-tf-records/test.tfrecords'
 46 | 
 47 | # Shuffle filenames before loading them
 48 | np.random.shuffle(cifar10_train_files)
 49 | 
 50 | checkpoint_dir='logs_dir/{}'.format(time())
 51 | print('Checkpoint directory: ' + checkpoint_dir)
 52 | 
 53 | global_step = tf.train.get_or_create_global_step()
 54 | 
 55 | # Check number of available CPUs
 56 | cpu_count = multiprocessing.cpu_count()
 57 | 
 58 | # Define input pipeline, place these ops in the cpu
 59 | with tf.name_scope('dataset'), tf.device('/cpu:0'):
 60 |     # Map function to decode data and preprocess it
 61 |     def preprocess(serialized_examples):
 62 |         # Parse a batch
 63 |         features = tf.parse_example(serialized_examples, {'image': tf.FixedLenFeature([], tf.string), 'label': tf.FixedLenFeature([], tf.int64)})
 64 |         # Decode and reshape image
 65 |         image = tf.map_fn(lambda img: tf.reshape(tf.decode_raw(img, tf.uint8), tf.stack([height, width, channels])), features['image'], dtype=tf.uint8, name='decode')
 66 |         # Cast image
 67 |         casted_image = tf.cast(image, tf.float32, name='input_cast')
 68 |         # Resize image for testing
 69 |         resized_image = tf.image.resize_image_with_crop_or_pad(casted_image, 24, 24)
 70 |         # Augment images for training
 71 |         distorted_image = tf.map_fn(lambda img: tf.random_crop(img, [24, 24, 3]), casted_image, name='random_crop')
 72 |         distorted_image = tf.image.random_flip_left_right(distorted_image)
 73 |         distorted_image = tf.image.random_brightness(distorted_image, 63)
 74 |         distorted_image = tf.image.random_contrast(distorted_image, 0.2, 1.8)
 75 |         # Check if test or train mode
 76 |         result = tf.cond(train_mode, lambda: distorted_image, lambda: resized_image)
 77 |         # Standardize images
 78 |         processed_image = tf.map_fn(lambda img: tf.image.per_image_standardization(img), result, name='standardization')
 79 |         return processed_image, features['label']
 80 |     # Placeholders for the iterator
 81 |     filename_placeholder = tf.placeholder(tf.string, name='input_filename')
 82 |     batch_size = tf.placeholder(tf.int64, name='batch_size')
 83 |     shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
 84 |     train_mode = tf.placeholder(tf.bool, name='train_mode')
 85 | 
 86 |     # Create dataset, shuffle, repeat, batch, map and prefetch
 87 |     dataset = tf.data.TFRecordDataset(filename_placeholder)
 88 |     dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
 89 |     dataset = dataset.repeat(EPOCHS)
 90 |     dataset = dataset.batch(batch_size)
 91 |     dataset = dataset.map(preprocess, cpu_count)
 92 |     dataset = dataset.prefetch(BATCHES_TO_PREFETCH)
 93 |     # Define a feedable iterator and the initialization op
 94 |     iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
 95 |     dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
 96 |     X, y = iterator.get_next()
 97 | 
 98 | # Define our model
 99 | first_conv = tf.layers.conv2d(X, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='first_conv')
100 | 
101 | first_pool = tf.nn.max_pool(first_conv, [1, 3, 3 ,1], [1, 2, 2, 1], padding='SAME', name='first_pool')
102 | 
103 | first_norm = tf.nn.lrn(first_pool, 4, alpha=0.001 / 9.0, beta=0.75, name='first_norm')
104 | 
105 | second_conv = tf.layers.conv2d(first_norm, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='second_conv')
106 | 
107 | second_norm = tf.nn.lrn(second_conv, 4, alpha=0.001 / 9.0, beta=0.75, name='second_norm')
108 | 
109 | second_pool = tf.nn.max_pool(second_norm, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME', name='second_pool')
110 | 
111 | flatten_layer = tf.layers.flatten(second_pool, name='flatten')
112 | 
113 | first_relu = tf.layers.dense(flatten_layer, 384, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='first_relu')
114 | 
115 | second_relu = tf.layers.dense(first_relu, 192, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='second_relu')
116 | 
117 | logits = tf.layers.dense(second_relu, 10, kernel_initializer=tf.truncated_normal_initializer(stddev=1/192.0), name='logits')
118 | 
119 | # Object to keep moving averages of our metrics (for tensorboard)
120 | summary_averages = tf.train.ExponentialMovingAverage(0.9)
121 | 
122 | # Define cross_entropy loss
123 | with tf.name_scope('loss'):
124 |     base_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits), name='base_loss')
125 |     # Add regularization loss to both relu layers
126 |     regularizer_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'relu/kernel' in v.name], name='regularizer_loss') * 0.004
127 |     loss = tf.add(base_loss, regularizer_loss)
128 |     loss_averages_op = summary_averages.apply([loss])
129 |     # Store moving average of the loss
130 |     tf.summary.scalar('cross_entropy', summary_averages.average(loss))
131 | 
132 | with tf.name_scope('accuracy'):
133 |     with tf.name_scope('correct_prediction'):
134 |         # Compare prediction with actual label
135 |         correct_prediction = tf.equal(tf.argmax(logits, 1), y)
136 |     # Average correct predictions in the current batch
137 |     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy_metric')
138 |     accuracy_averages_op = summary_averages.apply([accuracy])
139 |     # Store moving average of the accuracy
140 |     tf.summary.scalar('accuracy', summary_averages.average(accuracy))
141 | 
142 | n_batches = int(num_train_images / BATCH_SIZE)
143 | last_step = int(n_batches * EPOCHS)
144 | 
145 | # Define moving averages of the trainable variables. This sometimes improve
146 | # the performance of the trained model
147 | with tf.name_scope('variable_averages'):
148 |     variable_averages = tf.train.ExponentialMovingAverage(0.9999, global_step)
149 |     variable_averages_op = variable_averages.apply(tf.trainable_variables())
150 | 
151 | # Define optimizer and training op
152 | with tf.name_scope('train'):
153 |     # Make decaying learning rate
154 |     lr = tf.train.exponential_decay(0.1, global_step, n_batches * EPOCHS_PER_DECAY, 0.1, staircase=True)
155 |     tf.summary.scalar('learning_rate', lr)
156 |     # Make train_op dependent on moving averages ops. Otherwise they will be
157 |     # disconnected from the graph
158 |     with tf.control_dependencies([loss_averages_op, accuracy_averages_op, variable_averages_op]):
159 |         train_op = tf.train.GradientDescentOptimizer(lr).minimize(loss, global_step=global_step)
160 | 
161 | print('Graph definition finished')
162 | sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
163 | 
164 | print('Training {} batches...'.format(last_step))
165 | 
166 | # Logger hook to keep track of the training
167 | class _LoggerHook(tf.train.SessionRunHook):
168 |   def begin(self):
169 |       self._total_loss = 0
170 |       self._total_acc = 0
171 | 
172 |   def before_run(self, run_context):
173 |       return tf.train.SessionRunArgs([loss, accuracy, global_step])
174 | 
175 |   def after_run(self, run_context, run_values):
176 |       loss_value, acc_value, step_value = run_values.results
177 |       self._total_loss += loss_value
178 |       self._total_acc += acc_value
179 |       if (step_value + 1) % n_batches == 0:
180 |           print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(int(step_value / n_batches) + 1, EPOCHS, self._total_loss / n_batches, self._total_acc / n_batches))
181 |           self._total_loss = 0
182 |           self._total_acc = 0
183 | 
184 | # Hook to initialize the dataset
185 | class _InitHook(tf.train.SessionRunHook):
186 |     def after_create_session(self, session, coord):
187 |         session.run(dataset_init_op, feed_dict={filename_placeholder: cifar10_train_files, batch_size: BATCH_SIZE, shuffle_size: SHUFFLE_SIZE, train_mode: True})
188 | 
189 | with tf.name_scope('monitored_session'):
190 |     with tf.train.MonitoredTrainingSession(
191 |             checkpoint_dir=checkpoint_dir,
192 |             hooks=[_LoggerHook(), _InitHook(), tf.train.CheckpointSaverHook(checkpoint_dir=checkpoint_dir, save_steps=n_batches, saver=tf.train.Saver(variable_averages.variables_to_restore()))],
193 |             config=sess_config,
194 |             save_checkpoint_secs=None) as mon_sess:
195 |         while not mon_sess.should_stop():
196 |             mon_sess.run(train_op)
197 | 
198 | print('--- Begin Evaluation ---')
199 | # Reset graph and place ops in cpu to avoid OOM
200 | tf.reset_default_graph()
201 | with tf.device('/cpu:0'), tf.Session() as sess:
202 |     ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
203 |     saver = tf.train.import_meta_graph(ckpt.model_checkpoint_path + '.meta', clear_devices=True)
204 |     saver.restore(sess, ckpt.model_checkpoint_path)
205 |     print('Model restored')
206 |     graph = tf.get_default_graph()
207 |     filename_placeholder = graph.get_tensor_by_name('dataset/input_filename:0')
208 |     batch_size = graph.get_tensor_by_name('dataset/batch_size:0')
209 |     shuffle_size = graph.get_tensor_by_name('dataset/shuffle_size:0')
210 |     train_mode = graph.get_tensor_by_name('dataset/train_mode:0')
211 |     accuracy = graph.get_tensor_by_name('accuracy/accuracy_metric:0')
212 |     dataset_init_op = graph.get_operation_by_name('dataset/dataset_init')
213 |     sess.run(dataset_init_op, feed_dict={filename_placeholder: cifar10_test_file, batch_size: num_test_images, shuffle_size: 1, train_mode: False})
214 |     print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
215 | 


--------------------------------------------------------------------------------
/advanced_distributed_classifier.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # ==============================================================================
 17 | 
 18 | # TensorFlow
 19 | import tensorflow as tf
 20 | 
 21 | # Helper libraries
 22 | import os
 23 | import numpy as np
 24 | from time import time
 25 | import multiprocessing
 26 | 
 27 | flags = tf.app.flags
 28 | flags.DEFINE_integer("task_index", None,
 29 |                      "Worker task index, should be >= 0. task_index=0 is "
 30 |                      "the master worker task that performs the variable "
 31 |                      "initialization ")
 32 | flags.DEFINE_string("ps_hosts", "localhost:2222",
 33 |                     "Comma-separated list of hostname:port pairs")
 34 | flags.DEFINE_string("worker_hosts", "localhost:2223,localhost:2224",
 35 |                     "Comma-separated list of hostname:port pairs")
 36 | flags.DEFINE_string("job_name", None, "job name: worker or ps")
 37 | 
 38 | # You can safely tune these variables
 39 | BATCH_SIZE = 128
 40 | SHUFFLE_SIZE = BATCH_SIZE * 100
 41 | EPOCHS = 250
 42 | EPOCHS_PER_DECAY = 50
 43 | BATCHES_TO_PREFETCH = 1
 44 | # ----------------
 45 | 
 46 | FLAGS = flags.FLAGS
 47 | 
 48 | if FLAGS.job_name is None or FLAGS.job_name == "":
 49 |     raise ValueError("Must specify an explicit `job_name`")
 50 | if FLAGS.task_index is None or FLAGS.task_index == "":
 51 |     raise ValueError("Must specify an explicit `task_index`")
 52 | 
 53 | # Only enable GPU for worker 1 (not needed if training with separate machines)
 54 | if FLAGS.task_index == 0:
 55 |     print('--- GPU Disabled ---')
 56 |     os.environ['CUDA_VISIBLE_DEVICES'] = ''
 57 | 
 58 | #Construct the cluster and start the server
 59 | ps_spec = FLAGS.ps_hosts.split(",")
 60 | worker_spec = FLAGS.worker_hosts.split(",")
 61 | 
 62 | # Get the number of workers.
 63 | num_workers = len(worker_spec)
 64 | print('{} workers defined'.format(num_workers))
 65 | 
 66 | # Dataset dependent constants
 67 | num_train_images = int(50000 / num_workers)
 68 | num_test_images = 10000
 69 | height = 32
 70 | width = 32
 71 | channels = 3
 72 | num_batch_files = 5
 73 | 
 74 | cluster = tf.train.ClusterSpec({"ps": ps_spec, "worker": worker_spec})
 75 | 
 76 | server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
 77 | 
 78 | # ps will block here
 79 | if FLAGS.job_name == "ps":
 80 |     print('--- Parameter Server Ready ---')
 81 |     server.join()
 82 | 
 83 | # Path to TFRecord files (check readme for instructions on how to get these files)
 84 | cifar10_train_files = ['cifar-10-tf-records/train{}.tfrecords'.format(i) for i in range(num_batch_files)]
 85 | cifar10_test_file = 'cifar-10-tf-records/test.tfrecords'
 86 | 
 87 | # Shuffle filenames before loading them
 88 | np.random.shuffle(cifar10_train_files)
 89 | 
 90 | is_chief = (FLAGS.task_index == 0)
 91 | 
 92 | checkpoint_dir='logs_dir/{}'.format(time())
 93 | print('Checkpoint directory: ' + checkpoint_dir)
 94 | 
 95 | worker_device = "/job:worker/task:%d" % FLAGS.task_index
 96 | print('Worker device: ' + worker_device + ' - is_chief: {}'.format(is_chief))
 97 | 
 98 | # Check number of available CPUs
 99 | cpu_count = int(multiprocessing.cpu_count() / num_workers)
100 | 
101 | # replica device setter will place ops in the appropriate devices
102 | with tf.device(
103 |       tf.train.replica_device_setter(
104 |           worker_device=worker_device,
105 |           cluster=cluster)):
106 |     global_step = tf.train.get_or_create_global_step()
107 | 
108 |     # Define input pipeline, place these ops in the cpu
109 |     with tf.name_scope('dataset'), tf.device('/cpu:0'):
110 |         # Map function to decode data and preprocess it
111 |         def preprocess(serialized_examples):
112 |             # Parse a batch
113 |             features = tf.parse_example(serialized_examples, {'image': tf.FixedLenFeature([], tf.string), 'label': tf.FixedLenFeature([], tf.int64)})
114 |             # Decode and reshape image
115 |             image = tf.map_fn(lambda img: tf.reshape(tf.decode_raw(img, tf.uint8), tf.stack([height, width, channels])), features['image'], dtype=tf.uint8, name='decode')
116 |             # Cast image
117 |             casted_image = tf.cast(image, tf.float32, name='input_cast')
118 |             # Resize image for testing
119 |             resized_image = tf.image.resize_image_with_crop_or_pad(casted_image, 24, 24)
120 |             # Augment images for training
121 |             distorted_image = tf.map_fn(lambda img: tf.random_crop(img, [24, 24, 3]), casted_image, name='random_crop')
122 |             distorted_image = tf.image.random_flip_left_right(distorted_image)
123 |             distorted_image = tf.image.random_brightness(distorted_image, 63)
124 |             distorted_image = tf.image.random_contrast(distorted_image, 0.2, 1.8)
125 |             # Check if test or train mode
126 |             result = tf.cond(train_mode, lambda: distorted_image, lambda: resized_image)
127 |             # Standardize images
128 |             processed_image = tf.map_fn(lambda img: tf.image.per_image_standardization(img), result, name='standardization')
129 |             return processed_image, features['label']
130 |         # Placeholders for the iterator
131 |         filename_placeholder = tf.placeholder(tf.string, name='input_filename')
132 |         batch_size = tf.placeholder(tf.int64, name='batch_size')
133 |         shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
134 |         train_mode = tf.placeholder(tf.bool, name='train_mode')
135 | 
136 |         # Create dataset, shuffle, repeat, batch, map and prefetch
137 |         dataset = tf.data.TFRecordDataset(filename_placeholder)
138 |         dataset = dataset.shard(num_workers, FLAGS.task_index)
139 |         dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
140 |         dataset = dataset.repeat(EPOCHS)
141 |         dataset = dataset.batch(batch_size)
142 |         dataset = dataset.map(preprocess, cpu_count)
143 |         dataset = dataset.prefetch(BATCHES_TO_PREFETCH)
144 |         # Define a feedable iterator and the initialization op
145 |         iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
146 |         dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
147 |         X, y = iterator.get_next()
148 | 
149 |     # Define our model
150 |     first_conv = tf.layers.conv2d(X, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='first_conv')
151 | 
152 |     first_pool = tf.nn.max_pool(first_conv, [1, 3, 3 ,1], [1, 2, 2, 1], padding='SAME', name='first_pool')
153 | 
154 |     first_norm = tf.nn.lrn(first_pool, 4, alpha=0.001 / 9.0, beta=0.75, name='first_norm')
155 | 
156 |     second_conv = tf.layers.conv2d(first_norm, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='second_conv')
157 | 
158 |     second_norm = tf.nn.lrn(second_conv, 4, alpha=0.001 / 9.0, beta=0.75, name='second_norm')
159 | 
160 |     second_pool = tf.nn.max_pool(second_norm, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME', name='second_pool')
161 | 
162 |     flatten_layer = tf.layers.flatten(second_pool, name='flatten')
163 | 
164 |     first_relu = tf.layers.dense(flatten_layer, 384, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='first_relu')
165 | 
166 |     second_relu = tf.layers.dense(first_relu, 192, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='second_relu')
167 | 
168 |     logits = tf.layers.dense(second_relu, 10, kernel_initializer=tf.truncated_normal_initializer(stddev=1/192.0), name='logits')
169 | 
170 |     # Object to keep moving averages of our metrics (for tensorboard)
171 |     summary_averages = tf.train.ExponentialMovingAverage(0.9)
172 |     n_batches = int(num_train_images / (BATCH_SIZE * num_workers))
173 | 
174 |     # Define cross_entropy loss
175 |     with tf.name_scope('loss'):
176 |         base_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits), name='base_loss')
177 |         # Add regularization loss to both relu layers
178 |         regularizer_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'relu/kernel' in v.name], name='regularizer_loss') * 0.004
179 |         loss = tf.add(base_loss, regularizer_loss)
180 |         loss_averages_op = summary_averages.apply([loss])
181 |         # Store moving average of the loss
182 |         tf.summary.scalar('cross_entropy', summary_averages.average(loss))
183 | 
184 |     with tf.name_scope('accuracy'):
185 |         with tf.name_scope('correct_prediction'):
186 |             # Compare prediction with actual label
187 |             correct_prediction = tf.equal(tf.argmax(logits, 1), y)
188 |         # Average correct predictions in the current batch
189 |         accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy_metric')
190 |         accuracy_averages_op = summary_averages.apply([accuracy])
191 |         # Store moving average of the accuracy
192 |         tf.summary.scalar('accuracy', summary_averages.average(accuracy))
193 | 
194 |     # Define moving averages of the trainable variables. This sometimes improve
195 |     # the performance of the trained model
196 |     with tf.name_scope('variable_averages'):
197 |         variable_averages = tf.train.ExponentialMovingAverage(0.9999, global_step)
198 |         variable_averages_op = variable_averages.apply(tf.trainable_variables())
199 | 
200 |     # Define optimizer and training op
201 |     with tf.name_scope('train'):
202 |         # Make decaying learning rate
203 |         lr = tf.train.exponential_decay(0.1, global_step, n_batches * EPOCHS_PER_DECAY, 0.1, staircase=True)
204 |         tf.summary.scalar('learning_rate', lr)
205 |         # Wrap the optimizer in a SyncReplicasOptimizer for distributed training
206 |         optimizer = tf.train.SyncReplicasOptimizer(tf.train.GradientDescentOptimizer(np.sqrt(num_workers) * lr), replicas_to_aggregate=num_workers)
207 |         # Make train_op dependent on moving averages ops. Otherwise they will be
208 |         # disconnected from the graph
209 |         with tf.control_dependencies([loss_averages_op, accuracy_averages_op, variable_averages_op]):
210 |             train_op = optimizer.minimize(loss, global_step=global_step)
211 |         sync_replicas_hook = optimizer.make_session_run_hook(is_chief)
212 | 
213 |     print('Graph definition finished')
214 | 
215 |     last_step = int(n_batches * EPOCHS)
216 | 
217 |     sess_config = tf.ConfigProto(
218 |         allow_soft_placement=True,
219 |         log_device_placement=False,
220 |         device_filters=["/job:ps",
221 |         "/job:worker/task:%d" % FLAGS.task_index])
222 | 
223 |     print('Training {} batches...'.format(last_step))
224 | 
225 |     # Logger hook to keep track of the training
226 |     class _LoggerHook(tf.train.SessionRunHook):
227 |       def begin(self):
228 |           self._total_loss = 0
229 |           self._total_acc = 0
230 | 
231 |       def before_run(self, run_context):
232 |           return tf.train.SessionRunArgs([loss, accuracy, global_step])
233 | 
234 |       def after_run(self, run_context, run_values):
235 |           loss_value, acc_value, step_value = run_values.results
236 |           self._total_loss += loss_value
237 |           self._total_acc += acc_value
238 |           if (step_value + 1) % n_batches == 0:
239 |               print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(int(step_value / n_batches) + 1, EPOCHS, self._total_loss / n_batches, self._total_acc / n_batches))
240 |               self._total_loss = 0
241 |               self._total_acc = 0
242 | 
243 |       def end(self, session):
244 |           print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(int(session.run(global_step) / n_batches) + 1, EPOCHS, self._total_loss / n_batches, self._total_acc / n_batches))
245 | 
246 |     # Hook to initialize the dataset
247 |     class _InitHook(tf.train.SessionRunHook):
248 |         def after_create_session(self, session, coord):
249 |             session.run(dataset_init_op, feed_dict={filename_placeholder: cifar10_train_files, batch_size: BATCH_SIZE, shuffle_size: SHUFFLE_SIZE, train_mode: True})
250 | 
251 |     with tf.name_scope('monitored_session'):
252 |         with tf.train.MonitoredTrainingSession(
253 |                 master=server.target,
254 |                 is_chief=is_chief,
255 |                 checkpoint_dir=checkpoint_dir,
256 |                 hooks=[_LoggerHook(), _InitHook(), sync_replicas_hook],
257 |                 chief_only_hooks=[tf.train.CheckpointSaverHook(checkpoint_dir=checkpoint_dir, save_steps=n_batches, saver=tf.train.Saver(variable_averages.variables_to_restore()))],
258 |                 config=sess_config,
259 |                 stop_grace_period_secs=10,
260 |                 save_checkpoint_secs=None) as mon_sess:
261 |             while not mon_sess.should_stop():
262 |                 mon_sess.run(train_op)
263 | 
264 | if is_chief:
265 |     print('--- Begin Evaluation ---')
266 |     # Reset graph to clear any ops stored in other devices
267 |     tf.reset_default_graph()
268 |     with tf.Session() as sess:
269 |         ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
270 |         saver = tf.train.import_meta_graph(ckpt.model_checkpoint_path + '.meta', clear_devices=True)
271 |         saver.restore(sess, ckpt.model_checkpoint_path)
272 |         print('Model restored')
273 |         graph = tf.get_default_graph()
274 |         filename_placeholder = graph.get_tensor_by_name('dataset/input_filename:0')
275 |         batch_size = graph.get_tensor_by_name('dataset/batch_size:0')
276 |         shuffle_size = graph.get_tensor_by_name('dataset/shuffle_size:0')
277 |         train_mode = graph.get_tensor_by_name('dataset/train_mode:0')
278 |         accuracy = graph.get_tensor_by_name('accuracy/accuracy_metric:0')
279 |         dataset_init_op = graph.get_operation_by_name('dataset/dataset_init')
280 |         sess.run(dataset_init_op, feed_dict={filename_placeholder: cifar10_test_file, batch_size: num_test_images, shuffle_size: 1, train_mode: False})
281 |         print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
282 | 


--------------------------------------------------------------------------------
/advanced_federated_classifier.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # ==============================================================================
 17 | 
 18 | # TensorFlow
 19 | import tensorflow as tf
 20 | 
 21 | # Helper libraries
 22 | import os
 23 | import numpy as np
 24 | from time import time
 25 | import multiprocessing
 26 | 
 27 | # Import custom optimizer
 28 | import federated_averaging_optimizer
 29 | 
 30 | flags = tf.app.flags
 31 | flags.DEFINE_integer("task_index", None,
 32 |                      "Worker task index, should be >= 0. task_index=0 is "
 33 |                      "the master worker task that performs the variable "
 34 |                      "initialization ")
 35 | flags.DEFINE_string("ps_hosts", "localhost:2222",
 36 |                     "Comma-separated list of hostname:port pairs")
 37 | flags.DEFINE_string("worker_hosts", "localhost:2223,localhost:2224",
 38 |                     "Comma-separated list of hostname:port pairs")
 39 | flags.DEFINE_string("job_name", None, "job name: worker or ps")
 40 | 
 41 | # You can safely tune these variables
 42 | BATCH_SIZE = 128
 43 | SHUFFLE_SIZE = BATCH_SIZE * 100
 44 | EPOCHS = 250
 45 | EPOCHS_PER_DECAY = 50
 46 | INTERVAL_STEPS = 100
 47 | BATCHES_TO_PREFETCH = 1
 48 | # ----------------
 49 | 
 50 | FLAGS = flags.FLAGS
 51 | 
 52 | if FLAGS.job_name is None or FLAGS.job_name == "":
 53 |     raise ValueError("Must specify an explicit `job_name`")
 54 | if FLAGS.task_index is None or FLAGS.task_index == "":
 55 |     raise ValueError("Must specify an explicit `task_index`")
 56 | 
 57 | # Only enable GPU for worker 1 (not needed if training with separate machines)
 58 | if FLAGS.task_index == 0:
 59 |     print('--- GPU Disabled ---')
 60 |     os.environ['CUDA_VISIBLE_DEVICES'] = ''
 61 | 
 62 | #Construct the cluster and start the server
 63 | ps_spec = FLAGS.ps_hosts.split(",")
 64 | worker_spec = FLAGS.worker_hosts.split(",")
 65 | 
 66 | # Get the number of workers.
 67 | num_workers = len(worker_spec)
 68 | print('{} workers defined'.format(num_workers))
 69 | 
 70 | # Dataset dependent constants
 71 | num_train_images = int(50000 / num_workers)
 72 | num_test_images = 10000
 73 | height = 32
 74 | width = 32
 75 | channels = 3
 76 | num_batch_files = 5
 77 | 
 78 | cluster = tf.train.ClusterSpec({"ps": ps_spec, "worker": worker_spec})
 79 | 
 80 | server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
 81 | 
 82 | # ps will block here
 83 | if FLAGS.job_name == "ps":
 84 |     print('--- Parameter Server Ready ---')
 85 |     server.join()
 86 | 
 87 | # Path to TFRecord files (check readme for instructions on how to get these files)
 88 | cifar10_train_files = ['cifar-10-tf-records/train{}.tfrecords'.format(i) for i in range(num_batch_files)]
 89 | cifar10_test_file = 'cifar-10-tf-records/test.tfrecords'
 90 | 
 91 | # Shuffle filenames before loading them
 92 | np.random.shuffle(cifar10_train_files)
 93 | 
 94 | is_chief = (FLAGS.task_index == 0)
 95 | 
 96 | checkpoint_dir='logs_dir/federated_worker_{}/{}'.format(FLAGS.task_index, time())
 97 | print('Checkpoint directory: ' + checkpoint_dir)
 98 | 
 99 | worker_device = "/job:worker/task:%d" % FLAGS.task_index
100 | print('Worker device: ' + worker_device + ' - is_chief: {}'.format(is_chief))
101 | 
102 | # Check number of available CPUs
103 | cpu_count = int(multiprocessing.cpu_count() / num_workers)
104 | 
105 | # Place all ops in local worker by default
106 | with tf.device(worker_device):
107 |     global_step = tf.train.get_or_create_global_step()
108 | 
109 |     # Define input pipeline, place these ops in the cpu
110 |     with tf.name_scope('dataset'), tf.device('/cpu:0'):
111 |         # Map function to decode data and preprocess it
112 |         def preprocess(serialized_examples):
113 |             # Parse a batch
114 |             features = tf.parse_example(serialized_examples, {'image': tf.FixedLenFeature([], tf.string), 'label': tf.FixedLenFeature([], tf.int64)})
115 |             # Decode and reshape imag
116 |             image = tf.map_fn(lambda img: tf.reshape(tf.decode_raw(img, tf.uint8), tf.stack([height, width, channels])), features['image'], dtype=tf.uint8, name='decode')
117 |             # Cast image
118 |             casted_image = tf.cast(image, tf.float32, name='input_cast')
119 |             # Resize image for testing
120 |             resized_image = tf.image.resize_image_with_crop_or_pad(casted_image, 24, 24)
121 |             # Augment images for training
122 |             distorted_image = tf.map_fn(lambda img: tf.random_crop(img, [24, 24, 3]), casted_image, name='random_crop')
123 |             distorted_image = tf.image.random_flip_left_right(distorted_image)
124 |             distorted_image = tf.image.random_brightness(distorted_image, 63)
125 |             distorted_image = tf.image.random_contrast(distorted_image, 0.2, 1.8)
126 |             # Check if test or train mode
127 |             result = tf.cond(train_mode, lambda: distorted_image, lambda: resized_image)
128 |             # Standardize images
129 |             processed_image = tf.map_fn(lambda img: tf.image.per_image_standardization(img), result, name='standardization')
130 |             return processed_image, features['label']
131 |         # Placeholders for the iterator
132 |         filename_placeholder = tf.placeholder(tf.string, name='input_filename')
133 |         batch_size = tf.placeholder(tf.int64, name='batch_size')
134 |         shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
135 |         train_mode = tf.placeholder(tf.bool, name='train_mode')
136 | 
137 |         # Create dataset, shuffle, repeat, batch, map and prefetch
138 |         dataset = tf.data.TFRecordDataset(filename_placeholder)
139 |         dataset = dataset.shard(num_workers, FLAGS.task_index)
140 |         dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
141 |         dataset = dataset.repeat(EPOCHS)
142 |         dataset = dataset.batch(batch_size)
143 |         dataset = dataset.map(preprocess, cpu_count)
144 |         dataset = dataset.prefetch(BATCHES_TO_PREFETCH)
145 |         # Define a feedable iterator and the initialization op
146 |         iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
147 |         dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
148 |         X, y = iterator.get_next()
149 | 
150 |     # Define our model
151 |     first_conv = tf.layers.conv2d(X, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='first_conv')
152 | 
153 |     first_pool = tf.nn.max_pool(first_conv, [1, 3, 3 ,1], [1, 2, 2, 1], padding='SAME', name='first_pool')
154 | 
155 |     first_norm = tf.nn.lrn(first_pool, 4, alpha=0.001 / 9.0, beta=0.75, name='first_norm')
156 | 
157 |     second_conv = tf.layers.conv2d(first_norm, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='second_conv')
158 | 
159 |     second_norm = tf.nn.lrn(second_conv, 4, alpha=0.001 / 9.0, beta=0.75, name='second_norm')
160 | 
161 |     second_pool = tf.nn.max_pool(second_norm, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME', name='second_pool')
162 | 
163 |     flatten_layer = tf.layers.flatten(second_pool, name='flatten')
164 | 
165 |     first_relu = tf.layers.dense(flatten_layer, 384, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='first_relu')
166 | 
167 |     second_relu = tf.layers.dense(first_relu, 192, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='second_relu')
168 | 
169 |     logits = tf.layers.dense(second_relu, 10, kernel_initializer=tf.truncated_normal_initializer(stddev=1/192.0), name='logits')
170 | 
171 |     # Object to keep moving averages of our metrics (for tensorboard)
172 |     summary_averages = tf.train.ExponentialMovingAverage(0.9)
173 |     n_batches = int(num_train_images / BATCH_SIZE)
174 | 
175 |     # Define cross_entropy loss
176 |     with tf.name_scope('loss'):
177 |         base_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits), name='base_loss')
178 |         # Add regularization loss to both relu layers
179 |         regularizer_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'relu/kernel' in v.name], name='regularizer_loss') * 0.004
180 |         loss = tf.add(base_loss, regularizer_loss)
181 |         loss_averages_op = summary_averages.apply([loss])
182 |         # Store moving average of the loss
183 |         tf.summary.scalar('cross_entropy', summary_averages.average(loss))
184 | 
185 |     with tf.name_scope('accuracy'):
186 |         with tf.name_scope('correct_prediction'):
187 |             # Compare prediction with actual label
188 |             correct_prediction = tf.equal(tf.argmax(logits, 1), y)
189 |         # Average correct predictions in the current batch
190 |         accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy_metric')
191 |         accuracy_averages_op = summary_averages.apply([accuracy])
192 |         # Store moving average of the accuracy
193 |         tf.summary.scalar('accuracy', summary_averages.average(accuracy))
194 | 
195 |     # Define moving averages of the trainable variables. This sometimes improve
196 |     # the performance of the trained model
197 |     with tf.name_scope('variable_averages'):
198 |         variable_averages = tf.train.ExponentialMovingAverage(0.9999, global_step)
199 |         variable_averages_op = variable_averages.apply(tf.trainable_variables())
200 | 
201 |     # Define optimizer and training op
202 |     with tf.name_scope('train'):
203 |         # Define device setter to place copies of local variables
204 |         device_setter = tf.train.replica_device_setter(worker_device=worker_device, cluster=cluster)
205 |         # Make decaying learning rate
206 |         lr = tf.train.exponential_decay(0.1, global_step, n_batches * EPOCHS_PER_DECAY, 0.1, staircase=True)
207 |         tf.summary.scalar('learning_rate', lr)
208 |         # Wrap the optimizer in a FederatedAveragingOptimizer for federated training
209 |         optimizer = federated_averaging_optimizer.FederatedAveragingOptimizer(tf.train.GradientDescentOptimizer(lr), replicas_to_aggregate=num_workers, interval_steps=INTERVAL_STEPS, is_chief=is_chief, device_setter=device_setter)
210 |         # Make train_op dependent on moving averages ops. Otherwise they will be
211 |         # disconnected from the graph
212 |         with tf.control_dependencies([loss_averages_op, accuracy_averages_op, variable_averages_op]):
213 |             train_op = optimizer.minimize(loss, global_step=global_step)
214 |         model_average_hook = optimizer.make_session_run_hook()
215 | 
216 |     print('Graph definition finished')
217 | 
218 |     last_step = int(n_batches * EPOCHS)
219 | 
220 |     sess_config = tf.ConfigProto(
221 |         allow_soft_placement=True,
222 |         log_device_placement=False,
223 |         device_filters=["/job:ps",
224 |         "/job:worker/task:%d" % FLAGS.task_index])
225 | 
226 |     print('Training {} batches...'.format(last_step))
227 | 
228 |     # Logger hook to keep track of the training
229 |     class _LoggerHook(tf.train.SessionRunHook):
230 |       def begin(self):
231 |           self._total_loss = 0
232 |           self._total_acc = 0
233 | 
234 |       def before_run(self, run_context):
235 |           return tf.train.SessionRunArgs([loss, accuracy, global_step])
236 | 
237 |       def after_run(self, run_context, run_values):
238 |           loss_value, acc_value, step_value = run_values.results
239 |           self._total_loss += loss_value
240 |           self._total_acc += acc_value
241 |           if (step_value + 1) % n_batches == 0:
242 |               print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(int(step_value / n_batches) + 1, EPOCHS, self._total_loss / n_batches, self._total_acc / n_batches))
243 |               self._total_loss = 0
244 |               self._total_acc = 0
245 | 
246 |     # Hook to initialize the dataset
247 |     class _InitHook(tf.train.SessionRunHook):
248 |         def after_create_session(self, session, coord):
249 |             session.run(dataset_init_op, feed_dict={filename_placeholder: cifar10_train_files, batch_size: BATCH_SIZE, shuffle_size: SHUFFLE_SIZE, train_mode: True})
250 | 
251 |     # Hook to save just trainable_variables
252 |     class _SaverHook(tf.train.SessionRunHook):
253 |         def begin(self):
254 |             self._saver = tf.train.Saver(variable_averages.variables_to_restore())
255 | 
256 |         def before_run(self, run_context):
257 |             return tf.train.SessionRunArgs(global_step)
258 | 
259 |         def after_run(self, run_context, run_values):
260 |             step_value = run_values.results
261 |             if step_value % n_batches == 0 and not step_value == 0:
262 |                 self._saver.save(run_context.session, checkpoint_dir+'/model.ckpt', step_value)
263 | 
264 |         def end(self, session):
265 |             self._saver.save(session, checkpoint_dir+'/model.ckpt', session.run(global_step))
266 | 
267 |     # Make sure we do not define a chief worker
268 |     with tf.name_scope('monitored_session'):
269 |         with tf.train.MonitoredTrainingSession(
270 |                 master=server.target,
271 |                 checkpoint_dir=checkpoint_dir,
272 |                 hooks=[_LoggerHook(), _InitHook(), _SaverHook(), model_average_hook],
273 |                 config=sess_config,
274 |                 stop_grace_period_secs=10,
275 |                 save_checkpoint_secs=None) as mon_sess:
276 |             while not mon_sess.should_stop():
277 |                 mon_sess.run(train_op)
278 | 
279 | if is_chief:
280 |     print('--- Begin Evaluation ---')
281 |     # Reset graph to clear any ops stored in other devices
282 |     tf.reset_default_graph()
283 |     with tf.Session() as sess:
284 |         ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
285 |         saver = tf.train.import_meta_graph(ckpt.model_checkpoint_path + '.meta', clear_devices=True)
286 |         saver.restore(sess, ckpt.model_checkpoint_path)
287 |         print('Model restored')
288 |         graph = tf.get_default_graph()
289 |         filename_placeholder = graph.get_tensor_by_name('dataset/input_filename:0')
290 |         batch_size = graph.get_tensor_by_name('dataset/batch_size:0')
291 |         shuffle_size = graph.get_tensor_by_name('dataset/shuffle_size:0')
292 |         train_mode = graph.get_tensor_by_name('dataset/train_mode:0')
293 |         accuracy = graph.get_tensor_by_name('accuracy/accuracy_metric:0')
294 |         dataset_init_op = graph.get_operation_by_name('dataset/dataset_init')
295 |         sess.run(dataset_init_op, feed_dict={filename_placeholder: cifar10_test_file, batch_size: num_test_images, shuffle_size: 1, train_mode: False})
296 |         print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
297 | 


--------------------------------------------------------------------------------
/basic_classifier.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # ==============================================================================
 17 | 
 18 | # TensorFlow and tf.keras
 19 | import tensorflow as tf
 20 | from tensorflow import keras
 21 | 
 22 | # Helper libraries
 23 | import numpy as np
 24 | import matplotlib.pyplot as plt
 25 | from time import time
 26 | 
 27 | # You can safely tune these variables
 28 | BATCH_SIZE = 32
 29 | EPOCHS = 5
 30 | # ----------------
 31 | 
 32 | # Load dataset as numpy arrays
 33 | fashion_mnist = keras.datasets.fashion_mnist
 34 | (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
 35 | print('Data loaded')
 36 | print('Local dataset size: {}'.format(train_images.shape[0]))
 37 | 
 38 | # List with class names to see the labels of the images with matplotlib
 39 | class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
 40 |                'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
 41 | 
 42 | # Normalize dataset
 43 | train_images = train_images / 255.0
 44 | test_images = test_images / 255.0
 45 | 
 46 | checkpoint_dir='logs_dir/{}'.format(time())
 47 | print('Checkpoint directory: ' + checkpoint_dir)
 48 | 
 49 | global_step = tf.train.get_or_create_global_step()
 50 | 
 51 | # Define input pipeline, place these ops in the cpu
 52 | with tf.name_scope('dataset'), tf.device('/cpu:0'):
 53 |     # Placeholders for the iterator
 54 |     images_placeholder = tf.placeholder(train_images.dtype, [None, train_images.shape[1], train_images.shape[2]], name='images_placeholder')
 55 |     labels_placeholder = tf.placeholder(train_labels.dtype, [None], name='labels_placeholder')
 56 |     batch_size = tf.placeholder(tf.int64, name='batch_size')
 57 |     shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
 58 | 
 59 |     # Create dataset from numpy arrays, shuffle, repeat and batch
 60 |     dataset = tf.data.Dataset.from_tensor_slices((images_placeholder, labels_placeholder))
 61 |     dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
 62 |     dataset = dataset.repeat(EPOCHS)
 63 |     dataset = dataset.batch(batch_size)
 64 |     # Define a feedable iterator and the initialization op
 65 |     iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
 66 |     dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
 67 |     X, y = iterator.get_next()
 68 | 
 69 | # Define our model
 70 | flatten_layer = tf.layers.flatten(X, name='flatten')
 71 | 
 72 | dense_layer = tf.layers.dense(flatten_layer, 128, activation=tf.nn.relu, name='relu')
 73 | 
 74 | predictions = tf.layers.dense(dense_layer, 10, activation=tf.nn.softmax, name='softmax')
 75 | 
 76 | # Object to keep moving averages of our metrics (for tensorboard)
 77 | summary_averages = tf.train.ExponentialMovingAverage(0.9)
 78 | 
 79 | # Define cross_entropy loss
 80 | with tf.name_scope('loss'):
 81 |     loss = tf.reduce_mean(keras.losses.sparse_categorical_crossentropy(y, predictions))
 82 |     loss_averages_op = summary_averages.apply([loss])
 83 |     # Store moving average of the loss
 84 |     tf.summary.scalar('cross_entropy', summary_averages.average(loss))
 85 | 
 86 | # Define accuracy metric
 87 | with tf.name_scope('accuracy'):
 88 |     with tf.name_scope('correct_prediction'):
 89 |         # Compare prediction with actual label
 90 |         correct_prediction = tf.equal(tf.argmax(predictions, 1), tf.cast(y, tf.int64))
 91 |     # Average correct predictions in the current batch
 92 |     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
 93 |     accuracy_averages_op = summary_averages.apply([accuracy])
 94 |     # Store moving average of the accuracy
 95 |     tf.summary.scalar('accuracy', summary_averages.average(accuracy))
 96 | 
 97 | # Define optimizer and training op
 98 | with tf.name_scope('train'):
 99 |     # Make train_op dependent on moving averages ops. Otherwise they will be
100 |     # disconnected from the graph
101 |     with tf.control_dependencies([loss_averages_op, accuracy_averages_op]):
102 |         train_op = tf.train.AdamOptimizer(0.001).minimize(loss, global_step=global_step)
103 | 
104 | print('Graph definition finished')
105 | sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
106 | 
107 | n_batches = int(train_images.shape[0] / BATCH_SIZE)
108 | last_step = int(n_batches * EPOCHS)
109 | print('Training {} batches...'.format(last_step))
110 | 
111 | # Logger hook to keep track of the training
112 | class _LoggerHook(tf.train.SessionRunHook):
113 |   def begin(self):
114 |       self._total_loss = 0
115 |       self._total_acc = 0
116 | 
117 |   def before_run(self, run_context):
118 |       return tf.train.SessionRunArgs([loss, accuracy, global_step])
119 | 
120 |   def after_run(self, run_context, run_values):
121 |       loss_value, acc_value, step_value = run_values.results
122 |       self._total_loss += loss_value
123 |       self._total_acc += acc_value
124 |       if (step_value + 1) % n_batches == 0:
125 |           print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(int(step_value / n_batches) + 1, EPOCHS, self._total_loss / n_batches, self._total_acc / n_batches))
126 |           self._total_loss = 0
127 |           self._total_acc = 0
128 | 
129 | # Hook to initialize the dataset
130 | class _InitHook(tf.train.SessionRunHook):
131 |     def after_create_session(self, session, coord):
132 |         session.run(dataset_init_op, feed_dict={images_placeholder: train_images, labels_placeholder: train_labels, batch_size: BATCH_SIZE, shuffle_size: train_images.shape[0]})
133 | 
134 | with tf.name_scope('monitored_session'):
135 |     with tf.train.MonitoredTrainingSession(
136 |             checkpoint_dir=checkpoint_dir,
137 |             hooks=[_LoggerHook(), _InitHook()],
138 |             config=sess_config,
139 |             save_checkpoint_steps=n_batches) as mon_sess:
140 |         while not mon_sess.should_stop():
141 |             mon_sess.run(train_op)
142 | 
143 | print('--- Begin Evaluation ---')
144 | with tf.device('/cpu:0'), tf.Session() as sess:
145 |     ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
146 |     tf.train.Saver().restore(sess, ckpt.model_checkpoint_path)
147 |     print('Model restored')
148 |     sess.run(dataset_init_op, feed_dict={images_placeholder: test_images, labels_placeholder: test_labels, batch_size: test_images.shape[0], shuffle_size: 1})
149 |     print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
150 |     predicted = sess.run(predictions)
151 | 
152 | # Plot the first 25 test images, their predicted label, and the true label
153 | # Color correct predictions in green, incorrect predictions in red
154 | plt.figure(figsize=(10, 10))
155 | for i in range(25):
156 |     plt.subplot(5, 5, i + 1)
157 |     plt.xticks([])
158 |     plt.yticks([])
159 |     plt.grid(False)
160 |     plt.imshow(test_images[i], cmap=plt.cm.binary)
161 |     predicted_label = np.argmax(predicted[i])
162 |     true_label = test_labels[i]
163 |     if predicted_label == true_label:
164 |       color = 'green'
165 |     else:
166 |       color = 'red'
167 |     plt.xlabel("{} ({})".format(class_names[predicted_label],
168 |                                 class_names[true_label]),
169 |                                 color=color)
170 | 
171 | plt.show(True)
172 | 


--------------------------------------------------------------------------------
/basic_distributed_classifier.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # ==============================================================================
 17 | 
 18 | # TensorFlow and tf.keras
 19 | import tensorflow as tf
 20 | from tensorflow import keras
 21 | 
 22 | # Helper libraries
 23 | import os
 24 | import numpy as np
 25 | from time import time
 26 | import matplotlib.pyplot as plt
 27 | 
 28 | flags = tf.app.flags
 29 | flags.DEFINE_integer("task_index", None,
 30 |                      "Worker task index, should be >= 0. task_index=0 is "
 31 |                      "the master worker task that performs the variable "
 32 |                      "initialization ")
 33 | flags.DEFINE_string("ps_hosts", "localhost:2222",
 34 |                     "Comma-separated list of hostname:port pairs")
 35 | flags.DEFINE_string("worker_hosts", "localhost:2223,localhost:2224",
 36 |                     "Comma-separated list of hostname:port pairs")
 37 | flags.DEFINE_string("job_name", None, "job name: worker or ps")
 38 | 
 39 | # You can safely tune these variables
 40 | BATCH_SIZE = 32
 41 | EPOCHS = 5
 42 | # ----------------
 43 | 
 44 | FLAGS = flags.FLAGS
 45 | 
 46 | if FLAGS.job_name is None or FLAGS.job_name == "":
 47 |     raise ValueError("Must specify an explicit `job_name`")
 48 | if FLAGS.task_index is None or FLAGS.task_index == "":
 49 |     raise ValueError("Must specify an explicit `task_index`")
 50 | 
 51 | # Only enable GPU for worker 1 (not needed if training with separate machines)
 52 | if FLAGS.task_index == 0:
 53 |     print('--- GPU Disabled ---')
 54 |     os.environ['CUDA_VISIBLE_DEVICES'] = ''
 55 | 
 56 | #Construct the cluster and start the server
 57 | ps_spec = FLAGS.ps_hosts.split(",")
 58 | worker_spec = FLAGS.worker_hosts.split(",")
 59 | 
 60 | # Get the number of workers.
 61 | num_workers = len(worker_spec)
 62 | print('{} workers defined'.format(num_workers))
 63 | 
 64 | cluster = tf.train.ClusterSpec({"ps": ps_spec, "worker": worker_spec})
 65 | 
 66 | server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
 67 | 
 68 | # Parameter server will block here
 69 | if FLAGS.job_name == "ps":
 70 |     print('--- Parameter Server Ready ---')
 71 |     server.join()
 72 | 
 73 | # Load dataset as numpy arrays
 74 | fashion_mnist = keras.datasets.fashion_mnist
 75 | (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
 76 | print('Data loaded')
 77 | 
 78 | # List with class names to see the labels of the images with matplotlib
 79 | class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
 80 |                'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
 81 | 
 82 | # Split dataset between workers
 83 | train_images = np.array_split(train_images, num_workers)[FLAGS.task_index]
 84 | train_labels = np.array_split(train_labels, num_workers)[FLAGS.task_index]
 85 | print('Local dataset size: {}'.format(train_images.shape[0]))
 86 | 
 87 | # Normalize dataset
 88 | train_images = train_images / 255.0
 89 | test_images = test_images / 255.0
 90 | 
 91 | is_chief = (FLAGS.task_index == 0)
 92 | 
 93 | checkpoint_dir='logs_dir/{}'.format(time())
 94 | print('Checkpoint directory: ' + checkpoint_dir)
 95 | 
 96 | worker_device = "/job:worker/task:%d" % FLAGS.task_index
 97 | print('Worker device: ' + worker_device + ' - is_chief: {}'.format(is_chief))
 98 | 
 99 | # replica_device_setter will place vars in the corresponding device
100 | with tf.device(
101 |       tf.train.replica_device_setter(
102 |           worker_device=worker_device,
103 |           cluster=cluster)):
104 |     global_step = tf.train.get_or_create_global_step()
105 | 
106 |     # Define input pipeline, place these ops in the cpu
107 |     with tf.name_scope('dataset'), tf.device('/cpu:0'):
108 |         # Placeholders for the iterator
109 |         images_placeholder = tf.placeholder(train_images.dtype, [None, train_images.shape[1], train_images.shape[2]], name='images_placeholder')
110 |         labels_placeholder = tf.placeholder(train_labels.dtype, [None], name='labels_placeholder')
111 |         batch_size = tf.placeholder(tf.int64, name='batch_size')
112 |         shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
113 | 
114 |         # Create dataset from numpy arrays, shuffle, repeat and batch
115 |         dataset = tf.data.Dataset.from_tensor_slices((images_placeholder, labels_placeholder))
116 |         dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
117 |         dataset = dataset.repeat(EPOCHS)
118 |         dataset = dataset.batch(batch_size)
119 |         # Define a feedable iterator and the initialization op
120 |         iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
121 |         dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
122 |         X, y = iterator.get_next()
123 | 
124 |     # Define our model
125 |     flatten_layer = tf.layers.flatten(X, name='flatten')
126 | 
127 |     dense_layer = tf.layers.dense(flatten_layer, 128, activation=tf.nn.relu, name='relu')
128 | 
129 |     predictions = tf.layers.dense(dense_layer, 10, activation=tf.nn.softmax, name='softmax')
130 | 
131 |     # Object to keep moving averages of our metrics (for tensorboard)
132 |     summary_averages = tf.train.ExponentialMovingAverage(0.9)
133 | 
134 |     # Define cross_entropy loss
135 |     with tf.name_scope('loss'):
136 |         loss = tf.reduce_mean(keras.losses.sparse_categorical_crossentropy(y, predictions))
137 |         loss_averages_op = summary_averages.apply([loss])
138 |         # Store moving average of the loss
139 |         tf.summary.scalar('cross_entropy', summary_averages.average(loss))
140 | 
141 |     # Define accuracy metric
142 |     with tf.name_scope('accuracy'):
143 |         with tf.name_scope('correct_prediction'):
144 |             # Compare prediction with actual label
145 |             correct_prediction = tf.equal(tf.argmax(predictions, 1), tf.cast(y, tf.int64))
146 |         # Average correct predictions in the current batch
147 |         accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy_metric')
148 |         accuracy_averages_op = summary_averages.apply([accuracy])
149 |         # Store moving average of the accuracy
150 |         tf.summary.scalar('accuracy', summary_averages.average(accuracy))
151 | 
152 |     # Define optimizer and training op
153 |     with tf.name_scope('train'):
154 |         # Wrap optimizer in a SyncReplicasOptimizer for distributed training
155 |         optimizer = tf.train.SyncReplicasOptimizer(tf.train.AdamOptimizer(np.sqrt(num_workers) * 0.001), replicas_to_aggregate=num_workers)
156 |         # Make train_op dependent on moving averages ops. Otherwise they will be
157 |         # disconnected from the graph
158 |         with tf.control_dependencies([loss_averages_op, accuracy_averages_op]):
159 |             train_op = optimizer.minimize(loss, global_step=global_step)
160 |         # Define a hook for optimizer initialization
161 |         sync_replicas_hook = optimizer.make_session_run_hook(is_chief)
162 | 
163 |     print('Graph definition finished')
164 | 
165 |     sess_config = tf.ConfigProto(
166 |         allow_soft_placement=True,
167 |         log_device_placement=False,
168 |         device_filters=["/job:ps",
169 |         "/job:worker/task:%d" % FLAGS.task_index])
170 | 
171 |     n_batches = int(train_images.shape[0] / (BATCH_SIZE * num_workers))
172 |     last_step = int(n_batches * EPOCHS)
173 | 
174 |     print('Training {} batches...'.format(last_step))
175 | 
176 |     # Logger hook to keep track of the training
177 |     class _LoggerHook(tf.train.SessionRunHook):
178 |       def begin(self):
179 |           self._total_loss = 0
180 |           self._total_acc = 0
181 | 
182 |       def before_run(self, run_context):
183 |           return tf.train.SessionRunArgs([loss, accuracy, global_step])
184 | 
185 |       def after_run(self, run_context, run_values):
186 |           loss_value, acc_value, step_value = run_values.results
187 |           self._total_loss += loss_value
188 |           self._total_acc += acc_value
189 |           if (step_value + 1) % n_batches == 0:
190 |               print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(int(step_value / n_batches) + 1, EPOCHS, self._total_loss / n_batches, self._total_acc / n_batches))
191 |               self._total_loss = 0
192 |               self._total_acc = 0
193 | 
194 |       def end(self, session):
195 |           print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(int(session.run(global_step) / n_batches) + 1, EPOCHS, self._total_loss / n_batches, self._total_acc / n_batches))
196 | 
197 |     # Hook to initialize the dataset
198 |     class _InitHook(tf.train.SessionRunHook):
199 |         def after_create_session(self, session, coord):
200 |             session.run(dataset_init_op, feed_dict={images_placeholder: train_images, labels_placeholder: train_labels, batch_size: BATCH_SIZE, shuffle_size: train_images.shape[0]})
201 | 
202 |     with tf.name_scope('monitored_session'):
203 |         with tf.train.MonitoredTrainingSession(
204 |                 master=server.target,
205 |                 is_chief=is_chief,
206 |                 checkpoint_dir=checkpoint_dir,
207 |                 hooks=[_LoggerHook(), _InitHook(), sync_replicas_hook],
208 |                 config=sess_config,
209 |                 stop_grace_period_secs=10,
210 |                 save_checkpoint_steps=n_batches) as mon_sess:
211 |             while not mon_sess.should_stop():
212 |                 mon_sess.run(train_op)
213 | 
214 | if is_chief:
215 |     print('--- Begin Evaluation ---')
216 |     # Reset graph and load it again to clean tensors placed in other devices
217 |     tf.reset_default_graph()
218 |     with tf.Session() as sess:
219 |         ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
220 |         saver = tf.train.import_meta_graph(ckpt.model_checkpoint_path + '.meta', clear_devices=True)
221 |         saver.restore(sess, ckpt.model_checkpoint_path)
222 |         print('Model restored')
223 |         graph = tf.get_default_graph()
224 |         images_placeholder = graph.get_tensor_by_name('dataset/images_placeholder:0')
225 |         labels_placeholder = graph.get_tensor_by_name('dataset/labels_placeholder:0')
226 |         batch_size = graph.get_tensor_by_name('dataset/batch_size:0')
227 |         shuffle_size = graph.get_tensor_by_name('dataset/shuffle_size:0')
228 |         accuracy = graph.get_tensor_by_name('accuracy/accuracy_metric:0')
229 |         predictions = graph.get_tensor_by_name('softmax/BiasAdd:0')
230 |         dataset_init_op = graph.get_operation_by_name('dataset/dataset_init')
231 |         sess.run(dataset_init_op, feed_dict={images_placeholder: test_images, labels_placeholder: test_labels, batch_size: test_images.shape[0], shuffle_size: 1})
232 |         print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
233 |         predicted = sess.run(predictions)
234 | 
235 |     # Plot the first 25 test images, their predicted label, and the true label
236 |     # Color correct predictions in green, incorrect predictions in red
237 |     plt.figure(figsize=(10, 10))
238 |     for i in range(25):
239 |         plt.subplot(5, 5, i + 1)
240 |         plt.xticks([])
241 |         plt.yticks([])
242 |         plt.grid(False)
243 |         plt.imshow(test_images[i], cmap=plt.cm.binary)
244 |         predicted_label = np.argmax(predicted[i])
245 |         true_label = test_labels[i]
246 |         if predicted_label == true_label:
247 |           color = 'green'
248 |         else:
249 |           color = 'red'
250 |         plt.xlabel("{} ({})".format(class_names[predicted_label],
251 |                                     class_names[true_label]),
252 |                                     color=color)
253 | 
254 |     plt.show(True)
255 | 


--------------------------------------------------------------------------------
/basic_federated_classifier.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # ==============================================================================
 17 | 
 18 | # TensorFlow and tf.keras
 19 | import tensorflow as tf
 20 | from tensorflow import keras
 21 | 
 22 | # Helper libraries
 23 | import os
 24 | import numpy as np
 25 | from time import time
 26 | import matplotlib.pyplot as plt
 27 | 
 28 | # Import custom optimizer
 29 | import federated_averaging_optimizer
 30 | 
 31 | flags = tf.app.flags
 32 | flags.DEFINE_integer("task_index", None,
 33 |                      "Worker task index, should be >= 0. task_index=0 is "
 34 |                      "the master worker task that performs the variable "
 35 |                      "initialization ")
 36 | flags.DEFINE_string("ps_hosts", "localhost:2222",
 37 |                     "Comma-separated list of hostname:port pairs")
 38 | flags.DEFINE_string("worker_hosts", "localhost:2223,localhost:2224",
 39 |                     "Comma-separated list of hostname:port pairs")
 40 | flags.DEFINE_string("job_name", None, "job name: worker or ps")
 41 | 
 42 | # You can safely tune these variables
 43 | BATCH_SIZE = 32
 44 | EPOCHS = 5
 45 | INTERVAL_STEPS = 100
 46 | # ----------------
 47 | 
 48 | FLAGS = flags.FLAGS
 49 | 
 50 | if FLAGS.job_name is None or FLAGS.job_name == "":
 51 |     raise ValueError("Must specify an explicit `job_name`")
 52 | if FLAGS.task_index is None or FLAGS.task_index == "":
 53 |     raise ValueError("Must specify an explicit `task_index`")
 54 | 
 55 | # Only enable GPU for worker 1 (not needed if training with separate machines)
 56 | if FLAGS.task_index == 0:
 57 |     print('--- GPU Disabled ---')
 58 |     os.environ['CUDA_VISIBLE_DEVICES'] = ''
 59 | 
 60 | #Construct the cluster and start the server
 61 | ps_spec = FLAGS.ps_hosts.split(",")
 62 | worker_spec = FLAGS.worker_hosts.split(",")
 63 | 
 64 | # Get the number of workers.
 65 | num_workers = len(worker_spec)
 66 | print('{} workers defined'.format(num_workers))
 67 | 
 68 | cluster = tf.train.ClusterSpec({"ps": ps_spec, "worker": worker_spec})
 69 | 
 70 | server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
 71 | # Parameter server will block here
 72 | if FLAGS.job_name == "ps":
 73 |     print('--- Parameter Server Ready ---')
 74 |     server.join()
 75 | 
 76 | # Load dataset as numpy arrays
 77 | fashion_mnist = keras.datasets.fashion_mnist
 78 | (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
 79 | print('Data loaded')
 80 | 
 81 | # List with class names to see the labels of the images with matplotlib
 82 | class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
 83 |                'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
 84 | 
 85 | # Split dataset between workers
 86 | train_images = np.array_split(train_images, num_workers)[FLAGS.task_index]
 87 | train_labels = np.array_split(train_labels, num_workers)[FLAGS.task_index]
 88 | print('Local dataset size: {}'.format(train_images.shape[0]))
 89 | 
 90 | # Normalize dataset
 91 | train_images = train_images / 255.0
 92 | test_images = test_images / 255.0
 93 | 
 94 | is_chief = (FLAGS.task_index == 0)
 95 | 
 96 | checkpoint_dir='logs_dir/federated_worker_{}/{}'.format(FLAGS.task_index, time())
 97 | print('Checkpoint directory: ' + checkpoint_dir)
 98 | 
 99 | worker_device = "/job:worker/task:%d" % FLAGS.task_index
100 | print('Worker device: ' + worker_device + ' - is_chief: {}'.format(is_chief))
101 | 
102 | # Place all ops in the local worker by default
103 | with tf.device(worker_device):
104 |     global_step = tf.train.get_or_create_global_step()
105 | 
106 |     # Define input pipeline, place these ops in the cpu
107 |     with tf.name_scope('dataset'), tf.device('/cpu:0'):
108 |         # Placeholders for the iterator
109 |         images_placeholder = tf.placeholder(train_images.dtype, [None, train_images.shape[1], train_images.shape[2]], name='images_placeholder')
110 |         labels_placeholder = tf.placeholder(train_labels.dtype, [None], name='labels_placeholder')
111 |         batch_size = tf.placeholder(tf.int64, name='batch_size')
112 |         shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
113 | 
114 |         # Create dataset from numpy arrays, shuffle, repeat and batch
115 |         dataset = tf.data.Dataset.from_tensor_slices((images_placeholder, labels_placeholder))
116 |         dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
117 |         dataset = dataset.repeat(EPOCHS)
118 |         dataset = dataset.batch(batch_size)
119 |         # Define a feedable iterator and the initialization op
120 |         iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
121 |         dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
122 |         X, y = iterator.get_next()
123 | 
124 |     # Define our model
125 |     flatten_layer = tf.layers.flatten(X, name='flatten')
126 | 
127 |     dense_layer = tf.layers.dense(flatten_layer, 128, activation=tf.nn.relu, name='relu')
128 | 
129 |     predictions = tf.layers.dense(dense_layer, 10, activation=tf.nn.softmax, name='softmax')
130 | 
131 |     # Object to keep moving averages of our metrics (for tensorboard)
132 |     summary_averages = tf.train.ExponentialMovingAverage(0.9)
133 | 
134 |     # Define cross_entropy loss
135 |     with tf.name_scope('loss'):
136 |         loss = tf.reduce_mean(keras.losses.sparse_categorical_crossentropy(y, predictions))
137 |         loss_averages_op = summary_averages.apply([loss])
138 |         # Store moving average of the loss
139 |         tf.summary.scalar('cross_entropy', summary_averages.average(loss))
140 | 
141 |     # Define accuracy metric
142 |     with tf.name_scope('accuracy'):
143 |         with tf.name_scope('correct_prediction'):
144 |             # Compare prediction with actual label
145 |             correct_prediction = tf.equal(tf.argmax(predictions, 1), tf.cast(y, tf.int64))
146 |         # Average correct predictions in the current batch
147 |         accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy_metric')
148 |         accuracy_averages_op = summary_averages.apply([accuracy])
149 |         # Store moving average of the accuracy
150 |         tf.summary.scalar('accuracy', summary_averages.average(accuracy))
151 | 
152 |     # Define optimizer and training op
153 |     with tf.name_scope('train'):
154 |         # Define device setter to place copies of local variables
155 |         device_setter = tf.train.replica_device_setter(worker_device=worker_device, cluster=cluster)
156 |         # Wrap optimizer in a FederatedAveragingOptimizer for federated training
157 |         optimizer = federated_averaging_optimizer.FederatedAveragingOptimizer(tf.train.AdamOptimizer(0.001), replicas_to_aggregate=num_workers, interval_steps=INTERVAL_STEPS, is_chief=is_chief, device_setter=device_setter)
158 |         # Make train_op dependent on moving averages ops. Otherwise they will be
159 |         # disconnected from the graph
160 |         with tf.control_dependencies([loss_averages_op, accuracy_averages_op]):
161 |             train_op = optimizer.minimize(loss, global_step=global_step)
162 |         # Define a hook for optimizer initialization
163 |         federated_hook = optimizer.make_session_run_hook()
164 | 
165 |     n_batches = int(train_images.shape[0] / BATCH_SIZE)
166 |     last_step = int(n_batches * EPOCHS)
167 | 
168 |     print('Graph definition finished')
169 | 
170 |     sess_config = tf.ConfigProto(
171 |         allow_soft_placement=True,
172 |         log_device_placement=False,
173 |         operation_timeout_in_ms=20000,
174 |         device_filters=["/job:ps",
175 |         "/job:worker/task:%d" % FLAGS.task_index])
176 | 
177 |     print('Training {} batches...'.format(last_step))
178 | 
179 |     # Logger hook to keep track of the training
180 |     class _LoggerHook(tf.train.SessionRunHook):
181 |       def begin(self):
182 |           self._total_loss = 0
183 |           self._total_acc = 0
184 | 
185 |       def before_run(self, run_context):
186 |           return tf.train.SessionRunArgs([loss, accuracy, global_step])
187 | 
188 |       def after_run(self, run_context, run_values):
189 |           loss_value, acc_value, step_value = run_values.results
190 |           self._total_loss += loss_value
191 |           self._total_acc += acc_value
192 |           if (step_value + 1) % n_batches == 0:
193 |               print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(int(step_value / n_batches) + 1, EPOCHS, self._total_loss / n_batches, self._total_acc / n_batches))
194 |               self._total_loss = 0
195 |               self._total_acc = 0
196 | 
197 |     # Hook to initialize the dataset
198 |     class _InitHook(tf.train.SessionRunHook):
199 |         def after_create_session(self, session, coord):
200 |             session.run(dataset_init_op, feed_dict={images_placeholder: train_images, labels_placeholder: train_labels, batch_size: BATCH_SIZE, shuffle_size: train_images.shape[0]})
201 | 
202 |     # Hook to save just trainable_variables
203 |     class _SaverHook(tf.train.SessionRunHook):
204 |         def begin(self):
205 |             self._saver = tf.train.Saver(tf.trainable_variables())
206 | 
207 |         def before_run(self, run_context):
208 |             return tf.train.SessionRunArgs(global_step)
209 | 
210 |         def after_run(self, run_context, run_values):
211 |             step_value = run_values.results
212 |             if step_value % n_batches == 0 and not step_value == 0:
213 |                 self._saver.save(run_context.session, checkpoint_dir+'/model.ckpt', step_value)
214 | 
215 |         def end(self, session):
216 |             self._saver.save(session, checkpoint_dir+'/model.ckpt', session.run(global_step))
217 | 
218 |     # Make sure we do not define a chief worker
219 |     with tf.name_scope('monitored_session'):
220 |         with tf.train.MonitoredTrainingSession(
221 |                 master=server.target,
222 |                 checkpoint_dir=checkpoint_dir,
223 |                 hooks=[_LoggerHook(), _InitHook(), _SaverHook(), federated_hook],
224 |                 config=sess_config,
225 |                 stop_grace_period_secs=10,
226 |                 save_checkpoint_secs=None) as mon_sess:
227 |             while not mon_sess.should_stop():
228 |                 mon_sess.run(train_op)
229 | 
230 | if is_chief:
231 |     print('--- Begin Evaluation ---')
232 |     # Reset graph and load it again to clean tensors placed in other devices
233 |     tf.reset_default_graph()
234 |     with tf.Session() as sess:
235 |         ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
236 |         saver = tf.train.import_meta_graph(ckpt.model_checkpoint_path + '.meta', clear_devices=True)
237 |         saver.restore(sess, ckpt.model_checkpoint_path)
238 |         print('Model restored')
239 |         graph = tf.get_default_graph()
240 |         images_placeholder = graph.get_tensor_by_name('dataset/images_placeholder:0')
241 |         labels_placeholder = graph.get_tensor_by_name('dataset/labels_placeholder:0')
242 |         batch_size = graph.get_tensor_by_name('dataset/batch_size:0')
243 |         shuffle_size = graph.get_tensor_by_name('dataset/shuffle_size:0')
244 |         accuracy = graph.get_tensor_by_name('accuracy/accuracy_metric:0')
245 |         predictions = graph.get_tensor_by_name('softmax/BiasAdd:0')
246 |         dataset_init_op = graph.get_operation_by_name('dataset/dataset_init')
247 |         sess.run(dataset_init_op, feed_dict={images_placeholder: test_images, labels_placeholder: test_labels, batch_size: test_images.shape[0], shuffle_size: 1})
248 |         print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
249 |         predicted = sess.run(predictions)
250 | 
251 |     # Plot the first 25 test images, their predicted label, and the true label
252 |     # Color correct predictions in green, incorrect predictions in red
253 |     plt.figure(figsize=(10, 10))
254 |     for i in range(25):
255 |         plt.subplot(5, 5, i + 1)
256 |         plt.xticks([])
257 |         plt.yticks([])
258 |         plt.grid(False)
259 |         plt.imshow(test_images[i], cmap=plt.cm.binary)
260 |         predicted_label = np.argmax(predicted[i])
261 |         true_label = test_labels[i]
262 |         if predicted_label == true_label:
263 |           color = 'green'
264 |         else:
265 |           color = 'red'
266 |         plt.xlabel("{} ({})".format(class_names[predicted_label],
267 |                                     class_names[true_label]),
268 |                                     color=color)
269 | 
270 |     plt.show(True)
271 | 


--------------------------------------------------------------------------------
/federated-MPI/README.md:
--------------------------------------------------------------------------------
 1 | # Implementation with MPI
 2 | 
 3 | This is the implementation of Federated Averaging using Message Passing Interface. It is harder to set-up but much easier to run, you can launch the whole cluster with just one command!
 4 | 
 5 | ## Installation dependencies
 6 | 
 7 | Same as previous, and:
 8 | - [Mpich3](https://www.mpich.org/)
 9 | - [mpi4py](https://mpi4py.readthedocs.io/en/stable/)
10 | 
11 | ## Usage
12 | 
13 | To run two processes in the same computer with the basic classifier type in the shell: `mpiexec -n 2 python3 mpi_basic_classifier.py`
14 | 
15 | To run a cluster of nodes list their IP's in a file and run: `mpiexec -f your_file python3 mpi_basic_classifier.py`
16 | 
17 | ## Useful resources
18 | 
19 | Check [this tutorial](https://lleksah.wordpress.com/2016/04/11/configuring-a-raspberry-cluster-with-mpi/) to set-up a cluster of Raspberry Pi's.
20 | 
21 | Check [this thread](https://raspberrypi.stackexchange.com/questions/54103/how-to-install-mpi4py-on-for-python3-on-raspberry-pi-after-installing-mpich) to solve common problems with mpi4py after installing mpich.
22 | 
23 | Check [this script](https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py) to see how to generate CIFAR-10 TFRecords.
24 | 
25 | ## Troubleshooting and Help
26 | 
27 | coMind has public Slack and Telegram channels which are a great place to ask questions and all things related to federated machine learning.
28 | 
29 | ## About
30 | 
31 | coMind is an open source project for training privacy-preserving federated deep learning models. 
32 | 
33 | * https://comind.org/
34 | * [Twitter](https://twitter.com/coMindOrg)
35 | 


--------------------------------------------------------------------------------
/federated-MPI/mpi_advanced_classifier.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # ==============================================================================
 17 | 
 18 | # TensorFlow
 19 | import tensorflow as tf
 20 | 
 21 | # Helper libraries
 22 | import numpy as np
 23 | from time import time
 24 | from mpi4py import MPI
 25 | import sys
 26 | import multiprocessing
 27 | 
 28 | # You can safely tune these variables
 29 | BATCH_SIZE = 128
 30 | SHUFFLE_SIZE = BATCH_SIZE * 100
 31 | EPOCHS = 250
 32 | EPOCHS_PER_DECAY = 50
 33 | INTERVAL_STEPS = 100 # Steps between averages
 34 | BATCHES_TO_PREFETCH = 1
 35 | # -----------------
 36 | 
 37 | # Let the code know about the MPI config
 38 | COMM = MPI.COMM_WORLD
 39 | 
 40 | num_workers = COMM.size
 41 | 
 42 | # Dataset dependent constants
 43 | NUM_TRAIN_IMAGES = int(50000 / num_workers)
 44 | NUM_TEST_IMAGES = 10000
 45 | HEIGHT = 32
 46 | WIDTH = 32
 47 | CHANNELS = 3
 48 | NUM_BATCH_FILES = 5
 49 | 
 50 | # Path to TFRecord files (check readme for instructions on how to get these files)
 51 | cifar10_train_files = ['cifar-10-tf-records/train{}.tfrecords'.format(i)
 52 |                        for i in range(NUM_BATCH_FILES)]
 53 | cifar10_test_file = 'cifar-10-tf-records/test.tfrecords'
 54 | 
 55 | # Shuffle filenames before loading them
 56 | np.random.shuffle(cifar10_train_files)
 57 | 
 58 | CHECKPOINT_DIR = 'logs_dir/{}'.format(time())
 59 | print('Checkpoint directory: ' + CHECKPOINT_DIR)
 60 | sys.stdout.flush()
 61 | 
 62 | global_step = tf.train.get_or_create_global_step()
 63 | 
 64 | CPU_COUNT = int(multiprocessing.cpu_count() / num_workers)
 65 | 
 66 | # Define input pipeline, place these ops in the cpu
 67 | with tf.name_scope('dataset'), tf.device('/cpu:0'):
 68 |     # Map function to decode data and preprocess it
 69 |     def preprocess(serialized_examples):
 70 |         """ Preprocess data """
 71 |         # Parse a batch
 72 |         features = tf.parse_example(
 73 |             serialized_examples,
 74 |             {'image': tf.FixedLenFeature([], tf.string), 'label': tf.FixedLenFeature([], tf.int64)})
 75 |         # Decode and reshape imag
 76 |         image = tf.map_fn(lambda img: tf.reshape(tf.decode_raw(img, tf.uint8),
 77 |                                                  tf.stack([HEIGHT, WIDTH, CHANNELS])),
 78 |                           features['image'], dtype=tf.uint8, name='decode')
 79 |         # Cast image
 80 |         casted_image = tf.cast(image, tf.float32, name='input_cast')
 81 |         # Resize image for testing
 82 |         resized_image = tf.image.resize_image_with_crop_or_pad(casted_image, 24, 24)
 83 |         # Augment images for training
 84 |         distorted_image = tf.map_fn(lambda img: tf.random_crop(img, [24, 24, 3]),
 85 |                                     casted_image, name='random_crop')
 86 |         distorted_image = tf.image.random_flip_left_right(distorted_image)
 87 |         distorted_image = tf.image.random_brightness(distorted_image, 63)
 88 |         distorted_image = tf.image.random_contrast(distorted_image, 0.2, 1.8)
 89 |         # Check if test or train mode
 90 |         result = tf.cond(train_mode, lambda: distorted_image, lambda: resized_image)
 91 |         # Standardize images
 92 |         processed_image = tf.map_fn(lambda img: tf.image.per_image_standardization(img),
 93 |                                     result, name='standardization')
 94 |         return processed_image, features['label']
 95 |     # Placeholders for the iterator
 96 |     filename_placeholder = tf.placeholder(tf.string, name='input_filename')
 97 |     batch_size = tf.placeholder(tf.int64, name='batch_size')
 98 |     shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
 99 |     train_mode = tf.placeholder(tf.bool, name='train_mode')
100 | 
101 |     # Create dataset, shuffle, repeat, batch, map and prefetch
102 |     dataset = tf.data.TFRecordDataset(filename_placeholder)
103 |     dataset = dataset.shard(num_workers, COMM.rank)
104 |     dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
105 |     dataset = dataset.repeat(EPOCHS)
106 |     dataset = dataset.batch(batch_size)
107 |     dataset = dataset.map(preprocess, CPU_COUNT)
108 |     dataset = dataset.prefetch(BATCHES_TO_PREFETCH)
109 |     # Define a feedable iterator and the initialization op
110 |     iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
111 |     dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
112 |     X, y = iterator.get_next()
113 | 
114 | # Define our model
115 | first_conv = tf.layers.conv2d(X, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='first_conv')
116 | 
117 | first_pool = tf.nn.max_pool(first_conv, [1, 3, 3 ,1], [1, 2, 2, 1], padding='SAME', name='first_pool')
118 | 
119 | first_norm = tf.nn.lrn(first_pool, 4, alpha=0.001 / 9.0, beta=0.75, name='first_norm')
120 | 
121 | second_conv = tf.layers.conv2d(first_norm, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='second_conv')
122 | 
123 | second_norm = tf.nn.lrn(second_conv, 4, alpha=0.001 / 9.0, beta=0.75, name='second_norm')
124 | 
125 | second_pool = tf.nn.max_pool(second_norm, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME', name='second_pool')
126 | 
127 | flatten_layer = tf.layers.flatten(second_pool, name='flatten')
128 | 
129 | first_relu = tf.layers.dense(flatten_layer, 384, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='first_relu')
130 | 
131 | second_relu = tf.layers.dense(first_relu, 192, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='second_relu')
132 | 
133 | logits = tf.layers.dense(second_relu, 10, kernel_initializer=tf.truncated_normal_initializer(stddev=1/192.0), name='logits')
134 | 
135 | # Object to keep moving averages of our metrics (for tensorboard)
136 | summary_averages = tf.train.ExponentialMovingAverage(0.9)
137 | 
138 | # Define cross_entropy loss
139 | with tf.name_scope('loss'):
140 |     base_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
141 |                                                                               logits=logits),
142 |                                name='base_loss')
143 |     # Add regularization loss to both relu layers
144 |     regularizer_loss = tf.add_n(
145 |         [tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'relu/kernel' in v.name],
146 |         name='regularizer_loss') * 0.004
147 |     loss = tf.add(base_loss, regularizer_loss)
148 |     loss_averages_op = summary_averages.apply([loss])
149 |     # Store moving average of the loss
150 |     tf.summary.scalar('cross_entropy', summary_averages.average(loss))
151 | 
152 | with tf.name_scope('accuracy'):
153 |     with tf.name_scope('correct_prediction'):
154 |         # Compare prediction with actual label
155 |         correct_prediction = tf.equal(tf.argmax(logits, 1), y)
156 |     # Average correct predictions in the current batch
157 |     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy_metric')
158 |     accuracy_averages_op = summary_averages.apply([accuracy])
159 |     # Store moving average of the accuracy
160 |     tf.summary.scalar('accuracy', summary_averages.average(accuracy))
161 | 
162 | N_BATCHES = int(NUM_TRAIN_IMAGES / BATCH_SIZE)
163 | LAST_STEP = int(N_BATCHES * EPOCHS)
164 | 
165 | # Define moving averages of the trainable variables. This sometimes improve
166 | # the performance of the trained model
167 | with tf.name_scope('variable_averages'):
168 |     variable_averages = tf.train.ExponentialMovingAverage(0.9999, global_step)
169 |     variable_averages_op = variable_averages.apply(tf.trainable_variables())
170 | 
171 | # Define optimizer and training op
172 | with tf.name_scope('train'):
173 |     # Make decaying learning rate
174 |     lr = tf.train.exponential_decay(0.1, global_step, N_BATCHES * EPOCHS_PER_DECAY,
175 |                                     0.1, staircase=True)
176 |     tf.summary.scalar('learning_rate', lr)
177 |     # Make train_op dependent on moving averages ops. Otherwise they will be
178 |     # disconnected from the graph
179 |     with tf.control_dependencies([loss_averages_op, accuracy_averages_op, variable_averages_op]):
180 |         train_op = tf.train.GradientDescentOptimizer(lr).minimize(loss, global_step=global_step)
181 | 
182 | print('Graph definition finished')
183 | sys.stdout.flush()
184 | SESS_CONFIG = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
185 | 
186 | print('Training {} batches...'.format(LAST_STEP))
187 | sys.stdout.flush()
188 | 
189 | # Logger hook to keep track of the training
190 | class _LoggerHook(tf.train.SessionRunHook):
191 |     def begin(self):
192 |         """ Run this in session begin """
193 |         self._total_loss = 0
194 |         self._total_acc = 0
195 | 
196 |     def before_run(self, run_context):
197 |         """ Run this in session before_run """
198 |         return tf.train.SessionRunArgs([loss, accuracy, global_step])
199 | 
200 |     def after_run(self, run_context, run_values):
201 |         """ Run this in session after_run """
202 |         loss_value, acc_value, step_value = run_values.results
203 |         self._total_loss += loss_value
204 |         self._total_acc += acc_value
205 |         if (step_value + 1) % N_BATCHES == 0 and COMM.rank == 0:
206 |             print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(
207 |                 int(step_value / N_BATCHES) + 1, EPOCHS, self._total_loss / N_BATCHES,
208 |                 self._total_acc / N_BATCHES))
209 |             sys.stdout.flush()
210 |             self._total_loss = 0
211 |             self._total_acc = 0
212 | 
213 | # Custom hook
214 | class _FederatedHook(tf.train.SessionRunHook):
215 |     def __init__(self, comm):
216 |         """ Initialize Hook """
217 |         # Store the MPI config
218 |         self._comm = comm
219 | 
220 |     def _create_placeholders(self):
221 |         """ Create placeholders for all the trainable variables """
222 |         # Create placeholders for all the trainable variables
223 |         for var in tf.trainable_variables():
224 |             self._placeholders.append(tf.placeholder_with_default(
225 |                 var, var.shape, name="%s/%s" % ("FedAvg", var.op.name)))
226 | 
227 |     def _assign_vars(self, local_vars):
228 |         """ Assign value feeded to placeholders to local vars """
229 |         reassign_ops = []
230 |         for var, fvar in zip(local_vars, self._placeholders):
231 |             reassign_ops.append(tf.assign(var, fvar))
232 |         return tf.group(*(reassign_ops))
233 | 
234 |     def _gather_weights(self, session):
235 |         """Gather all weights in the chief worker"""
236 |         gathered_weights = []
237 |         for var in tf.trainable_variables():
238 |             value = session.run(var)
239 |             value = self._comm.gather(value, root=0)
240 |             gathered_weights.append(np.array(value))
241 |         return gathered_weights
242 | 
243 |     def _broadcast_weights(self, session):
244 |         """Broadcast averaged weights to all workers"""
245 |         broadcasted_weights = []
246 |         for var in tf.trainable_variables():
247 |             value = session.run(var)
248 |             value = self._comm.bcast(value, root=0)
249 |             broadcasted_weights.append(np.array(value))
250 |         return broadcasted_weights
251 | 
252 |     def begin(self):
253 |         """ Run this in session begin """
254 |         self._placeholders = []
255 |         self._create_placeholders()
256 |         # Op to initialize update the weight
257 |         self._update_local_vars_op = self._assign_vars(tf.trainable_variables())
258 | 
259 |     def after_create_session(self, session, coord):
260 |         """ Run this after creating session """
261 |         # Broadcast weights
262 |         broadcasted_weights = self._broadcast_weights(session)
263 |         # Initialize the workers at the same point
264 |         if self._comm.rank != 0:
265 |             feed_dict = {}
266 |             for placeh, bweight in zip(self._placeholders, broadcasted_weights):
267 |                 feed_dict[placeh] = bweight
268 |             session.run(self._update_local_vars_op, feed_dict=feed_dict)
269 | 
270 |     def before_run(self, run_context):
271 |         """ Run this in session before_run """
272 |         return tf.train.SessionRunArgs(global_step)
273 | 
274 |     def after_run(self, run_context, run_values):
275 |         """ Run this in session after_run """
276 |         step_value = run_values.results
277 |         session = run_context.session
278 |         # Check if we should average
279 |         if step_value % INTERVAL_STEPS == 0 and not step_value == 0:
280 |             gathered_weights = self._gather_weights(session)
281 |             # Chief gather weights and averages
282 |             if self._comm.rank == 0:
283 |                 print('Average applied, iter: {}/{}'.format(step_value, LAST_STEP))
284 |                 sys.stdout.flush()
285 |                 for i, elem in enumerate(gathered_weights):
286 |                     gathered_weights[i] = np.mean(elem, axis=0)
287 |                 feed_dict = {}
288 |                 for placeh, gweight in zip(self._placeholders, gathered_weights):
289 |                     feed_dict[placeh] = gweight
290 |                 session.run(self._update_local_vars_op, feed_dict=feed_dict)
291 |             # The rest get the averages and update their local model
292 |             broadcasted_weights = self._broadcast_weights(session)
293 |             if self._comm.rank != 0:
294 |                 feed_dict = {}
295 |                 for placeh, bweight in zip(self._placeholders, broadcasted_weights):
296 |                     feed_dict[placeh] = bweight
297 |                 session.run(self._update_local_vars_op, feed_dict=feed_dict)
298 | 
299 | # Hook to initialize the dataset
300 | class _InitHook(tf.train.SessionRunHook):
301 |     def after_create_session(self, session, coord):
302 |         session.run(dataset_init_op, feed_dict={
303 |             filename_placeholder: cifar10_train_files, batch_size: BATCH_SIZE,
304 |             shuffle_size: SHUFFLE_SIZE, train_mode: True})
305 | 
306 | print("Worker {} ready".format(COMM.rank))
307 | sys.stdout.flush()
308 | 
309 | with tf.name_scope('monitored_session'):
310 |     with tf.train.MonitoredTrainingSession(
311 |             checkpoint_dir=CHECKPOINT_DIR,
312 |             hooks=[_LoggerHook(), _InitHook(), _FederatedHook(COMM),
313 |                    tf.train.CheckpointSaverHook(checkpoint_dir=CHECKPOINT_DIR,
314 |                                                 save_steps=N_BATCHES,
315 |                                                 saver=tf.train.Saver(
316 |                                                     variable_averages.variables_to_restore()))],
317 |             config=SESS_CONFIG,
318 |             save_checkpoint_secs=None) as mon_sess:
319 |         while not mon_sess.should_stop():
320 |             mon_sess.run(train_op)
321 | 
322 | if COMM.rank == 0:
323 |     print('--- Begin Evaluation ---')
324 |     sys.stdout.flush()
325 |     tf.reset_default_graph()
326 |     with tf.Session() as sess:
327 |         ckpt = tf.train.get_checkpoint_state(CHECKPOINT_DIR)
328 |         saver = tf.train.import_meta_graph(ckpt.model_checkpoint_path + '.meta', clear_devices=True)
329 |         saver.restore(sess, ckpt.model_checkpoint_path)
330 |         print('Model restored')
331 |         sys.stdout.flush()
332 |         graph = tf.get_default_graph()
333 |         images_placeholder = graph.get_tensor_by_name('dataset/images_placeholder:0')
334 |         labels_placeholder = graph.get_tensor_by_name('dataset/labels_placeholder:0')
335 |         batch_size = graph.get_tensor_by_name('dataset/batch_size:0')
336 |         train_mode = graph.get_tensor_by_name('dataset/train_mode:0')
337 |         accuracy = graph.get_tensor_by_name('accuracy/accuracy_metric:0')
338 |         dataset_init_op = graph.get_operation_by_name('dataset/dataset_init')
339 |         sess.run(dataset_init_op, feed_dict={
340 |             filename_placeholder: cifar10_test_file, batch_size: NUM_TEST_IMAGES,
341 |             shuffle_size: 1, train_mode: False})
342 |         print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
343 |         sys.stdout.flush()
344 | 


--------------------------------------------------------------------------------
/federated-MPI/mpi_basic_classifier.py:
--------------------------------------------------------------------------------
  1 | """# Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # =============================================================================="""
 17 | 
 18 | # TensorFlow and tf.keras
 19 | import tensorflow as tf
 20 | from tensorflow import keras
 21 | 
 22 | # Helper libraries
 23 | import numpy as np
 24 | from time import time
 25 | from mpi4py import MPI
 26 | import sys
 27 | 
 28 | # Let the code know about the MPI config
 29 | COMM = MPI.COMM_WORLD
 30 | 
 31 | # Load dataset as numpy arrays
 32 | fashion_mnist = keras.datasets.fashion_mnist
 33 | (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
 34 | 
 35 | # Split dataset
 36 | train_images = np.array_split(train_images, COMM.size)[COMM.rank]
 37 | train_labels = np.array_split(train_labels, COMM.size)[COMM.rank]
 38 | 
 39 | # You can safely tune these variables
 40 | BATCH_SIZE = 32
 41 | SHUFFLE_SIZE = train_images.shape[0]
 42 | EPOCHS = 5
 43 | INTERVAL_STEPS = 100
 44 | # -----------------
 45 | 
 46 | # Normalize dataset
 47 | train_images = train_images / 255.0
 48 | test_images = test_images / 255.0
 49 | 
 50 | CHECKPOINT_DIR = 'logs_dir/{}'.format(time())
 51 | 
 52 | global_step = tf.train.get_or_create_global_step()
 53 | 
 54 | # Define input pipeline, place these ops in the cpu
 55 | with tf.name_scope('dataset'), tf.device('/cpu:0'):
 56 |     # Placeholders for the iterator
 57 |     images_placeholder = tf.placeholder(train_images.dtype, [None, train_images.shape[1], train_images.shape[2]])
 58 |     labels_placeholder = tf.placeholder(train_labels.dtype, [None])
 59 |     batch_size = tf.placeholder(tf.int64)
 60 |     shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
 61 | 
 62 |     # Create dataset, shuffle, repeat and batch
 63 |     dataset = tf.data.Dataset.from_tensor_slices((images_placeholder, labels_placeholder))
 64 |     dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
 65 |     dataset = dataset.repeat(EPOCHS)
 66 |     dataset = dataset.batch(batch_size)
 67 |     iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
 68 |     dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
 69 |     X, y = iterator.get_next()
 70 | 
 71 | # Define our model
 72 | flatten_layer = tf.layers.flatten(X, name='flatten')
 73 | 
 74 | dense_layer = tf.layers.dense(flatten_layer, 128, activation=tf.nn.relu, name='relu')
 75 | 
 76 | predictions = tf.layers.dense(dense_layer, 10, activation=tf.nn.softmax, name='softmax')
 77 | 
 78 | # Object to keep moving averages of our metrics (for tensorboard)
 79 | summary_averages = tf.train.ExponentialMovingAverage(0.9)
 80 | 
 81 | # Define cross_entropy loss
 82 | with tf.name_scope('loss'):
 83 |     loss = tf.reduce_mean(keras.losses.sparse_categorical_crossentropy(y, predictions))
 84 |     loss_averages_op = summary_averages.apply([loss])
 85 |     # Store moving average of the loss
 86 |     tf.summary.scalar('cross_entropy', summary_averages.average(loss))
 87 | 
 88 | with tf.name_scope('accuracy'):
 89 |     with tf.name_scope('correct_prediction'):
 90 |         # Compare prediction with actual label
 91 |         correct_prediction = tf.equal(tf.argmax(predictions, 1), tf.cast(y, tf.int64))
 92 |     # Average correct predictions in the current batch
 93 |     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
 94 |     accuracy_averages_op = summary_averages.apply([accuracy])
 95 |     # Store moving average of the accuracy
 96 |     tf.summary.scalar('accuracy', summary_averages.average(accuracy))
 97 | 
 98 | # Define optimizer and training op
 99 | with tf.name_scope('train'):
100 |     # Make train_op dependent on moving averages ops. Otherwise they will be
101 |     # disconnected from the graph
102 |     with tf.control_dependencies([loss_averages_op, accuracy_averages_op]):
103 |         train_op = tf.train.AdamOptimizer(0.001).minimize(loss, global_step=global_step)
104 | 
105 | SESS_CONFIG = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
106 | 
107 | N_BATCHES = int(train_images.shape[0] / BATCH_SIZE)
108 | LAST_STEP = int(N_BATCHES * EPOCHS)
109 | 
110 | # Logger hook to keep track of the training
111 | class _LoggerHook(tf.train.SessionRunHook):
112 |     def begin(self):
113 |         """ Run this in session begin """
114 |         self._total_loss = 0
115 |         self._total_acc = 0
116 | 
117 |     def before_run(self, run_context):
118 |         """ Run this in session before_run """
119 |         return tf.train.SessionRunArgs([loss, accuracy, global_step])
120 | 
121 |     def after_run(self, run_context, run_values):
122 |         """ Run this in session after_run """
123 |         loss_value, acc_value, step_value = run_values.results
124 |         self._total_loss += loss_value
125 |         self._total_acc += acc_value
126 |         if (step_value + 1) % N_BATCHES == 0 and COMM.rank == 0:
127 |             print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(
128 |                 int(step_value / N_BATCHES) + 1,
129 |                 EPOCHS, self._total_loss / N_BATCHES, self._total_acc / N_BATCHES))
130 |             sys.stdout.flush()
131 |             self._total_loss = 0
132 |             self._total_acc = 0
133 | 
134 | # Custom hook
135 | class _FederatedHook(tf.train.SessionRunHook):
136 |     def __init__(self, comm):
137 |         """ Initialize Hook """
138 |         # Store the MPI config
139 |         self._comm = comm
140 | 
141 |     def _create_placeholders(self):
142 |         """ Create placeholders for all the trainable variables """
143 |         for var in tf.trainable_variables():
144 |             self._placeholders.append(
145 |                 tf.placeholder_with_default(
146 |                     var, var.shape, name="%s/%s" % ("FedAvg", var.op.name)))
147 | 
148 |     def _assign_vars(self, local_vars):
149 |         """ Assign value feeded to placeholders to local vars """
150 |         reassign_ops = []
151 |         for var, fvar in zip(local_vars, self._placeholders):
152 |             reassign_ops.append(tf.assign(var, fvar))
153 |         return tf.group(*(reassign_ops))
154 | 
155 |     def _gather_weights(self, session):
156 |         """Gather all weights in the chief worker"""
157 |         gathered_weights = []
158 |         for var in tf.trainable_variables():
159 |             value = session.run(var)
160 |             value = self._comm.gather(value, root=0)
161 |             gathered_weights.append(np.array(value))
162 |         return gathered_weights
163 | 
164 |     def _broadcast_weights(self, session):
165 |         """Broadcast averaged weights to all workers"""
166 |         broadcasted_weights = []
167 |         for var in tf.trainable_variables():
168 |             value = session.run(var)
169 |             value = self._comm.bcast(value, root=0)
170 |             broadcasted_weights.append(np.array(value))
171 |         return broadcasted_weights
172 | 
173 |     def begin(self):
174 |         """ Run this in session begin """
175 |         self._placeholders = []
176 |         self._create_placeholders()
177 |         # Op to initialize update the weights
178 |         self._update_local_vars_op = self._assign_vars(tf.trainable_variables())
179 | 
180 |     def after_create_session(self, session, coord):
181 |         """ Run this after creating session """
182 |         # Broadcast weights
183 |         broadcasted_weights = self._broadcast_weights(session)
184 |         # Initialize the workers at the same point
185 |         if self._comm.rank != 0:
186 |             feed_dict = {}
187 |             for placeh, bweight in zip(self._placeholders, broadcasted_weights):
188 |                 feed_dict[placeh] = bweight
189 |             session.run(self._update_local_vars_op, feed_dict=feed_dict)
190 | 
191 |     def before_run(self, run_context):
192 |         """ Run this in session before_run """
193 |         return tf.train.SessionRunArgs(global_step)
194 | 
195 |     def after_run(self, run_context, run_values):
196 |         """ Run this in session after_run """
197 |         step_value = run_values.results
198 |         session = run_context.session
199 |         # Check if we should average
200 |         if step_value % INTERVAL_STEPS == 0 and not step_value == 0:
201 |             gathered_weights = self._gather_weights(session)
202 |             # Chief gather weights and averages
203 |             if self._comm.rank == 0:
204 |                 print('Average applied, iter: {}/{}'.format(step_value, LAST_STEP))
205 |                 sys.stdout.flush()
206 |                 for i, elem in enumerate(gathered_weights):
207 |                     gathered_weights[i] = np.mean(elem, axis=0)
208 |                 feed_dict = {}
209 |                 for placeh, gweight in zip(self._placeholders, gathered_weights):
210 |                     feed_dict[placeh] = gweight
211 |                 session.run(self._update_local_vars_op, feed_dict=feed_dict)
212 |             # The rest get the averages and update their local model
213 |             broadcasted_weights = self._broadcast_weights(session)
214 |             if self._comm.rank != 0:
215 |                 feed_dict = {}
216 |                 for placeh, bweight in zip(self._placeholders, broadcasted_weights):
217 |                     feed_dict[placeh] = bweight
218 |                 session.run(self._update_local_vars_op, feed_dict=feed_dict)
219 | 
220 | # Hook to initialize the dataset
221 | class _InitHook(tf.train.SessionRunHook):
222 |     def after_create_session(self, session, coord):
223 |         """ Run this after creating session """
224 |         session.run(dataset_init_op, feed_dict={
225 |             images_placeholder: train_images,
226 |             labels_placeholder: train_labels,
227 |             batch_size: BATCH_SIZE, shuffle_size: SHUFFLE_SIZE})
228 | 
229 | print("Worker {} ready".format(COMM.rank))
230 | sys.stdout.flush()
231 | 
232 | with tf.name_scope('monitored_session'):
233 |     with tf.train.MonitoredTrainingSession(
234 |             checkpoint_dir=CHECKPOINT_DIR,
235 |             hooks=[_LoggerHook(), _InitHook(), _FederatedHook(COMM)],
236 |             config=SESS_CONFIG,
237 |             save_checkpoint_steps=N_BATCHES) as mon_sess:
238 |         while not mon_sess.should_stop():
239 |             mon_sess.run(train_op)
240 | 
241 | if COMM.rank == 0:
242 |     print('--- Begin Evaluation ---')
243 |     sys.stdout.flush()
244 |     with tf.Session() as sess:
245 |         ckpt = tf.train.get_checkpoint_state(CHECKPOINT_DIR)
246 |         tf.train.Saver().restore(sess, ckpt.model_checkpoint_path)
247 |         print('Model restored')
248 |         sys.stdout.flush()
249 |         sess.run(dataset_init_op, feed_dict={
250 |             images_placeholder: test_images, labels_placeholder: test_labels,
251 |             batch_size: test_images.shape[0], shuffle_size: 1})
252 |         print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
253 |         sys.stdout.flush()
254 | 


--------------------------------------------------------------------------------
/federated-keras/README.md:
--------------------------------------------------------------------------------
 1 | # Federated with Keras
 2 | 
 3 | This shows the usage of the distributed and federated set-ups with keras.
 4 | 
 5 | ## Dependencies
 6 | 
 7 | You will need the custom `federated_averaging_optimizer.py` to be able to run the keras example. You can [find it](https://github.com/coMindOrg/federated-averaging-tutorials/blob/master/federated_averaging_optimizer.py) in this same repository.
 8 | 
 9 | ## Usage
10 | 
11 | For example, to run the `keras_distributed_classifier.py`:
12 | 
13 | * 1st shell command should look like this: `python3 keras_distributed_classifier.py --job_name=ps --task_index=0`
14 | 
15 | * 2nd shell: `python3 keras_distributed_classifier.py --job_name=worker --task_index=0`
16 | 
17 | * 3rd shell: `python3 keras_distributed_classifier.py --job_name=worker --task_index=1`
18 | 
19 | Follow the same steps for the `keras_federated_classifier.py`.
20 | 
21 | ## Useful resources
22 | 
23 | Check [Keras](https://keras.io/) to learn more about this great API.
24 | 
25 | ## Troubleshooting and Help
26 | 
27 | coMind has public Slack and Telegram channels which are a great place to ask questions and all things related to federated machine learning.
28 | 
29 | ## About
30 | 
31 | coMind is an open source project for training privacy-preserving federated deep learning models. 
32 | 
33 | * https://comind.org/
34 | * [Twitter](https://twitter.com/coMindOrg)
35 | 


--------------------------------------------------------------------------------
/federated-keras/keras_distributed_classifier.py:
--------------------------------------------------------------------------------
  1 | # Helper libraries
  2 | import os
  3 | import numpy as np
  4 | from time import time
  5 | 
  6 | # TensorFlow and tf.keras
  7 | import tensorflow as tf
  8 | from tensorflow import keras
  9 | 
 10 | flags = tf.app.flags
 11 | flags.DEFINE_integer("task_index", None,
 12 |                      "Worker task index, should be >= 0. task_index=0 is "
 13 |                      "the master worker task the performs the variable "
 14 |                      "initialization ")
 15 | flags.DEFINE_integer("train_steps", 1000,
 16 |                      "Number of (global) training steps to perform")
 17 | flags.DEFINE_string("ps_hosts", "localhost:2222",
 18 |                     "Comma-separated list of hostname:port pairs")
 19 | flags.DEFINE_string("worker_hosts", "localhost:2223,localhost:2224",
 20 |                     "Comma-separated list of hostname:port pairs")
 21 | flags.DEFINE_string("job_name", None, "job name: worker or ps")
 22 | 
 23 | FLAGS = flags.FLAGS
 24 | 
 25 | # Steps between averages
 26 | INTERVAL_STEPS = 100
 27 | 
 28 | # Disable GPU to avoid OOM issues (could enable it for just one of the workers)
 29 | # Not necessary if workers are hosted in different machines
 30 | os.environ['CUDA_VISIBLE_DEVICES'] = ''
 31 | 
 32 | if FLAGS.job_name is None or FLAGS.job_name == "":
 33 |     raise ValueError("Must specify an explicit `job_name`")
 34 | if FLAGS.task_index is None or FLAGS.task_index == "":
 35 |     raise ValueError("Must specify an explicit `task_index`")
 36 | print("job name = %s" % FLAGS.job_name)
 37 | print("task index = %d" % FLAGS.task_index)
 38 | 
 39 | #Construct the cluster and start the server
 40 | ps_spec = FLAGS.ps_hosts.split(",")
 41 | worker_spec = FLAGS.worker_hosts.split(",")
 42 | 
 43 | # Get the number of workers.
 44 | NUM_WORKERS = len(worker_spec)
 45 | 
 46 | cluster = tf.train.ClusterSpec({"ps": ps_spec, "worker": worker_spec})
 47 | 
 48 | server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
 49 | 
 50 | # The server will block here
 51 | if FLAGS.job_name == "ps":
 52 |     server.join()
 53 | 
 54 | fashion_mnist = keras.datasets.fashion_mnist
 55 | (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
 56 | 
 57 | CLASS_NAMES = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
 58 |                'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
 59 | 
 60 | # Split dataset between workers
 61 | train_images = np.array_split(train_images, NUM_WORKERS)[FLAGS.task_index]
 62 | train_labels = np.array_split(train_labels, NUM_WORKERS)[FLAGS.task_index]
 63 | print('Local dataset size: {}'.format(train_images.shape[0]))
 64 | 
 65 | # Normalize dataset
 66 | train_images = train_images / 255.0
 67 | test_images = test_images / 255.0
 68 | 
 69 | IS_CHIEF = (FLAGS.task_index == 0)
 70 | 
 71 | WORKER_DEVICE = "/job:worker/task:%d" % FLAGS.task_index
 72 | 
 73 | # Device setter will place vars in the appropriate device
 74 | with tf.device(
 75 |         tf.train.replica_device_setter(
 76 |             worker_device=WORKER_DEVICE,
 77 |             cluster=cluster)):
 78 |     global_step = tf.train.get_or_create_global_step()
 79 | 
 80 |     # Define the model
 81 |     model = keras.Sequential([
 82 |         keras.layers.Flatten(input_shape=(28, 28)),
 83 |         keras.layers.Dense(128, activation=tf.nn.relu, name='relu'),
 84 |         keras.layers.Dense(10, activation=tf.nn.softmax, name='softmax')
 85 |     ])
 86 | 
 87 |     # Get placeholder for the labels
 88 |     y = tf.placeholder(tf.float32, shape=[None], name='labels')
 89 | 
 90 |     # Store reference to the output of the model
 91 |     predictions = model.output
 92 | 
 93 |     with tf.name_scope('loss'):
 94 |         loss = tf.reduce_mean(keras.losses.sparse_categorical_crossentropy(y, predictions))
 95 | 
 96 |     tf.summary.scalar('cross_entropy', loss)
 97 | 
 98 |     with tf.name_scope('train'):
 99 |         # Define the distributed optimizer
100 |         optimizer = tf.train.SyncReplicasOptimizer(tf.train.AdamOptimizer(0.001),
101 |                                                    replicas_to_aggregate=NUM_WORKERS)
102 |         train_op = optimizer.minimize(loss, global_step=global_step)
103 |         # Define the hook which initializes the optimizer
104 |         sync_replicas_hook = optimizer.make_session_run_hook(is_chief=IS_CHIEF)
105 | 
106 |     # ConfiProto for our session
107 |     SESS_CONFIG = tf.ConfigProto(
108 |         allow_soft_placement=True,
109 |         log_device_placement=False,
110 |         device_filters=["/job:ps",
111 |                         "/job:worker/task:%d" % FLAGS.task_index])
112 | 
113 |     # We need to let the MonitoredSession initialize the variables
114 |     keras.backend.manual_variable_initialization(True)
115 |     # Define the training feed
116 |     train_feed = {model.inputs[0]: train_images, y: train_labels}
117 | 
118 |     # Hook to log training progress
119 |     class _LoggerHook(tf.train.SessionRunHook):
120 |         def before_run(self, run_context):
121 |             """ Run this in session before_run """
122 |             return tf.train.SessionRunArgs(global_step)
123 | 
124 |         def after_run(self, run_context, run_values):
125 |             """ Run this in session after_run """
126 |             step = run_values.results
127 |             if step % 100 == 0:
128 |                 print('Iter {}/{}'.format(step, FLAGS.train_steps))
129 | 
130 |     with tf.train.MonitoredTrainingSession(
131 |             master=server.target,
132 |             is_chief=IS_CHIEF,
133 |             checkpoint_dir='logs_dir/{}'.format(time()),
134 |             hooks=[tf.train.StopAtStepHook(last_step=FLAGS.train_steps),
135 |                    _LoggerHook(), sync_replicas_hook],
136 |             save_checkpoint_steps=100,
137 |             config=SESS_CONFIG) as mon_sess:
138 |         keras.backend.set_session(mon_sess)
139 |         while not mon_sess.should_stop():
140 |             mon_sess.run(train_op, feed_dict=train_feed)
141 | 


--------------------------------------------------------------------------------
/federated-keras/keras_federated_classifier.py:
--------------------------------------------------------------------------------
  1 | # Helper libraries
  2 | import os
  3 | import numpy as np
  4 | from time import time
  5 | import sys
  6 | import federated_averaging_optimizer
  7 | 
  8 | # TensorFlow and tf.keras
  9 | import tensorflow as tf
 10 | from tensorflow import keras
 11 | 
 12 | # Trick to import from parent directory
 13 | sys.path.insert(1, os.path.join(sys.path[0], '..'))
 14 | 
 15 | flags = tf.app.flags
 16 | flags.DEFINE_integer("task_index", None,
 17 |                      "Worker task index, should be >= 0. task_index=0 is "
 18 |                      "the master worker task the performs the variable "
 19 |                      "initialization ")
 20 | flags.DEFINE_integer("train_steps", 1000,
 21 |                      "Number of (global) training steps to perform")
 22 | flags.DEFINE_string("ps_hosts", "localhost:2222",
 23 |                     "Comma-separated list of hostname:port pairs")
 24 | flags.DEFINE_string("worker_hosts", "localhost:2223,localhost:2224",
 25 |                     "Comma-separated list of hostname:port pairs")
 26 | flags.DEFINE_string("job_name", None, "job name: worker or ps")
 27 | 
 28 | FLAGS = flags.FLAGS
 29 | 
 30 | # Steps between averages
 31 | INTERVAL_STEPS = 100
 32 | 
 33 | # Disable GPU to avoid OOM issues (could enable it for just one of the workers)
 34 | # Not necessary if workers are hosted in different machines
 35 | os.environ['CUDA_VISIBLE_DEVICES'] = ''
 36 | 
 37 | if FLAGS.job_name is None or FLAGS.job_name == "":
 38 |     raise ValueError("Must specify an explicit `job_name`")
 39 | if FLAGS.task_index is None or FLAGS.task_index == "":
 40 |     raise ValueError("Must specify an explicit `task_index`")
 41 | print("job name = %s" % FLAGS.job_name)
 42 | print("task index = %d" % FLAGS.task_index)
 43 | 
 44 | #Construct the cluster and start the server
 45 | ps_spec = FLAGS.ps_hosts.split(",")
 46 | worker_spec = FLAGS.worker_hosts.split(",")
 47 | 
 48 | # Get the number of workers.
 49 | NUM_WORKERS = len(worker_spec)
 50 | 
 51 | cluster = tf.train.ClusterSpec({"ps": ps_spec, "worker": worker_spec})
 52 | 
 53 | server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
 54 | 
 55 | # The server will block here
 56 | if FLAGS.job_name == "ps":
 57 |     server.join()
 58 | 
 59 | fashion_mnist = keras.datasets.fashion_mnist
 60 | (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
 61 | 
 62 | CLASS_NAMES = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
 63 |                'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
 64 | 
 65 | # Split dataset between workers
 66 | train_images = np.array_split(train_images, NUM_WORKERS)[FLAGS.task_index]
 67 | train_labels = np.array_split(train_labels, NUM_WORKERS)[FLAGS.task_index]
 68 | print('Local dataset size: {}'.format(train_images.shape[0]))
 69 | 
 70 | # Normalize dataset
 71 | train_images = train_images / 255.0
 72 | test_images = test_images / 255.0
 73 | 
 74 | IS_CHIEF = (FLAGS.task_index == 0)
 75 | 
 76 | # We are not telling the MonitoredSession who is the chief so we need to
 77 | # prevent non-chief workers from saving checkpoints or summaries.
 78 | if IS_CHIEF:
 79 |     CHECKPOINT_DIR = 'logs_dir/{}'.format(time())
 80 | else:
 81 |     CHECKPOINT_DIR = None
 82 | 
 83 | WORKER_DEVICE = "/job:worker/task:%d" % FLAGS.task_index
 84 | 
 85 | # Place all ops in the local worker by default
 86 | with tf.device(WORKER_DEVICE):
 87 |     global_step = tf.train.get_or_create_global_step()
 88 | 
 89 |     # Define the model
 90 |     model = keras.Sequential([
 91 |         keras.layers.Flatten(input_shape=(28, 28)),
 92 |         keras.layers.Dense(128, activation=tf.nn.relu, name='relu'),
 93 |         keras.layers.Dense(10, activation=tf.nn.softmax, name='softmax')
 94 |     ])
 95 | 
 96 |     # Get placeholder for the labels
 97 |     y = tf.placeholder(tf.float32, shape=[None], name='labels')
 98 | 
 99 |     # Store reference to the output of the model
100 |     predictions = model.output
101 | 
102 |     with tf.name_scope('loss'):
103 |         loss = tf.reduce_mean(keras.losses.sparse_categorical_crossentropy(y, predictions))
104 | 
105 |     tf.summary.scalar('cross_entropy', loss)
106 | 
107 |     with tf.name_scope('train'):
108 |         # Define a device setter which will place a global copy of trainable variables
109 |         # in the parameter server.
110 |         device_setter = tf.train.replica_device_setter(worker_device=WORKER_DEVICE, cluster=cluster)
111 |         # Define our custom optimizer
112 |         optimizer = federated_averaging_optimizer.FederatedAveragingOptimizer(
113 |             tf.train.AdamOptimizer(0.001),
114 |             replicas_to_aggregate=NUM_WORKERS, interval_steps=INTERVAL_STEPS,
115 |             is_chief=IS_CHIEF, device_setter=device_setter)
116 |         train_op = optimizer.minimize(loss, global_step=global_step)
117 |         # Define the hook which initializes the optimizer
118 |         federated_average_hook = optimizer.make_session_run_hook()
119 | 
120 |     # ConfiProto for our session
121 |     SESS_CONFIG = tf.ConfigProto(
122 |         allow_soft_placement=True,
123 |         log_device_placement=False,
124 |         device_filters=["/job:ps",
125 |                         "/job:worker/task:%d" % FLAGS.task_index])
126 | 
127 |     # We need to let the MonitoredSession initialize the variables
128 |     keras.backend.manual_variable_initialization(True)
129 |     # Define the training feed
130 |     train_feed = {model.inputs[0]: train_images, y: train_labels}
131 | 
132 |     # Hook to log training progress
133 |     class _LoggerHook(tf.train.SessionRunHook):
134 |         def before_run(self, run_context):
135 |             """ Run this in session before_run """
136 |             return tf.train.SessionRunArgs(global_step)
137 | 
138 |         def after_run(self, run_context, run_values):
139 |             """ Run this in session after_run """
140 |             step = run_values.results
141 |             if step % 100 == 0:
142 |                 print('Iter {}/{}'.format(step, FLAGS.train_steps))
143 | 
144 |     with tf.train.MonitoredTrainingSession(
145 |             master=server.target,
146 |             checkpoint_dir=CHECKPOINT_DIR,
147 |             hooks=[tf.train.StopAtStepHook(last_step=FLAGS.train_steps),
148 |                    _LoggerHook(), federated_average_hook],
149 |             save_checkpoint_steps=100,
150 |             config=SESS_CONFIG) as mon_sess:
151 |         keras.backend.set_session(mon_sess)
152 |         while not mon_sess.should_stop():
153 |             mon_sess.run(train_op, feed_dict=train_feed)
154 | 


--------------------------------------------------------------------------------
/federated-sockets/FederatedHook.py:
--------------------------------------------------------------------------------
  1 | """# Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # =============================================================================="""
 17 | 
 18 | import socket
 19 | import time
 20 | import ssl
 21 | import hmac
 22 | import tensorflow as tf
 23 | import numpy as np
 24 | from config import SSL_CONF as SC
 25 | from config import SEND_RECEIVE_CONF as SRC
 26 | 
 27 | try:
 28 |     import cPickle as pickle
 29 | except ImportError:
 30 |     import pickle
 31 | 
 32 | 
 33 | class _FederatedHook(tf.train.SessionRunHook):
 34 |     """Provides a hook to implemente federated averaging with tensorflow.
 35 | 
 36 |       In a typical synchronous training environment, gradients will be averaged each
 37 |       step and then applied to the variables in one shot, after which replicas can
 38 |       fetch the new variables and continue. In a federated average training environment,
 39 |       model variables will be averaged every 'interval_steps' steps, and then the
 40 |       replicas will fetch the new variables and continue training locally. In the
 41 |       interval between two average operations, there is no data transfer, which can
 42 |       accelerate training.
 43 | 
 44 |       The hook has two different ways of working depending if it is the chief worker or not.
 45 | 
 46 |       The chief starts creating a socket that will act as server. Then it stays
 47 |       waiting _wait_time seconds and accepting connections of all those workers that
 48 |       want to join the training, and distributes a task index to each of them.
 49 |       This task index is not always neccesary. In our demos we use it to tell
 50 |       each worker which part of the dataset it has to use for the training and it
 51 |       could have other applications.
 52 | 
 53 |       Remember if you training is not going to be performed in a LAN you will
 54 |       need to do some port forwarding, we recommend you to have a look to this
 55 |       article we wrote about it:
 56 |       https://medium.com/comind/raspberry-pis-federated-learning-751b10fc92c9
 57 | 
 58 |       Once the training is going to start sends it's weights to the other workers,
 59 |       so that they all start with the same initial ones.
 60 |       After each batch is trained, it checks if _interval_steps has been completed,
 61 |       and if so, it gathers the weights of all the workers and its own, averages them
 62 |       and sends the average to all those workers that sended weights to it.
 63 | 
 64 |       Workers open a socket connection with the chief and wait to get their worker number.
 65 |       Once the training is going to start they wait for the chief to send them its weights.
 66 |       After each training round they check if _interval_steps has been completed,
 67 |       and if so, they send their weights to the chief and wait for it's response,
 68 |       the averaged weights with which they will continue training.
 69 |       """
 70 | 
 71 |     def __init__(self, is_chief, private_ip, public_ip, wait_time=30, interval_steps=100):
 72 | 
 73 |         """Construcs a FederatedHook object
 74 |             Args:
 75 |               is_chief (bool): whether it is going to act as chief or not.
 76 |               private_ip (str): complete local ip in which the chief is going to
 77 |                     serve its socket. Example: 172.134.65.123:7777
 78 |               public_ip (str): ip to which the workers are going to connect.
 79 |               interval_steps (int, optional): number of steps between two
 80 |                     "average op", which specifies how frequent a model
 81 |                     synchronization is performed.
 82 |               wait_time: how mucht time the chief should wait at the beginning
 83 |                     for the workers to connect.
 84 |          """
 85 |         self._is_chief = is_chief
 86 |         self._private_ip = private_ip.split(':')[0]
 87 |         self._private_port = int(private_ip.split(':')[1])
 88 |         self._public_ip = public_ip.split(':')[0]
 89 |         self._public_port = int(public_ip.split(':')[1])
 90 |         self._interval_steps = interval_steps
 91 |         self._wait_time = wait_time
 92 |         self._nex_task_index = 0
 93 |         # We get the number of connections that have been made, and which task_index
 94 |         # corresponds to this worker.
 95 |         self.task_index, self.num_workers = self._get_task_index()
 96 | 
 97 |     def _get_task_index(self):
 98 | 
 99 |         """Chief distributes task index number to workers that connect to it and
100 |         lets them know how many workers are there in total.
101 |         Returns:
102 |           task_index (int): task index corresponding to this worker.
103 |           num_workers (int): number of total workers.
104 |          """
105 | 
106 |         if self._is_chief:
107 |             self._server_socket = self._start_socket_server()
108 |             self._server_socket.settimeout(5)
109 |             users = []
110 |             t_end = time.time() + self._wait_time
111 | 
112 |             while time.time() < t_end:
113 |                 try:
114 |                     sock, _ = self._server_socket.accept()
115 |                     connection_socket = ssl.wrap_socket(
116 |                         sock,
117 |                         server_side=True,
118 |                         certfile=SC.cert_path,
119 |                         keyfile=SC.key_path,
120 |                         ssl_version=ssl.PROTOCOL_TLSv1)
121 |                     if connection_socket not in users:
122 |                         users.append(connection_socket)
123 |                 except socket.timeout:
124 |                     pass
125 | 
126 |             num_workers = len(users) + 1
127 |             _ = [us.send((str(i+1) + ':' + str(num_workers)).encode('utf-8')) \
128 |             for i, us in enumerate(users)]
129 |             self._nex_task_index = len(users) + 1
130 |             _ = [us.close() for us in users]
131 | 
132 |             self._server_socket.settimeout(120)
133 |             return 0, num_workers
134 | 
135 |         client_socket = self._start_socket_worker()
136 |         message = client_socket.recv(1024).decode('utf-8').split(':')
137 |         client_socket.close()
138 |         return int(message[0]), int(message[1])
139 | 
140 |     def _create_placeholders(self):
141 |         """Creates the placeholders that we will use to inject the weights into the graph"""
142 |         for var in tf.trainable_variables():
143 |             self._placeholders.append(tf.placeholder_with_default(var, var.shape,
144 |                                                                   name="%s/%s" % ("FedAvg",
145 |                                                                                   var.op.name)))
146 | 
147 |     def _assign_vars(self, local_vars):
148 |         """Utility to refresh local variables.
149 | 
150 |         Args:
151 |           local_vars: List of local variables.
152 |           global_vars: List of global variables.
153 | 
154 |         Returns:
155 |           refresh_ops: The ops to assign value of global vars to local vars.
156 |         """
157 |         reassign_ops = []
158 |         for var, fvar in zip(local_vars, self._placeholders):
159 |             reassign_ops.append(tf.assign(var, fvar))
160 |         return tf.group(*(reassign_ops))
161 | 
162 |     @staticmethod
163 |     def _receiving_subroutine(connection_socket):
164 |         """Subroutine inside _get_np_array to recieve a list of numpy arrays.
165 |         If the sending was not correctly recieved it sends back an error message
166 |         to the sender in order to try it again.
167 |         Args:
168 |           connection_socket (socket): a socket with a connection already
169 |               established.
170 |          """
171 |         timeout = 0.5
172 |         while True:
173 |             ultimate_buffer = b''
174 |             connection_socket.settimeout(240)
175 |             first_round = True
176 |             while True:
177 |                 try:
178 |                     receiving_buffer = connection_socket.recv(SRC.buffer)
179 |                 except socket.timeout:
180 |                     break
181 |                 if first_round:
182 |                     connection_socket.settimeout(timeout)
183 |                     first_round = False
184 |                 if not receiving_buffer:
185 |                     break
186 |                 ultimate_buffer += receiving_buffer
187 | 
188 |             pos_signature = SRC.hashsize
189 |             signature = ultimate_buffer[:pos_signature]
190 |             message = ultimate_buffer[pos_signature:]
191 |             good_signature = hmac.new(SRC.key, message, SRC.hashfunction).digest()
192 | 
193 |             if signature != good_signature:
194 |                 connection_socket.send(SRC.error)
195 |                 timeout += 0.5
196 |                 continue
197 |             else:
198 |                 connection_socket.send(SRC.recv)
199 |                 connection_socket.settimeout(120)
200 |                 return message
201 | 
202 |     def _get_np_array(self, connection_socket):
203 |         """Routine to recieve a list of numpy arrays.
204 |             Args:
205 |               connection_socket (socket): a socket with a connection already
206 |                   established.
207 |          """
208 | 
209 |         message = self._receiving_subroutine(connection_socket)
210 |         final_image = pickle.loads(message)
211 |         return final_image
212 | 
213 |     @staticmethod
214 |     def _send_np_array(arrays_to_send, connection_socket):
215 |         """Routine to send a list of numpy arrays. It sends it as many time as necessary
216 |             Args:
217 |               connection_socket (socket): a socket with a connection already
218 |                   established.
219 |          """
220 |         serialized = pickle.dumps(arrays_to_send)
221 |         signature = hmac.new(SRC.key, serialized, SRC.hashfunction).digest()
222 |         assert len(signature) == SRC.hashsize
223 |         message = signature + serialized
224 |         connection_socket.settimeout(240)
225 |         connection_socket.sendall(message)
226 |         while True:
227 |             check = connection_socket.recv(len(SRC.error))
228 |             if check == SRC.error:
229 |                 connection_socket.sendall(message)
230 |             elif check == SRC.recv:
231 |                 connection_socket.settimeout(120)
232 |                 break
233 | 
234 |     def _start_socket_server(self):
235 |         """Creates a socket with ssl protection that will act as server.
236 |             Returns:
237 |               sever_socket (socket): ssl secured socket that will act as server.
238 |          """
239 |         server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
240 |         server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
241 |         context = ssl.create_default_context(ssl.Purpose.CLIENT_AUTH)
242 |         context.options |= ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1  # optional
243 |         context.set_ciphers('EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH')
244 |         server_socket.bind((self._private_ip, self._private_port))
245 |         server_socket.listen()
246 |         return server_socket
247 | 
248 |     def _start_socket_worker(self):
249 |         """Creates a socket with ssl protection that will act as client.
250 |            Returns:
251 |               sever_socket (socket): ssl secured socket that will work as client.
252 |          """
253 |         to_wrap_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
254 |         context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH)
255 |         context.options |= ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1  # optional
256 | 
257 |         client_socket = ssl.wrap_socket(to_wrap_socket)
258 |         client_socket.connect((self._public_ip, self._public_port))
259 |         return client_socket
260 | 
261 |     def begin(self):
262 |         """Session begin"""
263 |         self._placeholders = []
264 |         self._create_placeholders()
265 |         self._update_local_vars_op = self._assign_vars(tf.trainable_variables())
266 |         self._global_step = tf.get_collection(tf.GraphKeys.GLOBAL_STEP)[0]
267 | 
268 |     def after_create_session(self, session, coord):
269 |         """
270 |         If chief:
271 |             Once the training is going to start sends it's weights to the other
272 |             workers, so that they all start with the same initial ones.
273 |             Once it has send the weights to all the workers it sends them a
274 |             signal to start training.
275 |         Workers:
276 |             Wait for the chief to send them its weights and inject them into
277 |             the graph.
278 |          """
279 |         if self._is_chief:
280 |             users = []
281 |             addresses = []
282 |             while len(users) < (self.num_workers - 1):
283 |                 try:
284 |                     self._server_socket.settimeout(30)
285 |                     sock, address = self._server_socket.accept()
286 |                     connection_socket = ssl.wrap_socket(
287 |                         sock,
288 |                         server_side=True,
289 |                         certfile=SC.cert_path,
290 |                         keyfile=SC.key_path,
291 |                         ssl_version=ssl.PROTOCOL_TLSv1)
292 | 
293 |                     print('Connected: ' + address[0] + ':' + str(address[1]))
294 |                 except socket.timeout:
295 |                     print('Some workers could not connect')
296 |                     break
297 |                 try:
298 |                     print('SENDING Worker: ' + address[0] + ':' + str(address[1]))
299 |                     self._send_np_array(session.run(tf.trainable_variables()), connection_socket)
300 |                     print('SENT Worker {}'.format(len(users)))
301 |                     users.append(connection_socket)
302 |                     addresses.append(address)
303 |                 except (ConnectionResetError, BrokenPipeError):
304 |                     print('Could not send to : '
305 |                           + address[0] + ':' + str(address[1])
306 |                           + ', fallen worker')
307 |                     connection_socket.close()
308 |             for i, user in enumerate(users):
309 |                 try:
310 |                     user.send(SRC.signal)
311 |                     user.close()
312 |                 except (ConnectionResetError, BrokenPipeError):
313 |                     print('Fallen Worker: ' + addresses[i][0] + ':' + str(address[i][1]))
314 |                     self.num_workers -= 1
315 |                     try:
316 |                         user.close()
317 |                     except (ConnectionResetError, BrokenPipeError):
318 |                         pass
319 |         else:
320 |             print('Starting Initialization')
321 |             client_socket = self._start_socket_worker()
322 |             broadcasted_weights = self._get_np_array(client_socket)
323 |             feed_dict = {}
324 |             for placeh, brweigh in zip(self._placeholders, broadcasted_weights):
325 |                 feed_dict[placeh] = brweigh
326 |             session.run(self._update_local_vars_op, feed_dict=feed_dict)
327 |             print('Initialization finished')
328 |             client_socket.settimeout(120)
329 |             client_socket.recv(len(SRC.signal))
330 |             client_socket.close()
331 | 
332 |     def before_run(self, run_context):
333 |         """ Session before_run"""
334 |         return tf.train.SessionRunArgs(self._global_step)
335 | 
336 |     def after_run(self, run_context, run_values):
337 |         """
338 |         Both chief and workers, check if they should average their weights in
339 |         this roud. Is this is the case:
340 | 
341 |         If chief:
342 |             Tries to gather the weights of all the workers, but ignores those
343 |             that lost connection at some point.
344 |             It averages them and then send them back to the workers.
345 |             Finally in injects the averaged weights to its own graph.
346 |         Workers:
347 |             Send their weights to the chief.
348 |             Wait for the chief to send them the averaged weights and inject them into
349 |             their graph.
350 |          """
351 |         step_value = run_values.results
352 |         session = run_context.session
353 |         if step_value % self._interval_steps == 0 and not step_value == 0:
354 |             if self._is_chief:
355 |                 self._server_socket.listen(self.num_workers - 1)
356 |                 gathered_weights = [session.run(tf.trainable_variables())]
357 |                 users = []
358 |                 addresses = []
359 |                 for i in range(self.num_workers - 1):
360 |                     try:
361 |                         self._server_socket.settimeout(30)
362 |                         sock, address = self._server_socket.accept()
363 |                         connection_socket = ssl.wrap_socket(
364 |                             sock,
365 |                             server_side=True,
366 |                             certfile=SC.cert_path,
367 |                             keyfile=SC.key_path,
368 |                             ssl_version=ssl.PROTOCOL_TLSv1)
369 | 
370 |                         print('Connected: ' + address[0] + ':' + str(address[1]))
371 |                     except socket.timeout:
372 |                         print('Some workers could not connect')
373 |                         break
374 |                     try:
375 |                         recieved = self._get_np_array(connection_socket)
376 |                         gathered_weights.append(recieved)
377 |                         users.append(connection_socket)
378 |                         addresses.append(address)
379 |                         print('Received from ' + address[0] + ':' + str(address[1]))
380 |                     except (ConnectionResetError, BrokenPipeError):
381 |                         print('Could not recieve from : '
382 |                               + address[0] + ':' + str(address[1])
383 |                               + ', fallen worker')
384 |                         connection_socket.close()
385 | 
386 |                 self.num_workers = len(users) + 1
387 | 
388 |                 print('Average applied '
389 |                       + 'with {} workers, iter: {}'.format(self.num_workers, step_value))
390 |                 rearranged_weights = []
391 | 
392 |                 #In gathered_weights, each list represents the weights of each worker.
393 |                 #We want to gahter in each list the weights of a single layer so
394 |                 #to average them afterwards
395 |                 for i in range(len(gathered_weights[0])):
396 |                     rearranged_weights.append([elem[i] for elem in gathered_weights])
397 |                 for i, elem in enumerate(rearranged_weights):
398 |                     rearranged_weights[i] = np.mean(elem, axis=0)
399 | 
400 |                 for i, user in enumerate(users):
401 |                     try:
402 |                         self._send_np_array(rearranged_weights, user)
403 |                         user.close()
404 |                     except (ConnectionResetError, BrokenPipeError):
405 |                         print('Fallen Worker: ' + addresses[i][0] + ':' + str(address[i][1]))
406 |                         self.num_workers -= 1
407 |                         try:
408 |                             user.close()
409 |                         except socket.timeout:
410 |                             pass
411 | 
412 |                 feed_dict = {}
413 |                 for placeh, reweigh in zip(self._placeholders, rearranged_weights):
414 |                     feed_dict[placeh] = reweigh
415 |                 session.run(self._update_local_vars_op, feed_dict=feed_dict)
416 | 
417 |             else:
418 |                 worker_socket = self._start_socket_worker()
419 |                 print('Sending weights')
420 |                 value = session.run(tf.trainable_variables())
421 |                 self._send_np_array(value, worker_socket)
422 | 
423 |                 broadcasted_weights = self._get_np_array(worker_socket)
424 |                 feed_dict = {}
425 |                 for placeh, brweigh in zip(self._placeholders, broadcasted_weights):
426 |                     feed_dict[placeh] = brweigh
427 |                 session.run(self._update_local_vars_op, feed_dict=feed_dict)
428 |                 print('Weights succesfully updated, iter: {}'.format(step_value))
429 |                 worker_socket.close()
430 | 
431 |     def end(self, session):
432 |         """ Session end """
433 |         if self._is_chief:
434 |             self._server_socket.close()
435 | 


--------------------------------------------------------------------------------
/federated-sockets/README.md:
--------------------------------------------------------------------------------
 1 | # Implementation with custom hook
 2 | 
 3 | This is the implementation of Federated Averaging using our custom hook. If you wish to use this same implementation with your own code just import the FederatedHook, set the config file and launch!
 4 | 
 5 | ## Usage
 6 | 
 7 | First of all set the config file:
 8 | 
 9 | > `SEND_RECEIVE_CONF.key = Shared key to sign messages and guarantee integrity` (a Bytearray, you can leave it as is)
10 | 
11 | Generate a private key and a certificate with: `openssl req -new -x509 -days 365 -nodes -out server.pem -keyout server.key`
12 | 
13 | > `SSL_CONF.key_path = Path to your private key`
14 | 
15 | > `SSL_CONF.cert_path = Path to your certificate`
16 | 
17 | Next set the IP's in the main code to your own. No need to change this if you are using localhost.
18 | 
19 | Finally, set the `WAIT_TIME`, the chief worker will wait for new workers during this amount of seconds before the training.
20 | 
21 | And launch the shells, as many as you want:
22 | 
23 | * 1st shell: `python3 basic_socket_fed_classifier.py --is_chief=True`
24 | 
25 | * Next shells: `python3 basic_socket_fed_classifier.py`
26 | 
27 | ## Troubleshooting and Help
28 | 
29 | coMind has public Slack and Telegram channels which are a great place to ask questions and all things related to federated machine learning.
30 | 
31 | ## Useful resources
32 | 
33 | Check the [medium post](https://medium.com/comind/raspberry-pis-federated-learning-751b10fc92c9) to learn about port forwarding and how to set-up your chief host.
34 | 
35 | Check [this script](https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py) to see how to generate CIFAR-10 TFRecords.
36 | 
37 | ## About
38 | 
39 | coMind is an open source project for training privacy-preserving federated deep learning models. 
40 | 
41 | * https://comind.org/
42 | * [Twitter](https://twitter.com/coMindOrg)
43 | 


--------------------------------------------------------------------------------
/federated-sockets/advanced_socket_fed_classifier.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # ==============================================================================
 17 | 
 18 | # TensorFlow
 19 | import tensorflow as tf
 20 | 
 21 | # Helper libraries
 22 | import numpy as np
 23 | from time import time
 24 | import multiprocessing
 25 | 
 26 | # Custom federated hook
 27 | from FederatedHook import _FederatedHook
 28 | 
 29 | flags = tf.app.flags
 30 | 
 31 | flags.DEFINE_boolean("is_chief", False, "True if this worker is chief")
 32 | 
 33 | FLAGS = flags.FLAGS
 34 | 
 35 | # You can safely tune these variables
 36 | BATCH_SIZE = 128
 37 | SHUFFLE_SIZE = BATCH_SIZE * 100
 38 | EPOCHS = 250
 39 | EPOCHS_PER_DECAY = 50
 40 | INTERVAL_STEPS = 100 # Steps between averages
 41 | WAIT_TIME = 30 # How many seconds to wait for new workers to connect
 42 | BATCHES_TO_PREFETCH = 1
 43 | # -----------------
 44 | 
 45 | # Set these IPs to your own, can leave as localhost for local testing
 46 | CHIEF_PUBLIC_IP = 'localhost:7777' # Public IP of the chief worker
 47 | CHIEF_PRIVATE_IP = 'localhost:7777' # Private IP of the chief worker
 48 | 
 49 | # Create the custom hook
 50 | federated_hook = _FederatedHook(FLAGS.is_chief, CHIEF_PRIVATE_IP, CHIEF_PUBLIC_IP, WAIT_TIME, INTERVAL_STEPS)
 51 | 
 52 | # Dataset dependent constants
 53 | NUM_TRAIN_IMAGES = int(50000 / federated_hook.num_workers)
 54 | NUM_TEST_IMAGES = 10000
 55 | HEIGHT = 32
 56 | WIDTH = 32
 57 | CHANNELS = 3
 58 | NUM_BATCH_FILES = 5
 59 | 
 60 | # Path to TFRecord files (check readme for instructions on how to get these files)
 61 | cifar10_train_files = ['cifar-10-tf-records/train{}.tfrecords'.format(i) for i in range(NUM_BATCH_FILES)]
 62 | cifar10_test_file = 'cifar-10-tf-records/test.tfrecords'
 63 | 
 64 | # Shuffle filenames before loading them
 65 | np.random.shuffle(cifar10_train_files)
 66 | 
 67 | CHECKPOINT_DIR = 'logs_dir/{}'.format(time())
 68 | print('Checkpoint directory: ' + CHECKPOINT_DIR)
 69 | 
 70 | global_step = tf.train.get_or_create_global_step()
 71 | 
 72 | # Check number of available CPUs
 73 | CPU_COUNT = int(multiprocessing.cpu_count() / federated_hook.num_workers)
 74 | 
 75 | # Define input pipeline, place these ops in the cpu
 76 | with tf.name_scope('dataset'), tf.device('/cpu:0'):
 77 |     # Map function to decode data and preprocess it
 78 |     def preprocess(serialized_examples):
 79 |         # Parse a batch
 80 |         features = tf.parse_example(serialized_examples, {'image': tf.FixedLenFeature([], tf.string), 'label': tf.FixedLenFeature([], tf.int64)})
 81 |         # Decode and reshape imag
 82 |         image = tf.map_fn(lambda img: tf.reshape(tf.decode_raw(img, tf.uint8), tf.stack([HEIGHT, WIDTH, CHANNELS])), features['image'], dtype=tf.uint8, name='decode')
 83 |         # Cast image
 84 |         casted_image = tf.cast(image, tf.float32, name='input_cast')
 85 |         # Resize image for testing
 86 |         resized_image = tf.image.resize_image_with_crop_or_pad(casted_image, 24, 24)
 87 |         # Augment images for training
 88 |         distorted_image = tf.map_fn(lambda img: tf.random_crop(img, [24, 24, 3]),
 89 |                                     casted_image, name='random_crop')
 90 |         distorted_image = tf.image.random_flip_left_right(distorted_image)
 91 |         distorted_image = tf.image.random_brightness(distorted_image, 63)
 92 |         distorted_image = tf.image.random_contrast(distorted_image, 0.2, 1.8)
 93 |         # Check if test or train mode
 94 |         result = tf.cond(train_mode, lambda: distorted_image, lambda: resized_image)
 95 |         # Standardize images
 96 |         processed_image = tf.map_fn(lambda img: tf.image.per_image_standardization(img),
 97 |                                     result, name='standardization')
 98 |         return processed_image, features['label']
 99 |     # Placeholders for the iterator
100 |     filename_placeholder = tf.placeholder(tf.string, name='input_filename')
101 |     batch_size = tf.placeholder(tf.int64, name='batch_size')
102 |     shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
103 |     train_mode = tf.placeholder(tf.bool, name='train_mode')
104 | 
105 |     # Create dataset, shuffle, repeat, batch, map and prefetch
106 |     dataset = tf.data.TFRecordDataset(filename_placeholder)
107 |     dataset = dataset.shard(federated_hook.num_workers, federated_hook.task_index)
108 |     dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
109 |     dataset = dataset.repeat(EPOCHS)
110 |     dataset = dataset.batch(batch_size)
111 |     dataset = dataset.map(preprocess, CPU_COUNT)
112 |     dataset = dataset.prefetch(BATCHES_TO_PREFETCH)
113 |     # Define a feedable iterator and the initialization op
114 |     iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
115 |     dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
116 |     X, y = iterator.get_next()
117 | 
118 | # Define our model
119 | first_conv = tf.layers.conv2d(X, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='first_conv')
120 | 
121 | first_pool = tf.nn.max_pool(first_conv, [1, 3, 3 ,1], [1, 2, 2, 1], padding='SAME', name='first_pool')
122 | 
123 | first_norm = tf.nn.lrn(first_pool, 4, alpha=0.001 / 9.0, beta=0.75, name='first_norm')
124 | 
125 | second_conv = tf.layers.conv2d(first_norm, 64, 5, padding='SAME', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=5e-2), name='second_conv')
126 | 
127 | second_norm = tf.nn.lrn(second_conv, 4, alpha=0.001 / 9.0, beta=0.75, name='second_norm')
128 | 
129 | second_pool = tf.nn.max_pool(second_norm, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME', name='second_pool')
130 | 
131 | flatten_layer = tf.layers.flatten(second_pool, name='flatten')
132 | 
133 | first_relu = tf.layers.dense(flatten_layer, 384, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='first_relu')
134 | 
135 | second_relu = tf.layers.dense(first_relu, 192, activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.04), name='second_relu')
136 | 
137 | logits = tf.layers.dense(second_relu, 10, kernel_initializer=tf.truncated_normal_initializer(stddev=1/192.0), name='logits')
138 | 
139 | # Object to keep moving averages of our metrics (for tensorboard)
140 | summary_averages = tf.train.ExponentialMovingAverage(0.9)
141 | 
142 | # Define cross_entropy loss
143 | with tf.name_scope('loss'):
144 |     base_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits), name='base_loss')
145 |     # Add regularization loss to both relu layers
146 |     regularizer_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'relu/kernel' in v.name], name='regularizer_loss') * 0.004
147 |     loss = tf.add(base_loss, regularizer_loss)
148 |     loss_averages_op = summary_averages.apply([loss])
149 |     # Store moving average of the loss
150 |     tf.summary.scalar('cross_entropy', summary_averages.average(loss))
151 | 
152 | with tf.name_scope('accuracy'):
153 |     with tf.name_scope('correct_prediction'):
154 |         # Compare prediction with actual label
155 |         correct_prediction = tf.equal(tf.argmax(logits, 1), y)
156 |     # Average correct predictions in the current batch
157 |     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy_metric')
158 |     accuracy_averages_op = summary_averages.apply([accuracy])
159 |     # Store moving average of the accuracy
160 |     tf.summary.scalar('accuracy', summary_averages.average(accuracy))
161 | 
162 | N_BATCHES = int(NUM_TRAIN_IMAGES / BATCH_SIZE)
163 | LAST_STEP = int(N_BATCHES * EPOCHS)
164 | 
165 | # Define moving averages of the trainable variables. This sometimes improve
166 | # the performance of the trained mode
167 | with tf.name_scope('variable_averages'):
168 |     variable_averages = tf.train.ExponentialMovingAverage(0.9999, global_step)
169 |     variable_averages_op = variable_averages.apply(tf.trainable_variables())
170 | 
171 | # Define optimizer and training op
172 | with tf.name_scope('train'):
173 |     # Make decaying learning rate
174 |     lr = tf.train.exponential_decay(0.1, global_step, N_BATCHES * EPOCHS_PER_DECAY, 0.1, staircase=True)
175 |     tf.summary.scalar('learning_rate', lr)
176 |     # Make train_op dependent on moving averages ops. Otherwise they will be
177 |     # disconnected from the graph
178 |     with tf.control_dependencies([loss_averages_op, accuracy_averages_op, variable_averages_op]):
179 |         train_op = tf.train.GradientDescentOptimizer(lr).minimize(loss, global_step=global_step)
180 | 
181 | print('Graph definition finished')
182 | sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
183 | 
184 | print('Training {} batches...'.format(LAST_STEP))
185 | 
186 | # Logger hook to keep track of the training
187 | class _LoggerHook(tf.train.SessionRunHook):
188 |     def begin(self):
189 |         """ Run this in session begin """
190 |         self._total_loss = 0
191 |         self._total_acc = 0
192 | 
193 |     def before_run(self, run_context):
194 |         """ Run this in session before_run """
195 |         return tf.train.SessionRunArgs([loss, accuracy, global_step])
196 | 
197 |     def after_run(self, run_context, run_values):
198 |         """ Run this in session after_run """
199 |         loss_value, acc_value, step_value = run_values.results
200 |         self._total_loss += loss_value
201 |         self._total_acc += acc_value
202 |         if (step_value + 1) % N_BATCHES == 0:
203 |             print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(
204 |                 int(step_value / N_BATCHES) + 1,
205 |                 EPOCHS, self._total_loss / N_BATCHES, self._total_acc / N_BATCHES))
206 |             self._total_loss = 0
207 |             self._total_acc = 0
208 | 
209 | # Hook to initialize the dataset
210 | class _InitHook(tf.train.SessionRunHook):
211 |     def after_create_session(self, session, coord):
212 |         """ Run this after creating session """
213 |         session.run(dataset_init_op, feed_dict={
214 |             filename_placeholder: cifar10_train_files,
215 |             batch_size: BATCH_SIZE, shuffle_size: SHUFFLE_SIZE, train_mode: True})
216 | 
217 | with tf.name_scope('monitored_session'):
218 |     with tf.train.MonitoredTrainingSession(
219 |             checkpoint_dir=CHECKPOINT_DIR,
220 |             hooks=[_LoggerHook(), _InitHook(), federated_hook,
221 |                    tf.train.CheckpointSaverHook(checkpoint_dir=CHECKPOINT_DIR,
222 |                                                 save_steps=N_BATCHES,
223 |                                                 saver=tf.train.Saver(
224 |                                                     variable_averages.variables_to_restore()))],
225 |             config=sess_config,
226 |             save_checkpoint_secs=None) as mon_sess:
227 |         while not mon_sess.should_stop():
228 |             mon_sess.run(train_op)
229 | 
230 | print('--- Begin Evaluation ---')
231 | # Reset graph to clear any ops stored in other devices
232 | tf.reset_default_graph()
233 | with tf.Session() as sess:
234 |     ckpt = tf.train.get_checkpoint_state(CHECKPOINT_DIR)
235 |     saver = tf.train.import_meta_graph(ckpt.model_checkpoint_path + '.meta', clear_devices=True)
236 |     saver.restore(sess, ckpt.model_checkpoint_path)
237 |     print('Model restored')
238 |     graph = tf.get_default_graph()
239 |     filename_placeholder = graph.get_tensor_by_name('dataset/input_filename:0')
240 |     batch_size = graph.get_tensor_by_name('dataset/batch_size:0')
241 |     shuffle_size = graph.get_tensor_by_name('dataset/shuffle_size:0')
242 |     train_mode = graph.get_tensor_by_name('dataset/train_mode:0')
243 |     accuracy = graph.get_tensor_by_name('accuracy/accuracy_metric:0')
244 |     dataset_init_op = graph.get_operation_by_name('dataset/dataset_init')
245 |     sess.run(dataset_init_op, feed_dict={filename_placeholder: cifar10_test_file, batch_size: NUM_TEST_IMAGES, shuffle_size: 1, train_mode: False})
246 |     print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
247 | 


--------------------------------------------------------------------------------
/federated-sockets/basic_socket_fed_classifier.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 coMind. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # https://comind.org/
 16 | # ==============================================================================
 17 | 
 18 | # TensorFlow and tf.keras
 19 | import tensorflow as tf
 20 | from tensorflow import keras
 21 | 
 22 | # Custom federated hook
 23 | from FederatedHook import _FederatedHook
 24 | 
 25 | # Helper libraries
 26 | import os
 27 | import numpy as np
 28 | from time import time
 29 | 
 30 | flags = tf.app.flags
 31 | 
 32 | flags.DEFINE_boolean("is_chief", False, "True if this worker is chief")
 33 | 
 34 | FLAGS = flags.FLAGS
 35 | 
 36 | os.environ['CUDA_VISIBLE_DEVICES'] = ''
 37 | 
 38 | # You can safely tune these variables
 39 | BATCH_SIZE = 32
 40 | EPOCHS = 5
 41 | INTERVAL_STEPS = 100 # Steps between averages
 42 | WAIT_TIME = 30 # How many seconds to wait for new workers to connect
 43 | # -----------------
 44 | 
 45 | # Set these IPs to your own, can leave as localhost for local testing
 46 | CHIEF_PUBLIC_IP = 'localhost:7777' # Public IP of the chief worker
 47 | CHIEF_PRIVATE_IP = 'localhost:7777' # Private IP of the chief worker
 48 | 
 49 | # Create the custom hook
 50 | federated_hook = _FederatedHook(FLAGS.is_chief, CHIEF_PRIVATE_IP, CHIEF_PUBLIC_IP, WAIT_TIME, INTERVAL_STEPS)
 51 | 
 52 | # Load dataset as numpy arrays
 53 | fashion_mnist = keras.datasets.fashion_mnist
 54 | (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
 55 | 
 56 | # Split dataset
 57 | train_images = np.array_split(train_images, federated_hook.num_workers)[federated_hook.task_index]
 58 | train_labels = np.array_split(train_labels, federated_hook.num_workers)[federated_hook.task_index]
 59 | 
 60 | # You can safely tune this variable
 61 | SHUFFLE_SIZE = train_images.shape[0]
 62 | # -----------------
 63 | 
 64 | print('Local dataset size: {}'.format(train_images.shape[0]))
 65 | 
 66 | # Normalize dataset
 67 | train_images = train_images / 255.0
 68 | test_images = test_images / 255.0
 69 | 
 70 | CHECKPOINT_DIR = 'logs_dir/{}'.format(time())
 71 | 
 72 | global_step = tf.train.get_or_create_global_step()
 73 | 
 74 | # Define input pipeline, place these ops in the cpu
 75 | with tf.name_scope('dataset'), tf.device('/cpu:0'):
 76 |     # Placeholders for the iterator
 77 |     images_placeholder = tf.placeholder(train_images.dtype, [None, train_images.shape[1], train_images.shape[2]])
 78 |     labels_placeholder = tf.placeholder(train_labels.dtype, [None])
 79 |     batch_size = tf.placeholder(tf.int64)
 80 |     shuffle_size = tf.placeholder(tf.int64, name='shuffle_size')
 81 | 
 82 |     # Create dataset, shuffle, repeat and batch
 83 |     dataset = tf.data.Dataset.from_tensor_slices((images_placeholder, labels_placeholder))
 84 |     dataset = dataset.shuffle(shuffle_size, reshuffle_each_iteration=True)
 85 |     dataset = dataset.repeat(EPOCHS)
 86 |     dataset = dataset.batch(batch_size)
 87 |     iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
 88 |     dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
 89 |     X, y = iterator.get_next()
 90 | 
 91 | # Define our model
 92 | flatten_layer = tf.layers.flatten(X, name='flatten')
 93 | 
 94 | dense_layer = tf.layers.dense(flatten_layer, 128, activation=tf.nn.relu, name='relu')
 95 | 
 96 | predictions = tf.layers.dense(dense_layer, 10, activation=tf.nn.softmax, name='softmax')
 97 | 
 98 | # Object to keep moving averages of our metrics (for tensorboard)
 99 | summary_averages = tf.train.ExponentialMovingAverage(0.9)
100 | 
101 | # Define cross_entropy loss
102 | with tf.name_scope('loss'):
103 |     loss = tf.reduce_mean(keras.losses.sparse_categorical_crossentropy(y, predictions))
104 |     loss_averages_op = summary_averages.apply([loss])
105 |     # Store moving average of the loss
106 |     tf.summary.scalar('cross_entropy', summary_averages.average(loss))
107 | 
108 | with tf.name_scope('accuracy'):
109 |     with tf.name_scope('correct_prediction'):
110 |         # Compare prediction with actual label
111 |         correct_prediction = tf.equal(tf.argmax(predictions, 1), tf.cast(y, tf.int64))
112 |     # Average correct predictions in the current batch
113 |     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
114 |     accuracy_averages_op = summary_averages.apply([accuracy])
115 |     # Store moving average of the accuracy
116 |     tf.summary.scalar('accuracy', summary_averages.average(accuracy))
117 | 
118 | # Define optimizer and training op
119 | with tf.name_scope('train'):
120 |     # Make train_op dependent on moving averages ops. Otherwise they will be
121 |     # disconnected from the graph
122 |     with tf.control_dependencies([loss_averages_op, accuracy_averages_op]):
123 |         train_op = tf.train.AdamOptimizer(0.001).minimize(loss, global_step=global_step)
124 | 
125 | SESS_CONFIG = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
126 | 
127 | N_BATCHES = int(train_images.shape[0] / BATCH_SIZE)
128 | LAST_STEP = int(N_BATCHES * EPOCHS)
129 | 
130 | # Logger hook to keep track of the training
131 | class _LoggerHook(tf.train.SessionRunHook):
132 |     def begin(self):
133 |         """ Run this in session begin """
134 |         self._total_loss = 0
135 |         self._total_acc = 0
136 | 
137 |     def before_run(self, run_context):
138 |         """ Run this in session before_run """
139 |         return tf.train.SessionRunArgs([loss, accuracy, global_step])
140 | 
141 |     def after_run(self, run_context, run_values):
142 |         """ Run this in session after_run """
143 |         loss_value, acc_value, step_value = run_values.results
144 |         self._total_loss += loss_value
145 |         self._total_acc += acc_value
146 |         if (step_value + 1) % N_BATCHES == 0:
147 |             print("Epoch {}/{} - loss: {:.4f} - acc: {:.4f}".format(
148 |                 int(step_value / N_BATCHES) + 1,
149 |                 EPOCHS, self._total_loss / N_BATCHES,
150 |                 self._total_acc / N_BATCHES))
151 |             self._total_loss = 0
152 |             self._total_acc = 0
153 | 
154 | class _InitHook(tf.train.SessionRunHook):
155 |     """ Hook to initialize the dataset """
156 |     def after_create_session(self, session, coord):
157 |         """ Run this after creating session """
158 |         session.run(dataset_init_op, feed_dict={
159 |             images_placeholder: train_images,
160 |             labels_placeholder: train_labels,
161 |             shuffle_size: SHUFFLE_SIZE, batch_size: BATCH_SIZE})
162 | 
163 | print("Worker {} ready".format(federated_hook.task_index))
164 | 
165 | with tf.name_scope('monitored_session'):
166 |     with tf.train.MonitoredTrainingSession(
167 |             checkpoint_dir=CHECKPOINT_DIR,
168 |             hooks=[_LoggerHook(), _InitHook(), federated_hook],
169 |             config=SESS_CONFIG,
170 |             save_checkpoint_steps=N_BATCHES) as mon_sess:
171 |         while not mon_sess.should_stop():
172 |             mon_sess.run(train_op)
173 | 
174 | print('--- Begin Evaluation ---')
175 | with tf.Session() as sess:
176 |     ckpt = tf.train.get_checkpoint_state(CHECKPOINT_DIR)
177 |     tf.train.Saver().restore(sess, ckpt.model_checkpoint_path)
178 |     print('Model restored')
179 |     sess.run(dataset_init_op, feed_dict={
180 |         images_placeholder: test_images,
181 |         labels_placeholder: test_labels,
182 |         shuffle_size: 1, batch_size: test_images.shape[0]})
183 |     print('Test accuracy: {:4f}'.format(sess.run(accuracy)))
184 | 


--------------------------------------------------------------------------------
/federated-sockets/config.py:
--------------------------------------------------------------------------------
 1 | import hashlib
 2 | 
 3 | SEND_RECEIVE_CONF = lambda x: x
 4 | SEND_RECEIVE_CONF.key = b'4C5jwen4wpNEjBeq1YmdBayIQ1oD'
 5 | SEND_RECEIVE_CONF.hashfunction = hashlib.sha1
 6 | SEND_RECEIVE_CONF.hashsize = int(160 / 8)
 7 | SEND_RECEIVE_CONF.error = b'error'
 8 | SEND_RECEIVE_CONF.recv = b'reciv'
 9 | SEND_RECEIVE_CONF.signal = b'go!go!go!'
10 | SEND_RECEIVE_CONF.buffer = 8192*2
11 | 
12 | SSL_CONF = lambda x: x
13 | SSL_CONF.key_path = 'server.key'
14 | SSL_CONF.cert_path = 'server.pem'
15 | 


--------------------------------------------------------------------------------
/federated_averaging_optimizer.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | #
 15 | # Modifications copyright (C) 2018 coMind.
 16 | # ==============================================================================
 17 | 
 18 | """Synchronize replicas for FedAvg training."""
 19 | from __future__ import absolute_import
 20 | from __future__ import division
 21 | from __future__ import print_function
 22 | 
 23 | from tensorflow.python.framework import constant_op
 24 | from tensorflow.python.framework import dtypes
 25 | from tensorflow.python.framework import ops
 26 | from tensorflow.python.ops import array_ops
 27 | from tensorflow.python.ops import control_flow_ops
 28 | from tensorflow.python.ops import data_flow_ops
 29 | from tensorflow.python.ops import math_ops
 30 | from tensorflow.python.ops import state_ops
 31 | from tensorflow.python.ops import variables
 32 | from tensorflow.python.ops import variable_scope
 33 | from tensorflow.python.platform import tf_logging as logging
 34 | from tensorflow.python.training import optimizer
 35 | from tensorflow.python.training import session_run_hook
 36 | 
 37 | # Please note that the parameters from replicas are averaged so you need to
 38 | # increase the learning rate according to the number of replicas. This change is
 39 | # introduced to be consistent with how parameters are aggregated within a batch
 40 | class FederatedAveragingOptimizer(optimizer.Optimizer):
 41 |   """Class to synchronize and aggregate model params.
 42 | 
 43 |   In a typical synchronous training environment, gradients will be averaged each
 44 |   step and then applied to the variables in one shot, after which replicas can
 45 |   fetch the new variables and continue. In a federated average training environment,
 46 |   model variables will be averaged every 'interval_steps' steps, and then the
 47 |   replicas will fetch the new variables and continue training locally. In the
 48 |   interval between two average operations, there is no data transfer, which can
 49 |   accelerate training.
 50 | 
 51 |   The following accumulators/queue are created:
 52 |   <empty line>
 53 |   * N `parameter accumulators`, one per variable to train. Local variables are
 54 |   pushed to them and the chief worker will wait until enough variables are
 55 |   collected and then average them. The accumulator will drop all stale variables
 56 |   (more details in the accumulator op).
 57 |   * 1 `token` queue where the optimizer pushes the new global_step value after
 58 |     all variables are updated.
 59 | 
 60 |   The following local variable is created:
 61 |   * `global_step`, one per replica. Updated after every average operation.
 62 | 
 63 |   The optimizer adds nodes to the graph to collect local variables and pause
 64 |   the trainers until variables are updated.
 65 |   For the Parameter Server job:
 66 |   <empty line>
 67 |   1. An accumulator is created for each variable, and each replica pushes the
 68 |      local variables into the accumulators.
 69 |   2. Each accumulator averages once enough variables (replicas_to_aggregate)
 70 |      have been accumulated.
 71 |   3. Apply the averaged variables to global variables.
 72 |   4. Only after all variables have been updated, increment the global step.
 73 |   5. Only after step 4, pushes a token in the `token_queue`, once for
 74 |      each worker replica. The workers can now fetch the token and start
 75 |      the next round.
 76 | 
 77 |   For the replicas:
 78 |   <empty line>
 79 |   1. Start a training block: fetch variables and train for "interval_steps" steps.
 80 |   2. Once the training block has been computed, push local variables into variable
 81 |      accumulators. Each accumulator will check the staleness and drop the stale.
 82 |   3. After pushing all the variables, dequeue a token from the token queue and
 83 |      continue training. Note that this is effectively a barrier.
 84 |   4. Fetch new variables and start the next block.
 85 | 
 86 |   ### Usage
 87 | 
 88 |   ```python
 89 |   # Create any optimizer to update the variables, say a simple SGD:
 90 |   opt = GradientDescentOptimizer(learning_rate=0.1)
 91 | 
 92 |   # Wrap the optimizer with fed_avg_optimizer with 50 replicas: at each
 93 |   # step the FederatedAveragingOptimizer collects "replicas_to_aggregate" variables
 94 |   # before applying the average. Note that if you want to have 2 backup replicas,
 95 |   # you can change total_num_replicas=52 and make sure this number matches how
 96 |   # many physical replicas you started in your job.
 97 |   opt = fed_avg_optimizer.FederatedAveragingOptimizer(opt,
 98 |                                                       replicas_to_aggregate=50,
 99 |                                                       is_chief=True,
100 |                                                       interval_steps=100,
101 |                                                       device_setter)
102 | 
103 |   # Some models have startup_delays to help stabilize the model but when using
104 |   # federated_average training, set it to 0.
105 | 
106 |   # Now you can call 'minimize() normally'
107 |   # train_op = opt.minimize(loss, global_step=global_step)
108 | 
109 |   # And also, create the hook which handles initialization.
110 |   fed_avg_hook = opt.make_session_run_hook()
111 |   ```
112 | 
113 |   In the training program, every worker will run the train_op as if not
114 |   averaged or synchronized. Note that if you want to run other ops like
115 |   test op, you should use common session instead of MonitoredSession:
116 | 
117 |   ```python
118 |   with training.MonitoredTrainingSession(
119 |       master=workers[worker_id].target,
120 |       hooks=[fed_avg_hook]) as mon_sess:
121 |     while not mon_sess.should_stop():
122 |       mon_sess.run(training_op)
123 |       sess = mon_sess._tf_sess()
124 |       sess.run(testing_op)
125 |   ```
126 |   """
127 | 
128 |   def __init__(self,
129 |                opt,
130 |                replicas_to_aggregate,
131 |                interval_steps,
132 |                is_chief=False,
133 |                total_num_replicas=None,
134 |                device_setter=None,
135 |                use_locking=False,
136 |                name="fedAverage"):
137 |     """Construct a fedAverage optimizer.
138 | 
139 |     Args:
140 |       opt: The actual optimizer that will be used to compute and apply the
141 |         gradients. Must be one of the Optimizer classes.
142 |       replicas_to_aggregate: number of replicas to aggregate for each variable
143 |         update.
144 |       interval_steps: number of steps between two "average op", which specifies
145 |         how frequent a model synchronization is performed.
146 |       is_chief: whether this worker is chief or not.
147 |       total_num_replicas: Total number of tasks/workers/replicas, could be
148 |         If total_num_replicas > replicas_to_aggregate: it is backup_replicas +
149 |         replicas_to_aggregate.
150 |         If total_num_replicas < replicas_to_aggregate: Replicas compute
151 |         multiple blocks per update to variables.
152 |       device_setter: A replica_device_setter that will be used to place copies
153 |         of the trainable variables in the parameter server.
154 |       use_locking: If True use locks for update operations.
155 |       name: string. Name of the global variables and related operation on ps.
156 |     """
157 |     if total_num_replicas is None:
158 |       total_num_replicas = replicas_to_aggregate
159 | 
160 |     super(FederatedAveragingOptimizer, self).__init__(use_locking, name)
161 |     logging.info(
162 |         "FedAvgV4: replicas_to_aggregate=%s; total_num_replicas=%s",
163 |         replicas_to_aggregate, total_num_replicas)
164 |     self._opt = opt
165 |     self._replicas_to_aggregate = replicas_to_aggregate
166 |     self._interval_steps = interval_steps
167 |     self._is_chief = is_chief
168 |     self._total_num_replicas = total_num_replicas
169 |     self._tokens_per_step = max(total_num_replicas, replicas_to_aggregate) - 1
170 |     self._device_setter = device_setter
171 |     self._name = name
172 | 
173 |     # Remember which accumulator is on which device to set the initial step in
174 |     # the accumulator to be global step. This list contains list of the
175 |     # following format: (accumulator, device).
176 |     self._accumulator_list = []
177 | 
178 |   def _generate_shared_variables(self):
179 |     """Generate a global variable placed on ps for each trainable variable.
180 | 
181 |        This creates a new copy of each user-defined trainable variable and places
182 |        them on ps_device. These variables store the averaged parameters.
183 |     """
184 |     # Only the chief should initialize the variables
185 |     if self._is_chief:
186 |       collections = [ops.GraphKeys.GLOBAL_VARIABLES, "global_model"]
187 |     else:
188 |       collections = ["global_model"]
189 | 
190 |     # Generate new global variables dependent on trainable variables.
191 |     with ops.device(self._device_setter):
192 |       for v in variables.trainable_variables():
193 |         _ = variable_scope.variable(
194 |             name="%s/%s" % (self._name, v.op.name),
195 |             initial_value=v.initialized_value(), trainable=False,
196 |             collections=collections)
197 | 
198 |       # Place the global step in the ps so that all the workers can see it
199 |       self._global_step = variables.Variable(0, name="%s_global_step" %
200 |           self._name, trainable=False)
201 | 
202 |   def apply_gradients(self, grads_and_vars, global_step=None, name=None):
203 |     """Apply gradients to variables.
204 |     This contains most of the synchronization implementation.
205 | 
206 |     Args:
207 |       grads_and_vars: List of (local_vars, gradients) pairs.
208 |       global_step: Variable to increment by one after the variables have been
209 |       updated. We need it to check staleness.
210 |       name: Optional name for the returned operation. Default to the
211 |         name passed to the Optimizer constructor.
212 | 
213 |     Returns:
214 |       train_op: The op to dequeue a token so the replicas can exit this batch
215 |       and apply averages to local vars or an op to update vars locally.
216 | 
217 |     Raises:
218 |       ValueError: If the grads_and_vars is empty.
219 |       ValueError: If global step is not provided, the staleness cannot be
220 |         checked.
221 |     """
222 |     if not grads_and_vars:
223 |       raise ValueError("Must supply at least one variable")
224 |     if global_step is None:
225 |       raise ValueError("Global step is required")
226 | 
227 |     # Generate copy of all trainable variables
228 |     self._generate_shared_variables()
229 | 
230 |     # Wraps the apply_gradients op of the parent optimizer
231 |     apply_updates = self._opt.apply_gradients(grads_and_vars, global_step)
232 | 
233 |     # This function will be called whenever the global_step divides interval steps
234 |     def _apply_averages():  # pylint: disable=missing-docstring
235 |       # Collect local and global vars
236 |       local_vars = [v for g, v in grads_and_vars if g is not None]
237 |       global_vars = ops.get_collection_ref("global_model")
238 |       # sync queue, place it in the ps
239 |       with ops.colocate_with(self._global_step):
240 |         sync_queue = data_flow_ops.FIFOQueue(
241 |             -1, [dtypes.bool], shapes=[[]], shared_name="sync_queue")
242 |       train_ops = []
243 |       aggregated_vars = []
244 |       with ops.name_scope(None, self._name + "/global"):
245 |         for var, gvar in zip(local_vars, global_vars):
246 |           # pylint: disable=protected-access
247 |           # Get reference to the tensor, this works with Variable and ResourceVariable
248 |           var = ops.convert_to_tensor(var)
249 |           # Place the accumulator in the same ps as the corresponding global_var
250 |           with ops.device(gvar.device):
251 |             var_accum = data_flow_ops.ConditionalAccumulator(
252 |                 var.dtype,
253 |                 shape=var.get_shape(),
254 |                 shared_name=gvar.name + "/var_accum")
255 |             # Add op to push local_var to accumulator
256 |             train_ops.append(
257 |                 var_accum.apply_grad(var, local_step=global_step))
258 |             # Op to average the vars in the accumulator
259 |             aggregated_vars.append(var_accum.take_grad(self._replicas_to_aggregate))
260 |             # Remember accumulator and corresponding device
261 |             self._accumulator_list.append((var_accum, gvar.device))
262 |       # chief worker updates global vars and enqueues tokens to the sync queue
263 |       if self._is_chief:
264 |         update_ops = []
265 |         # Make sure train_ops are run
266 |         with ops.control_dependencies(train_ops):
267 |           # Update global_vars with average values
268 |           for avg_var, gvar in zip(aggregated_vars, global_vars):
269 |             with ops.device(gvar.device):
270 |               update_ops.append(state_ops.assign(gvar, avg_var))
271 |           # Update shared global_step
272 |           with ops.device(global_step.device):
273 |             update_ops.append(state_ops.assign_add(self._global_step, 1))
274 |         # After averaging, push tokens to the queue
275 |         with ops.control_dependencies(update_ops), ops.device(
276 |             global_step.device):
277 |           tokens = array_ops.fill([self._tokens_per_step],
278 |                                   constant_op.constant(False))
279 |           sync_op = sync_queue.enqueue_many(tokens)
280 |       # non chief workers deque a token, they will block here until chief is done
281 |       else:
282 |         # Make sure train_ops are run
283 |         with ops.control_dependencies(train_ops), ops.device(
284 |             global_step.device):
285 |           sync_op = sync_queue.dequeue()
286 | 
287 |       # All workers pull averaged values
288 |       with ops.control_dependencies([sync_op]):
289 |         local_update_op = self._assign_vars(local_vars, global_vars)
290 |       return local_update_op
291 | 
292 |     # Check if we should push and average or not
293 |     with ops.control_dependencies([apply_updates]):
294 |       condition = math_ops.equal(
295 |           math_ops.mod(global_step, self._interval_steps), 0)
296 |       conditional_update = control_flow_ops.cond(
297 |           condition, _apply_averages, control_flow_ops.no_op)
298 | 
299 |     chief_init_ops = []
300 |     # Initialize accumulators, ops placed in ps
301 |     for accum, dev in self._accumulator_list:
302 |       with ops.device(dev):
303 |         chief_init_ops.append(
304 |             accum.set_global_step(global_step, name="SetGlobalStep"))
305 |     self._chief_init_op = control_flow_ops.group(*(chief_init_ops))
306 | 
307 |     return conditional_update
308 | 
309 |   def _assign_vars(self, local_vars, global_vars):
310 |     """Utility to refresh local variables.
311 | 
312 |     Args:
313 |       local_vars: List of local variables.
314 |       global_vars: List of global variables.
315 | 
316 |     Returns:
317 |       refresh_ops: The ops to assign value of global vars to local vars.
318 |     """
319 |     reassign_ops = []
320 |     for local_var, global_var in zip(local_vars, global_vars):
321 |       reassign_ops.append(state_ops.assign(local_var, global_var))
322 |     refresh_ops = control_flow_ops.group(*(reassign_ops))
323 |     return refresh_ops
324 | 
325 |   def make_session_run_hook(self):
326 |     """Creates a hook to handle federated average init operations."""
327 |     return _FederatedAverageHook(self)
328 | 
329 | class _FederatedAverageHook(session_run_hook.SessionRunHook):
330 |   """A SessionRunHook that handles ops related to FederatedAveragingOptimizer."""
331 | 
332 |   def __init__(self, fed_avg_optimizer):
333 |     """Creates hook to handle FederatedAveragingOptimizer
334 | 
335 |     Args:
336 |       fed_avg_optimizer: 'FederatedAveragingOptimizer' which this hook will
337 |         initialize.
338 |     """
339 |     self._fed_avg_optimizer = fed_avg_optimizer
340 | 
341 |   def begin(self):
342 |     local_vars = variables.trainable_variables()
343 |     global_vars = ops.get_collection_ref("global_model")
344 |     self._variable_init_op = self._fed_avg_optimizer._assign_vars(
345 |         local_vars,
346 |         global_vars)
347 | 
348 |   def after_create_session(self, session, coord):
349 |     # Make sure all models start at the same point
350 |     session.run(self._variable_init_op)
351 | 


--------------------------------------------------------------------------------
/images/Logo_Acuratio.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/123974939ffb3f14d50795e854cb698685fff46a/images/Logo_Acuratio.png


--------------------------------------------------------------------------------
/images/colab_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/123974939ffb3f14d50795e854cb698685fff46a/images/colab_logo.png


--------------------------------------------------------------------------------
/images/comindorg_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/123974939ffb3f14d50795e854cb698685fff46a/images/comindorg_logo.png


--------------------------------------------------------------------------------
/images/graph_tensorboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/123974939ffb3f14d50795e854cb698685fff46a/images/graph_tensorboard.png


--------------------------------------------------------------------------------
/images/slack_logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/123974939ffb3f14d50795e854cb698685fff46a/images/slack_logo.jpg


--------------------------------------------------------------------------------
/images/telegram_logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coMindOrg/federated-averaging-tutorials/123974939ffb3f14d50795e854cb698685fff46a/images/telegram_logo.jpg


--------------------------------------------------------------------------------