├── LICENSE ├── Notebooks ├── DEC_Gene_Clustering.ipynb ├── Keras-CDEC-Biohistology.ipynb ├── Keras-LSTM_AE-Biohistology.ipynb ├── Keras-VAE-Biohistology.ipynb ├── LSTM_AE_Text_Clustering.ipynb └── Untitled1.ipynb ├── Papers ├── 1602.06797.pdf ├── 1703.05291.pdf ├── 1803.04054.pdf └── readme.md ├── README.md └── requirements.txt /LICENSE: -------------------------------------------------------------------------------- 1 | LICENSE 2 | 3 | The MIT License (MIT) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /Notebooks/Keras-CDEC-Biohistology.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# CDEC: Unsupervised Clustering" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 6, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "name": "stderr", 17 | "output_type": "stream", 18 | "text": [ 19 | "c:\\users\\admin-karim\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\matplotlib\\__init__.py:1405: UserWarning: \n", 20 | "This call to matplotlib.use() has no effect because the backend has already\n", 21 | "been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,\n", 22 | "or matplotlib.backends is imported for the first time.\n", 23 | "\n", 24 | " warnings.warn(_use_error_msg)\n", 25 | "c:\\users\\admin-karim\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\utils\\linear_assignment_.py:21: DeprecationWarning: The linear_assignment_ module is deprecated in 0.21 and will be removed from 0.23. Use scipy.optimize.linear_sum_assignment instead.\n", 26 | " DeprecationWarning)\n" 27 | ] 28 | } 29 | ], 30 | "source": [ 31 | "from time import time\n", 32 | "from keras.datasets import mnist\n", 33 | "import numpy as np\n", 34 | "np.random.seed(10)\n", 35 | "import numpy as np\n", 36 | "import keras.backend as K\n", 37 | "from keras.engine.topology import Layer, InputSpec\n", 38 | "from keras.layers import Dense, Input\n", 39 | "from keras.models import Model\n", 40 | "from keras.optimizers import SGD\n", 41 | "from keras import callbacks\n", 42 | "from keras.initializers import VarianceScaling\n", 43 | "from sklearn.cluster import KMeans\n", 44 | "import metrics\n", 45 | "\n", 46 | "from keras.models import Model\n", 47 | "from keras import backend as K\n", 48 | "from keras import layers\n", 49 | "from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D, Flatten, Reshape, Conv2DTranspose\n", 50 | "from keras.models import Model\n", 51 | "import numpy as np\n", 52 | "import os \n", 53 | "from keras.preprocessing.image import load_img\n", 54 | "import _pickle as cPickle\n", 55 | "import _pickle \n", 56 | "import seaborn as sns\n", 57 | "import sklearn.metrics\n", 58 | "import matplotlib.pyplot as plt\n", 59 | "from sklearn.manifold import TSNE\n", 60 | "%matplotlib inline\n", 61 | "import gzip\n", 62 | "import numpy as np\n", 63 | "from PIL import Image\n", 64 | "import matplotlib\n", 65 | "import os\n", 66 | "# For plotting graphs via ssh with no display\n", 67 | "# Ref: https://stackoverflow.com/questions/2801882/generating-a-png-with-matplotlib-when-display-is-undefined\n", 68 | "matplotlib.use('Agg')\n", 69 | "from keras.preprocessing.image import load_img\n", 70 | "from matplotlib import pyplot as plt\n", 71 | "from numpy import float32\n", 72 | "from sklearn import metrics\n", 73 | "from sklearn.cluster.k_means_ import KMeans\n", 74 | "from sklearn import manifold\n", 75 | "from sklearn.utils.linear_assignment_ import linear_assignment\n", 76 | "from sklearn import preprocessing\n", 77 | "from sklearn.utils import shuffle\n", 78 | "from keras.layers import BatchNormalization" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## Convolutional DEC" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "### Step 1: convolutional auto encoder using `Conv2DTranspose`" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 258, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "def autoencoderConv2D_1(input_shape=(512, 512, 1), filters=[32, 64, 128, 10]):\n", 102 | " input_img = Input(shape=input_shape)\n", 103 | " if input_shape[0] % 8 == 0:\n", 104 | " pad3 = 'same'\n", 105 | " else:\n", 106 | " pad3 = 'valid'\n", 107 | " x = Conv2D(filters[0], 5, strides=2, padding='same', activation='relu', name='conv1', input_shape=input_shape)(input_img)\n", 108 | "\n", 109 | " x = Conv2D(filters[1], 5, strides=2, padding='same', activation='relu', name='conv2')(x)\n", 110 | "\n", 111 | " x = Conv2D(filters[2], 3, strides=2, padding=pad3, activation='relu', name='conv3')(x)\n", 112 | "\n", 113 | " x = Flatten()(x)\n", 114 | " encoded = Dense(units=filters[3], name='embedding')(x)\n", 115 | " x = Dense(units=filters[2]*int(input_shape[0]/8)*int(input_shape[0]/8), activation='relu')(encoded)\n", 116 | "\n", 117 | " x = Reshape((int(input_shape[0]/8), int(input_shape[0]/8), filters[2]))(x)\n", 118 | " x = Conv2DTranspose(filters[1], 3, strides=2, padding=pad3, activation='relu', name='deconv3')(x)\n", 119 | "\n", 120 | " x = Conv2DTranspose(filters[0], 5, strides=2, padding='same', activation='relu', name='deconv2')(x)\n", 121 | "\n", 122 | " decoded = Conv2DTranspose(input_shape[2], 5, strides=2, padding='same', name='deconv1')(x)\n", 123 | " return Model(inputs=input_img, outputs=decoded, name='AE'), Model(inputs=input_img, outputs=encoded, name='encoder')" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "### Convolutional auto encoder using `UpSampling2D`" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 10, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "init = VarianceScaling(scale=1. / 3., mode='fan_in', distribution='uniform')\n", 140 | "\n", 141 | "def autoencoderConv2D_2(img_shape=(508, 508, 3)):\n", 142 | " \"\"\"\n", 143 | " Conv2D auto-encoder model.\n", 144 | " Arguments:\n", 145 | " img_shape: e.g. (512, 512, 1) for MNIST\n", 146 | " return:\n", 147 | " (autoencoder, encoder), Model of autoencoder and model of encoder\n", 148 | " \"\"\"\n", 149 | " input_img = Input(shape=img_shape)\n", 150 | " # Encoder\n", 151 | " x = Conv2D(128, (3, 3), activation='relu', padding='same', strides=(2, 2))(input_img)\n", 152 | " x = Conv2D(64, (3, 3), activation='relu', padding='same', strides=(2, 2))(x)\n", 153 | " x = Conv2D(32, (3, 3), activation='relu', padding='same', strides=(2, 2))(x)\n", 154 | " x = BatchNormalization()(x)\n", 155 | " shape_before_flattening = K.int_shape(x)\n", 156 | " # at this point the representation is (4, 4, 8) i.e. 128-dimensional\n", 157 | " x = Flatten()(x)\n", 158 | " encoded = Dense(4, kernel_initializer=init, activation='relu', name='encoded')(x)\n", 159 | " x = BatchNormalization()(encoded)\n", 160 | "\n", 161 | " # Decoder\n", 162 | " x = Dense(np.prod(shape_before_flattening[1:]), activation='relu', kernel_initializer=init)(encoded)\n", 163 | " # Reshape into an image of the same shape as before our last `Flatten` layer\n", 164 | " x = Reshape(shape_before_flattening[1:])(x)\n", 165 | "\n", 166 | " x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)\n", 167 | " x = BatchNormalization()(x)\n", 168 | " x = UpSampling2D((2, 2))(x)\n", 169 | " x = Conv2D(64, (3, 3), activation='relu', padding='same')(x)\n", 170 | " x = BatchNormalization()(x)\n", 171 | " x = UpSampling2D((2, 2))(x)\n", 172 | " x = Conv2D(128, (3, 3), activation='relu')(x)\n", 173 | " x = BatchNormalization()(x)\n", 174 | " x = UpSampling2D((2, 2))(x)\n", 175 | " decoded = Conv2D(3, (3, 3), activation='sigmoid', padding='same')(x)\n", 176 | "\n", 177 | " return Model(inputs=input_img, outputs=decoded, name='AE'), Model(inputs=input_img, outputs=encoded, name='encoder')\n" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "### Pick a convolutional autoencoder (`autoencoderConv2D_1` or `autoencoderConv2D_2`)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 11, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "autoencoder, encoder = autoencoderConv2D_2()" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 12, 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "name": "stdout", 203 | "output_type": "stream", 204 | "text": [ 205 | "_________________________________________________________________\n", 206 | "Layer (type) Output Shape Param # \n", 207 | "=================================================================\n", 208 | "input_3 (InputLayer) (None, 508, 508, 3) 0 \n", 209 | "_________________________________________________________________\n", 210 | "conv2d_15 (Conv2D) (None, 254, 254, 128) 3584 \n", 211 | "_________________________________________________________________\n", 212 | "conv2d_16 (Conv2D) (None, 127, 127, 64) 73792 \n", 213 | "_________________________________________________________________\n", 214 | "conv2d_17 (Conv2D) (None, 64, 64, 32) 18464 \n", 215 | "_________________________________________________________________\n", 216 | "batch_normalization_2 (Batch (None, 64, 64, 32) 128 \n", 217 | "_________________________________________________________________\n", 218 | "flatten_3 (Flatten) (None, 131072) 0 \n", 219 | "_________________________________________________________________\n", 220 | "encoded (Dense) (None, 4) 524292 \n", 221 | "_________________________________________________________________\n", 222 | "dense_3 (Dense) (None, 131072) 655360 \n", 223 | "_________________________________________________________________\n", 224 | "reshape_3 (Reshape) (None, 64, 64, 32) 0 \n", 225 | "_________________________________________________________________\n", 226 | "conv2d_18 (Conv2D) (None, 64, 64, 32) 9248 \n", 227 | "_________________________________________________________________\n", 228 | "batch_normalization_4 (Batch (None, 64, 64, 32) 128 \n", 229 | "_________________________________________________________________\n", 230 | "up_sampling2d_7 (UpSampling2 (None, 128, 128, 32) 0 \n", 231 | "_________________________________________________________________\n", 232 | "conv2d_19 (Conv2D) (None, 128, 128, 64) 18496 \n", 233 | "_________________________________________________________________\n", 234 | "batch_normalization_5 (Batch (None, 128, 128, 64) 256 \n", 235 | "_________________________________________________________________\n", 236 | "up_sampling2d_8 (UpSampling2 (None, 256, 256, 64) 0 \n", 237 | "_________________________________________________________________\n", 238 | "conv2d_20 (Conv2D) (None, 254, 254, 128) 73856 \n", 239 | "_________________________________________________________________\n", 240 | "batch_normalization_6 (Batch (None, 254, 254, 128) 512 \n", 241 | "_________________________________________________________________\n", 242 | "up_sampling2d_9 (UpSampling2 (None, 508, 508, 128) 0 \n", 243 | "_________________________________________________________________\n", 244 | "conv2d_21 (Conv2D) (None, 508, 508, 3) 3459 \n", 245 | "=================================================================\n", 246 | "Total params: 1,381,575\n", 247 | "Trainable params: 1,381,063\n", 248 | "Non-trainable params: 512\n", 249 | "_________________________________________________________________\n" 250 | ] 251 | } 252 | ], 253 | "source": [ 254 | "autoencoder.summary()" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "### Load breast biohistology images for convolutional input" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 262, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "def loadDataset():\n", 271 | " data= []\n", 272 | " labels =[]\n", 273 | " root ='/home/rkarim/Training_data/'\n", 274 | "\n", 275 | " for rootName,dirName,fileNames in os.walk(root):\n", 276 | " if(not rootName == root):\n", 277 | " for fileName in fileNames:\n", 278 | " imgGray = load_img(rootName+'/'+fileName,color_mode='rgb')\n", 279 | " if rootName.split('/')[1] == 'Benign':\n", 280 | " labels+=[0]\n", 281 | " elif rootName.split('/')[1]== 'InSitu':\n", 282 | " labels+=[1]\n", 283 | " elif rootName.split('/')[1] == 'Invasive':\n", 284 | " labels+=[2]\n", 285 | " else:\n", 286 | " labels+=[3]\n", 287 | " \n", 288 | " transformed=transform.resize(np.array(imgGray),(508,508))\n", 289 | " data += [transformed.reshape((transformed.shape[0],transformed.shape[1],3))] \n", 290 | " \n", 291 | " data = np.stack(data)\n", 292 | " labels = np.stack(labels)\n", 293 | " #data,labels = shuffle(data,labels,random_state = 0)\n", 294 | " \n", 295 | " return [data,labels]" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 263, 301 | "metadata": {}, 302 | "outputs": [ 303 | { 304 | "data": { 305 | "text/plain": [ 306 | "(266, 508, 508, 3)" 307 | ] 308 | }, 309 | "execution_count": 263, 310 | "metadata": {}, 311 | "output_type": "execute_result" 312 | } 313 | ], 314 | "source": [ 315 | "from keras.datasets import mnist\n", 316 | "\n", 317 | "x, y = loadDataset()\n", 318 | "x.shape\n", 319 | "\n", 320 | "#x = x.reshape(x.shape + (1,))\n", 321 | "#x = np.divide(x, 255.)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 264, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \n", 331 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \n", 332 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \n", 333 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \n", 334 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \n", 335 | " 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, \n", 336 | " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, \n", 337 | " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, \n", 338 | " 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, \n", 339 | " 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 265, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "y = np.array(y)" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 267, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "data": { 358 | "text/plain": [ 359 | "4" 360 | ] 361 | }, 362 | "execution_count": 267, 363 | "metadata": {}, 364 | "output_type": "execute_result" 365 | } 366 | ], 367 | "source": [ 368 | "n_clusters = len(np.unique(y))\n", 369 | "x.shape\n", 370 | "n_clusters" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "### Pretrain covolutional autoencoder" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 289, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "pretrain_epochs = 1\n", 387 | "batch_size = 128\n", 388 | "save_dir = 'results/'" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 290, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "autoencoder.compile(optimizer='adam', loss='mse')" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 291, 403 | "metadata": { 404 | "scrolled": true 405 | }, 406 | "outputs": [ 407 | { 408 | "name": "stdout", 409 | "output_type": "stream", 410 | "text": [ 411 | "Epoch 1/1\n", 412 | "266/266 [==============================] - 248s 932ms/step - loss: 0.0619\n" 413 | ] 414 | } 415 | ], 416 | "source": [ 417 | "autoencoder.fit(x, x, batch_size=batch_size, epochs=pretrain_epochs)\n", 418 | "autoencoder.save_weights(save_dir+'/conv_ae_weights.h5')" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 292, 424 | "metadata": {}, 425 | "outputs": [], 426 | "source": [ 427 | "autoencoder.load_weights(save_dir+'/conv_ae_weights.h5')" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "### Build clustering model with convolutional autoencoder" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 293, 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "class ClusteringLayer(Layer):\n", 444 | " \"\"\"\n", 445 | " Clustering layer converts input sample (feature) to soft label, i.e. a vector that represents the probability of the\n", 446 | " sample belonging to each cluster. The probability is calculated with student's t-distribution.\n", 447 | "\n", 448 | " # Example\n", 449 | " ```\n", 450 | " model.add(ClusteringLayer(n_clusters=10))\n", 451 | " ```\n", 452 | " # Arguments\n", 453 | " n_clusters: number of clusters.\n", 454 | " weights: list of Numpy array with shape `(n_clusters, n_features)` witch represents the initial cluster centers.\n", 455 | " alpha: degrees of freedom parameter in Student's t-distribution. Default to 1.0.\n", 456 | " # Input shape\n", 457 | " 2D tensor with shape: `(n_samples, n_features)`.\n", 458 | " # Output shape\n", 459 | " 2D tensor with shape: `(n_samples, n_clusters)`.\n", 460 | " \"\"\"\n", 461 | "\n", 462 | " def __init__(self, n_clusters, weights=None, alpha=1.0, **kwargs):\n", 463 | " if 'input_shape' not in kwargs and 'input_dim' in kwargs:\n", 464 | " kwargs['input_shape'] = (kwargs.pop('input_dim'),)\n", 465 | " super(ClusteringLayer, self).__init__(**kwargs)\n", 466 | " self.n_clusters = n_clusters\n", 467 | " self.alpha = alpha\n", 468 | " self.initial_weights = weights\n", 469 | " self.input_spec = InputSpec(ndim=2)\n", 470 | "\n", 471 | " def build(self, input_shape):\n", 472 | " assert len(input_shape) == 2\n", 473 | " input_dim = input_shape[1]\n", 474 | " self.input_spec = InputSpec(dtype=K.floatx(), shape=(None, input_dim))\n", 475 | " self.clusters = self.add_weight((self.n_clusters, input_dim), initializer='glorot_uniform', name='clusters')\n", 476 | " if self.initial_weights is not None:\n", 477 | " self.set_weights(self.initial_weights)\n", 478 | " del self.initial_weights\n", 479 | " self.built = True\n", 480 | "\n", 481 | " def call(self, inputs, **kwargs):\n", 482 | " \"\"\" student t-distribution, as same as used in t-SNE algorithm.\n", 483 | " Measure the similarity between embedded point z_i and centroid µ_j.\n", 484 | " q_ij = 1/(1+dist(x_i, µ_j)^2), then normalize it.\n", 485 | " q_ij can be interpreted as the probability of assigning sample i to cluster j.\n", 486 | " (i.e., a soft assignment)\n", 487 | " Arguments:\n", 488 | " inputs: the variable containing data, shape=(n_samples, n_features)\n", 489 | " Return:\n", 490 | " q: student's t-distribution, or soft labels for each sample. shape=(n_samples, n_clusters)\n", 491 | " \"\"\"\n", 492 | " q = 1.0 / (1.0 + (K.sum(K.square(K.expand_dims(inputs, axis=1) - self.clusters), axis=2) / self.alpha))\n", 493 | " q **= (self.alpha + 1.0) / 2.0\n", 494 | " q = K.transpose(K.transpose(q) / K.sum(q, axis=1)) # Make sure each sample's 10 values add up to 1.\n", 495 | " return q\n", 496 | "\n", 497 | " def compute_output_shape(self, input_shape):\n", 498 | " assert input_shape and len(input_shape) == 2\n", 499 | " return input_shape[0], self.n_clusters\n", 500 | "\n", 501 | " def get_config(self):\n", 502 | " config = {'n_clusters': self.n_clusters}\n", 503 | " base_config = super(ClusteringLayer, self).get_config()\n", 504 | " return dict(list(base_config.items()) + list(config.items()))" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": 294, 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "clustering_layer = ClusteringLayer(n_clusters, name='clustering')(encoder.output)\n", 514 | "model = Model(inputs=encoder.input, outputs=clustering_layer)\n", 515 | "model.compile(optimizer='adam', loss='kld')" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "### Step 1: initialize cluster centers using k-means" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 295, 528 | "metadata": {}, 529 | "outputs": [], 530 | "source": [ 531 | "kmeans = KMeans(n_clusters=n_clusters, n_init=5)\n", 532 | "y_pred_kmeans = kmeans.fit_predict(encoder.predict(x))" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": 296, 538 | "metadata": {}, 539 | "outputs": [ 540 | { 541 | "data": { 542 | "text/plain": [ 543 | "0.2819548872180451" 544 | ] 545 | }, 546 | "execution_count": 296, 547 | "metadata": {}, 548 | "output_type": "execute_result" 549 | } 550 | ], 551 | "source": [ 552 | "metrics.accuracy_score(y, y_pred_kmeans)" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": 297, 558 | "metadata": {}, 559 | "outputs": [], 560 | "source": [ 561 | "y_pred_last = np.copy(y_pred_kmeans)" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": 298, 567 | "metadata": {}, 568 | "outputs": [], 569 | "source": [ 570 | "model.get_layer(name='clustering').set_weights([kmeans.cluster_centers_])" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "### Step 2: deep clustering" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 299, 583 | "metadata": {}, 584 | "outputs": [], 585 | "source": [ 586 | "#computing an auxiliary target distribution\n", 587 | "def target_distribution(q):\n", 588 | " weight = q ** 2 / q.sum(0)\n", 589 | " return (weight.T / weight.sum(1)).T" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": 300, 595 | "metadata": {}, 596 | "outputs": [], 597 | "source": [ 598 | "loss = 0\n", 599 | "index = 0\n", 600 | "maxiter = 1000\n", 601 | "update_interval = 10\n", 602 | "index_array = np.arange(x.shape[0])" 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": 301, 608 | "metadata": {}, 609 | "outputs": [], 610 | "source": [ 611 | "tol = 0.01 # tolerance threshold to stop training" 612 | ] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": {}, 617 | "source": [ 618 | "### Start training" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": 302, 624 | "metadata": {}, 625 | "outputs": [ 626 | { 627 | "name": "stdout", 628 | "output_type": "stream", 629 | "text": [ 630 | "Iter 0: acc = 0.28195, nmi = 0.04127, ari = 0.01177 ; loss= 0\n", 631 | "Iter 10: acc = 0.27820, nmi = 0.00000, ari = 0.00000 ; loss= 0.21101\n", 632 | "Iter 20: acc = 0.27820, nmi = 0.00000, ari = 0.00000 ; loss= 0.0\n", 633 | "delta_label 0.0 < tol 0.01\n", 634 | "Reached tolerance threshold. Stopping training.\n" 635 | ] 636 | } 637 | ], 638 | "source": [ 639 | "for ite in range(int(maxiter)):\n", 640 | " if ite % update_interval == 0:\n", 641 | " q = model.predict(x, verbose=0)\n", 642 | " p = target_distribution(q) # update the auxiliary target distribution p\n", 643 | "\n", 644 | " # evaluate the clustering performance\n", 645 | " y_pred = q.argmax(1)\n", 646 | " if y is not None:\n", 647 | " acc = np.round(metrics.accuracy_score(y, y_pred), 5)\n", 648 | " nmi = np.round(metrics.mutual_info_score(y, y_pred), 5)\n", 649 | " ari = np.round(metrics.adjusted_rand_score(y, y_pred), 5)\n", 650 | " loss = np.round(loss, 5)\n", 651 | " print('Iter %d: acc = %.5f, nmi = %.5f, ari = %.5f' % (ite, acc, nmi, ari), ' ; loss=', loss)\n", 652 | "\n", 653 | " # check stop criterion\n", 654 | " delta_label = np.sum(y_pred != y_pred_last).astype(np.float32) / y_pred.shape[0]\n", 655 | " y_pred_last = np.copy(y_pred)\n", 656 | " if ite > 0 and delta_label < tol:\n", 657 | " print('delta_label ', delta_label, '< tol ', tol)\n", 658 | " print('Reached tolerance threshold. Stopping training.')\n", 659 | " break\n", 660 | " idx = index_array[index * batch_size: min((index+1) * batch_size, x.shape[0])]\n", 661 | " loss = model.train_on_batch(x=x[idx], y=p[idx])\n", 662 | " index = index + 1 if (index + 1) * batch_size <= x.shape[0] else 0\n", 663 | "\n", 664 | "model.save_weights(save_dir + '/conv_DEC_model_final.h5')" 665 | ] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "metadata": {}, 670 | "source": [ 671 | "### Load the clustering model trained weights" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": 303, 677 | "metadata": {}, 678 | "outputs": [], 679 | "source": [ 680 | "model.load_weights(save_dir + '/conv_DEC_model_final.h5')" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": {}, 686 | "source": [ 687 | "### Final Evaluation" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": 304, 693 | "metadata": {}, 694 | "outputs": [ 695 | { 696 | "name": "stdout", 697 | "output_type": "stream", 698 | "text": [ 699 | "Acc = 0.28000, nmi = 0.00000, ari = 0.00000 ; loss= 0.0\n" 700 | ] 701 | } 702 | ], 703 | "source": [ 704 | "#Eval.\n", 705 | "\n", 706 | "q = model.predict(x, verbose=0)\n", 707 | "p = target_distribution(q) # update the auxiliary target distribution p\n", 708 | "\n", 709 | "# evaluate the clustering performance\n", 710 | "y_pred = q.argmax(1)\n", 711 | "if y is not None:\n", 712 | " acc = np.round(metrics.accuracy_score(y, y_pred), 2)\n", 713 | " nmi = np.round(metrics.mutual_info_score(y, y_pred), 2)\n", 714 | " ari = np.round(metrics.adjusted_rand_score(y, y_pred), 2)\n", 715 | " loss = np.round(loss, 5)\n", 716 | " print('Acc = %.5f, nmi = %.5f, ari = %.5f' % (acc, nmi, ari), ' ; loss=', loss)" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": 305, 722 | "metadata": {}, 723 | "outputs": [ 724 | { 725 | "data": { 726 | "image/png": "\n", 727 | "text/plain": [ 728 | "
" 729 | ] 730 | }, 731 | "metadata": {}, 732 | "output_type": "display_data" 733 | } 734 | ], 735 | "source": [ 736 | "sns.set(font_scale=3)\n", 737 | "confusion_matrix = sklearn.metrics.confusion_matrix(y, y_pred)\n", 738 | "\n", 739 | "plt.figure(figsize=(14, 10))\n", 740 | "sns.heatmap(confusion_matrix, annot=True, fmt=\"d\", annot_kws={\"size\": 20});\n", 741 | "plt.title(\"Confusion matrix\", fontsize=30)\n", 742 | "plt.ylabel('True label', fontsize=25)\n", 743 | "plt.xlabel('Clustering label', fontsize=25)\n", 744 | "plt.show()" 745 | ] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "execution_count": 310, 750 | "metadata": {}, 751 | "outputs": [], 752 | "source": [ 753 | "def vis_data(x_train_encoded, y_train, vis_dim, n_predict, n_train, build_anim):\n", 754 | " cmap = plt.get_cmap('rainbow', 10)\n", 755 | "\n", 756 | " # 3-dim vis: show one view, then compile animated .gif of many angled views\n", 757 | " if vis_dim == 3:\n", 758 | " # Simple static figure\n", 759 | " fig = plt.figure()\n", 760 | " ax = plt.axes(projection='3d')\n", 761 | " p = ax.scatter3D(x_train_encoded[:,0], x_train_encoded[:,1], x_train_encoded[:,2], \n", 762 | " c=y_train[:n_predict], cmap=cmap, edgecolor='black')\n", 763 | " fig.colorbar(p, drawedges=True)\n", 764 | " plt.show()\n", 765 | "\n", 766 | " # Build animation from many static figures\n", 767 | " if build_anim:\n", 768 | " angles = np.linspace(180, 360, 20)\n", 769 | " i = 0\n", 770 | " for angle in angles:\n", 771 | " fig = plt.figure()\n", 772 | " ax = plt.axes(projection='3d')\n", 773 | " ax.view_init(10, angle)\n", 774 | " p = ax.scatter3D(x_train_encoded[:,0], x_train_encoded[:,1], x_train_encoded[:,2], \n", 775 | " c=y_train[:n_predict], cmap=cmap, edgecolor='black')\n", 776 | " fig.colorbar(p, drawedges=True)\n", 777 | " outfile = 'anim/3dplot_step_' + chr(i + 97) + '.png'\n", 778 | " plt.savefig(outfile, dpi=96)\n", 779 | " i += 1\n", 780 | " call(['convert', '-delay', '50', 'anim/3dplot*', 'anim/3dplot_anim_' + str(n_train) + '.gif'])\n", 781 | "\n", 782 | " # 2-dim vis: plot and colorbar.\n", 783 | " elif vis_dim == 2:\n", 784 | " plt.scatter(x_train_encoded[:,0], x_train_encoded[:,1], \n", 785 | " c=y_train[:n_predict], edgecolor='black', cmap=cmap)\n", 786 | " plt.colorbar(drawedges=True)\n", 787 | " plt.show()" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": 319, 793 | "metadata": {}, 794 | "outputs": [], 795 | "source": [ 796 | "# Encode a number of MNIST digits, then perform t-SNE dim-reduction.\n", 797 | "x_train_predict = encoder.predict(x)\n", 798 | "\n", 799 | "#print \"Performing t-SNE dimensionality reduction...\"\n", 800 | "x_train_encoded = TSNE(n_components=2).fit_transform(x_train_predict)\n", 801 | "#np.save('%sx_%sdim_tnse_%s.npy' % (266, 2, 266), x_train_encoded)\n", 802 | "#x_train_encoded = np.load(str(n_predict) + 'x_' + str(vis_dim) + 'dim_tnse_' + str(n_train) + '.npy')\n" 803 | ] 804 | }, 805 | { 806 | "cell_type": "code", 807 | "execution_count": 324, 808 | "metadata": {}, 809 | "outputs": [ 810 | { 811 | "data": { 812 | "image/png": "\n", 813 | "text/plain": [ 814 | "
" 815 | ] 816 | }, 817 | "metadata": { 818 | "needs_background": "light" 819 | }, 820 | "output_type": "display_data" 821 | } 822 | ], 823 | "source": [ 824 | "# Visualize result.\n", 825 | "train_new = False\n", 826 | "n_train = 2660\n", 827 | "predict_new = False\n", 828 | "n_predict = 2660\n", 829 | "vis_dim = 2\n", 830 | "build_anim = False\n", 831 | " \n", 832 | "vis_data(x_train_encoded, y, vis_dim, n_predict, n_train, build_anim)" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": null, 838 | "metadata": {}, 839 | "outputs": [], 840 | "source": [ 841 | "import keras\n", 842 | "from keras import backend as K\n", 843 | "from keras.models import Sequential, Model\n", 844 | "from keras.layers import Input, LSTM, RepeatVector\n", 845 | "from keras.layers.core import Flatten, Dense, Dropout, Lambda\n", 846 | "from keras.optimizers import SGD, RMSprop, Adam\n", 847 | "from keras import objectives\n", 848 | "\n", 849 | "\n", 850 | "def create_lstm_autoencoder(input_dim, timesteps, latent_dim):\n", 851 | " \"\"\"\n", 852 | " Creates an LSTM Autoencoder (VAE). Returns Autoencoder, Encoder, Generator. \n", 853 | " (All code by fchollet - see reference.)\n", 854 | " # Arguments\n", 855 | " input_dim: int.\n", 856 | " timesteps: int, input timestep dimension.\n", 857 | " latent_dim: int, latent z-layer shape. \n", 858 | " # References\n", 859 | " - [Building Autoencoders in Keras](https://blog.keras.io/building-autoencoders-in-keras.html)\n", 860 | " \"\"\"\n", 861 | "\n", 862 | " inputs = Input(shape=(timesteps, input_dim,))\n", 863 | " encoded = LSTM(latent_dim)(inputs)\n", 864 | "\n", 865 | " decoded = RepeatVector(timesteps)(encoded)\n", 866 | " decoded = LSTM(input_dim, return_sequences=True)(decoded)\n", 867 | "\n", 868 | " sequence_autoencoder = Model(inputs, decoded)\n", 869 | " encoder = Model(inputs, encoded)\n", 870 | "\n", 871 | " autoencoder = Model(inputs, decoded)\n", 872 | " autoencoder.compile(optimizer='adam', loss='mse')\n", 873 | " \n", 874 | " return autoencoder, encoder" 875 | ] 876 | } 877 | ], 878 | "metadata": { 879 | "kernelspec": { 880 | "display_name": "Python 3", 881 | "language": "python", 882 | "name": "python3" 883 | }, 884 | "language_info": { 885 | "codemirror_mode": { 886 | "name": "ipython", 887 | "version": 3 888 | }, 889 | "file_extension": ".py", 890 | "mimetype": "text/x-python", 891 | "name": "python", 892 | "nbconvert_exporter": "python", 893 | "pygments_lexer": "ipython3", 894 | "version": "3.5.2" 895 | } 896 | }, 897 | "nbformat": 4, 898 | "nbformat_minor": 2 899 | } 900 | -------------------------------------------------------------------------------- /Papers/1602.06797.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rezacsedu/Deep-Learning-for-Clustering-in-Bioinformatics/0b236c7add335ea04ce0855e998602f53473f8ac/Papers/1602.06797.pdf -------------------------------------------------------------------------------- /Papers/1703.05291.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rezacsedu/Deep-Learning-for-Clustering-in-Bioinformatics/0b236c7add335ea04ce0855e998602f53473f8ac/Papers/1703.05291.pdf -------------------------------------------------------------------------------- /Papers/1803.04054.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rezacsedu/Deep-Learning-for-Clustering-in-Bioinformatics/0b236c7add335ea04ce0855e998602f53473f8ac/Papers/1803.04054.pdf -------------------------------------------------------------------------------- /Papers/readme.md: -------------------------------------------------------------------------------- 1 | ### Deep learning-based unsupervised/clustering methods, link to papers, and codes 2 | | Title | Article | Conference/Journal | Code | 3 | | :--------- | :------: | :------: | :------: | 4 | | Deep clustering with convolutional autoencoders (DCEC) | [Link](https://xifengguo.github.io/papers/ICONIP17-DCEC.pdf) | ICONIP'2017 | [GitHub](https://github.com/XifengGuo/DCEC) | 5 | | Unsupervised Data Augmentation for Consistency Training (UDA) | [Link](https://arxiv.org/pdf/1904.12848.pdf) | Arxiv'2019 | [GitHub](https://github.com/google-research/uda) | 6 | | Deep Clustering via joint convolutional autoencoder embedding and relative entropy minimization (DEPICT) | [Link](https://arxiv.org/pdf/1704.06327.pdf) | ICCV'2017 | [GitHub](https://github.com/herandy/DEPICT) | 7 | | Discriminatively Boosted Clustering (DBC) | [Link](https://arxiv.org/pdf/1703.07980.pdf) | Arxiv'2017 | N/A| 8 | | Variational Deep Embedding (VADE) | [Link](https://arxiv.org/pdf/1611.05148.pdf) | IJCAI'2017 | [GitHub](https://github.com/slim1017/VaDE) | 9 | | Convolutional Embedded Networks (CEN)} | [Link](https://arxiv.org/pdf/1805.12218.pdf) | Arxiv'2018 | [GitHub](https://github.com/rezacsedu/Convolutional-embedded-networks) | 10 | | Deep Subspace Clustering Networks (DSC-Nets) | [Link](http://papers.nips.cc/paper/6608-deep-subspace-clustering-networks.pdf) | NIPS'2017 |[GitHub](https://github.com/panji1990/Deep-subspace-clustering-networks) | 11 | | Graph Clustering with Dynamic Embedding (GRACE) | [Link](https://arxiv.org/pdf/1712.08249.pdf) | Arxiv'2017 | N/A | 12 | | Deep Unsupervised Clustering Using Mixture of Autoencoders (MIXAE) | [Link](https://arxiv.org/pdf/1712.07788.pdf) | Arxiv'2017 |N/A | 13 | | Deep Embedded Clustering (DEC) | [Link](http://proceedings.mlr.press/v48/xieb16.pdf) | ICML'2016 | [GitHub](https://github.com/piiswrong/dec) | 14 | | A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture | [Link](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8412085) | IEEE ACCESS 2018 | | 15 | | GEMSEC: Graph Embedding with Self Clustering | [Link](https://arxiv.org/pdf/1802.03997.pdf) | Arxiv,2018 | [GitHub](https://github.com/benedekrozemberczki/GEMSEC) | 16 | | Clustering with Deep Learning: Taxonomy and New Methods | [Link](https://arxiv.org/pdf/1801.07648.pdf) | Arxiv, 2018 | [GitHub](https://github.com/elieJalbout/Clustering-with-Deep-learning) | 17 | | Deep Continuous Clustering (DCC) | [Link](https://arxiv.org/pdf/1803.01449.pdf) | Arxiv, 2018 | [GitHub](https://github.com/shahsohil/DCC) | 18 | | Deep Clustering with Convolutional Autoencoders (DCEC) | [Link](https://xifengguo.github.io/papers/ICONIP17-DCEC.pdf) | ICONIP'2018 | [GitHub](https://github.com/XifengGuo/DCEC) | 19 | | SpectralNet: Spectral Clustering Using Deep Neural Networks | [Link](https://openreview.net/pdf?id=HJ_aoCyRZ) | ICLR'2018 | [GitHub](https://github.com/KlugerLab/SpectralNet) | 20 | | Subspace clustering using a low-rank constrained autoencoder (LRAE) | [Link](https://www.sciencedirect.com/science/article/pii/S0020025517309659) | Information Sciences'2018 | N/A| 21 | | Clustering-driven Deep Embedding with Pairwise Constraints (CPAC) | [Link](https://arxiv.org/pdf/1803.08457.pdf) | Arxiv'2018 | [GitHub](https://github.com/sharonFogel/CPAC) | 22 | | Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering | [Link](https://arxiv.org/pdf/1610.04794.pdf) | PMLR'2017 | N/A | 23 | | Deep Unsupervised Clustering With Gaussian Mixture Variational AutoEncoders (GMVAE) | [Link](https://arxiv.org/pdf/1611.02648.pdf)| ICLR'2017 | [GitHub](https://github.com/Nat-D/GMVAE)| N/A | 24 | | Is Simple Better?: Revisiting Simple Generative Models for Unsupervised Clustering | [Link](https://ic.unicamp.br/~adin/downloads/pubs/AriasFigueroa2017a.pdf) | NIPS'2017 Workshop | [GitHub](https://github.com/jariasf/clustering-nips-2017) | 25 | | Imporved Deep Embedding Clustering (IDEC) | [Link](https://www.ijcai.org/proceedings/2017/0243.pdf) | IJCAI'2017 | [GitHub](https://github.com/XifengGuo/IDEC) | 26 | | Deep Clustering Network (DCN) | [Link](https://arxiv.org/pdf/1610.04794v1.pdf) | Arxiv'2016 | [GitHub](https://github.com/boyangumn/DCN-New) | N/A| 27 | | Joint Unsupervised Learning of Deep Representations and Image Clustering (JULE) | [Link](https://arxiv.org/pdf/1604.03628.pdf) | CVPR'2016 | [GitHub](https://github.com/jwyang/JULE.torch) | 28 | | Deep Embedding Network for Clustering (DEN) | [Link](https://ieeexplore.ieee.org/document/6976982/) | ICPR'2014 | N/A| 29 | | Auto-encoder Based Data Clustering (ABDC) | [Link](http://nlpr-web.ia.ac.cn/english/irds/People/lwang/M-MCG_EN/Publications/2013/CFS2013CIARP.pdf) | CIARP'2013 | [GitHub](https://github.com/KellerJordan/Autoencoder-Clustering) | 30 | | Learning Deep Representations for Graph Clustering | [Link](https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/viewFile/8527/8571) | AAAI'2014 | [GitHub](https://github.com/quinngroup/deep-representations-clustering) | 31 | 32 | ## Contribute 33 | If you find more related work, which are not listed here, please create a PR or sugest by filing issues. Your contribution will be highly appreciated. 34 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Deep Learning-based Clustering Approaches for Bioinformatics 2 | Codes and supplementary materials for our paper "Deep Learning-based Clustering Approaches for Bioinformatics" published in [Briefings in Bioinformatics](https://academic.oup.com/bib) journal. This repo will be updated periodically. In particular, more complete Jupyter notebooks will be added. In this article, we reviewed deep learning-based approaches for cluster analysis, including network training, representation learning, parameter optimization, and formulating clustering quality metrics. We also discussed how representation learning based on different autoencoder architectures (e.g., vanilla, variational, LSTM, and convolutional) can be more effective than ML-based approaches (e.g., PCA) in different scenarios, e.g., bio-imaging, gene expression clustering, and clustering biomedical texts. 3 | 4 | ### Deep learning-based unsupervised/clustering methods, link to papers & codes 5 | We provide the list of deep learning-based unsupervised/clustering methods, link to papers, and codes. Besides, new articles proposing approaches and paper will be listed. So stay tuned! 6 | 7 | | Title | Article | Conference/Journal | Code | 8 | | :--------- | :------: | :------: | :------: | 9 | | Deep clustering with convolutional autoencoders (DCEC) | [Link](https://xifengguo.github.io/papers/ICONIP17-DCEC.pdf) | ICONIP'2017 | [GitHub](https://github.com/XifengGuo/DCEC) | 10 | | Unsupervised Data Augmentation for Consistency Training (UDA) | [Link](https://arxiv.org/pdf/1904.12848.pdf) | Arxiv'2019 | [GitHub](https://github.com/google-research/uda) | 11 | | Deep Clustering via joint convolutional autoencoder embedding and relative entropy minimization (DEPICT) | [Link](https://arxiv.org/pdf/1704.06327.pdf) | ICCV'2017 | [GitHub](https://github.com/herandy/DEPICT) | 12 | | Discriminatively Boosted Clustering (DBC) | [Link](https://arxiv.org/pdf/1703.07980.pdf) | Arxiv'2017 | N/A| 13 | | Variational Deep Embedding (VADE) | [Link](https://arxiv.org/pdf/1611.05148.pdf) | IJCAI'2017 | [GitHub](https://github.com/slim1017/VaDE) | 14 | | Convolutional Embedded Networks (CEN)} | [Link](https://arxiv.org/pdf/1805.12218.pdf) | Arxiv'2018 | [GitHub](https://github.com/rezacsedu/Convolutional-embedded-networks) | 15 | | Deep Subspace Clustering Networks (DSC-Nets) | [Link](http://papers.nips.cc/paper/6608-deep-subspace-clustering-networks.pdf) | NIPS'2017 |[GitHub](https://github.com/panji1990/Deep-subspace-clustering-networks) | 16 | | Graph Clustering with Dynamic Embedding (GRACE) | [Link](https://arxiv.org/pdf/1712.08249.pdf) | Arxiv'2017 | N/A | 17 | | Deep Unsupervised Clustering Using Mixture of Autoencoders (MIXAE) | [Link](https://arxiv.org/pdf/1712.07788.pdf) | Arxiv'2017 |N/A | 18 | | Deep Embedded Clustering (DEC) | [Link](http://proceedings.mlr.press/v48/xieb16.pdf) | ICML'2016 | [GitHub](https://github.com/piiswrong/dec) | 19 | | A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture | [Link](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8412085) | IEEE ACCESS 2018 | | 20 | | GEMSEC: Graph Embedding with Self Clustering | [Link](https://arxiv.org/pdf/1802.03997.pdf) | Arxiv,2018 | [GitHub](https://github.com/benedekrozemberczki/GEMSEC) | 21 | | Clustering with Deep Learning: Taxonomy and New Methods | [Link](https://arxiv.org/pdf/1801.07648.pdf) | Arxiv, 2018 | [GitHub](https://github.com/elieJalbout/Clustering-with-Deep-learning) | 22 | | Deep Continuous Clustering (DCC) | [Link](https://arxiv.org/pdf/1803.01449.pdf) | Arxiv, 2018 | [GitHub](https://github.com/shahsohil/DCC) | 23 | | Deep Clustering with Convolutional Autoencoders (DCEC) | [Link](https://xifengguo.github.io/papers/ICONIP17-DCEC.pdf) | ICONIP'2018 | [GitHub](https://github.com/XifengGuo/DCEC) | 24 | | SpectralNet: Spectral Clustering Using Deep Neural Networks | [Link](https://openreview.net/pdf?id=HJ_aoCyRZ) | ICLR'2018 | [GitHub](https://github.com/KlugerLab/SpectralNet) | 25 | | Subspace clustering using a low-rank constrained autoencoder (LRAE) | [Link](https://www.sciencedirect.com/science/article/pii/S0020025517309659) | Information Sciences'2018 | N/A| 26 | | Clustering-driven Deep Embedding with Pairwise Constraints (CPAC) | [Link](https://arxiv.org/pdf/1803.08457.pdf) | Arxiv'2018 | [GitHub](https://github.com/sharonFogel/CPAC) | 27 | | Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering | [Link](https://arxiv.org/pdf/1610.04794.pdf) | PMLR'2017 | N/A | 28 | | Deep Unsupervised Clustering With Gaussian Mixture Variational AutoEncoders (GMVAE) | [Link](https://arxiv.org/pdf/1611.02648.pdf)| ICLR'2017 | [GitHub](https://github.com/Nat-D/GMVAE)| N/A | 29 | | Is Simple Better?: Revisiting Simple Generative Models for Unsupervised Clustering | [Link](https://ic.unicamp.br/~adin/downloads/pubs/AriasFigueroa2017a.pdf) | NIPS'2017 Workshop | [GitHub](https://github.com/jariasf/clustering-nips-2017) | 30 | | Imporved Deep Embedding Clustering (IDEC) | [Link](https://www.ijcai.org/proceedings/2017/0243.pdf) | IJCAI'2017 | [GitHub](https://github.com/XifengGuo/IDEC) | 31 | | Deep Clustering Network (DCN) | [Link](https://arxiv.org/pdf/1610.04794v1.pdf) | Arxiv'2016 | [GitHub](https://github.com/boyangumn/DCN-New) | N/A| 32 | | Joint Unsupervised Learning of Deep Representations and Image Clustering (JULE) | [Link](https://arxiv.org/pdf/1604.03628.pdf) | CVPR'2016 | [GitHub](https://github.com/jwyang/JULE.torch) | 33 | | Deep Embedding Network for Clustering (DEN) | [Link](https://ieeexplore.ieee.org/document/6976982/) | ICPR'2014 | N/A| 34 | | Auto-encoder Based Data Clustering (ABDC) | [Link](http://nlpr-web.ia.ac.cn/english/irds/People/lwang/M-MCG_EN/Publications/2013/CFS2013CIARP.pdf) | CIARP'2013 | [GitHub](https://github.com/KellerJordan/Autoencoder-Clustering) | 35 | | Learning Deep Representations for Graph Clustering | [Link](https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/viewFile/8527/8571) | AAAI'2014 | [GitHub](https://github.com/quinngroup/deep-representations-clustering) | 36 | 37 | ### Running provided Jupyter notebooks 38 | To run the examples interactively, you need to install some Python modules and libraries. 39 | 40 | * Python 3 41 | * Scikit-learn 42 | * Keras 43 | * TensorFlow. 44 | 45 | For the Jupyter notebook, git it from this [Link](https://jupyter.readthedocs.io/en/latest/install.html) and install it on your machine. Then clone this repo using following command, given that you have already installed the `git`: 46 | 47 | ``` 48 | git clone https://github.com/rezacsedu/Deep-learning-for-clustering-in-bioinformatics.git 49 | ``` 50 | Alternatively, install all the required libraries by issuing the following command: 51 | ``` 52 | cd Deep-learning-for-clustering-in-bioinformatics 53 | pip3 install -r requirements.txt 54 | cd Notebboks 55 | ``` 56 | Then start Jupyter notebbok by issuing the following command: 57 | ``` 58 | jupyter notebook 59 | ``` 60 | In the opened browser, go to Jupyter tab and window open the notebook. 61 | ``` 62 | LSTM_AE_Text_Clustering.ipynb 63 | ``` 64 | If you want to skip the training, soon we'll provide the pre-trained weights, which you can restore and start fine-tuning. Happy coding! Leave a comment if you have any question. 65 | 66 | ### Acknowledgement 67 | The ClusteringLayer class and the target_distribution function are based on DEC from https://github.com/XifengGuo/DCEC/blob/master/DCEC.py by Xifeng Guo 68 | 69 | ### Citation request 70 | If you use the code of this repository in your research, please consider citing the folowing papers: 71 | 72 | @article{karim2021deep, 73 | title={Deep learning-based clustering approaches for bioinformatics}, 74 | author={Karim, Md Rezaul and Beyan, Oya and Zappa, Achille and Costa, Ivan G and Rebholz-Schuhmann, Dietrich and Cochez, Michael and Decker, Stefan}, 75 | journal={Briefings in bioinformatics}, 76 | volume={22}, 77 | number={1}, 78 | pages={393--415}, 79 | year={2021}, 80 | publisher={Oxford University Press} 81 | } 82 | 83 | ### Contributing 84 | If you find more related work, which are not listed here, please create a PR or sugest by filing issues. Your contribution will be highly appreciated. For any questions, feel free to open an issue or contact at rezaul.karim@rwth-aachen.de. 85 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | tensorflow 3 | keras 4 | scipy 5 | matplotlib 6 | sklearn 7 | seaborn 8 | pandas 9 | gzip 10 | skimage 11 | PIL 12 | _pickle 13 | gensim 14 | collections 15 | --------------------------------------------------------------------------------