├── images └── teaser.PNG ├── code ├── train │ ├── test_idx.npy │ ├── train_idx.npy │ ├── readme.txt │ ├── Train_adv.ipynb │ ├── Train_phase1.ipynb │ └── Train_phase2.ipynb ├── env.yml ├── readme.txt └── evaluate │ ├── readme.txt │ ├── Compute_CNN_embeddings.ipynb │ ├── Perturbations_attacks.ipynb │ ├── Evaluate.ipynb │ ├── Evaluate_new_phishing.ipynb │ ├── Whitebox_attacks.ipynb │ └── Whitebox_attacks_closest.ipynb ├── _layouts └── default.html ├── _includes ├── page-footer.html ├── head.html └── page-header.html ├── _posts └── 2020-03-18-welcome-to-jekyll.markdown ├── README.md ├── _config.yml └── css ├── cayman.css └── normalize.css /images/teaser.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/S-Abdelnabi/VisualPhishNet/HEAD/images/teaser.PNG -------------------------------------------------------------------------------- /code/train/test_idx.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/S-Abdelnabi/VisualPhishNet/HEAD/code/train/test_idx.npy -------------------------------------------------------------------------------- /code/train/train_idx.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/S-Abdelnabi/VisualPhishNet/HEAD/code/train/train_idx.npy -------------------------------------------------------------------------------- /code/env.yml: -------------------------------------------------------------------------------- 1 | name: tf_1.13 2 | channels: 3 | - defaults 4 | dependencies: 5 | - tensorflow-gpu=1.14.0 6 | - matplotlib 7 | - scikit-image 8 | 9 | -------------------------------------------------------------------------------- /_layouts/default.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | {% include head.html %} 5 | 6 | 7 | {% include page-header.html %} 8 | 9 |

10 | 11 | {{ content }} 12 | 13 | {% include page-footer.html %} 14 | 15 |

16 | 17 | 18 | 19 | -------------------------------------------------------------------------------- /_includes/page-footer.html: -------------------------------------------------------------------------------- 1 | 2 | 7 | -------------------------------------------------------------------------------- /code/readme.txt: -------------------------------------------------------------------------------- 1 | train: contains files to train the model (two training phases, adversarial finetuning, compute and save the embeddings), evaluate: experimental setup (evaluation, perturbations and whitebox attacks). 2 | all relevant files must be present in the directory (e.g. dataset for training and evaluation, saved models for evaluation and fine-tuning, precomputed embeddings for evaluation). 3 | -------------------------------------------------------------------------------- /_includes/head.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | {{ site.title }} 4 | 5 | 6 | 7 | 8 | 9 | 10 | -------------------------------------------------------------------------------- /code/train/readme.txt: -------------------------------------------------------------------------------- 1 | Files for training: 2 | - phase1: starts training, computes and saves embeddings at the end of training 3 | - phase2: fine tunes phase1 by hard subsets, should load the previously trained model, computes and saves embeddings at the end of training 4 | - adv: fine tunes on adv examples, should load the previously trained model, computes and saves embeddings at the end of training 5 | - adversarial training and evaluation require tensorflow v1 6 | 7 | 8 | - New Jan. 2022: More details have been added to reproduce the results in the paper: the train/test split, exact model architecture, and some other hyperparameters. 9 | -------------------------------------------------------------------------------- /_includes/page-header.html: -------------------------------------------------------------------------------- 1 | 10 | -------------------------------------------------------------------------------- /_posts/2020-03-18-welcome-to-jekyll.markdown: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Welcome to Jekyll!" 4 | date: 2020-03-18 17:24:42 +0100 5 | categories: jekyll update 6 | --- 7 | You’ll find this post in your `_posts` directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run `jekyll serve`, which launches a web server and auto-regenerates your site when a file is updated. 8 | 9 | Jekyll requires blog post files to be named according to the following format: 10 | 11 | `YEAR-MONTH-DAY-title.MARKUP` 12 | 13 | Where `YEAR` is a four-digit number, `MONTH` and `DAY` are both two-digit numbers, and `MARKUP` is the file extension representing the format used in the file. After that, include the necessary front matter. Take a look at the source for this post to get an idea about how it works. 14 | 15 | Jekyll also offers powerful support for code snippets: 16 | 17 | {% highlight ruby %} 18 | def print_hi(name) 19 | puts "Hi, #{name}" 20 | end 21 | print_hi('Tom') 22 | #=> prints 'Hi, Tom' to STDOUT. 23 | {% endhighlight %} 24 | 25 | Check out the [Jekyll docs][jekyll-docs] for more info on how to get the most out of Jekyll. File all bugs/feature requests at [Jekyll’s GitHub repo][jekyll-gh]. If you have questions, you can ask them on [Jekyll Talk][jekyll-talk]. 26 | 27 | [jekyll-docs]: https://jekyllrb.com/docs/home 28 | [jekyll-gh]: https://github.com/jekyll/jekyll 29 | [jekyll-talk]: https://talk.jekyllrb.com/ 30 | -------------------------------------------------------------------------------- /code/evaluate/readme.txt: -------------------------------------------------------------------------------- 1 | Files for evaluation, needs to be adjusted according to the dataset path, model file names and precomputed embeddings files names: 2 | - Evaluate.ipynb: loads pretrained model and precomputed embeddings and compute the matching of the original dataset phishing pages. Can be used to compute the matching of the attacks by loading the computed perturbed features of the test set. 3 | 4 | - Evaluate_new_phishing.ipynb: loads pretrained model and precomputed embeddings and compute the matching of the newly crawled phishing data. 5 | 6 | - Compute_CNN_embeddings.ipynb: Compute and save the off-the-shelf-cnns embeddigs (i.e. VGG-16 and ResNet50). The computed embeddings can be used in 'Evaluate.ipynb' and 'Evaluate_new_phishing.ipynb' 7 | 8 | - Perturbations_attacks.ipynb: applies the defined random perturbation attacks on the phishing test set and calculate and save the new embeddings of the perturbed images. The computed embeddings can be used in 'Evaluate.ipynb'. You can specify the attack type and parameters. 9 | 10 | - Whitebox_attacks.ipynb: applies whitebox adv perturbations on the phishing test set and calculates and saves the new embeddings of the perturbed images. The computed embeddings can be used in 'Evaluate.ipynb'. 11 | 12 | - Whitebox_attacks_closest.ipynb: applies adv perturbations by sampling the closest example as the positive image, calculate and save the new embeddings of the perturbed images. The computed embeddings can be used in 'Evaluate.ipynb'. 13 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | {:refdef: style="text-align: center;"} 2 | ![teaser](images/teaser.PNG){:width="65%"} 3 | {: refdef} 4 | 5 | ## Abstract 6 | 7 | *Phishing websites are still a major threat in today’s Internet ecosystem. Despite numerous previous efforts, similarity-based detection methods do not offer sufficient protection for the trusted websites –in particular against unseen phishing pages. This paper contributes VisualPhishNet, a new similarity-based phishing detection framework, based on a triplet Convolutional Neural Network (CNN). VisualPhishNet learns profiles for websites in order to detect phishing websites by a similarity metric that can generalize to pages with new visual appearances. We furthermore present VisualPhish, the largest dataset to date that facilitates visual phishing detection in an ecologically valid manner. We show that our method outperforms previous visual similarity phishing detection approaches by a large margin while being robust against a range of evasion attacks.* 8 | 9 | ## VisualPhish Dataset 10 | The dataset has the following screenshots: 11 | 12 | * A legitimate trusted-list of 155 websites 13 | * Phishing pages targeting the trusted-list 14 | * New crawled phishing pages collected in Mar-April 2020 15 | * Benign websites with different domains than the trusted-list 16 | * Other test subsets (e.g. different browsers) 17 | 18 | To access the dataset for research purposes, please write an email to: sahar.abdelnabi@cispa.saarland 19 | 20 | ## BibTex 21 | If you find our code or dataset helpful, please cite our paper: 22 | ~~~~~~~~~~~~~~~~ 23 | @inproceedings{abdelnabi20ccs, 24 | title = {VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity}, 25 | author = {Sahar Abdelnabi and Katharina Krombholz and Mario Fritz}, 26 | year = {2020}, 27 | booktitle = {ACM Conference on Computer and Communications Security (CCS) } 28 | } 29 | ~~~~~~~~~~~~~~~~ 30 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | #theme: jekyll-theme-cayman 2 | #title: "VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity" 3 | #description: "Sahar Abdelnabi, Katharina Krombholoz and Mario Fritz
CISPA Helmholtz Center for Information Security" 4 | #show_downloads: true 5 | 6 | # Welcome to Jekyll! 7 | # 8 | # This config file is meant for settings that affect your whole blog, values 9 | # which you are expected to set up once and rarely edit after that. If you find 10 | # yourself editing this file very often, consider using Jekyll's data files 11 | # feature for the data you need to update frequently. 12 | # 13 | # For technical reasons, this file is *NOT* reloaded automatically when you use 14 | # 'bundle exec jekyll serve'. If you change this file, please restart the server process. 15 | # 16 | # If you need help with YAML syntax, here are some quick references for you: 17 | # https://learn-the-web.algonquindesign.ca/topics/markdown-yaml-cheat-sheet/#yaml 18 | # https://learnxinyminutes.com/docs/yaml/ 19 | # 20 | # Site settings 21 | # These are used to personalize your new site. If you look in the HTML files, 22 | # you will see them accessed via {{ site.title }}, {{ site.email }}, and so on. 23 | # You can create any custom variable you would like, and they will be accessible 24 | # in the templates via {{ site.myvariable }}. 25 | 26 | title: "VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity " 27 | markdown: kramdown 28 | theme: jekyll-theme-cayman 29 | #description: "Sahar Abdelnabi, Katharina Krombholoz and Mario Fritz
CISPA Helmholtz Center for Information Security" 30 | url: "https://s-abdelnabi.github.io/VisualPhishNet" # the subpath of your site, e.g. /blog 31 | baseurl: "/VisualPhishNet" # the subpath of your site, e.g. /blog 32 | show_downloads: true 33 | show_tagline: true 34 | auth_color: "color:#FFFFFF" 35 | aff_color: "color:#000000" 36 | owner: 37 | name: Sahar Abdelnabi 38 | email: sahar.abdelnabi@cispa.saarland 39 | author: 40 | name: Sahar Abdelnabi 41 | url: https://scholar.google.com/citations?user=QEiYbDYAAAAJ&hl=en 42 | 43 | paper_url: "https://arxiv.org/abs/1909.00300" 44 | datasetlink: "https://s-abdelnabi.github.io/VisualPhishNet/" 45 | #plugins: 46 | # - jekyll-include-cache 47 | 48 | # Exclude from processing. 49 | # The following items will not be processed, by default. 50 | # Any item listed under the `exclude:` key here will be automatically added to 51 | # the internal "default list". 52 | # 53 | # Excluded items can be processed by explicitly listing the directories or 54 | # their entries' file path in the `include:` list. 55 | # 56 | # exclude: 57 | # - .sass-cache/ 58 | # - .jekyll-cache/ 59 | # - gemfiles/ 60 | # - Gemfile 61 | # - Gemfile.lock 62 | # - node_modules/ 63 | # - vendor/bundle/ 64 | # - vendor/cache/ 65 | # - vendor/gems/ 66 | # - vendor/ruby/ 67 | -------------------------------------------------------------------------------- /code/evaluate/Compute_CNN_embeddings.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import tensorflow as tf \n", 11 | "from keras.models import load_model\n", 12 | "\n", 13 | "from keras.models import Sequential\n", 14 | "from keras.layers import Dense\n", 15 | "from keras.layers import Flatten,Subtract,Reshape\n", 16 | "from keras.preprocessing import image\n", 17 | "from keras.models import Model\n", 18 | "from keras.layers import Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,Input,Lambda,GlobalMaxPooling2D\n", 19 | "from keras.regularizers import l2\n", 20 | "from keras import backend as K\n", 21 | "from keras.applications.vgg16 import VGG16\n", 22 | "from keras.applications.resnet50 import ResNet50\n", 23 | "from skimage.io import imsave\n", 24 | "\n", 25 | "from matplotlib.pyplot import imread\n", 26 | "from skimage.transform import rescale, resize\n", 27 | "import os" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "input_shape = [224,224,3]\n", 37 | "dataset_path = '../../datasets/VisualPhish/'\n", 38 | "cnn = 'VGG16'\n", 39 | "pooling = 'avg'" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "# Load dataset" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "def read_imgs_per_website(data_path,targets,imgs_num,reshape_size,start_target_count):\n", 56 | " all_imgs = np.zeros(shape=[imgs_num,224,224,3])\n", 57 | " all_labels = np.zeros(shape=[imgs_num,1])\n", 58 | " \n", 59 | " all_file_names = []\n", 60 | " targets_list = targets.splitlines()\n", 61 | " count = 0\n", 62 | " for i in range(0,len(targets_list)):\n", 63 | " target_path = data_path + targets_list[i]\n", 64 | " print(target_path)\n", 65 | " file_names = sorted(os.listdir(target_path))\n", 66 | " for j in range(0,len(file_names)):\n", 67 | " try:\n", 68 | " img = imread(target_path+'/'+file_names[j])\n", 69 | " img = img[:,:,0:3]\n", 70 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 71 | " all_labels[count,:] = i + start_target_count\n", 72 | " all_file_names.append(file_names[j])\n", 73 | " count = count + 1\n", 74 | " except:\n", 75 | " #some images were saved with a wrong extensions \n", 76 | " try:\n", 77 | " img = imread(target_path+'/'+file_names[j],format='jpeg')\n", 78 | " img = img[:,:,0:3]\n", 79 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 80 | " all_labels[count,:] = i + start_target_count\n", 81 | " all_file_names.append(file_names[j])\n", 82 | " count = count + 1\n", 83 | " except:\n", 84 | " print('failed at:')\n", 85 | " print('***')\n", 86 | " print(file_names[j])\n", 87 | " break \n", 88 | " return all_imgs,all_labels,all_file_names" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "# Read images legit (train)\n", 98 | "data_path = dataset_path + 'trusted_list/'\n", 99 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 100 | "targets = targets_file.read()\n", 101 | "imgs_num = 9363\n", 102 | "all_imgs_train,all_labels_train,all_file_names_train = read_imgs_per_website(data_path,targets,imgs_num,input_shape,0)\n", 103 | "\n", 104 | "# Read images phishing\n", 105 | "data_path = dataset_path + 'phishing/'\n", 106 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 107 | "targets = targets_file.read()\n", 108 | "imgs_num = 1195\n", 109 | "all_imgs_test,all_labels_test,all_file_names_test = read_imgs_per_website(data_path,targets,imgs_num,input_shape,0)\n", 110 | "\n" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "# Load model and predict" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "if cnn == 'VGG16':\n", 127 | " model = VGG16(weights='imagenet', input_shape=input_shape, include_top=False, pooling=pooling)\n", 128 | "elif cnn == 'ResNet50':\n", 129 | " model = ResNet50(weights='imagenet', input_shape=input_shape, include_top=False, pooling=pooling)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "X_train_emb = model.predict(all_imgs_train,batch_size=64)\n", 139 | "X_phish_emb = model.predict(all_imgs_test,batch_size=64)\n", 140 | "\n", 141 | "np.save('whitelist_'+cnn+'_emb',X_train_emb)\n", 142 | "np.save('phishing_'+cnn+'_emb',X_phish_emb )\n" 143 | ] 144 | } 145 | ], 146 | "metadata": { 147 | "kernelspec": { 148 | "display_name": "Python 3", 149 | "language": "python", 150 | "name": "python3" 151 | }, 152 | "language_info": { 153 | "codemirror_mode": { 154 | "name": "ipython", 155 | "version": 3 156 | }, 157 | "file_extension": ".py", 158 | "mimetype": "text/x-python", 159 | "name": "python", 160 | "nbconvert_exporter": "python", 161 | "pygments_lexer": "ipython3", 162 | "version": "3.7.6" 163 | } 164 | }, 165 | "nbformat": 4, 166 | "nbformat_minor": 2 167 | } 168 | -------------------------------------------------------------------------------- /code/evaluate/Perturbations_attacks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import tensorflow as tf \n", 10 | "\n", 11 | "from keras.models import Sequential\n", 12 | "from keras.layers import Dense\n", 13 | "from keras.layers import Flatten\n", 14 | "\n", 15 | "\n", 16 | "from keras.preprocessing import image\n", 17 | "from keras.models import Model\n", 18 | "from keras.layers import Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,Input,Lambda,GlobalMaxPooling2D\n", 19 | "from keras.regularizers import l2\n", 20 | "from keras import backend as K\n", 21 | "from keras.applications.vgg16 import VGG16\n", 22 | "\n", 23 | "from matplotlib.pyplot import imread,imshow\n", 24 | "from skimage.transform import rescale, resize\n", 25 | "from skimage.io import imsave\n", 26 | "\n", 27 | "import os\n", 28 | "import numpy as np\n", 29 | "from keras.models import load_model\n" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "from scipy import ndimage\n", 39 | "from skimage.filters import gaussian\n", 40 | "\n", 41 | "def add_blur(image,sigma):\n", 42 | " blur_img = gaussian(image, sigma=sigma, multichannel=True)\n", 43 | " return blur_img\n", 44 | "\n", 45 | "from skimage.util import random_noise\n", 46 | "def add_gauss(image,var):\n", 47 | " noisy_img = random_noise(image,mode='gaussian',var=var)\n", 48 | " return noisy_img\n", 49 | "\n", 50 | "def add_sp(image,amount):\n", 51 | " noisy_img = random_noise(image,mode='s&p',amount=amount)\n", 52 | " return noisy_img\n", 53 | "\n", 54 | "from skimage.transform import AffineTransform,warp\n", 55 | "def add_shift(image,dim):\n", 56 | " #dim is a tuple to indicate by which values to shift \n", 57 | " image[0,:]=0.0\n", 58 | " image[:,0]=0.0\n", 59 | " transform = AffineTransform(translation=(dim[0],dim[1]))\n", 60 | " shifted = warp(image, transform, mode='edge')\n", 61 | " return shifted\n", 62 | "\n", 63 | "def add_occlusion(image,quarter):\n", 64 | " #quarter is a number from 0 to 3 to indicate which quarter to hide\n", 65 | " image = np.array(image)\n", 66 | " image[:, quarter*int(1*image.shape[1]/4): (quarter+1)*int(1*image.shape[1]/4) ] = np.max(image)\n", 67 | " return image \n", 68 | "\n", 69 | "from skimage import exposure\n", 70 | "def change_gamma(image,gamma):\n", 71 | " image_bright = exposure.adjust_gamma(image, gamma=gamma,gain=1)\n", 72 | " return image_bright\n", 73 | "\n", 74 | "def add_noise(image,param,noise_type):\n", 75 | " if noise_type == 'gamma':\n", 76 | " noisy_img = change_gamma(image,param)\n", 77 | " elif noise_type == 'occlusion':\n", 78 | " noisy_img = add_occlusion(image,param)\n", 79 | " elif noise_type == 'shift':\n", 80 | " noisy_img = add_shift(image,param)\n", 81 | " elif noise_type == 'gauss':\n", 82 | " noisy_img = add_gauss(image,param)\n", 83 | " elif noise_type == 'sp':\n", 84 | " noisy_img = add_sp(image,param)\n", 85 | " elif noise_type == 'blur':\n", 86 | " noisy_img = add_blur(image,param)\n", 87 | " return noisy_img" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# Dataset parameters \n", 97 | "dataset_path = '../../datasets/VisualPhish/'\n", 98 | "reshape_size = [224,224,3]\n", 99 | "\n", 100 | "# Saved model\n", 101 | "output_dir = './'\n", 102 | "model_name = 'model.h5'\n", 103 | "phish_train_idx_name = 'train_idx.npy'\n", 104 | "phish_test_idx_name = 'test_idx.npy'\n", 105 | "\n", 106 | "#noise type\n", 107 | "noise = 'gamma'\n", 108 | "parameter = 0.7" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "# Load Phishing data with noise" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "\n", 125 | "def read_imgs_per_website(data_path,targets,imgs_num,reshape_size,start_target_count):\n", 126 | " all_imgs = np.zeros(shape=[imgs_num,224,224,3])\n", 127 | " all_labels = np.zeros(shape=[imgs_num,1])\n", 128 | " \n", 129 | " all_file_names = []\n", 130 | " targets_list = targets.splitlines()\n", 131 | " count = 0\n", 132 | " for i in range(0,len(targets_list)):\n", 133 | " target_path = data_path + targets_list[i]\n", 134 | " print(target_path)\n", 135 | " file_names = sorted(os.listdir(target_path))\n", 136 | " for j in range(0,len(file_names)):\n", 137 | " try:\n", 138 | " img = imread(target_path+'/'+file_names[j])\n", 139 | " img = img[:,:,0:3]\n", 140 | " img = add_noise(img,parameter,noise)\n", 141 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 142 | " all_labels[count,:] = i + start_target_count\n", 143 | " all_file_names.append(file_names[j])\n", 144 | " count = count + 1\n", 145 | " except:\n", 146 | " #some images were saved with a wrong extensions \n", 147 | " try:\n", 148 | " img = imread(target_path+'/'+file_names[j],format='jpeg')\n", 149 | " img = img[:,:,0:3]\n", 150 | " img = add_noise(img,parameter,noise)\n", 151 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 152 | " all_labels[count,:] = i + start_target_count\n", 153 | " all_file_names.append(file_names[j])\n", 154 | " count = count + 1\n", 155 | " except:\n", 156 | " print('failed at:')\n", 157 | " print('***')\n", 158 | " print(file_names[j])\n", 159 | " break \n", 160 | " return all_imgs,all_labels,all_file_names\n" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "# Read images phishing\n", 170 | "data_path = dataset_path + 'phishing/'\n", 171 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 172 | "targets = targets_file.read()\n", 173 | "imgs_num = 1195\n", 174 | "all_imgs_test,all_labels_test,all_file_names_test = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 175 | "\n", 176 | "phish_test_idx = np.load(output_dir+phish_test_idx_name)\n", 177 | "phish_train_idx = np.load(output_dir+phish_train_idx_name)\n", 178 | "\n", 179 | "# Apply noise to test only \n", 180 | "X_phish_test = all_imgs_test[phish_test_idx,:]" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "# Load model and predict" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "# Load_model \n", 197 | "margin = 2.2\n", 198 | "def loss(y_true,y_pred):\n", 199 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 200 | " loss_value = K.mean(loss_value,axis=0)\n", 201 | " return loss_value\n", 202 | "\n", 203 | "final_model = load_model(model_name, custom_objects={'loss': loss})\n", 204 | "inside_model = final_model.layers[3]\n", 205 | "\n", 206 | "X_phish_test_noise = inside_model.predict(X_phish_test)\n", 207 | "\n", 208 | "np.save('X_phish_test_noise_'+noise,X_phish_test_noise)" 209 | ] 210 | } 211 | ], 212 | "metadata": { 213 | "kernelspec": { 214 | "display_name": "Python 3", 215 | "language": "python", 216 | "name": "python3" 217 | }, 218 | "language_info": { 219 | "codemirror_mode": { 220 | "name": "ipython", 221 | "version": 3 222 | }, 223 | "file_extension": ".py", 224 | "mimetype": "text/x-python", 225 | "name": "python", 226 | "nbconvert_exporter": "python", 227 | "pygments_lexer": "ipython3", 228 | "version": "3.7.6" 229 | } 230 | }, 231 | "nbformat": 4, 232 | "nbformat_minor": 2 233 | } 234 | -------------------------------------------------------------------------------- /css/cayman.css: -------------------------------------------------------------------------------- 1 | .highlight table td { padding: 5px; } 2 | 3 | .highlight table pre { margin: 0; } 4 | 5 | .highlight .cm { color: #999988; font-style: italic; } 6 | 7 | .highlight .cp { color: #999999; font-weight: bold; } 8 | 9 | .highlight .c1 { color: #999988; font-style: italic; } 10 | 11 | .highlight .cs { color: #999999; font-weight: bold; font-style: italic; } 12 | 13 | .highlight .c, .highlight .cd { color: #999988; font-style: italic; } 14 | 15 | .highlight .err { color: #a61717; background-color: #e3d2d2; } 16 | 17 | .highlight .gd { color: #000000; background-color: #ffdddd; } 18 | 19 | .highlight .ge { color: #000000; font-style: italic; } 20 | 21 | .highlight .gr { color: #aa0000; } 22 | 23 | .highlight .gh { color: #999999; } 24 | 25 | .highlight .gi { color: #000000; background-color: #ddffdd; } 26 | 27 | .highlight .go { color: #888888; } 28 | 29 | .highlight .gp { color: #555555; } 30 | 31 | .highlight .gs { font-weight: bold; } 32 | 33 | .highlight .gu { color: #aaaaaa; } 34 | 35 | .highlight .gt { color: #aa0000; } 36 | 37 | .highlight .kc { color: #000000; font-weight: bold; } 38 | 39 | .highlight .kd { color: #000000; font-weight: bold; } 40 | 41 | .highlight .kn { color: #000000; font-weight: bold; } 42 | 43 | .highlight .kp { color: #000000; font-weight: bold; } 44 | 45 | .highlight .kr { color: #000000; font-weight: bold; } 46 | 47 | .highlight .kt { color: #445588; font-weight: bold; } 48 | 49 | .highlight .k, .highlight .kv { color: #000000; font-weight: bold; } 50 | 51 | .highlight .mf { color: #009999; } 52 | 53 | .highlight .mh { color: #009999; } 54 | 55 | .highlight .il { color: #009999; } 56 | 57 | .highlight .mi { color: #009999; } 58 | 59 | .highlight .mo { color: #009999; } 60 | 61 | .highlight .m, .highlight .mb, .highlight .mx { color: #009999; } 62 | 63 | .highlight .sb { color: #d14; } 64 | 65 | .highlight .sc { color: #d14; } 66 | 67 | .highlight .sd { color: #d14; } 68 | 69 | .highlight .s2 { color: #d14; } 70 | 71 | .highlight .se { color: #d14; } 72 | 73 | .highlight .sh { color: #d14; } 74 | 75 | .highlight .si { color: #d14; } 76 | 77 | .highlight .sx { color: #d14; } 78 | 79 | .highlight .sr { color: #009926; } 80 | 81 | .highlight .s1 { color: #d14; } 82 | 83 | .highlight .ss { color: #990073; } 84 | 85 | .highlight .s { color: #d14; } 86 | 87 | .highlight .na { color: #008080; } 88 | 89 | .highlight .bp { color: #999999; } 90 | 91 | .highlight .nb { color: #0086B3; } 92 | 93 | .highlight .nc { color: #445588; font-weight: bold; } 94 | 95 | .highlight .no { color: #008080; } 96 | 97 | .highlight .nd { color: #3c5d5d; font-weight: bold; } 98 | 99 | .highlight .ni { color: #800080; } 100 | 101 | .highlight .ne { color: #990000; font-weight: bold; } 102 | 103 | .highlight .nf { color: #990000; font-weight: bold; } 104 | 105 | .highlight .nl { color: #990000; font-weight: bold; } 106 | 107 | .highlight .nn { color: #555555; } 108 | 109 | .highlight .nt { color: #000080; } 110 | 111 | .highlight .vc { color: #008080; } 112 | 113 | .highlight .vg { color: #008080; } 114 | 115 | .highlight .vi { color: #008080; } 116 | 117 | .highlight .nv { color: #008080; } 118 | 119 | .highlight .ow { color: #000000; font-weight: bold; } 120 | 121 | .highlight .o { color: #000000; font-weight: bold; } 122 | 123 | .highlight .w { color: #bbbbbb; } 124 | 125 | .highlight { background-color: #f8f8f8; } 126 | 127 | * { box-sizing: border-box; } 128 | 129 | body { padding: 0; margin: 0; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 16px; line-height: 1.5; color: #606c71; } 130 | 131 | a { color: #1e6bb8; text-decoration: none; } 132 | a:hover { text-decoration: underline; } 133 | 134 | .btn { display: inline-block; margin-bottom: 1rem; color: rgba(255, 255, 255, 0.7); background-color: rgba(255, 255, 255, 0.08); border-color: rgba(255, 255, 255, 0.2); border-style: solid; border-width: 1px; border-radius: 0.3rem; transition: color 0.2s, background-color 0.2s, border-color 0.2s; } 135 | .btn:hover { color: rgba(255, 255, 255, 0.8); text-decoration: none; background-color: rgba(255, 255, 255, 0.2); border-color: rgba(255, 255, 255, 0.3); } 136 | .btn + .btn { margin-left: 1rem; } 137 | @media screen and (min-width: 64em) { .btn { padding: 0.75rem 1rem; } } 138 | @media screen and (min-width: 42em) and (max-width: 64em) { .btn { padding: 0.6rem 0.9rem; font-size: 0.9rem; } } 139 | @media screen and (max-width: 42em) { .btn { display: block; width: 100%; padding: 0.75rem; font-size: 0.9rem; } 140 | .btn + .btn { margin-top: 1rem; margin-left: 0; } } 141 | 142 | .page-header { color: #fff; text-align: center; background-color: #159957; background-image: linear-gradient(120deg, #155799, #159957); } 143 | @media screen and (min-width: 64em) { .page-header { padding: 2rem 6rem; } } 144 | @media screen and (min-width: 42em) and (max-width: 64em) { .page-header { padding: 3rem 4rem; } } 145 | @media screen and (max-width: 42em) { .page-header { padding: 2rem 1rem; } } 146 | 147 | .project-name { margin-top: 0; margin-bottom: 0.1rem; } 148 | @media screen and (min-width: 64em) { .project-name { font-size: 3.25rem; } } 149 | @media screen and (min-width: 42em) and (max-width: 64em) { .project-name { font-size: 2.25rem; } } 150 | @media screen and (max-width: 42em) { .project-name { font-size: 1.75rem; } } 151 | 152 | .project-tagline { margin-bottom: 2rem; font-weight: normal; opacity: 0.7; } 153 | @media screen and (min-width: 64em) { .project-tagline { font-size: 1.25rem; } } 154 | @media screen and (min-width: 42em) and (max-width: 64em) { .project-tagline { font-size: 1.15rem; } } 155 | @media screen and (max-width: 42em) { .project-tagline { font-size: 1rem; } } 156 | 157 | .main-content { word-wrap: break-word; } 158 | .main-content :first-child { margin-top: 0; } 159 | @media screen and (min-width: 64em) { .main-content { max-width: 64rem; padding: 2rem 6rem; margin: 0 auto; font-size: 1.1rem; } } 160 | @media screen and (min-width: 42em) and (max-width: 64em) { .main-content { padding: 2rem 4rem; font-size: 1.1rem; } } 161 | @media screen and (max-width: 42em) { .main-content { padding: 2rem 1rem; font-size: 1rem; } } 162 | .main-content img { max-width: 200%; } 163 | .main-content h1, .main-content h2, .main-content h3, .main-content h4, .main-content h5, .main-content h6 { margin-top: 2rem; margin-bottom: 1rem; font-weight: normal; color: #159957; } 164 | .main-content p { margin-bottom: 1em; } 165 | .main-content code { padding: 2px 4px; font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 0.9rem; color: #567482; background-color: #f3f6fa; border-radius: 0.3rem; } 166 | .main-content pre { padding: 0.8rem; margin-top: 0; margin-bottom: 1rem; font: 1rem Consolas, "Liberation Mono", Menlo, Courier, monospace; color: #567482; word-wrap: normal; background-color: #f3f6fa; border: solid 1px #dce6f0; border-radius: 0.3rem; } 167 | .main-content pre > code { padding: 0; margin: 0; font-size: 0.9rem; color: #567482; word-break: normal; white-space: pre; background: transparent; border: 0; } 168 | .main-content .highlight { margin-bottom: 1rem; } 169 | .main-content .highlight pre { margin-bottom: 0; word-break: normal; } 170 | .main-content .highlight pre, .main-content pre { padding: 0.8rem; overflow: auto; font-size: 0.9rem; line-height: 1.45; border-radius: 0.3rem; -webkit-overflow-scrolling: touch; } 171 | .main-content pre code, .main-content pre tt { display: inline; max-width: initial; padding: 0; margin: 0; overflow: initial; line-height: inherit; word-wrap: normal; background-color: transparent; border: 0; } 172 | .main-content pre code:before, .main-content pre code:after, .main-content pre tt:before, .main-content pre tt:after { content: normal; } 173 | .main-content ul, .main-content ol { margin-top: 0; } 174 | .main-content blockquote { padding: 0 1rem; margin-left: 0; color: #819198; border-left: 0.3rem solid #dce6f0; } 175 | .main-content blockquote > :first-child { margin-top: 0; } 176 | .main-content blockquote > :last-child { margin-bottom: 0; } 177 | .main-content table { display: block; width: 100%; overflow: auto; word-break: normal; word-break: keep-all; -webkit-overflow-scrolling: touch; } 178 | .main-content table th { font-weight: bold; } 179 | .main-content table th, .main-content table td { padding: 0.5rem 1rem; border: 1px solid #e9ebec; } 180 | .main-content dl { padding: 0; } 181 | .main-content dl dt { padding: 0; margin-top: 1rem; font-size: 1rem; font-weight: bold; } 182 | .main-content dl dd { padding: 0; margin-bottom: 1rem; } 183 | .main-content hr { height: 2px; padding: 0; margin: 1rem 0; background-color: #eff0f1; border: 0; } 184 | 185 | .site-footer { padding-top: 2rem; margin-top: 2rem; border-top: solid 1px #eff0f1; } 186 | @media screen and (min-width: 64em) { .site-footer { font-size: 1rem; } } 187 | @media screen and (min-width: 42em) and (max-width: 64em) { .site-footer { font-size: 1rem; } } 188 | @media screen and (max-width: 42em) { .site-footer { font-size: 0.9rem; } } 189 | 190 | .site-footer-owner { display: block; font-weight: bold; } 191 | 192 | .site-footer-credits { color: #819198; } 193 | -------------------------------------------------------------------------------- /css/normalize.css: -------------------------------------------------------------------------------- 1 | /*! normalize.css v3.0.2 | MIT License | git.io/normalize */ 2 | 3 | /** 4 | * 1. Set default font family to sans-serif. 5 | * 2. Prevent iOS text size adjust after orientation change, without disabling 6 | * user zoom. 7 | */ 8 | 9 | html { 10 | font-family: sans-serif; /* 1 */ 11 | -ms-text-size-adjust: 100%; /* 2 */ 12 | -webkit-text-size-adjust: 100%; /* 2 */ 13 | } 14 | 15 | /** 16 | * Remove default margin. 17 | */ 18 | 19 | body { 20 | margin: 0; 21 | } 22 | 23 | /* HTML5 display definitions 24 | ========================================================================== */ 25 | 26 | /** 27 | * Correct `block` display not defined for any HTML5 element in IE 8/9. 28 | * Correct `block` display not defined for `details` or `summary` in IE 10/11 29 | * and Firefox. 30 | * Correct `block` display not defined for `main` in IE 11. 31 | */ 32 | 33 | article, 34 | aside, 35 | details, 36 | figcaption, 37 | figure, 38 | footer, 39 | header, 40 | hgroup, 41 | main, 42 | menu, 43 | nav, 44 | section, 45 | summary { 46 | display: block; 47 | } 48 | 49 | /** 50 | * 1. Correct `inline-block` display not defined in IE 8/9. 51 | * 2. Normalize vertical alignment of `progress` in Chrome, Firefox, and Opera. 52 | */ 53 | 54 | audio, 55 | canvas, 56 | progress, 57 | video { 58 | display: inline-block; /* 1 */ 59 | vertical-align: baseline; /* 2 */ 60 | } 61 | 62 | /** 63 | * Prevent modern browsers from displaying `audio` without controls. 64 | * Remove excess height in iOS 5 devices. 65 | */ 66 | 67 | audio:not([controls]) { 68 | display: none; 69 | height: 0; 70 | } 71 | 72 | /** 73 | * Address `[hidden]` styling not present in IE 8/9/10. 74 | * Hide the `template` element in IE 8/9/11, Safari, and Firefox < 22. 75 | */ 76 | 77 | [hidden], 78 | template { 79 | display: none; 80 | } 81 | 82 | /* Links 83 | ========================================================================== */ 84 | 85 | /** 86 | * Remove the gray background color from active links in IE 10. 87 | */ 88 | 89 | a { 90 | background-color: transparent; 91 | } 92 | 93 | /** 94 | * Improve readability when focused and also mouse hovered in all browsers. 95 | */ 96 | 97 | a:active, 98 | a:hover { 99 | outline: 0; 100 | } 101 | 102 | /* Text-level semantics 103 | ========================================================================== */ 104 | 105 | /** 106 | * Address styling not present in IE 8/9/10/11, Safari, and Chrome. 107 | */ 108 | 109 | abbr[title] { 110 | border-bottom: 1px dotted; 111 | } 112 | 113 | /** 114 | * Address style set to `bolder` in Firefox 4+, Safari, and Chrome. 115 | */ 116 | 117 | b, 118 | strong { 119 | font-weight: bold; 120 | } 121 | 122 | /** 123 | * Address styling not present in Safari and Chrome. 124 | */ 125 | 126 | dfn { 127 | font-style: italic; 128 | } 129 | 130 | /** 131 | * Address variable `h1` font-size and margin within `section` and `article` 132 | * contexts in Firefox 4+, Safari, and Chrome. 133 | */ 134 | 135 | h1 { 136 | font-size: 2em; 137 | margin: 0.67em 0; 138 | } 139 | 140 | /** 141 | * Address styling not present in IE 8/9. 142 | */ 143 | 144 | mark { 145 | background: #ff0; 146 | color: #000; 147 | } 148 | 149 | /** 150 | * Address inconsistent and variable font size in all browsers. 151 | */ 152 | 153 | small { 154 | font-size: 80%; 155 | } 156 | 157 | /** 158 | * Prevent `sub` and `sup` affecting `line-height` in all browsers. 159 | */ 160 | 161 | sub, 162 | sup { 163 | font-size: 75%; 164 | line-height: 0; 165 | position: relative; 166 | vertical-align: baseline; 167 | } 168 | 169 | sup { 170 | top: -0.5em; 171 | } 172 | 173 | sub { 174 | bottom: -0.25em; 175 | } 176 | 177 | /* Embedded content 178 | ========================================================================== */ 179 | 180 | /** 181 | * Remove border when inside `a` element in IE 8/9/10. 182 | */ 183 | 184 | img { 185 | border: 0; 186 | } 187 | 188 | /** 189 | * Correct overflow not hidden in IE 9/10/11. 190 | */ 191 | 192 | svg:not(:root) { 193 | overflow: hidden; 194 | } 195 | 196 | /* Grouping content 197 | ========================================================================== */ 198 | 199 | /** 200 | * Address margin not present in IE 8/9 and Safari. 201 | */ 202 | 203 | figure { 204 | margin: 1em 40px; 205 | } 206 | 207 | /** 208 | * Address differences between Firefox and other browsers. 209 | */ 210 | 211 | hr { 212 | box-sizing: content-box; 213 | height: 0; 214 | } 215 | 216 | /** 217 | * Contain overflow in all browsers. 218 | */ 219 | 220 | pre { 221 | overflow: auto; 222 | } 223 | 224 | /** 225 | * Address odd `em`-unit font size rendering in all browsers. 226 | */ 227 | 228 | code, 229 | kbd, 230 | pre, 231 | samp { 232 | font-family: monospace, monospace; 233 | font-size: 1em; 234 | } 235 | 236 | /* Forms 237 | ========================================================================== */ 238 | 239 | /** 240 | * Known limitation: by default, Chrome and Safari on OS X allow very limited 241 | * styling of `select`, unless a `border` property is set. 242 | */ 243 | 244 | /** 245 | * 1. Correct color not being inherited. 246 | * Known issue: affects color of disabled elements. 247 | * 2. Correct font properties not being inherited. 248 | * 3. Address margins set differently in Firefox 4+, Safari, and Chrome. 249 | */ 250 | 251 | button, 252 | input, 253 | optgroup, 254 | select, 255 | textarea { 256 | color: inherit; /* 1 */ 257 | font: inherit; /* 2 */ 258 | margin: 0; /* 3 */ 259 | } 260 | 261 | /** 262 | * Address `overflow` set to `hidden` in IE 8/9/10/11. 263 | */ 264 | 265 | button { 266 | overflow: visible; 267 | } 268 | 269 | /** 270 | * Address inconsistent `text-transform` inheritance for `button` and `select`. 271 | * All other form control elements do not inherit `text-transform` values. 272 | * Correct `button` style inheritance in Firefox, IE 8/9/10/11, and Opera. 273 | * Correct `select` style inheritance in Firefox. 274 | */ 275 | 276 | button, 277 | select { 278 | text-transform: none; 279 | } 280 | 281 | /** 282 | * 1. Avoid the WebKit bug in Android 4.0.* where (2) destroys native `audio` 283 | * and `video` controls. 284 | * 2. Correct inability to style clickable `input` types in iOS. 285 | * 3. Improve usability and consistency of cursor style between image-type 286 | * `input` and others. 287 | */ 288 | 289 | button, 290 | html input[type="button"], /* 1 */ 291 | input[type="reset"], 292 | input[type="submit"] { 293 | -webkit-appearance: button; /* 2 */ 294 | cursor: pointer; /* 3 */ 295 | } 296 | 297 | /** 298 | * Re-set default cursor for disabled elements. 299 | */ 300 | 301 | button[disabled], 302 | html input[disabled] { 303 | cursor: default; 304 | } 305 | 306 | /** 307 | * Remove inner padding and border in Firefox 4+. 308 | */ 309 | 310 | button::-moz-focus-inner, 311 | input::-moz-focus-inner { 312 | border: 0; 313 | padding: 0; 314 | } 315 | 316 | /** 317 | * Address Firefox 4+ setting `line-height` on `input` using `!important` in 318 | * the UA stylesheet. 319 | */ 320 | 321 | input { 322 | line-height: normal; 323 | } 324 | 325 | /** 326 | * It's recommended that you don't attempt to style these elements. 327 | * Firefox's implementation doesn't respect box-sizing, padding, or width. 328 | * 329 | * 1. Address box sizing set to `content-box` in IE 8/9/10. 330 | * 2. Remove excess padding in IE 8/9/10. 331 | */ 332 | 333 | input[type="checkbox"], 334 | input[type="radio"] { 335 | box-sizing: border-box; /* 1 */ 336 | padding: 0; /* 2 */ 337 | } 338 | 339 | /** 340 | * Fix the cursor style for Chrome's increment/decrement buttons. For certain 341 | * `font-size` values of the `input`, it causes the cursor style of the 342 | * decrement button to change from `default` to `text`. 343 | */ 344 | 345 | input[type="number"]::-webkit-inner-spin-button, 346 | input[type="number"]::-webkit-outer-spin-button { 347 | height: auto; 348 | } 349 | 350 | /** 351 | * 1. Address `appearance` set to `searchfield` in Safari and Chrome. 352 | * 2. Address `box-sizing` set to `border-box` in Safari and Chrome 353 | * (include `-moz` to future-proof). 354 | */ 355 | 356 | input[type="search"] { 357 | -webkit-appearance: textfield; /* 1 */ /* 2 */ 358 | box-sizing: content-box; 359 | } 360 | 361 | /** 362 | * Remove inner padding and search cancel button in Safari and Chrome on OS X. 363 | * Safari (but not Chrome) clips the cancel button when the search input has 364 | * padding (and `textfield` appearance). 365 | */ 366 | 367 | input[type="search"]::-webkit-search-cancel-button, 368 | input[type="search"]::-webkit-search-decoration { 369 | -webkit-appearance: none; 370 | } 371 | 372 | /** 373 | * Define consistent border, margin, and padding. 374 | */ 375 | 376 | fieldset { 377 | border: 1px solid #c0c0c0; 378 | margin: 0 2px; 379 | padding: 0.35em 0.625em 0.75em; 380 | } 381 | 382 | /** 383 | * 1. Correct `color` not being inherited in IE 8/9/10/11. 384 | * 2. Remove padding so people aren't caught out if they zero out fieldsets. 385 | */ 386 | 387 | legend { 388 | border: 0; /* 1 */ 389 | padding: 0; /* 2 */ 390 | } 391 | 392 | /** 393 | * Remove default vertical scrollbar in IE 8/9/10/11. 394 | */ 395 | 396 | textarea { 397 | overflow: auto; 398 | } 399 | 400 | /** 401 | * Don't inherit the `font-weight` (applied by a rule above). 402 | * NOTE: the default cannot safely be changed in Chrome and Safari on OS X. 403 | */ 404 | 405 | optgroup { 406 | font-weight: bold; 407 | } 408 | 409 | /* Tables 410 | ========================================================================== */ 411 | 412 | /** 413 | * Remove most spacing between table cells. 414 | */ 415 | 416 | table { 417 | border-collapse: collapse; 418 | border-spacing: 0; 419 | } 420 | 421 | td, 422 | th { 423 | padding: 0; 424 | } 425 | -------------------------------------------------------------------------------- /code/evaluate/Evaluate.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import tensorflow as tf \n", 11 | "from keras.models import load_model\n", 12 | "\n", 13 | "from keras.models import Sequential\n", 14 | "from keras.layers import Dense\n", 15 | "from keras.layers import Flatten,Subtract,Reshape\n", 16 | "from keras.preprocessing import image\n", 17 | "from keras.models import Model\n", 18 | "from keras.layers import Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,Input,Lambda,GlobalMaxPooling2D\n", 19 | "from keras.regularizers import l2\n", 20 | "from keras import backend as K\n", 21 | "from keras.applications.vgg16 import VGG16\n", 22 | "from skimage.io import imsave\n", 23 | "\n", 24 | "from matplotlib.pyplot import imread\n", 25 | "from skimage.transform import rescale, resize\n", 26 | "import os" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# Load precomputed embeddings" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "output_dir = './'\n", 43 | "dataset_path = '../../datasets/VisualPhish/'\n", 44 | "\n", 45 | "phish_emb_name = 'phishing_emb.npy'\n", 46 | "phish_emb_labels_name = 'phishing_labels.npy'\n", 47 | "\n", 48 | "phish_train_idx_name = 'train_idx.npy'\n", 49 | "phish_test_idx_name = 'test_idx.npy'\n", 50 | "\n", 51 | "train_emb_name = 'whitelist_emb.npy'\n", 52 | "train_emb_labels_name = 'whitelist_labels.npy'\n", 53 | "\n", 54 | "#precomputed attacks embeddings for the phishing test set if any. \n", 55 | "#set use_attack to 1 to compute based on this\n", 56 | "phish_emb_test_attack = 'X_phish_test_noise_gamma.npy'\n", 57 | "use_attack = 0\n", 58 | "\n", 59 | "X_legit_train = np.load(output_dir+train_emb_name)\n", 60 | "y_legit_train = np.load(output_dir+train_emb_labels_name)\n", 61 | "\n", 62 | "X_phish = np.load(output_dir+phish_emb_name)\n", 63 | "y_phish = np.load(output_dir+phish_emb_labels_name)\n", 64 | "\n", 65 | "phish_test_idx = np.load(output_dir+phish_test_idx_name)\n", 66 | "phish_train_idx = np.load(output_dir+phish_train_idx_name)\n", 67 | "\n", 68 | "X_phish_test = X_phish[phish_test_idx,:]\n", 69 | "y_phish_test = y_phish[phish_test_idx,:]\n", 70 | "\n", 71 | "#set the phishing test set directly to the precomputed embeddings of the attack\n", 72 | "if use_attack == 1:\n", 73 | " X_phish_test = np.load(output_dir+phish_emb_test_attack)\n", 74 | " print('Test on: '+phish_emb_test_attack)\n", 75 | "\n", 76 | "X_phish_train = X_phish[phish_train_idx,:]\n", 77 | "y_phish_train = y_phish[phish_train_idx,:]\n" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "# Get file names of each example \n", 87 | "def read_file_names(data_path,file_name):\n", 88 | " targets_file = open(data_path+file_name, \"r\")\n", 89 | " targets = targets_file.read()\n", 90 | " \n", 91 | " file_names_list = []\n", 92 | " targets_list = targets.splitlines()\n", 93 | " for i in range(0,len(targets_list)):\n", 94 | " target_path = data_path + targets_list[i]\n", 95 | " file_names = sorted(os.listdir(target_path))\n", 96 | " for j in range(0,len(file_names)):\n", 97 | " file_names_list.append(file_names[j])\n", 98 | " return file_names_list\n", 99 | "\n", 100 | "legit_file_names = read_file_names(dataset_path+'trusted_list/','targets.txt')\n", 101 | "phish_file_names = read_file_names(dataset_path+'phishing/','targets.txt')\n", 102 | "\n", 103 | "phish_train_file_names = []\n", 104 | "for i in range(0,phish_train_idx.shape[0]):\n", 105 | " phish_train_file_names.append(phish_file_names[phish_train_idx[i]])\n", 106 | " \n", 107 | "phish_test_file_names = []\n", 108 | "for i in range(0,phish_test_idx.shape[0]):\n", 109 | " phish_test_file_names.append(phish_file_names[phish_test_idx[i]])\n", 110 | "\n", 111 | "def get_label_from_name(name):\n", 112 | " first_half = name.split('_',1)[0]\n", 113 | " number = int(first_half.replace('T',''))\n", 114 | " return number" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "# L2 distance\n", 124 | "def compute_distance_pair(layer1,layer2):\n", 125 | " diff = layer1 - layer2\n", 126 | " l2_diff = np.sum(diff**2) / X_phish_train.shape[1]\n", 127 | " return l2_diff\n", 128 | "\n", 129 | "# Pairwise distance between query image and training\n", 130 | "def compute_all_distances(test_matrix):\n", 131 | " train_size = phish_train_idx.shape[0] + X_legit_train.shape[0]\n", 132 | " X_all_train = np.concatenate((X_phish_train,X_legit_train))\n", 133 | " pairwise_distance = np.zeros([test_matrix.shape[0],train_size])\n", 134 | " for i in range(0,test_matrix.shape[0]):\n", 135 | " pair1 = test_matrix[i,:]\n", 136 | " for j in range(0,train_size):\n", 137 | " pair2 = X_all_train[j,:]\n", 138 | " l2_diff = compute_distance_pair(pair1,pair2)\n", 139 | " pairwise_distance[i,j] = l2_diff\n", 140 | " return pairwise_distance\n", 141 | "pairwise_distance = compute_all_distances(X_phish_test)\n", 142 | "\n", 143 | "# Find Smallest n distances\n", 144 | "def find_min_distances(distances,n):\n", 145 | " idx = distances.argsort()[:n]\n", 146 | " values = distances[idx]\n", 147 | " return idx,values\n", 148 | "\n", 149 | "# Find names of examples with min distance\n", 150 | "def find_names_min_distances(idx,values):\n", 151 | " names_min_distance = ''\n", 152 | " only_names = []\n", 153 | " distances = ''\n", 154 | " for i in range(0,idx.shape[0]):\n", 155 | " index_min_distance = idx[i]\n", 156 | " if (index_min_distance < X_phish_train.shape[0]):\n", 157 | " names_min_distance = names_min_distance + 'Phish: ' + phish_train_file_names[index_min_distance] +','\n", 158 | " only_names.append(phish_train_file_names[index_min_distance]) \n", 159 | " else:\n", 160 | " names_min_distance = names_min_distance + 'Legit: ' + legit_file_names[index_min_distance-X_phish_train.shape[0]] +','\n", 161 | " only_names.append(legit_file_names[index_min_distance-X_phish_train.shape[0]]) \n", 162 | " distances = distances + str(values[i]) + ','\n", 163 | " names_min_distance = names_min_distance[:-1]\n", 164 | " distances = distances[:-1]\n", 165 | " return names_min_distance,only_names,distances\n", 166 | "\n", 167 | "# Find same-category website (matching is correct if it was matched to the same category (e.g. microsoft and outlook ))\n", 168 | "parents_targets = ['microsoft','apple','google','alibaba']\n", 169 | "sub_targets = [['ms_outlook','ms_office','ms_bing','ms_onedrive','ms_skype'],['itunes','icloud'],['google_drive'],['aliexpress']]\n", 170 | "\n", 171 | "parents_targets_idx = [90,12,65,4]\n", 172 | "sub_targets = [[150,152,151,149,148],[153,154],[147],[5]]\n", 173 | "\n", 174 | "def check_if_same_category(img_label1,img_label2):\n", 175 | " if_same = 0\n", 176 | " if img_label1 in parents_targets_idx:\n", 177 | " if img_label2 in sub_targets[parents_targets_idx.index(img_label1)]:\n", 178 | " if_same = 1\n", 179 | " elif img_label1 in sub_targets[0]:\n", 180 | " if img_label2 in sub_targets[0] or img_label2 == parents_targets_idx[0]:\n", 181 | " if_same = 1\n", 182 | " elif img_label1 in sub_targets[1]:\n", 183 | " if img_label2 in sub_targets[1] or img_label2 == parents_targets_idx[1]:\n", 184 | " if_same = 1\n", 185 | " elif img_label1 in sub_targets[2]:\n", 186 | " if img_label2 in sub_targets[2] or img_label2 == parents_targets_idx[2]:\n", 187 | " if_same = 1\n", 188 | " return if_same\n", 189 | "\n", 190 | "# Find if target is in the top closest n distances\n", 191 | "def check_if_target_in_top(test_file_name,only_names):\n", 192 | " found = 0\n", 193 | " idx = 0\n", 194 | " test_label = get_label_from_name(test_file_name)\n", 195 | " print('***')\n", 196 | " print('Test example: '+test_file_name)\n", 197 | " for i in range(0,len(only_names)):\n", 198 | " label_distance = get_label_from_name(only_names[i])\n", 199 | " if label_distance == test_label or check_if_same_category(test_label,label_distance) == 1:\n", 200 | " found = 1\n", 201 | " idx = i+1\n", 202 | " print('found')\n", 203 | " break\n", 204 | " return found,idx" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "# Compute correct matches" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "n = 1 #Top-1 match\n", 221 | "correct = 0\n", 222 | "\n", 223 | "for i in range(0,phish_test_idx.shape[0]):\n", 224 | " distances_to_train = pairwise_distance[i,:]\n", 225 | " idx,values = find_min_distances(np.ravel(distances_to_train),n)\n", 226 | " names_min_distance,only_names,min_distances = find_names_min_distances(idx,values)\n", 227 | " found,found_idx = check_if_target_in_top(phish_test_file_names[i],only_names)\n", 228 | " print(names_min_distance)\n", 229 | " \n", 230 | " if found == 1:\n", 231 | " correct += 1\n", 232 | " \n", 233 | "\n", 234 | "print(\"Correct match percentage: \" + str(correct/phish_test_idx.shape[0]))" 235 | ] 236 | } 237 | ], 238 | "metadata": { 239 | "kernelspec": { 240 | "display_name": "Python 3", 241 | "language": "python", 242 | "name": "python3" 243 | }, 244 | "language_info": { 245 | "codemirror_mode": { 246 | "name": "ipython", 247 | "version": 3 248 | }, 249 | "file_extension": ".py", 250 | "mimetype": "text/x-python", 251 | "name": "python", 252 | "nbconvert_exporter": "python", 253 | "pygments_lexer": "ipython3", 254 | "version": "3.7.6" 255 | } 256 | }, 257 | "nbformat": 4, 258 | "nbformat_minor": 2 259 | } 260 | -------------------------------------------------------------------------------- /code/evaluate/Evaluate_new_phishing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import tensorflow as tf \n", 11 | "from keras.models import load_model\n", 12 | "\n", 13 | "from keras.models import Sequential\n", 14 | "from keras.layers import Dense\n", 15 | "from keras.layers import Flatten,Subtract,Reshape\n", 16 | "from keras.preprocessing import image\n", 17 | "from keras.models import Model\n", 18 | "from keras.layers import Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,Input,Lambda,GlobalMaxPooling2D\n", 19 | "from keras.regularizers import l2\n", 20 | "from keras import backend as K\n", 21 | "from keras.applications.vgg16 import VGG16\n", 22 | "from skimage.io import imsave\n", 23 | "\n", 24 | "from matplotlib.pyplot import imread\n", 25 | "from skimage.transform import rescale, resize\n", 26 | "import os" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "# Data \n", 36 | "reshape_size = [224,224,3]\n", 37 | "dataset_path = '../../datasets/VisualPhish/'\n", 38 | "data_path = dataset_path + 'newly_crawled_phishing/'\n", 39 | "imgs_num = 955\n", 40 | "\n", 41 | "# Saved models and computed embeddings\n", 42 | "margin = 2.2\n", 43 | "output_dir = './'\n", 44 | "model_name = 'model.h5'\n", 45 | "\n", 46 | "phish_emb_name = 'phishing_emb.npy'\n", 47 | "phish_emb_labels_name = 'phishing_labels.npy'\n", 48 | "\n", 49 | "phish_train_idx_name = 'train_idx.npy'\n", 50 | "phish_test_idx_name = 'test_idx.npy'\n", 51 | "\n", 52 | "train_emb_name = 'whitelist_emb.npy'\n", 53 | "train_emb_labels_name = 'whitelist_labels.npy'" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "# Load New phishing examples" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "def read_new_phish_imgs(data_path,targets,imgs_num,reshape_size,labels_dict):\n", 70 | " all_imgs = np.zeros(shape=[imgs_num,224,224,3])\n", 71 | " all_labels = np.zeros(shape=[imgs_num,1])\n", 72 | " all_file_names = []\n", 73 | " file_names = sorted(os.listdir(data_path))\n", 74 | " count = 0\n", 75 | " for j in range(0,imgs_num):\n", 76 | " if not file_names[j] == 'labels.txt':\n", 77 | " print(file_names[j])\n", 78 | " img = imread(data_path+file_names[j])\n", 79 | " img = img[:,:,0:3]\n", 80 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 81 | " all_labels[count,:] = labels_dict[targets[j]]\n", 82 | " all_file_names.append(file_names[j])\n", 83 | " count = count + 1\n", 84 | " print(all_imgs.shape)\n", 85 | " return all_imgs,all_labels,all_file_names\n", 86 | "\n", 87 | "#get names of labels/indices\n", 88 | "def get_dict_targets(data_path,file_name):\n", 89 | " targets_file = open(data_path+file_name, \"r\")\n", 90 | " targets = targets_file.read()\n", 91 | " targets_list = targets.splitlines()\n", 92 | " x = {'absa':0}\n", 93 | " for i in range(0,len(targets_list)):\n", 94 | " target_path = data_path + targets_list[i]\n", 95 | " file_names = sorted(os.listdir(target_path))\n", 96 | " for j in range(0,len(file_names)):\n", 97 | " if not( file_names[j] == 'textfiles' or file_names[j] == 'textfiles_translated' ): \n", 98 | " first_file = file_names[j]\n", 99 | " first_file = file_names[j].replace('T','')\n", 100 | " label = int(first_file.split('_')[0])\n", 101 | " d = {targets_list[i]:label}\n", 102 | " x.update(d)\n", 103 | " break\n", 104 | " return x\n", 105 | " \n", 106 | "labels_dict = get_dict_targets(dataset_path+'phishing/','targets.txt')" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "targets_file = open(data_path+'labels.txt', \"r\")\n", 116 | "targets = targets_file.read()\n", 117 | "targets = targets.splitlines()\n", 118 | "all_imgs_phish_new,all_labels_phish_new,all_file_names_phish_new = read_new_phish_imgs(data_path,targets,imgs_num,reshape_size,labels_dict)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "# Load model and predict" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "# Load_model \n", 135 | "def loss(y_true,y_pred):\n", 136 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 137 | " loss_value = K.mean(loss_value,axis=0)\n", 138 | " return loss_value\n", 139 | "\n", 140 | "final_model = load_model(model_name, custom_objects={'loss': loss})\n", 141 | "inside_model = final_model.layers[3]\n", 142 | "\n", 143 | "X_phish_new = inside_model.predict(all_imgs_phish_new)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "# Load precomputed embeddings " 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "X_legit_train = np.load(output_dir+train_emb_name)\n", 160 | "y_legit_train = np.load(output_dir+train_emb_labels_name)\n", 161 | "\n", 162 | "X_phish = np.load(output_dir+phish_emb_name)\n", 163 | "y_phish = np.load(output_dir+phish_emb_labels_name)\n", 164 | "\n", 165 | "phish_test_idx = np.load(output_dir+phish_test_idx_name)\n", 166 | "phish_train_idx = np.load(output_dir+phish_train_idx_name)\n", 167 | "\n", 168 | "X_phish_test = X_phish[phish_test_idx,:]\n", 169 | "y_phish_test = y_phish[phish_test_idx,:]\n", 170 | "\n", 171 | "X_phish_train = X_phish[phish_train_idx,:]\n", 172 | "y_phish_train = y_phish[phish_train_idx,:]\n", 173 | "\n", 174 | "# Get file names of each example \n", 175 | "def read_file_names(data_path,file_name):\n", 176 | " targets_file = open(data_path+file_name, \"r\")\n", 177 | " targets = targets_file.read()\n", 178 | " \n", 179 | " file_names_list = []\n", 180 | " targets_list = targets.splitlines()\n", 181 | " for i in range(0,len(targets_list)):\n", 182 | " target_path = data_path + targets_list[i]\n", 183 | " file_names = sorted(os.listdir(target_path))\n", 184 | " for j in range(0,len(file_names)):\n", 185 | " file_names_list.append(file_names[j])\n", 186 | " return file_names_list\n", 187 | "\n", 188 | "legit_file_names = read_file_names(dataset_path+'trusted_list/','targets.txt')\n", 189 | "phish_file_names = read_file_names(dataset_path+'phishing/','targets.txt')\n", 190 | "\n", 191 | "phish_train_file_names = []\n", 192 | "for i in range(0,phish_train_idx.shape[0]): \n", 193 | " phish_train_file_names.append(phish_file_names[phish_train_idx[i]])\n", 194 | " \n", 195 | "phish_test_file_names = []\n", 196 | "for i in range(0,phish_test_idx.shape[0]):\n", 197 | " phish_test_file_names.append(phish_file_names[phish_test_idx[i]])\n", 198 | "\n", 199 | "def get_label_from_name(name):\n", 200 | " first_half = name.split('_',1)[0]\n", 201 | " number = int(first_half.replace('T',''))\n", 202 | " return number" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "# Compute distances" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "# L2 distance\n", 219 | "def compute_distance_pair(layer1,layer2):\n", 220 | " diff = layer1 - layer2\n", 221 | " l2_diff = np.sum(diff**2) / X_phish_train.shape[1]\n", 222 | " return l2_diff\n", 223 | "\n", 224 | "# Pairwise distance between query image and training\n", 225 | "def compute_all_distances(test_matrix):\n", 226 | " train_size = X_phish_train.shape[0] + X_legit_train.shape[0]\n", 227 | " X_all_train = np.concatenate((X_phish_train,X_legit_train))\n", 228 | " pairwise_distance = np.zeros([test_matrix.shape[0],train_size])\n", 229 | " for i in range(0,test_matrix.shape[0]):\n", 230 | " pair1 = test_matrix[i,:]\n", 231 | " for j in range(0,train_size):\n", 232 | " pair2 = X_all_train[j,:]\n", 233 | " l2_diff = compute_distance_pair(pair1,pair2)\n", 234 | " pairwise_distance[i,j] = l2_diff\n", 235 | " return pairwise_distance\n", 236 | "\n", 237 | "pairwise_distance = compute_all_distances(X_phish_new)\n", 238 | "\n", 239 | "# Find Smallest n distances\n", 240 | "def find_min_distances(distances,n):\n", 241 | " idx = distances.argsort()[:n]\n", 242 | " values = distances[idx]\n", 243 | " return idx,values\n", 244 | "\n", 245 | "# Find names of examples with min distance\n", 246 | "def find_names_min_distances(idx,values):\n", 247 | " names_min_distance = ''\n", 248 | " only_names = []\n", 249 | " distances = ''\n", 250 | " for i in range(0,idx.shape[0]):\n", 251 | " index_min_distance = idx[i]\n", 252 | " if (index_min_distance < X_phish_train.shape[0]):\n", 253 | " names_min_distance = names_min_distance + 'Phish: ' + phish_train_file_names[index_min_distance] +','\n", 254 | " only_names.append(phish_train_file_names[index_min_distance]) \n", 255 | " else:\n", 256 | " names_min_distance = names_min_distance + 'Legit: ' + legit_file_names[index_min_distance-X_phish_train.shape[0]] +','\n", 257 | " only_names.append(legit_file_names[index_min_distance-X_phish_train.shape[0]]) \n", 258 | " distances = distances + str(values[i]) + ','\n", 259 | " names_min_distance = names_min_distance[:-1]\n", 260 | " distances = distances[:-1]\n", 261 | " return names_min_distance,only_names,distances\n", 262 | "\n", 263 | "# Find same-category website (matching is correct if it was matched to the same category (e.g. microsoft and outlook ))\n", 264 | "parents_targets = ['microsoft','apple','google','alibaba']\n", 265 | "sub_targets = [['ms_outlook','ms_office','ms_bing','ms_onedrive','ms_skype'],['itunes','icloud'],['google_drive'],['aliexpress']]\n", 266 | "\n", 267 | "parents_targets_idx = [90,12,65,4]\n", 268 | "sub_targets = [[150,152,151,149,148],[153,154],[147],[5]]\n", 269 | "\n", 270 | "def check_if_same_category(img_label1,img_label2):\n", 271 | " if_same = 0\n", 272 | " if img_label1 in parents_targets_idx:\n", 273 | " if img_label2 in sub_targets[parents_targets_idx.index(img_label1)]:\n", 274 | " if_same = 1\n", 275 | " elif img_label1 in sub_targets[0]:\n", 276 | " if img_label2 in sub_targets[0] or img_label2 == parents_targets_idx[0]:\n", 277 | " if_same = 1\n", 278 | " elif img_label1 in sub_targets[1]:\n", 279 | " if img_label2 in sub_targets[1] or img_label2 == parents_targets_idx[1]:\n", 280 | " if_same = 1\n", 281 | " elif img_label1 in sub_targets[2]:\n", 282 | " if img_label2 in sub_targets[2] or img_label2 == parents_targets_idx[2]:\n", 283 | " if_same = 1\n", 284 | " return if_same\n", 285 | "\n", 286 | "# Find if target is in the top closest n distances\n", 287 | "def check_if_target_in_top(test_label,only_names):\n", 288 | " found = 0\n", 289 | " idx = 0\n", 290 | " #test_label = get_label_from_name(test_file_name)\n", 291 | " print('***')\n", 292 | " print('Target: '+str(test_label))\n", 293 | " for i in range(0,len(only_names)):\n", 294 | " label_distance = get_label_from_name(only_names[i])\n", 295 | " if int(label_distance) == int(test_label) or check_if_same_category(test_label,label_distance) == 1:\n", 296 | " found = 1\n", 297 | " idx = i+1\n", 298 | " print('found')\n", 299 | " break\n", 300 | " return found,idx" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "# Compute correct matches" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "n = 1 #Top-1 match\n", 317 | "correct = 0\n", 318 | "\n", 319 | "for i in range(0,X_phish_new.shape[0]):\n", 320 | " distances_to_train = pairwise_distance[i,:]\n", 321 | " idx,values = find_min_distances(np.ravel(distances_to_train),n)\n", 322 | " names_min_distance,only_names,min_distances = find_names_min_distances(idx,values)\n", 323 | " found,found_idx = check_if_target_in_top(all_labels_phish_new[i,0],only_names)\n", 324 | " print('Closest: '+names_min_distance)\n", 325 | " \n", 326 | " if found == 1:\n", 327 | " correct += 1\n", 328 | "\n", 329 | "print(\"Correct match percentage: \" + str(correct/X_phish_new.shape[0]))" 330 | ] 331 | } 332 | ], 333 | "metadata": { 334 | "kernelspec": { 335 | "display_name": "Python 3", 336 | "language": "python", 337 | "name": "python3" 338 | }, 339 | "language_info": { 340 | "codemirror_mode": { 341 | "name": "ipython", 342 | "version": 3 343 | }, 344 | "file_extension": ".py", 345 | "mimetype": "text/x-python", 346 | "name": "python", 347 | "nbconvert_exporter": "python", 348 | "pygments_lexer": "ipython3", 349 | "version": "3.7.6" 350 | } 351 | }, 352 | "nbformat": 4, 353 | "nbformat_minor": 2 354 | } 355 | -------------------------------------------------------------------------------- /code/evaluate/Whitebox_attacks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import tensorflow as tf \n", 10 | "\n", 11 | "\n", 12 | "from keras.models import Sequential\n", 13 | "from keras.layers import Dense\n", 14 | "from keras.layers import Flatten\n", 15 | "\n", 16 | "\n", 17 | "from keras.preprocessing import image\n", 18 | "from keras.models import Model\n", 19 | "from keras.layers import Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,Input,Lambda,GlobalMaxPooling2D\n", 20 | "from keras.regularizers import l2\n", 21 | "from keras import backend as K\n", 22 | "from keras.applications.vgg16 import VGG16\n", 23 | "\n", 24 | "from matplotlib.pyplot import imread,imshow\n", 25 | "from skimage.transform import rescale, resize\n", 26 | "from skimage.io import imsave\n", 27 | "\n", 28 | "import os\n", 29 | "import numpy as np\n", 30 | "from keras.models import load_model\n", 31 | "from tensorflow.compat.v1.keras.backend import get_session\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Dataset parameters \n", 41 | "dataset_path = '../../datasets/VisualPhish/'\n", 42 | "reshape_size = [224,224,3]\n", 43 | "num_targets = 155 \n", 44 | "\n", 45 | "# Model parameters\n", 46 | "input_shape = [224,224,3]\n", 47 | "margin = 2.2\n", 48 | "epsilon = 0.01 #the noise magnitude of adv examples\n", 49 | "\n", 50 | "output_dir = './'\n", 51 | "saved_model = 'model'" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "# Load dataset:\n", 59 | " - Load training screenshots per website\n", 60 | " - Load Phishing screenshots per website " 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "\n", 70 | "def read_imgs_per_website(data_path,targets,imgs_num,reshape_size,start_target_count):\n", 71 | " all_imgs = np.zeros(shape=[imgs_num,224,224,3])\n", 72 | " all_labels = np.zeros(shape=[imgs_num,1])\n", 73 | " \n", 74 | " all_file_names = []\n", 75 | " targets_list = targets.splitlines()\n", 76 | " count = 0\n", 77 | " for i in range(0,len(targets_list)):\n", 78 | " target_path = data_path + targets_list[i]\n", 79 | " print(target_path)\n", 80 | " file_names = sorted(os.listdir(target_path))\n", 81 | " for j in range(0,len(file_names)):\n", 82 | " try:\n", 83 | " img = imread(target_path+'/'+file_names[j])\n", 84 | " img = img[:,:,0:3]\n", 85 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 86 | " all_labels[count,:] = i + start_target_count\n", 87 | " all_file_names.append(file_names[j])\n", 88 | " count = count + 1\n", 89 | " except:\n", 90 | " #some images were saved with a wrong extensions \n", 91 | " try:\n", 92 | " img = imread(target_path+'/'+file_names[j],format='jpeg')\n", 93 | " img = img[:,:,0:3]\n", 94 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 95 | " all_labels[count,:] = i + start_target_count\n", 96 | " all_file_names.append(file_names[j])\n", 97 | " count = count + 1\n", 98 | " except:\n", 99 | " print('failed at:')\n", 100 | " print('***')\n", 101 | " print(file_names[j])\n", 102 | " break \n", 103 | " return all_imgs,all_labels,all_file_names\n", 104 | "\n" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "# Read images legit (train)\n", 114 | "data_path = dataset_path + 'trusted_list/'\n", 115 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 116 | "targets = targets_file.read()\n", 117 | "imgs_num = 9363\n", 118 | "all_imgs_train,all_labels_train,all_file_names_train = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 119 | "\n", 120 | "# Read images phishing\n", 121 | "data_path = dataset_path + 'phishing/'\n", 122 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 123 | "targets = targets_file.read()\n", 124 | "imgs_num = 1195\n", 125 | "all_imgs_test,all_labels_test,all_file_names_test = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 126 | "\n", 127 | "X_train_legit = all_imgs_train\n", 128 | "y_train_legit = all_labels_train\n", 129 | "\n", 130 | "# Load the train and test split\n", 131 | "phish_test_idx = np.load(output_dir+'test_idx.npy')\n", 132 | "phish_train_idx = np.load(output_dir+'train_idx.npy')\n", 133 | "\n", 134 | "X_test_phish = all_imgs_test[phish_test_idx,:]\n", 135 | "y_test_phish = all_labels_test[phish_test_idx,:]\n", 136 | "\n", 137 | "X_train_phish = all_imgs_test[phish_train_idx,:]\n", 138 | "y_train_phish = all_labels_test[phish_train_idx,:]\n" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "# Load Model " 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "#load model\n", 155 | "\n", 156 | "from keras.models import load_model\n", 157 | "margin = 2.2\n", 158 | "def loss(y_true,y_pred):\n", 159 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 160 | " loss_value = K.mean(loss_value,axis=0)\n", 161 | " return loss_value\n", 162 | "\n", 163 | "full_model = load_model(output_dir+saved_model+'.h5', custom_objects={'loss': loss})\n", 164 | "\n", 165 | "#define custom_loss\n", 166 | "def custom_loss(margin):\n", 167 | " def loss(y_true,y_pred):\n", 168 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 169 | " loss_value = K.mean(loss_value,axis=0)\n", 170 | " return loss_value\n", 171 | " return loss\n", 172 | "my_loss = custom_loss(30) #Enter a high margin in order to make sure not to have a 0-loss values \n", 173 | "\n", 174 | "#get tf session\n", 175 | "sess = K.get_session()\n", 176 | "#sess = get_session()\n", 177 | "#to be able to use tf.placeholder\n", 178 | "#tf.disable_v2_behavior() \n" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "# Triplet Sampling" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "metadata": {}, 192 | "outputs": [], 193 | "source": [ 194 | "# Order the split array \n", 195 | "def order_random_array(orig_arr,y_orig_arr,targets):\n", 196 | " sorted_arr = np.zeros(orig_arr.shape)\n", 197 | " y_sorted_arr = np.zeros(y_orig_arr.shape)\n", 198 | " count = 0\n", 199 | " for i in range(0,targets):\n", 200 | " for j in range(0,orig_arr.shape[0]):\n", 201 | " if y_orig_arr[j] == i:\n", 202 | " sorted_arr[count,:,:,:] = orig_arr[j,:,:,:]\n", 203 | " y_sorted_arr[count,:] = i\n", 204 | " count = count + 1\n", 205 | " return sorted_arr,y_sorted_arr \n", 206 | "\n", 207 | "X_test_phish,y_test_phish = order_random_array(X_test_phish,y_test_phish,num_targets)\n", 208 | "X_train_phish,y_train_phish = order_random_array(X_train_phish,y_train_phish,num_targets)\n", 209 | "\n", 210 | "\n", 211 | "# Get start and end of each label of the phishing set \n", 212 | "def start_end_each_target_not_complete(num_target,labels):\n", 213 | " prev_target = labels[0]\n", 214 | " start_end_each_target = np.zeros((num_target,2))\n", 215 | " start_end_each_target[0,0] = labels[0]\n", 216 | " if not labels[0] == 0:\n", 217 | " start_end_each_target[0,0] = -1\n", 218 | " start_end_each_target[0,1] = -1\n", 219 | " count_target = 0\n", 220 | " for i in range(1,labels.shape[0]):\n", 221 | " if not labels[i] == prev_target:\n", 222 | " start_end_each_target[int(labels[i-1]),1] = int(i-1)\n", 223 | " #count_target = count_target + 1\n", 224 | " start_end_each_target[int(labels[i]),0] = int(i)\n", 225 | " prev_target = labels[i]\n", 226 | " start_end_each_target[int(labels[-1]),1] = int(labels.shape[0]-1)\n", 227 | " \n", 228 | " for i in range(1,num_target):\n", 229 | " if start_end_each_target[i,0] == 0:\n", 230 | " start_end_each_target[i,0] = -1\n", 231 | " start_end_each_target[i,1] = -1\n", 232 | " return start_end_each_target\n", 233 | "\n", 234 | "labels_start_end_train_phish = start_end_each_target_not_complete(num_targets,y_train_phish)\n", 235 | "\n", 236 | "\n", 237 | "# Get start and end of each label\n", 238 | "def start_end_each_target(num_target,labels):\n", 239 | " prev_target = 0\n", 240 | " start_end_each_target = np.zeros((num_target,2))\n", 241 | " start_end_each_target[0,0] = 0\n", 242 | " count_target = 0\n", 243 | " for i in range(1,labels.shape[0]):\n", 244 | " if not labels[i] == prev_target:\n", 245 | " start_end_each_target[count_target,1] = i-1\n", 246 | " count_target = count_target + 1\n", 247 | " start_end_each_target[count_target,0] = i\n", 248 | " prev_target = prev_target + 1\n", 249 | " start_end_each_target[num_target-1,1] = labels.shape[0]-1\n", 250 | " return start_end_each_target\n", 251 | "\n", 252 | "labels_start_end_train_legit = start_end_each_target(num_targets,y_train_legit)\n", 253 | "\n", 254 | "def pick_pos_img_idx(prob_phish,img_label):\n", 255 | " if np.random.uniform() > prob_phish:\n", 256 | " #take image from legit\n", 257 | " class_idx_start_end = labels_start_end_train_legit[img_label,:]\n", 258 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 259 | " img = X_train_legit[same_idx,:]\n", 260 | " else:\n", 261 | " #take from phish\n", 262 | " if not labels_start_end_train_phish[img_label,0] == -1:\n", 263 | " class_idx_start_end = labels_start_end_train_phish[img_label,:]\n", 264 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 265 | " img = X_train_phish[same_idx,:]\n", 266 | " else:\n", 267 | " class_idx_start_end = labels_start_end_train_legit[img_label,:]\n", 268 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 269 | " img = X_train_legit[same_idx,:]\n", 270 | " return img\n", 271 | "\n", 272 | "\n", 273 | "def pick_neg_img(anchor_idx,num_targets):\n", 274 | " if anchor_idx == 0:\n", 275 | " targets = np.arange(1,num_targets)\n", 276 | " elif anchor_idx == num_targets -1:\n", 277 | " targets = np.arange(0,num_targets-1)\n", 278 | " else:\n", 279 | " targets = np.concatenate([np.arange(0,anchor_idx),np.arange(anchor_idx+1,num_targets)])\n", 280 | " diff_target_idx = np.random.randint(low = 0,high = num_targets-1)\n", 281 | " diff_target = targets[diff_target_idx]\n", 282 | " \n", 283 | " class_idx_start_end = labels_start_end_train_legit[diff_target,:]\n", 284 | " idx_from_diff_target = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 285 | " img = X_train_legit[idx_from_diff_target,:]\n", 286 | " \n", 287 | " return img,diff_target\n", 288 | "\n", 289 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 290 | "all_targets = targets_file.read()\n", 291 | "all_targets = all_targets.splitlines()\n", 292 | "\n", 293 | "def get_idx_of_target(target_name,all_targets):\n", 294 | " for i in range(0,len(all_targets)):\n", 295 | " if all_targets[i] == target_name:\n", 296 | " found_idx = i\n", 297 | " return found_idx\n", 298 | "\n", 299 | "target_lists = [['microsoft','ms_outlook','ms_office','ms_bing','ms_onedrive','ms_skype'],['apple','itunes','icloud'],['google','google_drive'],['alibaba','aliexpress']]\n", 300 | "\n", 301 | "def get_associated_targets_idx(target_lists,all_targets):\n", 302 | " sub_target_lists_idx = []\n", 303 | " parents_ids = []\n", 304 | " for i in range(0,len(target_lists)):\n", 305 | " target_list = target_lists[i]\n", 306 | " parent_target = target_list[0]\n", 307 | " one_target_list = []\n", 308 | " parent_idx = get_idx_of_target(parent_target,all_targets)\n", 309 | " parents_ids.append(parent_idx)\n", 310 | " for child_target in target_list[1:]:\n", 311 | " child_idx = get_idx_of_target(child_target,all_targets)\n", 312 | " one_target_list.append(child_idx)\n", 313 | " sub_target_lists_idx.append(one_target_list)\n", 314 | " return parents_ids,sub_target_lists_idx \n", 315 | "\n", 316 | "parents_ids,sub_target_lists_idx = get_associated_targets_idx(target_lists,all_targets)\n", 317 | "\n", 318 | "def check_if_same_category(img_label1,img_label2):\n", 319 | " if_same = 0\n", 320 | " if img_label1 in parents_ids:\n", 321 | " if img_label2 in sub_target_lists_idx[parents_ids.index(img_label1)]:\n", 322 | " if_same = 1\n", 323 | " elif img_label1 in sub_target_lists_idx[0]:\n", 324 | " if img_label2 in sub_target_lists_idx[0] or img_label2 == parents_ids[0]:\n", 325 | " if_same = 1\n", 326 | " elif img_label1 in sub_target_lists_idx[1]:\n", 327 | " if img_label2 in sub_target_lists_idx[1] or img_label2 == parents_ids[1]:\n", 328 | " if_same = 1\n", 329 | " elif img_label1 in sub_target_lists_idx[2]:\n", 330 | " if img_label2 in sub_target_lists_idx[2] or img_label2 == parents_ids[2]:\n", 331 | " if_same = 1\n", 332 | " return if_same" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "# Generate Adv. examples " 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "def get_adv_example(triple,epsilon):\n", 349 | " \n", 350 | " # Initialize adversarial example \n", 351 | " anchor_adv = np.zeros_like(triple[0])\n", 352 | " # Added noise\n", 353 | " anchor_noise = np.zeros_like(triple[0])\n", 354 | "\n", 355 | " y_true = tf.placeholder(\"float\", [None,1])\n", 356 | " target = np.zeros([1,1])\n", 357 | " target.astype(float)\n", 358 | " \n", 359 | " # Get the loss and gradient of the loss wrt the inputs\n", 360 | " loss_val = my_loss(y_true, full_model.output)\n", 361 | " grads = K.gradients(loss_val, full_model.input[0])\n", 362 | " \n", 363 | " # Get the sign of the gradient\n", 364 | " delta = K.sign(grads[0])\n", 365 | " \n", 366 | " dict_input = {y_true:target,full_model.input[0]:triple[0],full_model.input[1]:triple[1],full_model.input[2]:triple[2] }\n", 367 | " delta1 = sess.run(delta, feed_dict=dict_input)\n", 368 | " \n", 369 | " # Get noise\n", 370 | " anchor_noise = anchor_noise + delta1\n", 371 | " \n", 372 | " # Perturb the image\n", 373 | " anchor_adv = triple[0] + epsilon*delta1\n", 374 | " \n", 375 | " return anchor_noise,anchor_adv" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "# initialize 3 empty arrays for the input image batch\n", 385 | "batch_size = 1\n", 386 | "h = X_train_legit.shape[1]\n", 387 | "w = X_train_legit.shape[2]\n", 388 | "triple=[np.zeros((batch_size, h, w,3)) for i in range(3)]\n", 389 | "\n", 390 | "X_test_phish_non_ordered = all_imgs_test[phish_test_idx,:]\n", 391 | "y_test_phish_non_ordered = all_labels_test[phish_test_idx,:]\n", 392 | "\n", 393 | "X_test_phish_adv = np.zeros_like(X_test_phish_non_ordered)\n", 394 | "\n", 395 | "for i in range(0,X_test_phish.shape[0]):\n", 396 | " first_img = X_test_phish_non_ordered[i,:]\n", 397 | " triple[0][0,:,:,:] = first_img\n", 398 | " first_img_label = int(y_test_phish_non_ordered[i,:])\n", 399 | " \n", 400 | " pos_img = pick_pos_img_idx(-0.1,first_img_label)\n", 401 | " triple[1][0,:,:,:] = pos_img\n", 402 | " \n", 403 | " #get image for the thrid: negative from legit\n", 404 | " neg_img,label_neg = pick_neg_img(first_img_label,155)\n", 405 | " while check_if_same_category(first_img_label,label_neg) == 1:\n", 406 | " neg_img,label_neg = pick_neg_img(first_img_label,155)\n", 407 | " triple[2][0,:,:,:] = neg_img\n", 408 | " \n", 409 | " anchor_noise,anchor_adv = get_adv_example(triple,epsilon)\n", 410 | " X_test_phish_adv[i,:] = anchor_adv\n", 411 | "\n", 412 | "# Predict perturbed images using the saved model\n", 413 | "inside_model = full_model.layers[3]\n", 414 | "X_test_phish_adv_features = inside_model.predict(X_test_phish_adv)\n", 415 | "np.save(output_dir+'X_test_phish_adv_features',X_test_phish_adv_features)" 416 | ] 417 | } 418 | ], 419 | "metadata": { 420 | "kernelspec": { 421 | "display_name": "Python 3", 422 | "language": "python", 423 | "name": "python3" 424 | }, 425 | "language_info": { 426 | "codemirror_mode": { 427 | "name": "ipython", 428 | "version": 3 429 | }, 430 | "file_extension": ".py", 431 | "mimetype": "text/x-python", 432 | "name": "python", 433 | "nbconvert_exporter": "python", 434 | "pygments_lexer": "ipython3", 435 | "version": "3.7.6" 436 | } 437 | }, 438 | "nbformat": 4, 439 | "nbformat_minor": 2 440 | } 441 | -------------------------------------------------------------------------------- /code/train/Train_adv.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import tensorflow as tf \n", 11 | "\n", 12 | "from keras.models import Sequential\n", 13 | "from keras.layers import Dense\n", 14 | "from keras.layers import Flatten,Subtract,Reshape\n", 15 | "from keras.preprocessing import image\n", 16 | "from keras.models import Model\n", 17 | "from keras.layers import Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,Input,Lambda,GlobalMaxPooling2D\n", 18 | "from keras.regularizers import l2\n", 19 | "from keras import backend as K\n", 20 | "from keras.applications.vgg16 import VGG16\n", 21 | "from skimage.io import imsave\n", 22 | "\n", 23 | "from matplotlib.pyplot import imread\n", 24 | "from skimage.transform import rescale, resize\n", 25 | "import os\n", 26 | "from keras import optimizers\n", 27 | "from keras.models import load_model\n" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "margin = 2.2\n", 37 | "start_lr = 0.00002\n", 38 | "\n", 39 | "dataset_path = '../../datasets/VisualPhish/'\n", 40 | "reshape_size = [224,224,3]\n", 41 | "num_targets = 155 \n", 42 | "batch_size = 32 \n", 43 | "n_iter = 6000\n", 44 | "\n", 45 | "input_shape = [224,224,3]\n", 46 | "saved_model_name = 'model.h5' #from first training \n", 47 | "\n", 48 | "new_saved_model_name = 'model_adv'\n", 49 | "output_dir = './'\n", 50 | "save_interval = 1000\n", 51 | "lr_interval = 300" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "# Load data" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "\n", 68 | "def read_imgs_per_website(data_path,targets,imgs_num,reshape_size,start_target_count):\n", 69 | " all_imgs = np.zeros(shape=[imgs_num,224,224,3])\n", 70 | " all_labels = np.zeros(shape=[imgs_num,1])\n", 71 | " \n", 72 | " all_file_names = []\n", 73 | " targets_list = targets.splitlines()\n", 74 | " count = 0\n", 75 | " for i in range(0,len(targets_list)):\n", 76 | " target_path = data_path + targets_list[i]\n", 77 | " print(target_path)\n", 78 | " file_names = sorted(os.listdir(target_path))\n", 79 | " for j in range(0,len(file_names)):\n", 80 | " try:\n", 81 | " img = imread(target_path+'/'+file_names[j])\n", 82 | " img = img[:,:,0:3]\n", 83 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 84 | " all_labels[count,:] = i + start_target_count\n", 85 | " all_file_names.append(file_names[j])\n", 86 | " count = count + 1\n", 87 | " except:\n", 88 | " #some images were saved with a wrong extensions \n", 89 | " try:\n", 90 | " img = imread(target_path+'/'+file_names[j],format='jpeg')\n", 91 | " img = img[:,:,0:3]\n", 92 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 93 | " all_labels[count,:] = i + start_target_count\n", 94 | " all_file_names.append(file_names[j])\n", 95 | " count = count + 1\n", 96 | " except:\n", 97 | " print('failed at:')\n", 98 | " print('***')\n", 99 | " print(file_names[j])\n", 100 | " break \n", 101 | " return all_imgs,all_labels,all_file_names" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "# Read images legit (train)\n", 111 | "data_path = dataset_path + 'trusted_list/'\n", 112 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 113 | "targets = targets_file.read()\n", 114 | "imgs_num = 9363\n", 115 | "all_imgs_train,all_labels_train,all_file_names_train = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 116 | "\n", 117 | "# Read images phishing\n", 118 | "data_path = dataset_path + 'phishing/'\n", 119 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 120 | "targets = targets_file.read()\n", 121 | "imgs_num = 1195\n", 122 | "all_imgs_test,all_labels_test,all_file_names_test = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 123 | "\n", 124 | "X_train_legit = all_imgs_train\n", 125 | "y_train_legit = all_labels_train\n", 126 | "\n", 127 | "# Load indices of training and test split\n", 128 | "idx_train = np.load(output_dir+'train_idx.npy')\n", 129 | "idx_test = np.load(output_dir+'test_idx.npy')\n", 130 | "X_test_phish = all_imgs_test[idx_test,:]\n", 131 | "y_test_phish = all_labels_test[idx_test,:]\n", 132 | "\n", 133 | "X_train_phish = all_imgs_test[idx_train,:]\n", 134 | "y_train_phish = all_labels_test[idx_train,:]\n", 135 | "\n", 136 | "def order_random_array(orig_arr,y_orig_arr,targets):\n", 137 | " sorted_arr = np.zeros(orig_arr.shape)\n", 138 | " y_sorted_arr = np.zeros(y_orig_arr.shape)\n", 139 | " count = 0\n", 140 | " for i in range(0,targets):\n", 141 | " for j in range(0,orig_arr.shape[0]):\n", 142 | " if y_orig_arr[j] == i:\n", 143 | " sorted_arr[count,:,:,:] = orig_arr[j,:,:,:]\n", 144 | " y_sorted_arr[count,:] = i\n", 145 | " count = count + 1\n", 146 | " return sorted_arr,y_sorted_arr \n", 147 | "\n", 148 | "X_test_phish,y_test_phish = order_random_array(X_test_phish,y_test_phish,155)\n", 149 | "X_train_phish,y_train_phish = order_random_array(X_train_phish,y_train_phish,155)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "#get start and end of each label\n", 159 | "def start_end_each_target_not_complete(num_target,labels):\n", 160 | " prev_target = labels[0]\n", 161 | " start_end_each_target = np.zeros((num_target,2))\n", 162 | " start_end_each_target[0,0] = labels[0]\n", 163 | " if not labels[0] == 0:\n", 164 | " start_end_each_target[0,0] = -1\n", 165 | " start_end_each_target[0,1] = -1\n", 166 | " count_target = 0\n", 167 | " for i in range(1,labels.shape[0]):\n", 168 | " if not labels[i] == prev_target:\n", 169 | " start_end_each_target[int(labels[i-1]),1] = int(i-1)\n", 170 | " #count_target = count_target + 1\n", 171 | " start_end_each_target[int(labels[i]),0] = int(i)\n", 172 | " prev_target = labels[i]\n", 173 | " start_end_each_target[int(labels[-1]),1] = int(labels.shape[0]-1)\n", 174 | " \n", 175 | " for i in range(1,num_target):\n", 176 | " if start_end_each_target[i,0] == 0:\n", 177 | " start_end_each_target[i,0] = -1\n", 178 | " start_end_each_target[i,1] = -1\n", 179 | " return start_end_each_target\n", 180 | "\n", 181 | "labels_start_end_train_phish = start_end_each_target_not_complete(num_targets,y_train_phish)\n", 182 | "labels_start_end_test_phish = start_end_each_target_not_complete(num_targets,y_test_phish)\n", 183 | "\n", 184 | "\n", 185 | "def start_end_each_target(num_target,labels):\n", 186 | " prev_target = 0\n", 187 | " start_end_each_target = np.zeros((num_target,2))\n", 188 | " start_end_each_target[0,0] = 0\n", 189 | " count_target = 0\n", 190 | " for i in range(1,labels.shape[0]):\n", 191 | " if not labels[i] == prev_target:\n", 192 | " start_end_each_target[count_target,1] = i-1\n", 193 | " count_target = count_target + 1\n", 194 | " start_end_each_target[count_target,0] = i\n", 195 | " prev_target = prev_target + 1\n", 196 | " start_end_each_target[num_target-1,1] = labels.shape[0]-1\n", 197 | " return start_end_each_target\n", 198 | "\n", 199 | "labels_start_end_train_legit = start_end_each_target(num_targets,y_train_legit)" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "# Read trained model " 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "def custom_loss(margin):\n", 216 | " def loss(y_true,y_pred):\n", 217 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 218 | " loss_value = K.mean(loss_value,axis=0)\n", 219 | " return loss_value\n", 220 | " return loss\n", 221 | "my_loss = custom_loss(30)\n", 222 | "\n", 223 | "def loss(y_true,y_pred):\n", 224 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 225 | " loss_value = K.mean(loss_value,axis=0)\n", 226 | " return loss_value\n", 227 | "\n", 228 | "model = load_model(output_dir+saved_model_name, custom_objects={'loss': loss})\n", 229 | "optimizer = optimizers.Adam(lr = start_lr)\n", 230 | "model.compile(loss=custom_loss(margin),optimizer=optimizer)\n", 231 | "model.summary()\n", 232 | "sess = K.get_session()" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "# Sample triplets" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "def pick_first_img_idx(labels_start_end,num_targets):\n", 249 | " random_target = -1\n", 250 | " while (random_target == -1):\n", 251 | " random_target = np.random.randint(low = 0,high = num_targets)\n", 252 | " if labels_start_end[random_target,0] == -1:\n", 253 | " random_target = -1\n", 254 | " class_idx_start_end = labels_start_end[random_target,:]\n", 255 | " img_from_target_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 256 | " return img_from_target_idx\n", 257 | "\n", 258 | "def pick_pos_img_idx(prob_phish,img_label):\n", 259 | " if np.random.uniform() > prob_phish:\n", 260 | " #take image from legit\n", 261 | " class_idx_start_end = labels_start_end_train_legit[img_label,:]\n", 262 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 263 | " img = X_train_legit[same_idx,:]\n", 264 | " else:\n", 265 | " #take from phish\n", 266 | " if not labels_start_end_train_phish[img_label,0] == -1:\n", 267 | " class_idx_start_end = labels_start_end_train_phish[img_label,:]\n", 268 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 269 | " img = X_train_phish[same_idx,:]\n", 270 | " else:\n", 271 | " class_idx_start_end = labels_start_end_train_legit[img_label,:]\n", 272 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 273 | " img = X_train_legit[same_idx,:]\n", 274 | " return img\n", 275 | "\n", 276 | "def pick_neg_img(anchor_idx,num_targets):\n", 277 | " if anchor_idx == 0:\n", 278 | " targets = np.arange(1,num_targets)\n", 279 | " elif anchor_idx == num_targets -1:\n", 280 | " targets = np.arange(0,num_targets-1)\n", 281 | " else:\n", 282 | " targets = np.concatenate([np.arange(0,anchor_idx),np.arange(anchor_idx+1,num_targets)])\n", 283 | " diff_target_idx = np.random.randint(low = 0,high = num_targets-1)\n", 284 | " diff_target = targets[diff_target_idx]\n", 285 | " \n", 286 | " class_idx_start_end = labels_start_end_train_legit[diff_target,:]\n", 287 | " idx_from_diff_target = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 288 | " img = X_train_legit[idx_from_diff_target,:]\n", 289 | " \n", 290 | " return img,diff_target\n", 291 | "\n", 292 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 293 | "all_targets = targets_file.read()\n", 294 | "all_targets = all_targets.splitlines()\n", 295 | "\n", 296 | "def get_idx_of_target(target_name,all_targets):\n", 297 | " for i in range(0,len(all_targets)):\n", 298 | " if all_targets[i] == target_name:\n", 299 | " found_idx = i\n", 300 | " return found_idx\n", 301 | "\n", 302 | "target_lists = [['microsoft','ms_outlook','ms_office','ms_bing','ms_onedrive','ms_skype'],['apple','itunes','icloud'],['google','google_drive'],['alibaba','aliexpress']]\n", 303 | "\n", 304 | "def get_associated_targets_idx(target_lists,all_targets):\n", 305 | " sub_target_lists_idx = []\n", 306 | " parents_ids = []\n", 307 | " for i in range(0,len(target_lists)):\n", 308 | " target_list = target_lists[i]\n", 309 | " parent_target = target_list[0]\n", 310 | " one_target_list = []\n", 311 | " parent_idx = get_idx_of_target(parent_target,all_targets)\n", 312 | " parents_ids.append(parent_idx)\n", 313 | " for child_target in target_list[1:]:\n", 314 | " child_idx = get_idx_of_target(child_target,all_targets)\n", 315 | " one_target_list.append(child_idx)\n", 316 | " sub_target_lists_idx.append(one_target_list)\n", 317 | " return parents_ids,sub_target_lists_idx \n", 318 | "\n", 319 | "parents_ids,sub_target_lists_idx = get_associated_targets_idx(target_lists,all_targets)\n", 320 | "\n", 321 | "def check_if_same_category(img_label1,img_label2):\n", 322 | " if_same = 0\n", 323 | " if img_label1 in parents_ids:\n", 324 | " if img_label2 in sub_target_lists_idx[parents_ids.index(img_label1)]:\n", 325 | " if_same = 1\n", 326 | " elif img_label1 in sub_target_lists_idx[0]:\n", 327 | " if img_label2 in sub_target_lists_idx[0] or img_label2 == parents_ids[0]:\n", 328 | " if_same = 1\n", 329 | " elif img_label1 in sub_target_lists_idx[1]:\n", 330 | " if img_label2 in sub_target_lists_idx[1] or img_label2 == parents_ids[1]:\n", 331 | " if_same = 1\n", 332 | " elif img_label1 in sub_target_lists_idx[2]:\n", 333 | " if img_label2 in sub_target_lists_idx[2] or img_label2 == parents_ids[2]:\n", 334 | " if_same = 1\n", 335 | " return if_same\n" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "# Sample triplets (of normal data)\n", 345 | "def get_batch(batch_size,num_targets):\n", 346 | " \n", 347 | " # initialize 3 empty arrays for the input image batch\n", 348 | " h = X_train_legit.shape[1]\n", 349 | " w = X_train_legit.shape[2]\n", 350 | " triple=[np.zeros((batch_size, h, w,3)) for i in range(3)]\n", 351 | "\n", 352 | " for i in range(0,batch_size):\n", 353 | " img_idx_pair1 = pick_first_img_idx(labels_start_end_train_legit,num_targets)\n", 354 | " triple[0][i,:,:,:] = X_train_legit[img_idx_pair1,:]\n", 355 | " img_label = int(y_train_legit[img_idx_pair1])\n", 356 | " \n", 357 | " #get image for the second: positive\n", 358 | " triple[1][i,:,:,:] = pick_pos_img_idx(0.15,img_label)\n", 359 | " \n", 360 | " #get image for the thrid: negative from legit\n", 361 | " img_neg,label_neg = pick_neg_img(img_label,num_targets)\n", 362 | " while check_if_same_category(img_label,label_neg) == 1:\n", 363 | " img_neg,label_neg = pick_neg_img(img_label,num_targets)\n", 364 | "\n", 365 | " triple[2][i,:,:,:] = img_neg\n", 366 | " \n", 367 | " return triple\n", 368 | "\n", 369 | "# Generate adv example for one image \n", 370 | "def get_adv_example(triple,epsilon,batch_size):\n", 371 | " \n", 372 | " # Initialize adversarial example \n", 373 | " anchor_adv = np.zeros_like(triple[0])\n", 374 | " # Added noise\n", 375 | " anchor_noise = np.zeros_like(triple[0])\n", 376 | "\n", 377 | " y_true = tf.placeholder(\"float\", [None,1])\n", 378 | " target = np.zeros([batch_size,1])\n", 379 | " target.astype(float)\n", 380 | " \n", 381 | " # Get the loss and gradient of the loss wrt the inputs\n", 382 | " loss_val = my_loss(y_true, model.output)\n", 383 | " grads = K.gradients(loss_val, model.input[0])\n", 384 | " \n", 385 | " # Get the sign of the gradient\n", 386 | " delta = K.sign(grads[0])\n", 387 | " \n", 388 | " dict_input = {y_true:target,model.input[0]:triple[0],model.input[1]:triple[1],model.input[2]:triple[2] }\n", 389 | " delta1 = sess.run(delta, feed_dict=dict_input)\n", 390 | " \n", 391 | " # Get noise\n", 392 | " anchor_noise = anchor_noise + delta1\n", 393 | " \n", 394 | " # Perturb the image\n", 395 | " anchor_adv = triple[0] + epsilon*delta1\n", 396 | " \n", 397 | " return anchor_noise,anchor_adv\n", 398 | "\n", 399 | "# Get batch of adv examples \n", 400 | "def get_batch_adv(batch_size,num_targets):\n", 401 | " \n", 402 | " # initialize 3 empty arrays for the input image batch\n", 403 | " h = X_train_legit.shape[1]\n", 404 | " w = X_train_legit.shape[2]\n", 405 | " triple=[np.zeros((batch_size, h, w,3)) for i in range(3)]\n", 406 | "\n", 407 | " for i in range(0,batch_size):\n", 408 | " img_idx_pair1 = pick_first_img_idx(labels_start_end_train_legit,num_targets)\n", 409 | " triple[0][i,:,:,:] = X_train_legit[img_idx_pair1,:]\n", 410 | " img_label = int(y_train_legit[img_idx_pair1])\n", 411 | " \n", 412 | " #get image for the second: positive\n", 413 | " triple[1][i,:,:,:] = pick_pos_img_idx(0.15,img_label)\n", 414 | " \n", 415 | " #get image for the thrid: negative from legit\n", 416 | " img_neg,label_neg = pick_neg_img(img_label,num_targets)\n", 417 | " while check_if_same_category(img_label,label_neg) == 1:\n", 418 | " img_neg,label_neg = pick_neg_img(img_label,num_targets)\n", 419 | "\n", 420 | " triple[2][i,:,:,:] = img_neg\n", 421 | " \n", 422 | " epsilon = np.random.uniform(low=0.003, high=0.01) \n", 423 | " triple_noise,triple_adv = get_adv_example(triple,epsilon,batch_size)\n", 424 | " triple[0] = triple_adv\n", 425 | " return triple\n", 426 | "\n", 427 | "# Sample two batches (one for adv examples and one for normal images)\n", 428 | "def get_two_batches(batch_size,num_targets):\n", 429 | " half_batch = int(batch_size/2)\n", 430 | " triple1 = get_batch(half_batch,num_targets)\n", 431 | " triple2 = get_batch_adv(half_batch,num_targets)\n", 432 | " \n", 433 | " h = X_train_legit.shape[1]\n", 434 | " w = X_train_legit.shape[2]\n", 435 | " triple = [np.zeros((batch_size, h, w,3)) for i in range(3)]\n", 436 | " \n", 437 | " triple[0][0:half_batch,:] = triple1[0]\n", 438 | " triple[1][0:half_batch,:] = triple1[1]\n", 439 | " triple[2][0:half_batch,:] = triple1[2]\n", 440 | "\n", 441 | " triple[0][half_batch:batch_size,:] = triple2[0]\n", 442 | " triple[1][half_batch:batch_size,:] = triple2[1]\n", 443 | " triple[2][half_batch:batch_size,:] = triple2[2]\n", 444 | " \n", 445 | " return triple" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [ 454 | "def save_keras_model(model):\n", 455 | " model.save(output_dir+new_saved_model_name+'.h5')\n", 456 | " print(\"Saved model to disk\")" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "# Train" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": null, 469 | "metadata": {}, 470 | "outputs": [], 471 | "source": [ 472 | "print(\"Starting training process!\")\n", 473 | "print(\"-------------------------------------\")\n", 474 | "\n", 475 | "targets_train = np.zeros([batch_size,1])\n", 476 | "for i in range(1, n_iter):\n", 477 | " inputs=get_two_batches(batch_size,num_targets)\n", 478 | " loss_value=model.train_on_batch(inputs,targets_train)\n", 479 | " \n", 480 | " print(\"\\n ------------- \\n\")\n", 481 | " print('Iteration: '+ str(i) +'. '+ \"Loss: {0}\".format(loss_value))\n", 482 | " \n", 483 | " if i % save_interval == 0:\n", 484 | " save_keras_model(model)\n", 485 | " \n", 486 | " if i%lr_interval ==0:\n", 487 | " start_lr = 0.99*start_lr\n", 488 | " K.set_value(model.optimizer.lr, start_lr)\n" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "metadata": {}, 494 | "source": [ 495 | "# Calculate Embeddings" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": null, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "shared_model = model.layers[3]\n", 505 | "\n", 506 | "whitelist_emb = shared_model.predict(X_train_legit,batch_size=64)\n", 507 | "np.save(output_dir+'whitelist_emb_adv',whitelist_emb)\n", 508 | "np.save(output_dir+'whitelist_labels_adv',y_train_legit )\n", 509 | "\n", 510 | "phishing_emb = shared_model.predict(all_imgs_test,batch_size=64)\n", 511 | "np.save(output_dir+'phishing_emb_adv',phishing_emb)\n", 512 | "np.save(output_dir+'phishing_labels_adv',all_labels_test )\n" 513 | ] 514 | } 515 | ], 516 | "metadata": { 517 | "kernelspec": { 518 | "display_name": "Python 3", 519 | "language": "python", 520 | "name": "python3" 521 | }, 522 | "language_info": { 523 | "codemirror_mode": { 524 | "name": "ipython", 525 | "version": 3 526 | }, 527 | "file_extension": ".py", 528 | "mimetype": "text/x-python", 529 | "name": "python", 530 | "nbconvert_exporter": "python", 531 | "pygments_lexer": "ipython3", 532 | "version": "3.6.7" 533 | } 534 | }, 535 | "nbformat": 4, 536 | "nbformat_minor": 2 537 | } 538 | -------------------------------------------------------------------------------- /code/train/Train_phase1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import tensorflow as tf \n", 11 | "\n", 12 | "from keras.models import Sequential\n", 13 | "from keras.layers import Dense\n", 14 | "from keras.layers import Flatten,Subtract,Reshape\n", 15 | "from keras.preprocessing import image\n", 16 | "from keras.models import Model\n", 17 | "from keras.layers import Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,Input,Lambda,GlobalMaxPooling2D\n", 18 | "from keras.regularizers import l2\n", 19 | "from keras import backend as K\n", 20 | "from keras.applications.vgg16 import VGG16\n", 21 | "from skimage.io import imsave\n", 22 | "\n", 23 | "from matplotlib.pyplot import imread\n", 24 | "from skimage.transform import rescale, resize\n", 25 | "import os\n", 26 | "from keras.models import load_model\n", 27 | "from sklearn.model_selection import train_test_split" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "# Dataset parameters \n", 37 | "dataset_path = '../datasets/VisualPhish/VisualPhish/'\n", 38 | "reshape_size = [224,224,3]\n", 39 | "phishing_test_size = 0.4\n", 40 | "num_targets = 155 \n", 41 | "\n", 42 | "# Model parameters\n", 43 | "input_shape = [224,224,3]\n", 44 | "margin = 2.2\n", 45 | "new_conv_params = [5,5,512]\n", 46 | "\n", 47 | "# Training parameters\n", 48 | "start_lr = 0.00002\n", 49 | "output_dir = './'\n", 50 | "saved_model_name = 'model'\n", 51 | "save_interval = 2000\n", 52 | "batch_size = 32 \n", 53 | "n_iter = 21000\n", 54 | "lr_interval = 100\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "# Load dataset:\n", 62 | " - Load training screenshots per website\n", 63 | " - Load Phishing screenshots per website " 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "\n", 73 | "def read_imgs_per_website(data_path,targets,imgs_num,reshape_size,start_target_count):\n", 74 | " all_imgs = np.zeros(shape=[imgs_num,224,224,3])\n", 75 | " all_labels = np.zeros(shape=[imgs_num,1])\n", 76 | " \n", 77 | " all_file_names = []\n", 78 | " targets_list = targets.splitlines()\n", 79 | " count = 0\n", 80 | " for i in range(0,len(targets_list)):\n", 81 | " target_path = data_path + targets_list[i]\n", 82 | " print(target_path)\n", 83 | " file_names = sorted(os.listdir(target_path))\n", 84 | " for j in range(0,len(file_names)):\n", 85 | " try:\n", 86 | " img = imread(target_path+'/'+file_names[j])\n", 87 | " img = img[:,:,0:3]\n", 88 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 89 | " all_labels[count,:] = i + start_target_count\n", 90 | " all_file_names.append(file_names[j])\n", 91 | " count = count + 1\n", 92 | " except:\n", 93 | " #some images were saved with a wrong extensions \n", 94 | " try:\n", 95 | " img = imread(target_path+'/'+file_names[j],format='jpeg')\n", 96 | " img = img[:,:,0:3]\n", 97 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 98 | " all_labels[count,:] = i + start_target_count\n", 99 | " all_file_names.append(file_names[j])\n", 100 | " count = count + 1\n", 101 | " except:\n", 102 | " print('failed at:')\n", 103 | " print('***')\n", 104 | " print(file_names[j])\n", 105 | " break \n", 106 | " return all_imgs,all_labels,all_file_names\n", 107 | "\n" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "# Read images legit (train)\n", 117 | "data_path = dataset_path + 'trusted_list/'\n", 118 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 119 | "targets = targets_file.read()\n", 120 | "imgs_num = 9363\n", 121 | "all_imgs_train,all_labels_train,all_file_names_train = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 122 | "\n", 123 | "# Read images phishing\n", 124 | "data_path = dataset_path + 'phishing/'\n", 125 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 126 | "targets = targets_file.read()\n", 127 | "imgs_num = 1195\n", 128 | "all_imgs_test,all_labels_test,all_file_names_test = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 129 | "\n", 130 | "X_train_legit = all_imgs_train\n", 131 | "y_train_legit = all_labels_train\n", 132 | "\n", 133 | "# Split phishing to training and test, load train and test indices.\n", 134 | "idx_test = np.load(output_dir+'test_idx.npy')\n", 135 | "idx_train = np.load(output_dir+'train_idx.npy')\n", 136 | "X_test_phish = all_imgs_test[idx_test,:]\n", 137 | "y_test_phish = all_labels_test[idx_test,:]\n", 138 | "X_train_phish = all_imgs_test[idx_train,:]\n", 139 | "y_train_phish = all_labels_test[idx_train,:]\n", 140 | "\n", 141 | "#otherwise, make a new split here.\n", 142 | "\n", 143 | "#idx = np.arange(all_imgs_test.shape[0])\n", 144 | "#X_test_phish, X_train_phish, y_test_phish, y_train_phish,idx_test,idx_train = train_test_split(all_imgs_test, all_labels_test,idx, test_size=phishing_test_size)\n", 145 | "#np.save(output_dir+'test_idx',idx_test)\n", 146 | "#np.save(output_dir+'train_idx',idx_train)\n" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "# Create Model" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "def define_triplet_network(input_shape, new_conv_params):\n", 163 | " \n", 164 | " # Input_shape: shape of input images\n", 165 | " # new_conv_params: dimension of the new convolution layer [spatial1,spatial2,channels] \n", 166 | " \n", 167 | " # Define the tensors for the three input images\n", 168 | " anchor_input = Input(input_shape)\n", 169 | " positive_input = Input(input_shape)\n", 170 | " negative_input = Input(input_shape)\n", 171 | " \n", 172 | " # Use VGG as a base model \n", 173 | " base_model = VGG16(weights='imagenet', input_shape=input_shape, include_top=False)\n", 174 | "\n", 175 | " x = base_model.output\n", 176 | " x = Conv2D(new_conv_params[2],(new_conv_params[0],new_conv_params[1]),activation='relu', kernel_initializer='he_normal', kernel_regularizer=l2(2e-4)) (x)\n", 177 | " x = GlobalMaxPooling2D() (x)\n", 178 | " model = Model(inputs=base_model.input, outputs=x)\n", 179 | "\n", 180 | " # Generate the encodings (feature vectors) for the two images\n", 181 | " encoded_a = model(anchor_input)\n", 182 | " encoded_p = model(positive_input)\n", 183 | " encoded_n = model(negative_input)\n", 184 | " \n", 185 | " mean_layer = Lambda(lambda x: K.mean(x,axis=1))\n", 186 | " \n", 187 | " square_diff_layer = Lambda(lambda tensors:K.square(tensors[0] - tensors[1]))\n", 188 | " square_diff_pos = square_diff_layer([encoded_a,encoded_p])\n", 189 | " square_diff_neg = square_diff_layer([encoded_a,encoded_n])\n", 190 | " \n", 191 | " square_diff_pos_l2 = mean_layer(square_diff_pos)\n", 192 | " square_diff_neg_l2 = mean_layer(square_diff_neg)\n", 193 | " \n", 194 | " # Add a diff layer\n", 195 | " diff = Subtract()([square_diff_pos_l2, square_diff_neg_l2])\n", 196 | " diff = Reshape((1,)) (diff)\n", 197 | "\n", 198 | " # Connect the inputs with the outputs\n", 199 | " triplet_net = Model(inputs=[anchor_input,positive_input,negative_input],outputs=diff)\n", 200 | " \n", 201 | " # return the model\n", 202 | " return triplet_net" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "# Define the triplet loss\n", 212 | "# Create and compile model \n", 213 | "\n", 214 | "def custom_loss(margin):\n", 215 | " def loss(y_true,y_pred):\n", 216 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 217 | " loss_value = K.mean(loss_value,axis=0)\n", 218 | " return loss_value\n", 219 | " return loss\n", 220 | "def loss(y_true,y_pred):\n", 221 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 222 | " loss_value = K.mean(loss_value,axis=0)\n", 223 | " return loss_value\n", 224 | "\n", 225 | "\n", 226 | "model = define_triplet_network(input_shape, new_conv_params)\n", 227 | "model.summary()\n", 228 | "\n", 229 | "from keras import optimizers\n", 230 | "optimizer = optimizers.Adam(lr = start_lr)\n", 231 | "model.compile(loss=custom_loss(margin),optimizer=optimizer)" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "# Triplet Sampling" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "# Order random phishing arrays per website (from 0 to 155 target)\n", 248 | "\n", 249 | "def order_random_array(orig_arr,y_orig_arr,targets):\n", 250 | " sorted_arr = np.zeros(orig_arr.shape)\n", 251 | " y_sorted_arr = np.zeros(y_orig_arr.shape)\n", 252 | " count = 0\n", 253 | " for i in range(0,targets):\n", 254 | " for j in range(0,orig_arr.shape[0]):\n", 255 | " if y_orig_arr[j] == i:\n", 256 | " sorted_arr[count,:,:,:] = orig_arr[j,:,:,:]\n", 257 | " y_sorted_arr[count,:] = i\n", 258 | " count = count + 1\n", 259 | " return sorted_arr,y_sorted_arr \n", 260 | "\n", 261 | "X_test_phish,y_test_phish = order_random_array(X_test_phish,y_test_phish,num_targets)\n", 262 | "X_train_phish,y_train_phish = order_random_array(X_train_phish,y_train_phish,num_targets)\n", 263 | "\n", 264 | "\n", 265 | "# Store the start and end of each target in the phishing set (used later in triplet sampling)\n", 266 | "# Not all targets might be in the phishing set \n", 267 | "def targets_start_end(num_target,labels):\n", 268 | " prev_target = labels[0]\n", 269 | " start_end_each_target = np.zeros((num_target,2))\n", 270 | " start_end_each_target[0,0] = labels[0]\n", 271 | " if not labels[0] == 0:\n", 272 | " start_end_each_target[0,0] = -1\n", 273 | " start_end_each_target[0,1] = -1\n", 274 | " count_target = 0\n", 275 | " for i in range(1,labels.shape[0]):\n", 276 | " if not labels[i] == prev_target:\n", 277 | " start_end_each_target[int(labels[i-1]),1] = int(i-1)\n", 278 | " start_end_each_target[int(labels[i]),0] = int(i)\n", 279 | " prev_target = labels[i]\n", 280 | " start_end_each_target[int(labels[-1]),1] = int(labels.shape[0]-1)\n", 281 | " \n", 282 | " for i in range(1,num_target):\n", 283 | " if start_end_each_target[i,0] == 0:\n", 284 | " start_end_each_target[i,0] = -1\n", 285 | " start_end_each_target[i,1] = -1\n", 286 | " return start_end_each_target\n", 287 | "\n", 288 | "labels_start_end_train_phish = targets_start_end(num_targets,y_train_phish)\n", 289 | "labels_start_end_test_phish = targets_start_end(num_targets,y_test_phish)\n", 290 | "\n", 291 | "\n", 292 | "# Store the start and end of each target in the training set (used later in triplet sampling)\n", 293 | "def all_targets_start_end(num_target,labels):\n", 294 | " prev_target = 0\n", 295 | " start_end_each_target = np.zeros((num_target,2))\n", 296 | " start_end_each_target[0,0] = 0\n", 297 | " count_target = 0\n", 298 | " for i in range(1,labels.shape[0]):\n", 299 | " if not labels[i] == prev_target:\n", 300 | " start_end_each_target[count_target,1] = i-1\n", 301 | " count_target = count_target + 1\n", 302 | " start_end_each_target[count_target,0] = i\n", 303 | " prev_target = prev_target + 1\n", 304 | " start_end_each_target[num_target-1,1] = labels.shape[0]-1\n", 305 | " return start_end_each_target\n", 306 | "\n", 307 | "labels_start_end_train_legit = all_targets_start_end(num_targets,y_train_legit)\n" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "#Sample anchor, positive and negative images \n", 317 | "def pick_first_img_idx(labels_start_end,num_targets):\n", 318 | " random_target = -1\n", 319 | " while (random_target == -1):\n", 320 | " random_target = np.random.randint(low = 0,high = num_targets)\n", 321 | " if labels_start_end[random_target,0] == -1:\n", 322 | " random_target = -1\n", 323 | " return random_target\n", 324 | "\n", 325 | "def pick_pos_img_idx(prob_phish,img_label):\n", 326 | " if np.random.uniform() > prob_phish:\n", 327 | " class_idx_start_end = labels_start_end_train_legit[img_label,:]\n", 328 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 329 | " img = X_train_legit[same_idx,:]\n", 330 | " else:\n", 331 | " if not labels_start_end_train_phish[img_label,0] == -1:\n", 332 | " class_idx_start_end = labels_start_end_train_phish[img_label,:]\n", 333 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 334 | " img = X_train_phish[same_idx,:]\n", 335 | " else:\n", 336 | " class_idx_start_end = labels_start_end_train_legit[img_label,:]\n", 337 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 338 | " img = X_train_legit[same_idx,:]\n", 339 | " return img\n", 340 | "\n", 341 | "def pick_neg_img(anchor_idx,num_targets):\n", 342 | " if anchor_idx == 0:\n", 343 | " targets = np.arange(1,num_targets)\n", 344 | " elif anchor_idx == num_targets -1:\n", 345 | " targets = np.arange(0,num_targets-1)\n", 346 | " else:\n", 347 | " targets = np.concatenate([np.arange(0,anchor_idx),np.arange(anchor_idx+1,num_targets)])\n", 348 | " diff_target_idx = np.random.randint(low = 0,high = num_targets-1)\n", 349 | " diff_target = targets[diff_target_idx]\n", 350 | " \n", 351 | " class_idx_start_end = labels_start_end_train_legit[diff_target,:]\n", 352 | " idx_from_diff_target = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 353 | " img = X_train_legit[idx_from_diff_target,:]\n", 354 | " \n", 355 | " return img,diff_target" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "# Sample batches" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "# Don't sample negative image from the same category as the positive image (e.g. google and google drive)\n", 372 | "# Create clusters of same-company websites (e.g. all microsoft websites)\n", 373 | "\n", 374 | "\n", 375 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 376 | "all_targets = targets_file.read()\n", 377 | "all_targets = all_targets.splitlines()\n", 378 | "\n", 379 | "def get_idx_of_target(target_name,all_targets):\n", 380 | " for i in range(0,len(all_targets)):\n", 381 | " if all_targets[i] == target_name:\n", 382 | " found_idx = i\n", 383 | " return found_idx\n", 384 | " \n", 385 | "#targets names of parent and sub websites\n", 386 | "target_lists = [['microsoft','ms_outlook','ms_office','ms_bing','ms_onedrive','ms_skype'],['apple','itunes','icloud'],['google','google_drive'],['alibaba','aliexpress']]\n", 387 | "\n", 388 | "def get_associated_targets_idx(target_lists,all_targets):\n", 389 | " sub_target_lists_idx = []\n", 390 | " parents_ids = []\n", 391 | " for i in range(0,len(target_lists)):\n", 392 | " target_list = target_lists[i]\n", 393 | " parent_target = target_list[0]\n", 394 | " one_target_list = []\n", 395 | " parent_idx = get_idx_of_target(parent_target,all_targets)\n", 396 | " parents_ids.append(parent_idx)\n", 397 | " for child_target in target_list[1:]:\n", 398 | " child_idx = get_idx_of_target(child_target,all_targets)\n", 399 | " one_target_list.append(child_idx)\n", 400 | " sub_target_lists_idx.append(one_target_list)\n", 401 | " return parents_ids,sub_target_lists_idx \n", 402 | "\n", 403 | "parents_ids,sub_target_lists_idx = get_associated_targets_idx(target_lists,all_targets)\n", 404 | "\n", 405 | "def check_if_same_category(img_label1,img_label2):\n", 406 | " if_same = 0\n", 407 | " if img_label1 in parents_ids:\n", 408 | " if img_label2 in sub_target_lists_idx[parents_ids.index(img_label1)]:\n", 409 | " if_same = 1\n", 410 | " elif img_label1 in sub_target_lists_idx[0]:\n", 411 | " if img_label2 in sub_target_lists_idx[0] or img_label2 == parents_ids[0]:\n", 412 | " if_same = 1\n", 413 | " elif img_label1 in sub_target_lists_idx[1]:\n", 414 | " if img_label2 in sub_target_lists_idx[1] or img_label2 == parents_ids[1]:\n", 415 | " if_same = 1\n", 416 | " elif img_label1 in sub_target_lists_idx[2]:\n", 417 | " if img_label2 in sub_target_lists_idx[2] or img_label2 == parents_ids[2]:\n", 418 | " if_same = 1\n", 419 | " return if_same\n" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "#sample triplets\n", 429 | "def get_batch(batch_size,num_targets):\n", 430 | " \n", 431 | " # initialize 3 empty arrays for the input image batch\n", 432 | " h = X_train_legit.shape[1]\n", 433 | " w = X_train_legit.shape[2]\n", 434 | " triple=[np.zeros((batch_size, h, w,3)) for i in range(3)]\n", 435 | "\n", 436 | " for i in range(0,batch_size):\n", 437 | " img_idx_pair1 = pick_first_img_idx(labels_start_end_train_legit,num_targets)\n", 438 | " triple[0][i,:,:,:] = X_train_legit[img_idx_pair1,:]\n", 439 | " img_label = int(y_train_legit[img_idx_pair1])\n", 440 | " \n", 441 | " # get image for the second: positive\n", 442 | " triple[1][i,:,:,:] = pick_pos_img_idx(0.15,img_label)\n", 443 | " \n", 444 | " # get image for the thrid: negative from legit\n", 445 | " # don't sample from the same cluster\n", 446 | " img_neg,label_neg = pick_neg_img(img_label,num_targets)\n", 447 | " while check_if_same_category(img_label,label_neg) == 1:\n", 448 | " img_neg,label_neg = pick_neg_img(img_label,num_targets)\n", 449 | "\n", 450 | " triple[2][i,:,:,:] = img_neg\n", 451 | " \n", 452 | " return triple\n" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "# Training" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": null, 465 | "metadata": {}, 466 | "outputs": [], 467 | "source": [ 468 | "def save_keras_model(model):\n", 469 | " model.save(output_dir+saved_model_name+'.h5')\n", 470 | " print(\"Saved model to disk\")" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "print(\"Starting training process!\")\n", 480 | "print(\"-------------------------------------\")\n", 481 | "\n", 482 | "targets_train = np.zeros([batch_size,1])\n", 483 | "for i in range(1, n_iter):\n", 484 | " inputs=get_batch(batch_size,num_targets)\n", 485 | " loss_value=model.train_on_batch(inputs,targets_train)\n", 486 | " \n", 487 | "\n", 488 | " print(\"\\n ------------- \\n\")\n", 489 | " print('Iteration: '+ str(i) +'. '+ \"Loss: {0}\".format(loss_value))\n", 490 | " \n", 491 | " if i % save_interval == 0:\n", 492 | " save_keras_model(model)\n", 493 | " \n", 494 | " if i % lr_interval ==0:\n", 495 | " start_lr = 0.99*start_lr\n", 496 | " K.set_value(model.optimizer.lr, start_lr)\n", 497 | "\n", 498 | "save_keras_model(model)" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "# Calculate Embeddings" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": null, 511 | "metadata": {}, 512 | "outputs": [], 513 | "source": [ 514 | "shared_model = model.layers[3]\n", 515 | "\n", 516 | "whitelist_emb = shared_model.predict(X_train_legit,batch_size=64)\n", 517 | "np.save(output_dir+'whitelist_emb',whitelist_emb)\n", 518 | "np.save(output_dir+'whitelist_labels',y_train_legit )\n", 519 | "\n", 520 | "phishing_emb = shared_model.predict(all_imgs_test,batch_size=64)\n", 521 | "np.save(output_dir+'phishing_emb',phishing_emb)\n", 522 | "np.save(output_dir+'phishing_labels',all_labels_test )\n" 523 | ] 524 | } 525 | ], 526 | "metadata": { 527 | "kernelspec": { 528 | "display_name": "Python 3", 529 | "language": "python", 530 | "name": "python3" 531 | }, 532 | "language_info": { 533 | "codemirror_mode": { 534 | "name": "ipython", 535 | "version": 3 536 | }, 537 | "file_extension": ".py", 538 | "mimetype": "text/x-python", 539 | "name": "python", 540 | "nbconvert_exporter": "python", 541 | "pygments_lexer": "ipython3", 542 | "version": "3.7.6" 543 | } 544 | }, 545 | "nbformat": 4, 546 | "nbformat_minor": 2 547 | } 548 | -------------------------------------------------------------------------------- /code/evaluate/Whitebox_attacks_closest.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import tensorflow as tf \n", 10 | "from keras.models import Sequential\n", 11 | "from keras.layers import Dense\n", 12 | "from keras.layers import Flatten\n", 13 | "\n", 14 | "from keras.preprocessing import image\n", 15 | "from keras.models import Model\n", 16 | "from keras.layers import Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,Input,Lambda,GlobalMaxPooling2D\n", 17 | "from keras.regularizers import l2\n", 18 | "from keras import backend as K\n", 19 | "from keras.applications.vgg16 import VGG16\n", 20 | "\n", 21 | "from matplotlib.pyplot import imread,imshow\n", 22 | "from skimage.transform import rescale, resize\n", 23 | "from skimage.io import imsave\n", 24 | "\n", 25 | "import os\n", 26 | "import numpy as np\n", 27 | "from keras.models import load_model\n", 28 | "from keras.models import load_model" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "whitelist_emb_file = 'whitelist_emb.npy' #precomputed embeddings \n", 38 | "phish_emb_file = 'phishing_emb.npy' #precomputed embeddings \n", 39 | "output_dir = './'\n", 40 | "saved_model_name = 'model' \n", 41 | "\n", 42 | "# Dataset parameters \n", 43 | "dataset_path = '../../datasets/VisualPhish/'\n", 44 | "reshape_size = [224,224,3]\n", 45 | "num_targets = 155 \n", 46 | "\n", 47 | "#adv parameters\n", 48 | "epsilon = 0.005\n", 49 | "batch_size = 3\n", 50 | "trials = 5" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "# Load dataset " 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "\n", 67 | "def read_imgs_per_website(data_path,targets,imgs_num,reshape_size,start_target_count):\n", 68 | " all_imgs = np.zeros(shape=[imgs_num,224,224,3])\n", 69 | " all_labels = np.zeros(shape=[imgs_num,1])\n", 70 | " \n", 71 | " all_file_names = []\n", 72 | " targets_list = targets.splitlines()\n", 73 | " count = 0\n", 74 | " for i in range(0,len(targets_list)):\n", 75 | " target_path = data_path + targets_list[i]\n", 76 | " print(target_path)\n", 77 | " file_names = sorted(os.listdir(target_path))\n", 78 | " for j in range(0,len(file_names)):\n", 79 | " try:\n", 80 | " img = imread(target_path+'/'+file_names[j])\n", 81 | " img = img[:,:,0:3]\n", 82 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 83 | " all_labels[count,:] = i + start_target_count\n", 84 | " all_file_names.append(file_names[j])\n", 85 | " count = count + 1\n", 86 | " except:\n", 87 | " #some images were saved with a wrong extensions \n", 88 | " try:\n", 89 | " img = imread(target_path+'/'+file_names[j],format='jpeg')\n", 90 | " img = img[:,:,0:3]\n", 91 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 92 | " all_labels[count,:] = i + start_target_count\n", 93 | " all_file_names.append(file_names[j])\n", 94 | " count = count + 1\n", 95 | " except:\n", 96 | " print('failed at:')\n", 97 | " print('***')\n", 98 | " print(file_names[j])\n", 99 | " break \n", 100 | " return all_imgs,all_labels,all_file_names\n", 101 | "\n" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "# Read images legit (train)\n", 111 | "data_path = dataset_path + 'trusted_list/'\n", 112 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 113 | "targets = targets_file.read()\n", 114 | "imgs_num = 9363\n", 115 | "all_imgs_train,all_labels_train,all_file_names_train = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 116 | "\n", 117 | "# Read images phishing\n", 118 | "data_path = dataset_path + 'phishing/'\n", 119 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 120 | "targets = targets_file.read()\n", 121 | "imgs_num = 1195\n", 122 | "all_imgs_test,all_labels_test,all_file_names_test = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "X_train_legit = all_imgs_train\n", 132 | "y_train_legit = all_labels_train\n", 133 | "\n", 134 | "\n", 135 | "phish_train_idx = np.load(output_dir+'train_idx.npy')\n", 136 | "phish_test_idx = np.load(output_dir+'test_idx.npy')\n", 137 | "\n", 138 | "\n", 139 | "X_train_phish = all_imgs_test[phish_train_idx,:]\n", 140 | "y_train_phish = all_labels_test[phish_train_idx,:]\n", 141 | "\n", 142 | "X_test_phish = all_imgs_test[phish_test_idx,:]\n", 143 | "y_test_phish = all_labels_test[phish_test_idx,:]\n", 144 | "y_test_phish_non_ordered = all_labels_test[phish_test_idx,:]\n", 145 | "\n", 146 | "\n", 147 | "def order_random_array(orig_arr,y_orig_arr,targets):\n", 148 | " sorted_arr = np.zeros(orig_arr.shape)\n", 149 | " y_sorted_arr = np.zeros(y_orig_arr.shape)\n", 150 | " count = 0\n", 151 | " for i in range(0,targets):\n", 152 | " for j in range(0,orig_arr.shape[0]):\n", 153 | " if y_orig_arr[j] == i:\n", 154 | " sorted_arr[count,:,] = orig_arr[j,:]\n", 155 | " y_sorted_arr[count,:] = i\n", 156 | " count = count + 1\n", 157 | " return sorted_arr,y_sorted_arr \n", 158 | "X_train_phish_features,y_train_phish_ordered = order_random_array(X_train_phish_features,y_train_phish,num_targets)\n", 159 | "\n", 160 | "X_test_phish,y_test_phish = order_random_array(X_test_phish,y_test_phish,num_targets)\n", 161 | "X_train_phish,y_train_phish = order_random_array(X_train_phish,y_train_phish,num_targets)\n", 162 | "\n", 163 | "\n", 164 | "#get start and end of each label\n", 165 | "def start_end_each_target_not_complete(num_target,labels):\n", 166 | " prev_target = labels[0]\n", 167 | " start_end_each_target = np.zeros((num_target,2))\n", 168 | " start_end_each_target[0,0] = labels[0]\n", 169 | " if not labels[0] == 0:\n", 170 | " start_end_each_target[0,0] = -1\n", 171 | " start_end_each_target[0,1] = -1\n", 172 | " count_target = 0\n", 173 | " for i in range(1,labels.shape[0]):\n", 174 | " if not labels[i] == prev_target:\n", 175 | " start_end_each_target[int(labels[i-1]),1] = int(i-1)\n", 176 | " #count_target = count_target + 1\n", 177 | " start_end_each_target[int(labels[i]),0] = int(i)\n", 178 | " prev_target = labels[i]\n", 179 | " start_end_each_target[int(labels[-1]),1] = int(labels.shape[0]-1)\n", 180 | " \n", 181 | " for i in range(1,num_target):\n", 182 | " if start_end_each_target[i,0] == 0:\n", 183 | " start_end_each_target[i,0] = -1\n", 184 | " start_end_each_target[i,1] = -1\n", 185 | " return start_end_each_target\n", 186 | "\n", 187 | "labels_start_end_train_phish = start_end_each_target_not_complete(num_targets,y_train_phish)\n", 188 | "labels_start_end_test_phish = start_end_each_target_not_complete(num_targets,y_test_phish)\n", 189 | "\n", 190 | "def start_end_each_target(num_target,labels):\n", 191 | " prev_target = 0\n", 192 | " start_end_each_target = np.zeros((num_target,2))\n", 193 | " start_end_each_target[0,0] = 0\n", 194 | " count_target = 0\n", 195 | " for i in range(1,labels.shape[0]):\n", 196 | " if not labels[i] == prev_target:\n", 197 | " start_end_each_target[count_target,1] = i-1\n", 198 | " count_target = count_target + 1\n", 199 | " start_end_each_target[count_target,0] = i\n", 200 | " prev_target = prev_target + 1\n", 201 | " start_end_each_target[num_target-1,1] = labels.shape[0]-1\n", 202 | " return start_end_each_target\n", 203 | "\n", 204 | "labels_start_end_train_legit = start_end_each_target(num_targets,y_train_legit)\n" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "# Load Model " 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "from keras.models import load_model\n", 221 | "margin = 2.2\n", 222 | "def loss(y_true,y_pred):\n", 223 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 224 | " loss_value = K.mean(loss_value,axis=0)\n", 225 | " return loss_value\n", 226 | "\n", 227 | "full_model = load_model(output_dir+saved_model_name+'.h5', custom_objects={'loss': loss})\n", 228 | "inside_model = full_model.layers[3]\n", 229 | "\n", 230 | "#define custom_loss\n", 231 | "def custom_loss(margin):\n", 232 | " def loss(y_true,y_pred):\n", 233 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 234 | " loss_value = K.mean(loss_value,axis=0)\n", 235 | " return loss_value\n", 236 | " return loss\n", 237 | "my_loss = custom_loss(60) #set margin to a large value in order to always have a non-zero loss in adv generation.\n", 238 | "\n", 239 | "#get tf session\n", 240 | "sess = K.get_session()" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "# Load pretrained embeddings " 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "\n", 257 | "X_train_legit_features = np.load(output_dir+whitelist_emb_file)\n", 258 | "phish_features = np.load(output_dir+phish_emb_file)\n", 259 | "\n", 260 | "X_test_phish_features = phish_features[phish_test_idx,:]\n", 261 | "X_train_phish_features = phish_features[phish_train_idx,:]\n" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "# Triplet sampling " 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "def pick_pos_img_idx(prob_phish,img_label):\n", 278 | " if np.random.uniform() > prob_phish:\n", 279 | " #take image from legit\n", 280 | " class_idx_start_end = labels_start_end_train_legit[img_label,:]\n", 281 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 282 | " img = X_train_legit[same_idx,:]\n", 283 | " else:\n", 284 | " #take from phish\n", 285 | " if not labels_start_end_train_phish[img_label,0] == -1:\n", 286 | " class_idx_start_end = labels_start_end_train_phish[img_label,:]\n", 287 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 288 | " img = X_train_phish[same_idx,:]\n", 289 | " else:\n", 290 | " class_idx_start_end = labels_start_end_train_legit[img_label,:]\n", 291 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 292 | " img = X_train_legit[same_idx,:]\n", 293 | " return img\n", 294 | "\n", 295 | "\n", 296 | "def pick_neg_img(anchor_idx,num_targets):\n", 297 | " if anchor_idx == 0:\n", 298 | " targets = np.arange(1,num_targets)\n", 299 | " elif anchor_idx == num_targets -1:\n", 300 | " targets = np.arange(0,num_targets-1)\n", 301 | " else:\n", 302 | " targets = np.concatenate([np.arange(0,anchor_idx),np.arange(anchor_idx+1,num_targets)])\n", 303 | " diff_target_idx = np.random.randint(low = 0,high = num_targets-1)\n", 304 | " diff_target = targets[diff_target_idx]\n", 305 | " \n", 306 | " class_idx_start_end = labels_start_end_train_legit[diff_target,:]\n", 307 | " idx_from_diff_target = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 308 | " img = X_train_legit[idx_from_diff_target,:]\n", 309 | " \n", 310 | " return img,diff_target\n", 311 | "\n", 312 | "\n", 313 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 314 | "all_targets = targets_file.read()\n", 315 | "all_targets = all_targets.splitlines()\n", 316 | "\n", 317 | "def get_idx_of_target(target_name,all_targets):\n", 318 | " for i in range(0,len(all_targets)):\n", 319 | " if all_targets[i] == target_name:\n", 320 | " found_idx = i\n", 321 | " return found_idx\n", 322 | "\n", 323 | "target_lists = [['microsoft','ms_outlook','ms_office','ms_bing','ms_onedrive','ms_skype'],['apple','itunes','icloud'],['google','google_drive'],['alibaba','aliexpress']]\n", 324 | "\n", 325 | "def get_associated_targets_idx(target_lists,all_targets):\n", 326 | " sub_target_lists_idx = []\n", 327 | " parents_ids = []\n", 328 | " for i in range(0,len(target_lists)):\n", 329 | " target_list = target_lists[i]\n", 330 | " parent_target = target_list[0]\n", 331 | " one_target_list = []\n", 332 | " parent_idx = get_idx_of_target(parent_target,all_targets)\n", 333 | " parents_ids.append(parent_idx)\n", 334 | " for child_target in target_list[1:]:\n", 335 | " child_idx = get_idx_of_target(child_target,all_targets)\n", 336 | " one_target_list.append(child_idx)\n", 337 | " sub_target_lists_idx.append(one_target_list)\n", 338 | " return parents_ids,sub_target_lists_idx \n", 339 | "\n", 340 | "parents_ids,sub_target_lists_idx = get_associated_targets_idx(target_lists,all_targets)\n", 341 | "\n", 342 | "def check_if_same_category(img_label1,img_label2):\n", 343 | " if_same = 0\n", 344 | " if img_label1 in parents_ids:\n", 345 | " if img_label2 in sub_target_lists_idx[parents_ids.index(img_label1)]:\n", 346 | " if_same = 1\n", 347 | " elif img_label1 in sub_target_lists_idx[0]:\n", 348 | " if img_label2 in sub_target_lists_idx[0] or img_label2 == parents_ids[0]:\n", 349 | " if_same = 1\n", 350 | " elif img_label1 in sub_target_lists_idx[1]:\n", 351 | " if img_label2 in sub_target_lists_idx[1] or img_label2 == parents_ids[1]:\n", 352 | " if_same = 1\n", 353 | " elif img_label1 in sub_target_lists_idx[2]:\n", 354 | " if img_label2 in sub_target_lists_idx[2] or img_label2 == parents_ids[2]:\n", 355 | " if_same = 1\n", 356 | " return if_same" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "# Find closest example to each query" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "def compute_distance_pair(layer1,layer2):\n", 373 | " diff = layer1 - layer2\n", 374 | " l2_diff = np.mean(diff**2)\n", 375 | " return l2_diff\n", 376 | "\n", 377 | "def argmax(lst):\n", 378 | " return lst.index(max(lst))\n", 379 | "\n", 380 | "def argmin(lst):\n", 381 | " return lst.index(min(lst))\n", 382 | "\n", 383 | "\n", 384 | "# Find closest example from the same website:\n", 385 | "# Assume ordered training arrays (legit,train).\n", 386 | "\n", 387 | "def find_closest_example(X_test_matrix,y_test_matrix):\n", 388 | " train_size = phish_train_idx.shape[0] + X_train_legit.shape[0]\n", 389 | " X_all_train = np.concatenate((X_train_phish_features,X_train_legit_features))\n", 390 | " pairwise_distance = np.zeros([X_test_matrix.shape[0],train_size])\n", 391 | " pairwise_distance_idx = np.zeros([X_test_matrix.shape[0],2])\n", 392 | " \n", 393 | " for i in range(0,X_test_matrix.shape[0]):\n", 394 | " pair1 = X_test_matrix[i,:]\n", 395 | " start_in_train = int(labels_start_end_train_legit[int(y_test_matrix[i]),0])\n", 396 | " end_in_train = int(labels_start_end_train_legit[int(y_test_matrix[i]),1])\n", 397 | " dist_list = []\n", 398 | " for j in range(start_in_train,end_in_train+1):\n", 399 | " pair2 = X_train_legit_features[j,:]\n", 400 | " l2_diff = compute_distance_pair(pair1,pair2)\n", 401 | " dist_list.append(l2_diff)\n", 402 | " \n", 403 | " min1 = min(dist_list)\n", 404 | " min1_idx = start_in_train + argmin(dist_list)\n", 405 | " \n", 406 | " dist_list2 = []\n", 407 | " start_in_phish_train = int(labels_start_end_train_phish[int(y_test_matrix[i]),0])\n", 408 | " end_in_phish_train = int(labels_start_end_train_phish[int(y_test_matrix[i]),1])\n", 409 | " \n", 410 | " min2 = -1\n", 411 | " if not labels_start_end_train_phish[int(y_test_matrix[i]),0] == -1:\n", 412 | " for j in range(start_in_phish_train,end_in_phish_train+1):\n", 413 | " pair2 = X_train_phish_features[j,:]\n", 414 | " l2_diff = compute_distance_pair(pair1,pair2)\n", 415 | " dist_list2.append(l2_diff)\n", 416 | " \n", 417 | " min2 = min(dist_list2)\n", 418 | " min2_idx = argmin(dist_list2) + start_in_phish_train\n", 419 | " \n", 420 | " if min1 < min2 or min2 == -1:\n", 421 | " pairwise_distance_idx[i,0] = min1_idx\n", 422 | " else:\n", 423 | " pairwise_distance_idx[i,0] = min2_idx\n", 424 | " #min is from phishing train\n", 425 | " pairwise_distance_idx[i,1] = 1\n", 426 | " \n", 427 | " return pairwise_distance_idx\n", 428 | "\n", 429 | "pairwise_distance_idx = find_closest_example(X_test_phish_features,y_test_phish_non_ordered)" 430 | ] 431 | }, 432 | { 433 | "cell_type": "markdown", 434 | "metadata": {}, 435 | "source": [ 436 | "# Adv examples generation " 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "metadata": {}, 443 | "outputs": [], 444 | "source": [ 445 | "def get_adv_example(triple,epsilon):\n", 446 | " \n", 447 | " # Initialize adversarial example \n", 448 | " anchor_adv = np.zeros_like(triple[0])\n", 449 | " # Added noise\n", 450 | " anchor_noise = np.zeros_like(triple[0])\n", 451 | "\n", 452 | " y_true = tf.placeholder(\"float\", [None,1])\n", 453 | " target = np.zeros([len(triple),1])\n", 454 | " target.astype(float)\n", 455 | " \n", 456 | " # Get the loss and gradient of the loss wrt the inputs\n", 457 | " loss_val = my_loss(y_true, full_model.output)\n", 458 | " grads = K.gradients(loss_val, full_model.input[0])\n", 459 | " \n", 460 | " # Get the sign of the gradient\n", 461 | " delta = K.sign(grads[0])\n", 462 | " \n", 463 | " dict_input = {y_true:target,full_model.input[0]:triple[0],full_model.input[1]:triple[1],full_model.input[2]:triple[2] }\n", 464 | " delta1 = sess.run(delta, feed_dict=dict_input)\n", 465 | " \n", 466 | " # Get noise\n", 467 | " anchor_noise = anchor_noise + delta1\n", 468 | " \n", 469 | " # Perturb the image\n", 470 | " anchor_adv = triple[0] + epsilon*delta1\n", 471 | " \n", 472 | " return anchor_noise,anchor_adv" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "metadata": {}, 479 | "outputs": [], 480 | "source": [ 481 | "# Generate and save adv. examples for the phishing test set \n", 482 | "# For each query image, pick the closest example as the positive image\n", 483 | "# Compute 5 trials (by randomly changing triplets)\n", 484 | "# Save the embeddings for each trial \n", 485 | "\n", 486 | "X_test_phish_non_ordered = all_imgs_test[phish_test_idx,:]\n", 487 | "y_test_phish_non_ordered = all_labels_test[phish_test_idx,:]\n", 488 | "\n", 489 | "X_test_phish_adv = np.zeros_like(X_test_phish_non_ordered)\n", 490 | "\n", 491 | "# initialize 3 empty arrays for the input image batch\n", 492 | "h = X_train_legit.shape[1]\n", 493 | "w = X_train_legit.shape[2]\n", 494 | "triple=[np.zeros((batch_size, h, w,3)) for i in range(3)]\n", 495 | "\n", 496 | "for l in range(0,trials):\n", 497 | " number_batches = int(X_test_phish_non_ordered.shape[0]/batch_size)\n", 498 | " count = 0\n", 499 | " for i in range(0,number_batches):\n", 500 | " for j in range(0,batch_size):\n", 501 | " first_img = X_test_phish_non_ordered[i*batch_size+j,:]\n", 502 | " triple[0][j,:,:,:] = first_img\n", 503 | " first_img_label = int(y_test_phish_non_ordered[i*batch_size+j,:])\n", 504 | "\n", 505 | "\n", 506 | " # get pos image by finding the closest image.\n", 507 | " if pairwise_distance_idx[i*batch_size+j,1] == 1:\n", 508 | " #from phish_train \n", 509 | " pos_img = X_train_phish[int(pairwise_distance_idx[i*batch_size+j,0])]\n", 510 | " else:\n", 511 | " pos_img = X_train_legit[int(pairwise_distance_idx[i*batch_size+j,0])]\n", 512 | " \n", 513 | " triple[1][j,:,:,:] = pos_img\n", 514 | "\n", 515 | " #get image for the thrid: negative from legit\n", 516 | " neg_img,label_neg = pick_neg_img(first_img_label,155)\n", 517 | " while check_if_same_category(first_img_label,label_neg) == 1:\n", 518 | " neg_img,label_neg = pick_neg_img(first_img_label,155)\n", 519 | " triple[2][j,:,:,:] = neg_img\n", 520 | "\n", 521 | " anchor_noise,anchor_adv = get_adv_example(triple,epsilon)\n", 522 | " for k in range(0,len(anchor_adv)):\n", 523 | " X_test_phish_adv[count,:] = anchor_adv[k,:]\n", 524 | " count = count + 1\n", 525 | " X_test_phish_adv_features = inside_model.predict(X_test_phish_adv)\n", 526 | " np.save(output_dir + 'X_test_phish_adv_closest_'+str(l),X_test_phish_adv_features)" 527 | ] 528 | } 529 | ], 530 | "metadata": { 531 | "kernelspec": { 532 | "display_name": "Python 3", 533 | "language": "python", 534 | "name": "python3" 535 | }, 536 | "language_info": { 537 | "codemirror_mode": { 538 | "name": "ipython", 539 | "version": 3 540 | }, 541 | "file_extension": ".py", 542 | "mimetype": "text/x-python", 543 | "name": "python", 544 | "nbconvert_exporter": "python", 545 | "pygments_lexer": "ipython3", 546 | "version": "3.6.7" 547 | } 548 | }, 549 | "nbformat": 4, 550 | "nbformat_minor": 2 551 | } 552 | -------------------------------------------------------------------------------- /code/train/Train_phase2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import tensorflow as tf \n", 10 | "\n", 11 | "from keras.models import Sequential\n", 12 | "from keras.layers import Dense\n", 13 | "from keras.layers import Flatten\n", 14 | "\n", 15 | "\n", 16 | "from keras.preprocessing import image\n", 17 | "from keras.models import Model\n", 18 | "from keras.layers import Dense, GlobalAveragePooling2D,Conv2D,MaxPooling2D,Input,Lambda,GlobalMaxPooling2D\n", 19 | "from keras.regularizers import l2\n", 20 | "from keras import backend as K\n", 21 | "from keras.applications.vgg16 import VGG16\n", 22 | "\n", 23 | "from matplotlib.pyplot import imread\n", 24 | "from skimage.transform import rescale, resize\n", 25 | "from skimage.io import imsave\n", 26 | "\n", 27 | "import os\n", 28 | "import numpy as np\n", 29 | "from keras.models import load_model" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "# Dataset parameters \n", 39 | "dataset_path = '../datasets/VisualPhish/VisualPhish/'\n", 40 | "reshape_size = [224,224,3]\n", 41 | "phishing_test_size = 0.4\n", 42 | "num_targets = 155 \n", 43 | "\n", 44 | "# Model parameters\n", 45 | "input_shape = [224,224,3]\n", 46 | "margin = 2.2\n", 47 | "new_conv_params = [5,5,512]\n", 48 | "\n", 49 | "# Training parameters\n", 50 | "start_lr = 0.00002\n", 51 | "output_dir = './'\n", 52 | "saved_model_name = 'model' #from first training \n", 53 | "new_saved_model_name = 'model2'\n", 54 | "save_interval = 2000\n", 55 | "batch_size = 32 \n", 56 | "n_iter = 50000\n", 57 | "lr_interval = 250\n", 58 | "# hard examples training \n", 59 | "num_sets = 100\n", 60 | "iter_per_set = 8\n", 61 | "n_iter = 30\n", 62 | "batch_size = 32" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "# Load dataset:\n", 70 | "- Load training screenshots per website\n", 71 | "- Load Phishing screenshots per website" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "\n", 81 | "def read_imgs_per_website(data_path,targets,imgs_num,reshape_size,start_target_count):\n", 82 | " all_imgs = np.zeros(shape=[imgs_num,224,224,3])\n", 83 | " all_labels = np.zeros(shape=[imgs_num,1])\n", 84 | " \n", 85 | " all_file_names = []\n", 86 | " targets_list = targets.splitlines()\n", 87 | " count = 0\n", 88 | " for i in range(0,len(targets_list)):\n", 89 | " target_path = data_path + targets_list[i]\n", 90 | " print(target_path)\n", 91 | " file_names = sorted(os.listdir(target_path))\n", 92 | " for j in range(0,len(file_names)):\n", 93 | " try:\n", 94 | " img = imread(target_path+'/'+file_names[j])\n", 95 | " img = img[:,:,0:3]\n", 96 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 97 | " all_labels[count,:] = i + start_target_count\n", 98 | " all_file_names.append(file_names[j])\n", 99 | " count = count + 1\n", 100 | " except:\n", 101 | " #some images were saved with a wrong extensions \n", 102 | " try:\n", 103 | " img = imread(target_path+'/'+file_names[j],format='jpeg')\n", 104 | " img = img[:,:,0:3]\n", 105 | " all_imgs[count,:,:,:] = resize(img, (reshape_size[0], reshape_size[1]),anti_aliasing=True)\n", 106 | " all_labels[count,:] = i + start_target_count\n", 107 | " all_file_names.append(file_names[j])\n", 108 | " count = count + 1\n", 109 | " except:\n", 110 | " print('failed at:')\n", 111 | " print('***')\n", 112 | " print(file_names[j])\n", 113 | " break \n", 114 | " return all_imgs,all_labels,all_file_names\n", 115 | "\n" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "# Read images legit (train)\n", 125 | "data_path = dataset_path + 'trusted_list/'\n", 126 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 127 | "targets = targets_file.read()\n", 128 | "imgs_num = 9363\n", 129 | "all_imgs_train,all_labels_train,all_file_names_train = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 130 | "\n", 131 | "# Read images phishing\n", 132 | "data_path = dataset_path + 'phishing/'\n", 133 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 134 | "targets = targets_file.read()\n", 135 | "imgs_num = 1195\n", 136 | "all_imgs_test,all_labels_test,all_file_names_test = read_imgs_per_website(data_path,targets,imgs_num,reshape_size,0)\n", 137 | "\n", 138 | "X_train_legit = all_imgs_train\n", 139 | "y_train_legit = all_labels_train\n", 140 | "\n", 141 | "# Load the same train/split in phase 1\n", 142 | "phish_test_idx = np.load(output_dir+'test_idx.npy')\n", 143 | "phish_train_idx = np.load(output_dir+'train_idx.npy')\n", 144 | "\n", 145 | "X_test_phish = all_imgs_test[phish_test_idx,:]\n", 146 | "y_test_phish = all_labels_test[phish_test_idx,:]\n", 147 | "\n", 148 | "X_train_phish = all_imgs_test[phish_train_idx,:]\n", 149 | "y_train_phish = all_labels_test[phish_train_idx,:]\n" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "# Order and label targets" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "def order_random_array(orig_arr,y_orig_arr,targets):\n", 166 | " sorted_arr = np.zeros(orig_arr.shape)\n", 167 | " y_sorted_arr = np.zeros(y_orig_arr.shape)\n", 168 | " count = 0\n", 169 | " for i in range(0,targets):\n", 170 | " for j in range(0,orig_arr.shape[0]):\n", 171 | " if y_orig_arr[j] == i:\n", 172 | " sorted_arr[count,:,:,:] = orig_arr[j,:,:,:]\n", 173 | " y_sorted_arr[count,:] = i\n", 174 | " count = count + 1\n", 175 | " return sorted_arr,y_sorted_arr \n", 176 | "\n", 177 | "# Store the start and end of each target in the phishing set (used later in triplet sampling)\n", 178 | "# Not all targets might be in the phishing set \n", 179 | "def start_end_each_target(num_target,labels):\n", 180 | " prev_target = 0\n", 181 | " start_end_each_target = np.zeros((num_target,2))\n", 182 | " start_end_each_target[0,0] = 0\n", 183 | " count_target = 0\n", 184 | " for i in range(1,labels.shape[0]):\n", 185 | " if not labels[i] == prev_target:\n", 186 | " start_end_each_target[count_target,1] = i-1\n", 187 | " count_target = count_target + 1\n", 188 | " start_end_each_target[count_target,0] = i\n", 189 | " prev_target = prev_target + 1\n", 190 | " start_end_each_target[num_target-1,1] = labels.shape[0]-1\n", 191 | " return start_end_each_target\n", 192 | "\n", 193 | "\n", 194 | "# Store the start and end of each target in the training set (used later in triplet sampling)\n", 195 | "def all_targets_start_end(num_target,labels):\n", 196 | " prev_target = labels[0]\n", 197 | " start_end_each_target = np.zeros((num_target,2))\n", 198 | " start_end_each_target[0,0] = labels[0]\n", 199 | " if not labels[0] == 0:\n", 200 | " start_end_each_target[0,0] = -1\n", 201 | " start_end_each_target[0,1] = -1\n", 202 | " count_target = 0\n", 203 | " for i in range(1,labels.shape[0]):\n", 204 | " if not labels[i] == prev_target:\n", 205 | " start_end_each_target[int(labels[i-1]),1] = int(i-1)\n", 206 | " #count_target = count_target + 1\n", 207 | " start_end_each_target[int(labels[i]),0] = int(i)\n", 208 | " prev_target = labels[i]\n", 209 | " start_end_each_target[int(labels[-1]),1] = int(labels.shape[0]-1)\n", 210 | " \n", 211 | " for i in range(1,num_target):\n", 212 | " if start_end_each_target[i,0] == 0:\n", 213 | " print(i)\n", 214 | " start_end_each_target[i,0] = -1\n", 215 | " start_end_each_target[i,1] = -1\n", 216 | " return start_end_each_target\n", 217 | "\n", 218 | "labels_start_end_train_legit = all_targets_start_end(num_targets,y_train_legit)\n" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "# Finding Hard subsets" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "# Find a query set for each target\n", 235 | "def find_fixed_set_idx(num_target):\n", 236 | " website_random_idx = np.zeros([num_target,])\n", 237 | " for i in range(0,num_target):\n", 238 | " class_idx_start_end = labels_start_end_train_legit[i,:]\n", 239 | " website_random_idx[i] = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 240 | " return website_random_idx\n", 241 | "\n", 242 | "# Compute L2 distance between embeddings\n", 243 | "def compute_distance_pair(layer1,layer2):\n", 244 | " diff = layer1 - layer2\n", 245 | " l2_diff = np.mean(diff**2)\n", 246 | " return l2_diff\n", 247 | " \n", 248 | "# Compute the embeddings of the query set, the phishing training set, the training whitelist \n", 249 | "def predict_all_imgs(model):\n", 250 | " X_train_legit_last_layer = model.predict(X_train_legit,batch_size=10)\n", 251 | " X_train_phish_last_layer = model.predict(X_train_phish,batch_size=10)\n", 252 | " fixed_set_last_layer = model.predict(fixed_set,batch_size=10)\n", 253 | " \n", 254 | " return X_train_legit_last_layer,X_train_phish_last_layer,fixed_set_last_layer\n", 255 | "\n", 256 | "# Compute distance between the query set and all training examples \n", 257 | "def compute_all_distances(fixed_set,train_legit,train_phish):\n", 258 | " train_size = train_legit.shape[0] + train_phish.shape[0]\n", 259 | " X_all_train = np.concatenate((train_legit,train_phish))\n", 260 | " pairwise_distance = np.zeros([fixed_set.shape[0],train_size])\n", 261 | " for i in range(0,fixed_set.shape[0]):\n", 262 | " pair1 = fixed_set[i,:]\n", 263 | " for j in range(0,train_size):\n", 264 | " pair2 = X_all_train[j,:]\n", 265 | " l2_diff = compute_distance_pair(pair1,pair2)\n", 266 | " pairwise_distance[i,j] = l2_diff\n", 267 | " return pairwise_distance\n", 268 | "\n", 269 | "# Get index of false positives (different-website examples with small distance) of one query image\n", 270 | "def find_n_false_positives(distances,n,test_label):\n", 271 | " count = 0\n", 272 | " X_false_pos_idx = np.zeros([n,])\n", 273 | " idx_min = np.argsort(distances)\n", 274 | " for i in range(0,distances.shape[0]):\n", 275 | " next_min_idx = idx_min[i]\n", 276 | " n_label = y_train[next_min_idx]\n", 277 | " #false positives (have close distance even if they are from differenet category)\n", 278 | " if not (test_label == n_label):\n", 279 | " X_false_pos_idx[count] = next_min_idx\n", 280 | " count = count + 1\n", 281 | " if count == n:\n", 282 | " break \n", 283 | " while count < n:\n", 284 | " idx_min[count] = -1\n", 285 | " count = count + 1\n", 286 | " return X_false_pos_idx\n", 287 | "\n", 288 | "# Get index of false negatives (same-website examples with large distance) of one query image\n", 289 | "def find_n_false_negatives(distances,n,test_label):\n", 290 | " count = 0 \n", 291 | " X_false_neg_idx = np.zeros([n,])\n", 292 | " idx_max = np.argsort(distances)[::-1]\n", 293 | " for i in range(0,distances.shape[0]):\n", 294 | " next_max_idx = idx_max[i]\n", 295 | " n_label = y_train[next_max_idx]\n", 296 | " #false negatives (have large distance although they are in the same category )\n", 297 | " if test_label == n_label:\n", 298 | " X_false_neg_idx[count] = next_max_idx\n", 299 | " count = count + 1\n", 300 | " if count == n:\n", 301 | " break\n", 302 | " while count < n:\n", 303 | " idx_max[count] = -1\n", 304 | " count = count + 1\n", 305 | " return X_false_neg_idx\n", 306 | "\n", 307 | "# Get the idx of false positives and false negtaives for all query examples\n", 308 | "def find_index_for_all_set(distances,n):\n", 309 | " all_idx = np.zeros([distances.shape[0],2,n])\n", 310 | " for i in range(0,distances.shape[0]):\n", 311 | " distance_i = distances[i,:]\n", 312 | " all_idx[i,0,:] = find_n_false_positives(distance_i,n,i)\n", 313 | " all_idx[i,1,:] = find_n_false_negatives(distance_i,n,i)\n", 314 | " return all_idx\n", 315 | "\n", 316 | "# Form the new training set based on the hard examples indices of all query images\n", 317 | "def find_next_training_set(all_idx,n):\n", 318 | " global X_train_new, y_train_new\n", 319 | " all_idx = all_idx.astype(int)\n", 320 | " count = 0\n", 321 | " for i in range(all_idx.shape[0]):\n", 322 | " for j in range(0,n):\n", 323 | " if not all_idx[i,0,j] == -1:\n", 324 | " X_train_new[count,:,:,:] = X_train[all_idx[i,0,j],:,:,:]\n", 325 | " y_train_new[count,:] = y_train[all_idx[i,0,j]]\n", 326 | " count = count +1 \n", 327 | " for j in range(0,n):\n", 328 | " if not all_idx[i,1,j] == -1:\n", 329 | " X_train_new[count,:,:,:] = X_train[all_idx[i,1,j],:,:,:]\n", 330 | " y_train_new[count,:] = y_train[all_idx[i,1,j]]\n", 331 | " count = count +1 \n", 332 | " X_train_new = X_train_new[0:count,:]\n", 333 | " y_train_new = y_train_new[0:count,:]\n", 334 | " return X_train_new,y_train_new\n", 335 | "\n", 336 | "# Main function for subset sampling\n", 337 | "# Steps:\n", 338 | " # Predict all images\n", 339 | " # Find pairwise distances between query and training set\n", 340 | " # Find indices of hard positive and negative examples\n", 341 | " # Find new training set\n", 342 | " # Order training set by targets\n", 343 | "def find_main_train(model,fixed_set,targets):\n", 344 | " X_train_legit_last_layer,X_train_phish_last_layer,fixed_set_last_layer = predict_all_imgs(model)\n", 345 | " pairwise_distance = compute_all_distances(fixed_set_last_layer,X_train_legit_last_layer,X_train_phish_last_layer)\n", 346 | " n = 1\n", 347 | " all_idx = find_index_for_all_set(pairwise_distance,n)\n", 348 | " X_train_new,y_train_new = find_next_training_set(all_idx,n)\n", 349 | " X_train_new,y_train_new = order_random_array(X_train_new,y_train_new,targets)\n", 350 | " labels_start_end_train = start_end_each_target(targets,y_train_new)\n", 351 | " return X_train_new,y_train_new,labels_start_end_train" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "# Don't sample negative image from the same category as the positive image (e.g. google and google drive)\n", 361 | "# Create clusters of same-company websites (e.g. all microsoft websites)\n", 362 | "\n", 363 | "\n", 364 | "targets_file = open(data_path+'targets.txt', \"r\")\n", 365 | "all_targets = targets_file.read()\n", 366 | "all_targets = all_targets.splitlines()\n", 367 | "\n", 368 | "def get_idx_of_target(target_name,all_targets):\n", 369 | " for i in range(0,len(all_targets)):\n", 370 | " if all_targets[i] == target_name:\n", 371 | " found_idx = i\n", 372 | " return found_idx\n", 373 | " \n", 374 | "#targets names of parent and sub websites\n", 375 | "target_lists = [['microsoft','ms_outlook','ms_office','ms_bing','ms_onedrive','ms_skype'],['apple','itunes','icloud'],['google','google_drive'],['alibaba','aliexpress']]\n", 376 | "\n", 377 | "def get_associated_targets_idx(target_lists,all_targets):\n", 378 | " sub_target_lists_idx = []\n", 379 | " parents_ids = []\n", 380 | " for i in range(0,len(target_lists)):\n", 381 | " target_list = target_lists[i]\n", 382 | " parent_target = target_list[0]\n", 383 | " one_target_list = []\n", 384 | " parent_idx = get_idx_of_target(parent_target,all_targets)\n", 385 | " parents_ids.append(parent_idx)\n", 386 | " for child_target in target_list[1:]:\n", 387 | " child_idx = get_idx_of_target(child_target,all_targets)\n", 388 | " one_target_list.append(child_idx)\n", 389 | " sub_target_lists_idx.append(one_target_list)\n", 390 | " return parents_ids,sub_target_lists_idx \n", 391 | "\n", 392 | "parents_ids,sub_target_lists_idx = get_associated_targets_idx(target_lists,all_targets)\n", 393 | "\n", 394 | "def check_if_same_category(img_label1,img_label2):\n", 395 | " if_same = 0\n", 396 | " if img_label1 in parents_ids:\n", 397 | " if img_label2 in sub_target_lists_idx[parents_ids.index(img_label1)]:\n", 398 | " if_same = 1\n", 399 | " elif img_label1 in sub_target_lists_idx[0]:\n", 400 | " if img_label2 in sub_target_lists_idx[0] or img_label2 == parents_ids[0]:\n", 401 | " if_same = 1\n", 402 | " elif img_label1 in sub_target_lists_idx[1]:\n", 403 | " if img_label2 in sub_target_lists_idx[1] or img_label2 == parents_ids[1]:\n", 404 | " if_same = 1\n", 405 | " elif img_label1 in sub_target_lists_idx[2]:\n", 406 | " if img_label2 in sub_target_lists_idx[2] or img_label2 == parents_ids[2]:\n", 407 | " if_same = 1\n", 408 | " return if_same\n" 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": {}, 414 | "source": [ 415 | "# Triplet Sampling" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": null, 421 | "metadata": {}, 422 | "outputs": [], 423 | "source": [ 424 | "def pick_first_img_idx(labels_start_end,num_targets):\n", 425 | " random_target = -1\n", 426 | " while (random_target == -1):\n", 427 | " random_target = np.random.randint(low = 0,high = num_targets)\n", 428 | " if labels_start_end[random_target,0] == -1:\n", 429 | " random_target = -1\n", 430 | " return random_target\n", 431 | "\n", 432 | "def pick_pos_img_idx(img_label):\n", 433 | " class_idx_start_end = labels_start_end_train[img_label,:]\n", 434 | " same_idx = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 435 | " img = X_train_new[same_idx,:]\n", 436 | " return img\n", 437 | "\n", 438 | "def pick_neg_img(anchor_idx,num_targets):\n", 439 | " if anchor_idx == 0:\n", 440 | " targets = np.arange(1,num_targets)\n", 441 | " elif anchor_idx == num_targets -1:\n", 442 | " targets = np.arange(0,num_targets-1)\n", 443 | " else:\n", 444 | " targets = np.concatenate([np.arange(0,anchor_idx),np.arange(anchor_idx+1,num_targets)])\n", 445 | " diff_target_idx = np.random.randint(low = 0,high = num_targets-1)\n", 446 | " diff_target = targets[diff_target_idx]\n", 447 | " \n", 448 | " class_idx_start_end = labels_start_end_train[diff_target,:]\n", 449 | " idx_from_diff_target = np.random.randint(low = class_idx_start_end[0],high = class_idx_start_end[1]+1)\n", 450 | " img = X_train_new[idx_from_diff_target,:]\n", 451 | " \n", 452 | " return img,diff_target\n", 453 | "\n", 454 | "#Sample batch \n", 455 | "def get_batch(batch_size,train_fixed_set,num_targets):\n", 456 | " \n", 457 | " # initialize 3 empty arrays for the input image batch\n", 458 | " h = X_train_legit.shape[1]\n", 459 | " w = X_train_legit.shape[2]\n", 460 | " triple=[np.zeros((batch_size, h, w,3)) for i in range(3)]\n", 461 | "\n", 462 | " for i in range(0,batch_size):\n", 463 | " img_idx_pair1 = pick_first_img_idx(labels_start_end_train,num_targets)\n", 464 | " triple[0][i,:,:,:] = train_fixed_set[img_idx_pair1,:]\n", 465 | " img_label = img_idx_pair1\n", 466 | " \n", 467 | " #get image for the second: positive\n", 468 | " triple[1][i,:,:,:] = pick_pos_img_idx(img_label)\n", 469 | " \n", 470 | " #get image for the thrid: negative from legit\n", 471 | " img_neg,label_neg = pick_neg_img(img_label,num_targets)\n", 472 | " while check_if_same_category(img_label,label_neg) == 1:\n", 473 | " img_neg,label_neg = pick_neg_img(img_label,num_targets)\n", 474 | "\n", 475 | " triple[2][i,:,:,:] = img_neg\n", 476 | " \n", 477 | " return triple" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "# Load Model" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "\n", 494 | "def loss(y_true,y_pred):\n", 495 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 496 | " loss_value = K.mean(loss_value,axis=0)\n", 497 | " return loss_value\n", 498 | "\n", 499 | "full_model = load_model(output_dir+saved_model_name+'.h5', custom_objects={'loss': loss})\n", 500 | "\n", 501 | "def custom_loss(margin):\n", 502 | " def loss(y_true,y_pred):\n", 503 | " loss_value = K.maximum(y_true, margin + y_pred)\n", 504 | " loss_value = K.mean(loss_value,axis=0)\n", 505 | " return loss_value\n", 506 | " return loss\n", 507 | "\n", 508 | "\n", 509 | "from keras import optimizers\n", 510 | "optimizer = optimizers.Adam(lr = start_lr)\n", 511 | "full_model.compile(loss=custom_loss(margin),optimizer=optimizer)" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "# Training" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "metadata": {}, 525 | "outputs": [], 526 | "source": [ 527 | "def save_keras_model(model):\n", 528 | " model.save(output_dir+new_saved_model_name+'.h5')\n", 529 | " print(\"Saved model to disk\")" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "n=1 #number of wrong points \n", 539 | "\n", 540 | "#all training images\n", 541 | "X_train = np.concatenate([X_train_legit,X_train_phish])\n", 542 | "y_train = np.concatenate([y_train_legit,y_train_phish])\n", 543 | "\n", 544 | "#subset training\n", 545 | "X_train_new = np.zeros([num_targets*2*n,X_train_legit.shape[1],X_train_legit.shape[2],X_train_legit.shape[3]])\n", 546 | "y_train_new = np.zeros([num_targets*2*n,1])\n", 547 | "\n", 548 | "targets_train = np.zeros([batch_size,1])\n", 549 | "tot_count = 0 \n", 550 | "\n", 551 | "print(\"Starting training process!\")\n", 552 | "print(\"\\n ------------- \\n\")\n", 553 | "for k in range(0,num_sets):\n", 554 | " print(\"Starting a new set!\")\n", 555 | " print(\"\\n ------------- \\n\")\n", 556 | " X_train_legit = all_imgs_train\n", 557 | " y_train_legit = all_labels_train\n", 558 | " \n", 559 | " fixed_set_idx = find_fixed_set_idx(num_targets)\n", 560 | " fixed_set = X_train_legit[fixed_set_idx.astype(int),:,:,:]\n", 561 | " \n", 562 | " for j in range(0,iter_per_set):\n", 563 | " model = full_model.layers[3]\n", 564 | " X_train_new,y_train_new,labels_start_end_train = find_main_train(model,fixed_set,num_targets)\n", 565 | " for i in range(1, n_iter):\n", 566 | " tot_count = tot_count + 1\n", 567 | " inputs=get_batch(batch_size,fixed_set,num_targets)\n", 568 | " loss_iteration=full_model.train_on_batch(inputs,targets_train)\n", 569 | " \n", 570 | " print(\"\\n ------------- \\n\")\n", 571 | " print('Iteration: '+ str(i) +'. '+ \"Loss: {0}\".format(loss_iteration))\n", 572 | "\n", 573 | " if tot_count % save_interval == 0:\n", 574 | " save_keras_model(full_model)\n", 575 | "\n", 576 | " if tot_count % lr_interval ==0:\n", 577 | " start_lr = 0.99*start_lr\n", 578 | " K.set_value(full_model.optimizer.lr, start_lr)\n", 579 | "\n", 580 | "save_keras_model(full_model) " 581 | ] 582 | }, 583 | { 584 | "cell_type": "markdown", 585 | "metadata": {}, 586 | "source": [ 587 | "# Calculate Embeddings" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": null, 593 | "metadata": {}, 594 | "outputs": [], 595 | "source": [ 596 | "shared_model = full_model.layers[3]\n", 597 | "\n", 598 | "whitelist_emb = shared_model.predict(X_train_legit,batch_size=64)\n", 599 | "np.save(output_dir+'whitelist_emb2',whitelist_emb)\n", 600 | "np.save(output_dir+'whitelist_labels2',y_train_legit )\n", 601 | "\n", 602 | "phishing_emb = shared_model.predict(all_imgs_test,batch_size=64)\n", 603 | "np.save(output_dir+'phishing_emb2',phishing_emb)\n", 604 | "np.save(output_dir+'phishing_labels2',all_labels_test )" 605 | ] 606 | } 607 | ], 608 | "metadata": { 609 | "kernelspec": { 610 | "display_name": "Python 3", 611 | "language": "python", 612 | "name": "python3" 613 | }, 614 | "language_info": { 615 | "codemirror_mode": { 616 | "name": "ipython", 617 | "version": 3 618 | }, 619 | "file_extension": ".py", 620 | "mimetype": "text/x-python", 621 | "name": "python", 622 | "nbconvert_exporter": "python", 623 | "pygments_lexer": "ipython3", 624 | "version": "3.7.6" 625 | } 626 | }, 627 | "nbformat": 4, 628 | "nbformat_minor": 2 629 | } 630 | --------------------------------------------------------------------------------

{{ site.title }}

Sahar Abdelnabi Katharina Krombholz Mario Fritz

CISPA Helmholtz Center for Information Security