├── .github
    ├── ISSUE_TEMPLATE
    │   ├── bug_report.md
    │   ├── custom.md
    │   ├── feature_request.md
    │   └── question.md
    └── no-response.yml
├── .gitignore
├── Extension
    ├── icon.png
    ├── manifest.json
    ├── popup.html
    ├── popup.js
    └── style.css
├── LICENSE
├── Other Information
    ├── Phishing Websites Features.docx
    └── verified_online.csv
├── README.md
├── _config.yml
├── classifier
    └── random_forest.pkl
├── clientServer.php
├── data_validation.py
├── dataset
    └── Training Dataset.arff
├── docs
    ├── CODE_OF_CONDUCT.md
    └── Troubleshooting.md
├── features_extraction.py
├── patterns.py
├── requirements.txt
├── test.py
├── train.py
└── tst
    └── test_features_extraction.py


/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Bug report
 3 | about: Create a report to help us improve
 4 | title: ''
 5 | labels: bug
 6 | assignees: rohitnaik246
 7 | 
 8 | ---
 9 | 
10 | **Have you read [Troubleshooting.md](https://github.com/rohitnaik246/Malicious-Web-Content-Detection-Using-Machine-Learning/blob/master/docs/Troubleshooting.md)? If No, please do so before filing an issue.**
11 | Yes/No
12 | 
13 | **Have you tried Googling the problem?**
14 | Yes/No
15 | 
16 | **Which python version are you using to run the project? In the terminal, type ```which <python-path-you-have-in-clientServer.php>``` and enter the output here**
17 | Python version -
18 | 
19 | **Describe the bug**
20 | A clear and concise description of what the bug is.
21 | 
22 | **To Reproduce**
23 | Steps to reproduce the behavior:
24 | 1. Go to '...'
25 | 2. Click on '....'
26 | 3. Scroll down to '....'
27 | 4. See error
28 | 
29 | **Expected behavior**
30 | A clear and concise description of what you expected to happen.
31 | 
32 | **Screenshots**
33 | If applicable, add screenshots to help explain your problem.
34 | 
35 | **Desktop (please complete the following information):**
36 |  - OS: [e.g. iOS]
37 |  - Browser [e.g. chrome, safari]
38 |  - Version (optional) [e.g. 22]
39 | 
40 | **Smartphone (please complete the following information):**
41 |  - Device: [e.g. iPhone6]
42 |  - OS: [e.g. iOS8.1]
43 |  - Browser [e.g. stock browser, safari]
44 |  - Version (optional) [e.g. 22]
45 | 
46 | **Additional context**
47 | Add any other context about the problem here.
48 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/custom.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Custom issue template
 3 | about: Describe this issue template's purpose here.
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | 
11 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Feature request
 3 | about: Suggest an idea for this project
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 | 
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 | 
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 | 
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/question.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Question
 3 | about: Provide details about your question
 4 | title: ''
 5 | labels: question
 6 | assignees: rohitnaik246
 7 | 
 8 | ---
 9 | 
10 | **Have you read [Troubleshooting.md](https://github.com/rohitnaik246/Malicious-Web-Content-Detection-Using-Machine-Learning/blob/master/docs/Troubleshooting.md)? If No, please do so before filing an issue.**
11 | Yes/No
12 | 
13 | **Have you tried Googling about it?**
14 | Yes/No
15 | 
16 | **If yes, what were your findings?**
17 | <Post screenshots or links that you found and ahow how you have done some self-research and narrowed down the root cause>
18 | 
19 | **Which python version are you using to run the project? In the terminal, type ```which <python-path-you-have-in-clientServer.php>``` and enter the output here**
20 | Python version -
21 | 
22 | **Describe the question**
23 | A clear and concise description of your question.
24 | 
25 | **Screenshots**
26 | Add any screenshots, if necessary to describe your question better.
27 | 
28 | **Additional context**
29 | Any additional information, if needed.
30 | 


--------------------------------------------------------------------------------
/.github/no-response.yml:
--------------------------------------------------------------------------------
 1 | # Number of days of inactivity before an Issue is closed for lack of response
 2 | daysUntilClose: 10
 3 | # Label requiring a response
 4 | responseRequiredLabel: more-information-needed
 5 | # Comment to post when closing an Issue for lack of response. Set to `false` to disable
 6 | closeComment: >
 7 |   This issue has been automatically closed because there has been no response
 8 |   to our request for more information from the original author. With only the
 9 |   information that is currently in the issue, we don't have enough information
10 |   to take action. Please reach out if you have or find the answers we need so
11 |   that we can investigate further.


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/*
2 | .DS_Store
3 | markup.txt
4 | *.pyc


--------------------------------------------------------------------------------
/Extension/icon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/philomathic-guy/Malicious-Web-Content-Detection-Using-Machine-Learning/c86d4709354691d702e0c693f482e1fabd3b172c/Extension/icon.png


--------------------------------------------------------------------------------
/Extension/manifest.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "manifest_version": 2,
 3 | 
 4 |   "name": "Malicious Web Content Detector",
 5 |   "description": "This extension helps you avoid accessing malicious websites by giving you a chance to analyze the website before you interact with it.",
 6 |   "version": "1.0",
 7 | 
 8 |   "browser_action": {
 9 |     "default_icon": "icon.png",
10 |     "default_popup": "popup.html"
11 |   },
12 |     "content_security_policy": "script-src 'self' https://ajax.googleapis.com; object-src 'self'",
13 |   "permissions": [
14 |     "activeTab",
15 |     "tabs",
16 |     "https://ajax.googleapis.com/",
17 |     "tabs",
18 |     "<all_urls>"
19 |   ]
20 | }
21 | 


--------------------------------------------------------------------------------
/Extension/popup.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |     <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
 5 |     <script src="popup.js" type="text/javascript"></script>
 6 |     <link rel="stylesheet" href="style.css">
 7 | </head>
 8 | <body>
 9 | 
10 | <div><h2> Check your WebPage </h2></div>
11 | <div id="p1"></div>
12 | <p>Find out now...<br><sup>&darr;</sup></p>
13 | <div>
14 |     <button class="button"><span>Safe or Not?</span></button>
15 |     <div id="div1"></div>
16 | </div>
17 | </body>
18 | </html>
19 | 


--------------------------------------------------------------------------------
/Extension/popup.js:
--------------------------------------------------------------------------------
 1 | // Purpose - This file contains all the logic relevant to the extension such as getting the URL, calling the server
 2 | // side clientServer.php which then calls the core logic.
 3 | 
 4 | function transfer(){	
 5 | 	var tablink;
 6 | 	chrome.tabs.getSelected(null,function(tab) {
 7 | 	   	tablink = tab.url;
 8 | 		$("#p1").text("The URL being tested is - "+tablink);
 9 | 
10 | 		var xhr=new XMLHttpRequest();
11 | 		params="url="+tablink;
12 |         // alert(params);
13 | 		var markup = "url="+tablink+"&html="+document.documentElement.innerHTML;
14 | 		xhr.open("POST","http://localhost/Malicious-Web-Content-Detection-Using-Machine-Learning/clientServer.php",false);
15 | 		xhr.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
16 | 		xhr.send(markup);
17 | 		// Uncomment this line if you see some error on the extension to see the full error message for debugging.
18 | 		// alert(xhr.responseText);
19 | 		$("#div1").text(xhr.responseText);
20 | 		return xhr.responseText;
21 | 	});
22 | }
23 | 
24 | 
25 | $(document).ready(function(){
26 |     $("button").click(function(){	
27 | 		var val = transfer();
28 |     });
29 | });
30 | 
31 | chrome.tabs.getSelected(null,function(tab) {
32 |    	var tablink = tab.url;
33 | 	$("#p1").text("The URL being tested is - "+tablink);
34 | });
35 | 


--------------------------------------------------------------------------------
/Extension/style.css:
--------------------------------------------------------------------------------
  1 | body{
  2 |    width:500px;
  3 |    height:100px;
  4 |    display:inline-block;
  5 |    }
  6 | 
  7 | @import url(http://fonts.googleapis.com/css?family=Nunito:300);
  8 | 
  9 | body { font-family: "Nunito", sans-serif; font-size: 15px; background: black; color: white; }
 10 | a    { text-decoration: none; }
 11 | p    { text-align: center; }
 12 | sup  { font-size: 36px; font-weight: 100; line-height: 55px; }
 13 | 
 14 | 
 15 | .button
 16 | {
 17 |   text-transform: uppercase;
 18 |   letter-spacing: 2px;
 19 |   text-align: center;
 20 |   color: #0C5;
 21 | 
 22 |   font-size: 24px;
 23 |   font-family: "Nunito", sans-serif;
 24 |   font-weight: 300;
 25 |   
 26 |   margin: 5em auto;
 27 |   
 28 |   position: relative;
 29 |     top: initial;
 30 |     right: 100px;
 31 |     bottom: 140px;
 32 |     left: 120px;
 33 |   
 34 |   padding: 10px 0;
 35 |   width: 50%;
 36 |   height:5%;
 37 | 
 38 |   background: #0D6;
 39 |   border: 1px solid #0D6;
 40 |   color: #FFF;
 41 |   overflow: hidden;
 42 |   
 43 |   transition: all 0.5s;
 44 | }
 45 | 
 46 | .button:hover, .button:active 
 47 | {
 48 |   text-decoration: none;
 49 |   color: #0C5;
 50 |   border-color: #0C5;
 51 |   background: #FFF;
 52 | }
 53 | 
 54 | .button span 
 55 | {
 56 |   display: inline-block;
 57 |   position: relative;
 58 |   padding-right: 0;
 59 |   
 60 |   transition: padding-right 0.5s;
 61 | }
 62 | 
 63 | .button span:after 
 64 | {
 65 |   content: ' ';  
 66 |   position: absolute;
 67 |   top: 0;
 68 |   right: -18px;
 69 |   opacity: 0;
 70 |   width: 10px;
 71 |   height: 10px;
 72 |   margin-top: -10px;
 73 | 
 74 |   background: rgba(0, 0, 0, 0);
 75 |   border: 3px solid #FFF;
 76 |   border-top: none;
 77 |   border-right: none;
 78 | 
 79 |   transition: opacity 0.5s, top 0.5s, right 0.5s;
 80 |   transform: rotate(-45deg);
 81 | }
 82 | 
 83 | .button:hover span, .button:active span 
 84 | {
 85 |   padding-right: 30px;
 86 | }
 87 | 
 88 | .button:hover span:after, .button:active span:after 
 89 | {
 90 |   transition: opacity 0.5s, top 0.5s, right 0.5s;
 91 |   opacity: 1;
 92 |   border-color: #0C5;
 93 |   right: 0;
 94 |   top: 50%;
 95 | }
 96 | 
 97 | #p1
 98 | {
 99 | 	font-size:30px;
100 | 	display:inline-block;
101 | }
102 | 
103 | #div1
104 | {
105 | 	vertical-align: middle; 
106 | 	line-height: 50px; /* the same as your div height*/
107 | 	text-align:center;
108 | 	font-size:100px;
109 | }
110 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Rohit Naik
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Other Information/Phishing Websites Features.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/philomathic-guy/Malicious-Web-Content-Detection-Using-Machine-Learning/c86d4709354691d702e0c693f482e1fabd3b172c/Other Information/Phishing Websites Features.docx


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Malicious Web Content Detection using Machine Learning
 2 | 
 3 | #### NOTE - 
 4 | #### 1. If you face any issue, first refer to [Troubleshooting.md](docs/Troubleshooting.md). If you are still not able to resolve it, please file an issue with the appropriate template (Bug report, question, custom issue or feature request).
 5 | #### 2. Please support the project by starring it :)
 6 | 
 7 | ### Steps for reproducing the project -
 8 | * Install all the required packages using the following command - ```pip install -r requirements.txt```.
 9 | Make sure your pip is consistent with the Python version you are using by typing ```pip -V```.
10 | * Move the project folder to the correct localhost location. For eg. ```/Library/WebServer/Documents``` in case of Macs.
11 | * (If you are using a Mac) Give permissions to write to the markup file ```sudo chmod 777 markup.txt```.
12 | * Modify the path of your Python 2.x installation in ```clientServer.php```.
13 | * (If you are using **anything other than** a Mac) Modify the localhost path in ```features_extraction.py``` to your localhost path (or host the application on a remote server and make the necessary changes).
14 | * Go to ```chrome://extensions```, activate developer mode, click on load unpacked and select the 'Extension' folder from our project.
15 | * Now, you can go to any web page and click on the extension in the top right panel of your Chrome window. Click on the 'Safe of not?' button and wait for a second for the result.
16 | * Done!
17 | 
18 | ### Research paper - http://ieeexplore.ieee.org/document/8256834/
19 | 
20 | #### Abstract -
21 | * Naive users using a browser have no idea about the back-end of the page. The users might be tricked into giving away their credentials or downloading malicious data.
22 | * Our aim is to create an extension for Chrome which will act as middleware between the users and the malicious websites, and mitigate the risk of users succumbing to such websites.
23 | * Further, all harmful content cannot be exhaustively collected as even that is bound to continuous development. To counter this we are using machine learning - to train the tool and categorize the new content it sees every time into the particular categories so that corresponding action can be taken.
24 | 
25 | ### Take a look at the [demo](https://youtu.be/0-wky0h3hmM)
26 | 
27 | A few snapshots of our system being run on different webpages -
28 | 
29 | ![spit_safe](https://user-images.githubusercontent.com/18022447/35985360-7cd910f2-0cc4-11e8-9edf-d38bf83d19a1.png)
30 | _**Fig 1.** A safe website - www.spit.ac.in (College website)_
31 | 
32 | ![drive_phishing](https://user-images.githubusercontent.com/18022447/35985366-81a9c5b8-0cc4-11e8-887d-7f427ffa8a8e.png)
33 | _**Fig 2.** A phishing website which looks just like Google Drive._
34 | 
35 | ![dropbox_phishing](https://user-images.githubusercontent.com/18022447/35985373-84056c86-0cc4-11e8-8751-cf511d5b8aa0.png)
36 | _**Fig 3.** A phishing website which looks just like Dropbox_
37 | 
38 | ![moodle_safe](https://user-images.githubusercontent.com/18022447/35985384-881ea85a-0cc4-11e8-9bea-cf71b3089364.png)
39 | _**Fig 4.** A safe website - www.google.com_
40 | 
41 | 
42 | 
43 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-cayman
2 | 


--------------------------------------------------------------------------------
/classifier/random_forest.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/philomathic-guy/Malicious-Web-Content-Detection-Using-Machine-Learning/c86d4709354691d702e0c693f482e1fabd3b172c/classifier/random_forest.pkl


--------------------------------------------------------------------------------
/clientServer.php:
--------------------------------------------------------------------------------
 1 | // Purpose - This file acts as a mediator between the client side popup.js and the server side test.py.
 2 | // It gets the HTML contents which acts as input to the suite of python files.
 3 | 
 4 | <?php
 5 | header("Access-Control-Allow-Origin: *");
 6 | $site=$_POST['url'];
 7 | $html = file_get_contents($site);
 8 | //echo $html;
 9 | $bytes=file_put_contents('markup.txt', $html);
10 | 
11 | // Can use this if your default interpreter is Python 2.x.
12 | // Has some problem executing 'which python2'. So, absolute path is just simpler.
13 | //$python_path=exec("which python 2>&1 ");
14 | //$decision=exec("$python_path test.py $site 2>&1 ");
15 | 
16 | // Replace the path with the path of your python2.x installation.
17 | $decision=exec("/Library/Frameworks/Python.framework/Versions/2.7/bin/python2 test.py $site 2>&1 ");
18 | echo $decision;
19 | ?>
20 | 


--------------------------------------------------------------------------------
/data_validation.py:
--------------------------------------------------------------------------------
 1 | # Purpose - To print the training data and check the parsing logic for it.
 2 | # Note: This file is not a part of the codepath which is used by the Chrome extension for making a decision.
 3 | 
 4 | import numpy as np
 5 | from features_extraction import DIRECTORY_NAME, LOCALHOST_PATH
 6 | 
 7 | with open(LOCALHOST_PATH + DIRECTORY_NAME + '/dataset/Training Dataset.arff') as f:
 8 |     file = f.read()
 9 | data_list = file.split('\n')
10 | 
11 | print(data_list)
12 | print("/////////////////////////////////")
13 | 
14 | data = np.array(data_list)
15 | data1 = [i.split(',') for i in data]
16 | 
17 | print("Data1 before indexing - ", data1)
18 | print ("Length of data1 - ", len(data1))
19 | print ("////////////////////////////////")
20 | 
21 | data1 = data1[0:-1]
22 | 
23 | print ("Data1 after indexing - ", data1)
24 | print ("Length of data1 - ", len(data1))
25 | 
26 | # for i in data1:
27 | #    labels.append(i[30])
28 | data1 = np.array(data1)
29 | 
30 | print ("Converted to np array - ", data1)
31 | print ("Number of columns in a row - ", len(data1[0]))
32 | print ("Shape of data1 - ", data1.shape)
33 | print ("////////////////////////////////")
34 | 
35 | features = data1[:, :-1]
36 | 
37 | print ("Features array - ", features)
38 | print ("Number of columns in a row - ", len(features[0]))
39 | 


--------------------------------------------------------------------------------
/docs/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Contributor Covenant Code of Conduct
 2 | 
 3 | ## Our Pledge
 4 | 
 5 | In the interest of fostering an open and welcoming environment, we as
 6 | contributors and maintainers pledge to making participation in our project and
 7 | our community a harassment-free experience for everyone, regardless of age, body
 8 | size, disability, ethnicity, sex characteristics, gender identity and expression,
 9 | level of experience, education, socio-economic status, nationality, personal
10 | appearance, race, religion, or sexual identity and orientation.
11 | 
12 | ## Our Standards
13 | 
14 | Examples of behavior that contributes to creating a positive environment
15 | include:
16 | 
17 | * Using welcoming and inclusive language
18 | * Being respectful of differing viewpoints and experiences
19 | * Gracefully accepting constructive criticism
20 | * Focusing on what is best for the community
21 | * Showing empathy towards other community members
22 | 
23 | Examples of unacceptable behavior by participants include:
24 | 
25 | * The use of sexualized language or imagery and unwelcome sexual attention or
26 |  advances
27 | * Trolling, insulting/derogatory comments, and personal or political attacks
28 | * Public or private harassment
29 | * Publishing others' private information, such as a physical or electronic
30 |  address, without explicit permission
31 | * Other conduct which could reasonably be considered inappropriate in a
32 |  professional setting
33 | 
34 | ## Our Responsibilities
35 | 
36 | Project maintainers are responsible for clarifying the standards of acceptable
37 | behavior and are expected to take appropriate and fair corrective action in
38 | response to any instances of unacceptable behavior.
39 | 
40 | Project maintainers have the right and responsibility to remove, edit, or
41 | reject comments, commits, code, wiki edits, issues, and other contributions
42 | that are not aligned to this Code of Conduct, or to ban temporarily or
43 | permanently any contributor for other behaviors that they deem inappropriate,
44 | threatening, offensive, or harmful.
45 | 
46 | ## Scope
47 | 
48 | This Code of Conduct applies both within project spaces and in public spaces
49 | when an individual is representing the project or its community. Examples of
50 | representing a project or community include using an official project e-mail
51 | address, posting via an official social media account, or acting as an appointed
52 | representative at an online or offline event. Representation of a project may be
53 | further defined and clarified by project maintainers.
54 | 
55 | ## Enforcement
56 | 
57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
58 | reported by contacting the project team at rohit.naik246@gmail.com. All
59 | complaints will be reviewed and investigated and will result in a response that
60 | is deemed necessary and appropriate to the circumstances. The project team is
61 | obligated to maintain confidentiality with regard to the reporter of an incident.
62 | Further details of specific enforcement policies may be posted separately.
63 | 
64 | Project maintainers who do not follow or enforce the Code of Conduct in good
65 | faith may face temporary or permanent repercussions as determined by other
66 | members of the project's leadership.
67 | 
68 | ## Attribution
69 | 
70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72 | 
73 | [homepage]: https://www.contributor-covenant.org
74 | 
75 | For answers to common questions about this code of conduct, see
76 | https://www.contributor-covenant.org/faq
77 | 


--------------------------------------------------------------------------------
/docs/Troubleshooting.md:
--------------------------------------------------------------------------------
 1 | # Troubleshooting
 2 | This document will help you check for common issues and make sure your issue has not already been reported.
 3 | 
 4 | ## Check to see if the issue has been reported
 5 | * Search the [Issues page](https://github.com/rohitnaik246/Malicious-Web-Content-Detection-Using-Machine-Learning/issues) to see if someone else has already reported the same issue.
 6 | 
 7 | ## List of things to double check
 8 | Please make sure of the following things -
 9 | * You are using Python 2.x for running the project. You can confirm this by entering the following command in the terminal - ```which <python-path-you-have-in-clientServer.php>```
10 | (The project has been developed in Python 2.7 and due to certain differences in the syntax and the support of modules imported, the project works smoothly with Python 2.x as of now.)
11 | * You have done a ```git pull``` before running. The project is being updated on a daily basis (as of May 2019) and making sure you have the most updated version of the project will resolve most issues.
12 | 
13 | ## Create an issue
14 | If your problem hasn't been solved or reported previously, and if you have followed all the items on the above list of things to check, then create an issue:
15 | 1. Select a template closest to the title of the issue you want to file.
16 | 1. Enter the details specific to the template.
17 | 


--------------------------------------------------------------------------------
/features_extraction.py:
--------------------------------------------------------------------------------
  1 | # Purpose -
  2 | # Running this file (stand alone) - For extracting all the features from a web page for testing.
  3 | # Notes -
  4 | # 1 stands for legitimate
  5 | # 0 stands for suspicious
  6 | # -1 stands for phishing
  7 | 
  8 | from bs4 import BeautifulSoup
  9 | import urllib
 10 | import bs4
 11 | import re
 12 | import socket
 13 | import whois
 14 | from datetime import datetime
 15 | import time
 16 | 
 17 | # https://breakingcode.wordpress.com/2010/06/29/google-search-python/
 18 | # Previous package structure was modified. Import statements according to new structure added. Also code modified.
 19 | from googlesearch import search
 20 | 
 21 | # This import is needed only when you run this file in isolation.
 22 | import sys
 23 | 
 24 | from patterns import *
 25 | 
 26 | # Path of your local server. Different for different OSs.
 27 | LOCALHOST_PATH = "/Library/WebServer/Documents/"
 28 | DIRECTORY_NAME = "Malicious-Web-Content-Detection-Using-Machine-Learning"
 29 | 
 30 | 
 31 | def having_ip_address(url):
 32 |     ip_address_pattern = ipv4_pattern + "|" + ipv6_pattern
 33 |     match = re.search(ip_address_pattern, url)
 34 |     return -1 if match else 1
 35 | 
 36 | 
 37 | def url_length(url):
 38 |     if len(url) < 54:
 39 |         return 1
 40 |     if 54 <= len(url) <= 75:
 41 |         return 0
 42 |     return -1
 43 | 
 44 | 
 45 | def shortening_service(url):
 46 |     match = re.search(shortening_services, url)
 47 |     return -1 if match else 1
 48 | 
 49 | 
 50 | def having_at_symbol(url):
 51 |     match = re.search('@', url)
 52 |     return -1 if match else 1
 53 | 
 54 | 
 55 | def double_slash_redirecting(url):
 56 |     # since the position starts from 0, we have given 6 and not 7 which is according to the document.
 57 |     # It is convenient and easier to just use string search here to search the last occurrence instead of re.
 58 |     last_double_slash = url.rfind('//')
 59 |     return -1 if last_double_slash > 6 else 1
 60 | 
 61 | 
 62 | def prefix_suffix(domain):
 63 |     match = re.search('-', domain)
 64 |     return -1 if match else 1
 65 | 
 66 | 
 67 | def having_sub_domain(url):
 68 |     # Here, instead of greater than 1 we will take greater than 3 since the greater than 1 condition is when www and
 69 |     # country domain dots are skipped
 70 |     # Accordingly other dots will increase by 1
 71 |     if having_ip_address(url) == -1:
 72 |         match = re.search(
 73 |             '(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
 74 |             '([01]?\\d\\d?|2[0-4]\\d|25[0-5]))|(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}',
 75 |             url)
 76 |         pos = match.end()
 77 |         url = url[pos:]
 78 |     num_dots = [x.start() for x in re.finditer(r'\.', url)]
 79 |     if len(num_dots) <= 3:
 80 |         return 1
 81 |     elif len(num_dots) == 4:
 82 |         return 0
 83 |     else:
 84 |         return -1
 85 | 
 86 | 
 87 | def domain_registration_length(domain):
 88 |     expiration_date = domain.expiration_date
 89 |     today = time.strftime('%Y-%m-%d')
 90 |     today = datetime.strptime(today, '%Y-%m-%d')
 91 | 
 92 |     registration_length = 0
 93 |     # Some domains do not have expiration dates. This if condition makes sure that the expiration date is used only
 94 |     # when it is present.
 95 |     if expiration_date:
 96 |         registration_length = abs((expiration_date - today).days)
 97 |     return -1 if registration_length / 365 <= 1 else 1
 98 | 
 99 | 
100 | def favicon(wiki, soup, domain):
101 |     for head in soup.find_all('head'):
102 |         for head.link in soup.find_all('link', href=True):
103 |             dots = [x.start() for x in re.finditer(r'\.', head.link['href'])]
104 |             return 1 if wiki in head.link['href'] or len(dots) == 1 or domain in head.link['href'] else -1
105 |     return 1
106 | 
107 | 
108 | def https_token(url):
109 |     match = re.search(http_https, url)
110 |     if match and match.start() == 0:
111 |         url = url[match.end():]
112 |     match = re.search('http|https', url)
113 |     return -1 if match else 1
114 | 
115 | 
116 | def request_url(wiki, soup, domain):
117 |     i = 0
118 |     success = 0
119 |     for img in soup.find_all('img', src=True):
120 |         dots = [x.start() for x in re.finditer(r'\.', img['src'])]
121 |         if wiki in img['src'] or domain in img['src'] or len(dots) == 1:
122 |             success = success + 1
123 |         i = i + 1
124 | 
125 |     for audio in soup.find_all('audio', src=True):
126 |         dots = [x.start() for x in re.finditer(r'\.', audio['src'])]
127 |         if wiki in audio['src'] or domain in audio['src'] or len(dots) == 1:
128 |             success = success + 1
129 |         i = i + 1
130 | 
131 |     for embed in soup.find_all('embed', src=True):
132 |         dots = [x.start() for x in re.finditer(r'\.', embed['src'])]
133 |         if wiki in embed['src'] or domain in embed['src'] or len(dots) == 1:
134 |             success = success + 1
135 |         i = i + 1
136 | 
137 |     for i_frame in soup.find_all('i_frame', src=True):
138 |         dots = [x.start() for x in re.finditer(r'\.', i_frame['src'])]
139 |         if wiki in i_frame['src'] or domain in i_frame['src'] or len(dots) == 1:
140 |             success = success + 1
141 |         i = i + 1
142 | 
143 |     try:
144 |         percentage = success / float(i) * 100
145 |     except:
146 |         return 1
147 | 
148 |     if percentage < 22.0:
149 |         return 1
150 |     elif 22.0 <= percentage < 61.0:
151 |         return 0
152 |     else:
153 |         return -1
154 | 
155 | 
156 | def url_of_anchor(wiki, soup, domain):
157 |     i = 0
158 |     unsafe = 0
159 |     for a in soup.find_all('a', href=True):
160 |         # 2nd condition was 'JavaScript ::void(0)' but we put JavaScript because the space between javascript and ::
161 |         # might not be
162 |         # there in the actual a['href']
163 |         if "#" in a['href'] or "javascript" in a['href'].lower() or "mailto" in a['href'].lower() or not (
164 |                 wiki in a['href'] or domain in a['href']):
165 |             unsafe = unsafe + 1
166 |         i = i + 1
167 |         # print a['href']
168 |     try:
169 |         percentage = unsafe / float(i) * 100
170 |     except:
171 |         return 1
172 |     if percentage < 31.0:
173 |         return 1
174 |         # return percentage
175 |     elif 31.0 <= percentage < 67.0:
176 |         return 0
177 |     else:
178 |         return -1
179 | 
180 | 
181 | # Links in <Script> and <Link> tags
182 | def links_in_tags(wiki, soup, domain):
183 |     i = 0
184 |     success = 0
185 |     for link in soup.find_all('link', href=True):
186 |         dots = [x.start() for x in re.finditer(r'\.', link['href'])]
187 |         if wiki in link['href'] or domain in link['href'] or len(dots) == 1:
188 |             success = success + 1
189 |         i = i + 1
190 | 
191 |     for script in soup.find_all('script', src=True):
192 |         dots = [x.start() for x in re.finditer(r'\.', script['src'])]
193 |         if wiki in script['src'] or domain in script['src'] or len(dots) == 1:
194 |             success = success + 1
195 |         i = i + 1
196 |     try:
197 |         percentage = success / float(i) * 100
198 |     except:
199 |         return 1
200 | 
201 |     if percentage < 17.0:
202 |         return 1
203 |     elif 17.0 <= percentage < 81.0:
204 |         return 0
205 |     else:
206 |         return -1
207 | 
208 | 
209 | # Server Form Handler (SFH)
210 | # Have written conditions directly from word file..as there are no sites to test ######
211 | def sfh(wiki, soup, domain):
212 |     for form in soup.find_all('form', action=True):
213 |         if form['action'] == "" or form['action'] == "about:blank":
214 |             return -1
215 |         elif wiki not in form['action'] and domain not in form['action']:
216 |             return 0
217 |         else:
218 |             return 1
219 |     return 1
220 | 
221 | 
222 | # Mail Function
223 | # PHP mail() function is difficult to retrieve, hence the following function is based on mailto
224 | def submitting_to_email(soup):
225 |     for form in soup.find_all('form', action=True):
226 |         return -1 if "mailto:" in form['action'] else 1
227 |     # In case there is no form in the soup, then it is safe to return 1.
228 |     return 1
229 | 
230 | 
231 | def abnormal_url(domain, url):
232 |     hostname = domain.name
233 |     match = re.search(hostname, url)
234 |     return 1 if match else -1
235 | 
236 | 
237 | # IFrame Redirection
238 | def i_frame(soup):
239 |     for i_frame in soup.find_all('i_frame', width=True, height=True, frameBorder=True):
240 |         # Even if one iFrame satisfies the below conditions, it is safe to return -1 for this method.
241 |         if i_frame['width'] == "0" and i_frame['height'] == "0" and i_frame['frameBorder'] == "0":
242 |             return -1
243 |         if i_frame['width'] == "0" or i_frame['height'] == "0" or i_frame['frameBorder'] == "0":
244 |             return 0
245 |     # If none of the iframes have a width or height of zero or a frameBorder of size 0, then it is safe to return 1.
246 |     return 1
247 | 
248 | 
249 | def age_of_domain(domain):
250 |     creation_date = domain.creation_date
251 |     expiration_date = domain.expiration_date
252 |     ageofdomain = 0
253 |     if expiration_date:
254 |         ageofdomain = abs((expiration_date - creation_date).days)
255 |     return -1 if ageofdomain / 30 < 6 else 1
256 | 
257 | 
258 | def web_traffic(url):
259 |     try:
260 |         rank = \
261 |             bs4.BeautifulSoup(urllib.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find(
262 |                 "REACH")['RANK']
263 |     except TypeError:
264 |         return -1
265 |     rank = int(rank)
266 |     return 1 if rank < 100000 else 0
267 | 
268 | 
269 | def google_index(url):
270 |     site = search(url, 5)
271 |     return 1 if site else -1
272 | 
273 | 
274 | def statistical_report(url, hostname):
275 |     try:
276 |         ip_address = socket.gethostbyname(hostname)
277 |     except:
278 |         return -1
279 |     url_match = re.search(
280 |         r'at\.ua|usa\.cc|baltazarpresentes\.com\.br|pe\.hu|esy\.es|hol\.es|sweddy\.com|myjino\.ru|96\.lt|ow\.ly', url)
281 |     ip_match = re.search(
282 |         '146\.112\.61\.108|213\.174\.157\.151|121\.50\.168\.88|192\.185\.217\.116|78\.46\.211\.158|181\.174\.165\.13|46\.242\.145\.103|121\.50\.168\.40|83\.125\.22\.219|46\.242\.145\.98|'
283 |         '107\.151\.148\.44|107\.151\.148\.107|64\.70\.19\.203|199\.184\.144\.27|107\.151\.148\.108|107\.151\.148\.109|119\.28\.52\.61|54\.83\.43\.69|52\.69\.166\.231|216\.58\.192\.225|'
284 |         '118\.184\.25\.86|67\.208\.74\.71|23\.253\.126\.58|104\.239\.157\.210|175\.126\.123\.219|141\.8\.224\.221|10\.10\.10\.10|43\.229\.108\.32|103\.232\.215\.140|69\.172\.201\.153|'
285 |         '216\.218\.185\.162|54\.225\.104\.146|103\.243\.24\.98|199\.59\.243\.120|31\.170\.160\.61|213\.19\.128\.77|62\.113\.226\.131|208\.100\.26\.234|195\.16\.127\.102|195\.16\.127\.157|'
286 |         '34\.196\.13\.28|103\.224\.212\.222|172\.217\.4\.225|54\.72\.9\.51|192\.64\.147\.141|198\.200\.56\.183|23\.253\.164\.103|52\.48\.191\.26|52\.214\.197\.72|87\.98\.255\.18|209\.99\.17\.27|'
287 |         '216\.38\.62\.18|104\.130\.124\.96|47\.89\.58\.141|78\.46\.211\.158|54\.86\.225\.156|54\.82\.156\.19|37\.157\.192\.102|204\.11\.56\.48|110\.34\.231\.42',
288 |         ip_address)
289 |     if url_match:
290 |         return -1
291 |     elif ip_match:
292 |         return -1
293 |     else:
294 |         return 1
295 | 
296 | 
297 | def get_hostname_from_url(url):
298 |     hostname = url
299 |     # TODO: Put this pattern in patterns.py as something like - get_hostname_pattern.
300 |     pattern = "https://|http://|www.|https://www.|http://www."
301 |     pre_pattern_match = re.search(pattern, hostname)
302 | 
303 |     if pre_pattern_match:
304 |         hostname = hostname[pre_pattern_match.end():]
305 |         post_pattern_match = re.search("/", hostname)
306 |         if post_pattern_match:
307 |             hostname = hostname[:post_pattern_match.start()]
308 | 
309 |     return hostname
310 | 
311 | # TODO: Put the DNS and domain code into a function.
312 | 
313 | 
314 | def main(url):
315 |     with open(LOCALHOST_PATH + DIRECTORY_NAME + '/markup.txt', 'r') as file:
316 |         soup_string = file.read()
317 | 
318 |     soup = BeautifulSoup(soup_string, 'html.parser')
319 | 
320 |     status = []
321 |     hostname = get_hostname_from_url(url)
322 | 
323 |     status.append(having_ip_address(url))
324 |     status.append(url_length(url))
325 |     status.append(shortening_service(url))
326 |     status.append(having_at_symbol(url))
327 |     status.append(double_slash_redirecting(url))
328 |     status.append(prefix_suffix(hostname))
329 |     status.append(having_sub_domain(url))
330 | 
331 |     dns = 1
332 |     try:
333 |         domain = whois.query(hostname)
334 |     except:
335 |         dns = -1
336 | 
337 |     status.append(-1 if dns == -1 else domain_registration_length(domain))
338 | 
339 |     status.append(favicon(url, soup, hostname))
340 |     status.append(https_token(url))
341 |     status.append(request_url(url, soup, hostname))
342 |     status.append(url_of_anchor(url, soup, hostname))
343 |     status.append(links_in_tags(url, soup, hostname))
344 |     status.append(sfh(url, soup, hostname))
345 |     status.append(submitting_to_email(soup))
346 | 
347 |     status.append(-1 if dns == -1 else abnormal_url(domain, url))
348 | 
349 |     status.append(i_frame(soup))
350 | 
351 |     status.append(-1 if dns == -1 else age_of_domain(domain))
352 | 
353 |     status.append(dns)
354 | 
355 |     status.append(web_traffic(soup))
356 |     status.append(google_index(url))
357 |     status.append(statistical_report(url, hostname))
358 | 
359 |     print('\n1. Having IP address\n2. URL Length\n3. URL Shortening service\n4. Having @ symbol\n'
360 |           '5. Having double slash\n6. Having dash symbol(Prefix Suffix)\n7. Having multiple subdomains\n'
361 |           '8. SSL Final State\n8. Domain Registration Length\n9. Favicon\n10. HTTP or HTTPS token in domain name\n'
362 |           '11. Request URL\n12. URL of Anchor\n13. Links in tags\n14. SFH\n15. Submitting to email\n16. Abnormal URL\n'
363 |           '17. IFrame\n18. Age of Domain\n19. DNS Record\n20. Web Traffic\n21. Google Index\n22. Statistical Reports\n')
364 |     print(status)
365 |     return status
366 | 
367 | 
368 | # Use the below two lines if features_extraction.py is being run as a standalone file. If you are running this file as
369 | # a part of the workflow pipeline starting with the chrome extension, comment out these two lines.
370 | # if __name__ == "__main__":
371 | #     if len(sys.argv) != 2:
372 | #         print("Please use the following format for the command - `python2 features_extraction.py <url-to-be-tested>`")
373 | #         exit(0)
374 | #     main(sys.argv[1])
375 | 


--------------------------------------------------------------------------------
/patterns.py:
--------------------------------------------------------------------------------
 1 | # Purpose - This file just stores all the regular expression patterns used in features_extraction.py so that there is
 2 | # a common source which can be used if any of the patterns are to be edited.
 3 | 
 4 | ipv4_pattern = r"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"
 5 | ipv6_pattern = r"^(?:(?:(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):){6})(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):" \
 6 |                r"(?:(?:[0-9a-fA-F]{1,4})))|(?:(?:(?:(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9]))\.){3}" \
 7 |                r"(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9])))))))|(?:(?:::(?:(?:(?:[0-9a-fA-F]{1,4})):){5})" \
 8 |                r"(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):(?:(?:[0-9a-fA-F]{1,4})))|(?:(?:(?:(?:(?:25[0-5]|" \
 9 |                r"(?:[1-9]|1[0-9]|2[0-4])?[0-9]))\.){3}(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9])))))))|" \
10 |                r"(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})))?::(?:(?:(?:[0-9a-fA-F]{1,4})):){4})" \
11 |                r"(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):(?:(?:[0-9a-fA-F]{1,4})))|(?:(?:(?:(?:(?:25[0-5]|" \
12 |                r"(?:[1-9]|1[0-9]|2[0-4])?[0-9]))\.){3}(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9])))))))|" \
13 |                r"(?:(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):){0,1}(?:(?:[0-9a-fA-F]{1,4})))?::" \
14 |                r"(?:(?:(?:[0-9a-fA-F]{1,4})):){3})(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):(?:(?:[0-9a-fA-F]{1,4})))|" \
15 |                r"(?:(?:(?:(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9]))\.){3}" \
16 |                r"(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9])))))))|(?:(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):){0,2}" \
17 |                r"(?:(?:[0-9a-fA-F]{1,4})))?::(?:(?:(?:[0-9a-fA-F]{1,4})):){2})(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):" \
18 |                r"(?:(?:[0-9a-fA-F]{1,4})))|(?:(?:(?:(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9]))\.){3}(?:(?:25[0-5]|" \
19 |                r"(?:[1-9]|1[0-9]|2[0-4])?[0-9])))))))|(?:(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):){0,3}" \
20 |                r"(?:(?:[0-9a-fA-F]{1,4})))?::(?:(?:[0-9a-fA-F]{1,4})):)(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):" \
21 |                r"(?:(?:[0-9a-fA-F]{1,4})))|(?:(?:(?:(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9]))\.){3}" \
22 |                r"(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9])))))))|(?:(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):){0,4}" \
23 |                r"(?:(?:[0-9a-fA-F]{1,4})))?::)(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):(?:(?:[0-9a-fA-F]{1,4})))|" \
24 |                r"(?:(?:(?:(?:(?:25[0-5]|(?:[1-9]|1[0-9]|2[0-4])?[0-9]))\.){3}(?:(?:25[0-5]|" \
25 |                r"(?:[1-9]|1[0-9]|2[0-4])?[0-9])))))))|(?:(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):){0,5}" \
26 |                r"(?:(?:[0-9a-fA-F]{1,4})))?::)(?:(?:[0-9a-fA-F]{1,4})))|(?:(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):){0,6}" \
27 |                r"(?:(?:[0-9a-fA-F]{1,4})))?::))))$"
28 | shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
29 |                       r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
30 |                       r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
31 |                       r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
32 |                       r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
33 |                       r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
34 |                       r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
35 |                       r"tr\.im|link\.zip\.net"
36 | http_https = r"https://|http://"
37 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | joblib >= 0.13.1
2 | numpy
3 | scikit-learn
4 | beautifulsoup4
5 | whois
6 | google
7 | 


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
 1 | # Purpose - Receive the call for testing a page from the Chrome extension and return the result (SAFE/PHISHING)
 2 | # for display. This file calls all the different components of the project (The ML model, features_extraction) and
 3 | # consolidates the result.
 4 | 
 5 | import joblib
 6 | import features_extraction
 7 | import sys
 8 | import numpy as np
 9 | 
10 | from features_extraction import LOCALHOST_PATH, DIRECTORY_NAME
11 | 
12 | 
13 | def get_prediction_from_url(test_url):
14 |     features_test = features_extraction.main(test_url)
15 |     # Due to updates to scikit-learn, we now need a 2D array as a parameter to the predict function.
16 |     features_test = np.array(features_test).reshape((1, -1))
17 | 
18 |     clf = joblib.load(LOCALHOST_PATH + DIRECTORY_NAME + '/classifier/random_forest.pkl')
19 | 
20 |     pred = clf.predict(features_test)
21 |     return int(pred[0])
22 | 
23 | 
24 | def main():
25 |     url = sys.argv[1]
26 | 
27 |     prediction = get_prediction_from_url(url)
28 | 
29 |     # Print the probability of prediction (if needed)
30 |     # prob = clf.predict_proba(features_test)
31 |     # print 'Features=', features_test, 'The predicted probability is - ', prob, 'The predicted label is - ', pred
32 |     #    print "The probability of this site being a phishing website is ", features_test[0]*100, "%"
33 | 
34 |     if prediction == 1:
35 |         # print "The website is safe to browse"
36 |         print("SAFE")
37 |     elif prediction == -1:
38 |         # print "The website has phishing features. DO NOT VISIT!"
39 |         print("PHISHING")
40 | 
41 |         # print 'Error -', features_test
42 | 
43 | 
44 | if __name__ == "__main__":
45 |     main()
46 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
 1 | # Purpose - This file is used to create a classifier and store it in a .pkl file. You can modify the contents of this
 2 | # file to create your own version of the classifier.
 3 | 
 4 | import numpy as np
 5 | 
 6 | from sklearn.ensemble import RandomForestClassifier
 7 | # from sklearn.metrics import accuracy_score, classification_report
 8 | # from sklearn import metrics
 9 | 
10 | import joblib
11 | 
12 | labels = []
13 | data_file = open('dataset/Training Dataset.arff').read()
14 | data_list = data_file.split('\r\n')
15 | data = np.array(data_list)
16 | data1 = [i.split(',') for i in data]
17 | data1 = data1[0:-1]
18 | for i in data1:
19 |     labels.append(i[30])
20 | data1 = np.array(data1)
21 | features = data1[:, :-1]
22 | # Choose only the relevant features from the data set.
23 | features = features[:, [0, 1, 2, 3, 4, 5, 6, 8, 9, 11, 12, 13, 14, 15, 16, 17, 22, 23, 24, 25, 27, 29]]
24 | features = np.array(features).astype(np.float)
25 | 
26 | features_train = features
27 | labels_train = labels
28 | # features_test=features[10000:]
29 | # labels_test=labels[10000:]
30 | 
31 | 
32 | print("\n\n ""Random Forest Algorithm Results"" ")
33 | clf4 = RandomForestClassifier(min_samples_split=7, verbose=True)
34 | clf4.fit(features_train, labels_train)
35 | importances = clf4.feature_importances_
36 | std = np.std([tree.feature_importances_ for tree in clf4.estimators_], axis=0)
37 | indices = np.argsort(importances)[::-1]
38 | # Print the feature ranking
39 | print("Feature ranking:")
40 | for f in range(features_train.shape[1]):
41 |     print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
42 | 
43 | # pred4=clf4.predict(features_test)
44 | # print(classification_report(labels_test, pred4))
45 | # print 'The accuracy is:', accuracy_score(labels_test, pred4)
46 | # print metrics.confusion_matrix(labels_test, pred4)
47 | 
48 | # sys.setrecursionlimit(9999999)
49 | joblib.dump(clf4, 'classifier/random_forest.pkl', compress=9)
50 | 


--------------------------------------------------------------------------------
/tst/test_features_extraction.py:
--------------------------------------------------------------------------------
 1 | # Purpose - This file includes unit tests for all the functionality of features_extraction.py
 2 | 
 3 | import unittest
 4 | from features_extraction import *
 5 | from test import get_prediction_from_url
 6 | 
 7 | 
 8 | class TestFeaturesExtraction(unittest.TestCase):
 9 |     def test_having_ip_address(self):
10 |         ipv4_address = "172.11.141.23"
11 |         ipv6_address_1 = "fe80:0:0:0:204:61ff:fe9d:f156"
12 |         ipv6_address_2 = "fe80::204:61ff:fe9d:f156"
13 |         ipv6_address_3 = "fe80:0000:0000:0000:0204:61ff:fe9d:f156"
14 |         ipv6_address_4 = "fe80:0:0:0:0204:61ff:254.157.241.86"
15 |         url_1 = "www.google.com"
16 |         url_2 = "888.com"
17 | 
18 |         # IP address cases
19 |         self.assertEqual(having_ip_address(ipv4_address), -1, "Given input URL has an IP address.")
20 |         self.assertEqual(having_ip_address(ipv6_address_1), -1, "Given input URL has an IP address.")
21 |         self.assertEqual(having_ip_address(ipv6_address_2), -1, "Given input URL has an IP address.")
22 |         self.assertEqual(having_ip_address(ipv6_address_3), -1, "Given input URL has an IP address.")
23 |         self.assertEqual(having_ip_address(ipv6_address_4), -1, "Given input URL has an IP address.")
24 | 
25 |         # Non IP address cases
26 |         self.assertEqual(having_ip_address(url_1), 1, "Given input URL does not have an IP address.")
27 |         self.assertEqual(having_ip_address(url_2), 1, "Given input URL does not have an IP address.")
28 | 
29 |     def test_shortening_services(self):
30 |         url_1 = "bit.ly/akhd9a9"
31 |         url_2 = "http://goo.gl/shan78a"
32 |         url_3 = "https://github.com/philomathic-guy"
33 |         url_4 = "tr.im/adsfaj8"
34 | 
35 |         # Shortening services links
36 |         self.assertEqual(shortening_service(url_1), -1, "Given input URL is a shortening service URL.")
37 |         self.assertEqual(shortening_service(url_2), -1, "Given input URL is a shortening service URL.")
38 |         self.assertEqual(shortening_service(url_4), -1, "Given input URL is a shortening service URL.")
39 | 
40 |         # Non-shortening services links
41 |         self.assertEqual(shortening_service(url_3), 1, "Given input URL is a non-shortening service URL.")
42 | 
43 |     def test_url_length(self):
44 |         # Short URL - Length 41.
45 |         url_1 = "https://docs.python.org/2/library/re.html"
46 |         # Long URL - Length - 73.
47 |         url_2 = "https://github.com/philomathic-guy/Friend-recommendation-using-movie-data"
48 |         # Longer URL - Length 79.
49 |         url_3 = "https://myfunds.000webhostapp.com/new_now/8f66d5a47bf3ec8e6c1ag6s3dc770001a4bd/"
50 | 
51 |         self.assertEqual(url_length(url_1), 1, "The URL length is not suspicious.")
52 |         self.assertEqual(url_length(url_2), 0, "The URL length is not suspicious.")
53 |         self.assertEqual(url_length(url_3), -1, "The URL length is suspicious.")
54 | 
55 |     def test_having_at_symbol(self):
56 |         url_1 = "https://docs.python.org/2/library/re.html"
57 |         url_2 = "https://github.com/philomathic-guy/"
58 | 
59 |         self.assertEqual(having_at_symbol(url_1), 1)
60 |         self.assertEqual(having_at_symbol(url_2), 1)
61 | 
62 |     def test_full_path(self):
63 |         url_1 = "https://github.com/philomathic-guy/"
64 | 
65 |         self.assertEqual(get_prediction_from_url(url_1), 1)
66 | 
67 |     def test_domain_registration_length(self):
68 |         url_1 = "https://github.com/philomathic-guy/"
69 | 
70 |         hostname_1 = get_hostname_from_url(url_1)
71 |         try:
72 |             domain_1 = whois.query(hostname_1)
73 |             self.assertEqual(domain_registration_length(domain_1))
74 |         except:
75 |             pass
76 | 
77 |     def test_having_sub_domain(self):
78 |         url_1 = "https://github.com/philomathic-guy/"
79 |         url_2 = "https://www.spit.ac.in"
80 | 
81 |         self.assertEqual(having_sub_domain(url_1), 1)
82 |         self.assertEqual(having_sub_domain(url_2), 1)
83 | 
84 | 
85 | if __name__ == "__main__":
86 |     unittest.main()
87 | 
88 | 


--------------------------------------------------------------------------------