├── .github ├── ISSUE_TEMPLATE │ ├── bug_report.md │ ├── custom.md │ ├── feature_request.md │ └── question.md └── no-response.yml ├── .gitignore ├── Extension ├── icon.png ├── manifest.json ├── popup.html ├── popup.js └── style.css ├── LICENSE ├── Other Information ├── Phishing Websites Features.docx └── verified_online.csv ├── README.md ├── _config.yml ├── classifier └── random_forest.pkl ├── clientServer.php ├── data_validation.py ├── dataset └── Training Dataset.arff ├── docs ├── CODE_OF_CONDUCT.md └── Troubleshooting.md ├── features_extraction.py ├── patterns.py ├── requirements.txt ├── test.py ├── train.py └── tst └── test_features_extraction.py /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: bug 6 | assignees: rohitnaik246 7 | 8 | --- 9 | 10 | **Have you read [Troubleshooting.md](https://github.com/rohitnaik246/Malicious-Web-Content-Detection-Using-Machine-Learning/blob/master/docs/Troubleshooting.md)? If No, please do so before filing an issue.** 11 | Yes/No 12 | 13 | **Have you tried Googling the problem?** 14 | Yes/No 15 | 16 | **Which python version are you using to run the project? In the terminal, type ```which ``` and enter the output here** 17 | Python version - 18 | 19 | **Describe the bug** 20 | A clear and concise description of what the bug is. 21 | 22 | **To Reproduce** 23 | Steps to reproduce the behavior: 24 | 1. Go to '...' 25 | 2. Click on '....' 26 | 3. Scroll down to '....' 27 | 4. See error 28 | 29 | **Expected behavior** 30 | A clear and concise description of what you expected to happen. 31 | 32 | **Screenshots** 33 | If applicable, add screenshots to help explain your problem. 34 | 35 | **Desktop (please complete the following information):** 36 | - OS: [e.g. iOS] 37 | - Browser [e.g. chrome, safari] 38 | - Version (optional) [e.g. 22] 39 | 40 | **Smartphone (please complete the following information):** 41 | - Device: [e.g. iPhone6] 42 | - OS: [e.g. iOS8.1] 43 | - Browser [e.g. stock browser, safari] 44 | - Version (optional) [e.g. 22] 45 | 46 | **Additional context** 47 | Add any other context about the problem here. 48 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/custom.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Custom issue template 3 | about: Describe this issue template's purpose here. 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | 11 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/question.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Question 3 | about: Provide details about your question 4 | title: '' 5 | labels: question 6 | assignees: rohitnaik246 7 | 8 | --- 9 | 10 | **Have you read [Troubleshooting.md](https://github.com/rohitnaik246/Malicious-Web-Content-Detection-Using-Machine-Learning/blob/master/docs/Troubleshooting.md)? If No, please do so before filing an issue.** 11 | Yes/No 12 | 13 | **Have you tried Googling about it?** 14 | Yes/No 15 | 16 | **If yes, what were your findings?** 17 | 18 | 19 | **Which python version are you using to run the project? In the terminal, type ```which ``` and enter the output here** 20 | Python version - 21 | 22 | **Describe the question** 23 | A clear and concise description of your question. 24 | 25 | **Screenshots** 26 | Add any screenshots, if necessary to describe your question better. 27 | 28 | **Additional context** 29 | Any additional information, if needed. 30 | -------------------------------------------------------------------------------- /.github/no-response.yml: -------------------------------------------------------------------------------- 1 | # Number of days of inactivity before an Issue is closed for lack of response 2 | daysUntilClose: 10 3 | # Label requiring a response 4 | responseRequiredLabel: more-information-needed 5 | # Comment to post when closing an Issue for lack of response. Set to `false` to disable 6 | closeComment: > 7 | This issue has been automatically closed because there has been no response 8 | to our request for more information from the original author. With only the 9 | information that is currently in the issue, we don't have enough information 10 | to take action. Please reach out if you have or find the answers we need so 11 | that we can investigate further. -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/* 2 | .DS_Store 3 | markup.txt 4 | *.pyc -------------------------------------------------------------------------------- /Extension/icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/philomathic-guy/Malicious-Web-Content-Detection-Using-Machine-Learning/c86d4709354691d702e0c693f482e1fabd3b172c/Extension/icon.png -------------------------------------------------------------------------------- /Extension/manifest.json: -------------------------------------------------------------------------------- 1 | { 2 | "manifest_version": 2, 3 | 4 | "name": "Malicious Web Content Detector", 5 | "description": "This extension helps you avoid accessing malicious websites by giving you a chance to analyze the website before you interact with it.", 6 | "version": "1.0", 7 | 8 | "browser_action": { 9 | "default_icon": "icon.png", 10 | "default_popup": "popup.html" 11 | }, 12 | "content_security_policy": "script-src 'self' https://ajax.googleapis.com; object-src 'self'", 13 | "permissions": [ 14 | "activeTab", 15 | "tabs", 16 | "https://ajax.googleapis.com/", 17 | "tabs", 18 | "" 19 | ] 20 | } 21 | -------------------------------------------------------------------------------- /Extension/popup.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

Check your WebPage

11 |
12 |

Find out now...

13 |
14 | 15 |
16 |
17 | 18 | 19 | -------------------------------------------------------------------------------- /Extension/popup.js: -------------------------------------------------------------------------------- 1 | // Purpose - This file contains all the logic relevant to the extension such as getting the URL, calling the server 2 | // side clientServer.php which then calls the core logic. 3 | 4 | function transfer(){ 5 | var tablink; 6 | chrome.tabs.getSelected(null,function(tab) { 7 | tablink = tab.url; 8 | $("#p1").text("The URL being tested is - "+tablink); 9 | 10 | var xhr=new XMLHttpRequest(); 11 | params="url="+tablink; 12 | // alert(params); 13 | var markup = "url="+tablink+"&html="+document.documentElement.innerHTML; 14 | xhr.open("POST","http://localhost/Malicious-Web-Content-Detection-Using-Machine-Learning/clientServer.php",false); 15 | xhr.setRequestHeader("Content-type", "application/x-www-form-urlencoded"); 16 | xhr.send(markup); 17 | // Uncomment this line if you see some error on the extension to see the full error message for debugging. 18 | // alert(xhr.responseText); 19 | $("#div1").text(xhr.responseText); 20 | return xhr.responseText; 21 | }); 22 | } 23 | 24 | 25 | $(document).ready(function(){ 26 | $("button").click(function(){ 27 | var val = transfer(); 28 | }); 29 | }); 30 | 31 | chrome.tabs.getSelected(null,function(tab) { 32 | var tablink = tab.url; 33 | $("#p1").text("The URL being tested is - "+tablink); 34 | }); 35 | -------------------------------------------------------------------------------- /Extension/style.css: -------------------------------------------------------------------------------- 1 | body{ 2 | width:500px; 3 | height:100px; 4 | display:inline-block; 5 | } 6 | 7 | @import url(http://fonts.googleapis.com/css?family=Nunito:300); 8 | 9 | body { font-family: "Nunito", sans-serif; font-size: 15px; background: black; color: white; } 10 | a { text-decoration: none; } 11 | p { text-align: center; } 12 | sup { font-size: 36px; font-weight: 100; line-height: 55px; } 13 | 14 | 15 | .button 16 | { 17 | text-transform: uppercase; 18 | letter-spacing: 2px; 19 | text-align: center; 20 | color: #0C5; 21 | 22 | font-size: 24px; 23 | font-family: "Nunito", sans-serif; 24 | font-weight: 300; 25 | 26 | margin: 5em auto; 27 | 28 | position: relative; 29 | top: initial; 30 | right: 100px; 31 | bottom: 140px; 32 | left: 120px; 33 | 34 | padding: 10px 0; 35 | width: 50%; 36 | height:5%; 37 | 38 | background: #0D6; 39 | border: 1px solid #0D6; 40 | color: #FFF; 41 | overflow: hidden; 42 | 43 | transition: all 0.5s; 44 | } 45 | 46 | .button:hover, .button:active 47 | { 48 | text-decoration: none; 49 | color: #0C5; 50 | border-color: #0C5; 51 | background: #FFF; 52 | } 53 | 54 | .button span 55 | { 56 | display: inline-block; 57 | position: relative; 58 | padding-right: 0; 59 | 60 | transition: padding-right 0.5s; 61 | } 62 | 63 | .button span:after 64 | { 65 | content: ' '; 66 | position: absolute; 67 | top: 0; 68 | right: -18px; 69 | opacity: 0; 70 | width: 10px; 71 | height: 10px; 72 | margin-top: -10px; 73 | 74 | background: rgba(0, 0, 0, 0); 75 | border: 3px solid #FFF; 76 | border-top: none; 77 | border-right: none; 78 | 79 | transition: opacity 0.5s, top 0.5s, right 0.5s; 80 | transform: rotate(-45deg); 81 | } 82 | 83 | .button:hover span, .button:active span 84 | { 85 | padding-right: 30px; 86 | } 87 | 88 | .button:hover span:after, .button:active span:after 89 | { 90 | transition: opacity 0.5s, top 0.5s, right 0.5s; 91 | opacity: 1; 92 | border-color: #0C5; 93 | right: 0; 94 | top: 50%; 95 | } 96 | 97 | #p1 98 | { 99 | font-size:30px; 100 | display:inline-block; 101 | } 102 | 103 | #div1 104 | { 105 | vertical-align: middle; 106 | line-height: 50px; /* the same as your div height*/ 107 | text-align:center; 108 | font-size:100px; 109 | } 110 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Rohit Naik 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Other Information/Phishing Websites Features.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/philomathic-guy/Malicious-Web-Content-Detection-Using-Machine-Learning/c86d4709354691d702e0c693f482e1fabd3b172c/Other Information/Phishing Websites Features.docx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Malicious Web Content Detection using Machine Learning 2 | 3 | #### NOTE - 4 | #### 1. If you face any issue, first refer to [Troubleshooting.md](docs/Troubleshooting.md). If you are still not able to resolve it, please file an issue with the appropriate template (Bug report, question, custom issue or feature request). 5 | #### 2. Please support the project by starring it :) 6 | 7 | ### Steps for reproducing the project - 8 | * Install all the required packages using the following command - ```pip install -r requirements.txt```. 9 | Make sure your pip is consistent with the Python version you are using by typing ```pip -V```. 10 | * Move the project folder to the correct localhost location. For eg. ```/Library/WebServer/Documents``` in case of Macs. 11 | * (If you are using a Mac) Give permissions to write to the markup file ```sudo chmod 777 markup.txt```. 12 | * Modify the path of your Python 2.x installation in ```clientServer.php```. 13 | * (If you are using **anything other than** a Mac) Modify the localhost path in ```features_extraction.py``` to your localhost path (or host the application on a remote server and make the necessary changes). 14 | * Go to ```chrome://extensions```, activate developer mode, click on load unpacked and select the 'Extension' folder from our project. 15 | * Now, you can go to any web page and click on the extension in the top right panel of your Chrome window. Click on the 'Safe of not?' button and wait for a second for the result. 16 | * Done! 17 | 18 | ### Research paper - http://ieeexplore.ieee.org/document/8256834/ 19 | 20 | #### Abstract - 21 | * Naive users using a browser have no idea about the back-end of the page. The users might be tricked into giving away their credentials or downloading malicious data. 22 | * Our aim is to create an extension for Chrome which will act as middleware between the users and the malicious websites, and mitigate the risk of users succumbing to such websites. 23 | * Further, all harmful content cannot be exhaustively collected as even that is bound to continuous development. To counter this we are using machine learning - to train the tool and categorize the new content it sees every time into the particular categories so that corresponding action can be taken. 24 | 25 | ### Take a look at the [demo](https://youtu.be/0-wky0h3hmM) 26 | 27 | A few snapshots of our system being run on different webpages - 28 | 29 | ![spit_safe](https://user-images.githubusercontent.com/18022447/35985360-7cd910f2-0cc4-11e8-9edf-d38bf83d19a1.png) 30 | _**Fig 1.** A safe website - www.spit.ac.in (College website)_ 31 | 32 | ![drive_phishing](https://user-images.githubusercontent.com/18022447/35985366-81a9c5b8-0cc4-11e8-887d-7f427ffa8a8e.png) 33 | _**Fig 2.** A phishing website which looks just like Google Drive._ 34 | 35 | ![dropbox_phishing](https://user-images.githubusercontent.com/18022447/35985373-84056c86-0cc4-11e8-8751-cf511d5b8aa0.png) 36 | _**Fig 3.** A phishing website which looks just like Dropbox_ 37 | 38 | ![moodle_safe](https://user-images.githubusercontent.com/18022447/35985384-881ea85a-0cc4-11e8-9bea-cf71b3089364.png) 39 | _**Fig 4.** A safe website - www.google.com_ 40 | 41 | 42 | 43 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-cayman 2 | -------------------------------------------------------------------------------- /classifier/random_forest.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/philomathic-guy/Malicious-Web-Content-Detection-Using-Machine-Learning/c86d4709354691d702e0c693f482e1fabd3b172c/classifier/random_forest.pkl -------------------------------------------------------------------------------- /clientServer.php: -------------------------------------------------------------------------------- 1 | // Purpose - This file acts as a mediator between the client side popup.js and the server side test.py. 2 | // It gets the HTML contents which acts as input to the suite of python files. 3 | 4 | &1 "); 14 | //$decision=exec("$python_path test.py $site 2>&1 "); 15 | 16 | // Replace the path with the path of your python2.x installation. 17 | $decision=exec("/Library/Frameworks/Python.framework/Versions/2.7/bin/python2 test.py $site 2>&1 "); 18 | echo $decision; 19 | ?> 20 | -------------------------------------------------------------------------------- /data_validation.py: -------------------------------------------------------------------------------- 1 | # Purpose - To print the training data and check the parsing logic for it. 2 | # Note: This file is not a part of the codepath which is used by the Chrome extension for making a decision. 3 | 4 | import numpy as np 5 | from features_extraction import DIRECTORY_NAME, LOCALHOST_PATH 6 | 7 | with open(LOCALHOST_PATH + DIRECTORY_NAME + '/dataset/Training Dataset.arff') as f: 8 | file = f.read() 9 | data_list = file.split('\n') 10 | 11 | print(data_list) 12 | print("/////////////////////////////////") 13 | 14 | data = np.array(data_list) 15 | data1 = [i.split(',') for i in data] 16 | 17 | print("Data1 before indexing - ", data1) 18 | print ("Length of data1 - ", len(data1)) 19 | print ("////////////////////////////////") 20 | 21 | data1 = data1[0:-1] 22 | 23 | print ("Data1 after indexing - ", data1) 24 | print ("Length of data1 - ", len(data1)) 25 | 26 | # for i in data1: 27 | # labels.append(i[30]) 28 | data1 = np.array(data1) 29 | 30 | print ("Converted to np array - ", data1) 31 | print ("Number of columns in a row - ", len(data1[0])) 32 | print ("Shape of data1 - ", data1.shape) 33 | print ("////////////////////////////////") 34 | 35 | features = data1[:, :-1] 36 | 37 | print ("Features array - ", features) 38 | print ("Number of columns in a row - ", len(features[0])) 39 | -------------------------------------------------------------------------------- /docs/CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies both within project spaces and in public spaces 49 | when an individual is representing the project or its community. Examples of 50 | representing a project or community include using an official project e-mail 51 | address, posting via an official social media account, or acting as an appointed 52 | representative at an online or offline event. Representation of a project may be 53 | further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the project team at rohit.naik246@gmail.com. All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 72 | 73 | [homepage]: https://www.contributor-covenant.org 74 | 75 | For answers to common questions about this code of conduct, see 76 | https://www.contributor-covenant.org/faq 77 | -------------------------------------------------------------------------------- /docs/Troubleshooting.md: -------------------------------------------------------------------------------- 1 | # Troubleshooting 2 | This document will help you check for common issues and make sure your issue has not already been reported. 3 | 4 | ## Check to see if the issue has been reported 5 | * Search the [Issues page](https://github.com/rohitnaik246/Malicious-Web-Content-Detection-Using-Machine-Learning/issues) to see if someone else has already reported the same issue. 6 | 7 | ## List of things to double check 8 | Please make sure of the following things - 9 | * You are using Python 2.x for running the project. You can confirm this by entering the following command in the terminal - ```which ``` 10 | (The project has been developed in Python 2.7 and due to certain differences in the syntax and the support of modules imported, the project works smoothly with Python 2.x as of now.) 11 | * You have done a ```git pull``` before running. The project is being updated on a daily basis (as of May 2019) and making sure you have the most updated version of the project will resolve most issues. 12 | 13 | ## Create an issue 14 | If your problem hasn't been solved or reported previously, and if you have followed all the items on the above list of things to check, then create an issue: 15 | 1. Select a template closest to the title of the issue you want to file. 16 | 1. Enter the details specific to the template. 17 | -------------------------------------------------------------------------------- /features_extraction.py: -------------------------------------------------------------------------------- 1 | # Purpose - 2 | # Running this file (stand alone) - For extracting all the features from a web page for testing. 3 | # Notes - 4 | # 1 stands for legitimate 5 | # 0 stands for suspicious 6 | # -1 stands for phishing 7 | 8 | from bs4 import BeautifulSoup 9 | import urllib 10 | import bs4 11 | import re 12 | import socket 13 | import whois 14 | from datetime import datetime 15 | import time 16 | 17 | # https://breakingcode.wordpress.com/2010/06/29/google-search-python/ 18 | # Previous package structure was modified. Import statements according to new structure added. Also code modified. 19 | from googlesearch import search 20 | 21 | # This import is needed only when you run this file in isolation. 22 | import sys 23 | 24 | from patterns import * 25 | 26 | # Path of your local server. Different for different OSs. 27 | LOCALHOST_PATH = "/Library/WebServer/Documents/" 28 | DIRECTORY_NAME = "Malicious-Web-Content-Detection-Using-Machine-Learning" 29 | 30 | 31 | def having_ip_address(url): 32 | ip_address_pattern = ipv4_pattern + "|" + ipv6_pattern 33 | match = re.search(ip_address_pattern, url) 34 | return -1 if match else 1 35 | 36 | 37 | def url_length(url): 38 | if len(url) < 54: 39 | return 1 40 | if 54 <= len(url) <= 75: 41 | return 0 42 | return -1 43 | 44 | 45 | def shortening_service(url): 46 | match = re.search(shortening_services, url) 47 | return -1 if match else 1 48 | 49 | 50 | def having_at_symbol(url): 51 | match = re.search('@', url) 52 | return -1 if match else 1 53 | 54 | 55 | def double_slash_redirecting(url): 56 | # since the position starts from 0, we have given 6 and not 7 which is according to the document. 57 | # It is convenient and easier to just use string search here to search the last occurrence instead of re. 58 | last_double_slash = url.rfind('//') 59 | return -1 if last_double_slash > 6 else 1 60 | 61 | 62 | def prefix_suffix(domain): 63 | match = re.search('-', domain) 64 | return -1 if match else 1 65 | 66 | 67 | def having_sub_domain(url): 68 | # Here, instead of greater than 1 we will take greater than 3 since the greater than 1 condition is when www and 69 | # country domain dots are skipped 70 | # Accordingly other dots will increase by 1 71 | if having_ip_address(url) == -1: 72 | match = re.search( 73 | '(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.' 74 | '([01]?\\d\\d?|2[0-4]\\d|25[0-5]))|(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}', 75 | url) 76 | pos = match.end() 77 | url = url[pos:] 78 | num_dots = [x.start() for x in re.finditer(r'\.', url)] 79 | if len(num_dots) <= 3: 80 | return 1 81 | elif len(num_dots) == 4: 82 | return 0 83 | else: 84 | return -1 85 | 86 | 87 | def domain_registration_length(domain): 88 | expiration_date = domain.expiration_date 89 | today = time.strftime('%Y-%m-%d') 90 | today = datetime.strptime(today, '%Y-%m-%d') 91 | 92 | registration_length = 0 93 | # Some domains do not have expiration dates. This if condition makes sure that the expiration date is used only 94 | # when it is present. 95 | if expiration_date: 96 | registration_length = abs((expiration_date - today).days) 97 | return -1 if registration_length / 365 <= 1 else 1 98 | 99 | 100 | def favicon(wiki, soup, domain): 101 | for head in soup.find_all('head'): 102 | for head.link in soup.find_all('link', href=True): 103 | dots = [x.start() for x in re.finditer(r'\.', head.link['href'])] 104 | return 1 if wiki in head.link['href'] or len(dots) == 1 or domain in head.link['href'] else -1 105 | return 1 106 | 107 | 108 | def https_token(url): 109 | match = re.search(http_https, url) 110 | if match and match.start() == 0: 111 | url = url[match.end():] 112 | match = re.search('http|https', url) 113 | return -1 if match else 1 114 | 115 | 116 | def request_url(wiki, soup, domain): 117 | i = 0 118 | success = 0 119 | for img in soup.find_all('img', src=True): 120 | dots = [x.start() for x in re.finditer(r'\.', img['src'])] 121 | if wiki in img['src'] or domain in img['src'] or len(dots) == 1: 122 | success = success + 1 123 | i = i + 1 124 | 125 | for audio in soup.find_all('audio', src=True): 126 | dots = [x.start() for x in re.finditer(r'\.', audio['src'])] 127 | if wiki in audio['src'] or domain in audio['src'] or len(dots) == 1: 128 | success = success + 1 129 | i = i + 1 130 | 131 | for embed in soup.find_all('embed', src=True): 132 | dots = [x.start() for x in re.finditer(r'\.', embed['src'])] 133 | if wiki in embed['src'] or domain in embed['src'] or len(dots) == 1: 134 | success = success + 1 135 | i = i + 1 136 | 137 | for i_frame in soup.find_all('i_frame', src=True): 138 | dots = [x.start() for x in re.finditer(r'\.', i_frame['src'])] 139 | if wiki in i_frame['src'] or domain in i_frame['src'] or len(dots) == 1: 140 | success = success + 1 141 | i = i + 1 142 | 143 | try: 144 | percentage = success / float(i) * 100 145 | except: 146 | return 1 147 | 148 | if percentage < 22.0: 149 | return 1 150 | elif 22.0 <= percentage < 61.0: 151 | return 0 152 | else: 153 | return -1 154 | 155 | 156 | def url_of_anchor(wiki, soup, domain): 157 | i = 0 158 | unsafe = 0 159 | for a in soup.find_all('a', href=True): 160 | # 2nd condition was 'JavaScript ::void(0)' but we put JavaScript because the space between javascript and :: 161 | # might not be 162 | # there in the actual a['href'] 163 | if "#" in a['href'] or "javascript" in a['href'].lower() or "mailto" in a['href'].lower() or not ( 164 | wiki in a['href'] or domain in a['href']): 165 | unsafe = unsafe + 1 166 | i = i + 1 167 | # print a['href'] 168 | try: 169 | percentage = unsafe / float(i) * 100 170 | except: 171 | return 1 172 | if percentage < 31.0: 173 | return 1 174 | # return percentage 175 | elif 31.0 <= percentage < 67.0: 176 | return 0 177 | else: 178 | return -1 179 | 180 | 181 | # Links in