├── .gitignore ├── LICENSE ├── NOTES.md ├── README.md ├── bin ├── check-dir └── check-repo ├── dummy-project ├── .mysterious_hidden_file ├── dangerous_file.json ├── data_loading │ ├── empy.py │ └── file_with_password.py ├── exploratory_data_analysis │ ├── .ipynb_checkpoints │ │ └── Untitled-checkpoint.ipynb │ └── Untitled.ipynb ├── ignore ├── models │ └── training.py └── python_file_with_password.py ├── repo_scraper ├── DiffChecker.py ├── FileChecker.py ├── FolderChecker.py ├── Git.py ├── GitChecker.py ├── Result.py ├── __init__.py ├── constants │ ├── __init__.py │ ├── extensions.py │ ├── git_diff.py │ └── result.py ├── filesystem.py ├── filetype.py └── matchers.py ├── requirements.txt ├── setup.py └── tests ├── __init__.py ├── test_FileChecker.py ├── test_filesystem.py ├── test_ip_check.py ├── test_password_check.py └── test_url_check.py /.gitignore: -------------------------------------------------------------------------------- 1 | #### joe made this: https://goel.io/joe 2 | 3 | #####=== Python ===##### 4 | 5 | # Byte-compiled / optimized / DLL files 6 | __pycache__/ 7 | *.py[cod] 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .coverage 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | 47 | # Translations 48 | *.mo 49 | *.pot 50 | 51 | # Django stuff: 52 | *.log 53 | 54 | # Sphinx documentation 55 | docs/_build/ 56 | 57 | # PyBuilder 58 | target/ 59 | 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Edu 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /NOTES.md: -------------------------------------------------------------------------------- 1 | #Notes 2 | 3 | I found some interesting facts when developing this, here they are. 4 | 5 | ##Git and standard error 6 | 7 | Git (even in version 2.6.0) does not get along with the [standard error](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=447395) and sends any kind of unpredictable output to it. Since `check-repo` executes git commands (via `subprocess`), it needs to handle git errors so the user knows what happened. Since just checking stdrr is not enough, a naive helper function checks for the keyword 'fatal' in the standard error to raise a Python Exception. 8 | 9 | Right now, it works. Since the most common error is to run git commands on a directory that does not have a git repository, hopefully this will also catch some other 'fatal' errors. 10 | 11 | ##On checking git history 12 | 13 | As it is mentioned on the [README](/README.md) file, checking only at the differences between commits is faster than checking out each commit. To check such differences, the library relies on `git diff`, which introduces a couple limitations and design caveats, such limitations are explained below. 14 | 15 | ###`git diff` and extensions 16 | 17 | ###`git diff` and file size 18 | 19 | 20 | ###`git diff` and binary files 21 | 22 | By default, `git diff` does not show changes in binary files. That make things easier since we do not need to check if the file we are dealing with is binary before applying the regular expressions. However, what git interprets as a binary file depends on your configuration. That being said, `check-repo` will not look at binary files and will only print a warning. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | #repo_scraper 2 | 3 | Check your projects for possible password (or other sensitive data) leaks. 4 | 5 | The library exposes two commands: 6 | * `check-dir` - Performs checks on a folder and subdirectories 7 | * `check-repo` - Performs a check in a git repository 8 | 9 | Both scripts work almost the same from the user point of view, enter `check-dir --help` or `check-repo --help` for more details. 10 | 11 | ##Example 12 | 13 | Check your dummy-project: 14 | ```bash 15 | check-dir dummy-project 16 | ``` 17 | 18 | Output: 19 | ```bash 20 | Checking folder dummy-project... 21 | 22 | dummy-project/python_file_with_password.py 23 | ALERT - MATCH ["password = 'qwerty'"] 24 | 25 | dummy-project/dangerous_file.json 26 | ALERT - MATCH ['"password": "super-secret-password"'] 27 | ``` 28 | 29 | ##How does it work? 30 | 31 | Briefly speaking, `check-dir` lists all files below a folder and applies regular expressions to look for passwords/IPs. Given that a blind search would never end (for example, if the repo constans a 50MB csv file), some filters are applied before the regular expressions are matched: 32 | 33 | * **File size** - If file is bigger than 1MB, ignore it but print a warning 34 | * **Extension** - If extension is not allowed, ignore file but print a warning. (See [NOTES](NOTES.md) to know why extension is used instead of mimetype) 35 | * **Base64** - If file contains Base64 data, remove it. Many plain-text formats (such as Jupyter notebooks embed data in Base64 format. Applying regex to such files is never going to end) 36 | 37 | `check-repo` works in a slightly different way, one obvious way to check git history is to checkout each commit and apply `check-dir`. That approach would be really slow since the script would be checking the same files many times. Instead, `check-repo` checks out the first commit, runs `check-dir` there and then, moves up one commit at a time and uses `git diff` to get only the difference between each consecutive pair of commits. 38 | 39 | As in `check-dir`, the script applies some filters before applying regular expressions to prevent getting stuck on big files, note that in this case we are not dealing with files, but with the `git diff` output, and that prevents us to check for file size directly: 40 | 41 | * **Number of lines** - 42 | * **Number of characters** - 43 | * **Extension** - If extension is not allowed, ignore file but print a warning. (See [NOTES](NOTES.md) to know why extension is used instead of mimetype) 44 | * **Base64** - Remove Base64 code. 45 | 46 | The project has some limitations see [NOTES](NOTES.md) file for information regarding the design of the project and how that limits what the library is able to detect. 47 | 48 | ##Installation 49 | 50 | ```bash 51 | pip install git+git://github.com/dssg/repo-scraper.git -r requirements.txt 52 | ``` 53 | 54 | ##Dependencies 55 | 56 | * glob2 57 | * nose (optional, for running tests) 58 | 59 | ##Tested with 60 | * Python 2.7.10 61 | * Git 2.6.0 62 | 63 | ##Usage 64 | 65 | ```bash 66 | cd path/to/your/project 67 | check-dir 68 | ``` 69 | 70 | See help for more options available: 71 | 72 | ```bash 73 | check-dir --help 74 | ``` 75 | 76 | ###Using a IGNORE file with check-dir 77 | 78 | Just as with git, you can specify a file to make the program ignore some files/folders. This is specially useful when you have folder with many log files that you are sure do not have sensitive data. The library assumes one glob rule per line. 79 | 80 | Adding a IGNORE file will make execution faster, since many regular expressions are matched against all files that have certain characteristics. 81 | 82 | **Important**: Even though the format is very similar, you cannot use the same rules as in your [.gitignore](https://git-scm.com/docs/gitignore) file. For more details, see [this](https://en.wikipedia.org/wiki/Glob_(programming)). 83 | 84 | ##What's done 85 | 86 | * Passwords (using regex). See [`test_password_check.py`](tests/test_password_check.py) 87 | * IPs 88 | * URLs on amazonaws.com (it's simple to add more domains if needed) 89 | 90 | ##What's missing 91 | 92 | * URLs 93 | * Check other branches apart from master 94 | 95 | #TODO 96 | * Come up with a cool name 97 | -------------------------------------------------------------------------------- /bin/check-dir: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from repo_scraper.constants.extensions import * 3 | from repo_scraper.constants.result import * 4 | from repo_scraper.FolderChecker import FolderChecker 5 | import argparse 6 | import re 7 | import sys 8 | 9 | parser = argparse.ArgumentParser() 10 | parser.add_argument("-f", "--force", action="store_true", help="Force execution, prevents confirmation prompt form appearing") 11 | parser.add_argument("-i", "--ignore", help="optional ignore file", type=str) 12 | parser.add_argument("-w", "--warnings", action="store_true", help="Print warnings (and alerts)") 13 | parser.add_argument("-a", "--printall", action="store_true", help="Print everything (alerts, warnings and non-matches)") 14 | parser.add_argument("-e", "--extensions", 15 | help='''Comma-separated extensions, files that don't match any of these will raise a warning. 16 | If empty, uses default list: %s ''' % DEFAULT_EXTENSIONS_FORMAT) 17 | parser.add_argument("-p", "--path", help="directory location, if this is not provided, the script will run on the working directory") 18 | parser.add_argument("-g", "--includegit", action="store_true", help="Includes .git/ folder in the search scope, disabled by default") 19 | args = parser.parse_args() 20 | 21 | print ('IMPORTANT: This script is going to use the filesystem.\n' 22 | 'Do not change any files in the directory while the script is running, have a coffee or ' 23 | 'something.\n') 24 | if not args.force: 25 | continue_ = raw_input("Do you want to continue? (y/n): ") 26 | if continue_.lower()!='y': 27 | sys.exit('Aborted by the user') 28 | 29 | folder_path = '.' if args.path is None else args.path 30 | ignore_path = args.ignore 31 | 32 | results_to_print = [ALERT] 33 | if args.warnings: 34 | results_to_print += [WARNING] 35 | elif args.printall: 36 | results_to_print += [WARNING, NOTHING] 37 | 38 | if args.extensions: 39 | allowed_extensions = re.compile(',\s*\.?').split(args.extensions.lower()) 40 | else: 41 | allowed_extensions = DEFAULT_EXTENSIONS 42 | 43 | print 'Allowed extensions: %s' % reduce(lambda x,y: x+', '+y, allowed_extensions) 44 | 45 | if ignore_path: 46 | print 'Using ignore file %s\n' % ignore_path 47 | else: 48 | print '\n' 49 | 50 | #Create an instance of folder checker, 51 | #this class will list files in all subdirectories, 52 | #then apply ignore file rules. It provides a generator 53 | #to check each file 54 | ignore_git = not args.includegit 55 | fc = FolderChecker(folder_path, allowed_extensions=allowed_extensions, 56 | ignore_git_folder=ignore_git, 57 | ignore_path=ignore_path) 58 | #Get the generator to traverse the folder structure 59 | file_traverser = fc.file_traverser() 60 | 61 | for result in file_traverser: 62 | if result.result_type in results_to_print: 63 | print result -------------------------------------------------------------------------------- /bin/check-repo: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from repo_scraper.constants.extensions import * 3 | from repo_scraper.constants.result import * 4 | from repo_scraper.GitChecker import GitChecker 5 | import argparse 6 | import re 7 | import sys 8 | 9 | parser = argparse.ArgumentParser() 10 | parser.add_argument("-f", "--force", action="store_true", help="Force execution, prevents confirmation prompt form appearing") 11 | parser.add_argument("-w", "--warnings", action="store_true", help="Print warnings (and alerts)") 12 | parser.add_argument("-a", "--printall", action="store_true", help="Print everything (alerts, warnings and non-matches)") 13 | parser.add_argument("-e", "--extensions", 14 | help='''Comma-separated extensions, files that don't match any of these will raise a warning. 15 | If empty, uses default list: %s ''' % DEFAULT_EXTENSIONS_FORMAT) 16 | parser.add_argument("-p", "--path", help="git repository location, if this is not provided, the script will run on the working directory") 17 | args = parser.parse_args() 18 | 19 | print ('IMPORTANT: This script is going to execute git commands.\n' 20 | 'Do not change any files or execute git commands in this project' 21 | ' while the script is running, have a coffee or ' 22 | 'something.\n') 23 | 24 | if not args.force: 25 | continue_ = raw_input("Do you want to continue? (y/n): ") 26 | if continue_.lower()!='y': 27 | sys.exit('Aborted by the user') 28 | 29 | results_to_print = [ALERT] 30 | if args.warnings: 31 | results_to_print += [WARNING] 32 | elif args.printall: 33 | results_to_print += [WARNING, NOTHING] 34 | 35 | if args.extensions: 36 | allowed_extensions = re.compile(',\s*\.?').split(args.extensions.lower()) 37 | else: 38 | allowed_extensions = DEFAULT_EXTENSIONS 39 | 40 | 41 | #Default path is working directory, change if user 42 | #specified a different one 43 | path = '.' if args.path is None else args.path 44 | 45 | print 'Allowed extensions: %s' % reduce(lambda x,y: x+', '+y, allowed_extensions) 46 | 47 | #Create an instance of git checkerparser.add_argument("-v", "--verbose", action="store_true", help="prints all results, including no matches and warnings") 48 | gc = GitChecker(allowed_extensions=allowed_extensions, git_dir=path) 49 | 50 | #Get the generator that will turn one resul per file modified in each 51 | #commit 52 | file_traverser = gc.file_traverser() 53 | 54 | for result in file_traverser: 55 | if result.result_type in results_to_print: 56 | print result 57 | 58 | print '\033[93mNote: you are in the first commit, if you want to go to the last one do git checkout master.\033[0m' -------------------------------------------------------------------------------- /dummy-project/.mysterious_hidden_file: -------------------------------------------------------------------------------- 1 | PASS=this-should-not-be-public -------------------------------------------------------------------------------- /dummy-project/dangerous_file.json: -------------------------------------------------------------------------------- 1 | { 2 | "password": "super-secret-password" 3 | } -------------------------------------------------------------------------------- /dummy-project/data_loading/empy.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/dummy-project/data_loading/empy.py -------------------------------------------------------------------------------- /dummy-project/data_loading/file_with_password.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/dummy-project/data_loading/file_with_password.py -------------------------------------------------------------------------------- /dummy-project/exploratory_data_analysis/.ipynb_checkpoints/Untitled-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "#Just an empty jupyter notebook, nothing to worry about..." 12 | ] 13 | } 14 | ], 15 | "metadata": { 16 | "kernelspec": { 17 | "display_name": "Python 2", 18 | "language": "python", 19 | "name": "python2" 20 | }, 21 | "language_info": { 22 | "codemirror_mode": { 23 | "name": "ipython", 24 | "version": 2 25 | }, 26 | "file_extension": ".py", 27 | "mimetype": "text/x-python", 28 | "name": "python", 29 | "nbconvert_exporter": "python", 30 | "pygments_lexer": "ipython2", 31 | "version": "2.7.10" 32 | } 33 | }, 34 | "nbformat": 4, 35 | "nbformat_minor": 0 36 | } 37 | -------------------------------------------------------------------------------- /dummy-project/exploratory_data_analysis/Untitled.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "#Just an empty jupyter notebook, nothing to worry about..." 12 | ] 13 | } 14 | ], 15 | "metadata": { 16 | "kernelspec": { 17 | "display_name": "Python 2", 18 | "language": "python", 19 | "name": "python2" 20 | }, 21 | "language_info": { 22 | "codemirror_mode": { 23 | "name": "ipython", 24 | "version": 2 25 | }, 26 | "file_extension": ".py", 27 | "mimetype": "text/x-python", 28 | "name": "python", 29 | "nbconvert_exporter": "python", 30 | "pygments_lexer": "ipython2", 31 | "version": "2.7.10" 32 | } 33 | }, 34 | "nbformat": 4, 35 | "nbformat_minor": 0 36 | } 37 | -------------------------------------------------------------------------------- /dummy-project/ignore: -------------------------------------------------------------------------------- 1 | **/*.py -------------------------------------------------------------------------------- /dummy-project/models/training.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/dummy-project/models/training.py -------------------------------------------------------------------------------- /dummy-project/python_file_with_password.py: -------------------------------------------------------------------------------- 1 | password = 'qwerty' -------------------------------------------------------------------------------- /repo_scraper/DiffChecker.py: -------------------------------------------------------------------------------- 1 | from repo_scraper import matchers as m 2 | from repo_scraper.Result import * 3 | from repo_scraper import filetype 4 | 5 | class DiffChecker: 6 | def __init__(self, commit_hashes, filename, content, error, allowed_extensions): 7 | self.commit_hashes = commit_hashes 8 | self.filename = filename 9 | self.content = content 10 | self.error = error 11 | self.allowed_extensions = allowed_extensions 12 | def check(self): 13 | #Build the identifier using the filename and commit hashes 14 | identifier = '%s (%s)' % (self.filename, self.commit_hashes[1]) 15 | 16 | #The comments is a list to keep track of useful information 17 | #encountered when checking, right now, its only being used 18 | #to annotate when base64 code was removed 19 | comments = [] 20 | 21 | #Check the number of additions, if there are too many 22 | #send a warning and skip, this may be due to a big data file addition 23 | if self.error: 24 | return Result(self.filename, self.error) 25 | 26 | #Check if extension/mimetype is allowed 27 | if filetype.get_extension(self.filename) not in self.allowed_extensions: 28 | return Result(identifier, FILETYPE_NOT_ALLOWED) 29 | 30 | #Start applying rules... 31 | #First check if additions contain base64, if there is remove it 32 | has_base64, self.content = m.base64_matcher(self.content, remove=True) 33 | if has_base64: 34 | comments.append('BASE64_REMOVED') 35 | 36 | #Create matcher for amazonaws.com 37 | amazonaws_matcher = m.create_domain_matcher('amazonaws.com') 38 | #Apply matchers: password, ips and aws 39 | match, matches = m.multi_matcher(self.content, m.password_matcher, m.ip_matcher, amazonaws_matcher) 40 | 41 | if match: 42 | return Result(identifier, MATCH, matches=matches, comments=comments) 43 | else: 44 | return Result(identifier, NOT_MATCH, comments=comments) -------------------------------------------------------------------------------- /repo_scraper/FileChecker.py: -------------------------------------------------------------------------------- 1 | from repo_scraper import filetype 2 | from repo_scraper import matchers as m 3 | from repo_scraper.Result import * 4 | import os 5 | import re 6 | 7 | class FileChecker: 8 | def __init__(self, path, allowed_extensions, max_file_size_bytes=1048576): 9 | self.path = path 10 | self.max_file_size_bytes = max_file_size_bytes 11 | self.allowed_extensions = allowed_extensions 12 | def check(self): 13 | #The comments is a list to keep track of useful information 14 | #encountered when checking, right now, its only being used 15 | #to annotate when base64 code was removed 16 | comments = [] 17 | 18 | #Check file size if it's more than max_file_size_bytes (default is 1MB) 19 | #send just a warning and do not open the file, 20 | #since pattern matching is going to be really slow 21 | f_size = os.stat(self.path).st_size 22 | if f_size > self.max_file_size_bytes: 23 | return Result(self.path, BIG_FILE) 24 | 25 | #Check if extension is allowed 26 | if filetype.get_extension(self.path) not in self.allowed_extensions: 27 | return Result(self.path, FILETYPE_NOT_ALLOWED) 28 | 29 | #At this point you only have files with allowed extensions and 30 | #smaller than max_file_size_bytes 31 | #open the file and then apply all rules 32 | with open(self.path, 'r') as f: 33 | content = f.read() 34 | 35 | #Last check: search for potential base64 strings and remove them, send a warning 36 | has_base64, content = m.base64_matcher(content, remove=True) 37 | if has_base64: 38 | comments.append('BASE64_REMOVED') 39 | 40 | #Create matcher for amazonaws.com 41 | amazonaws_matcher = m.create_domain_matcher('amazonaws.com') 42 | #Apply matchers: password, ips and aws 43 | match, matches = m.multi_matcher(content, m.password_matcher, m.ip_matcher, amazonaws_matcher) 44 | 45 | if match: 46 | return Result(self.path, MATCH, matches=matches, comments=comments) 47 | else: 48 | return Result(self.path, NOT_MATCH, comments=comments) -------------------------------------------------------------------------------- /repo_scraper/FolderChecker.py: -------------------------------------------------------------------------------- 1 | #Creates a FileChecker instance for each file inside a directory 2 | #(and subdirectories) and applies matchers 3 | from repo_scraper import filesystem as fs 4 | from repo_scraper.FileChecker import FileChecker 5 | 6 | class FolderChecker: 7 | def __init__(self, folder_path, allowed_extensions, ignore_git_folder=True, ignore_path=None): 8 | #List all files in directory, apply ignore file if necessary 9 | self.filenames = fs.list_files_in(folder_path, ignore_git_folder=ignore_git_folder, ignore_file=ignore_path) 10 | self.allowed_extensions = allowed_extensions 11 | def file_traverser(self): 12 | for filename in self.filenames: 13 | yield FileChecker(filename, allowed_extensions=self.allowed_extensions).check() -------------------------------------------------------------------------------- /repo_scraper/Git.py: -------------------------------------------------------------------------------- 1 | from repo_scraper.constants.result import * 2 | from repo_scraper.constants.git_diff import * 3 | import subprocess 4 | import re 5 | 6 | class Git: 7 | def __init__(self, git_dir): 8 | self.git_dir = git_dir 9 | 10 | def list_commits(self): 11 | '''Returns a list with all commit hashes''' 12 | p = subprocess.Popen(['git', '-C', self.git_dir, 'log', '--pretty=format:%H'], 13 | stdout=subprocess.PIPE, 14 | stderr=subprocess.PIPE) 15 | out, err = p.communicate() 16 | 17 | #See comments on the function definition for details 18 | self.check_stderr(err) 19 | #Split by breakline to get a list and reverse the order 20 | #so the first commit comes first 21 | return out.replace('"', '').split('\n')[::-1] 22 | 23 | def diff_for_commit_to_commit(self, commit1_hash, commit2_hash): 24 | '''Retrieves diff for a specific commit and parses it 25 | to get file name and additions''' 26 | #Pass the -M flag to detect modified files and avoid redundancy 27 | p = subprocess.Popen(['git', '-C', self.git_dir, 'diff', '-M', commit1_hash, commit2_hash], 28 | stdout=subprocess.PIPE, 29 | stderr=subprocess.PIPE) 30 | out, err = p.communicate() 31 | 32 | #See comments on the function definition for details 33 | self.check_stderr(err) 34 | 35 | return self.parse_diff(out) 36 | 37 | def parse_diff(self, diff): 38 | '''Parse a diff output''' 39 | files = diff.split('diff --git ')[1:] 40 | files = [self.parse_file_diff(f) for f in files] 41 | return files 42 | 43 | #http://stackoverflow.com/questions/2529441/how-to-read-the-output-from-git-diff 44 | def parse_file_diff(self, diff): 45 | '''Parse a diff output''' 46 | lines = diff.split('\n') 47 | #File line is the first line here, with the following format: 48 | #a/file.txt b/file2.txt where file and file2 are different 49 | #only if the file was renamed, in which case the current name is file2 50 | #print 'L is %s matches %s' % (lines[0], re.compile('\s{1}.{1}/{1}(.*)').findall(lines[0])) 51 | #note, if the filename has unusual characters, its going to appear with quotes 52 | filename = re.compile('\s{1}(?:\'|\")*./{1}(.*)(?:\'|\")*').findall(lines[0])[0] 53 | 54 | #If there are many lines in the diff file, the filter for addisions 55 | #is going to break, check how many lines there are 56 | #print 'lines for content %s' % len(lines[1:]) 57 | if len(lines[1:]) > MAX_DIFF_LINES: 58 | return {'filename': filename, 'content': None, 'error': BIG_FILE} 59 | 60 | #Filter only the additions (lines that start with +) 61 | content = filter(lambda x: x.startswith('+'), lines[1:]) 62 | #Remove the leading + sign on each line 63 | content = [line.strip('+') for line in content] 64 | #Join the lines again 65 | content = reduce(lambda x,y:x+'\n'+y, content) if len(content) else '' 66 | #Threshold for the number of characters 67 | #print 'len is: %d' % len(content) 68 | if len(content) > MAX_DIFF_ADDITIONS_CHARACTERS: 69 | return {'filename': filename, 'content': None, 'error': BIG_FILE} 70 | else: 71 | return {'filename': filename, 'content': content, 'error': None} 72 | 73 | def checkout(self, commit_id): 74 | p = subprocess.Popen(['git', '-C', self.git_dir,'checkout', commit_id], stdout=subprocess.PIPE, stderr=subprocess.PIPE) 75 | out, err = p.communicate() 76 | #See comments on the function definition for details 77 | self.check_stderr(err) 78 | 79 | #git likes to abuse standard error: 80 | #https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=447395 81 | #I wrote this function to check for actual errors instead of 82 | #what git likes to send sometimes, actually the only error 83 | #I'm looking for right now is when the git command is ran 84 | #in a folder with no repository 85 | def check_stderr(self, err): 86 | if err.startswith('fatal'): 87 | raise Exception(err) 88 | 89 | -------------------------------------------------------------------------------- /repo_scraper/GitChecker.py: -------------------------------------------------------------------------------- 1 | from itertools import chain 2 | from repo_scraper.Git import Git 3 | from repo_scraper.DiffChecker import DiffChecker 4 | from repo_scraper.FolderChecker import FolderChecker 5 | import subprocess 6 | 7 | 8 | class GitChecker: 9 | def __init__(self, allowed_extensions, git_dir): 10 | self.allowed_extensions = allowed_extensions 11 | self.git_dir = git_dir 12 | self.git = Git(git_dir) 13 | def file_traverser(self): 14 | #Checkout master 15 | print 'git checkout master' 16 | self.git.checkout('master') 17 | 18 | #Get all commits in chronological order 19 | commits = self.git.list_commits() 20 | #Generate commit pairs (each commit with the previous one) 21 | commit_pairs = zip(commits[:-1], commits[1:]) 22 | 23 | #Go to the first commit 24 | print 'git checkout %s (first commit in master)' % commits[0] 25 | self.git.checkout(commits[0]) 26 | 27 | #Get generator to check the first commit 28 | fc = FolderChecker(folder_path=self.git_dir, 29 | allowed_extensions=self.allowed_extensions, 30 | ignore_git_folder=True) 31 | folder_file_traverser = fc.file_traverser() 32 | 33 | #Define a second generator that will traverse the repository 34 | def repo_generator(): 35 | for pair in commit_pairs: 36 | #print 'getting diff for %s %s' % pair 37 | files_diff = self.git.diff_for_commit_to_commit(*pair) 38 | for f in files_diff: 39 | #print 'gichecker: %s' % f['filename']+' in '+pair[1] 40 | yield DiffChecker(pair, f['filename'], f['content'], f['error'], self.allowed_extensions).check() 41 | 42 | repo_file_traverser = repo_generator() 43 | 44 | #Join both generators and return 45 | return chain(folder_file_traverser, repo_file_traverser) -------------------------------------------------------------------------------- /repo_scraper/Result.py: -------------------------------------------------------------------------------- 1 | from repo_scraper.constants.result import * 2 | 3 | dic = { 4 | BIG_FILE: WARNING, 5 | NOT_PLAIN_TEXT: WARNING, 6 | MATCH: ALERT, 7 | NOT_MATCH: NOTHING, 8 | FILETYPE_NOT_ALLOWED: WARNING 9 | } 10 | 11 | class Result: 12 | def __init__(self, identifier, reason, matches=None, comments=None): 13 | self.identifier = identifier 14 | self.reason = reason 15 | self.matches = matches 16 | self.result_type = dic[reason] 17 | self.comments = comments 18 | def __str__(self): 19 | #Message to print for matches is originally No matches, unless 20 | #self.matches has some values 21 | matches_print = 'No matches to show' 22 | if self.matches: 23 | #Create list of string with 'index. content' format for eacch match 24 | matches_print = [str(idx+1)+'. '+content for idx,content in enumerate(self.matches)] 25 | #Join list 26 | matches_print = reduce(lambda x,y: x+'\n'+y, matches_print) 27 | 28 | return '%s - %s in %s\n%s\n' % (self.result_type, self.reason, self.identifier, matches_print) -------------------------------------------------------------------------------- /repo_scraper/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/repo_scraper/__init__.py -------------------------------------------------------------------------------- /repo_scraper/constants/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/repo_scraper/constants/__init__.py -------------------------------------------------------------------------------- /repo_scraper/constants/extensions.py: -------------------------------------------------------------------------------- 1 | #Add extensions here (lowercase) 2 | DEFAULT_EXTENSIONS = ["py", "ipynb", "json", "sql", "sh", "txt", "r", "md", "log", "yaml"] 3 | DEFAULT_EXTENSIONS_FORMAT = reduce(lambda x,y: x+', '+y, DEFAULT_EXTENSIONS) 4 | 5 | -------------------------------------------------------------------------------- /repo_scraper/constants/git_diff.py: -------------------------------------------------------------------------------- 1 | #Constants in git.py (used in git diff commands) 2 | #Maximum number of lines when doing git diff hash_a hash_b 3 | #that won't trigger a BIG_FILE result, note that this constant 4 | #includes all lines in the git diff output (additions, deletions) 5 | MAX_DIFF_LINES = 10000 6 | #Max number of characters in ADDITIONS that won't trigger a BIG_FILE result, 7 | #note that in this case, only new lines are taken into account 8 | MAX_DIFF_ADDITIONS_CHARACTERS = 1048576 -------------------------------------------------------------------------------- /repo_scraper/constants/result.py: -------------------------------------------------------------------------------- 1 | #Result class constants with ANSI colors 2 | ALERT = '\033[91mAlert!\033[0m' 3 | WARNING = '\033[93mWarning\033[0m' 4 | NOTHING = '\033[92mCheck passed\033[0m' 5 | 6 | BIG_FILE = 'Big file found' 7 | NOT_PLAIN_TEXT = 'File is not plain text' 8 | MATCH = 'Match' 9 | NOT_MATCH = 'Nothing found' 10 | FILETYPE_NOT_ALLOWED = 'Extension not allowed' -------------------------------------------------------------------------------- /repo_scraper/filesystem.py: -------------------------------------------------------------------------------- 1 | #This file contains utility functions for working witht the filesytem 2 | import os 3 | import mimetypes as mime 4 | import glob2 as glob 5 | 6 | def list_files_in(directory, ignore_git_folder, ignore_file): 7 | '''Receives a path to a directory and returns 8 | paths to all files along with each mimetype''' 9 | file_list = [] 10 | for root, dirs, files in os.walk(directory): 11 | for file in files: 12 | file_list.append(os.path.join(root, file)) 13 | 14 | file_list = set(file_list) 15 | 16 | glob_rules = [] 17 | 18 | if ignore_git_folder: 19 | glob_rules.append('.git/**') 20 | 21 | #Check if ignore file was provided 22 | if ignore_file is not None: 23 | glob_rules += parse_ignore_file(ignore_file) 24 | 25 | if len(glob_rules): 26 | glob_matches = match_glob_rules_in_directory(glob_rules, directory) 27 | #Remove files in file_list that matched any glob rule 28 | file_list = file_list - glob_matches 29 | 30 | return file_list 31 | 32 | def match_glob_rules_in_directory(glob_rules, directory): 33 | #Append directory to each glob_rule 34 | glob_rules = [os.path.join(directory, rule) for rule in glob_rules] 35 | glob_matches = [glob.glob(rule) for rule in glob_rules] 36 | #Flatten matches 37 | glob_matches = reduce(lambda x,y: x+y, glob_matches) 38 | #Convert to a set to remove duplicates 39 | return set(glob_matches) 40 | 41 | def parse_ignore_file(ignore_file): 42 | #Open file 43 | with open(ignore_file, 'r') as f: 44 | lines = f.read().splitlines() 45 | #Filter lines starting with # 46 | lines = filter(lambda line: not line.startswith('#'), lines) 47 | #Remove empty lines 48 | lines = filter(lambda line: len(line) > 0, lines) 49 | return lines -------------------------------------------------------------------------------- /repo_scraper/filetype.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | def get_extension(filename): 4 | try: 5 | return re.compile('.*\.(\S+)$').findall(filename)[0].lower() 6 | except: 7 | return None 8 | 9 | -------------------------------------------------------------------------------- /repo_scraper/matchers.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | def multi_matcher(s, *matchers): 4 | '''Receives matchers as parameters and applies all of them''' 5 | results = [m(s) for m in matchers] 6 | #Get the flag that indicates wether there was a match for each matcher 7 | has_match = [r[0] for r in results] 8 | #Check if there was at least one match 9 | at_least_one = reduce(lambda x,y: x or y, has_match) 10 | #Get list of matches for each matcher, delete Nones 11 | list_of_lists = [r[1] for r in results if r[1] is not None] 12 | #Flatten list of matches, ignore None 13 | matches = [match for single_list in list_of_lists for match in single_list] 14 | #If the list is empty, return None 15 | matches = None if len(matches)==0 else matches 16 | return at_least_one, matches 17 | 18 | def base64_matcher(s, remove=False): 19 | regex = '(?:"|\')[A-Za-z0-9\\+\\\=\\/]{50,}(?:"|\')' 20 | base64images = re.compile(regex).findall(s) 21 | has_base64 = len(base64images) > 0 22 | if remove: 23 | return has_base64, re.sub(regex, '""', s) 24 | else: 25 | return has_base64 26 | 27 | def password_matcher(s): 28 | #Case 1: hardcoded passwords assigned to variables (python, r, etc) 29 | #or values (json, csv, etc) 30 | #match variable names such as password, PASSWORD, pwd, pass, 31 | #SOMETHING_PASSWORD assigned to strings (match = and <-) 32 | 33 | #Matches p_w_d='something' and similar 34 | pwd = re.compile('(\S*\\\*(?:\'|\")*(?:p|P)\S*(?:w|W)\S*(?:d|D)\\\*(?:\'|\")*\s*(?:=|<-|:)\s*\\\*(?:\'|\").+\\\*(?:\'|\"))') 35 | #Matches pass='something' and similar 36 | pass_ = re.compile('(\S*\\\*(?:\'|\")*(?:pass|PASS)\S*\\\*(?:\'|\")*\s*(?:=|<-|:)\s*\\\*(?:\'|\").+\\\*(?:\'|\"))') 37 | 38 | #Case 2: URLS (e.g. SQLAlchemy engines) 39 | #http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html 40 | #Note that validating URls is really hard... 41 | #http://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python 42 | urls = re.compile('\\\*(?:\'|\")*[a-zA-Z0-9-_]+://[a-zA-Z0-9-_]+:[a-zA-Z0-9-_]+@[a-zA-Z0-9-_]+:[a-zA-Z0-9-_]+/[a-zA-Z0-9-_]+\\\*(?:\'|\")*') 43 | 44 | #Case 3: Passwords in bash files (bash, psql, etc) bash parameters 45 | 46 | #Case 5: Pgpass 47 | #http://www.postgresql.org/docs/9.3/static/libpq-pgpass.html 48 | 49 | #what about case 1 without quotes? 50 | 51 | #passwords assigned to variables whose names are nor similar to pwd 52 | #but the string seems a password 53 | regex_list = [pwd, pass_, urls] 54 | matches = regex_matcher(regex_list, s) 55 | has_password = len(matches) > 0 56 | matches = None if has_password is False else matches 57 | return has_password, matches 58 | 59 | #Checks if a string has ips 60 | #Matching IPs with regex is a thing: 61 | #http://stackoverflow.com/questions/10086572/ip-address-validation-in-python-using-regex 62 | def ip_matcher(s): 63 | ips = re.findall(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", s) 64 | #Remove obvious non-dangerous matches 65 | allowed_ips = ['127.0.0.1', '0.0.0.0'] 66 | ips = [ip for ip in ips if ip not in allowed_ips] 67 | if len(ips): 68 | return True, ips 69 | else: 70 | return False, None 71 | 72 | def create_domain_matcher(domain): 73 | '''Returns a function that serves as a matcher for a given domain''' 74 | def domain_matcher(s): 75 | regex = '\S+\.'+domain.replace('.', '\.') 76 | matches = re.findall(regex, s) 77 | if len(matches): 78 | return True, matches 79 | else: 80 | return False, None 81 | return domain_matcher 82 | 83 | 84 | def regex_matcher(regex_list, s): 85 | '''Get a list of regex and return all matches, removes duplicates 86 | in case more than onw regex matches the same pattern (pattern location 87 | is taken into account to determine wheter two matches are the same).''' 88 | #Find matchees and position for each regex 89 | results = [match_with_position(regex, s) for regex in regex_list] 90 | #Flatten list 91 | results = reduce(lambda x,y: x+y, results) 92 | #Convert to set to remove duplicates 93 | results = set(results) 94 | #Extract matches only (wuthout position) 95 | results = [res[1] for res in results] 96 | return results 97 | 98 | def match_with_position(regex, s): 99 | '''Returns a list of tuples (pos, match) for each match.''' 100 | return [(m.start(), m.group()) for m in regex.finditer(s)] 101 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | glob2==0.4.1 2 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup( 4 | name='repo-scraper', 5 | version='0.1', 6 | description='Search for potential passwords/data leaks in a folder or git repo', 7 | url='https://github.com/dssg/repo-scraper', 8 | author='Eduardo Blancas Reyes', 9 | author_email='edu.blancas@gmail.com', 10 | license='MIT', 11 | packages=['repo_scraper'], 12 | scripts=['bin/check-dir', 'bin/check-repo'], 13 | test_suite='nose.collector', 14 | tests_require=['nose'], 15 | zip_safe=False 16 | ) -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/tests/__init__.py -------------------------------------------------------------------------------- /tests/test_FileChecker.py: -------------------------------------------------------------------------------- 1 | from repo_scraper.constants.result import * 2 | from unittest import TestCase 3 | from repo_scraper.FileChecker import FileChecker 4 | import os 5 | 6 | module_path = os.path.dirname(os.path.abspath(__file__)) 7 | dummy_repo_path = os.path.join(module_path, '..', 'dummy-project') 8 | 9 | class TestFileChecker(TestCase): 10 | def test_json_file_with_password(self): 11 | pass 12 | def test_plain_text_file_with_password(self): 13 | pass 14 | def test_python_file_with_password(self): 15 | path = os.path.join(dummy_repo_path, 'python_file_with_password.py') 16 | r = FileChecker(path, allowed_extensions=['py']).check() 17 | self.assertEqual(r.result_type, ALERT) 18 | def test_hidden_file_with_password(self): 19 | pass -------------------------------------------------------------------------------- /tests/test_filesystem.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/tests/test_filesystem.py -------------------------------------------------------------------------------- /tests/test_ip_check.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/tests/test_ip_check.py -------------------------------------------------------------------------------- /tests/test_password_check.py: -------------------------------------------------------------------------------- 1 | from unittest import TestCase 2 | from repo_scraper import matchers 3 | 4 | #what if it's not quoted? what if is spanned across more than one line 5 | #is this reasonable to test? p=some-password 6 | 7 | class HardcodedPasswordString(TestCase): 8 | def test_detects_easy_password(self): 9 | str_to_check = 'password="123456"' 10 | password_matcher, matches = matchers.password_matcher(str_to_check) 11 | self.assertTrue(password_matcher) 12 | self.assertEqual(matches, [str_to_check]) 13 | 14 | def test_detects_easy_password_single_quotes(self): 15 | str_to_check = 'password=\'123456\'' 16 | password_matcher, matches = matchers.password_matcher(str_to_check) 17 | self.assertTrue(password_matcher) 18 | self.assertEqual(matches, [str_to_check]) 19 | 20 | def test_detects_easy_password_spaces(self): 21 | str_to_check = 'password = "123456"' 22 | password_matcher, matches = matchers.password_matcher(str_to_check) 23 | self.assertTrue(password_matcher) 24 | self.assertEqual(matches, [str_to_check]) 25 | 26 | def test_detects_easy_password_linebreaks(self): 27 | password_matcher, matches = matchers.password_matcher('password ="123456"\n') 28 | self.assertTrue(password_matcher) 29 | self.assertEqual(matches, ['password ="123456"']) 30 | 31 | def test_detects_easy_password_in_R(self): 32 | str_to_check = 'password<- "123456"' 33 | password_matcher, matches = matchers.password_matcher(str_to_check) 34 | self.assertTrue(password_matcher) 35 | self.assertEqual(matches, [str_to_check]) 36 | 37 | def test_detects_pwd(self): 38 | str_to_check = 'pwd="123456"' 39 | password_matcher, matches = matchers.password_matcher(str_to_check) 40 | self.assertTrue(password_matcher) 41 | self.assertEqual(matches, [str_to_check]) 42 | 43 | def test_detects_password_with_prefix(self): 44 | str_to_check = 'POSTGRES_PASSWORD=\'iYiLKi7879\'' 45 | password_matcher, matches = matchers.password_matcher(str_to_check) 46 | self.assertTrue(password_matcher) 47 | self.assertEqual(matches, [str_to_check]) 48 | 49 | def test_detects_password_with_suffix(self): 50 | str_to_check = 'PASSWORD_MYSQL=\'iYiLKi7879\'' 51 | password_matcher, matches = matchers.password_matcher(str_to_check) 52 | self.assertTrue(password_matcher) 53 | self.assertEqual(matches, [str_to_check]) 54 | 55 | def test_detects_multiple_passwords(self): 56 | str_to_check = 'PASSWORD_MYSQL=\'iYiLKi7879\' \n \n password ="123456"\n var=5' 57 | password_matcher, matches = matchers.password_matcher(str_to_check) 58 | self.assertTrue(password_matcher) 59 | self.assertEqual(matches, ['password ="123456"', 'PASSWORD_MYSQL=\'iYiLKi7879\'']) 60 | 61 | def test_ignores_password_from_another_variable(self): 62 | str_to_check = 'password=variable' 63 | password_matcher, matches = matchers.password_matcher(str_to_check) 64 | self.assertFalse(password_matcher) 65 | self.assertEqual(matches, None) 66 | 67 | def test_ignores_pwd_from_another_variable(self): 68 | str_to_check = 'pwd=variable' 69 | password_matcher, matches = matchers.password_matcher(str_to_check) 70 | self.assertFalse(password_matcher) 71 | self.assertEqual(matches, None) 72 | 73 | def test_ignores_password_from_another_variable_with_blanks(self): 74 | password_matcher, matches = matchers.password_matcher('pwd =variable\n') 75 | self.assertFalse(password_matcher) 76 | self.assertEqual(matches, None) 77 | 78 | class HardcodedURLs(TestCase): 79 | def test_detects_sqlalchemy_engine(self): 80 | str_to_check = 'db-schema://user:strong-pwd@localhost:5432/mydb' 81 | password_matcher, matches = matchers.password_matcher(str_to_check) 82 | self.assertTrue(password_matcher) 83 | self.assertEqual(matches, [str_to_check]) 84 | 85 | def test_detects_sqlalchemy_engine_different_settings(self): 86 | str_to_check = 'another-schema://user2:1234@localhost:0000/awesome-db' 87 | password_matcher, matches = matchers.password_matcher(str_to_check) 88 | self.assertTrue(password_matcher) 89 | self.assertEqual(matches, [str_to_check]) 90 | 91 | def test_detects_sqlalchemy_quoted(self): 92 | str_to_check = '\'db-schema://user:strong-pwd@localhost:5432/mydb\'' 93 | password_matcher, matches = matchers.password_matcher(str_to_check) 94 | self.assertTrue(password_matcher) 95 | self.assertEqual(matches, [str_to_check]) 96 | 97 | def test_detects_sqlalchemy_double_quoted(self): 98 | str_to_check = '"db-schema://user:strong-pwd@localhost:5432/mydb"' 99 | password_matcher, matches = matchers.password_matcher(str_to_check) 100 | self.assertTrue(password_matcher) 101 | self.assertEqual(matches, [str_to_check]) 102 | 103 | class HardcodedPasswordsInJSON(TestCase): 104 | def test_detects_hardcoded_value_json(self): 105 | str_to_check = '''{ 106 | "password":"super-secret-password" \n\n\t 107 | }''' 108 | password_matcher, matches = matchers.password_matcher(str_to_check) 109 | self.assertTrue(password_matcher) 110 | self.assertEqual(matches, ['"password":"super-secret-password"']) 111 | 112 | def test_detects_hardcoded_value_json_single_quotes(self): 113 | str_to_check = '''{ 114 | \'password\': \'super-secret-password\' \n\n\t 115 | }''' 116 | password_matcher, matches = matchers.password_matcher(str_to_check) 117 | self.assertTrue(password_matcher) 118 | self.assertEqual(matches, ['\'password\': \'super-secret-password\'']) 119 | 120 | def test_detects_hardcoded_value_json_multiple_keys(self): 121 | str_to_check = '''{ 122 | "pass": "dont-hack-me-please", 123 | "key": "1234" 124 | }''' 125 | password_matcher, matches = matchers.password_matcher(str_to_check) 126 | self.assertTrue(password_matcher) 127 | self.assertEqual(matches, ['"pass": "dont-hack-me-please"']) 128 | 129 | def test_detects_hardcoded_value_json_multiple_passwords(self): 130 | str_to_check = '''{ 131 | "pass": "dont-hack-me-please", 132 | "key": "1234", 133 | \'pwd\' : \'qwerty\', 134 | }''' 135 | password_matcher, matches = matchers.password_matcher(str_to_check) 136 | self.assertTrue(password_matcher) 137 | self.assertEqual(matches, ['\'pwd\' : \'qwerty\'', '"pass": "dont-hack-me-please"']) 138 | 139 | def test_detects_hardcoded_value_json_blanks(self): 140 | str_to_check = '''{ 141 | " pass" : "dont-hack-me-please" \n\n\t 142 | }''' 143 | password_matcher, matches = matchers.password_matcher(str_to_check) 144 | #self.assertTrue(password_matcher) 145 | #self.assertEqual(matches, ['" pass" : "dont-hack-me-please"']) 146 | def test_ignores_json_without_passwords(self): 147 | str_to_check = '''{ 148 | "some_key": "this is not a password", 149 | "another_key": 100-12301-123, 150 | }''' 151 | password_matcher, matches = matchers.password_matcher(str_to_check) 152 | self.assertFalse(password_matcher) 153 | self.assertEqual(matches, None) 154 | def test_detects_url_in_json_file(self): 155 | str_to_check = '''{ 156 | "engine": "db-schema://user:strong-pwd@localhost:5432/mydb", 157 | "key": "1234", 158 | }''' 159 | password_matcher, matches = matchers.password_matcher(str_to_check) 160 | self.assertTrue(password_matcher) 161 | self.assertEqual(matches, ['"db-schema://user:strong-pwd@localhost:5432/mydb"']) 162 | 163 | class HardcodedPasswordsInYAML(TestCase): 164 | def test_detects_hardcoded_double_quotes(self): 165 | str_to_check = ''' 166 | database: 167 | drivername: "dbdriver" 168 | host: "dbhost" 169 | port: "port" 170 | username: "username" 171 | password: "password" 172 | database: "database" 173 | ''' 174 | password_matcher, matches = matchers.password_matcher(str_to_check) 175 | self.assertTrue(password_matcher) 176 | self.assertEqual(matches, ['password: "password"']) 177 | def test_detects_hardcoded_with_a_double_quote(self): 178 | str_to_check = ''' 179 | db: 180 | host: 'host' 181 | user: 'user' 182 | password: '"klu89oinlk' 183 | database: 'a_db' 184 | 185 | ''' 186 | password_matcher, matches = matchers.password_matcher(str_to_check) 187 | self.assertTrue(password_matcher) 188 | self.assertEqual(matches, ["password: '\"klu89oinlk'"]) 189 | def test_detects_hardcoded_with_a_single_quote(self): 190 | str_to_check = ''' 191 | database: 192 | drivername: "dbdriver" 193 | host: "dbhost" 194 | port: "port" 195 | username: "username" 196 | password: "thispwdhasthis'" 197 | database: "database" 198 | ''' 199 | password_matcher, matches = matchers.password_matcher(str_to_check) 200 | self.assertTrue(password_matcher) 201 | self.assertEqual(matches, ['password: "thispwdhasthis\'"']) 202 | 203 | class HardcodedPasswordsInCSV(TestCase): 204 | def test_detects_hardcoded_value_csv(self): 205 | #str_to_check = '''password, qwerty''' 206 | #password_matcher, matches = matchers.password_matcher(str_to_check) 207 | #self.assertTrue(password_matcher) 208 | #self.assertEqual(matches, ['password, qwerty']) 209 | pass 210 | 211 | class HardcodedPasswordsInGenericPlainText(TestCase): 212 | pass -------------------------------------------------------------------------------- /tests/test_url_check.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/tests/test_url_check.py --------------------------------------------------------------------------------