├── .gitignore
├── LICENSE
├── NOTES.md
├── README.md
├── bin
    ├── check-dir
    └── check-repo
├── dummy-project
    ├── .mysterious_hidden_file
    ├── dangerous_file.json
    ├── data_loading
    │   ├── empy.py
    │   └── file_with_password.py
    ├── exploratory_data_analysis
    │   ├── .ipynb_checkpoints
    │   │   └── Untitled-checkpoint.ipynb
    │   └── Untitled.ipynb
    ├── ignore
    ├── models
    │   └── training.py
    └── python_file_with_password.py
├── repo_scraper
    ├── DiffChecker.py
    ├── FileChecker.py
    ├── FolderChecker.py
    ├── Git.py
    ├── GitChecker.py
    ├── Result.py
    ├── __init__.py
    ├── constants
    │   ├── __init__.py
    │   ├── extensions.py
    │   ├── git_diff.py
    │   └── result.py
    ├── filesystem.py
    ├── filetype.py
    └── matchers.py
├── requirements.txt
├── setup.py
└── tests
    ├── __init__.py
    ├── test_FileChecker.py
    ├── test_filesystem.py
    ├── test_ip_check.py
    ├── test_password_check.py
    └── test_url_check.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | #### joe made this: https://goel.io/joe
 2 | 
 3 | #####=== Python ===#####
 4 | 
 5 | # Byte-compiled / optimized / DLL files
 6 | __pycache__/
 7 | *.py[cod]
 8 | 
 9 | # C extensions
10 | *.so
11 | 
12 | # Distribution / packaging
13 | .Python
14 | env/
15 | build/
16 | develop-eggs/
17 | dist/
18 | downloads/
19 | eggs/
20 | lib/
21 | lib64/
22 | parts/
23 | sdist/
24 | var/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | 
29 | # PyInstaller
30 | #  Usually these files are written by a python script from a template
31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
32 | *.manifest
33 | *.spec
34 | 
35 | # Installer logs
36 | pip-log.txt
37 | pip-delete-this-directory.txt
38 | 
39 | # Unit test / coverage reports
40 | htmlcov/
41 | .tox/
42 | .coverage
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | 
47 | # Translations
48 | *.mo
49 | *.pot
50 | 
51 | # Django stuff:
52 | *.log
53 | 
54 | # Sphinx documentation
55 | docs/_build/
56 | 
57 | # PyBuilder
58 | target/
59 | 
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2015 Edu
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/NOTES.md:
--------------------------------------------------------------------------------
 1 | #Notes
 2 | 
 3 | I found some interesting facts when developing this, here they are.
 4 | 
 5 | ##Git and standard error
 6 | 
 7 | Git (even in version 2.6.0) does not get along with the [standard error](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=447395) and sends any kind of unpredictable output to it. Since `check-repo` executes git commands (via `subprocess`), it needs to handle git errors so the user knows what happened. Since just checking stdrr is not enough, a naive helper function checks for the keyword 'fatal' in the standard error to raise a Python Exception.
 8 | 
 9 | Right now, it works. Since the most common error is to run git commands on a directory that does not have a git repository, hopefully this will also catch some other 'fatal' errors.
10 | 
11 | ##On checking git history
12 | 
13 | As it is mentioned on the [README](/README.md) file, checking only at the differences between commits is faster than checking out each commit. To check such differences, the library relies on `git diff`, which introduces a couple limitations and design caveats, such limitations are explained below.
14 | 
15 | ###`git diff` and extensions
16 | 
17 | ###`git diff` and file size
18 | 
19 | 
20 | ###`git diff`  and binary files
21 | 
22 | By default, `git diff` does not show changes in binary files. That make things easier since we do not need to check if the file we are dealing with is binary before applying the regular expressions. However, what git interprets as a binary file depends on your configuration. That being said, `check-repo` will not look at binary files and will only print a warning.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | #repo_scraper
 2 | 
 3 | Check your projects for possible password (or other sensitive data) leaks.
 4 | 
 5 | The library exposes two commands:
 6 | * `check-dir` - Performs checks on a folder and subdirectories
 7 | * `check-repo` - Performs a check in a git repository
 8 | 
 9 | Both scripts work almost the same from the user point of view, enter `check-dir --help` or `check-repo --help` for more details.
10 | 
11 | ##Example
12 | 
13 | Check your dummy-project:
14 | ```bash
15 | check-dir dummy-project
16 | ```
17 | 
18 | Output:
19 | ```bash
20 | Checking folder dummy-project...
21 | 
22 | dummy-project/python_file_with_password.py
23 | ALERT - MATCH ["password = 'qwerty'"]
24 | 
25 | dummy-project/dangerous_file.json
26 | ALERT - MATCH ['"password": "super-secret-password"']
27 | ```
28 | 
29 | ##How does it work?
30 | 
31 | Briefly speaking, `check-dir` lists all files below a folder and applies regular expressions to look for passwords/IPs. Given that a blind search would never end (for example, if the repo constans a 50MB csv file), some filters are applied before the regular expressions are matched:
32 | 
33 | * **File size** - If file is bigger than 1MB, ignore it but print a warning
34 | * **Extension** - If extension is not allowed, ignore file but print a warning. (See [NOTES](NOTES.md) to know why extension is used instead of mimetype)
35 | * **Base64** - If file contains Base64 data, remove it. Many plain-text formats (such as Jupyter notebooks embed data in Base64 format. Applying regex to such files is never going to end)
36 | 
37 | `check-repo` works in a slightly different way, one obvious way to check git history is to checkout each commit and apply `check-dir`. That approach would be really slow since the script would be checking the same files many times. Instead, `check-repo` checks out the first commit, runs `check-dir` there and then, moves up one commit at a time and uses `git diff` to get only the difference between each consecutive pair of commits.
38 | 
39 | As in `check-dir`, the script applies some filters before applying regular expressions to prevent getting stuck on big files, note that in this case we are not dealing with files, but with the `git diff` output, and that prevents us to check for file size directly:
40 | 
41 | * **Number of lines** - 
42 | * **Number of characters** - 
43 | * **Extension** - If extension is not allowed, ignore file but print a warning. (See [NOTES](NOTES.md) to know why extension is used instead of mimetype)
44 | * **Base64** - Remove Base64 code.
45 | 
46 | The project has some limitations see [NOTES](NOTES.md) file for information regarding the design of the project and how that limits what the library is able to detect.
47 | 
48 | ##Installation
49 | 
50 | ```bash
51 |     pip install git+git://github.com/dssg/repo-scraper.git -r requirements.txt
52 | ```
53 | 
54 | ##Dependencies
55 | 
56 | * glob2
57 | * nose (optional, for running tests)
58 | 
59 | ##Tested with
60 | * Python 2.7.10
61 | * Git 2.6.0
62 | 
63 | ##Usage
64 | 
65 | ```bash
66 |     cd path/to/your/project
67 |     check-dir
68 | ```
69 | 
70 | See help for more options available:
71 | 
72 | ```bash
73 |     check-dir --help
74 | ```
75 | 
76 | ###Using a IGNORE file with check-dir
77 | 
78 | Just as with git, you can specify a file to make the program ignore some files/folders. This is specially useful when you have folder with many log files that you are sure do not have sensitive data. The library assumes one glob rule per line.
79 | 
80 | Adding a IGNORE file will make execution faster, since many regular expressions are matched against all files that have certain characteristics.
81 | 
82 | **Important**: Even though the format is very similar, you cannot use the same rules as in your [.gitignore](https://git-scm.com/docs/gitignore) file. For more details, see [this](https://en.wikipedia.org/wiki/Glob_(programming)).
83 | 
84 | ##What's done
85 | 
86 | * Passwords (using regex). See [`test_password_check.py`](tests/test_password_check.py)
87 | * IPs
88 | * URLs on amazonaws.com (it's simple to add more domains if needed)
89 | 
90 | ##What's missing
91 | 
92 | * URLs
93 | * Check other branches apart from master
94 | 
95 | #TODO
96 | * Come up with a cool name
97 | 


--------------------------------------------------------------------------------
/bin/check-dir:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | from repo_scraper.constants.extensions import *
 3 | from repo_scraper.constants.result import *
 4 | from repo_scraper.FolderChecker import FolderChecker
 5 | import argparse
 6 | import re
 7 | import sys
 8 | 
 9 | parser = argparse.ArgumentParser()
10 | parser.add_argument("-f", "--force", action="store_true", help="Force execution, prevents confirmation prompt form appearing")
11 | parser.add_argument("-i", "--ignore", help="optional ignore file", type=str)
12 | parser.add_argument("-w", "--warnings", action="store_true", help="Print warnings (and alerts)")
13 | parser.add_argument("-a", "--printall", action="store_true", help="Print everything (alerts, warnings and non-matches)")
14 | parser.add_argument("-e", "--extensions",
15 |     help='''Comma-separated extensions, files that don't match any of these will raise a warning.
16 |          If empty, uses default list: %s ''' % DEFAULT_EXTENSIONS_FORMAT)
17 | parser.add_argument("-p", "--path", help="directory location, if this is not provided, the script will run on the working directory")
18 | parser.add_argument("-g", "--includegit", action="store_true", help="Includes .git/ folder in the search scope, disabled by default")
19 | args = parser.parse_args()
20 | 
21 | print ('IMPORTANT: This script is going to use the filesystem.\n'
22 |         'Do not change any files in the directory while the script is running, have a coffee or '
23 |         'something.\n')
24 | if not args.force:
25 |     continue_ = raw_input("Do you want to continue? (y/n): ")
26 |     if continue_.lower()!='y':
27 |         sys.exit('Aborted by the user')
28 | 
29 | folder_path = '.' if args.path is None else args.path
30 | ignore_path = args.ignore
31 | 
32 | results_to_print = [ALERT]
33 | if args.warnings:
34 |     results_to_print += [WARNING]
35 | elif args.printall:
36 |     results_to_print += [WARNING, NOTHING]
37 | 
38 | if args.extensions:
39 |     allowed_extensions = re.compile(',\s*\.?').split(args.extensions.lower())
40 | else:
41 |     allowed_extensions = DEFAULT_EXTENSIONS
42 | 
43 | print 'Allowed extensions: %s' %  reduce(lambda x,y: x+', '+y, allowed_extensions)
44 | 
45 | if ignore_path:
46 |     print 'Using ignore file %s\n' % ignore_path
47 | else:
48 |     print '\n'
49 | 
50 | #Create an instance of folder checker,
51 | #this class will list files in all subdirectories,
52 | #then apply ignore file rules. It provides a generator
53 | #to check each file
54 | ignore_git = not args.includegit
55 | fc = FolderChecker(folder_path, allowed_extensions=allowed_extensions,
56 |                                 ignore_git_folder=ignore_git,
57 |                                 ignore_path=ignore_path)
58 | #Get the generator to traverse the folder structure
59 | file_traverser = fc.file_traverser()
60 | 
61 | for result in file_traverser:
62 |     if result.result_type in results_to_print:
63 |         print result


--------------------------------------------------------------------------------
/bin/check-repo:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | from repo_scraper.constants.extensions import *
 3 | from repo_scraper.constants.result import *
 4 | from repo_scraper.GitChecker import GitChecker 
 5 | import argparse
 6 | import re
 7 | import sys
 8 | 
 9 | parser = argparse.ArgumentParser()
10 | parser.add_argument("-f", "--force", action="store_true", help="Force execution, prevents confirmation prompt form appearing")
11 | parser.add_argument("-w", "--warnings", action="store_true", help="Print warnings (and alerts)")
12 | parser.add_argument("-a", "--printall", action="store_true", help="Print everything (alerts, warnings and non-matches)")
13 | parser.add_argument("-e", "--extensions",
14 |     help='''Comma-separated extensions, files that don't match any of these will raise a warning.
15 |          If empty, uses default list: %s ''' % DEFAULT_EXTENSIONS_FORMAT)
16 | parser.add_argument("-p", "--path", help="git repository location, if this is not provided, the script will run on the working directory")
17 | args = parser.parse_args()
18 | 
19 | print ('IMPORTANT: This script is going to execute git commands.\n'
20 |         'Do not change any files or execute git commands in this project'
21 |         ' while the script is running, have a coffee or '
22 |         'something.\n')
23 | 
24 | if not args.force:
25 |     continue_ = raw_input("Do you want to continue? (y/n): ")
26 |     if continue_.lower()!='y':
27 |         sys.exit('Aborted by the user')
28 | 
29 | results_to_print = [ALERT]
30 | if args.warnings:
31 |     results_to_print += [WARNING]
32 | elif args.printall:
33 |     results_to_print += [WARNING, NOTHING]
34 | 
35 | if args.extensions:
36 |     allowed_extensions = re.compile(',\s*\.?').split(args.extensions.lower())
37 | else:
38 |     allowed_extensions = DEFAULT_EXTENSIONS
39 | 
40 | 
41 | #Default path is working directory, change if user
42 | #specified a different one
43 | path = '.' if args.path is None else args.path
44 | 
45 | print 'Allowed extensions: %s' %  reduce(lambda x,y: x+', '+y, allowed_extensions)
46 | 
47 | #Create an instance of git checkerparser.add_argument("-v", "--verbose", action="store_true", help="prints all results, including no matches and warnings")
48 | gc = GitChecker(allowed_extensions=allowed_extensions, git_dir=path)
49 | 
50 | #Get the generator that will turn one resul per file modified in each
51 | #commit
52 | file_traverser =  gc.file_traverser()
53 | 
54 | for result in file_traverser:
55 |     if result.result_type in results_to_print:
56 |         print result
57 | 
58 | print '\033[93mNote: you are in the first commit, if you want to go to the last one do git checkout master.\033[0m'


--------------------------------------------------------------------------------
/dummy-project/.mysterious_hidden_file:
--------------------------------------------------------------------------------
1 | PASS=this-should-not-be-public


--------------------------------------------------------------------------------
/dummy-project/dangerous_file.json:
--------------------------------------------------------------------------------
1 | {
2 |     "password": "super-secret-password"
3 | }


--------------------------------------------------------------------------------
/dummy-project/data_loading/empy.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/dummy-project/data_loading/empy.py


--------------------------------------------------------------------------------
/dummy-project/data_loading/file_with_password.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/dummy-project/data_loading/file_with_password.py


--------------------------------------------------------------------------------
/dummy-project/exploratory_data_analysis/.ipynb_checkpoints/Untitled-checkpoint.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "metadata": {
 7 |     "collapsed": true
 8 |    },
 9 |    "outputs": [],
10 |    "source": [
11 |     "#Just an empty jupyter notebook, nothing to worry about..."
12 |    ]
13 |   }
14 |  ],
15 |  "metadata": {
16 |   "kernelspec": {
17 |    "display_name": "Python 2",
18 |    "language": "python",
19 |    "name": "python2"
20 |   },
21 |   "language_info": {
22 |    "codemirror_mode": {
23 |     "name": "ipython",
24 |     "version": 2
25 |    },
26 |    "file_extension": ".py",
27 |    "mimetype": "text/x-python",
28 |    "name": "python",
29 |    "nbconvert_exporter": "python",
30 |    "pygments_lexer": "ipython2",
31 |    "version": "2.7.10"
32 |   }
33 |  },
34 |  "nbformat": 4,
35 |  "nbformat_minor": 0
36 | }
37 | 


--------------------------------------------------------------------------------
/dummy-project/exploratory_data_analysis/Untitled.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "metadata": {
 7 |     "collapsed": true
 8 |    },
 9 |    "outputs": [],
10 |    "source": [
11 |     "#Just an empty jupyter notebook, nothing to worry about..."
12 |    ]
13 |   }
14 |  ],
15 |  "metadata": {
16 |   "kernelspec": {
17 |    "display_name": "Python 2",
18 |    "language": "python",
19 |    "name": "python2"
20 |   },
21 |   "language_info": {
22 |    "codemirror_mode": {
23 |     "name": "ipython",
24 |     "version": 2
25 |    },
26 |    "file_extension": ".py",
27 |    "mimetype": "text/x-python",
28 |    "name": "python",
29 |    "nbconvert_exporter": "python",
30 |    "pygments_lexer": "ipython2",
31 |    "version": "2.7.10"
32 |   }
33 |  },
34 |  "nbformat": 4,
35 |  "nbformat_minor": 0
36 | }
37 | 


--------------------------------------------------------------------------------
/dummy-project/ignore:
--------------------------------------------------------------------------------
1 | **/*.py


--------------------------------------------------------------------------------
/dummy-project/models/training.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/dummy-project/models/training.py


--------------------------------------------------------------------------------
/dummy-project/python_file_with_password.py:
--------------------------------------------------------------------------------
1 | password = 'qwerty'


--------------------------------------------------------------------------------
/repo_scraper/DiffChecker.py:
--------------------------------------------------------------------------------
 1 | from repo_scraper import matchers as m
 2 | from repo_scraper.Result import *
 3 | from repo_scraper import filetype
 4 | 
 5 | class DiffChecker:
 6 |     def __init__(self, commit_hashes, filename, content, error, allowed_extensions):
 7 |         self.commit_hashes = commit_hashes
 8 |         self.filename = filename
 9 |         self.content = content
10 |         self.error = error
11 |         self.allowed_extensions = allowed_extensions
12 |     def check(self):
13 |         #Build the identifier using the filename and commit hashes
14 |         identifier = '%s (%s)' % (self.filename, self.commit_hashes[1])
15 | 
16 |         #The comments is a list to keep track of useful information
17 |         #encountered when checking, right now, its only being used
18 |         #to annotate when base64 code was removed
19 |         comments = []
20 | 
21 |         #Check the number of additions, if there are too many
22 |         #send a warning and skip, this may be due to a big data file addition
23 |         if self.error:
24 |             return Result(self.filename, self.error)
25 | 
26 |         #Check if extension/mimetype is allowed
27 |         if filetype.get_extension(self.filename) not in self.allowed_extensions:
28 |             return Result(identifier, FILETYPE_NOT_ALLOWED)
29 |         
30 |         #Start applying rules...
31 |         #First check if additions contain base64, if there is remove it
32 |         has_base64, self.content = m.base64_matcher(self.content, remove=True)
33 |         if has_base64:
34 |             comments.append('BASE64_REMOVED')
35 |         
36 |         #Create matcher for amazonaws.com
37 |         amazonaws_matcher = m.create_domain_matcher('amazonaws.com')
38 |         #Apply matchers: password, ips and aws
39 |         match, matches = m.multi_matcher(self.content, m.password_matcher, m.ip_matcher, amazonaws_matcher)
40 | 
41 |         if match:
42 |             return Result(identifier, MATCH, matches=matches, comments=comments)
43 |         else:
44 |             return Result(identifier, NOT_MATCH, comments=comments)


--------------------------------------------------------------------------------
/repo_scraper/FileChecker.py:
--------------------------------------------------------------------------------
 1 | from repo_scraper import filetype
 2 | from repo_scraper import matchers as m
 3 | from repo_scraper.Result import *
 4 | import os
 5 | import re
 6 | 
 7 | class FileChecker:
 8 |     def __init__(self, path, allowed_extensions, max_file_size_bytes=1048576):
 9 |         self.path = path
10 |         self.max_file_size_bytes = max_file_size_bytes
11 |         self.allowed_extensions = allowed_extensions
12 |     def check(self):
13 |         #The comments is a list to keep track of useful information
14 |         #encountered when checking, right now, its only being used
15 |         #to annotate when base64 code was removed
16 |         comments = []
17 | 
18 |         #Check file size if it's more than max_file_size_bytes (default is 1MB)
19 |         #send just a warning and do not open the file,
20 |         #since pattern matching is going to be really slow
21 |         f_size = os.stat(self.path).st_size
22 |         if f_size > self.max_file_size_bytes:
23 |             return Result(self.path, BIG_FILE)
24 |   
25 |         #Check if extension is allowed
26 |         if filetype.get_extension(self.path) not in self.allowed_extensions:
27 |             return Result(self.path, FILETYPE_NOT_ALLOWED)
28 | 
29 |         #At this point you only have files with allowed extensions and
30 |         #smaller than max_file_size_bytes
31 |         #open the file and then apply all rules
32 |         with open(self.path, 'r') as f:
33 |             content = f.read()
34 | 
35 |         #Last check: search for potential base64 strings and remove them, send a warning
36 |         has_base64, content = m.base64_matcher(content, remove=True)
37 |         if has_base64:
38 |             comments.append('BASE64_REMOVED')
39 | 
40 |         #Create matcher for amazonaws.com
41 |         amazonaws_matcher = m.create_domain_matcher('amazonaws.com')
42 |         #Apply matchers: password, ips and aws
43 |         match, matches = m.multi_matcher(content, m.password_matcher, m.ip_matcher, amazonaws_matcher)
44 | 
45 |         if match:
46 |             return Result(self.path, MATCH, matches=matches, comments=comments)
47 |         else:
48 |             return Result(self.path, NOT_MATCH, comments=comments)


--------------------------------------------------------------------------------
/repo_scraper/FolderChecker.py:
--------------------------------------------------------------------------------
 1 | #Creates a FileChecker instance for each file inside a directory
 2 | #(and subdirectories) and applies matchers
 3 | from repo_scraper import filesystem as fs
 4 | from repo_scraper.FileChecker import FileChecker
 5 | 
 6 | class FolderChecker:
 7 |     def __init__(self, folder_path, allowed_extensions, ignore_git_folder=True, ignore_path=None):
 8 |         #List all files in directory, apply ignore file if necessary
 9 |         self.filenames = fs.list_files_in(folder_path, ignore_git_folder=ignore_git_folder, ignore_file=ignore_path)
10 |         self.allowed_extensions = allowed_extensions
11 |     def file_traverser(self):
12 |         for filename in self.filenames:
13 |             yield FileChecker(filename, allowed_extensions=self.allowed_extensions).check()


--------------------------------------------------------------------------------
/repo_scraper/Git.py:
--------------------------------------------------------------------------------
 1 | from repo_scraper.constants.result import *
 2 | from repo_scraper.constants.git_diff import *
 3 | import subprocess
 4 | import re
 5 | 
 6 | class Git:
 7 |     def __init__(self, git_dir):
 8 |         self.git_dir = git_dir
 9 | 
10 |     def list_commits(self):
11 |         '''Returns a list with all commit hashes'''
12 |         p = subprocess.Popen(['git', '-C', self.git_dir, 'log', '--pretty=format:%H'],
13 |                             stdout=subprocess.PIPE,
14 |                             stderr=subprocess.PIPE)
15 |         out, err = p.communicate()
16 |     
17 |         #See comments on the function definition for details
18 |         self.check_stderr(err)
19 |         #Split by breakline to get a list and reverse the order
20 |         #so the first commit comes first
21 |         return out.replace('"', '').split('\n')[::-1]
22 | 
23 |     def diff_for_commit_to_commit(self, commit1_hash, commit2_hash):
24 |         '''Retrieves diff for a specific commit and parses it
25 |            to get file name and additions'''
26 |         #Pass the -M flag to detect modified files and avoid redundancy
27 |         p = subprocess.Popen(['git', '-C', self.git_dir, 'diff', '-M', commit1_hash, commit2_hash],
28 |                             stdout=subprocess.PIPE,
29 |                             stderr=subprocess.PIPE)
30 |         out, err = p.communicate()
31 |     
32 |         #See comments on the function definition for details
33 |         self.check_stderr(err)
34 |         
35 |         return self.parse_diff(out)
36 | 
37 |     def parse_diff(self, diff):
38 |         '''Parse a diff output'''
39 |         files = diff.split('diff --git ')[1:]
40 |         files = [self.parse_file_diff(f) for f in files]
41 |         return files
42 | 
43 |     #http://stackoverflow.com/questions/2529441/how-to-read-the-output-from-git-diff
44 |     def parse_file_diff(self, diff):
45 |         '''Parse a diff output'''
46 |         lines = diff.split('\n')
47 |         #File line is the first line here, with the following format:
48 |         #a/file.txt b/file2.txt where file and file2 are different
49 |         #only if the file was renamed, in which case the current name is file2
50 |         #print 'L is %s matches %s' % (lines[0], re.compile('\s{1}.{1}/{1}(.*)').findall(lines[0]))
51 |         #note, if the filename has unusual characters, its going to appear with quotes
52 |         filename = re.compile('\s{1}(?:\'|\")*./{1}(.*)(?:\'|\")*').findall(lines[0])[0]
53 |     
54 |         #If there are many lines in the diff file, the filter for addisions
55 |         #is going to break, check how many lines there are
56 |         #print 'lines for content %s' % len(lines[1:])
57 |         if len(lines[1:]) > MAX_DIFF_LINES:
58 |             return {'filename': filename, 'content': None, 'error': BIG_FILE}
59 | 
60 |         #Filter only the additions (lines that start with +)
61 |         content = filter(lambda x: x.startswith('+'), lines[1:])
62 |         #Remove the leading + sign on each line
63 |         content = [line.strip('+') for line in content]
64 |         #Join the lines again
65 |         content = reduce(lambda x,y:x+'\n'+y, content) if len(content) else ''
66 |         #Threshold for the number of characters
67 |         #print 'len is: %d' % len(content)
68 |         if len(content) > MAX_DIFF_ADDITIONS_CHARACTERS:
69 |             return {'filename': filename, 'content': None, 'error': BIG_FILE}
70 |         else:
71 |             return {'filename': filename, 'content': content, 'error': None}
72 | 
73 |     def checkout(self, commit_id):
74 |         p = subprocess.Popen(['git', '-C', self.git_dir,'checkout', commit_id], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
75 |         out, err = p.communicate()
76 |         #See comments on the function definition for details
77 |         self.check_stderr(err)
78 | 
79 |     #git likes to abuse standard error:
80 |     #https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=447395
81 |     #I wrote this function to check for actual errors instead of
82 |     #what git likes to send sometimes, actually the only error
83 |     #I'm looking for right now is when the git command is ran
84 |     #in a folder with no repository
85 |     def check_stderr(self, err):
86 |         if err.startswith('fatal'):
87 |             raise Exception(err)
88 | 
89 | 


--------------------------------------------------------------------------------
/repo_scraper/GitChecker.py:
--------------------------------------------------------------------------------
 1 | from itertools import chain
 2 | from repo_scraper.Git import Git
 3 | from repo_scraper.DiffChecker import DiffChecker
 4 | from repo_scraper.FolderChecker import FolderChecker
 5 | import subprocess
 6 | 
 7 | 
 8 | class GitChecker:
 9 |     def __init__(self, allowed_extensions, git_dir):
10 |         self.allowed_extensions = allowed_extensions
11 |         self.git_dir = git_dir
12 |         self.git = Git(git_dir)
13 |     def file_traverser(self):
14 |         #Checkout master
15 |         print 'git checkout master'
16 |         self.git.checkout('master')
17 | 
18 |         #Get all commits in chronological order
19 |         commits = self.git.list_commits()
20 |         #Generate commit pairs (each commit with the previous one)
21 |         commit_pairs = zip(commits[:-1], commits[1:])
22 | 
23 |         #Go to the first commit
24 |         print 'git checkout %s (first commit in master)' % commits[0]
25 |         self.git.checkout(commits[0])
26 | 
27 |         #Get generator to check the first commit
28 |         fc = FolderChecker(folder_path=self.git_dir,
29 |                            allowed_extensions=self.allowed_extensions,
30 |                            ignore_git_folder=True)
31 |         folder_file_traverser = fc.file_traverser()
32 | 
33 |         #Define a second generator that will traverse the repository
34 |         def repo_generator():
35 |             for pair in commit_pairs:
36 |                 #print 'getting diff for %s %s' % pair
37 |                 files_diff = self.git.diff_for_commit_to_commit(*pair)
38 |                 for f in files_diff:
39 |                     #print 'gichecker: %s' % f['filename']+' in '+pair[1]
40 |                     yield DiffChecker(pair, f['filename'], f['content'], f['error'], self.allowed_extensions).check()
41 | 
42 |         repo_file_traverser = repo_generator()
43 | 
44 |         #Join both generators and return
45 |         return chain(folder_file_traverser, repo_file_traverser)


--------------------------------------------------------------------------------
/repo_scraper/Result.py:
--------------------------------------------------------------------------------
 1 | from repo_scraper.constants.result import *
 2 | 
 3 | dic = {
 4 |         BIG_FILE: WARNING,
 5 |         NOT_PLAIN_TEXT: WARNING,
 6 |         MATCH: ALERT,
 7 |         NOT_MATCH: NOTHING,
 8 |         FILETYPE_NOT_ALLOWED: WARNING
 9 |        }
10 | 
11 | class Result:
12 |     def __init__(self, identifier, reason, matches=None, comments=None):
13 |         self.identifier = identifier
14 |         self.reason = reason
15 |         self.matches = matches
16 |         self.result_type = dic[reason]
17 |         self.comments = comments
18 |     def __str__(self):
19 |         #Message to print for matches is originally No matches, unless
20 |         #self.matches has some values
21 |         matches_print = 'No matches to show'
22 |         if self.matches:
23 |             #Create list of string with 'index. content' format for eacch match
24 |             matches_print = [str(idx+1)+'. '+content for idx,content in enumerate(self.matches)]
25 |             #Join list
26 |             matches_print = reduce(lambda x,y: x+'\n'+y, matches_print)
27 | 
28 |         return '%s - %s in %s\n%s\n' % (self.result_type, self.reason, self.identifier, matches_print)


--------------------------------------------------------------------------------
/repo_scraper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/repo_scraper/__init__.py


--------------------------------------------------------------------------------
/repo_scraper/constants/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/repo_scraper/constants/__init__.py


--------------------------------------------------------------------------------
/repo_scraper/constants/extensions.py:
--------------------------------------------------------------------------------
1 | #Add extensions here (lowercase)
2 | DEFAULT_EXTENSIONS = ["py", "ipynb", "json", "sql", "sh", "txt", "r", "md", "log", "yaml"]
3 | DEFAULT_EXTENSIONS_FORMAT = reduce(lambda x,y: x+', '+y, DEFAULT_EXTENSIONS)
4 | 
5 | 


--------------------------------------------------------------------------------
/repo_scraper/constants/git_diff.py:
--------------------------------------------------------------------------------
1 | #Constants in git.py (used in git diff commands)
2 | #Maximum number of lines when doing git diff hash_a hash_b
3 | #that won't trigger a BIG_FILE result, note that this constant
4 | #includes all lines in the git diff output (additions, deletions)
5 | MAX_DIFF_LINES = 10000
6 | #Max number of characters in ADDITIONS that won't trigger a BIG_FILE result,
7 | #note that in this case, only new lines are taken into account
8 | MAX_DIFF_ADDITIONS_CHARACTERS = 1048576


--------------------------------------------------------------------------------
/repo_scraper/constants/result.py:
--------------------------------------------------------------------------------
 1 | #Result class constants with ANSI colors
 2 | ALERT = '\033[91mAlert!\033[0m'
 3 | WARNING = '\033[93mWarning\033[0m'
 4 | NOTHING = '\033[92mCheck passed\033[0m'
 5 | 
 6 | BIG_FILE = 'Big file found'
 7 | NOT_PLAIN_TEXT = 'File is not plain text'
 8 | MATCH = 'Match'
 9 | NOT_MATCH = 'Nothing found'
10 | FILETYPE_NOT_ALLOWED = 'Extension not allowed'


--------------------------------------------------------------------------------
/repo_scraper/filesystem.py:
--------------------------------------------------------------------------------
 1 | #This file contains utility functions for working witht the filesytem
 2 | import os
 3 | import mimetypes as mime
 4 | import glob2 as glob
 5 | 
 6 | def list_files_in(directory, ignore_git_folder, ignore_file):
 7 |     '''Receives a path to a directory and returns
 8 |     paths to all files along with each mimetype'''
 9 |     file_list = []
10 |     for root, dirs, files in os.walk(directory):
11 |         for file in files:
12 |             file_list.append(os.path.join(root, file))
13 | 
14 |     file_list = set(file_list)
15 | 
16 |     glob_rules = []
17 | 
18 |     if ignore_git_folder:
19 |             glob_rules.append('.git/**')
20 | 
21 |     #Check if ignore file was provided
22 |     if ignore_file is not None:
23 |         glob_rules += parse_ignore_file(ignore_file)
24 | 
25 |     if len(glob_rules):
26 |         glob_matches = match_glob_rules_in_directory(glob_rules, directory)
27 |         #Remove files in file_list that matched any glob rule
28 |         file_list = file_list - glob_matches
29 | 
30 |     return file_list
31 | 
32 | def match_glob_rules_in_directory(glob_rules, directory):
33 |     #Append directory to each glob_rule
34 |     glob_rules = [os.path.join(directory, rule) for rule in glob_rules]
35 |     glob_matches = [glob.glob(rule) for rule in glob_rules]
36 |     #Flatten matches
37 |     glob_matches = reduce(lambda x,y: x+y, glob_matches)
38 |     #Convert to a set to remove duplicates
39 |     return set(glob_matches)
40 | 
41 | def parse_ignore_file(ignore_file):
42 |     #Open file
43 |     with open(ignore_file, 'r') as f:
44 |             lines = f.read().splitlines()
45 |     #Filter lines starting with #
46 |     lines = filter(lambda line: not line.startswith('#'), lines)
47 |     #Remove empty lines
48 |     lines = filter(lambda line: len(line) > 0, lines)
49 |     return lines


--------------------------------------------------------------------------------
/repo_scraper/filetype.py:
--------------------------------------------------------------------------------
1 | import re
2 | 
3 | def get_extension(filename):
4 |     try:
5 |         return re.compile('.*\.(\S+)$').findall(filename)[0].lower()
6 |     except:
7 |         return None
8 | 
9 | 


--------------------------------------------------------------------------------
/repo_scraper/matchers.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | 
  3 | def multi_matcher(s, *matchers):
  4 |     '''Receives matchers as parameters and applies all of them'''
  5 |     results = [m(s) for m in matchers]
  6 |     #Get the flag that indicates wether there was a match for each matcher
  7 |     has_match = [r[0] for r in results]
  8 |     #Check if there was at least one match
  9 |     at_least_one = reduce(lambda x,y: x or y, has_match)
 10 |     #Get list of matches for each matcher, delete Nones
 11 |     list_of_lists = [r[1] for r in results if r[1] is not None]
 12 |     #Flatten list of matches, ignore None
 13 |     matches = [match for single_list in list_of_lists for match in single_list]
 14 |     #If the list is empty, return None
 15 |     matches = None if len(matches)==0 else matches
 16 |     return at_least_one, matches
 17 | 
 18 | def base64_matcher(s, remove=False):
 19 |     regex = '(?:"|\')[A-Za-z0-9\\+\\\=\\/]{50,}(?:"|\')'
 20 |     base64images = re.compile(regex).findall(s)
 21 |     has_base64 = len(base64images) > 0
 22 |     if remove:
 23 |         return has_base64, re.sub(regex, '""', s)
 24 |     else:
 25 |         return has_base64
 26 | 
 27 | def password_matcher(s):
 28 |     #Case 1: hardcoded passwords assigned to variables (python, r, etc) 
 29 |     #or values (json, csv, etc)
 30 |     #match variable names such as password, PASSWORD, pwd, pass,
 31 |     #SOMETHING_PASSWORD assigned to strings (match = and <-)
 32 | 
 33 |     #Matches p_w_d='something' and similar
 34 |     pwd = re.compile('(\S*\\\*(?:\'|\")*(?:p|P)\S*(?:w|W)\S*(?:d|D)\\\*(?:\'|\")*\s*(?:=|<-|:)\s*\\\*(?:\'|\").+\\\*(?:\'|\"))')
 35 |     #Matches pass='something' and similar
 36 |     pass_ = re.compile('(\S*\\\*(?:\'|\")*(?:pass|PASS)\S*\\\*(?:\'|\")*\s*(?:=|<-|:)\s*\\\*(?:\'|\").+\\\*(?:\'|\"))')
 37 | 
 38 |     #Case 2: URLS (e.g. SQLAlchemy engines)
 39 |     #http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html
 40 |     #Note that validating URls is really hard...
 41 |     #http://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python
 42 |     urls = re.compile('\\\*(?:\'|\")*[a-zA-Z0-9-_]+://[a-zA-Z0-9-_]+:[a-zA-Z0-9-_]+@[a-zA-Z0-9-_]+:[a-zA-Z0-9-_]+/[a-zA-Z0-9-_]+\\\*(?:\'|\")*')
 43 | 
 44 |     #Case 3: Passwords in bash files (bash, psql, etc) bash parameters
 45 | 
 46 |     #Case 5: Pgpass
 47 |     #http://www.postgresql.org/docs/9.3/static/libpq-pgpass.html
 48 | 
 49 |     #what about case 1 without quotes?
 50 | 
 51 |     #passwords assigned to variables whose names are nor similar to pwd
 52 |     #but the string seems a password
 53 |     regex_list = [pwd, pass_, urls]
 54 |     matches = regex_matcher(regex_list, s)
 55 |     has_password = len(matches) > 0
 56 |     matches = None if has_password is False else matches
 57 |     return has_password, matches
 58 | 
 59 | #Checks if a string has ips
 60 | #Matching IPs with regex is a thing:
 61 | #http://stackoverflow.com/questions/10086572/ip-address-validation-in-python-using-regex
 62 | def ip_matcher(s):
 63 |     ips = re.findall(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", s)
 64 |     #Remove obvious non-dangerous matches
 65 |     allowed_ips = ['127.0.0.1', '0.0.0.0']
 66 |     ips = [ip for ip in ips if ip not in allowed_ips]
 67 |     if len(ips):
 68 |         return True, ips
 69 |     else:
 70 |         return False, None
 71 | 
 72 | def create_domain_matcher(domain):
 73 |     '''Returns a function that serves as a matcher for a given domain'''
 74 |     def domain_matcher(s):
 75 |         regex = '\S+\.'+domain.replace('.', '\.')
 76 |         matches = re.findall(regex, s)
 77 |         if len(matches):
 78 |             return True, matches
 79 |         else:
 80 |             return False, None
 81 |     return domain_matcher
 82 | 
 83 | 
 84 | def regex_matcher(regex_list, s):
 85 |     '''Get a list of regex and return all matches, removes duplicates
 86 |     in case more than onw regex matches the same pattern (pattern location
 87 |     is taken into account to determine wheter two matches are the same).'''
 88 |     #Find matchees and position for each regex
 89 |     results = [match_with_position(regex, s) for regex in regex_list]
 90 |     #Flatten list
 91 |     results = reduce(lambda x,y: x+y, results)
 92 |     #Convert to set to remove duplicates
 93 |     results = set(results)
 94 |     #Extract matches only (wuthout position)
 95 |     results = [res[1] for res in results]
 96 |     return results
 97 | 
 98 | def match_with_position(regex, s):
 99 |     '''Returns a list of tuples (pos, match) for each match.'''
100 |     return [(m.start(), m.group()) for m in regex.finditer(s)]
101 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | glob2==0.4.1
2 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | setup(
 4 |       name='repo-scraper',
 5 |       version='0.1',
 6 |       description='Search for potential passwords/data leaks in a folder or git repo',
 7 |       url='https://github.com/dssg/repo-scraper',
 8 |       author='Eduardo Blancas Reyes',
 9 |       author_email='edu.blancas@gmail.com',
10 |       license='MIT',
11 |       packages=['repo_scraper'],
12 |       scripts=['bin/check-dir', 'bin/check-repo'],
13 |       test_suite='nose.collector',
14 |       tests_require=['nose'],
15 |       zip_safe=False
16 |       )


--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/tests/__init__.py


--------------------------------------------------------------------------------
/tests/test_FileChecker.py:
--------------------------------------------------------------------------------
 1 | from repo_scraper.constants.result import *
 2 | from unittest import TestCase
 3 | from repo_scraper.FileChecker import FileChecker
 4 | import os
 5 | 
 6 | module_path = os.path.dirname(os.path.abspath(__file__))
 7 | dummy_repo_path = os.path.join(module_path, '..', 'dummy-project')
 8 | 
 9 | class TestFileChecker(TestCase):
10 |     def test_json_file_with_password(self):
11 |         pass
12 |     def test_plain_text_file_with_password(self):
13 |         pass
14 |     def test_python_file_with_password(self):
15 |         path = os.path.join(dummy_repo_path, 'python_file_with_password.py')
16 |         r = FileChecker(path, allowed_extensions=['py']).check()
17 |         self.assertEqual(r.result_type, ALERT)
18 |     def test_hidden_file_with_password(self):
19 |         pass


--------------------------------------------------------------------------------
/tests/test_filesystem.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/tests/test_filesystem.py


--------------------------------------------------------------------------------
/tests/test_ip_check.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/tests/test_ip_check.py


--------------------------------------------------------------------------------
/tests/test_password_check.py:
--------------------------------------------------------------------------------
  1 | from unittest import TestCase
  2 | from repo_scraper import matchers
  3 | 
  4 | #what if it's not quoted? what if is spanned across more than one line
  5 | #is this reasonable to test? p=some-password
  6 | 
  7 | class HardcodedPasswordString(TestCase):
  8 |     def test_detects_easy_password(self):
  9 |         str_to_check = 'password="123456"'
 10 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 11 |         self.assertTrue(password_matcher)
 12 |         self.assertEqual(matches, [str_to_check])
 13 | 
 14 |     def test_detects_easy_password_single_quotes(self):
 15 |         str_to_check = 'password=\'123456\''
 16 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 17 |         self.assertTrue(password_matcher)
 18 |         self.assertEqual(matches, [str_to_check])
 19 | 
 20 |     def test_detects_easy_password_spaces(self):
 21 |         str_to_check = 'password =   "123456"'
 22 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 23 |         self.assertTrue(password_matcher)
 24 |         self.assertEqual(matches, [str_to_check])
 25 | 
 26 |     def test_detects_easy_password_linebreaks(self):
 27 |         password_matcher, matches = matchers.password_matcher('password ="123456"\n')
 28 |         self.assertTrue(password_matcher)
 29 |         self.assertEqual(matches, ['password ="123456"'])
 30 | 
 31 |     def test_detects_easy_password_in_R(self):
 32 |         str_to_check = 'password<-   "123456"'
 33 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 34 |         self.assertTrue(password_matcher)
 35 |         self.assertEqual(matches, [str_to_check])
 36 | 
 37 |     def test_detects_pwd(self):
 38 |         str_to_check = 'pwd="123456"'
 39 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 40 |         self.assertTrue(password_matcher)
 41 |         self.assertEqual(matches, [str_to_check])
 42 | 
 43 |     def test_detects_password_with_prefix(self):
 44 |         str_to_check = 'POSTGRES_PASSWORD=\'iYiLKi7879\''
 45 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 46 |         self.assertTrue(password_matcher)
 47 |         self.assertEqual(matches, [str_to_check])
 48 | 
 49 |     def test_detects_password_with_suffix(self):
 50 |         str_to_check = 'PASSWORD_MYSQL=\'iYiLKi7879\''
 51 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 52 |         self.assertTrue(password_matcher)
 53 |         self.assertEqual(matches, [str_to_check])
 54 | 
 55 |     def test_detects_multiple_passwords(self):
 56 |         str_to_check = 'PASSWORD_MYSQL=\'iYiLKi7879\'  \n  \n  password ="123456"\n var=5'
 57 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 58 |         self.assertTrue(password_matcher)
 59 |         self.assertEqual(matches, ['password ="123456"', 'PASSWORD_MYSQL=\'iYiLKi7879\''])
 60 | 
 61 |     def test_ignores_password_from_another_variable(self):
 62 |         str_to_check = 'password=variable'
 63 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 64 |         self.assertFalse(password_matcher)
 65 |         self.assertEqual(matches, None)
 66 | 
 67 |     def test_ignores_pwd_from_another_variable(self):
 68 |         str_to_check = 'pwd=variable'
 69 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 70 |         self.assertFalse(password_matcher)
 71 |         self.assertEqual(matches, None)
 72 | 
 73 |     def test_ignores_password_from_another_variable_with_blanks(self):
 74 |         password_matcher, matches = matchers.password_matcher('pwd    =variable\n')
 75 |         self.assertFalse(password_matcher)
 76 |         self.assertEqual(matches, None)
 77 | 
 78 | class HardcodedURLs(TestCase):
 79 |     def test_detects_sqlalchemy_engine(self):
 80 |         str_to_check = 'db-schema://user:strong-pwd@localhost:5432/mydb'
 81 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 82 |         self.assertTrue(password_matcher)
 83 |         self.assertEqual(matches, [str_to_check])
 84 | 
 85 |     def test_detects_sqlalchemy_engine_different_settings(self):
 86 |         str_to_check = 'another-schema://user2:1234@localhost:0000/awesome-db'
 87 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 88 |         self.assertTrue(password_matcher)
 89 |         self.assertEqual(matches, [str_to_check])
 90 | 
 91 |     def test_detects_sqlalchemy_quoted(self):
 92 |         str_to_check = '\'db-schema://user:strong-pwd@localhost:5432/mydb\''
 93 |         password_matcher, matches = matchers.password_matcher(str_to_check)
 94 |         self.assertTrue(password_matcher)
 95 |         self.assertEqual(matches, [str_to_check])
 96 | 
 97 |     def test_detects_sqlalchemy_double_quoted(self):
 98 |         str_to_check = '"db-schema://user:strong-pwd@localhost:5432/mydb"'
 99 |         password_matcher, matches = matchers.password_matcher(str_to_check)
100 |         self.assertTrue(password_matcher)
101 |         self.assertEqual(matches, [str_to_check])
102 | 
103 | class HardcodedPasswordsInJSON(TestCase):
104 |     def test_detects_hardcoded_value_json(self):
105 |         str_to_check = '''{
106 |                             "password":"super-secret-password"     \n\n\t
107 |                         }'''
108 |         password_matcher, matches = matchers.password_matcher(str_to_check)
109 |         self.assertTrue(password_matcher)
110 |         self.assertEqual(matches, ['"password":"super-secret-password"'])
111 | 
112 |     def test_detects_hardcoded_value_json_single_quotes(self):
113 |         str_to_check = '''{
114 |                             \'password\': \'super-secret-password\'     \n\n\t
115 |                         }'''
116 |         password_matcher, matches = matchers.password_matcher(str_to_check)
117 |         self.assertTrue(password_matcher)
118 |         self.assertEqual(matches, ['\'password\': \'super-secret-password\''])
119 | 
120 |     def test_detects_hardcoded_value_json_multiple_keys(self):
121 |         str_to_check = '''{
122 |                             "pass": "dont-hack-me-please",
123 |                             "key": "1234"
124 |                         }'''
125 |         password_matcher, matches = matchers.password_matcher(str_to_check)
126 |         self.assertTrue(password_matcher)
127 |         self.assertEqual(matches, ['"pass": "dont-hack-me-please"'])
128 | 
129 |     def test_detects_hardcoded_value_json_multiple_passwords(self):
130 |         str_to_check = '''{
131 |                             "pass": "dont-hack-me-please",
132 |                             "key": "1234",
133 |                             \'pwd\'  :    \'qwerty\',
134 |                         }'''
135 |         password_matcher, matches = matchers.password_matcher(str_to_check)
136 |         self.assertTrue(password_matcher)
137 |         self.assertEqual(matches, ['\'pwd\'  :    \'qwerty\'', '"pass": "dont-hack-me-please"'])
138 | 
139 |     def test_detects_hardcoded_value_json_blanks(self):
140 |         str_to_check = '''{
141 |                             "   pass"  :    "dont-hack-me-please"     \n\n\t
142 |                         }'''
143 |         password_matcher, matches = matchers.password_matcher(str_to_check)
144 |         #self.assertTrue(password_matcher)
145 |         #self.assertEqual(matches, ['"   pass"  :    "dont-hack-me-please"'])
146 |     def test_ignores_json_without_passwords(self):
147 |         str_to_check = '''{
148 |                             "some_key": "this is not a password",
149 |                             "another_key": 100-12301-123,
150 |                         }'''
151 |         password_matcher, matches = matchers.password_matcher(str_to_check)
152 |         self.assertFalse(password_matcher)
153 |         self.assertEqual(matches, None)
154 |     def test_detects_url_in_json_file(self):
155 |         str_to_check = '''{
156 |                             "engine": "db-schema://user:strong-pwd@localhost:5432/mydb",
157 |                             "key": "1234",
158 |                         }'''
159 |         password_matcher, matches = matchers.password_matcher(str_to_check)
160 |         self.assertTrue(password_matcher)
161 |         self.assertEqual(matches, ['"db-schema://user:strong-pwd@localhost:5432/mydb"'])
162 | 
163 | class HardcodedPasswordsInYAML(TestCase):
164 |     def test_detects_hardcoded_double_quotes(self):
165 |         str_to_check = '''
166 |                             database: 
167 |                               drivername: "dbdriver"
168 |                               host:       "dbhost"
169 |                               port:       "port"
170 |                               username:   "username"
171 |                               password:   "password"
172 |                               database:   "database"
173 |                         '''
174 |         password_matcher, matches = matchers.password_matcher(str_to_check)
175 |         self.assertTrue(password_matcher)
176 |         self.assertEqual(matches, ['password:   "password"'])
177 |     def test_detects_hardcoded_with_a_double_quote(self):
178 |         str_to_check = '''
179 |                             db:
180 |                               host: 'host'
181 |                               user: 'user'
182 |                               password: '"klu89oinlk'
183 |                               database: 'a_db'
184 | 
185 |                        '''
186 |         password_matcher, matches = matchers.password_matcher(str_to_check)
187 |         self.assertTrue(password_matcher)
188 |         self.assertEqual(matches, ["password: '\"klu89oinlk'"])
189 |     def test_detects_hardcoded_with_a_single_quote(self):
190 |         str_to_check = '''
191 |                         database: 
192 |                               drivername: "dbdriver"
193 |                               host:       "dbhost"
194 |                               port:       "port"
195 |                               username:   "username"
196 |                               password:   "thispwdhasthis'"
197 |                               database:   "database"
198 |                        '''
199 |         password_matcher, matches = matchers.password_matcher(str_to_check)
200 |         self.assertTrue(password_matcher)
201 |         self.assertEqual(matches, ['password:   "thispwdhasthis\'"'])
202 | 
203 | class HardcodedPasswordsInCSV(TestCase):
204 |     def test_detects_hardcoded_value_csv(self):
205 |         #str_to_check = '''password, qwerty'''
206 |         #password_matcher, matches = matchers.password_matcher(str_to_check)
207 |         #self.assertTrue(password_matcher)
208 |         #self.assertEqual(matches, ['password, qwerty'])
209 |         pass
210 | 
211 | class HardcodedPasswordsInGenericPlainText(TestCase):
212 |     pass


--------------------------------------------------------------------------------
/tests/test_url_check.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dssg/repo-scraper/3fbc4668f244ddb0c9c814324fb3794ca192104e/tests/test_url_check.py


--------------------------------------------------------------------------------