├── .gitignore ├── LICENSE ├── README.md ├── logging_config.json ├── main.py └── resources ├── Google_Account_Setup.md └── google_drive_api_relationship_diagram.png /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | # Google API credentials 132 | credentials.json 133 | token.json 134 | 135 | # IDEA files 136 | .idea 137 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 TimeInvestor 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Google Drive Duplicate Files Remover 2 | Google Drive Duplicate Files Remover is a tool for searching, finding and removing duplicate files in your Google Drive. 3 | 4 | ## Privacy & Data Safety 5 | Your data is totally safe because 6 | - the program uses your own Google Drive App and your own Google account authentication. 7 | - it only reads and changes files metadata. No file content reading, exporting, downloading or deletion. 8 | - it uses soft deletion which means it trashes duplicate files instead of deleting them permanently. 9 | 10 | > Note: 11 | > - If you need to delete the trashed files and release Google Drive space immediately, login to your Google Drive, go to `Bin`, and delete selected files or empty entire bin. 12 | > - You need to refresh the browser page to see any change in Google Drive storage space. 13 | 14 | ## Prerequisite: Google Account Setup 15 | Basically, what we are doing here is that we are createing a Google Drive App for our own use. 16 | As far as I know, there is no such way that we just enable API access for our Google Drive and then, use some OAuth token to make APIs calls. 17 | 18 | Please read [Google Account Setup](/resources/Google_Account_Setup.md) readme file for the details. 19 | 20 | After successful setup, you will have OAuth credentials saved in a file named `credentials.json`. 21 | We need this file later. 22 | 23 | ## Prerequisite: Program Execution Environment 24 | Before you begin, ensure you have met the following requirements: 25 | * Python 3.6 or higher version. 26 | 27 | ## Installing Google Drive Duplicate Files Remover 28 | To install Google Drive Duplicate Files Remover, follow these steps: 29 | 30 | **Download code** 31 | ```shell 32 | git clone git@github.com:TimeInvestor/gdrive-duplicate-remover.git 33 | ``` 34 | 35 | **(optional) Configure Python virtual environment** 36 | 37 | If you want to use Python virtual environment for the project, please do so. 38 | > If you want to learn more about Python virtual environment, you could refer to https://realpython.com/python-virtual-environments-a-primer/. 39 | 40 | **Install required Python libraries** 41 | ```shell 42 | cd gdrive-duplicate-remover 43 | pip install google-api-python-client 44 | ``` 45 | 46 | ## Configuration 47 | We need access (OAuth 2.0) credentials for the code to call Google Drive API. 48 | So put your saved `credentials.json` file at the root of the project folder. 49 | 50 | ## Using Google Drive Duplicate Files Remover 51 | 52 | To use Google Drive Duplicate Files Remover, follow these steps: 53 | 54 | ```shell 55 | # Go to code folder 56 | cd 57 | # Run the main script 58 | python main.py 59 | ``` 60 | 61 | ## Contributing to Google Drive Duplicate Files Remover 62 | 63 | To contribute to Google Drive Duplicate Files Remover, follow these steps: 64 | 65 | 1. Fork this repository. 66 | 2. Create a branch: `git checkout -b `. 67 | 3. Make your changes and commit them: `git commit -m ''` 68 | 4. Push to the original branch: `git push origin gdrive-duplicate-remover/` 69 | 5. Create the pull request. 70 | 71 | Alternatively see the GitHub documentation on [creating a pull request](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request). 72 | 73 | ## Contact 74 | 75 | If you want to contact me you can reach me at zhenglisheng@gmail.com. 76 | 77 | ## License 78 | This project uses the following license: [MIT License](LICENSE). 79 | -------------------------------------------------------------------------------- /logging_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": 1, 3 | "disable_existing_loggers": true, 4 | "formatters": { 5 | "standard": { 6 | "class": "logging.Formatter", 7 | "style": "{", 8 | "format": "{asctime:s} | {name:s} | {levelname:s} | {message:s}" 9 | } 10 | }, 11 | "handlers": { 12 | "console":{ 13 | "level": "INFO", 14 | "class": "logging.StreamHandler", 15 | "formatter": "standard", 16 | "stream" : "ext://sys.stdout" 17 | }, 18 | "file_handler": { 19 | "level": "INFO", 20 | "class": "logging.FileHandler", 21 | "formatter": "standard", 22 | "filename": "logs/application.log", 23 | "mode": "a", 24 | "encoding": "utf-8" 25 | } 26 | }, 27 | "loggers": { }, 28 | "root": { 29 | "handlers": ["console", "file_handler"], 30 | "level": "INFO" 31 | } 32 | } -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import json 4 | import logging 5 | import os 6 | from datetime import datetime 7 | from logging.config import dictConfig 8 | 9 | from google.auth.transport.requests import Request 10 | from google.oauth2.credentials import Credentials 11 | from google_auth_oauthlib.flow import InstalledAppFlow 12 | from googleapiclient.discovery import build 13 | from googleapiclient.errors import HttpError 14 | 15 | # If modifying these scopes, delete the file token.json. 16 | SCOPES = ['https://www.googleapis.com/auth/drive'] 17 | 18 | MAX_PAGE_SIZE = 1000 19 | FILE_FIELDS = "name, modifiedTime, id, trashed, ownedByMe, md5Checksum" 20 | 21 | # For reporting 22 | date = datetime.now().strftime("%Y_%m_%d-%I_%M_%S%p") 23 | REPORT_FOLDER = 'reports' 24 | 25 | LOGGER_NAME = 'GDrive Duplicate Remover' 26 | 27 | # New fields for status tracking of files. 28 | # We can leverage the existing `trashed` field of files, but it may not be a 29 | # good idea as we don't have the control of the field and Google could rename 30 | # or remove the field one day. 31 | # Another benefit of adding new fields, except having control, is it helps 32 | # auditing/failure analysis by have different status field. 33 | TO_REMOVE = 'to_remove' 34 | REMOVED = 'removed' 35 | 36 | 37 | def main(): 38 | api_client = get_gdrive_api_client() 39 | 40 | hash_map = fetch_files_from_gdrive(api_client) 41 | logger.info(f'Total number of md5Checksum entries: {len(hash_map)}') 42 | _produce_candidate_files_report(hash_map) 43 | 44 | _mark_duplicates(hash_map) 45 | _produce_files_for_removal_report(hash_map) 46 | 47 | remove_duplicates_from_gdrive(api_client, hash_map) 48 | _produce_files_removed_report(hash_map) 49 | 50 | 51 | def get_gdrive_api_client(): 52 | """Takes Google Developer App client secrets from a file 53 | `credentials.json`, kicks off authorization flow, and creates a `service` 54 | object for Google Drive API. 55 | 56 | Returns: 57 | A client for Google Drive ready to make calls. 58 | """ 59 | creds = None 60 | # The file token.json stores the user's access and refresh tokens, and is 61 | # created automatically when the authorization flow completes for the first 62 | # time. 63 | if os.path.exists('token.json'): 64 | creds = Credentials.from_authorized_user_file('token.json', SCOPES) 65 | # If there are no (valid) credentials available, let the user log in. 66 | if not creds or not creds.valid: 67 | if creds and creds.expired and creds.refresh_token: 68 | creds.refresh(Request()) 69 | else: 70 | flow = InstalledAppFlow.from_client_secrets_file( 71 | 'credentials.json', SCOPES) 72 | creds = flow.run_local_server(port=0) 73 | # Save the credentials for the next run 74 | with open('token.json', 'w') as token: 75 | token.write(creds.to_json()) 76 | service = build('drive', 'v3', credentials=creds) 77 | 78 | return service 79 | 80 | 81 | def fetch_files_from_gdrive(api_client): 82 | """Fetches list of files from Google Drive. 83 | 84 | Fetches only files that have md5Checksum. Files that does not have 85 | md5checksum: folders, Google Docs, Google Sheets, Google Slides, 86 | and other Google Office files. It also excludes files that are not owned 87 | by the user or have already been trashed. 88 | 89 | Args: 90 | api_client: Google Drive API client. 91 | 92 | Returns: 93 | A dict using files' md5Checksum as the keys. The value for each key 94 | is a list of files with same md5Checksum i.e. duplicate files. 95 | """ 96 | hash_map = {} 97 | next_page_token = None 98 | while True: 99 | results = api_client.files().list( 100 | # Exclude files without md5Checksum 101 | q="mimeType!='application/vnd.google-apps.folder' and " 102 | "mimeType!='application/vnd.google-apps.spreadsheet' and " 103 | "mimeType!='application/vnd.google-apps.presentation' and " 104 | "mimeType!='application/vnd.google-apps.document' and " 105 | "mimeType!='application/vnd.google-apps.form' and " 106 | "mimeType!='application/vnd.google-apps.drive-sdk.810194666617' and " 107 | "mimeType!='application/vnd.google-apps.site' and " 108 | "mimeType!='application/vnd.google-apps.earth' and " 109 | "mimeType!='application/vnd.google-apps.drawing' and " 110 | "mimeType!='application/vnd.google-apps.jam' and " 111 | "trashed=false", 112 | pageSize=MAX_PAGE_SIZE, 113 | pageToken=next_page_token, 114 | fields=f"nextPageToken, files({FILE_FIELDS})" 115 | ).execute() 116 | 117 | next_page_token = results.get('nextPageToken', None) 118 | logger.info(f'next_page_token: {next_page_token}') 119 | files = results.get('files', []) 120 | if not files: 121 | logger.info('No suitable files found in your Google Drive.') 122 | else: 123 | for file in files: 124 | if (file.get('md5Checksum') is not None and 125 | file.get('trashed') is False and 126 | file.get('ownedByMe') is True): 127 | md5_checksum = file['md5Checksum'] 128 | file_list = hash_map.get(md5_checksum) 129 | if file_list is None: 130 | hash_map[md5_checksum] = [file] 131 | else: 132 | file_list.append(file) 133 | if next_page_token is None: 134 | break 135 | 136 | return hash_map 137 | 138 | 139 | def remove_duplicates_from_gdrive(api_client, hash_map): 140 | """Remove duplicate files from Google Drive. 141 | 142 | It actually trash duplicate files by updating a duplicate file's metadata 143 | field `trashed` to be true. 144 | 145 | Args: 146 | api_client: Google Drive API client. 147 | hash_map: a dict contains info about md5Checksum and its related files 148 | 149 | Returns: 150 | None 151 | """ 152 | for file_list in hash_map.values(): 153 | for file in file_list: 154 | if file.get(TO_REMOVE) is True: 155 | logger.info(f'Trashing file: {file}') 156 | try: 157 | result = api_client.files().update(fileId=file['id'], 158 | body={'trashed': True} 159 | ).execute() 160 | except HttpError: 161 | logger.exception('Trashing file failed!') 162 | else: 163 | logger.info(f'Trashing file done - response: {result}') 164 | file[REMOVED] = True 165 | 166 | 167 | def _mark_duplicates(hash_map): 168 | for md5_checksum, file_list in hash_map.items(): 169 | # Loop through the file list to 170 | # 1) mark files to be removed 171 | # 2) find out most recent file to keep 172 | if len(file_list) > 1: 173 | most_recent_file = file_list[0] 174 | for file in file_list: 175 | file[TO_REMOVE] = True 176 | if file['modifiedTime'] > most_recent_file['modifiedTime']: 177 | most_recent_file = file 178 | most_recent_file[TO_REMOVE] = False 179 | 180 | 181 | def _produce_candidate_files_report(hash_map): 182 | report_file = os.path.join(REPORT_FOLDER, f'Candidate_Files-{date}.log') 183 | _produce_report(hash_map, report_file) 184 | 185 | 186 | def _produce_files_for_removal_report(hash_map): 187 | new_hash_map = {} 188 | for md5_checksum, file_list in hash_map.items(): 189 | new_list = [file for file in file_list if file.get(TO_REMOVE) is True] 190 | if len(new_list) > 0: 191 | new_hash_map[md5_checksum] = new_list 192 | 193 | report_file = os.path.join(REPORT_FOLDER, f'Files_To_Remove-{date}.log') 194 | _produce_report(new_hash_map, report_file) 195 | 196 | 197 | def _produce_files_removed_report(hash_map): 198 | new_hash_map = {} 199 | for md5_checksum, file_list in hash_map.items(): 200 | new_list = [file for file in file_list if file.get(REMOVED) is True] 201 | if len(new_list) > 0: 202 | new_hash_map[md5_checksum] = new_list 203 | 204 | report_file = os.path.join(REPORT_FOLDER, f'Files_Removed-{date}.log') 205 | _produce_report(new_hash_map, report_file) 206 | 207 | 208 | def _produce_report(hash_map, report_file): 209 | os.makedirs(os.path.dirname(report_file), exist_ok=True) 210 | with open(report_file, 'w') as f: 211 | for md5_checksum, file_list in hash_map.items(): 212 | f.write(f"md5Checksum: {md5_checksum}\n") 213 | for file in file_list: 214 | f.write(f"{file}\n") 215 | 216 | 217 | if __name__ == '__main__': 218 | # Config logging 219 | with open('logging_config.json', 'r') as logging_config_file: 220 | logging_config = json.load(logging_config_file) 221 | dictConfig(logging_config) 222 | logger = logging.getLogger(LOGGER_NAME) 223 | 224 | # Run the application 225 | main() 226 | -------------------------------------------------------------------------------- /resources/Google_Account_Setup.md: -------------------------------------------------------------------------------- 1 | Basically, what we are doing here is that we are createing a Google Drive App for our own use. As far as I know, there is no such way that we just enable API access for our Google Drive and then, use some OAuth token to make APIs calls. 2 | 3 | ### Google Drive API Relationship Diagram 4 | ![This diagram shows the relationship between your Google Drive app, Google Drive, and Google Drive API](google_drive_api_relationship_diagram.png?raw=true "Title") 5 | 6 | ### Creat a Google developer project and Enable Google Drive API for it 7 | Refer https://developers.google.com/identity/protocols/oauth2/web-server#enable-apis 8 | 9 | To interact with Google Drive API, you need to enable the Drive API service for your app. 10 | To enable the Drive API, complete these steps: 11 | - [Open the API Library](https://console.developers.google.com/apis/library) in the Google API Console. 12 | - If prompted, select a project, or create a new one. 13 | - The API Library lists all available APIs, grouped by product family and popularity. If the API you want to enable isn't visible in the list, use search to find it, or click View All in the product family it belongs to. 14 | - Select **Google Drive API** under **Google Workspace**, then click the Enable button. 15 | - If prompted, enable billing. (No worries. You will not be charged.) 16 | - If prompted, read and accept the API's Terms of Service. 17 | 18 | ### Create authorization credentials 19 | We use OAuth 2.0. So please refer to [Authorizing requests with OAuth 2.0](https://developers.google.com/drive/api/v3/about-auth#OAuth2Authorizing). 20 | 21 | Any application that uses OAuth 2.0 to access Google APIs must have authorization credentials that identify the application to Google's OAuth 2.0 server. 22 | The following steps explain how to create credentials for your project. 23 | 24 | First, we need to configure the **OAuth consent screen** first. 25 | 26 | Here are the steps: 27 | - Go to [OAuth consent screen](https://console.cloud.google.com/apis/credentials/consent) page. 28 | - Configure Consent Screen 29 | - Choose `External` as User Type. 30 | - Click `CREATE` button. 31 | - Just fill up the form on the next page. You can safely ignore the `App domain` and `Autorised domains` sections. 32 | - Click `SAVE AND CONTINUE` button. 33 | - Select Scopes 34 | - Click `ADD OR REMOVE SCOPES` button. 35 | - Browse the list displayed and choose "`https://www.googleapis.com/auth/drive`". 36 | - Click `UPDATE` button. 37 | - Click `SAVE AND CONTINUE` button. 38 | - Add test users 39 | - Click `ADD USERS` button. 40 | - Type your Gmail address. 41 | - Click `ADD` button. 42 | - Click `SAVE AND CONTINUE` button. 43 | - Review Summary 44 | - Click `BACK TO DASHBOARD` 45 | 46 | > - Although this gives full access to your Google Drive files, don't worry as you, yourself, will be the sole user. 47 | > 48 | > - Besides, you don't have to get your Google Drive App approved by Google, as we can use **test user** feature to use the app. 49 | 50 | Second, we need to create **OAuth credentials** for the Google Duplicate Remover code to use. 51 | 52 | Your applications can then use the credentials to access APIs that you have enabled for that project. 53 | - Go to the [Credentials page](https://console.developers.google.com/apis/credentials). 54 | - Click **+CREATE CREDENTIALs > OAuth client ID** in the top bar. 55 | - Select the `Web application` as application type. 56 | - Fill in rest of the form and click `CREATE` button. 57 | - Click `OK` in the pop up. 58 | 59 | Now, we need to download/save the credentials for the Google Duplicate Remover. 60 | - On the same page, under **OAuth 2.0 Client IDs** section, find the OAuth client you just created. 61 | - Click the download/save arrow icon. 62 | - Save the credential file as `credentials.json`. 63 | 64 | You are done! 65 | 66 | 67 | -------------------------------------------------------------------------------- /resources/google_drive_api_relationship_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TimeInvestor/gdrive-duplicate-remover/de155e1a631c34b91508c158fe35467e2c9d2ccc/resources/google_drive_api_relationship_diagram.png --------------------------------------------------------------------------------