├── LICENSE ├── extensions.txt ├── README.md └── wayBackupFinder.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Anmol K Sachan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /extensions.txt: -------------------------------------------------------------------------------- 1 | .xls 2 | .xml 3 | .xlsx 4 | .json 5 | .pdf 6 | .sql 7 | .doc 8 | .docx 9 | .pptx 10 | .txt 11 | .zip 12 | .tar 13 | .gz 14 | .tgz 15 | .bak 16 | .7z 17 | .rar 18 | .log 19 | .cache 20 | .secret 21 | .db 22 | .backup 23 | .yml 24 | .config 25 | .csv 26 | .yaml 27 | .md 28 | .md5 29 | .exe 30 | .dll 31 | .bin 32 | .ini 33 | .bat 34 | .sh 35 | .deb 36 | .rpm 37 | .iso 38 | .img 39 | .apk 40 | .msi 41 | .dmg 42 | .tmp 43 | .crt 44 | .pem 45 | .key 46 | .pub 47 | .asc 48 | .OLD 49 | .PHP 50 | .BAK 51 | .SAVE 52 | .ZIP 53 | .example 54 | .php 55 | .asp 56 | .aspx 57 | .jsp 58 | .dist 59 | .conf 60 | .swp 61 | .old 62 | .tar.gz 63 | .jar 64 | .bz2 65 | .php.save 66 | .php-backup 67 | .save 68 | .php~ 69 | .aspx~ 70 | .asp~ 71 | .bkp 72 | .jsp~ 73 | .sql.gz 74 | .sql.zip 75 | .sql.tar.gz 76 | .sql~ 77 | .swp~ 78 | .tar.bz2 79 | .lz 80 | .xz 81 | .z 82 | .Z 83 | .tar.z 84 | .sqlite 85 | .sqlitedb 86 | .sql.7z 87 | .sql.bz2 88 | .sql.lz 89 | .sql.rar 90 | .sql.xz 91 | .sql.z 92 | .sql.tar.z 93 | .war 94 | .backup.zip 95 | .backup.tar 96 | .backup.tgz 97 | .backup.sql 98 | .tar.bz 99 | .tgz.bz 100 | .tar.lz 101 | .backup.7z 102 | .backup.gz 103 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![image](https://github.com/user-attachments/assets/80c07192-187d-4199-a225-8febf8c2e007) 2 | 3 | # WayBackup Finder 4 | 5 | This Python script fetches URLs from the Wayback Machine and filters them based on specified file extensions. It also checks if archived snapshots are available for each URL and saves the filtered URLs to files. 6 | 7 | Read more: Medium
8 | Watch Tool in action: Medium 9 | 10 | Community Blogs:
11 | WayBackupFinder Passive Recon
12 | 13 | ## Features 14 | 15 | - Fetches URLs from the Wayback Machine using the CDX API. 16 | - Filters the fetched URLs by specific file extensions (e.g., `.pdf`, `.zip`). 17 | - Checks if a Wayback snapshot is available for each URL. 18 | - Saves the filtered URLs to text files. 19 | - Customizable file extensions to filter or use default extensions from `extensions.txt`. 20 | 21 | ## Use Case: Finding Archived Backups 22 | 23 | This tool can be especially useful for finding backups of websites or files that may no longer be available on the live site. If a resource (e.g., a PDF or image) was previously available on the website but has since been removed or is temporarily unavailable, there may still be an archived snapshot of it in the Wayback Machine. 24 | 25 | By using this script, you can: 26 | 27 | - **Identify URLs** that may have once been accessible but no longer are. 28 | - **Check for backups** on the Wayback Machine that might not be available on the current live site. 29 | - **Retrieve historical versions** of files or content that have been deleted or moved. 30 | 31 | The script attempts to retrieve URLs from the Wayback Machine. For each URL found, it checks if an archived snapshot is available. If a snapshot exists, the script provides a link to the backup. 32 | 33 | ## Requirements 34 | 35 | - Python 3.x 36 | - The following Python packages: 37 | - `requests` 38 | - `colorama` 39 | - `termcolor` 40 | 41 | You can install the required packages using the following command: 42 | 43 | ```bash 44 | pip3 install requests colorama termcolor 45 | ``` 46 | 47 | ## How to Use 48 | 49 | 1. Clone the repository or download the script. 50 | 2. Ensure you have a file named `extensions.txt` in the same directory as the script, or specify custom file extensions. 51 | 3. Run the script: 52 | 53 | ```bash 54 | python wayBackupFinder.py 55 | ``` 56 | 57 | 4. When prompted, enter the target domain (e.g., `example.com`) and specify whether to use custom file extensions or load them from the `extensions.txt` file. 58 | 5. The script will: 59 | - Fetch URLs from the Wayback Machine. 60 | - Filter the URLs by the provided file extensions. 61 | - Save the filtered URLs to separate files. 62 | - Check if archived snapshots are available for each URL. 63 | 64 | ## Example 65 | ![image](https://github.com/user-attachments/assets/4a7652dd-7c43-42aa-a9f0-94f7207dca60) 66 | 67 | ### Input: 68 | 69 | ``` 70 | Enter the target domain (e.g., example.com): example.com 71 | Would you like to use custom file extensions or load from extensions.txt? (custom/load): load 72 | ``` 73 | 74 | ### Output: 75 | 76 | The script will print the progress and save the filtered URLs to files such as: 77 | 78 | ``` 79 | Filtered URLs for .pdf saved to: content/example.com/example.com_pdf_filtered_urls.txt 80 | Found possible backup: https://web.archive.org/web/20200101000000/https://example.com/sample.pdf 81 | ``` 82 | 83 | ### File Extensions 84 | 85 | You can specify custom file extensions to filter by, separated by commas, for example: `.zip,.pdf,.jpg`. If you choose to load extensions from `extensions.txt`, the script will use those. 86 | 87 | ### File Structure 88 | 89 | The script will create a folder called `content` and store the filtered URLs for each extension in subfolders named after the target domain. 90 | 91 | ### Community Resources 92 | 93 | - 🌐 𝐖𝐄𝐁 𝐏𝐄𝐍𝐓𝐄𝐒𝐓𝐈𝐍𝐆 - "𝕎𝔸𝕐𝔹𝔸ℂ𝕂𝕌ℙ 𝔽𝕀ℕ𝔻𝔼ℝ" [ 𝐇𝐨𝐰 𝐭𝐨 𝐅𝐢𝐧𝐝 𝐋𝐞𝐚𝐤𝐞𝐝 𝐁𝐚𝐜𝐤𝐮𝐩 𝐅𝐢𝐥𝐞𝐬 𝐮𝐬𝐢𝐧𝐠 𝐰𝐚𝐲𝐛𝐚𝐜𝐤𝐮𝐩 𝐟𝐢𝐧𝐝𝐞𝐫 ] 🔍 94 | - WayBackupFinder Passive Recon 95 | - How can you use WayBackLister? 96 | 97 | ## License 98 | 99 | This project is licensed under the MIT License - see the [LICENSE](https://raw.githubusercontent.com/anmolksachan/WayBackupFinder/refs/heads/main/LICENSE) file for details. 100 | -------------------------------------------------------------------------------- /wayBackupFinder.py: -------------------------------------------------------------------------------- 1 | import os 2 | import requests 3 | import time 4 | from colorama import init 5 | from termcolor import colored 6 | from threading import Thread 7 | from itertools import cycle 8 | 9 | # Loader animation 10 | def loader_animation(message="Processing..."): 11 | animation = cycle(["|", "/", "-", "\\"]) 12 | while not stop_loader: 13 | print(f"\r{message} {next(animation)}", end="") 14 | time.sleep(0.1) 15 | print("\r" + " " * len(message) + "\r", end="") # Clear the line 16 | 17 | # ASCII Art for aesthetics 18 | def print_ascii_art(): 19 | ascii_art = r''' 20 | . . .__ . .___ . 21 | | | _. .[__) _. _.;_/. .._ [__ *._ _| _ ._. 22 | |/\|(_]\_|[__)(_](_.| \(_|[_) | |[ )(_](/,[ 23 | ._| | 24 | ''' 25 | print(ascii_art) 26 | 27 | print_ascii_art() 28 | 29 | # Load extensions from file 30 | def load_extensions_from_file(file_path='extensions.txt'): 31 | try: 32 | with open(file_path, 'r') as f: 33 | extensions = [line.strip() for line in f.readlines() if line.strip()] 34 | return extensions 35 | except FileNotFoundError: 36 | print(colored(f"{file_path} not found. Proceeding with no extensions.", "red")) 37 | return [] 38 | 39 | # Load domains from file 40 | def load_domains_from_file(file_path): 41 | try: 42 | with open(file_path, 'r') as f: 43 | domains = [line.strip() for line in f.readlines() if line.strip()] 44 | return domains 45 | except FileNotFoundError: 46 | print(colored(f"{file_path} not found. Exiting.", "red")) 47 | exit() 48 | 49 | # Fetch URLs using The Wayback Machine API with streaming and backoff 50 | def fetch_urls(target, file_extensions): 51 | print(f"\nFetching URLs from The Time Machine Lite for {target}...") 52 | archive_url = f'https://web.archive.org/cdx/search/cdx?url=*.{target}/*&output=txt&fl=original&collapse=urlkey&page=/' 53 | 54 | global stop_loader 55 | stop_loader = False 56 | loader_thread = Thread(target=loader_animation, args=("Fetching URLs...",)) 57 | loader_thread.start() 58 | 59 | max_retries = 3 # Maximum number of retries 60 | retry_delay = 5 # Delay between retries (in seconds) 61 | attempt = 0 62 | 63 | while attempt < max_retries: 64 | try: 65 | with requests.get(archive_url, stream=True, timeout=60) as response: # Stream the response 66 | response.raise_for_status() 67 | print(colored("\nStreaming response from archive...", "green")) 68 | 69 | url_list = [] 70 | total_lines = 0 71 | for line in response.iter_lines(decode_unicode=True): # Process each line incrementally 72 | if line: 73 | url_list.append(line) 74 | total_lines += 1 75 | if total_lines % 1000 == 0: # Show progress every 1000 lines 76 | print(f"\rFetched {total_lines} URLs...", end="") 77 | 78 | print(colored(f"\nFetched {total_lines} URLs from archive.", "green")) 79 | stop_loader = True 80 | loader_thread.join() 81 | return {ext: [url for url in url_list if url.lower().endswith(ext.lower())] for ext in file_extensions} 82 | except requests.exceptions.RequestException as e: 83 | attempt += 1 84 | if attempt < max_retries: 85 | print(colored(f"\nAttempt {attempt} failed: {e}. Retrying in {retry_delay} seconds...", "yellow")) 86 | time.sleep(retry_delay) 87 | else: 88 | print(colored(f"\nError fetching URLs after {max_retries} attempts: {e}", "red")) 89 | print(colored("The server may be rate-limiting or refusing connections.", "yellow")) 90 | print(colored("Pausing for 5 minutes before continuing...", "yellow")) 91 | time.sleep(300) # Sleep for 5 minutes (300 seconds) 92 | print(colored("Resuming...", "green")) 93 | return {} # Return an empty dictionary after backoff 94 | 95 | # Check for archived snapshots 96 | def check_wayback_snapshot(url): 97 | wayback_url = f'https://archive.org/wayback/available?url={url}' 98 | try: 99 | response = requests.get(wayback_url, timeout=30) 100 | response.raise_for_status() 101 | data = response.json() 102 | if "archived_snapshots" in data and "closest" in data["archived_snapshots"]: 103 | snapshot_url = data["archived_snapshots"]["closest"].get("url") 104 | if snapshot_url: 105 | print(f"[+] Found possible backup: {colored(snapshot_url, 'green')}") 106 | else: 107 | print(f"[-] No archived snapshot found for {url}.") 108 | except Exception as e: 109 | print(f"[?] Error checking Wayback snapshot for {url}: {e}") 110 | 111 | # Save filtered URLs 112 | def save_urls(target, extension_stats, file_suffix="_filtered_urls.txt"): 113 | folder = f"content/{target}" 114 | os.makedirs(folder, exist_ok=True) 115 | all_filtered_urls = [] 116 | for ext, urls in extension_stats.items(): 117 | if urls: 118 | file_path = os.path.join(folder, f"{target}_{ext.strip('.')}"+file_suffix) 119 | with open(file_path, 'w') as file: 120 | file.write("\n".join(urls)) 121 | all_filtered_urls.extend(urls) 122 | print(f"Filtered URLs for {ext} saved to: {colored(file_path, 'green')}") 123 | return all_filtered_urls 124 | 125 | # Process domain 126 | def process_domain(target, file_extensions): 127 | extension_stats = fetch_urls(target, file_extensions) 128 | if not extension_stats: # Ensure extension_stats is not empty 129 | print(colored(f"No URLs fetched for {target}. Skipping...", "yellow")) 130 | return 131 | all_filtered_urls = save_urls(target, extension_stats) 132 | for url in all_filtered_urls: 133 | check_wayback_snapshot(url) 134 | 135 | # Main execution 136 | if __name__ == "__main__": 137 | init() 138 | print(colored(' Coded with Love by Anmol K Sachan @Fr13ND0x7f\n', 'green')) 139 | 140 | # Input: Single or multiple domains 141 | mode = input("Select mode (1: Single Domain, 2: Multiple Domains): ").strip() 142 | if mode == "1": 143 | target = input("\nEnter the target domain (e.g., example.com): ").strip() 144 | if not target: 145 | print(colored("Target domain is required. Exiting.", "red")) 146 | exit() 147 | domains = [target] 148 | elif mode == "2": 149 | domain_file = input("\nEnter the path to the file containing domain list: ").strip() 150 | domains = load_domains_from_file(domain_file) 151 | print(f"Loaded {len(domains)} domains from {colored(domain_file, 'green')}.") 152 | else: 153 | print(colored("Invalid choice. Exiting.", "red")) 154 | exit() 155 | 156 | # Load default extensions from file 157 | default_extensions = load_extensions_from_file() 158 | choice = input("Use custom file extensions or load from extensions.txt? (custom/load): ").strip().lower() 159 | if choice == "custom": 160 | file_extensions = input("Enter file extensions to filter (e.g., .zip,.pdf): ").strip().split(",") 161 | elif choice == "load" and default_extensions: 162 | file_extensions = default_extensions 163 | else: 164 | print(colored("No extensions found. Exiting.", "red")) 165 | exit() 166 | 167 | # Process each domain 168 | for target in domains: 169 | print(colored(f"\nProcessing domain: {target}", "blue")) 170 | process_domain(target, file_extensions) 171 | 172 | print(colored("\nProcess complete for all domains.", "green")) 173 | --------------------------------------------------------------------------------