├── LICENSE
├── extensions.txt
├── README.md
└── wayBackupFinder.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 Anmol K Sachan
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/extensions.txt:
--------------------------------------------------------------------------------
1 | .xls
2 | .xml
3 | .xlsx
4 | .json
5 | .pdf
6 | .sql
7 | .doc
8 | .docx
9 | .pptx
10 | .txt
11 | .zip
12 | .tar
13 | .gz
14 | .tgz
15 | .bak
16 | .7z
17 | .rar
18 | .log
19 | .cache
20 | .secret
21 | .db
22 | .backup
23 | .yml
24 | .config
25 | .csv
26 | .yaml
27 | .md
28 | .md5
29 | .exe
30 | .dll
31 | .bin
32 | .ini
33 | .bat
34 | .sh
35 | .deb
36 | .rpm
37 | .iso
38 | .img
39 | .apk
40 | .msi
41 | .dmg
42 | .tmp
43 | .crt
44 | .pem
45 | .key
46 | .pub
47 | .asc
48 | .OLD
49 | .PHP
50 | .BAK
51 | .SAVE
52 | .ZIP
53 | .example
54 | .php
55 | .asp
56 | .aspx
57 | .jsp
58 | .dist
59 | .conf
60 | .swp
61 | .old
62 | .tar.gz
63 | .jar
64 | .bz2
65 | .php.save
66 | .php-backup
67 | .save
68 | .php~
69 | .aspx~
70 | .asp~
71 | .bkp
72 | .jsp~
73 | .sql.gz
74 | .sql.zip
75 | .sql.tar.gz
76 | .sql~
77 | .swp~
78 | .tar.bz2
79 | .lz
80 | .xz
81 | .z
82 | .Z
83 | .tar.z
84 | .sqlite
85 | .sqlitedb
86 | .sql.7z
87 | .sql.bz2
88 | .sql.lz
89 | .sql.rar
90 | .sql.xz
91 | .sql.z
92 | .sql.tar.z
93 | .war
94 | .backup.zip
95 | .backup.tar
96 | .backup.tgz
97 | .backup.sql
98 | .tar.bz
99 | .tgz.bz
100 | .tar.lz
101 | .backup.7z
102 | .backup.gz
103 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # WayBackup Finder
4 |
5 | This Python script fetches URLs from the Wayback Machine and filters them based on specified file extensions. It also checks if archived snapshots are available for each URL and saves the filtered URLs to files.
6 |
7 | Read more: Medium
8 | Watch Tool in action: Medium
9 |
10 | Community Blogs:
11 | WayBackupFinder Passive Recon
12 |
13 | ## Features
14 |
15 | - Fetches URLs from the Wayback Machine using the CDX API.
16 | - Filters the fetched URLs by specific file extensions (e.g., `.pdf`, `.zip`).
17 | - Checks if a Wayback snapshot is available for each URL.
18 | - Saves the filtered URLs to text files.
19 | - Customizable file extensions to filter or use default extensions from `extensions.txt`.
20 |
21 | ## Use Case: Finding Archived Backups
22 |
23 | This tool can be especially useful for finding backups of websites or files that may no longer be available on the live site. If a resource (e.g., a PDF or image) was previously available on the website but has since been removed or is temporarily unavailable, there may still be an archived snapshot of it in the Wayback Machine.
24 |
25 | By using this script, you can:
26 |
27 | - **Identify URLs** that may have once been accessible but no longer are.
28 | - **Check for backups** on the Wayback Machine that might not be available on the current live site.
29 | - **Retrieve historical versions** of files or content that have been deleted or moved.
30 |
31 | The script attempts to retrieve URLs from the Wayback Machine. For each URL found, it checks if an archived snapshot is available. If a snapshot exists, the script provides a link to the backup.
32 |
33 | ## Requirements
34 |
35 | - Python 3.x
36 | - The following Python packages:
37 | - `requests`
38 | - `colorama`
39 | - `termcolor`
40 |
41 | You can install the required packages using the following command:
42 |
43 | ```bash
44 | pip3 install requests colorama termcolor
45 | ```
46 |
47 | ## How to Use
48 |
49 | 1. Clone the repository or download the script.
50 | 2. Ensure you have a file named `extensions.txt` in the same directory as the script, or specify custom file extensions.
51 | 3. Run the script:
52 |
53 | ```bash
54 | python wayBackupFinder.py
55 | ```
56 |
57 | 4. When prompted, enter the target domain (e.g., `example.com`) and specify whether to use custom file extensions or load them from the `extensions.txt` file.
58 | 5. The script will:
59 | - Fetch URLs from the Wayback Machine.
60 | - Filter the URLs by the provided file extensions.
61 | - Save the filtered URLs to separate files.
62 | - Check if archived snapshots are available for each URL.
63 |
64 | ## Example
65 | 
66 |
67 | ### Input:
68 |
69 | ```
70 | Enter the target domain (e.g., example.com): example.com
71 | Would you like to use custom file extensions or load from extensions.txt? (custom/load): load
72 | ```
73 |
74 | ### Output:
75 |
76 | The script will print the progress and save the filtered URLs to files such as:
77 |
78 | ```
79 | Filtered URLs for .pdf saved to: content/example.com/example.com_pdf_filtered_urls.txt
80 | Found possible backup: https://web.archive.org/web/20200101000000/https://example.com/sample.pdf
81 | ```
82 |
83 | ### File Extensions
84 |
85 | You can specify custom file extensions to filter by, separated by commas, for example: `.zip,.pdf,.jpg`. If you choose to load extensions from `extensions.txt`, the script will use those.
86 |
87 | ### File Structure
88 |
89 | The script will create a folder called `content` and store the filtered URLs for each extension in subfolders named after the target domain.
90 |
91 | ### Community Resources
92 |
93 | - 🌐 𝐖𝐄𝐁 𝐏𝐄𝐍𝐓𝐄𝐒𝐓𝐈𝐍𝐆 - "𝕎𝔸𝕐𝔹𝔸ℂ𝕂𝕌ℙ 𝔽𝕀ℕ𝔻𝔼ℝ" [ 𝐇𝐨𝐰 𝐭𝐨 𝐅𝐢𝐧𝐝 𝐋𝐞𝐚𝐤𝐞𝐝 𝐁𝐚𝐜𝐤𝐮𝐩 𝐅𝐢𝐥𝐞𝐬 𝐮𝐬𝐢𝐧𝐠 𝐰𝐚𝐲𝐛𝐚𝐜𝐤𝐮𝐩 𝐟𝐢𝐧𝐝𝐞𝐫 ] 🔍
94 | - WayBackupFinder Passive Recon
95 | - How can you use WayBackLister?
96 |
97 | ## License
98 |
99 | This project is licensed under the MIT License - see the [LICENSE](https://raw.githubusercontent.com/anmolksachan/WayBackupFinder/refs/heads/main/LICENSE) file for details.
100 |
--------------------------------------------------------------------------------
/wayBackupFinder.py:
--------------------------------------------------------------------------------
1 | import os
2 | import requests
3 | import time
4 | from colorama import init
5 | from termcolor import colored
6 | from threading import Thread
7 | from itertools import cycle
8 |
9 | # Loader animation
10 | def loader_animation(message="Processing..."):
11 | animation = cycle(["|", "/", "-", "\\"])
12 | while not stop_loader:
13 | print(f"\r{message} {next(animation)}", end="")
14 | time.sleep(0.1)
15 | print("\r" + " " * len(message) + "\r", end="") # Clear the line
16 |
17 | # ASCII Art for aesthetics
18 | def print_ascii_art():
19 | ascii_art = r'''
20 | . . .__ . .___ .
21 | | | _. .[__) _. _.;_/. .._ [__ *._ _| _ ._.
22 | |/\|(_]\_|[__)(_](_.| \(_|[_) | |[ )(_](/,[
23 | ._| |
24 | '''
25 | print(ascii_art)
26 |
27 | print_ascii_art()
28 |
29 | # Load extensions from file
30 | def load_extensions_from_file(file_path='extensions.txt'):
31 | try:
32 | with open(file_path, 'r') as f:
33 | extensions = [line.strip() for line in f.readlines() if line.strip()]
34 | return extensions
35 | except FileNotFoundError:
36 | print(colored(f"{file_path} not found. Proceeding with no extensions.", "red"))
37 | return []
38 |
39 | # Load domains from file
40 | def load_domains_from_file(file_path):
41 | try:
42 | with open(file_path, 'r') as f:
43 | domains = [line.strip() for line in f.readlines() if line.strip()]
44 | return domains
45 | except FileNotFoundError:
46 | print(colored(f"{file_path} not found. Exiting.", "red"))
47 | exit()
48 |
49 | # Fetch URLs using The Wayback Machine API with streaming and backoff
50 | def fetch_urls(target, file_extensions):
51 | print(f"\nFetching URLs from The Time Machine Lite for {target}...")
52 | archive_url = f'https://web.archive.org/cdx/search/cdx?url=*.{target}/*&output=txt&fl=original&collapse=urlkey&page=/'
53 |
54 | global stop_loader
55 | stop_loader = False
56 | loader_thread = Thread(target=loader_animation, args=("Fetching URLs...",))
57 | loader_thread.start()
58 |
59 | max_retries = 3 # Maximum number of retries
60 | retry_delay = 5 # Delay between retries (in seconds)
61 | attempt = 0
62 |
63 | while attempt < max_retries:
64 | try:
65 | with requests.get(archive_url, stream=True, timeout=60) as response: # Stream the response
66 | response.raise_for_status()
67 | print(colored("\nStreaming response from archive...", "green"))
68 |
69 | url_list = []
70 | total_lines = 0
71 | for line in response.iter_lines(decode_unicode=True): # Process each line incrementally
72 | if line:
73 | url_list.append(line)
74 | total_lines += 1
75 | if total_lines % 1000 == 0: # Show progress every 1000 lines
76 | print(f"\rFetched {total_lines} URLs...", end="")
77 |
78 | print(colored(f"\nFetched {total_lines} URLs from archive.", "green"))
79 | stop_loader = True
80 | loader_thread.join()
81 | return {ext: [url for url in url_list if url.lower().endswith(ext.lower())] for ext in file_extensions}
82 | except requests.exceptions.RequestException as e:
83 | attempt += 1
84 | if attempt < max_retries:
85 | print(colored(f"\nAttempt {attempt} failed: {e}. Retrying in {retry_delay} seconds...", "yellow"))
86 | time.sleep(retry_delay)
87 | else:
88 | print(colored(f"\nError fetching URLs after {max_retries} attempts: {e}", "red"))
89 | print(colored("The server may be rate-limiting or refusing connections.", "yellow"))
90 | print(colored("Pausing for 5 minutes before continuing...", "yellow"))
91 | time.sleep(300) # Sleep for 5 minutes (300 seconds)
92 | print(colored("Resuming...", "green"))
93 | return {} # Return an empty dictionary after backoff
94 |
95 | # Check for archived snapshots
96 | def check_wayback_snapshot(url):
97 | wayback_url = f'https://archive.org/wayback/available?url={url}'
98 | try:
99 | response = requests.get(wayback_url, timeout=30)
100 | response.raise_for_status()
101 | data = response.json()
102 | if "archived_snapshots" in data and "closest" in data["archived_snapshots"]:
103 | snapshot_url = data["archived_snapshots"]["closest"].get("url")
104 | if snapshot_url:
105 | print(f"[+] Found possible backup: {colored(snapshot_url, 'green')}")
106 | else:
107 | print(f"[-] No archived snapshot found for {url}.")
108 | except Exception as e:
109 | print(f"[?] Error checking Wayback snapshot for {url}: {e}")
110 |
111 | # Save filtered URLs
112 | def save_urls(target, extension_stats, file_suffix="_filtered_urls.txt"):
113 | folder = f"content/{target}"
114 | os.makedirs(folder, exist_ok=True)
115 | all_filtered_urls = []
116 | for ext, urls in extension_stats.items():
117 | if urls:
118 | file_path = os.path.join(folder, f"{target}_{ext.strip('.')}"+file_suffix)
119 | with open(file_path, 'w') as file:
120 | file.write("\n".join(urls))
121 | all_filtered_urls.extend(urls)
122 | print(f"Filtered URLs for {ext} saved to: {colored(file_path, 'green')}")
123 | return all_filtered_urls
124 |
125 | # Process domain
126 | def process_domain(target, file_extensions):
127 | extension_stats = fetch_urls(target, file_extensions)
128 | if not extension_stats: # Ensure extension_stats is not empty
129 | print(colored(f"No URLs fetched for {target}. Skipping...", "yellow"))
130 | return
131 | all_filtered_urls = save_urls(target, extension_stats)
132 | for url in all_filtered_urls:
133 | check_wayback_snapshot(url)
134 |
135 | # Main execution
136 | if __name__ == "__main__":
137 | init()
138 | print(colored(' Coded with Love by Anmol K Sachan @Fr13ND0x7f\n', 'green'))
139 |
140 | # Input: Single or multiple domains
141 | mode = input("Select mode (1: Single Domain, 2: Multiple Domains): ").strip()
142 | if mode == "1":
143 | target = input("\nEnter the target domain (e.g., example.com): ").strip()
144 | if not target:
145 | print(colored("Target domain is required. Exiting.", "red"))
146 | exit()
147 | domains = [target]
148 | elif mode == "2":
149 | domain_file = input("\nEnter the path to the file containing domain list: ").strip()
150 | domains = load_domains_from_file(domain_file)
151 | print(f"Loaded {len(domains)} domains from {colored(domain_file, 'green')}.")
152 | else:
153 | print(colored("Invalid choice. Exiting.", "red"))
154 | exit()
155 |
156 | # Load default extensions from file
157 | default_extensions = load_extensions_from_file()
158 | choice = input("Use custom file extensions or load from extensions.txt? (custom/load): ").strip().lower()
159 | if choice == "custom":
160 | file_extensions = input("Enter file extensions to filter (e.g., .zip,.pdf): ").strip().split(",")
161 | elif choice == "load" and default_extensions:
162 | file_extensions = default_extensions
163 | else:
164 | print(colored("No extensions found. Exiting.", "red"))
165 | exit()
166 |
167 | # Process each domain
168 | for target in domains:
169 | print(colored(f"\nProcessing domain: {target}", "blue"))
170 | process_domain(target, file_extensions)
171 |
172 | print(colored("\nProcess complete for all domains.", "green"))
173 |
--------------------------------------------------------------------------------