└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # EmailCrawl 2 | 3 | ![Python](https://img.shields.io/badge/Python-3.6+-blue.svg) 4 | ![License](https://img.shields.io/badge/License-MIT-green.svg) 5 | ![OSINT](https://img.shields.io/badge/OSINT-Tool-orange.svg) 6 | ![Security](https://img.shields.io/badge/Cybersecurity-Professional-red.svg) 7 | 8 | **EmailCrawl** is a professional-grade OSINT (Open Source Intelligence) tool designed for advanced email address extraction through comprehensive web crawling. Built for cybersecurity professionals, penetration testers, and intelligence analysts. 9 | 10 | --- 11 | 12 | ## Table of Contents 13 | 14 | - Overview 15 | - Key Features 16 | - Importance in OSINT & Cybersecurity 17 | - Installation 18 | - Usage 19 | - Output Examples 20 | - Use Cases 21 | - Legal & Ethical Considerations 22 | - Contributing 23 | - Disclaimer 24 | 25 | --- 26 | 27 | ## Overview 28 | 29 | EmailCrawl is a specialized reconnaissance tool that systematically crawls websites and their subdirectories to extract valid email addresses. Unlike generic web scrapers, it employs advanced validation algorithms to filter out false positives and identify legitimate contact information. 30 | 31 | --- 32 | 33 | 34 | ## KEY FEATURES 35 | 36 | Advanced Crawling Engine 37 | 38 | • **Intelligent URL Discovery** - Parses robots.txt, sitemaps, and HTML content 39 | 40 | • **Configurable Depth Control** - Adjustable crawl depth and page limits 41 | 42 | • **Same-Domain Focus** - Stays targeted on the specified domain 43 | 44 | • **Respectful Crawling** - Built-in delays and proper user-agent headers 45 | 46 | 47 | ## SMART EMAIL EXTRACTION 48 | 49 | • **Multi-Source Extraction** - Visible text, meta tags, and mailto links 50 | 51 | • **Advanced Validation** - Sophisticated false positive filtering 52 | 53 | • **Real-Time Discovery** - Live email detection with source URLs 54 | 55 | • **Duplicate Prevention** - Automatic deduplication of found emails 56 | 57 | 58 | ## PROFESSIONAL FEATURES 59 | 60 | • **Proxy Support** - HTTP/SOCKS proxy integration for anonymity 61 | 62 | • **Comprehensive Reporting** - JSON and TXT output formats 63 | 64 | • **Domain Analysis** - Email distribution by domain 65 | 66 | • **Performance Metrics** - Detailed crawl statistics and timing 67 | 68 | 69 | 70 | 71 | ## IMPORTANCE IN OSINT & CYBERSECURITY 72 | 73 | ### OSINT Intelligence Gathering 74 | 75 | 76 | **Primary Use Cases**: 77 | 78 | • **Threat Intelligence** - Identify key personnel in target organizations 79 | 80 | • **Attack Surface Mapping** - Discover contact points for social engineering 81 | 82 | • **Corporate Reconnaissance** - Map organizational structure via email patterns 83 | 84 | 85 | ## CYBERSECURITY APPLICATION 86 | 87 | 88 | **Defensive Security**: 89 | 90 | • **External Threat Assessment** - Understand what email information is publicly exposed 91 | 92 | • **Phishing Defense** - Identify emails that could be targeted in phishing campaigns 93 | 94 | • **Security Awareness** - Demonstrate information exposure to employees 95 | 96 | **Offensive Security**: 97 | 98 | • **Penetration Testing** - Email discovery for authorized security assessments 99 | 100 | • **Red Team Operations** - Social engineering preparation and reconnaissance 101 | 102 | • **Vulnerability Assessment** - Identify information disclosure risks 103 | 104 | • **Investigative Research** - Gather contact information for legal investigations 105 | 106 | 107 | 108 | ## STRATEGIC VALUE 109 | 110 | • **Early Warning System** - Detect exposed email addresses before attackers do 111 | 112 | • **Attack Prevention** - Proactively secure vulnerable contact points 113 | 114 | • **Compliance** - Identify GDPR/Privacy Act compliance issues 115 | 116 | • **Risk Management** - Quantify organizational exposure through public data 117 | 118 | 119 | 120 | ## INSTALLATION 121 | 122 | **Prerequisites** 123 | 124 | Python 3.6 or higher 125 | 126 | pip package manager 127 | 128 | **Quick Manual Installation** 129 | 130 | Visit the link below to get the script, then use nano to install it: 131 | 132 | **https://gist.github.com/techenthusiast167/6de049a8260cf8c53d424aeb7ff8402d** 133 | 134 | **Step-by-Step**: 135 | 136 | • Click on the link below to access the script 137 | 138 | • Copy the script content 139 | 140 | • Use nano to create and install the tool 141 | 142 | 143 | 144 | **Install dependencies** 145 | 146 | pip install requests beautifulsoup4 tldextract colorama 147 | 148 | 149 | ## USAGE 150 | 151 | **Basic Command** 152 | 153 | python3 emailcrawl.py https://example.com 154 | 155 | 156 | **Deep Reconnaissance With Custom Limits** 157 | 158 | python3 emailcrawl.py https://example.com --max-pages 500 --max-depth 4 159 | 160 | 161 | 162 | **With Proxy For Operational Security** 163 | 164 | python3 emailcrawl.py https://example.com --proxy http://127.0.0.1:8080 165 | 166 | 167 | **Custom Output And Faster Crawling** 168 | 169 | python3 emailcrawl.py https://example.com --delay 0.5 170 | 171 | 172 | ## FULL COMMAND REFERENCE 173 | 174 | **Option Description Default** 175 | 176 | --max-pages NUM Maximum pages to crawl > 200 177 | 178 | --max-depth NUM Maximum crawl depth > 3 179 | 180 | --output FILE Custom output file path > auto-generated 181 | 182 | --proxy URL H TTP/SOCKS proxy URL > none 183 | 184 | --delay SECONDS Delay between requests > 1 185 | 186 | -h, --help S how help message > N/A 187 | 188 | 189 | ## OUTPUT EXAMPLES 190 | 191 | **Real-Time Discovery** 192 | 193 | [EMAIL] Found: john.doe@company.com (mailto link from: https://company.com/contact) 194 | [EMAIL] Found: sarah.wilson@company.com (from: https://company.com/team) 195 | [EMAIL] Found: info@company.com (from: https://company.com/about) 196 | 197 | 198 | 199 | ## SUMMARY REPORT 200 | 201 | 202 | ============================================================ 203 | EMAILCRAWL - EXTRACTION SUMMARY 204 | ============================================================ 205 | [STATS] Pages Crawled: 147 206 | [STATS] Unique Emails Found: 23 207 | [STATS] Crawl Duration: 45.32 seconds 208 | 209 | **EMAIL DOMAINS DISTRIBUTION**: 210 | 211 | company.com: 18 emails 212 | gmail.com: 3 emails 213 | outlook.com: 2 emails 214 | 215 | **UNIQUE EMAILS EXTRACTED**: 216 | 217 | admin@company.com 218 | info@company.com 219 | john.doe@company.com 220 | 221 | 222 | ## LEGAL & ETHICAL CONSIDERATION 223 | 224 | Authorized Use Only 225 | 226 | 227 | ### LEGAL USES: 228 | 229 | ✓ Security assessments with explicit permission 230 | ✓ Your own websites and applications 231 | ✓ Public bug bounty programs 232 | ✓ Academic research with ethics approval 233 | 234 | ### ILLEGAL USES: 235 | 236 | ✗ Unauthorized access to systems 237 | ✗ Harassment or spam campaigns 238 | ✗ Commercial exploitation without consent 239 | ✗ Violation of terms of service 240 | 241 | 242 | ## RESPONSIBLE DISCLOSURE 243 | 244 | • Always obtain proper authorization before scanning 245 | 246 | • Respect robots.txt and website terms of service 247 | 248 | • Use appropriate rate limiting to avoid service disruption 249 | 250 | • Report discovered vulnerabilities responsibly 251 | 252 | 253 | ## CONTRIBUTING 254 | 255 | The author welcome contributions from the security community! 256 | 257 | **Areas for Improvement**: 258 | 259 | • Enhanced email pattern recognition 260 | 261 | • Additional output formats (CSV, XML) 262 | 263 | • Integration with other OSINT tools 264 | 265 | • Performance optimizations 266 | 267 | 268 | **Contribution Process** 269 | 270 | • Fork the repository 271 | 272 | • Create a feature branch 273 | 274 | • Submit a pull request with comprehensive testing 275 | 276 | --------------------------------------------------------------------------------