├── .github └── FUNDING.yml ├── LICENSE ├── README.md ├── bad_words.txt ├── config.py ├── connection_tree.py ├── connection_tree_v2.py ├── crawler.py ├── current.ver ├── docs.md ├── fix_db.py ├── install.sh ├── installer.py ├── mongo_db.py ├── opencrawler ├── opencrawler.1 ├── proxy_tool.py ├── requirements.txt ├── robots_txt.py ├── search.py ├── search_website.py ├── templates ├── index.html └── tree.html └── tree_website.py /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | 3 | patreon: cactochan 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Cactochan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 |
3 |

4 | 5 | Logo 6 | 7 | 8 |

Open Crawler

9 | 10 |

11 | Open Source Website Crawler 12 | 13 | 14 |
15 |
16 | Explore the docs » 17 |
18 |
19 | Report Bug 20 | . 21 | Request Feature 22 |

23 |

24 | 25 | ![Contributors](https://img.shields.io/github/contributors/merwin-asm/OpenCrawler?color=dark-green) ![Issues](https://img.shields.io/github/issues/merwin-asm/OpenCrawler) ![License](https://img.shields.io/github/license/merwin-asm/OpenCrawler) 26 |

27 | 28 | ## Table Of Contents 29 | 30 | * [About the Project](#about-the-project) 31 | * [Features](#features) 32 | * [Getting Started](#getting-started) 33 | * [Installation](#installation) 34 | * [Usage](#usage) 35 | * [Contributing](#contributing) 36 | * [License](#license) 37 | * [Authors](#authors) 38 | 39 | ## About The Project 40 | 41 | ![Screen Shot](https://cdn.discordapp.com/attachments/951417646191083551/1099926374954827936/image.png?ex=66b848c3&is=66b6f743&hm=15144fd543ce67f1bd03c1a741822d0651b61ccadc8e993675619bbb015765f4&) 42 | 43 |
44 | 45 | *An Open Source Crawler/Spider* 46 | 47 | Can be used by anyone... And can be ran on any win / linux computers 48 | It ain't any crawler for industrial use as written in a slow programming language and may have its own issues.. 49 | 50 | The project can be easily used with mongoDB. 51 | 52 | The project can also be used for pentesting. 53 | 54 | ## Features 55 | 56 | - Cross Platform 57 | - Installer for linux 58 | - Related-CLI Tools (includes ,CLI access to tool, not that good search-tool xD, etc) 59 | - Memory efficient [ig] 60 | - Pool Crawling - Use multiple crawlers at same time 61 | - Supports Robot.txt 62 | - MongoDB [DB] 63 | - Language Detection 64 | - 18 + Checks / Offensive Content Check 65 | - Proxies 66 | - Multi Threading 67 | - Url Scanning 68 | - Keyword, Desc And recurring words Logging 69 | - Search Website - search_website.py 70 | - Connection Tree Website - tree_website.py 71 | - Tool for finding proxies - proxy_tool.py 72 | 73 | ## Getting Started 74 | 75 | The first thing is install the project... 76 | The installer provided is only for Linux.. 77 | 78 | In windows the application wont be added to path or requirements be installed soo check out the installation procedure for Windows. 79 | 80 | ### Installation 81 | 82 | ##### Linux 83 | 84 | ```shell 85 | git clone https://github.com/merwin-asm/OpenCrawler.git 86 | ``` 87 | ```shell 88 | cd OpenCrawler 89 | ``` 90 | ```shell 91 | chmod +x install.sh && ./install.sh 92 | ``` 93 | 94 | ##### Windows 95 | 96 | *You need git, python3 and pip installed* 97 | 98 | ```shell 99 | git clone https://github.com/merwin-asm/OpenCrawler.git 100 | ``` 101 | ```shell 102 | cd OpenCrawler 103 | ``` 104 | ```shell 105 | pip install -r requirements.txt 106 | ``` 107 | 108 | 109 | ## Usage 110 | 111 | The project can be used for : 112 | - Making a (not that good) search engine 113 | - For Osint 114 | - For Pentesting 115 | 116 | ##### Linux 117 | 118 | To see available commands 119 | 120 | ```sh 121 | opencrawler help 122 | ``` 123 | 124 | or 125 | 126 | ```sh 127 | man opencrawler 128 | ``` 129 | 130 | ##### Windows 131 | 132 | To see available commands 133 | 134 | ```sh 135 | python opencrawler help 136 | ``` 137 | 138 |

139 | 140 | 141 | 142 | ## Contributing 143 | 144 | Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are **greatly appreciated**. 145 | * If you have suggestions for adding or removing projects, feel free to [open an issue](https://github.com/merwin-asm/OpenCrawler/issues/new) to discuss it, or directly create a pull request after you edit the *README.md* file with necessary changes. 146 | * Please make sure you check your spelling and grammar. 147 | * Create individual PR for each suggestion. 148 | 149 | 150 | ## License 151 | 152 | Distributed under the MIT License. See [LICENSE](https://github.com/merwin-asm/OpenCrawler/blob/main/LICENSE) for more information. 153 | 154 | ## Authors 155 | 156 | * **Merwin A J** - *CS Student* - [Merwin A J](https://github.com/merwin-asm/) - *Build OpenCrawler* 157 | 158 | ### Uses Materials From : 159 | 160 | - https://github.com/coffee-and-fun/google-profanity-words 161 | 162 | -------------------------------------------------------------------------------- /bad_words.txt: -------------------------------------------------------------------------------- 1 | 2 girls 1 cup 2 | 2g1c 3 | 4r5e 4 | 5h1t 5 | 5hit 6 | a55 7 | a_s_s 8 | acrotomophilia 9 | alabama hot pocket 10 | alaskan pipeline 11 | anal 12 | anilingus 13 | anus 14 | apeshit 15 | ar5e 16 | arrse 17 | arse 18 | arsehole 19 | ass 20 | ass-fucker 21 | ass-hat 22 | ass-pirate 23 | assbag 24 | assbandit 25 | assbanger 26 | assbite 27 | assclown 28 | asscock 29 | asscracker 30 | asses 31 | assface 32 | assfucker 33 | assfukka 34 | assgoblin 35 | asshat 36 | asshead 37 | asshole 38 | assholes 39 | asshopper 40 | assjacker 41 | asslick 42 | asslicker 43 | assmonkey 44 | assmunch 45 | assmuncher 46 | asspirate 47 | assshole 48 | asssucker 49 | asswad 50 | asswhole 51 | asswipe 52 | auto erotic 53 | autoerotic 54 | b!tch 55 | b00bs 56 | b17ch 57 | b1tch 58 | babeland 59 | baby batter 60 | baby juice 61 | ball gag 62 | ball gravy 63 | ball kicking 64 | ball licking 65 | ball sack 66 | ball sucking 67 | ballbag 68 | balls 69 | ballsack 70 | bampot 71 | bangbros 72 | bareback 73 | barely legal 74 | barenaked 75 | bastard 76 | bastardo 77 | bastinado 78 | bbw 79 | bdsm 80 | beaner 81 | beaners 82 | beastial 83 | beastiality 84 | beastility 85 | beaver cleaver 86 | beaver lips 87 | bellend 88 | bestial 89 | bestiality 90 | bi+ch 91 | biatch 92 | big black 93 | big breasts 94 | big knockers 95 | big tits 96 | bimbos 97 | birdlock 98 | bitch 99 | bitcher 100 | bitchers 101 | bitches 102 | bitchin 103 | bitching 104 | black cock 105 | blonde action 106 | blonde on blonde action 107 | bloody 108 | blow job 109 | blow your load 110 | blowjob 111 | blowjobs 112 | blue waffle 113 | blumpkin 114 | boiolas 115 | bollock 116 | bollocks 117 | bollok 118 | bollox 119 | bondage 120 | boner 121 | boob 122 | boobie 123 | boobs 124 | booobs 125 | boooobs 126 | booooobs 127 | booooooobs 128 | booty call 129 | breasts 130 | brown showers 131 | brunette action 132 | buceta 133 | bugger 134 | bukkake 135 | bulldyke 136 | bullet vibe 137 | bullshit 138 | bum 139 | bung hole 140 | bunghole 141 | bunny fucker 142 | busty 143 | butt 144 | butt-pirate 145 | buttcheeks 146 | butthole 147 | buttmunch 148 | buttplug 149 | c0ck 150 | c0cksucker 151 | camel toe 152 | camgirl 153 | camslut 154 | camwhore 155 | carpet muncher 156 | carpetmuncher 157 | cawk 158 | chinc 159 | chink 160 | choad 161 | chocolate rosebuds 162 | chode 163 | cipa 164 | circlejerk 165 | cl1t 166 | cleveland steamer 167 | clit 168 | clitface 169 | clitoris 170 | clits 171 | clover clamps 172 | clusterfuck 173 | cnut 174 | cock 175 | cock-sucker 176 | cockbite 177 | cockburger 178 | cockface 179 | cockhead 180 | cockjockey 181 | cockknoker 182 | cockmaster 183 | cockmongler 184 | cockmongruel 185 | cockmonkey 186 | cockmunch 187 | cockmuncher 188 | cocknose 189 | cocknugget 190 | cocks 191 | cockshit 192 | cocksmith 193 | cocksmoker 194 | cocksuck 195 | cocksuck 196 | cocksucked 197 | cocksucked 198 | cocksucker 199 | cocksucking 200 | cocksucks 201 | cocksuka 202 | cocksukka 203 | cok 204 | cokmuncher 205 | coksucka 206 | coochie 207 | coochy 208 | coon 209 | coons 210 | cooter 211 | coprolagnia 212 | coprophilia 213 | cornhole 214 | cox 215 | crap 216 | creampie 217 | cum 218 | cumbubble 219 | cumdumpster 220 | cumguzzler 221 | cumjockey 222 | cummer 223 | cumming 224 | cums 225 | cumshot 226 | cumslut 227 | cumtart 228 | cunilingus 229 | cunillingus 230 | cunnie 231 | cunnilingus 232 | cunt 233 | cuntface 234 | cunthole 235 | cuntlick 236 | cuntlick 237 | cuntlicker 238 | cuntlicker 239 | cuntlicking 240 | cuntlicking 241 | cuntrag 242 | cunts 243 | cyalis 244 | cyberfuc 245 | cyberfuck 246 | cyberfucked 247 | cyberfucker 248 | cyberfuckers 249 | cyberfucking 250 | d1ck 251 | dammit 252 | damn 253 | darkie 254 | date rape 255 | daterape 256 | deep throat 257 | deepthroat 258 | dendrophilia 259 | dick 260 | dickbag 261 | dickbeater 262 | dickface 263 | dickhead 264 | dickhole 265 | dickjuice 266 | dickmilk 267 | dickmonger 268 | dickslap 269 | dicksucker 270 | dickwad 271 | dickweasel 272 | dickweed 273 | dickwod 274 | dike 275 | dildo 276 | dildos 277 | dingleberries 278 | dingleberry 279 | dink 280 | dinks 281 | dipshit 282 | dirsa 283 | dirty pillows 284 | dirty sanchez 285 | dlck 286 | dog style 287 | dog-fucker 288 | doggie style 289 | doggiestyle 290 | doggin 291 | dogging 292 | doggy style 293 | doggystyle 294 | dolcett 295 | domination 296 | dominatrix 297 | dommes 298 | donkey punch 299 | donkeyribber 300 | doochbag 301 | dookie 302 | doosh 303 | double dong 304 | double penetration 305 | douche 306 | douchebag 307 | dp action 308 | dry hump 309 | duche 310 | dumbshit 311 | dumshit 312 | dvda 313 | dyke 314 | eat my ass 315 | ecchi 316 | ejaculate 317 | ejaculated 318 | ejaculates 319 | ejaculating 320 | ejaculatings 321 | ejaculation 322 | ejakulate 323 | erotic 324 | erotism 325 | escort 326 | eunuch 327 | f u c k 328 | f u c k e r 329 | f4nny 330 | f_u_c_k 331 | fag 332 | fagbag 333 | fagg 334 | fagging 335 | faggit 336 | faggitt 337 | faggot 338 | faggs 339 | fagot 340 | fagots 341 | fags 342 | fagtard 343 | fanny 344 | fannyflaps 345 | fannyfucker 346 | fanyy 347 | fart 348 | farted 349 | farting 350 | farty 351 | fatass 352 | fcuk 353 | fcuker 354 | fcuking 355 | fecal 356 | feck 357 | fecker 358 | felatio 359 | felch 360 | felching 361 | fellate 362 | fellatio 363 | feltch 364 | female squirting 365 | femdom 366 | figging 367 | fingerbang 368 | fingerfuck 369 | fingerfucked 370 | fingerfucker 371 | fingerfuckers 372 | fingerfucking 373 | fingerfucks 374 | fingering 375 | fistfuck 376 | fistfucked 377 | fistfucker 378 | fistfuckers 379 | fistfucking 380 | fistfuckings 381 | fistfucks 382 | fisting 383 | flamer 384 | flange 385 | fook 386 | fooker 387 | foot fetish 388 | footjob 389 | frotting 390 | fuck 391 | fuck buttons 392 | fucka 393 | fucked 394 | fucker 395 | fuckers 396 | fuckhead 397 | fuckheads 398 | fuckin 399 | fucking 400 | fuckings 401 | fuckingshitmotherfucker 402 | fuckme 403 | fucks 404 | fucktards 405 | fuckwhit 406 | fuckwit 407 | fudge packer 408 | fudgepacker 409 | fuk 410 | fuker 411 | fukker 412 | fukkin 413 | fuks 414 | fukwhit 415 | fukwit 416 | futanari 417 | fux 418 | fux0r 419 | g-spot 420 | gang bang 421 | gangbang 422 | gangbanged 423 | gangbanged 424 | gangbangs 425 | gay sex 426 | gayass 427 | gaybob 428 | gaydo 429 | gaylord 430 | gaysex 431 | gaytard 432 | gaywad 433 | genitals 434 | giant cock 435 | girl on 436 | girl on top 437 | girls gone wild 438 | goatcx 439 | goatse 440 | god damn 441 | god-dam 442 | god-damned 443 | goddamn 444 | goddamned 445 | gokkun 446 | golden shower 447 | goo girl 448 | gooch 449 | goodpoop 450 | gook 451 | goregasm 452 | gringo 453 | grope 454 | group sex 455 | guido 456 | guro 457 | hand job 458 | handjob 459 | hard core 460 | hardcore 461 | hardcoresex 462 | heeb 463 | hell 464 | hentai 465 | heshe 466 | ho 467 | hoar 468 | hoare 469 | hoe 470 | hoer 471 | homo 472 | homoerotic 473 | honkey 474 | honky 475 | hooker 476 | hore 477 | horniest 478 | horny 479 | hot carl 480 | hot chick 481 | hotsex 482 | how to kill 483 | how to murder 484 | huge fat 485 | humping 486 | incest 487 | intercourse 488 | jack off 489 | jack-off 490 | jackass 491 | jackoff 492 | jail bait 493 | jailbait 494 | jap 495 | jelly donut 496 | jerk off 497 | jerk-off 498 | jigaboo 499 | jiggaboo 500 | jiggerboo 501 | jism 502 | jiz 503 | jiz 504 | jizm 505 | jizm 506 | jizz 507 | juggs 508 | kawk 509 | kike 510 | kinbaku 511 | kinkster 512 | kinky 513 | kiunt 514 | knob 515 | knobbing 516 | knobead 517 | knobed 518 | knobend 519 | knobhead 520 | knobjocky 521 | knobjokey 522 | kock 523 | kondum 524 | kondums 525 | kooch 526 | kootch 527 | kum 528 | kumer 529 | kummer 530 | kumming 531 | kums 532 | kunilingus 533 | kunt 534 | kyke 535 | l3i+ch 536 | l3itch 537 | labia 538 | leather restraint 539 | leather straight jacket 540 | lemon party 541 | lesbo 542 | lezzie 543 | lmfao 544 | lolita 545 | lovemaking 546 | lust 547 | lusting 548 | m0f0 549 | m0fo 550 | m45terbate 551 | ma5terb8 552 | ma5terbate 553 | make me come 554 | male squirting 555 | masochist 556 | master-bate 557 | masterb8 558 | masterbat* 559 | masterbat3 560 | masterbate 561 | masterbation 562 | masterbations 563 | masturbate 564 | menage a trois 565 | milf 566 | minge 567 | missionary position 568 | mo-fo 569 | mof0 570 | mofo 571 | mothafuck 572 | mothafucka 573 | mothafuckas 574 | mothafuckaz 575 | mothafucked 576 | mothafucker 577 | mothafuckers 578 | mothafuckin 579 | mothafucking 580 | mothafuckings 581 | mothafucks 582 | mother fucker 583 | motherfuck 584 | motherfucked 585 | motherfucker 586 | motherfuckers 587 | motherfuckin 588 | motherfucking 589 | motherfuckings 590 | motherfuckka 591 | motherfucks 592 | mound of venus 593 | mr hands 594 | muff 595 | muff diver 596 | muffdiver 597 | muffdiving 598 | mutha 599 | muthafecker 600 | muthafuckker 601 | muther 602 | mutherfucker 603 | n1gga 604 | n1gger 605 | nambla 606 | nawashi 607 | nazi 608 | negro 609 | neonazi 610 | nig nog 611 | nigg3r 612 | nigg4h 613 | nigga 614 | niggah 615 | niggas 616 | niggaz 617 | nigger 618 | niggers 619 | niglet 620 | nimphomania 621 | nipple 622 | nipples 623 | nob 624 | nob jokey 625 | nobhead 626 | nobjocky 627 | nobjokey 628 | nsfw images 629 | nude 630 | nudity 631 | numbnuts 632 | nutsack 633 | nympho 634 | nymphomania 635 | octopussy 636 | omorashi 637 | one cup two girls 638 | one guy one jar 639 | orgasim 640 | orgasim 641 | orgasims 642 | orgasm 643 | orgasms 644 | orgy 645 | p0rn 646 | paedophile 647 | paki 648 | panooch 649 | panties 650 | panty 651 | pawn 652 | pecker 653 | peckerhead 654 | pedobear 655 | pedophile 656 | pegging 657 | penis 658 | penisfucker 659 | phone sex 660 | phonesex 661 | phuck 662 | phuk 663 | phuked 664 | phuking 665 | phukked 666 | phukking 667 | phuks 668 | phuq 669 | piece of shit 670 | pigfucker 671 | pimpis 672 | pis 673 | pises 674 | pisin 675 | pising 676 | pisof 677 | piss 678 | piss pig 679 | pissed 680 | pisser 681 | pissers 682 | pisses 683 | pissflap 684 | pissflaps 685 | pissin 686 | pissin 687 | pissing 688 | pissoff 689 | pissoff 690 | pisspig 691 | playboy 692 | pleasure chest 693 | pole smoker 694 | polesmoker 695 | pollock 696 | ponyplay 697 | poo 698 | poof 699 | poon 700 | poonani 701 | poonany 702 | poontang 703 | poop 704 | poop chute 705 | poopchute 706 | porn 707 | porno 708 | pornography 709 | pornos 710 | prick 711 | pricks 712 | prince albert piercing 713 | pron 714 | pthc 715 | pube 716 | pubes 717 | punanny 718 | punany 719 | punta 720 | pusies 721 | pusse 722 | pussi 723 | pussies 724 | pussy 725 | pussylicking 726 | pussys 727 | pusy 728 | puto 729 | queaf 730 | queef 731 | queerbait 732 | queerhole 733 | quim 734 | raghead 735 | raging boner 736 | rape 737 | raping 738 | rapist 739 | rectum 740 | renob 741 | retard 742 | reverse cowgirl 743 | rimjaw 744 | rimjob 745 | rimming 746 | rosy palm 747 | rosy palm and her 5 sisters 748 | ruski 749 | rusty trombone 750 | s hit 751 | s&m 752 | s.o.b. 753 | s_h_i_t 754 | sadism 755 | sadist 756 | santorum 757 | scat 758 | schlong 759 | scissoring 760 | screwing 761 | scroat 762 | scrote 763 | scrotum 764 | semen 765 | sex 766 | sexo 767 | sexy 768 | sh!+ 769 | sh!t 770 | sh1t 771 | shag 772 | shagger 773 | shaggin 774 | shagging 775 | shaved beaver 776 | shaved pussy 777 | shemale 778 | shi+ 779 | shibari 780 | shit 781 | shit-ass 782 | shit-bag 783 | shit-bagger 784 | shit-brain 785 | shit-breath 786 | shit-cunt 787 | shit-dick 788 | shit-eating 789 | shit-face 790 | shit-faced 791 | shit-fit 792 | shit-head 793 | shit-heel 794 | shit-hole 795 | shit-house 796 | shit-load 797 | shit-pot 798 | shit-spitter 799 | shit-stain 800 | shitass 801 | shitbag 802 | shitbagger 803 | shitblimp 804 | shitbrain 805 | shitbreath 806 | shitcunt 807 | shitdick 808 | shite 809 | shiteating 810 | shited 811 | shitey 812 | shitface 813 | shitfaced 814 | shitfit 815 | shitfuck 816 | shitfull 817 | shithead 818 | shitheel 819 | shithole 820 | shithouse 821 | shiting 822 | shitings 823 | shitload 824 | shitpot 825 | shits 826 | shitspitter 827 | shitstain 828 | shitted 829 | shitter 830 | shitters 831 | shittiest 832 | shitting 833 | shittings 834 | shitty 835 | shitty 836 | shity 837 | shiz 838 | shiznit 839 | shota 840 | shrimping 841 | skank 842 | skeet 843 | slanteye 844 | slut 845 | slutbag 846 | sluts 847 | smeg 848 | smegma 849 | smut 850 | snatch 851 | snowballing 852 | sodomize 853 | sodomy 854 | son-of-a-bitch 855 | spac 856 | spic 857 | spick 858 | splooge 859 | splooge moose 860 | spooge 861 | spread legs 862 | spunk 863 | strap on 864 | strapon 865 | strappado 866 | strip club 867 | style doggy 868 | suck 869 | sucks 870 | suicide girls 871 | sultry women 872 | swastika 873 | swinger 874 | t1tt1e5 875 | t1tties 876 | tainted love 877 | tard 878 | taste my 879 | tea bagging 880 | teets 881 | teez 882 | testical 883 | testicle 884 | threesome 885 | throating 886 | thundercunt 887 | tied up 888 | tight white 889 | tit 890 | titfuck 891 | tits 892 | titt 893 | tittie5 894 | tittiefucker 895 | titties 896 | titty 897 | tittyfuck 898 | tittywank 899 | titwank 900 | tongue in a 901 | topless 902 | tosser 903 | towelhead 904 | tranny 905 | tribadism 906 | tub girl 907 | tubgirl 908 | turd 909 | tushy 910 | tw4t 911 | twat 912 | twathead 913 | twatlips 914 | twatty 915 | twink 916 | twinkie 917 | two girls one cup 918 | twunt 919 | twunter 920 | undressing 921 | upskirt 922 | urethra play 923 | urophilia 924 | v14gra 925 | v1gra 926 | va-j-j 927 | vag 928 | vagina 929 | venus mound 930 | viagra 931 | vibrator 932 | violet wand 933 | vjayjay 934 | vorarephilia 935 | voyeur 936 | vulva 937 | w00se 938 | wang 939 | wank 940 | wanker 941 | wanky 942 | wet dream 943 | wetback 944 | white power 945 | whoar 946 | whore 947 | willies 948 | willy 949 | wrapping men 950 | wrinkled starfish 951 | xrated 952 | xx 953 | xxx 954 | yaoi 955 | yellow showers 956 | yiffy 957 | zoophilia 958 | 🖕 959 | -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | """ 2 | Configures the Open Crawler v 1.0.0 3 | """ 4 | 5 | 6 | from rich import print 7 | import getpass 8 | import json 9 | import os 10 | 11 | 12 | print("[blue][bold]Configuring Open Crawler v 0.0.1[/bold] - File : config.json[/blue]") 13 | 14 | if os.path.exists("config.json"): 15 | print("[yellow] config.json already found , do you want to rewrite it ? [y/n][/yellow]", end="") 16 | res = input(" ").lower() 17 | 18 | if res == "y": 19 | os.remove("config.json") 20 | else: 21 | exit() 22 | 23 | 24 | configs = {} 25 | 26 | 27 | 28 | print("\n[green]-----------------------------Writing to config.json-----------------------------[/green]\n") 29 | 30 | 31 | print("[dark_orange] [?] MongoDB's Password ?[/dark_orange]", end="") 32 | configs.setdefault("MONGODB_PWD", getpass.getpass(prompt=" ")) 33 | 34 | print("[dark_orange] [?] URI Provided By MongoDB ?[/dark_orange]", end="") 35 | configs.setdefault("MONGODB_URI", input(" ")) 36 | 37 | print("[dark_orange] [?] Timeout For Requests ?[/dark_orange]", end="") 38 | configs.setdefault("TIMEOUT", int(input(" "))) 39 | 40 | print("[dark_orange] [?] Maximum Threads To Be Used ?[/dark_orange]", end="") 41 | configs.setdefault("MAX_THREADS", int(input(" "))) 42 | 43 | print("[dark_orange] [?] Flaged/Bad words list (enter for default) ?[/dark_orange]", end="") 44 | res = input(" ") 45 | 46 | if res == "": 47 | res = "bad_words.txt" 48 | 49 | configs.setdefault("bad_words", res) 50 | 51 | print("[dark_orange] [?] Use Proxies (y/n) ?[/dark_orange]", end="") 52 | res = input(" ").lower() 53 | 54 | if res == "y": 55 | res = True 56 | else: 57 | res = False 58 | 59 | configs.setdefault("USE_PROXIES", res) 60 | 61 | print("[dark_orange] [?] Scan Bad Words (y/n) ?[/dark_orange]", end="") 62 | res = input(" ").lower() 63 | 64 | if res == "y": 65 | res = True 66 | else: 67 | res = False 68 | 69 | configs.setdefault("Scan_Bad_Words", res) 70 | 71 | print("[dark_orange] [?] Scan Top Keywords (y/n) ?[/dark_orange]", end="") 72 | res = input(" ").lower() 73 | 74 | if res == "y": 75 | res = True 76 | else: 77 | res = False 78 | 79 | configs.setdefault("Scan_Top_Keywords", res) 80 | 81 | print("[dark_orange] [?] Scan URL For Malicious Stuff (y/n) ?[/dark_orange]", end="") 82 | res = input(" ").lower() 83 | 84 | if res == "y": 85 | res = True 86 | else: 87 | res = False 88 | 89 | configs.setdefault("URL_SCAN", res) 90 | 91 | print("[dark_orange] [?] UrlScan API Key (If not scanning just enter) ?[/dark_orange]", end="") 92 | configs.setdefault("urlscan_key", input(" ")) 93 | 94 | print("\n[green]Saving--------------------------------------------------------------------------[/green]\n") 95 | 96 | 97 | f = open("config.json", "w") 98 | f.write(json.dumps(configs)) 99 | f.close() 100 | -------------------------------------------------------------------------------- /connection_tree.py: -------------------------------------------------------------------------------- 1 | """ 2 | Part of Open Crawler v 1.0.0 3 | """ 4 | 5 | 6 | from rich import print 7 | import requests 8 | import random 9 | import sys 10 | import re 11 | 12 | 13 | 14 | # regex patterns 15 | url_extract_pattern = "https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)" 16 | url_pattern = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$" 17 | url_pattern_0 = "^[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$" 18 | url_extract_pattern_0 = "[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)" 19 | 20 | 21 | # Main Variables 22 | website = sys.argv[1] # website to be scanned 23 | num = int(sys.argv[2]) # number of layers to scan 24 | 25 | 26 | 27 | 28 | def get_proxy(): 29 | """ 30 | Gets a free proxy from 'proxyscrape' 31 | returns : dict - > {"http": ""} 32 | """ 33 | 34 | res = requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all") 35 | return {"http" : random.choice(res.text.split("\r\n"))} 36 | 37 | 38 | 39 | 40 | 41 | def scan(website, max_, it): 42 | """ 43 | Scans for sub urls and prints them. 44 | website : Str 45 | max_ : int 46 | it : int 47 | """ 48 | 49 | global TOTAL 50 | 51 | if max_ != it: 52 | print(" "*it + "[green]----" + website + ":[/green]") 53 | else: 54 | print(" "*it + "[green]----" + website + "[/green]") 55 | return None 56 | 57 | # Gets a proxy 58 | try: 59 | proxies = get_proxy() 60 | except: 61 | proxies = {} 62 | 63 | try: 64 | website_txt = requests.get(website, headers = {"user-agent":"open crawler Mapper v 0.0.1"}, proxies = proxies).text 65 | except: 66 | website_txt = "" 67 | print(f"[red] [-] '{website}' Website Couldn't Be Loaded") 68 | 69 | sub_urls = [] 70 | 71 | for x in re.findall(url_extract_pattern, website_txt): 72 | if re.match(url_pattern, x): 73 | if ".onion" in x: 74 | # skips onion sites 75 | continue 76 | 77 | if x[-1] == "/" or x.endswith(".html") or x.split("/")[-1].isalnum(): 78 | # tries to filture out not crawlable urls 79 | sub_urls.append(x) 80 | 81 | # removes all duplicates 82 | sub_urls = set(sub_urls) 83 | 84 | for e in sub_urls: 85 | scan(e, max_ , it + 1) 86 | 87 | 88 | print(f"[dark_orange]Scanning :{website} | No. of Layers : {num} [/dark_orange]\n") 89 | scan(website, num, 1) 90 | 91 | 92 | -------------------------------------------------------------------------------- /connection_tree_v2.py: -------------------------------------------------------------------------------- 1 | from rich import print 2 | import requests 3 | import random 4 | import sys 5 | import re 6 | import pickle 7 | 8 | # regex patterns 9 | url_extract_pattern = "https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)" 10 | url_pattern = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$" 11 | url_pattern_0 = "^[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$" 12 | url_extract_pattern_0 = "[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)" 13 | 14 | # Main Variables 15 | website = sys.argv[1] # website to be scanned 16 | num = int(sys.argv[2]) # number of layers to scan 17 | DATA = {} 18 | 19 | def get_proxy(): 20 | """ 21 | Gets a free proxy from 'proxyscrape' 22 | returns : dict - > {"http": ""} 23 | """ 24 | res = requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all") 25 | return {"http": random.choice(res.text.split("\r\n"))} 26 | 27 | def scan(website, max_, it, parent_node): 28 | """ 29 | Scans for sub URLs and adds them to the DATA dictionary. 30 | website : Str 31 | max_ : int 32 | it : int 33 | parent_node : dict 34 | """ 35 | if max_ != it: 36 | print(" "*it + "[green]----" + website + ":[/green]") 37 | else: 38 | print(" "*it + "[green]----" + website + "[/green]") 39 | return None 40 | 41 | # Gets a proxy 42 | try: 43 | proxies = get_proxy() 44 | except: 45 | proxies = {} 46 | 47 | try: 48 | website_txt = requests.get(website, headers={"user-agent": "open crawler Mapper v 0.0.1"}, proxies=proxies).text 49 | except: 50 | website_txt = "" 51 | print(f"[red] [-] '{website}' Website Couldn't Be Loaded") 52 | 53 | sub_urls = [] 54 | 55 | for x in re.findall(url_extract_pattern, website_txt): 56 | if re.match(url_pattern, x): 57 | if ".onion" in x: 58 | # skips onion sites 59 | continue 60 | 61 | if x[-1] == "/" or x.endswith(".html") or x.split("/")[-1].isalnum(): 62 | # tries to filter out non-crawlable urls 63 | sub_urls.append(x) 64 | 65 | # removes all duplicates 66 | sub_urls = set(sub_urls) 67 | 68 | if not parent_node.get("children"): 69 | parent_node["children"] = [] 70 | 71 | for e in sub_urls: 72 | child_node = {"name": e} 73 | parent_node["children"].append(child_node) 74 | scan(e, max_, it + 1, child_node) 75 | 76 | print(f"[dark_orange]Scanning :{website} | No. of Layers : {num} [/dark_orange]\n") 77 | DATA[website] = {"name": website} 78 | scan(website, num, 1, DATA[website]) 79 | 80 | with open(f".{website}_{num}".replace("/","o"), "wb") as f: 81 | f.write(pickle.dumps(DATA)) 82 | print(DATA) 83 | -------------------------------------------------------------------------------- /crawler.py: -------------------------------------------------------------------------------- 1 | """ 2 | Open Crawler 0.0.1 3 | 4 | License - MIT , 5 | An open source crawler/spider 6 | 7 | Features : 8 | - Cross Platform 9 | - Easy install 10 | - Related-CLI Tools (includes ,CLI access to tool, not that good search-tool xD, etc) 11 | - Memory efficient [ig] 12 | - Pool Crawling - Use multiple crawlers at same time 13 | - Supports Robot.txt 14 | - MongoDB [DB] 15 | - Language Detection 16 | - 18 + Checks / Offensive Content Check 17 | - Proxies 18 | - Multi Threading 19 | - Url Scanning 20 | - Keyword, Desc And recurring words Logging 21 | 22 | 23 | Author - Merwin M 24 | """ 25 | 26 | 27 | 28 | 29 | 30 | from memory_profiler import profile 31 | from collections import Counter 32 | from functools import lru_cache 33 | from bs4 import BeautifulSoup 34 | from langdetect import detect 35 | from mongo_db import * 36 | from rich import print 37 | import urllib.robotparser 38 | import threading 39 | import requests 40 | import signal 41 | import atexit 42 | import random 43 | import json 44 | import time 45 | import sys 46 | import re 47 | import os 48 | 49 | 50 | 51 | 52 | 53 | """ 54 | ######### Crawled Info are stored in Mongo DB as ##### 55 | Crawled sites = [ 56 | 57 | { 58 | "website" : "" 59 | 60 | "time" : "", 61 | 62 | "mal" : Val/None, # malicious or not 63 | "offn" : Val/None, # 18 +/ Offensive language 64 | 65 | "ln" : "", 66 | 67 | "keys" : [], 68 | "desc" : "", 69 | 70 | "recc" : []/None, 71 | } 72 | ] 73 | """ 74 | 75 | 76 | 77 | 78 | ## Regex patterns 79 | html_pattern = re.compile(r'<[^>]+>') 80 | url_extract_pattern = "https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)" 81 | url_pattern = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$" 82 | url_pattern_0 = "^[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$" 83 | url_extract_pattern_0 = "[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)" 84 | 85 | 86 | 87 | # Config File 88 | config_file = "config.json" 89 | 90 | 91 | 92 | # Load configs from config_file - > json 93 | try: 94 | config_file = open(config_file, "r") 95 | configs = json.loads(config_file.read()) 96 | config_file.close() 97 | 98 | except: 99 | try: 100 | os.system("python3 config.py") # Re-configures 101 | except: 102 | os.system("python config.py") # Re-configures 103 | 104 | config_file = open(config_file, "r") 105 | configs = json.loads(config_file.read()) 106 | config_file.close() 107 | 108 | 109 | 110 | 111 | 112 | ## Setting Up Configs 113 | MONGODB_PWD = configs["MONGODB_PWD"] 114 | MONGODB_URI = configs["MONGODB_URI"] 115 | TIMEOUT = configs["TIMEOUT"] # Timeout for reqs 116 | MAX_THREADS = configs["MAX_THREADS"] 117 | bad_words = configs["bad_words"] 118 | USE_PROXIES = configs["USE_PROXIES"] 119 | Scan_Bad_Words = configs["Scan_Bad_Words"] 120 | Scan_Top_Keywords = configs["Scan_Top_Keywords"] 121 | URL_SCAN = configs["URL_SCAN"] 122 | urlscan_key = configs["urlscan_key"] 123 | 124 | 125 | del configs 126 | 127 | 128 | 129 | ## Main Vars 130 | EXIT_FLAG = False 131 | DB = None 132 | ROBOT_SCANS = [] # On Going robot scans 133 | WEBSITE_SCANS = [] # On Going website scans 134 | PROXY_CHECK = False 135 | HTTP = [] 136 | HTTPS = [] 137 | 138 | # Loads bad words / flaged words 139 | file = open(bad_words, "r") 140 | bad_words = file.read() 141 | file.close() 142 | bad_words = tuple(bad_words.split("\n")) 143 | 144 | 145 | 146 | 147 | 148 | # @lru_cache(maxsize=100) 149 | # def get_robot(domain): 150 | # """ 151 | # reads robots.txt 152 | # """ 153 | 154 | # print(f"[green] [+] Scans - {domain} for restrictions") 155 | 156 | # rp = urllib.robotparser.RobotFileParser() 157 | # rp.set_url("http://" + domain + "/robots.txt") 158 | 159 | # return rp.can_fetch 160 | 161 | 162 | 163 | 164 | def get_top_reccuring(txt): 165 | """ 166 | Gets most reccuring 8 terms from the website html 167 | txt : str 168 | returns : list 169 | """ 170 | 171 | split_it = txt.split() 172 | counter = Counter(split_it) 173 | try: 174 | most_occur = counter.most_common(8) 175 | return most_occur 176 | except: 177 | return [] 178 | 179 | 180 | 181 | def lang_d(txt): 182 | """ 183 | Scans for bad words/ flaged words. 184 | txt : str 185 | return : int - > score 186 | """ 187 | 188 | if not Scan_Bad_Words: 189 | return None 190 | score = 0 191 | 192 | for e in bad_words: 193 | try: 194 | x = txt.split(e) 195 | except: 196 | continue 197 | score += len(x)-1 198 | 199 | try: 200 | score = round(score/len(txt)) 201 | except: 202 | pass 203 | 204 | return score 205 | 206 | 207 | def proxy_checker(proxies, url="https://www.google.com"): 208 | working_proxies = [] 209 | proxies = proxies.text.split("\r\n") 210 | 211 | if url.startswith("https"): 212 | protocol = "https" 213 | else: 214 | protocol = "http" 215 | 216 | proxies.pop() 217 | 218 | for proxy in proxies: 219 | try: 220 | response = requests.get(url, proxies={protocol:proxy}, timeout=2) 221 | if response.status_code == 200: 222 | print(f"Proxy {proxy} works! [{len(working_proxies)+1}]") 223 | working_proxies.append(proxy) 224 | else: 225 | pass 226 | except requests.RequestException as e: 227 | pass 228 | 229 | return working_proxies 230 | 231 | 232 | def proxy_checker_(proxies, url="https://www.wired.com/review/klipsch-flexus-core-200/"): 233 | working_proxies = [] 234 | proxies = proxies.split("\n") 235 | 236 | if url.startswith("https"): 237 | protocol = "https" 238 | else: 239 | protocol = "http" 240 | 241 | proxies.pop() 242 | 243 | for proxy in proxies: 244 | try: 245 | response = requests.get(url, proxies={protocol:proxy}, timeout=2) 246 | if response.status_code == 200: 247 | print(f"Proxy {proxy} works! [{len(working_proxies)+1}]") 248 | working_proxies.append(proxy) 249 | else: 250 | pass 251 | except requests.RequestException as e: 252 | pass 253 | 254 | return working_proxies 255 | 256 | 257 | 258 | def get_proxy(): 259 | 260 | """ 261 | Gets a free proxy from 'proxyscrape' 262 | returns : dict - > {"http": ""} 263 | """ 264 | global PROXY_CHECK, HTTP, HTTPS 265 | 266 | if not PROXY_CHECK: 267 | try: 268 | f = open("found_proxies_http") 269 | f2 = open("found_proxies_https") 270 | res = f.read() 271 | res2 = f2.read() 272 | 273 | HTTP = proxy_checker_(res, "http://www.wired.com/review/klipsch-flexus-core-200/") 274 | HTTPS = proxy_checker_(res2) 275 | 276 | PROXY_CHECK = True 277 | 278 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {len(HTTP)} [/green]") 279 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {len(HTTPS)} [/green]") 280 | 281 | f2.close() 282 | f.close() 283 | except: 284 | pass 285 | 286 | if not PROXY_CHECK: 287 | print("We are generating a new proxy list so it would take time... \[this happens when you are using old proxylist/have none]") 288 | res = requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all") 289 | res2= requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all") 290 | HTTP = proxy_checker(res, "http://google.com") 291 | HTTPS = proxy_checker(res2) 292 | 293 | PROXY_CHECK = True 294 | 295 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {len(HTTP)} [/green]") 296 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {len(HTTPS)} [/green]") 297 | 298 | f = open("found_proxies_http", "w") 299 | f2 = open("found_proxies_https", "w") 300 | f.write("\n".join(HTTP)) 301 | f2.write("\n".join(HTTPS)) 302 | f.close() 303 | f2.close() 304 | 305 | return {"http" : random.choice(HTTP), "https": random.choice(HTTPS)} 306 | 307 | 308 | 309 | def scan_url(url): 310 | """ 311 | Scans url for malicious stuff , Uses the Urlscan API 312 | url : str 313 | return : int - > score 314 | """ 315 | 316 | headers = {'API-Key':urlscan_key,'Content-Type':'application/json'} 317 | data = {"url": url, "visibility": "public"} 318 | r = requests.post('https://urlscan.io/api/v1/scan/',headers=headers, data=json.dumps(data)) 319 | print(r.json()) 320 | r = "https://urlscan.io/api/v1/result/" + r.json()["uuid"] 321 | 322 | for e in range(0,100): 323 | time.sleep(2) 324 | res = requests.get(r, headers) 325 | res = res.json() 326 | try: 327 | if res["status"] == 404: 328 | pass 329 | except: 330 | print(res["verdicts"]) 331 | return res["verdicts"]["urlscan"]["score"] 332 | 333 | return None 334 | 335 | 336 | 337 | def remove_html(string): 338 | """ 339 | removes html tags 340 | string : str 341 | return : str 342 | """ 343 | 344 | return html_pattern.sub('', string) 345 | 346 | 347 | 348 | def handler(SignalNumber, Frame): # for handling SIGINT 349 | safe_exit() 350 | 351 | signal.signal(signal.SIGINT, handler) # register handler 352 | 353 | 354 | 355 | def safe_exit(): # safely exits the program 356 | global EXIT_FLAG 357 | 358 | print(f"\n\n[blue] [\] Exit Triggered At : {time.time()} [/blue]") 359 | 360 | EXIT_FLAG = True 361 | 362 | print("[red] EXITED [/red]") 363 | 364 | 365 | 366 | atexit.register(safe_exit) # registers at exit handler 367 | 368 | 369 | 370 | def forced_crawl(website): 371 | """ 372 | Crawl a website forcefully - ignorring the crawl wait list 373 | website : string 374 | """ 375 | 376 | # Checks if crawled already , y__ = crawled or not , True/False 377 | z__ = if_crawled(website) 378 | y__ = z__[0] 379 | 380 | mal = None 381 | lang_18 = None 382 | 383 | lang = None 384 | 385 | # Current thread no. as there is no separate threads for crawling, set as 0 386 | th = 0 387 | 388 | print(f"[green] [+] Started Crawling : {website} | Thread : {th}[/green]") 389 | 390 | 391 | proxies = {} 392 | 393 | if USE_PROXIES: 394 | proxies = get_proxy() 395 | 396 | try: 397 | website_req = requests.get(website, headers = {"user-agent":"open crawler v 0.0.1"}, proxies = proxies, timeout = TIMEOUT) 398 | 399 | # checks if content is html or skips 400 | try: 401 | if not "html" in website_req.headers["Content-Type"]: 402 | print(f"[green] [+] Skiped : {website} Because Content type not 'html' | Thread : {th}[/green]") 403 | return 0 404 | except: 405 | return 0 406 | 407 | 408 | website_txt = website_req.text 409 | 410 | if not y__: 411 | save_crawl(website, time.time(), 0,0,0,0,0,0,) 412 | else: 413 | update_crawl(website, time.time(), 0,0,0,0,0,0) 414 | 415 | except: 416 | # could be because website is down or the timeout 417 | print(f"[red] [-] Coundn't Crwal : {website} | Thread : {th}[/red]") 418 | if not y__: 419 | save_crawl(website, time.time(), "ERROR OCCURED", 0, 0, 0, 0, 0) 420 | else: 421 | update_crawl(website, time.time(), "ERROR OCCURED", 0, 0, 0, 0, 0) 422 | return 0 423 | 424 | try: 425 | lang = detect(website_txt) 426 | except: 427 | lang = "un-dic" 428 | 429 | if URL_SCAN: 430 | mal = scan_url(website) 431 | 432 | website_txt_ = remove_html(website_txt) 433 | 434 | if Scan_Bad_Words: 435 | lang_18 = lang_d(website_txt_) 436 | 437 | keywords = [] 438 | desc = "" 439 | 440 | soup = BeautifulSoup(website_txt, 'html.parser') 441 | 442 | for meta in soup.findAll("meta"): 443 | try: 444 | if meta["name"] == "keywords": 445 | keywords = meta["content"] 446 | except: 447 | pass 448 | 449 | try: 450 | if meta["name"] == "description": 451 | desc = meta["content"] 452 | except: 453 | pass 454 | 455 | 456 | del soup 457 | 458 | top_r = None 459 | 460 | if Scan_Top_Keywords: 461 | top_r = get_top_reccuring(website_txt_) 462 | 463 | update_crawl(website, time.time(), mal, lang_18, lang, keywords, desc, top_r) 464 | 465 | del mal, lang_18, lang, keywords, desc, top_r 466 | 467 | sub_urls = [] 468 | 469 | for x in re.findall(url_extract_pattern, website_txt): 470 | if re.match(url_pattern, x): 471 | if ".onion" in x: 472 | # skips onion sites 473 | continue 474 | 475 | if x[-1] == "/" or x.endswith(".html") or x.split("/")[-1].isalnum(): 476 | # tries to filture out not crawlable urls 477 | sub_urls.append(x) 478 | 479 | # removes all duplicates 480 | sub_urls = set(sub_urls) 481 | sub_urls = list(sub_urls) 482 | 483 | 484 | # check for restrictions in robots.txt and filture out the urls found 485 | for sub_url in sub_urls: 486 | 487 | if if_waiting(sub_url): 488 | sub_urls.remove(sub_url) 489 | 490 | continue 491 | 492 | # restricted = robots_txt.disallowed(sub_url, proxies) 493 | 494 | 495 | # t = sub_url.split("://")[1].split("/") 496 | # t.remove(t[0]) 497 | 498 | # t_ = "" 499 | # for u in t: 500 | # t_ += "/" + u 501 | 502 | # t = t_ 503 | 504 | # restricted = tuple(restricted) 505 | 506 | # for resk in restricted: 507 | # if t.startswith(resk): 508 | # sub_urls.remove(sub_url) 509 | # break 510 | 511 | 512 | site = sub_url.replace("https://", "") 513 | site = site.replace("http://", "") 514 | domain = site.split("/")[0] 515 | 516 | # print(f"[green] [+] Scans - {domain} for restrictions") 517 | 518 | rp = urllib.robotparser.RobotFileParser() 519 | rp.set_url("http://" + domain + "/robots.txt") 520 | 521 | # a = get_robot(domain) 522 | 523 | 524 | if not rp.can_fetch("*", sub_url): 525 | sub_urls.remove(sub_url) 526 | 527 | 528 | # try: 529 | # restricted = get_robots(domain) 530 | # except: 531 | # print(f"[green] [+] Scans - {domain} for restrictions") 532 | # restricted = robots_txt.disallowed(sub_url, proxies) 533 | 534 | # save_robots(domain, restricted) 535 | 536 | 537 | # restricted = tuple(restricted) 538 | 539 | # t = sub_url.split("://")[1].split("/") 540 | # t.remove(t[0]) 541 | 542 | # t_ = "" 543 | # for u in t: 544 | # t_ += "/" + u 545 | 546 | # t = t_ 547 | 548 | # for resk in restricted: 549 | # if t.startswith(resk): 550 | # sub_urls.remove(sub_url) 551 | # break 552 | 553 | 554 | 555 | # check if there is a need of crawling 556 | for e in sub_urls: 557 | 558 | z__ = if_crawled(e) 559 | 560 | y__ = z__[0] 561 | t__ = z__[1] 562 | 563 | 564 | if y__: 565 | if t__ < time.time() - 604800 : # Re-Crawls Only After 7 566 | sub_urls.remove(e) 567 | continue 568 | 569 | try: 570 | website_req = requests.get(e, headers = {"user-agent":"open crawler v 0.0.1"}, proxies = proxies, timeout = TIMEOUT) 571 | 572 | except: 573 | sub_urls.remove(e) 574 | continue 575 | 576 | try: 577 | if not "html" in website_req.headers["Content-Type"]: 578 | print(f"[green] [+] Skiped : {e} Because Content type not 'html' | Thread : {th}[/green]") 579 | sub_urls.remove(e) 580 | continue 581 | except: 582 | sub_urls.remove(e) 583 | continue 584 | 585 | 586 | write_to_wait_list(sub_urls) 587 | 588 | del sub_urls 589 | 590 | print(f"[green] [+] Crawled : {website} | Thread : {th}[/green]") 591 | 592 | 593 | 594 | ## for checking the memory usage uncomment @profile , also uncoment for main() 595 | # @profile 596 | def crawl(th): 597 | 598 | global ROBOT_SCANS, WEBSITE_SCANS 599 | 600 | time.sleep(th) 601 | 602 | if EXIT_FLAG: 603 | return 1 604 | 605 | 606 | while not EXIT_FLAG: 607 | 608 | # gets 10 urls from waitlist 609 | sub_urls = get_wait_list(10) 610 | 611 | for website in sub_urls: 612 | if website in WEBSITE_SCANS: 613 | continue 614 | else: 615 | WEBSITE_SCANS.append(website) 616 | 617 | website_url = website 618 | 619 | website = website["website"] 620 | 621 | update = False 622 | 623 | # Checks if crawled already , y__ = crawled or not , True/False 624 | 625 | z__ = if_crawled(website) 626 | 627 | y__ = z__[0] 628 | t__ = z__[1] 629 | 630 | 631 | if y__: 632 | update = True 633 | 634 | if int(time.time()) - int(t__) < 604800: # Re-Crawls Only After 7 635 | print(f"[green] [+] Already Crawled : {website} | Thread : {th}[/green]") 636 | continue 637 | 638 | print(f"[green] [+] ReCrawling : {website} | Thread : {th} [/green]") 639 | 640 | mal = None 641 | lang_18 = None 642 | 643 | lang = None 644 | 645 | print(f"[green] [+] Started Crawling : {website} | Thread : {th}[/green]") 646 | 647 | 648 | proxies = {} 649 | 650 | if USE_PROXIES: 651 | proxies = get_proxy() 652 | 653 | try: 654 | website_req = requests.get(website, headers = {"user-agent":"open crawler v 0.0.1"}, proxies = proxies, timeout = TIMEOUT) 655 | 656 | try: 657 | if not "html" in website_req.headers["Content-Type"]: 658 | # checks if the site responds with html content or skips 659 | print(f"[green] [+] Skiped : {website} Because Content type not 'html' | Thread : {th}[/green]") 660 | continue 661 | except: 662 | continue 663 | 664 | website_txt = website_req.text 665 | 666 | if not update: 667 | save_crawl(website, time.time(), 0,0,0,0,0,0,) 668 | else: 669 | update_crawl(website, time.time(), 0,0,0,0,0,0) 670 | 671 | except: 672 | # could be because website is down or the timeout 673 | print(f"[red] [-] Coundn't Crwal : {website} | Thread : {th}[/red]") 674 | save_crawl(website, time.time(), "ERROR OCCURED", 0, 0, 0, 0, 0) 675 | continue 676 | 677 | try: 678 | lang = detect(website_txt) 679 | except: 680 | lang = "un-dic" 681 | 682 | if URL_SCAN: 683 | mal = scan_url(website) 684 | 685 | website_txt_ = remove_html(website_txt) 686 | 687 | if Scan_Bad_Words: 688 | lang_18 = lang_d(website_txt_) 689 | 690 | keywords = [] 691 | desc = "" 692 | 693 | soup = BeautifulSoup(website_txt, 'html.parser') 694 | 695 | for meta in soup.findAll("meta"): 696 | try: 697 | if meta["name"] == "keywords": 698 | keywords = meta["content"] 699 | except: 700 | pass 701 | 702 | try: 703 | if meta["name"] == "description": 704 | desc = meta["content"] 705 | except: 706 | pass 707 | 708 | del soup 709 | 710 | top_r = None 711 | 712 | if Scan_Top_Keywords: 713 | top_r = get_top_reccuring(website_txt_) 714 | 715 | update_crawl(website, time.time(), mal, lang_18, lang, keywords, desc, top_r) 716 | 717 | del mal, lang_18, lang, keywords, desc, top_r 718 | 719 | sub_urls = [] 720 | 721 | for x in re.findall(url_extract_pattern, website_txt): 722 | if re.match(url_pattern, x): 723 | if ".onion" in x: 724 | # skips onion sites 725 | continue 726 | 727 | if x[-1] == "/" or x.endswith(".html") or x.split("/")[-1].isalnum(): 728 | # tries to filture out not crawlable urls 729 | sub_urls.append(x) 730 | 731 | 732 | # removes all duplicates 733 | sub_urls = set(sub_urls) 734 | sub_urls = list(sub_urls) 735 | 736 | 737 | # check for restrictions in robots.txt and filture out the urls found 738 | for sub_url in sub_urls: 739 | 740 | if if_waiting(sub_url): 741 | sub_urls.remove(sub_url) 742 | continue 743 | 744 | 745 | site = sub_url.replace("https://", "") 746 | site = site.replace("http://", "") 747 | domain = site.split("/")[0] 748 | 749 | # print(f"[green] [+] Scans - {domain} for restrictions") 750 | 751 | rp = urllib.robotparser.RobotFileParser() 752 | rp.set_url("http://" + domain + "/robots.txt") 753 | 754 | # a = get_robot(domain) 755 | 756 | if not rp.can_fetch("*", sub_url): 757 | sub_urls.remove(sub_url) 758 | 759 | 760 | 761 | # restricted = robots_txt.disallowed(sub_url, proxies) 762 | 763 | 764 | # t = sub_url.split("://")[1].split("/") 765 | # t.remove(t[0]) 766 | 767 | # t_ = "" 768 | # for u in t: 769 | # t_ += "/" + u 770 | 771 | # t = t_ 772 | 773 | # restricted = tuple(restricted) 774 | 775 | # for resk in restricted: 776 | # if t.startswith(resk): 777 | # sub_urls.remove(sub_url) 778 | # break 779 | 780 | 781 | # check if there is a need of crawling 782 | for e in sub_urls: 783 | 784 | z__ = if_crawled(e) 785 | 786 | y__ = z__[0] 787 | t__ = z__[1] 788 | 789 | 790 | if y__: 791 | if int(time.time()) - int(t__) < 604800: # Re-Crawls Only After 7 792 | sub_urls.remove(e) 793 | continue 794 | 795 | try: 796 | website_req = requests.get(e, headers = {"user-agent":"open crawler v 0.0.1"}, proxies = proxies, timeout = TIMEOUT) 797 | 798 | except: 799 | sub_urls.remove(e) 800 | continue 801 | 802 | try: 803 | if not "html" in website_req.headers["Content-Type"]: 804 | print(f"[green] [+] Skiped : {e} Because Content type not 'html' | Thread : {th}[/green]") 805 | sub_urls.remove(e) 806 | continue 807 | 808 | except: 809 | sub_urls.remove(e) 810 | continue 811 | 812 | 813 | 814 | del proxies 815 | 816 | write_to_wait_list(sub_urls) 817 | 818 | del sub_urls 819 | 820 | WEBSITE_SCANS.remove(website_url) 821 | 822 | print(f"[green] [+] Crawled : {website} | Thread : {th}[/green]") 823 | 824 | 825 | 826 | 827 | 828 | ascii_art = """ 829 | [medium_spring_green] 830 | ______ ______ __ 831 | / \ / \ | \ 832 | | $$$$$$\ ______ ______ _______ | $$$$$$\ ______ ______ __ __ __ | $$ 833 | | $$ | $$ / \ / \ | \ | $$ \$$ / \ | \ | \ | \ | \| $$ 834 | | $$ | $$| $$$$$$\| $$$$$$\| $$$$$$$\ | $$ | $$$$$$\ \$$$$$$\| $$ | $$ | $$| $$ 835 | | $$ | $$| $$ | $$| $$ $$| $$ | $$ | $$ __ | $$ \$$/ $$| $$ | $$ | $$| $$ 836 | | $$__/ $$| $$__/ $$| $$$$$$$$| $$ | $$ | $$__/ \| $$ | $$$$$$$| $$_/ $$_/ $$| $$ 837 | \$$ $$| $$ $$ \$$ \| $$ | $$ \$$ $$| $$ \$$ $$ \$$ $$ $$| $$ 838 | \$$$$$$ | $$$$$$$ \$$$$$$$ \$$ \$$ \$$$$$$ \$$ \$$$$$$$ \$$$$$\$$$$ \$$ 839 | | $$ 840 | | $$ 841 | \$$ [bold]v 0.0.1[/bold] [/medium_spring_green] 842 | """ 843 | 844 | 845 | # for checking the memory usage uncomment @profile 846 | # @profile 847 | def main(): 848 | global DB 849 | 850 | print(ascii_art) 851 | 852 | # Initializes MongoDB 853 | DB = connect_db(MONGODB_URI, MONGODB_PWD) 854 | 855 | try: 856 | primary_url = sys.argv[1] 857 | except: 858 | print("\n[blue] [?] Primary Url [You can skip this part but entering] :[/blue]", end="") 859 | primary_url = input(" ") 860 | 861 | print("") 862 | 863 | print("[blue] [+] Loading And Testing Proxies... .. ... .. .. .. [/blue]") 864 | get_proxy() 865 | 866 | if primary_url != "": 867 | forced_crawl(primary_url) 868 | print("") 869 | 870 | 871 | # Starts threading 872 | for th in range(0, MAX_THREADS): 873 | t_d = threading.Thread(target=crawl, args=(th+1,)) 874 | t_d.daemon = True 875 | t_d.start() 876 | 877 | print(f"[spring_green1] [+] Started Thread : {th + 1}[/spring_green1]") 878 | 879 | print("\n") 880 | 881 | 882 | # while loop waiting for exit flag 883 | while not EXIT_FLAG: 884 | time.sleep(0.5) 885 | 886 | 887 | if __name__ == "__main__": 888 | main() 889 | 890 | -------------------------------------------------------------------------------- /current.ver: -------------------------------------------------------------------------------- 1 | 1.0.0 2 | -------------------------------------------------------------------------------- /docs.md: -------------------------------------------------------------------------------- 1 | # Open Crawler 1.0.0 - Documentation 2 | 3 | 4 | ## Table Of Contents 5 | 6 | * [Getting Started](#getting-started) 7 | * [Installation](#installation) 8 | * [Features](#features) 9 | * [Uses](#uses) 10 | * [Commands](#commands) 11 | * [Find Commands](#find-Commands) 12 | * [About Commands](#about-Commands) 13 | * [Config File](#config-file) 14 | * [Working](#working) 15 | * [Files](#files) 16 | * [Connection Tree](#connection-tree) 17 | * [Search](#search) 18 | * [MongoDB Collections](#mongodb-collections) 19 | * [How is data stored in mongoDB](#how-is-data-stored-in-mongoDB) 20 | * [Note](#note) 21 | 22 | 23 | 24 | 25 | ## Getting Started 26 | 27 | 28 | ### Installation 29 | 30 | ##### Linux 31 | 32 | ```shell 33 | git clone https://github.com/merwin-asm/OpenCrawler.git 34 | ``` 35 | ```shell 36 | cd OpenCrawler 37 | ``` 38 | ```shell 39 | chmod +x install.sh && ./install.sh 40 | ``` 41 | 42 | ##### Windows 43 | 44 | *You need git, python3 and pip installed* 45 | 46 | ```shell 47 | git clone https://github.com/merwin-asm/OpenCrawler.git 48 | ``` 49 | ```shell 50 | cd OpenCrawler 51 | ``` 52 | ```shell 53 | pip install -r requirements.txt 54 | ``` 55 | 56 | 57 | ### Features 58 | 59 | - Cross Platform 60 | - Installer for linux 61 | - Related-CLI Tools (includes ,CLI access to tool, not that good search-tool xD, etc) 62 | - Memory efficient [ig] 63 | - Pool Crawling - Use multiple crawlers at same time 64 | - Supports Robot.txt 65 | - MongoDB [DB] 66 | - Language Detection 67 | - 18 + Checks / Offensive Content Check 68 | - Proxies 69 | - Multi Threading 70 | - Url Scanning 71 | - Keyword, Desc And recurring words Logging 72 | 73 | 74 | ### Uses 75 | 76 | #### Making A (Not that good) Search engine : 77 | 78 | This can be easily done with verry less modifications if required 79 | 80 | - We also provide an inbuild search function , which may not be good enough but does do the thing ( the search topic be discussed below ) 81 | 82 | #### Osint Tool : 83 | 84 | You can make use of the tool to crawl through sites related to someone and do osint by using the search utility or make custom code for it 85 | 86 | #### Pentesting Tool : 87 | 88 | Find all websites related to one site , this can be achieved using the connection tree command ( this topic be discussed below ) 89 | 90 | #### Crawler As It says.. 91 | 92 | ## Commands 93 | 94 | ### Find Commands 95 | To find the commands you can use any of these 2 methods, 96 | 97 | *warning : this only works in linux* 98 | ```sh 99 | man opencrawler 100 | ``` 101 | 102 | For Linux: 103 | ```sh 104 | opencrawler help 105 | ``` 106 | For Windows: 107 | ```sh 108 | python opencrawler help 109 | ``` 110 | 111 | ### About Commands 112 | 113 | ##### help 114 | 115 | Shows the commands available 116 | 117 | ##### v 118 | 119 | Shows the current version of opencrawler 120 | 121 | ##### crawl 122 | 123 | This would start the normal crawler 124 | 125 | ##### forced_crawl \ 126 | 127 | Forcefully crawl a site , the site crawled is \ 128 | 129 | ##### crawled_status 130 | 131 | *warning : the data shown aint exact* 132 | 133 | Gives the info on the mongoDB.. 134 | This will show the number of sites crawled and the avg ammount of storage used. 135 | 136 | Show the info for both collections : (more info on the collections are given in the *working* section) 137 | - crawledsites 138 | - waitlist 139 | 140 | ##### search \ 141 | 142 | Uses basic filturing methods to search , this command aint meant for anything like search engine 143 | (the working of search be discussed in *working* section) 144 | 145 | 146 | ##### configure 147 | 148 | Configures the opencrawler... 149 | The same is also used to re configure... 150 | It will ask all the info required to start the crawler and save it in json file (config.json) (more info in the *config* section) 151 | 152 | Its ok if you are running crawl command without configs because it will ask you to .. xd 153 | 154 | ##### connection-tree \ \ 155 | 156 | A tree of websites connected to \ be shown 157 | 158 | \ is how deep you want to crawl a site. 159 | The default depth is 2 160 | 161 | ##### check_html \ 162 | 163 | Checks if a website is returning html 164 | 165 | ##### crawlable \ 166 | 167 | Checks if a website is allowed to be crawled 168 | It checks the robot.txt , to find if disallowed 169 | 170 | ##### dissallowed \ 171 | 172 | Shows the disallowed urls of a website 173 | The results are based on robots.txt 174 | 175 | ##### fix_db 176 | 177 | Starts the fix db program 178 | This can be used to resolve bugs present in the code , which could contaminate the DB 179 | 180 | ##### re-install 181 | 182 | Re installs the opencrawler 183 | 184 | ##### update 185 | 186 | Installs new version of the opencrawler | reinstalls 187 | 188 | ##### install-requirements 189 | 190 | Installs the requirements.. 191 | These requirements are mentioned in requirements.txt 192 | 193 | ## Config File 194 | 195 | The file is generated by the configure command , which will run the "config.py" file. 196 | 197 | The file in json , "config.json" 198 | 199 | The config file stores info regarding the crawling activity 200 | These Include : 201 | 202 | - **MONGODB_PWD** - pwd of mongoDB user 203 | - **MONGODB_URI** - uri for connecting to mongoDB 204 | - **TIMEOUT** - time out for get requests 205 | - **MAX_THREADS** - number of threads , set it as one if you don't wanna do multithreading 206 | - **bad_words** - the file containing list of bad words , which by default is bad_words.txt (bad_words.txt is provided) 207 | - **USE_PROXIES** - bool - if the crawler should use proxy (proxy wont be used even if set True for robot.txt scanning) 208 | - **Scan_Bad_Words** - bool - if you want to save the bad / offensive text score 209 | - **Scan_Top_Keywords** - bool - if you want to save the top keywords found in the html txt 210 | - **urlscan_key** - the url scan API key , if you are not use the feature leave it empty 211 | - **URL_SCAN** - bool - if you want to scan url using UrlScan API 212 | 213 | 214 | ## Working 215 | 216 | 217 | 218 | ### Files : 219 | 220 | 221 | |Filename |Type |Use | 222 | |----------|-------|-------------------------------------------------------------------------| 223 | |opencrawler|python|The main file which get called on using command opencrawler| 224 | |crawler.py|python|The file which do the crawling| 225 | |requirments.txt|text|The file containing names of python modules , to be installed| 226 | |search.py|python|Does the search| 227 | |opencrawler.1|roff|The user manual| 228 | |mongo_db.py|python|Handles mongoDB| 229 | |installer.py|python|Installer for linux, which will be ran by install.sh| 230 | |install.sh|shell|Install basic requirements like python3, **for linux use only**| 231 | |fix_db.py|python|Fixes the DB| 232 | |connection_tree.py|python|Makes the connection tree| 233 | |config.py|python|Configures the OpenCrawler| 234 | |bad_words.txt|text|Contains bad words used for predicting the bad/offensive text score| 235 | 236 | ### MongoDB Collections 237 | 238 | There are two collections used : 239 | 240 | - **waitlist** - Used for storing sites which is to be crawled 241 | - **crawledsites** - Used to store crawled sites and collected info about them 242 | 243 | ### How is data stored in mongoDB 244 | 245 | Structure in which data is stored in the collections... 246 | 247 | ##### crawledsites : 248 | 249 | ``` 250 | ######### Crawled Info are stored in Mongo DB as ##### 251 | Crawled sites = [ 252 | { 253 | "website" : "" 254 | 255 | "time" : "", 256 | "mal" : Val/None, # malicious or not 257 | "offn" : Val/None, # 18 +/ Offensive language 258 | "ln" : "", 259 | 260 | "keys" : [], 261 | "desc" : "", 262 | 263 | "recc" : []/None, 264 | } 265 | ] 266 | ``` 267 | 268 | 269 | ##### waitlist : 270 | 271 | ``` 272 | waitlist = [ 273 | { 274 | "website" : "" 275 | } 276 | ] 277 | ``` 278 | 279 | 280 | ### Connection Tree 281 | 282 | *By default depth is 2* 283 | 284 | The command tree works by getting all urls found in a site, 285 | then doing the same with the urls found, 286 | the number of times this happens deppends on the depth 287 | 288 | 289 | ### Search 290 | 291 | The search command uses the data stored in the crawledsites. 292 | 293 | For each word of query it will check for sites containing them in, 294 | - website URL 295 | - desc 296 | - keywords 297 | - top recurring words 298 | 299 | The results are sorted with the ones with most number of words from the query 300 | 301 | ``` 302 | url = list(_DB().Crawledsites.find({"$or" : [ 303 | {"recc": {"$regex": re.compile(word, re.IGNORECASE)}}, 304 | {"keys": {"$regex": re.compile(word, re.IGNORECASE)}}, 305 | {"desc": {"$regex": re.compile(word, re.IGNORECASE)}}, 306 | {"website" : {"$regex": re.compile(word, re.IGNORECASE)}} 307 | ]})) 308 | 309 | ``` 310 | 311 | 312 | ## Note 313 | 314 | - Proxy doesn't work for robot.txt scans while you are crawling , this is because the urlib.robotparser doesnt allow the use of proxy 315 | - If you have any issues with pymongo not working try installing versions preffered for the specific python version 316 | - If you get errors regarding pymongo also make sure you give read and write perms to the user 317 | - You can use local mongoDB 318 | - Search function aint making use of all possible filtures to find a site 319 | - installer.py and install.sh aint same , install.sh also installs python and pip then runs installer.py 320 | - installer.py and install.sh is only for linux use 321 | - we use proxyscrape API for geting free proxies 322 | - we use Virus Total's API for scanning websites , if required 323 | 324 | 325 | 326 | 327 | 328 | 329 | -------------------------------------------------------------------------------- /fix_db.py: -------------------------------------------------------------------------------- 1 | """ 2 | Set of Tools to fix your DB | OpenCrawler v 1.0.0 3 | """ 4 | 5 | 6 | 7 | from mongo_db import connect_db, _DB 8 | from rich import print 9 | import json 10 | import os 11 | 12 | 13 | 14 | 15 | def mongodb(): 16 | # Config File 17 | config_file = "config.json" 18 | 19 | 20 | # Load configs from config_file - > json 21 | try: 22 | config_file = open(config_file, "r") 23 | configs = json.loads(config_file.read()) 24 | config_file.close() 25 | 26 | except: 27 | try: 28 | os.system("python3 config.py") # Re-configures 29 | except: 30 | os.system("python config.py") # Re-configures 31 | 32 | config_file = open(config_file, "r") 33 | configs = json.loads(config_file.read()) 34 | config_file.close() 35 | 36 | 37 | ## Setting Up Configs 38 | MONGODB_PWD = configs["MONGODB_PWD"] 39 | MONGODB_URI = configs["MONGODB_URI"] 40 | 41 | # Initializes MongoDB 42 | connect_db(MONGODB_URI, MONGODB_PWD) 43 | 44 | 45 | 46 | 47 | mongodb() # Connects to DB 48 | 49 | 50 | 51 | print("\n[blue]---------------------------------------DB-FIXER---------------------------------------[/blue]\n") 52 | 53 | print("""[dark_orange]\t[1] Remove Duplicates[/dark_orange]""") 54 | 55 | 56 | print("\n[blue]Option :[/blue]", end="") 57 | op = input(" ") 58 | 59 | if op == "1": 60 | 61 | print("[blue] Scan Crawledsites (y/enter to skip) >[/blue]", end="") 62 | if input(" ").lower() == "y": 63 | print("\n[green] [+] Scanning Duplicates In Crawledsites [/green]") 64 | 65 | e = _DB().Crawledsites.find({}) 66 | for x in e: 67 | 68 | ww = list(_DB().Crawledsites.find({"website":x["website"]})) 69 | len_ = len(ww) 70 | 71 | ww = ww[0] 72 | 73 | 74 | if len_ != 1: 75 | 76 | _DB().Crawledsites.delete_many({"website":x["website"]}) 77 | 78 | _DB().Crawledsites.insert_one(ww) 79 | 80 | print(f"[green] [+] Removed : {x['website']} [/green]") 81 | 82 | 83 | 84 | print("[blue] Scan waitlist (y/enter to skip) >[/blue]", end="") 85 | if input(" ").lower() == "y": 86 | 87 | print("[green] [+] Scanning Duplicates In waitlist [/green]") 88 | 89 | e = _DB().waitlist.find({}) 90 | for x in e: 91 | 92 | ww = list(_DB().waitlist.find({"website":x["website"]})) 93 | len_ = len(ww) 94 | 95 | ww = ww[0] 96 | 97 | if len_ != 1: 98 | 99 | _DB().waitlist.delete_many({"website":x["website"]}) 100 | 101 | _DB().waitlist.insert_one(ww) 102 | 103 | print(f"[green] [+] Removed : {x['website']} [/green]") 104 | 105 | 106 | 107 | # print("[blue] Scan Robots (y/enter to skip) >[/blue]", end="") 108 | # if input(" ").lower() == "y": 109 | 110 | # print("[green] [+] Scanning Duplicates In Robots [/green]") 111 | 112 | # e = _DB().Robots.find({}) 113 | # for x in e: 114 | 115 | # ww = list(_DB().Robots.find({"website":x["website"]})) 116 | # len_ = len(ww) 117 | 118 | # ww = ww[0] 119 | 120 | # if len_ != 1: 121 | 122 | # _DB().Robots.delete_many({"website":x["website"]}) 123 | 124 | # _DB().Robots.insert_one(ww) 125 | 126 | # print(f"[green] [+] Removed : {x['website']} [/green]") 127 | 128 | 129 | else: 130 | print(f"[red] [-] Option '{op}' Not Found[/red]") 131 | 132 | 133 | print("\n[blue]--------------------------------------------------------------------------------------[/blue]\n") 134 | 135 | 136 | -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | sudo apt update 2 | sudo apt install python3 3 | sudo apt install python3-pip 4 | python3 installer.py 5 | -------------------------------------------------------------------------------- /installer.py: -------------------------------------------------------------------------------- 1 | """ 2 | Open Crawler v 1.0.0 Installer 3 | - Linux Installer 4 | """ 5 | 6 | 7 | 8 | 9 | import os 10 | 11 | 12 | 13 | os.system("pip3 install rich") 14 | 15 | os.system("clear") 16 | 17 | 18 | from rich import print 19 | 20 | print("[green] [+] Installing Requirements[/green]") 21 | 22 | 23 | os.system("pip3 install -r requirements.txt") 24 | 25 | 26 | print("[green] [+] Installing The Manual | You can run 'man opencrawler' now yey!! [/green]") 27 | 28 | 29 | os.system("sudo cp opencrawler.1 /usr/local/man/man1/opencrawler.1") 30 | 31 | print("[green] [+] Adding files to path[/green]") 32 | 33 | files = ["search.py", "robots_txt.py", "mongo_db.py", "crawler.py", "fix_db.py", "opencrawler", "connection_tree.py", "config.py", "bad_words.txt"] # FIles which will be added to path 34 | 35 | for file in files: 36 | os.system(f"sudo cp {file} /usr/bin/{file}") 37 | 38 | os.system("sudo chmod +x /usr/bin/opencrawler") 39 | 40 | 41 | print("[green] Exited[/green]") 42 | 43 | 44 | -------------------------------------------------------------------------------- /mongo_db.py: -------------------------------------------------------------------------------- 1 | """ 2 | Open Crawler v 1.0.0 | Mongo - DB 3 | """ 4 | 5 | 6 | from pymongo.mongo_client import MongoClient 7 | from rich import print 8 | import atexit 9 | import time 10 | import json 11 | import os 12 | 13 | 14 | # Main Variables 15 | CLIENT = None 16 | DB = None 17 | 18 | 19 | 20 | def connect_db(uri, pwd): 21 | """ 22 | Initializes Connection With MongoDB 23 | uri : str - > The URI given by MongoDB 24 | pwd : str - > The password to connect 25 | """ 26 | 27 | global CLIENT, DB 28 | 29 | uri = uri.replace("", pwd) 30 | 31 | try: 32 | CLIENT = MongoClient(uri) 33 | print("[spring_green1] [+] Connected To MongoDB [/spring_green1]") 34 | 35 | DB = CLIENT.Crawledsites 36 | 37 | 38 | except Exception as e: 39 | print("[red] [-] Error Occured While Connecting To Mongo DB [/red]") 40 | print(f"[red] [bold] \t\t\t\t Error : {e}[/bold] [/red]") 41 | quit() 42 | 43 | 44 | 45 | def if_waiting(url): 46 | """ 47 | Checks if a website is in the waiting list 48 | returns bool 49 | """ 50 | try: 51 | a = DB.waitlist.find_one({"website":url})["website"] 52 | if a != None: 53 | return True 54 | else: 55 | return False 56 | except: 57 | return False 58 | 59 | 60 | def _DB(): 61 | """ 62 | returns the DB 63 | """ 64 | return DB 65 | 66 | 67 | def get_info(): 68 | """ 69 | To get count of docs in main collections 70 | returns list of int 71 | """ 72 | 73 | a = int(DB.Crawledsites.estimated_document_count()) 74 | b = int(DB.waitlist.estimated_document_count()) 75 | 76 | a = f" Len : {a} | Storage : {a*257} Bytes" 77 | b = f" Len : {b} | Storage : {b*618} Bytes" 78 | 79 | return [a, b] 80 | 81 | 82 | 83 | def get_last(): 84 | """ 85 | Last crawled site 86 | returns str 87 | """ 88 | a = DB.Crawledsites.find().sort("_id", -1) 89 | return a[0]["website"] 90 | 91 | 92 | 93 | def get_crawl(website): 94 | """ 95 | Get crawled info of a site 96 | returns dict 97 | """ 98 | 99 | return dict(DB.Crawledsites.find_one({"website":website})) 100 | 101 | 102 | 103 | def if_crawled(url): 104 | """ 105 | Checks if a site was crawled 106 | returns Bool , time/None (last crawled time) 107 | """ 108 | try: 109 | a = DB.Crawledsites.find_one({"website":url}) 110 | return True, a["time"] 111 | 112 | except: 113 | return False, None 114 | 115 | 116 | 117 | def update_crawl(website, time, mal, offn, ln, key, desc, recc): 118 | """ 119 | Updates a crawl 120 | """ 121 | DB.Crawledsites.delete_many({"website":website}) 122 | DB.Crawledsites.insert_one({"website":website, "time":time, "mal":mal, "offn":offn, "ln":ln, "key":key, "desc":desc, "recc":recc}) 123 | 124 | 125 | 126 | 127 | def save_crawl(website, time, mal, offn, ln, key, desc, recc): 128 | """ 129 | Saves a crawl 130 | """ 131 | DB.Crawledsites.insert_one({"website":website, "time":time, "mal":mal, "offn":offn, "ln":ln, "key":key, "desc":desc, "recc":recc}) 132 | 133 | 134 | 135 | def save_robots(website, robots): 136 | """ 137 | Saves dissallowed sites 138 | """ 139 | 140 | DB.Robots.insert_one({"website":website, "restricted":robots}) 141 | 142 | 143 | 144 | 145 | def get_robots(website): 146 | """ 147 | Gets dissallowed sites from the database 148 | """ 149 | return DB.Robots.find_one({"website":website})["restricted"] 150 | 151 | 152 | 153 | def get_wait_list(num): 154 | """ 155 | Gets websites to crawl 156 | num : int - > number of websites to recv 157 | returns list - > list of websites 158 | """ 159 | 160 | wait = list(DB.waitlist.find().limit(num)) 161 | 162 | for e in wait: 163 | DB.waitlist.delete_many({"website":e["website"]}) 164 | 165 | return wait 166 | 167 | 168 | 169 | def write_to_wait_list(list_): 170 | """ 171 | Writes to collection of websites to get crawled 172 | list_ : list - > website urls 173 | """ 174 | 175 | list_ = set(list_) 176 | list__ = [] 177 | 178 | 179 | 180 | for e in list_: 181 | if not if_waiting(e): 182 | list__.append({"website": e}) 183 | 184 | 185 | try: 186 | DB.waitlist.insert_many(list__) 187 | except: 188 | pass 189 | 190 | 191 | 192 | # Part of testings 193 | if __name__ == "__main__": 194 | 195 | # Config File 196 | config_file = "config.json" 197 | 198 | 199 | # Load configs from config_file - > json 200 | try: 201 | config_file = open(config_file, "r") 202 | configs = json.loads(config_file.read()) 203 | config_file.close() 204 | 205 | except: 206 | try: 207 | os.system("python3 config.py") # Re-configures 208 | except: 209 | os.system("python config.py") # Re-configures 210 | 211 | config_file = open(config_file, "r") 212 | configs = json.loads(config_file.read()) 213 | config_file.close() 214 | 215 | 216 | ## Setting Up Configs 217 | MONGODB_PWD = configs["MONGODB_PWD"] 218 | MONGODB_URI = configs["MONGODB_URI"] 219 | 220 | connect_db(MONGODB_URI, MONGODB_PWD) 221 | 222 | # save_crawl("w1",1,0,1,3,4,5,5) 223 | # save_crawl("w2",4,0,1,3,4,5,5) 224 | # save_crawl("w3",10,0,1,3,4,5,5) 225 | # print(if_crawled("w")) 226 | # update_crawl("w",1,1,1,3,4,5,5) 227 | # print(get_last()) 228 | # print(if_crawled("https://darkmash-org.github.io/")) 229 | # print(get_robots("www.bfi.org.uk")) 230 | # print(if_waiting("https://www.w3.org/blog/2015/01/")) 231 | -------------------------------------------------------------------------------- /opencrawler: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | 4 | """ 5 | Open Crawler v 1.0.0 | CLI 6 | """ 7 | 8 | 9 | 10 | from rich.table import Table 11 | from rich import print 12 | from mongo_db import * 13 | import robots_txt 14 | import platform 15 | import requests 16 | import json 17 | import sys 18 | import os 19 | 20 | 21 | 22 | 23 | def mongodb(): 24 | # Config File 25 | config_file = "config.json" 26 | 27 | 28 | # Load configs from config_file - > json 29 | try: 30 | config_file = open(config_file, "r") 31 | configs = json.loads(config_file.read()) 32 | config_file.close() 33 | 34 | except: 35 | try: 36 | os.system("python3 config.py") # Re-configures 37 | except: 38 | os.system("python config.py") # Re-configures 39 | 40 | 41 | config_file = open(config_file, "r") 42 | configs = json.loads(config_file.read()) 43 | config_file.close() 44 | 45 | 46 | ## Setting Up Configs 47 | MONGODB_PWD = configs["MONGODB_PWD"] 48 | MONGODB_URI = configs["MONGODB_URI"] 49 | 50 | # Initializes MongoDB 51 | connect_db(MONGODB_URI, MONGODB_PWD) 52 | 53 | 54 | 55 | 56 | # Finding the location of the file 57 | cur_path = __file__.split("/") 58 | cur_path.remove(cur_path[-1]) 59 | cur_path_ = "" 60 | for dir_ in cur_path: 61 | cur_path_ += "/" + dir_ 62 | cur_path = cur_path_ 63 | 64 | 65 | 66 | table_commands = Table(title="Help - Open Crawler v 1.0.0") 67 | 68 | table_commands.add_column("Command", style="cyan", no_wrap=True) 69 | table_commands.add_column("Use", style="magenta") 70 | table_commands.add_column("No", justify="right", style="green") 71 | 72 | table_commands.add_row("help", "Get info about the commands", "1") 73 | table_commands.add_row("v", "Get the version of open crawler", "2") 74 | table_commands.add_row("crawl", "Starts up the normal crawler", "3") 75 | table_commands.add_row("force_crawl ", "Forcefully crawls a website", "4") 76 | table_commands.add_row("crawled_status", "Shows the amount of data in DB , etc", "5") 77 | table_commands.add_row("configure", "Write / reWrite The config file", "6") 78 | table_commands.add_row("connection-tree ", "Makes a tree of websites connected to it, layers by default is 2", "7") 79 | table_commands.add_row("check_html ", "Checks if a website respond with html content", "8") 80 | table_commands.add_row("crawlable ", "Checks if a website is allowed to be crawled", "9") 81 | table_commands.add_row("dissallowed ", "Lists the websites not allowed to be crawled", "10") 82 | table_commands.add_row("re-install", "Re installs the Open Crawler", "11") 83 | table_commands.add_row("update", "Updates the open crawler", "12") 84 | table_commands.add_row("install-requirements", "Installs requirements for open crawler", "13") 85 | table_commands.add_row("search ", "Search from the crawled data", "14") 86 | table_commands.add_row("fix_db", "Tools to fix the DB", "15") 87 | 88 | 89 | 90 | try: 91 | main_arg = sys.argv[1] 92 | except: 93 | print(table_commands) 94 | quit() 95 | 96 | try: 97 | if main_arg == "help": 98 | print(table_commands) 99 | 100 | 101 | elif main_arg == "v": 102 | print(""" 103 | [medium_spring_green] 104 | ______ ______ __ 105 | / \ / \ | \ 106 | | $$$$$$\ ______ ______ _______ | $$$$$$\ ______ ______ __ __ __ | $$ 107 | | $$ | $$ / \ / \ | \ | $$ \$$ / \ | \ | \ | \ | \| $$ 108 | | $$ | $$| $$$$$$\| $$$$$$\| $$$$$$$\ | $$ | $$$$$$\ \$$$$$$\| $$ | $$ | $$| $$ 109 | | $$ | $$| $$ | $$| $$ $$| $$ | $$ | $$ __ | $$ \$$/ $$| $$ | $$ | $$| $$ 110 | | $$__/ $$| $$__/ $$| $$$$$$$$| $$ | $$ | $$__/ \| $$ | $$$$$$$| $$_/ $$_/ $$| $$ 111 | \$$ $$| $$ $$ \$$ \| $$ | $$ \$$ $$| $$ \$$ $$ \$$ $$ $$| $$ 112 | \$$$$$$ | $$$$$$$ \$$$$$$$ \$$ \$$ \$$$$$$ \$$ \$$$$$$$ \$$$$$\$$$$ \$$ 113 | | $$ 114 | | $$ 115 | \$$ [bold]v 1.0.0[/bold] [/medium_spring_green] 116 | 117 | """) 118 | 119 | 120 | elif main_arg == "fix_db": 121 | try: 122 | os.system(f"python3 {cur_path}/fix_db.py") 123 | except: 124 | os.system(f"python {cur_path}/fix_db.py") 125 | 126 | 127 | elif main_arg == "search": 128 | 129 | pool = False 130 | 131 | try: 132 | test_arg = sys.argv[2:] 133 | 134 | except: 135 | print("[red] [-] No Search Text[/red]") 136 | quit() 137 | 138 | txt = "" 139 | for e in test_arg: 140 | txt += " " + e 141 | 142 | try: 143 | os.system(f"python3 {cur_path}/search.py {txt}") 144 | except: 145 | os.system(f"python {cur_path}/search.py {txt}") 146 | 147 | 148 | elif main_arg == "configure": 149 | try: 150 | os.system(f"python3 {cur_path}/config.py") 151 | except: 152 | os.system(f"python {cur_path}/config.py") 153 | 154 | 155 | elif main_arg == "crawl": 156 | try: 157 | os.system(f"python3 {cur_path}/crawler.py") 158 | except: 159 | os.system(f"python {cur_path}/crawler.py") 160 | 161 | 162 | elif main_arg == "forced_crawl": 163 | try: 164 | web = sys.argv[2] 165 | except: 166 | print("[red] [-] Link Not Passed In [/red]") 167 | quit() 168 | 169 | try: 170 | os.system(f"python3 {cur_path}/crawler.py {web}") 171 | except: 172 | os.system(f"python {cur_path}/crawler.py {web}") 173 | 174 | 175 | elif main_arg == "crawled_status": 176 | mongodb() 177 | try: 178 | res = get_info() 179 | except: 180 | print("[red] [-] Couldn't Get The Info[/red]") 181 | quit() 182 | 183 | print("[yellow3] [?] The Info Given Wouldn't Be Accurate[/yellow3]\n") 184 | 185 | print(f"[dark_orange] \t : Crawled Sites - > {res[0]} [/dark_orange]") 186 | print(f"[dark_orange] \t : Wait list - > {res[1]} [/dark_orange]") 187 | 188 | print("") 189 | 190 | elif main_arg == "connection-tree": 191 | try: 192 | web = sys.argv[2] 193 | except: 194 | print("[red] [-] Link Not Passed In [/red]") 195 | quit() 196 | 197 | num = 2 198 | 199 | try: 200 | num = sys.argv[3] 201 | except: 202 | pass 203 | try: 204 | os.system(f"python3 {cur_path}/connection_tree.py {web} {num}") 205 | except: 206 | os.system(f"python {cur_path}/connection_tree.py {web} {num}") 207 | 208 | elif main_arg == "dissallowed": 209 | try: 210 | web = sys.argv[2] 211 | except: 212 | print("[red] [-] Link Not Passed In [/red]") 213 | quit() 214 | 215 | try: 216 | restricted = robots_txt.disallowed(web, None) 217 | except: 218 | print("[red] [-] The site was down or some other error [/red]") 219 | quit() 220 | 221 | print("[green] [+] Dissallowed : [/green]") 222 | for e in restricted: 223 | print(f"[green] \t\t-------> {e} [/green]") 224 | 225 | 226 | elif main_arg == "crawlable": 227 | try: 228 | web = sys.argv[2] 229 | except: 230 | print("[red] [-] Link Not Passed In [/red]") 231 | quit() 232 | 233 | try: 234 | restricted = robots_txt.disallowed(web, None) 235 | except: 236 | print("[red] [-] The site was down or some other error [/red]") 237 | quit() 238 | 239 | site = web 240 | site = site.replace("https://", "") 241 | site = site.replace("http://", "") 242 | web = site.split("/") 243 | web.remove(web[0]) 244 | site = "" 245 | for e in web: 246 | site += e + "/" 247 | web = site 248 | 249 | A = True 250 | for e in restricted: 251 | if web.startswith(e): 252 | A = False 253 | break 254 | 255 | if A: 256 | print(f"[green] [+] Can Be Crawled [/green]") 257 | else: 258 | print(f"[red] [-] Can't Be Crawled [/red]") 259 | 260 | 261 | elif main_arg == "check_html": 262 | try: 263 | web = sys.argv[2] 264 | except: 265 | print("[red] [-] Link Not Passed In [/red]") 266 | quit() 267 | 268 | try: 269 | is_html = "html" in requests.get(web).headers["Content-Type"] 270 | except: 271 | print("[red] [-] Can't Be Checked Because the site is down or no content type provided in headers[/red]") 272 | quit() 273 | 274 | if is_html: 275 | print(f"[green] [+] '{web}' Respond With HTML Content [/green]") 276 | else: 277 | print(f"[red] [-] '{web}' Doesn't Respond With HTML Content [/red]") 278 | 279 | 280 | elif main_arg == "update": 281 | if platform.system() != "windows": 282 | 283 | try: 284 | os.system("git clone https://github.com/merwin-asm/OpenCrawler.git") 285 | except: 286 | os.system("sudp rm -rf OpenCrawler") 287 | os.system("git clone https://github.com/merwin-asm/OpenCrawler.git") 288 | 289 | os.system("cd OpenCrawler") 290 | os.system("chmod +x install.sh") 291 | os.system("./install.sh") 292 | else: 293 | print("[yellow] This wont work on windows [/yellow]") 294 | 295 | elif main_arg == "install_requirements": 296 | try: 297 | os.system(f"pip3 install -r {cur_path}/requirements.txt") 298 | except: 299 | os.system(f"pip install -r {cur_path}/requirements.txt") 300 | 301 | 302 | elif main_arg == "re_install": 303 | if platform.system() != "windows": 304 | 305 | try: 306 | os.system("git clone https://github.com/merwin-asm/OpenCrawler.git") 307 | except: 308 | 309 | os.system("sudp rm -rf OpenCrawler") 310 | os.system("git clone https://github.com/merwin-asm/OpenCrawler.git") 311 | 312 | os.system("cd OpenCrawler") 313 | os.system("chmod +x install.sh") 314 | os.system("./install.sh") 315 | 316 | else: 317 | print("[yellow] This wont work on windows [/yellow]") 318 | 319 | 320 | else: 321 | print(f"[red] [-] Command '{main_arg}' Not Found || Use 'opencrawler help' For Commands[/red]") 322 | 323 | except: 324 | print("[red] [-] Some Error Occurred[/red]") 325 | 326 | 327 | -------------------------------------------------------------------------------- /opencrawler.1: -------------------------------------------------------------------------------- 1 | .TH OpenCrawler 1 2 | 3 | .SH NAME OpenCrawler v 1.0.0 4 | 5 | .SH DESCRIPTION 6 | 7 | .B OpenCrawler v 1.0.0 8 | Is an program for crawling through websites. 9 | 10 | .TP 11 | An open source crawler/spider. LICENSE - MIT. 12 | 13 | .SH COMMANDS 14 | 15 | .TP 16 | .BR help 17 | Get info about the commands 18 | 19 | .TP 20 | .BR v 21 | Get the version of open crawler 22 | 23 | .TP 24 | .BR crawl 25 | Starts up the normal crawler 26 | 27 | .TP 28 | .BR forced_crawl 29 | Forcefully crawl a website/Make the crawler crawl the website 30 | 31 | .TP 32 | .BR crawled_status 33 | Shows the amount of data in DB , etc 34 | 35 | .TP 36 | .BR configure 37 | Write / reWrite The config file 38 | 39 | .TP 40 | .BR connection-tree 41 | Makes a tree of websites connected to it, layers by default is 2 42 | 43 | .TP 44 | .BR check_html 45 | Checks if a website respond with html content 46 | 47 | .TP 48 | .BR crawlable 49 | Checks if a website is allowed to be crawled 50 | 51 | .TP 52 | .BR dissallowed 53 | Lists the websites not allowed to be crawled 54 | 55 | .TP 56 | .BR re-install 57 | Re installs the Open Crawler 58 | 59 | .TP 60 | .BR update 61 | Updates the open crawler 62 | 63 | .TP 64 | .BR install-requirements 65 | Installs requirements for open crawler 66 | 67 | .TP 68 | .BR search 69 | Search from the crawled data 70 | 71 | 72 | .TP 73 | .BR fix_db 74 | Tools to fix the DB 75 | 76 | -------------------------------------------------------------------------------- /proxy_tool.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from rich import print 3 | import threading 4 | import time 5 | 6 | 7 | working_proxies = [] 8 | 9 | 10 | def check(url, protocol, proxy): 11 | global working_proxies 12 | 13 | try: 14 | response = requests.get(url, proxies={protocol:proxy}, timeout=2) 15 | if response.status_code == 200: 16 | print(f"Proxy {proxy} works! [{len(working_proxies)+1}]") 17 | working_proxies.append(proxy) 18 | else: 19 | print(f"Proxy {proxy} failed") 20 | except requests.RequestException as e: 21 | pass 22 | 23 | 24 | def proxy_checker(proxies, url="https://www.wired.com/review/klipsch-flexus-core-200/"): 25 | proxies = proxies.text.split("\r\n") 26 | 27 | if url.startswith("https"): 28 | protocol = "https" 29 | else: 30 | protocol = "http" 31 | 32 | proxies.pop() 33 | 34 | i = 0 35 | for proxy in proxies: 36 | threading.Thread(target=check, args=(url, protocol, proxy)).start() 37 | i += 1 38 | if i == 20: 39 | time.sleep(1.5) 40 | i = 0 41 | 42 | return working_proxies 43 | 44 | def proxy_checker_(proxies, url="https://www.wired.com/review/klipsch-flexus-core-200/"): 45 | proxies = proxies.split("\r\n") 46 | 47 | if url.startswith("https"): 48 | protocol = "https" 49 | else: 50 | protocol = "http" 51 | 52 | proxies.pop() 53 | 54 | 55 | i = 0 56 | for proxy in proxies: 57 | threading.Thread(target=check, args=(url, protocol, proxy)).start() 58 | i += 1 59 | if i == 20: 60 | time.sleep(1.5) 61 | i = 0 62 | 63 | return working_proxies 64 | 65 | def get_proxy(): 66 | global working_proxies 67 | 68 | try: 69 | f = open("found_proxies_http") 70 | f2 = open("found_proxies_https") 71 | res = f.read() 72 | res2 = f2.read() 73 | 74 | HTTP = proxy_checker_(res, "https://www.wired.com/review/klipsch-flexus-core-200/") 75 | working_proxies = [] 76 | HTTPS = proxy_checker(res2) 77 | 78 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {len(HTTP)} [/green]") 79 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {len(HTTPS)} [/green]") 80 | 81 | f2.close() 82 | f.close() 83 | 84 | f = open("found_proxies_http", "w") 85 | f2 = open("found_proxies_https", "w") 86 | f.write("\n".join(HTTP)) 87 | f2.write("\n".join(HTTPS)) 88 | f.close() 89 | f2.close() 90 | 91 | 92 | except: 93 | print("[red] [-] Failed, Maybe Because Already Didn't Have A proxyList To Refresh[/red]") 94 | 95 | def gen_new(): 96 | global working_proxies 97 | print("We are generating a new proxy list so it would take time... \[this happens when you are using old proxylist/have none]") 98 | res = requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all") 99 | res2= requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all") 100 | HTTP = proxy_checker(res, "http://www.wired.com/review/klipsch-flexus-core-200/") 101 | working_proxies = [] 102 | HTTPS = proxy_checker(res2) 103 | 104 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {len(HTTP)} [/green]") 105 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {len(HTTPS)} [/green]") 106 | 107 | f = open("found_proxies_http", "w") 108 | f2 = open("found_proxies_https", "w") 109 | f.write("\n".join(HTTP)) 110 | f2.write("\n".join(HTTPS)) 111 | f.close() 112 | f2.close() 113 | 114 | print("[blue] This Tool Belonging To OpenCrawler Project Can \nHelp With Generating And Checking HTTP And HTTPS Proxy List![/blue]") 115 | print("") 116 | print("""[blue] 117 | \t [1] - Refresh The ProxyList, Remove Proxies Which Doesn't Work. 118 | \t [2] - Renew The ProxyList, Autogenerate A New ProxyList. 119 | \t [3] - Show Count of the Proxies HTTP And HTTPS. 120 | [/blue]""") 121 | 122 | t = input("[1,2,3] > ") 123 | if t == "1": 124 | get_proxy() 125 | elif t == "2": 126 | gen_new() 127 | elif t == "3": 128 | try: 129 | f = open("found_proxies_http") 130 | f2 = open("found_proxies_https") 131 | res = f.read() 132 | res2 = f2.read() 133 | a = len(res.split('\r\n')) 134 | b = len(res2.split('\r\n')) 135 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {a} [/green]") 136 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {b} [/green]") 137 | 138 | f2.close() 139 | f.close() 140 | except: 141 | print("[red] [-] Failed, Maybe Because Already Didn't Have A proxyList[/red]") 142 | 143 | else: 144 | print("[red] [-] Unknown Command[/red]") 145 | 146 | 147 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pymongo 2 | rich 3 | requests 4 | memory_profiler 5 | beautifulsoup4 6 | langdetect 7 | -------------------------------------------------------------------------------- /robots_txt.py: -------------------------------------------------------------------------------- 1 | """ 2 | Open Crawler - Robot.txt - Loader 3 | """ 4 | 5 | import requests as r 6 | 7 | 8 | 9 | 10 | def get_robot_txt(site, proxies): 11 | site = site.replace("https://", "") 12 | site = site.replace("http://", "") 13 | robot_file_url = "https://" + site.split("/")[0] + "/robots.txt" 14 | 15 | 16 | try: 17 | res = r.get(robot_file_url, timeout = 3 , proxies = proxies) 18 | except: 19 | res = None 20 | 21 | if res == None: 22 | return "" 23 | 24 | status = int(res.status_code) 25 | 26 | if status == 200: 27 | return res.text 28 | 29 | else: 30 | return "" 31 | 32 | 33 | def get_lines(txt): 34 | txt = remove_comments(txt) 35 | 36 | while "" in txt: 37 | txt.remove("") 38 | 39 | while "\n" in txt: 40 | txt.remove("\n") 41 | 42 | while "\t" in txt: 43 | txt.remove("\t") 44 | 45 | return txt 46 | 47 | 48 | def remove_comments(txt): 49 | txt = txt.split("\n") 50 | txt_ = [] 51 | for e in txt: 52 | txt_.append(e.split("#")[0]) 53 | 54 | return txt_ 55 | 56 | 57 | def disallowed(site, proxies): 58 | 59 | txt = get_robot_txt(site, proxies) 60 | txt = get_lines(txt) 61 | 62 | dis = [] 63 | 64 | if txt == []: 65 | return txt 66 | 67 | record = False 68 | for line in txt: 69 | if line == "User-agent: *": 70 | record = True 71 | elif line.startswith("Disallow:"): 72 | if record: 73 | dis.append(line.split(" ")[-1]) 74 | elif "User-agent:" in line: 75 | if record == True: 76 | break 77 | 78 | return dis 79 | 80 | 81 | if __name__ == "__main__": 82 | # print(get_robot_txt("https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt")) 83 | # print(disallowed("https://www.google.com/robots.txt")) 84 | pass 85 | -------------------------------------------------------------------------------- /search.py: -------------------------------------------------------------------------------- 1 | """ 2 | Open Crawler v 1.0.0 | search.py 3 | 4 | -- Note the official search functions doesnt count the clicks or learn from search patterns etc :] 5 | """ 6 | 7 | 8 | 9 | from mongo_db import connect_db, _DB 10 | from rich import print 11 | import time 12 | import json 13 | import sys 14 | import os 15 | import re 16 | 17 | 18 | 19 | 20 | def mongodb(): 21 | # Config File 22 | config_file = "config.json" 23 | 24 | 25 | # Load configs from config_file - > json 26 | try: 27 | config_file = open(config_file, "r") 28 | configs = json.loads(config_file.read()) 29 | config_file.close() 30 | 31 | except: 32 | try: 33 | os.system("python3 config.py") # Re-configures 34 | except: 35 | os.system("python config.py") # Re-configures 36 | 37 | 38 | config_file = open(config_file, "r") 39 | configs = json.loads(config_file.read()) 40 | config_file.close() 41 | 42 | 43 | ## Setting Up Configs 44 | MONGODB_PWD = configs["MONGODB_PWD"] 45 | MONGODB_URI = configs["MONGODB_URI"] 46 | 47 | # Initializes MongoDB 48 | connect_db(MONGODB_URI, MONGODB_PWD) 49 | 50 | 51 | 52 | 53 | mongodb() # Connects to DB 54 | 55 | 56 | # Get the search 57 | search = sys.argv[1:] 58 | 59 | 60 | RESULTS = {} # Collects the results 61 | 62 | 63 | if len(search) > 1: 64 | 65 | t_1 = time.time() 66 | 67 | for e in search: 68 | url = list(_DB().Crawledsites.find({"$or" : [ 69 | {"recc": {"$regex": re.compile(e, re.IGNORECASE)}}, 70 | {"keys": {"$regex": re.compile(e, re.IGNORECASE)}}, 71 | {"desc": {"$regex": re.compile(e, re.IGNORECASE)}}, 72 | {"website" : {"$regex": re.compile(e, re.IGNORECASE)}} 73 | ]})) 74 | 75 | res = [] 76 | [res.append(x["website"]) for x in url if x["website"] not in res] 77 | 78 | del url 79 | 80 | for url in res: 81 | if url in RESULTS.keys(): 82 | RESULTS[url] += 1 83 | else: 84 | RESULTS.setdefault(url, 1) 85 | 86 | 87 | t_2 = time.time() 88 | 89 | 90 | RESULTS_ = RESULTS 91 | 92 | RESULTS = sorted(RESULTS.items(), key=lambda x:x[1], reverse=True) 93 | 94 | c = 0 95 | for result in RESULTS: 96 | if RESULTS_[result[0]] > 1: 97 | print(f"[green]Link: {result[0]} | Common words: {result[1]} [/green]") 98 | c += 1 99 | 100 | print(f"[dark_orange]Query : {search} | Total Results : {c} | Time Taken : {t_2 - t_1}s[/dark_orange]") 101 | 102 | else: 103 | t_1 = time.time() 104 | e = search[0] 105 | url = list(_DB().Crawledsites.find({"$or" : [ 106 | {"recc": {"$regex": re.compile(e, re.IGNORECASE)}}, 107 | {"keys": {"$regex": re.compile(e, re.IGNORECASE)}}, 108 | {"desc": {"$regex": re.compile(e, re.IGNORECASE)}}, 109 | {"website" : {"$regex": re.compile(e, re.IGNORECASE)}} 110 | ]})) 111 | t_2 = time.time() 112 | 113 | res = [] 114 | [res.append(x["website"]) for x in url if x["website"] not in res] 115 | 116 | del url 117 | 118 | for result in res: 119 | print(f"[green]Link: {result}[/green]") 120 | 121 | print(f"[dark_orange]Query : {search} | Total Results : {len(res)} | Time Taken : {t_2 - t_1}s[/dark_orange]") 122 | 123 | 124 | -------------------------------------------------------------------------------- /search_website.py: -------------------------------------------------------------------------------- 1 | from flask import Flask, request, jsonify, render_template 2 | import os 3 | import pickle 4 | 5 | app = Flask(__name__) 6 | 7 | @app.route('/') 8 | def index(): 9 | return render_template('index.html') 10 | 11 | @app.route('/search', methods=['GET']) 12 | def search(): 13 | query = request.args.get('query') 14 | if not query: 15 | return jsonify({"error": "No query provided"}), 400 16 | 17 | os.system(f"python3 search.py {query}") 18 | 19 | f = open(".results_"+".".join(query.split(" ")), "rb") 20 | data = pickle.loads(f.read()) 21 | f.close() 22 | 23 | output = { 24 | "urls": data["res"], 25 | "summary": f"Query : {query} | Total Results : {len(data['res'])} | Time Taken : {data['time']}" 26 | } 27 | 28 | return jsonify(output) 29 | 30 | if __name__ == '__main__': 31 | app.run(port="8080") 32 | 33 | -------------------------------------------------------------------------------- /templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | OpenCSearch 7 | 8 | 100 | 101 | 102 |

103 |

OpenCSearch

104 |

105 |

106 | 110 |

111 |

112 | 113 | 138 | 139 | 140 | 141 | -------------------------------------------------------------------------------- /templates/tree.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Fetch Tree Image 7 | 21 | 22 | 23 |

Connection Tree WEB

24 | 31 |

32 | 33 | 62 | 63 | 64 | 65 | -------------------------------------------------------------------------------- /tree_website.py: -------------------------------------------------------------------------------- 1 | from flask import Flask, send_file, request, render_template 2 | import networkx as nx 3 | import matplotlib.pyplot as plt 4 | import io 5 | from PIL import Image 6 | import json 7 | import os 8 | import pickle 9 | 10 | app = Flask(__name__) 11 | 12 | 13 | def get_data(root, layers): 14 | os.system(f"python3 connection_tree_v2.py {root} {layers}") 15 | f = open(f".{root}_{layers}".replace("/", "o"), "rb") 16 | data = pickle.loads(f.read()) 17 | f.close() 18 | 19 | return data 20 | 21 | 22 | def add_nodes_edges(graph, node, parent=None): 23 | """ 24 | Recursively add nodes and edges to the graph. 25 | """ 26 | if parent: 27 | graph.add_edge(parent, node['name']) 28 | 29 | if 'children' in node: 30 | for child in node['children']: 31 | add_nodes_edges(graph, child, node['name']) 32 | 33 | def plot_tree(data, output_file='tree.png'): 34 | """ 35 | Plot a tree graph from the data and save it as an image. 36 | """ 37 | G = nx.DiGraph() 38 | add_nodes_edges(G, data) 39 | 40 | pos = nx.spring_layout(G, seed=42, k=0.5) # Position nodes using spring layout 41 | plt.figure(figsize=(14, 10)) 42 | nx.draw(G, pos, with_labels=True, arrows=True, node_size=2000, node_color='lightblue', font_size=6, font_weight='bold', edge_color='gray') 43 | plt.title('Tree Graph') 44 | plt.savefig(output_file) 45 | 46 | 47 | 48 | @app.route('/') 49 | def index(): 50 | return render_template('tree.html') 51 | 52 | 53 | @app.route('/generate_tree') 54 | def generate_tree(): 55 | root = request.headers.get('root') 56 | layers = int(request.headers.get('layers')) 57 | 58 | data = get_data(root, layers) 59 | data = data[root] 60 | plot_tree(data) 61 | 62 | return send_file("tree.png", mimetype='image/png') 63 | 64 | if __name__ == '__main__': 65 | app.run() 66 | --------------------------------------------------------------------------------