├── .github
└── FUNDING.yml
├── LICENSE
├── README.md
├── bad_words.txt
├── config.py
├── connection_tree.py
├── connection_tree_v2.py
├── crawler.py
├── current.ver
├── docs.md
├── fix_db.py
├── install.sh
├── installer.py
├── mongo_db.py
├── opencrawler
├── opencrawler.1
├── proxy_tool.py
├── requirements.txt
├── robots_txt.py
├── search.py
├── search_website.py
├── templates
├── index.html
└── tree.html
└── tree_website.py
/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # These are supported funding model platforms
2 |
3 | patreon: cactochan
4 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Cactochan
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
Open Crawler
9 |
10 |
11 | Open Source Website Crawler
12 |
13 |
14 |
15 |
16 | Explore the docs »
17 |
18 |
19 | Report Bug
20 | .
21 | Request Feature
22 |
23 |
24 |
25 |   
26 |
27 |
28 | ## Table Of Contents
29 |
30 | * [About the Project](#about-the-project)
31 | * [Features](#features)
32 | * [Getting Started](#getting-started)
33 | * [Installation](#installation)
34 | * [Usage](#usage)
35 | * [Contributing](#contributing)
36 | * [License](#license)
37 | * [Authors](#authors)
38 |
39 | ## About The Project
40 |
41 | 
42 |
43 |
44 |
45 | *An Open Source Crawler/Spider*
46 |
47 | Can be used by anyone... And can be ran on any win / linux computers
48 | It ain't any crawler for industrial use as written in a slow programming language and may have its own issues..
49 |
50 | The project can be easily used with mongoDB.
51 |
52 | The project can also be used for pentesting.
53 |
54 | ## Features
55 |
56 | - Cross Platform
57 | - Installer for linux
58 | - Related-CLI Tools (includes ,CLI access to tool, not that good search-tool xD, etc)
59 | - Memory efficient [ig]
60 | - Pool Crawling - Use multiple crawlers at same time
61 | - Supports Robot.txt
62 | - MongoDB [DB]
63 | - Language Detection
64 | - 18 + Checks / Offensive Content Check
65 | - Proxies
66 | - Multi Threading
67 | - Url Scanning
68 | - Keyword, Desc And recurring words Logging
69 | - Search Website - search_website.py
70 | - Connection Tree Website - tree_website.py
71 | - Tool for finding proxies - proxy_tool.py
72 |
73 | ## Getting Started
74 |
75 | The first thing is install the project...
76 | The installer provided is only for Linux..
77 |
78 | In windows the application wont be added to path or requirements be installed soo check out the installation procedure for Windows.
79 |
80 | ### Installation
81 |
82 | ##### Linux
83 |
84 | ```shell
85 | git clone https://github.com/merwin-asm/OpenCrawler.git
86 | ```
87 | ```shell
88 | cd OpenCrawler
89 | ```
90 | ```shell
91 | chmod +x install.sh && ./install.sh
92 | ```
93 |
94 | ##### Windows
95 |
96 | *You need git, python3 and pip installed*
97 |
98 | ```shell
99 | git clone https://github.com/merwin-asm/OpenCrawler.git
100 | ```
101 | ```shell
102 | cd OpenCrawler
103 | ```
104 | ```shell
105 | pip install -r requirements.txt
106 | ```
107 |
108 |
109 | ## Usage
110 |
111 | The project can be used for :
112 | - Making a (not that good) search engine
113 | - For Osint
114 | - For Pentesting
115 |
116 | ##### Linux
117 |
118 | To see available commands
119 |
120 | ```sh
121 | opencrawler help
122 | ```
123 |
124 | or
125 |
126 | ```sh
127 | man opencrawler
128 | ```
129 |
130 | ##### Windows
131 |
132 | To see available commands
133 |
134 | ```sh
135 | python opencrawler help
136 | ```
137 |
138 |
139 |
140 |
141 |
142 | ## Contributing
143 |
144 | Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are **greatly appreciated**.
145 | * If you have suggestions for adding or removing projects, feel free to [open an issue](https://github.com/merwin-asm/OpenCrawler/issues/new) to discuss it, or directly create a pull request after you edit the *README.md* file with necessary changes.
146 | * Please make sure you check your spelling and grammar.
147 | * Create individual PR for each suggestion.
148 |
149 |
150 | ## License
151 |
152 | Distributed under the MIT License. See [LICENSE](https://github.com/merwin-asm/OpenCrawler/blob/main/LICENSE) for more information.
153 |
154 | ## Authors
155 |
156 | * **Merwin A J** - *CS Student* - [Merwin A J](https://github.com/merwin-asm/) - *Build OpenCrawler*
157 |
158 | ### Uses Materials From :
159 |
160 | - https://github.com/coffee-and-fun/google-profanity-words
161 |
162 |
--------------------------------------------------------------------------------
/bad_words.txt:
--------------------------------------------------------------------------------
1 | 2 girls 1 cup
2 | 2g1c
3 | 4r5e
4 | 5h1t
5 | 5hit
6 | a55
7 | a_s_s
8 | acrotomophilia
9 | alabama hot pocket
10 | alaskan pipeline
11 | anal
12 | anilingus
13 | anus
14 | apeshit
15 | ar5e
16 | arrse
17 | arse
18 | arsehole
19 | ass
20 | ass-fucker
21 | ass-hat
22 | ass-pirate
23 | assbag
24 | assbandit
25 | assbanger
26 | assbite
27 | assclown
28 | asscock
29 | asscracker
30 | asses
31 | assface
32 | assfucker
33 | assfukka
34 | assgoblin
35 | asshat
36 | asshead
37 | asshole
38 | assholes
39 | asshopper
40 | assjacker
41 | asslick
42 | asslicker
43 | assmonkey
44 | assmunch
45 | assmuncher
46 | asspirate
47 | assshole
48 | asssucker
49 | asswad
50 | asswhole
51 | asswipe
52 | auto erotic
53 | autoerotic
54 | b!tch
55 | b00bs
56 | b17ch
57 | b1tch
58 | babeland
59 | baby batter
60 | baby juice
61 | ball gag
62 | ball gravy
63 | ball kicking
64 | ball licking
65 | ball sack
66 | ball sucking
67 | ballbag
68 | balls
69 | ballsack
70 | bampot
71 | bangbros
72 | bareback
73 | barely legal
74 | barenaked
75 | bastard
76 | bastardo
77 | bastinado
78 | bbw
79 | bdsm
80 | beaner
81 | beaners
82 | beastial
83 | beastiality
84 | beastility
85 | beaver cleaver
86 | beaver lips
87 | bellend
88 | bestial
89 | bestiality
90 | bi+ch
91 | biatch
92 | big black
93 | big breasts
94 | big knockers
95 | big tits
96 | bimbos
97 | birdlock
98 | bitch
99 | bitcher
100 | bitchers
101 | bitches
102 | bitchin
103 | bitching
104 | black cock
105 | blonde action
106 | blonde on blonde action
107 | bloody
108 | blow job
109 | blow your load
110 | blowjob
111 | blowjobs
112 | blue waffle
113 | blumpkin
114 | boiolas
115 | bollock
116 | bollocks
117 | bollok
118 | bollox
119 | bondage
120 | boner
121 | boob
122 | boobie
123 | boobs
124 | booobs
125 | boooobs
126 | booooobs
127 | booooooobs
128 | booty call
129 | breasts
130 | brown showers
131 | brunette action
132 | buceta
133 | bugger
134 | bukkake
135 | bulldyke
136 | bullet vibe
137 | bullshit
138 | bum
139 | bung hole
140 | bunghole
141 | bunny fucker
142 | busty
143 | butt
144 | butt-pirate
145 | buttcheeks
146 | butthole
147 | buttmunch
148 | buttplug
149 | c0ck
150 | c0cksucker
151 | camel toe
152 | camgirl
153 | camslut
154 | camwhore
155 | carpet muncher
156 | carpetmuncher
157 | cawk
158 | chinc
159 | chink
160 | choad
161 | chocolate rosebuds
162 | chode
163 | cipa
164 | circlejerk
165 | cl1t
166 | cleveland steamer
167 | clit
168 | clitface
169 | clitoris
170 | clits
171 | clover clamps
172 | clusterfuck
173 | cnut
174 | cock
175 | cock-sucker
176 | cockbite
177 | cockburger
178 | cockface
179 | cockhead
180 | cockjockey
181 | cockknoker
182 | cockmaster
183 | cockmongler
184 | cockmongruel
185 | cockmonkey
186 | cockmunch
187 | cockmuncher
188 | cocknose
189 | cocknugget
190 | cocks
191 | cockshit
192 | cocksmith
193 | cocksmoker
194 | cocksuck
195 | cocksuck
196 | cocksucked
197 | cocksucked
198 | cocksucker
199 | cocksucking
200 | cocksucks
201 | cocksuka
202 | cocksukka
203 | cok
204 | cokmuncher
205 | coksucka
206 | coochie
207 | coochy
208 | coon
209 | coons
210 | cooter
211 | coprolagnia
212 | coprophilia
213 | cornhole
214 | cox
215 | crap
216 | creampie
217 | cum
218 | cumbubble
219 | cumdumpster
220 | cumguzzler
221 | cumjockey
222 | cummer
223 | cumming
224 | cums
225 | cumshot
226 | cumslut
227 | cumtart
228 | cunilingus
229 | cunillingus
230 | cunnie
231 | cunnilingus
232 | cunt
233 | cuntface
234 | cunthole
235 | cuntlick
236 | cuntlick
237 | cuntlicker
238 | cuntlicker
239 | cuntlicking
240 | cuntlicking
241 | cuntrag
242 | cunts
243 | cyalis
244 | cyberfuc
245 | cyberfuck
246 | cyberfucked
247 | cyberfucker
248 | cyberfuckers
249 | cyberfucking
250 | d1ck
251 | dammit
252 | damn
253 | darkie
254 | date rape
255 | daterape
256 | deep throat
257 | deepthroat
258 | dendrophilia
259 | dick
260 | dickbag
261 | dickbeater
262 | dickface
263 | dickhead
264 | dickhole
265 | dickjuice
266 | dickmilk
267 | dickmonger
268 | dickslap
269 | dicksucker
270 | dickwad
271 | dickweasel
272 | dickweed
273 | dickwod
274 | dike
275 | dildo
276 | dildos
277 | dingleberries
278 | dingleberry
279 | dink
280 | dinks
281 | dipshit
282 | dirsa
283 | dirty pillows
284 | dirty sanchez
285 | dlck
286 | dog style
287 | dog-fucker
288 | doggie style
289 | doggiestyle
290 | doggin
291 | dogging
292 | doggy style
293 | doggystyle
294 | dolcett
295 | domination
296 | dominatrix
297 | dommes
298 | donkey punch
299 | donkeyribber
300 | doochbag
301 | dookie
302 | doosh
303 | double dong
304 | double penetration
305 | douche
306 | douchebag
307 | dp action
308 | dry hump
309 | duche
310 | dumbshit
311 | dumshit
312 | dvda
313 | dyke
314 | eat my ass
315 | ecchi
316 | ejaculate
317 | ejaculated
318 | ejaculates
319 | ejaculating
320 | ejaculatings
321 | ejaculation
322 | ejakulate
323 | erotic
324 | erotism
325 | escort
326 | eunuch
327 | f u c k
328 | f u c k e r
329 | f4nny
330 | f_u_c_k
331 | fag
332 | fagbag
333 | fagg
334 | fagging
335 | faggit
336 | faggitt
337 | faggot
338 | faggs
339 | fagot
340 | fagots
341 | fags
342 | fagtard
343 | fanny
344 | fannyflaps
345 | fannyfucker
346 | fanyy
347 | fart
348 | farted
349 | farting
350 | farty
351 | fatass
352 | fcuk
353 | fcuker
354 | fcuking
355 | fecal
356 | feck
357 | fecker
358 | felatio
359 | felch
360 | felching
361 | fellate
362 | fellatio
363 | feltch
364 | female squirting
365 | femdom
366 | figging
367 | fingerbang
368 | fingerfuck
369 | fingerfucked
370 | fingerfucker
371 | fingerfuckers
372 | fingerfucking
373 | fingerfucks
374 | fingering
375 | fistfuck
376 | fistfucked
377 | fistfucker
378 | fistfuckers
379 | fistfucking
380 | fistfuckings
381 | fistfucks
382 | fisting
383 | flamer
384 | flange
385 | fook
386 | fooker
387 | foot fetish
388 | footjob
389 | frotting
390 | fuck
391 | fuck buttons
392 | fucka
393 | fucked
394 | fucker
395 | fuckers
396 | fuckhead
397 | fuckheads
398 | fuckin
399 | fucking
400 | fuckings
401 | fuckingshitmotherfucker
402 | fuckme
403 | fucks
404 | fucktards
405 | fuckwhit
406 | fuckwit
407 | fudge packer
408 | fudgepacker
409 | fuk
410 | fuker
411 | fukker
412 | fukkin
413 | fuks
414 | fukwhit
415 | fukwit
416 | futanari
417 | fux
418 | fux0r
419 | g-spot
420 | gang bang
421 | gangbang
422 | gangbanged
423 | gangbanged
424 | gangbangs
425 | gay sex
426 | gayass
427 | gaybob
428 | gaydo
429 | gaylord
430 | gaysex
431 | gaytard
432 | gaywad
433 | genitals
434 | giant cock
435 | girl on
436 | girl on top
437 | girls gone wild
438 | goatcx
439 | goatse
440 | god damn
441 | god-dam
442 | god-damned
443 | goddamn
444 | goddamned
445 | gokkun
446 | golden shower
447 | goo girl
448 | gooch
449 | goodpoop
450 | gook
451 | goregasm
452 | gringo
453 | grope
454 | group sex
455 | guido
456 | guro
457 | hand job
458 | handjob
459 | hard core
460 | hardcore
461 | hardcoresex
462 | heeb
463 | hell
464 | hentai
465 | heshe
466 | ho
467 | hoar
468 | hoare
469 | hoe
470 | hoer
471 | homo
472 | homoerotic
473 | honkey
474 | honky
475 | hooker
476 | hore
477 | horniest
478 | horny
479 | hot carl
480 | hot chick
481 | hotsex
482 | how to kill
483 | how to murder
484 | huge fat
485 | humping
486 | incest
487 | intercourse
488 | jack off
489 | jack-off
490 | jackass
491 | jackoff
492 | jail bait
493 | jailbait
494 | jap
495 | jelly donut
496 | jerk off
497 | jerk-off
498 | jigaboo
499 | jiggaboo
500 | jiggerboo
501 | jism
502 | jiz
503 | jiz
504 | jizm
505 | jizm
506 | jizz
507 | juggs
508 | kawk
509 | kike
510 | kinbaku
511 | kinkster
512 | kinky
513 | kiunt
514 | knob
515 | knobbing
516 | knobead
517 | knobed
518 | knobend
519 | knobhead
520 | knobjocky
521 | knobjokey
522 | kock
523 | kondum
524 | kondums
525 | kooch
526 | kootch
527 | kum
528 | kumer
529 | kummer
530 | kumming
531 | kums
532 | kunilingus
533 | kunt
534 | kyke
535 | l3i+ch
536 | l3itch
537 | labia
538 | leather restraint
539 | leather straight jacket
540 | lemon party
541 | lesbo
542 | lezzie
543 | lmfao
544 | lolita
545 | lovemaking
546 | lust
547 | lusting
548 | m0f0
549 | m0fo
550 | m45terbate
551 | ma5terb8
552 | ma5terbate
553 | make me come
554 | male squirting
555 | masochist
556 | master-bate
557 | masterb8
558 | masterbat*
559 | masterbat3
560 | masterbate
561 | masterbation
562 | masterbations
563 | masturbate
564 | menage a trois
565 | milf
566 | minge
567 | missionary position
568 | mo-fo
569 | mof0
570 | mofo
571 | mothafuck
572 | mothafucka
573 | mothafuckas
574 | mothafuckaz
575 | mothafucked
576 | mothafucker
577 | mothafuckers
578 | mothafuckin
579 | mothafucking
580 | mothafuckings
581 | mothafucks
582 | mother fucker
583 | motherfuck
584 | motherfucked
585 | motherfucker
586 | motherfuckers
587 | motherfuckin
588 | motherfucking
589 | motherfuckings
590 | motherfuckka
591 | motherfucks
592 | mound of venus
593 | mr hands
594 | muff
595 | muff diver
596 | muffdiver
597 | muffdiving
598 | mutha
599 | muthafecker
600 | muthafuckker
601 | muther
602 | mutherfucker
603 | n1gga
604 | n1gger
605 | nambla
606 | nawashi
607 | nazi
608 | negro
609 | neonazi
610 | nig nog
611 | nigg3r
612 | nigg4h
613 | nigga
614 | niggah
615 | niggas
616 | niggaz
617 | nigger
618 | niggers
619 | niglet
620 | nimphomania
621 | nipple
622 | nipples
623 | nob
624 | nob jokey
625 | nobhead
626 | nobjocky
627 | nobjokey
628 | nsfw images
629 | nude
630 | nudity
631 | numbnuts
632 | nutsack
633 | nympho
634 | nymphomania
635 | octopussy
636 | omorashi
637 | one cup two girls
638 | one guy one jar
639 | orgasim
640 | orgasim
641 | orgasims
642 | orgasm
643 | orgasms
644 | orgy
645 | p0rn
646 | paedophile
647 | paki
648 | panooch
649 | panties
650 | panty
651 | pawn
652 | pecker
653 | peckerhead
654 | pedobear
655 | pedophile
656 | pegging
657 | penis
658 | penisfucker
659 | phone sex
660 | phonesex
661 | phuck
662 | phuk
663 | phuked
664 | phuking
665 | phukked
666 | phukking
667 | phuks
668 | phuq
669 | piece of shit
670 | pigfucker
671 | pimpis
672 | pis
673 | pises
674 | pisin
675 | pising
676 | pisof
677 | piss
678 | piss pig
679 | pissed
680 | pisser
681 | pissers
682 | pisses
683 | pissflap
684 | pissflaps
685 | pissin
686 | pissin
687 | pissing
688 | pissoff
689 | pissoff
690 | pisspig
691 | playboy
692 | pleasure chest
693 | pole smoker
694 | polesmoker
695 | pollock
696 | ponyplay
697 | poo
698 | poof
699 | poon
700 | poonani
701 | poonany
702 | poontang
703 | poop
704 | poop chute
705 | poopchute
706 | porn
707 | porno
708 | pornography
709 | pornos
710 | prick
711 | pricks
712 | prince albert piercing
713 | pron
714 | pthc
715 | pube
716 | pubes
717 | punanny
718 | punany
719 | punta
720 | pusies
721 | pusse
722 | pussi
723 | pussies
724 | pussy
725 | pussylicking
726 | pussys
727 | pusy
728 | puto
729 | queaf
730 | queef
731 | queerbait
732 | queerhole
733 | quim
734 | raghead
735 | raging boner
736 | rape
737 | raping
738 | rapist
739 | rectum
740 | renob
741 | retard
742 | reverse cowgirl
743 | rimjaw
744 | rimjob
745 | rimming
746 | rosy palm
747 | rosy palm and her 5 sisters
748 | ruski
749 | rusty trombone
750 | s hit
751 | s&m
752 | s.o.b.
753 | s_h_i_t
754 | sadism
755 | sadist
756 | santorum
757 | scat
758 | schlong
759 | scissoring
760 | screwing
761 | scroat
762 | scrote
763 | scrotum
764 | semen
765 | sex
766 | sexo
767 | sexy
768 | sh!+
769 | sh!t
770 | sh1t
771 | shag
772 | shagger
773 | shaggin
774 | shagging
775 | shaved beaver
776 | shaved pussy
777 | shemale
778 | shi+
779 | shibari
780 | shit
781 | shit-ass
782 | shit-bag
783 | shit-bagger
784 | shit-brain
785 | shit-breath
786 | shit-cunt
787 | shit-dick
788 | shit-eating
789 | shit-face
790 | shit-faced
791 | shit-fit
792 | shit-head
793 | shit-heel
794 | shit-hole
795 | shit-house
796 | shit-load
797 | shit-pot
798 | shit-spitter
799 | shit-stain
800 | shitass
801 | shitbag
802 | shitbagger
803 | shitblimp
804 | shitbrain
805 | shitbreath
806 | shitcunt
807 | shitdick
808 | shite
809 | shiteating
810 | shited
811 | shitey
812 | shitface
813 | shitfaced
814 | shitfit
815 | shitfuck
816 | shitfull
817 | shithead
818 | shitheel
819 | shithole
820 | shithouse
821 | shiting
822 | shitings
823 | shitload
824 | shitpot
825 | shits
826 | shitspitter
827 | shitstain
828 | shitted
829 | shitter
830 | shitters
831 | shittiest
832 | shitting
833 | shittings
834 | shitty
835 | shitty
836 | shity
837 | shiz
838 | shiznit
839 | shota
840 | shrimping
841 | skank
842 | skeet
843 | slanteye
844 | slut
845 | slutbag
846 | sluts
847 | smeg
848 | smegma
849 | smut
850 | snatch
851 | snowballing
852 | sodomize
853 | sodomy
854 | son-of-a-bitch
855 | spac
856 | spic
857 | spick
858 | splooge
859 | splooge moose
860 | spooge
861 | spread legs
862 | spunk
863 | strap on
864 | strapon
865 | strappado
866 | strip club
867 | style doggy
868 | suck
869 | sucks
870 | suicide girls
871 | sultry women
872 | swastika
873 | swinger
874 | t1tt1e5
875 | t1tties
876 | tainted love
877 | tard
878 | taste my
879 | tea bagging
880 | teets
881 | teez
882 | testical
883 | testicle
884 | threesome
885 | throating
886 | thundercunt
887 | tied up
888 | tight white
889 | tit
890 | titfuck
891 | tits
892 | titt
893 | tittie5
894 | tittiefucker
895 | titties
896 | titty
897 | tittyfuck
898 | tittywank
899 | titwank
900 | tongue in a
901 | topless
902 | tosser
903 | towelhead
904 | tranny
905 | tribadism
906 | tub girl
907 | tubgirl
908 | turd
909 | tushy
910 | tw4t
911 | twat
912 | twathead
913 | twatlips
914 | twatty
915 | twink
916 | twinkie
917 | two girls one cup
918 | twunt
919 | twunter
920 | undressing
921 | upskirt
922 | urethra play
923 | urophilia
924 | v14gra
925 | v1gra
926 | va-j-j
927 | vag
928 | vagina
929 | venus mound
930 | viagra
931 | vibrator
932 | violet wand
933 | vjayjay
934 | vorarephilia
935 | voyeur
936 | vulva
937 | w00se
938 | wang
939 | wank
940 | wanker
941 | wanky
942 | wet dream
943 | wetback
944 | white power
945 | whoar
946 | whore
947 | willies
948 | willy
949 | wrapping men
950 | wrinkled starfish
951 | xrated
952 | xx
953 | xxx
954 | yaoi
955 | yellow showers
956 | yiffy
957 | zoophilia
958 | 🖕
959 |
--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
1 | """
2 | Configures the Open Crawler v 1.0.0
3 | """
4 |
5 |
6 | from rich import print
7 | import getpass
8 | import json
9 | import os
10 |
11 |
12 | print("[blue][bold]Configuring Open Crawler v 0.0.1[/bold] - File : config.json[/blue]")
13 |
14 | if os.path.exists("config.json"):
15 | print("[yellow] config.json already found , do you want to rewrite it ? [y/n][/yellow]", end="")
16 | res = input(" ").lower()
17 |
18 | if res == "y":
19 | os.remove("config.json")
20 | else:
21 | exit()
22 |
23 |
24 | configs = {}
25 |
26 |
27 |
28 | print("\n[green]-----------------------------Writing to config.json-----------------------------[/green]\n")
29 |
30 |
31 | print("[dark_orange] [?] MongoDB's Password ?[/dark_orange]", end="")
32 | configs.setdefault("MONGODB_PWD", getpass.getpass(prompt=" "))
33 |
34 | print("[dark_orange] [?] URI Provided By MongoDB ?[/dark_orange]", end="")
35 | configs.setdefault("MONGODB_URI", input(" "))
36 |
37 | print("[dark_orange] [?] Timeout For Requests ?[/dark_orange]", end="")
38 | configs.setdefault("TIMEOUT", int(input(" ")))
39 |
40 | print("[dark_orange] [?] Maximum Threads To Be Used ?[/dark_orange]", end="")
41 | configs.setdefault("MAX_THREADS", int(input(" ")))
42 |
43 | print("[dark_orange] [?] Flaged/Bad words list (enter for default) ?[/dark_orange]", end="")
44 | res = input(" ")
45 |
46 | if res == "":
47 | res = "bad_words.txt"
48 |
49 | configs.setdefault("bad_words", res)
50 |
51 | print("[dark_orange] [?] Use Proxies (y/n) ?[/dark_orange]", end="")
52 | res = input(" ").lower()
53 |
54 | if res == "y":
55 | res = True
56 | else:
57 | res = False
58 |
59 | configs.setdefault("USE_PROXIES", res)
60 |
61 | print("[dark_orange] [?] Scan Bad Words (y/n) ?[/dark_orange]", end="")
62 | res = input(" ").lower()
63 |
64 | if res == "y":
65 | res = True
66 | else:
67 | res = False
68 |
69 | configs.setdefault("Scan_Bad_Words", res)
70 |
71 | print("[dark_orange] [?] Scan Top Keywords (y/n) ?[/dark_orange]", end="")
72 | res = input(" ").lower()
73 |
74 | if res == "y":
75 | res = True
76 | else:
77 | res = False
78 |
79 | configs.setdefault("Scan_Top_Keywords", res)
80 |
81 | print("[dark_orange] [?] Scan URL For Malicious Stuff (y/n) ?[/dark_orange]", end="")
82 | res = input(" ").lower()
83 |
84 | if res == "y":
85 | res = True
86 | else:
87 | res = False
88 |
89 | configs.setdefault("URL_SCAN", res)
90 |
91 | print("[dark_orange] [?] UrlScan API Key (If not scanning just enter) ?[/dark_orange]", end="")
92 | configs.setdefault("urlscan_key", input(" "))
93 |
94 | print("\n[green]Saving--------------------------------------------------------------------------[/green]\n")
95 |
96 |
97 | f = open("config.json", "w")
98 | f.write(json.dumps(configs))
99 | f.close()
100 |
--------------------------------------------------------------------------------
/connection_tree.py:
--------------------------------------------------------------------------------
1 | """
2 | Part of Open Crawler v 1.0.0
3 | """
4 |
5 |
6 | from rich import print
7 | import requests
8 | import random
9 | import sys
10 | import re
11 |
12 |
13 |
14 | # regex patterns
15 | url_extract_pattern = "https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)"
16 | url_pattern = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
17 | url_pattern_0 = "^[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
18 | url_extract_pattern_0 = "[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)"
19 |
20 |
21 | # Main Variables
22 | website = sys.argv[1] # website to be scanned
23 | num = int(sys.argv[2]) # number of layers to scan
24 |
25 |
26 |
27 |
28 | def get_proxy():
29 | """
30 | Gets a free proxy from 'proxyscrape'
31 | returns : dict - > {"http": ""}
32 | """
33 |
34 | res = requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all")
35 | return {"http" : random.choice(res.text.split("\r\n"))}
36 |
37 |
38 |
39 |
40 |
41 | def scan(website, max_, it):
42 | """
43 | Scans for sub urls and prints them.
44 | website : Str
45 | max_ : int
46 | it : int
47 | """
48 |
49 | global TOTAL
50 |
51 | if max_ != it:
52 | print(" "*it + "[green]----" + website + ":[/green]")
53 | else:
54 | print(" "*it + "[green]----" + website + "[/green]")
55 | return None
56 |
57 | # Gets a proxy
58 | try:
59 | proxies = get_proxy()
60 | except:
61 | proxies = {}
62 |
63 | try:
64 | website_txt = requests.get(website, headers = {"user-agent":"open crawler Mapper v 0.0.1"}, proxies = proxies).text
65 | except:
66 | website_txt = ""
67 | print(f"[red] [-] '{website}' Website Couldn't Be Loaded")
68 |
69 | sub_urls = []
70 |
71 | for x in re.findall(url_extract_pattern, website_txt):
72 | if re.match(url_pattern, x):
73 | if ".onion" in x:
74 | # skips onion sites
75 | continue
76 |
77 | if x[-1] == "/" or x.endswith(".html") or x.split("/")[-1].isalnum():
78 | # tries to filture out not crawlable urls
79 | sub_urls.append(x)
80 |
81 | # removes all duplicates
82 | sub_urls = set(sub_urls)
83 |
84 | for e in sub_urls:
85 | scan(e, max_ , it + 1)
86 |
87 |
88 | print(f"[dark_orange]Scanning :{website} | No. of Layers : {num} [/dark_orange]\n")
89 | scan(website, num, 1)
90 |
91 |
92 |
--------------------------------------------------------------------------------
/connection_tree_v2.py:
--------------------------------------------------------------------------------
1 | from rich import print
2 | import requests
3 | import random
4 | import sys
5 | import re
6 | import pickle
7 |
8 | # regex patterns
9 | url_extract_pattern = "https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)"
10 | url_pattern = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
11 | url_pattern_0 = "^[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
12 | url_extract_pattern_0 = "[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)"
13 |
14 | # Main Variables
15 | website = sys.argv[1] # website to be scanned
16 | num = int(sys.argv[2]) # number of layers to scan
17 | DATA = {}
18 |
19 | def get_proxy():
20 | """
21 | Gets a free proxy from 'proxyscrape'
22 | returns : dict - > {"http": ""}
23 | """
24 | res = requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all")
25 | return {"http": random.choice(res.text.split("\r\n"))}
26 |
27 | def scan(website, max_, it, parent_node):
28 | """
29 | Scans for sub URLs and adds them to the DATA dictionary.
30 | website : Str
31 | max_ : int
32 | it : int
33 | parent_node : dict
34 | """
35 | if max_ != it:
36 | print(" "*it + "[green]----" + website + ":[/green]")
37 | else:
38 | print(" "*it + "[green]----" + website + "[/green]")
39 | return None
40 |
41 | # Gets a proxy
42 | try:
43 | proxies = get_proxy()
44 | except:
45 | proxies = {}
46 |
47 | try:
48 | website_txt = requests.get(website, headers={"user-agent": "open crawler Mapper v 0.0.1"}, proxies=proxies).text
49 | except:
50 | website_txt = ""
51 | print(f"[red] [-] '{website}' Website Couldn't Be Loaded")
52 |
53 | sub_urls = []
54 |
55 | for x in re.findall(url_extract_pattern, website_txt):
56 | if re.match(url_pattern, x):
57 | if ".onion" in x:
58 | # skips onion sites
59 | continue
60 |
61 | if x[-1] == "/" or x.endswith(".html") or x.split("/")[-1].isalnum():
62 | # tries to filter out non-crawlable urls
63 | sub_urls.append(x)
64 |
65 | # removes all duplicates
66 | sub_urls = set(sub_urls)
67 |
68 | if not parent_node.get("children"):
69 | parent_node["children"] = []
70 |
71 | for e in sub_urls:
72 | child_node = {"name": e}
73 | parent_node["children"].append(child_node)
74 | scan(e, max_, it + 1, child_node)
75 |
76 | print(f"[dark_orange]Scanning :{website} | No. of Layers : {num} [/dark_orange]\n")
77 | DATA[website] = {"name": website}
78 | scan(website, num, 1, DATA[website])
79 |
80 | with open(f".{website}_{num}".replace("/","o"), "wb") as f:
81 | f.write(pickle.dumps(DATA))
82 | print(DATA)
83 |
--------------------------------------------------------------------------------
/crawler.py:
--------------------------------------------------------------------------------
1 | """
2 | Open Crawler 0.0.1
3 |
4 | License - MIT ,
5 | An open source crawler/spider
6 |
7 | Features :
8 | - Cross Platform
9 | - Easy install
10 | - Related-CLI Tools (includes ,CLI access to tool, not that good search-tool xD, etc)
11 | - Memory efficient [ig]
12 | - Pool Crawling - Use multiple crawlers at same time
13 | - Supports Robot.txt
14 | - MongoDB [DB]
15 | - Language Detection
16 | - 18 + Checks / Offensive Content Check
17 | - Proxies
18 | - Multi Threading
19 | - Url Scanning
20 | - Keyword, Desc And recurring words Logging
21 |
22 |
23 | Author - Merwin M
24 | """
25 |
26 |
27 |
28 |
29 |
30 | from memory_profiler import profile
31 | from collections import Counter
32 | from functools import lru_cache
33 | from bs4 import BeautifulSoup
34 | from langdetect import detect
35 | from mongo_db import *
36 | from rich import print
37 | import urllib.robotparser
38 | import threading
39 | import requests
40 | import signal
41 | import atexit
42 | import random
43 | import json
44 | import time
45 | import sys
46 | import re
47 | import os
48 |
49 |
50 |
51 |
52 |
53 | """
54 | ######### Crawled Info are stored in Mongo DB as #####
55 | Crawled sites = [
56 |
57 | {
58 | "website" : ""
59 |
60 | "time" : "",
61 |
62 | "mal" : Val/None, # malicious or not
63 | "offn" : Val/None, # 18 +/ Offensive language
64 |
65 | "ln" : "",
66 |
67 | "keys" : [],
68 | "desc" : "",
69 |
70 | "recc" : []/None,
71 | }
72 | ]
73 | """
74 |
75 |
76 |
77 |
78 | ## Regex patterns
79 | html_pattern = re.compile(r'<[^>]+>')
80 | url_extract_pattern = "https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)"
81 | url_pattern = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
82 | url_pattern_0 = "^[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
83 | url_extract_pattern_0 = "[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)"
84 |
85 |
86 |
87 | # Config File
88 | config_file = "config.json"
89 |
90 |
91 |
92 | # Load configs from config_file - > json
93 | try:
94 | config_file = open(config_file, "r")
95 | configs = json.loads(config_file.read())
96 | config_file.close()
97 |
98 | except:
99 | try:
100 | os.system("python3 config.py") # Re-configures
101 | except:
102 | os.system("python config.py") # Re-configures
103 |
104 | config_file = open(config_file, "r")
105 | configs = json.loads(config_file.read())
106 | config_file.close()
107 |
108 |
109 |
110 |
111 |
112 | ## Setting Up Configs
113 | MONGODB_PWD = configs["MONGODB_PWD"]
114 | MONGODB_URI = configs["MONGODB_URI"]
115 | TIMEOUT = configs["TIMEOUT"] # Timeout for reqs
116 | MAX_THREADS = configs["MAX_THREADS"]
117 | bad_words = configs["bad_words"]
118 | USE_PROXIES = configs["USE_PROXIES"]
119 | Scan_Bad_Words = configs["Scan_Bad_Words"]
120 | Scan_Top_Keywords = configs["Scan_Top_Keywords"]
121 | URL_SCAN = configs["URL_SCAN"]
122 | urlscan_key = configs["urlscan_key"]
123 |
124 |
125 | del configs
126 |
127 |
128 |
129 | ## Main Vars
130 | EXIT_FLAG = False
131 | DB = None
132 | ROBOT_SCANS = [] # On Going robot scans
133 | WEBSITE_SCANS = [] # On Going website scans
134 | PROXY_CHECK = False
135 | HTTP = []
136 | HTTPS = []
137 |
138 | # Loads bad words / flaged words
139 | file = open(bad_words, "r")
140 | bad_words = file.read()
141 | file.close()
142 | bad_words = tuple(bad_words.split("\n"))
143 |
144 |
145 |
146 |
147 |
148 | # @lru_cache(maxsize=100)
149 | # def get_robot(domain):
150 | # """
151 | # reads robots.txt
152 | # """
153 |
154 | # print(f"[green] [+] Scans - {domain} for restrictions")
155 |
156 | # rp = urllib.robotparser.RobotFileParser()
157 | # rp.set_url("http://" + domain + "/robots.txt")
158 |
159 | # return rp.can_fetch
160 |
161 |
162 |
163 |
164 | def get_top_reccuring(txt):
165 | """
166 | Gets most reccuring 8 terms from the website html
167 | txt : str
168 | returns : list
169 | """
170 |
171 | split_it = txt.split()
172 | counter = Counter(split_it)
173 | try:
174 | most_occur = counter.most_common(8)
175 | return most_occur
176 | except:
177 | return []
178 |
179 |
180 |
181 | def lang_d(txt):
182 | """
183 | Scans for bad words/ flaged words.
184 | txt : str
185 | return : int - > score
186 | """
187 |
188 | if not Scan_Bad_Words:
189 | return None
190 | score = 0
191 |
192 | for e in bad_words:
193 | try:
194 | x = txt.split(e)
195 | except:
196 | continue
197 | score += len(x)-1
198 |
199 | try:
200 | score = round(score/len(txt))
201 | except:
202 | pass
203 |
204 | return score
205 |
206 |
207 | def proxy_checker(proxies, url="https://www.google.com"):
208 | working_proxies = []
209 | proxies = proxies.text.split("\r\n")
210 |
211 | if url.startswith("https"):
212 | protocol = "https"
213 | else:
214 | protocol = "http"
215 |
216 | proxies.pop()
217 |
218 | for proxy in proxies:
219 | try:
220 | response = requests.get(url, proxies={protocol:proxy}, timeout=2)
221 | if response.status_code == 200:
222 | print(f"Proxy {proxy} works! [{len(working_proxies)+1}]")
223 | working_proxies.append(proxy)
224 | else:
225 | pass
226 | except requests.RequestException as e:
227 | pass
228 |
229 | return working_proxies
230 |
231 |
232 | def proxy_checker_(proxies, url="https://www.wired.com/review/klipsch-flexus-core-200/"):
233 | working_proxies = []
234 | proxies = proxies.split("\n")
235 |
236 | if url.startswith("https"):
237 | protocol = "https"
238 | else:
239 | protocol = "http"
240 |
241 | proxies.pop()
242 |
243 | for proxy in proxies:
244 | try:
245 | response = requests.get(url, proxies={protocol:proxy}, timeout=2)
246 | if response.status_code == 200:
247 | print(f"Proxy {proxy} works! [{len(working_proxies)+1}]")
248 | working_proxies.append(proxy)
249 | else:
250 | pass
251 | except requests.RequestException as e:
252 | pass
253 |
254 | return working_proxies
255 |
256 |
257 |
258 | def get_proxy():
259 |
260 | """
261 | Gets a free proxy from 'proxyscrape'
262 | returns : dict - > {"http": ""}
263 | """
264 | global PROXY_CHECK, HTTP, HTTPS
265 |
266 | if not PROXY_CHECK:
267 | try:
268 | f = open("found_proxies_http")
269 | f2 = open("found_proxies_https")
270 | res = f.read()
271 | res2 = f2.read()
272 |
273 | HTTP = proxy_checker_(res, "http://www.wired.com/review/klipsch-flexus-core-200/")
274 | HTTPS = proxy_checker_(res2)
275 |
276 | PROXY_CHECK = True
277 |
278 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {len(HTTP)} [/green]")
279 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {len(HTTPS)} [/green]")
280 |
281 | f2.close()
282 | f.close()
283 | except:
284 | pass
285 |
286 | if not PROXY_CHECK:
287 | print("We are generating a new proxy list so it would take time... \[this happens when you are using old proxylist/have none]")
288 | res = requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all")
289 | res2= requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all")
290 | HTTP = proxy_checker(res, "http://google.com")
291 | HTTPS = proxy_checker(res2)
292 |
293 | PROXY_CHECK = True
294 |
295 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {len(HTTP)} [/green]")
296 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {len(HTTPS)} [/green]")
297 |
298 | f = open("found_proxies_http", "w")
299 | f2 = open("found_proxies_https", "w")
300 | f.write("\n".join(HTTP))
301 | f2.write("\n".join(HTTPS))
302 | f.close()
303 | f2.close()
304 |
305 | return {"http" : random.choice(HTTP), "https": random.choice(HTTPS)}
306 |
307 |
308 |
309 | def scan_url(url):
310 | """
311 | Scans url for malicious stuff , Uses the Urlscan API
312 | url : str
313 | return : int - > score
314 | """
315 |
316 | headers = {'API-Key':urlscan_key,'Content-Type':'application/json'}
317 | data = {"url": url, "visibility": "public"}
318 | r = requests.post('https://urlscan.io/api/v1/scan/',headers=headers, data=json.dumps(data))
319 | print(r.json())
320 | r = "https://urlscan.io/api/v1/result/" + r.json()["uuid"]
321 |
322 | for e in range(0,100):
323 | time.sleep(2)
324 | res = requests.get(r, headers)
325 | res = res.json()
326 | try:
327 | if res["status"] == 404:
328 | pass
329 | except:
330 | print(res["verdicts"])
331 | return res["verdicts"]["urlscan"]["score"]
332 |
333 | return None
334 |
335 |
336 |
337 | def remove_html(string):
338 | """
339 | removes html tags
340 | string : str
341 | return : str
342 | """
343 |
344 | return html_pattern.sub('', string)
345 |
346 |
347 |
348 | def handler(SignalNumber, Frame): # for handling SIGINT
349 | safe_exit()
350 |
351 | signal.signal(signal.SIGINT, handler) # register handler
352 |
353 |
354 |
355 | def safe_exit(): # safely exits the program
356 | global EXIT_FLAG
357 |
358 | print(f"\n\n[blue] [\] Exit Triggered At : {time.time()} [/blue]")
359 |
360 | EXIT_FLAG = True
361 |
362 | print("[red] EXITED [/red]")
363 |
364 |
365 |
366 | atexit.register(safe_exit) # registers at exit handler
367 |
368 |
369 |
370 | def forced_crawl(website):
371 | """
372 | Crawl a website forcefully - ignorring the crawl wait list
373 | website : string
374 | """
375 |
376 | # Checks if crawled already , y__ = crawled or not , True/False
377 | z__ = if_crawled(website)
378 | y__ = z__[0]
379 |
380 | mal = None
381 | lang_18 = None
382 |
383 | lang = None
384 |
385 | # Current thread no. as there is no separate threads for crawling, set as 0
386 | th = 0
387 |
388 | print(f"[green] [+] Started Crawling : {website} | Thread : {th}[/green]")
389 |
390 |
391 | proxies = {}
392 |
393 | if USE_PROXIES:
394 | proxies = get_proxy()
395 |
396 | try:
397 | website_req = requests.get(website, headers = {"user-agent":"open crawler v 0.0.1"}, proxies = proxies, timeout = TIMEOUT)
398 |
399 | # checks if content is html or skips
400 | try:
401 | if not "html" in website_req.headers["Content-Type"]:
402 | print(f"[green] [+] Skiped : {website} Because Content type not 'html' | Thread : {th}[/green]")
403 | return 0
404 | except:
405 | return 0
406 |
407 |
408 | website_txt = website_req.text
409 |
410 | if not y__:
411 | save_crawl(website, time.time(), 0,0,0,0,0,0,)
412 | else:
413 | update_crawl(website, time.time(), 0,0,0,0,0,0)
414 |
415 | except:
416 | # could be because website is down or the timeout
417 | print(f"[red] [-] Coundn't Crwal : {website} | Thread : {th}[/red]")
418 | if not y__:
419 | save_crawl(website, time.time(), "ERROR OCCURED", 0, 0, 0, 0, 0)
420 | else:
421 | update_crawl(website, time.time(), "ERROR OCCURED", 0, 0, 0, 0, 0)
422 | return 0
423 |
424 | try:
425 | lang = detect(website_txt)
426 | except:
427 | lang = "un-dic"
428 |
429 | if URL_SCAN:
430 | mal = scan_url(website)
431 |
432 | website_txt_ = remove_html(website_txt)
433 |
434 | if Scan_Bad_Words:
435 | lang_18 = lang_d(website_txt_)
436 |
437 | keywords = []
438 | desc = ""
439 |
440 | soup = BeautifulSoup(website_txt, 'html.parser')
441 |
442 | for meta in soup.findAll("meta"):
443 | try:
444 | if meta["name"] == "keywords":
445 | keywords = meta["content"]
446 | except:
447 | pass
448 |
449 | try:
450 | if meta["name"] == "description":
451 | desc = meta["content"]
452 | except:
453 | pass
454 |
455 |
456 | del soup
457 |
458 | top_r = None
459 |
460 | if Scan_Top_Keywords:
461 | top_r = get_top_reccuring(website_txt_)
462 |
463 | update_crawl(website, time.time(), mal, lang_18, lang, keywords, desc, top_r)
464 |
465 | del mal, lang_18, lang, keywords, desc, top_r
466 |
467 | sub_urls = []
468 |
469 | for x in re.findall(url_extract_pattern, website_txt):
470 | if re.match(url_pattern, x):
471 | if ".onion" in x:
472 | # skips onion sites
473 | continue
474 |
475 | if x[-1] == "/" or x.endswith(".html") or x.split("/")[-1].isalnum():
476 | # tries to filture out not crawlable urls
477 | sub_urls.append(x)
478 |
479 | # removes all duplicates
480 | sub_urls = set(sub_urls)
481 | sub_urls = list(sub_urls)
482 |
483 |
484 | # check for restrictions in robots.txt and filture out the urls found
485 | for sub_url in sub_urls:
486 |
487 | if if_waiting(sub_url):
488 | sub_urls.remove(sub_url)
489 |
490 | continue
491 |
492 | # restricted = robots_txt.disallowed(sub_url, proxies)
493 |
494 |
495 | # t = sub_url.split("://")[1].split("/")
496 | # t.remove(t[0])
497 |
498 | # t_ = ""
499 | # for u in t:
500 | # t_ += "/" + u
501 |
502 | # t = t_
503 |
504 | # restricted = tuple(restricted)
505 |
506 | # for resk in restricted:
507 | # if t.startswith(resk):
508 | # sub_urls.remove(sub_url)
509 | # break
510 |
511 |
512 | site = sub_url.replace("https://", "")
513 | site = site.replace("http://", "")
514 | domain = site.split("/")[0]
515 |
516 | # print(f"[green] [+] Scans - {domain} for restrictions")
517 |
518 | rp = urllib.robotparser.RobotFileParser()
519 | rp.set_url("http://" + domain + "/robots.txt")
520 |
521 | # a = get_robot(domain)
522 |
523 |
524 | if not rp.can_fetch("*", sub_url):
525 | sub_urls.remove(sub_url)
526 |
527 |
528 | # try:
529 | # restricted = get_robots(domain)
530 | # except:
531 | # print(f"[green] [+] Scans - {domain} for restrictions")
532 | # restricted = robots_txt.disallowed(sub_url, proxies)
533 |
534 | # save_robots(domain, restricted)
535 |
536 |
537 | # restricted = tuple(restricted)
538 |
539 | # t = sub_url.split("://")[1].split("/")
540 | # t.remove(t[0])
541 |
542 | # t_ = ""
543 | # for u in t:
544 | # t_ += "/" + u
545 |
546 | # t = t_
547 |
548 | # for resk in restricted:
549 | # if t.startswith(resk):
550 | # sub_urls.remove(sub_url)
551 | # break
552 |
553 |
554 |
555 | # check if there is a need of crawling
556 | for e in sub_urls:
557 |
558 | z__ = if_crawled(e)
559 |
560 | y__ = z__[0]
561 | t__ = z__[1]
562 |
563 |
564 | if y__:
565 | if t__ < time.time() - 604800 : # Re-Crawls Only After 7
566 | sub_urls.remove(e)
567 | continue
568 |
569 | try:
570 | website_req = requests.get(e, headers = {"user-agent":"open crawler v 0.0.1"}, proxies = proxies, timeout = TIMEOUT)
571 |
572 | except:
573 | sub_urls.remove(e)
574 | continue
575 |
576 | try:
577 | if not "html" in website_req.headers["Content-Type"]:
578 | print(f"[green] [+] Skiped : {e} Because Content type not 'html' | Thread : {th}[/green]")
579 | sub_urls.remove(e)
580 | continue
581 | except:
582 | sub_urls.remove(e)
583 | continue
584 |
585 |
586 | write_to_wait_list(sub_urls)
587 |
588 | del sub_urls
589 |
590 | print(f"[green] [+] Crawled : {website} | Thread : {th}[/green]")
591 |
592 |
593 |
594 | ## for checking the memory usage uncomment @profile , also uncoment for main()
595 | # @profile
596 | def crawl(th):
597 |
598 | global ROBOT_SCANS, WEBSITE_SCANS
599 |
600 | time.sleep(th)
601 |
602 | if EXIT_FLAG:
603 | return 1
604 |
605 |
606 | while not EXIT_FLAG:
607 |
608 | # gets 10 urls from waitlist
609 | sub_urls = get_wait_list(10)
610 |
611 | for website in sub_urls:
612 | if website in WEBSITE_SCANS:
613 | continue
614 | else:
615 | WEBSITE_SCANS.append(website)
616 |
617 | website_url = website
618 |
619 | website = website["website"]
620 |
621 | update = False
622 |
623 | # Checks if crawled already , y__ = crawled or not , True/False
624 |
625 | z__ = if_crawled(website)
626 |
627 | y__ = z__[0]
628 | t__ = z__[1]
629 |
630 |
631 | if y__:
632 | update = True
633 |
634 | if int(time.time()) - int(t__) < 604800: # Re-Crawls Only After 7
635 | print(f"[green] [+] Already Crawled : {website} | Thread : {th}[/green]")
636 | continue
637 |
638 | print(f"[green] [+] ReCrawling : {website} | Thread : {th} [/green]")
639 |
640 | mal = None
641 | lang_18 = None
642 |
643 | lang = None
644 |
645 | print(f"[green] [+] Started Crawling : {website} | Thread : {th}[/green]")
646 |
647 |
648 | proxies = {}
649 |
650 | if USE_PROXIES:
651 | proxies = get_proxy()
652 |
653 | try:
654 | website_req = requests.get(website, headers = {"user-agent":"open crawler v 0.0.1"}, proxies = proxies, timeout = TIMEOUT)
655 |
656 | try:
657 | if not "html" in website_req.headers["Content-Type"]:
658 | # checks if the site responds with html content or skips
659 | print(f"[green] [+] Skiped : {website} Because Content type not 'html' | Thread : {th}[/green]")
660 | continue
661 | except:
662 | continue
663 |
664 | website_txt = website_req.text
665 |
666 | if not update:
667 | save_crawl(website, time.time(), 0,0,0,0,0,0,)
668 | else:
669 | update_crawl(website, time.time(), 0,0,0,0,0,0)
670 |
671 | except:
672 | # could be because website is down or the timeout
673 | print(f"[red] [-] Coundn't Crwal : {website} | Thread : {th}[/red]")
674 | save_crawl(website, time.time(), "ERROR OCCURED", 0, 0, 0, 0, 0)
675 | continue
676 |
677 | try:
678 | lang = detect(website_txt)
679 | except:
680 | lang = "un-dic"
681 |
682 | if URL_SCAN:
683 | mal = scan_url(website)
684 |
685 | website_txt_ = remove_html(website_txt)
686 |
687 | if Scan_Bad_Words:
688 | lang_18 = lang_d(website_txt_)
689 |
690 | keywords = []
691 | desc = ""
692 |
693 | soup = BeautifulSoup(website_txt, 'html.parser')
694 |
695 | for meta in soup.findAll("meta"):
696 | try:
697 | if meta["name"] == "keywords":
698 | keywords = meta["content"]
699 | except:
700 | pass
701 |
702 | try:
703 | if meta["name"] == "description":
704 | desc = meta["content"]
705 | except:
706 | pass
707 |
708 | del soup
709 |
710 | top_r = None
711 |
712 | if Scan_Top_Keywords:
713 | top_r = get_top_reccuring(website_txt_)
714 |
715 | update_crawl(website, time.time(), mal, lang_18, lang, keywords, desc, top_r)
716 |
717 | del mal, lang_18, lang, keywords, desc, top_r
718 |
719 | sub_urls = []
720 |
721 | for x in re.findall(url_extract_pattern, website_txt):
722 | if re.match(url_pattern, x):
723 | if ".onion" in x:
724 | # skips onion sites
725 | continue
726 |
727 | if x[-1] == "/" or x.endswith(".html") or x.split("/")[-1].isalnum():
728 | # tries to filture out not crawlable urls
729 | sub_urls.append(x)
730 |
731 |
732 | # removes all duplicates
733 | sub_urls = set(sub_urls)
734 | sub_urls = list(sub_urls)
735 |
736 |
737 | # check for restrictions in robots.txt and filture out the urls found
738 | for sub_url in sub_urls:
739 |
740 | if if_waiting(sub_url):
741 | sub_urls.remove(sub_url)
742 | continue
743 |
744 |
745 | site = sub_url.replace("https://", "")
746 | site = site.replace("http://", "")
747 | domain = site.split("/")[0]
748 |
749 | # print(f"[green] [+] Scans - {domain} for restrictions")
750 |
751 | rp = urllib.robotparser.RobotFileParser()
752 | rp.set_url("http://" + domain + "/robots.txt")
753 |
754 | # a = get_robot(domain)
755 |
756 | if not rp.can_fetch("*", sub_url):
757 | sub_urls.remove(sub_url)
758 |
759 |
760 |
761 | # restricted = robots_txt.disallowed(sub_url, proxies)
762 |
763 |
764 | # t = sub_url.split("://")[1].split("/")
765 | # t.remove(t[0])
766 |
767 | # t_ = ""
768 | # for u in t:
769 | # t_ += "/" + u
770 |
771 | # t = t_
772 |
773 | # restricted = tuple(restricted)
774 |
775 | # for resk in restricted:
776 | # if t.startswith(resk):
777 | # sub_urls.remove(sub_url)
778 | # break
779 |
780 |
781 | # check if there is a need of crawling
782 | for e in sub_urls:
783 |
784 | z__ = if_crawled(e)
785 |
786 | y__ = z__[0]
787 | t__ = z__[1]
788 |
789 |
790 | if y__:
791 | if int(time.time()) - int(t__) < 604800: # Re-Crawls Only After 7
792 | sub_urls.remove(e)
793 | continue
794 |
795 | try:
796 | website_req = requests.get(e, headers = {"user-agent":"open crawler v 0.0.1"}, proxies = proxies, timeout = TIMEOUT)
797 |
798 | except:
799 | sub_urls.remove(e)
800 | continue
801 |
802 | try:
803 | if not "html" in website_req.headers["Content-Type"]:
804 | print(f"[green] [+] Skiped : {e} Because Content type not 'html' | Thread : {th}[/green]")
805 | sub_urls.remove(e)
806 | continue
807 |
808 | except:
809 | sub_urls.remove(e)
810 | continue
811 |
812 |
813 |
814 | del proxies
815 |
816 | write_to_wait_list(sub_urls)
817 |
818 | del sub_urls
819 |
820 | WEBSITE_SCANS.remove(website_url)
821 |
822 | print(f"[green] [+] Crawled : {website} | Thread : {th}[/green]")
823 |
824 |
825 |
826 |
827 |
828 | ascii_art = """
829 | [medium_spring_green]
830 | ______ ______ __
831 | / \ / \ | \
832 | | $$$$$$\ ______ ______ _______ | $$$$$$\ ______ ______ __ __ __ | $$
833 | | $$ | $$ / \ / \ | \ | $$ \$$ / \ | \ | \ | \ | \| $$
834 | | $$ | $$| $$$$$$\| $$$$$$\| $$$$$$$\ | $$ | $$$$$$\ \$$$$$$\| $$ | $$ | $$| $$
835 | | $$ | $$| $$ | $$| $$ $$| $$ | $$ | $$ __ | $$ \$$/ $$| $$ | $$ | $$| $$
836 | | $$__/ $$| $$__/ $$| $$$$$$$$| $$ | $$ | $$__/ \| $$ | $$$$$$$| $$_/ $$_/ $$| $$
837 | \$$ $$| $$ $$ \$$ \| $$ | $$ \$$ $$| $$ \$$ $$ \$$ $$ $$| $$
838 | \$$$$$$ | $$$$$$$ \$$$$$$$ \$$ \$$ \$$$$$$ \$$ \$$$$$$$ \$$$$$\$$$$ \$$
839 | | $$
840 | | $$
841 | \$$ [bold]v 0.0.1[/bold] [/medium_spring_green]
842 | """
843 |
844 |
845 | # for checking the memory usage uncomment @profile
846 | # @profile
847 | def main():
848 | global DB
849 |
850 | print(ascii_art)
851 |
852 | # Initializes MongoDB
853 | DB = connect_db(MONGODB_URI, MONGODB_PWD)
854 |
855 | try:
856 | primary_url = sys.argv[1]
857 | except:
858 | print("\n[blue] [?] Primary Url [You can skip this part but entering] :[/blue]", end="")
859 | primary_url = input(" ")
860 |
861 | print("")
862 |
863 | print("[blue] [+] Loading And Testing Proxies... .. ... .. .. .. [/blue]")
864 | get_proxy()
865 |
866 | if primary_url != "":
867 | forced_crawl(primary_url)
868 | print("")
869 |
870 |
871 | # Starts threading
872 | for th in range(0, MAX_THREADS):
873 | t_d = threading.Thread(target=crawl, args=(th+1,))
874 | t_d.daemon = True
875 | t_d.start()
876 |
877 | print(f"[spring_green1] [+] Started Thread : {th + 1}[/spring_green1]")
878 |
879 | print("\n")
880 |
881 |
882 | # while loop waiting for exit flag
883 | while not EXIT_FLAG:
884 | time.sleep(0.5)
885 |
886 |
887 | if __name__ == "__main__":
888 | main()
889 |
890 |
--------------------------------------------------------------------------------
/current.ver:
--------------------------------------------------------------------------------
1 | 1.0.0
2 |
--------------------------------------------------------------------------------
/docs.md:
--------------------------------------------------------------------------------
1 | # Open Crawler 1.0.0 - Documentation
2 |
3 |
4 | ## Table Of Contents
5 |
6 | * [Getting Started](#getting-started)
7 | * [Installation](#installation)
8 | * [Features](#features)
9 | * [Uses](#uses)
10 | * [Commands](#commands)
11 | * [Find Commands](#find-Commands)
12 | * [About Commands](#about-Commands)
13 | * [Config File](#config-file)
14 | * [Working](#working)
15 | * [Files](#files)
16 | * [Connection Tree](#connection-tree)
17 | * [Search](#search)
18 | * [MongoDB Collections](#mongodb-collections)
19 | * [How is data stored in mongoDB](#how-is-data-stored-in-mongoDB)
20 | * [Note](#note)
21 |
22 |
23 |
24 |
25 | ## Getting Started
26 |
27 |
28 | ### Installation
29 |
30 | ##### Linux
31 |
32 | ```shell
33 | git clone https://github.com/merwin-asm/OpenCrawler.git
34 | ```
35 | ```shell
36 | cd OpenCrawler
37 | ```
38 | ```shell
39 | chmod +x install.sh && ./install.sh
40 | ```
41 |
42 | ##### Windows
43 |
44 | *You need git, python3 and pip installed*
45 |
46 | ```shell
47 | git clone https://github.com/merwin-asm/OpenCrawler.git
48 | ```
49 | ```shell
50 | cd OpenCrawler
51 | ```
52 | ```shell
53 | pip install -r requirements.txt
54 | ```
55 |
56 |
57 | ### Features
58 |
59 | - Cross Platform
60 | - Installer for linux
61 | - Related-CLI Tools (includes ,CLI access to tool, not that good search-tool xD, etc)
62 | - Memory efficient [ig]
63 | - Pool Crawling - Use multiple crawlers at same time
64 | - Supports Robot.txt
65 | - MongoDB [DB]
66 | - Language Detection
67 | - 18 + Checks / Offensive Content Check
68 | - Proxies
69 | - Multi Threading
70 | - Url Scanning
71 | - Keyword, Desc And recurring words Logging
72 |
73 |
74 | ### Uses
75 |
76 | #### Making A (Not that good) Search engine :
77 |
78 | This can be easily done with verry less modifications if required
79 |
80 | - We also provide an inbuild search function , which may not be good enough but does do the thing ( the search topic be discussed below )
81 |
82 | #### Osint Tool :
83 |
84 | You can make use of the tool to crawl through sites related to someone and do osint by using the search utility or make custom code for it
85 |
86 | #### Pentesting Tool :
87 |
88 | Find all websites related to one site , this can be achieved using the connection tree command ( this topic be discussed below )
89 |
90 | #### Crawler As It says..
91 |
92 | ## Commands
93 |
94 | ### Find Commands
95 | To find the commands you can use any of these 2 methods,
96 |
97 | *warning : this only works in linux*
98 | ```sh
99 | man opencrawler
100 | ```
101 |
102 | For Linux:
103 | ```sh
104 | opencrawler help
105 | ```
106 | For Windows:
107 | ```sh
108 | python opencrawler help
109 | ```
110 |
111 | ### About Commands
112 |
113 | ##### help
114 |
115 | Shows the commands available
116 |
117 | ##### v
118 |
119 | Shows the current version of opencrawler
120 |
121 | ##### crawl
122 |
123 | This would start the normal crawler
124 |
125 | ##### forced_crawl \
126 |
127 | Forcefully crawl a site , the site crawled is \
128 |
129 | ##### crawled_status
130 |
131 | *warning : the data shown aint exact*
132 |
133 | Gives the info on the mongoDB..
134 | This will show the number of sites crawled and the avg ammount of storage used.
135 |
136 | Show the info for both collections : (more info on the collections are given in the *working* section)
137 | - crawledsites
138 | - waitlist
139 |
140 | ##### search \
141 |
142 | Uses basic filturing methods to search , this command aint meant for anything like search engine
143 | (the working of search be discussed in *working* section)
144 |
145 |
146 | ##### configure
147 |
148 | Configures the opencrawler...
149 | The same is also used to re configure...
150 | It will ask all the info required to start the crawler and save it in json file (config.json) (more info in the *config* section)
151 |
152 | Its ok if you are running crawl command without configs because it will ask you to .. xd
153 |
154 | ##### connection-tree \ \
155 |
156 | A tree of websites connected to \ be shown
157 |
158 | \ is how deep you want to crawl a site.
159 | The default depth is 2
160 |
161 | ##### check_html \
162 |
163 | Checks if a website is returning html
164 |
165 | ##### crawlable \
166 |
167 | Checks if a website is allowed to be crawled
168 | It checks the robot.txt , to find if disallowed
169 |
170 | ##### dissallowed \
171 |
172 | Shows the disallowed urls of a website
173 | The results are based on robots.txt
174 |
175 | ##### fix_db
176 |
177 | Starts the fix db program
178 | This can be used to resolve bugs present in the code , which could contaminate the DB
179 |
180 | ##### re-install
181 |
182 | Re installs the opencrawler
183 |
184 | ##### update
185 |
186 | Installs new version of the opencrawler | reinstalls
187 |
188 | ##### install-requirements
189 |
190 | Installs the requirements..
191 | These requirements are mentioned in requirements.txt
192 |
193 | ## Config File
194 |
195 | The file is generated by the configure command , which will run the "config.py" file.
196 |
197 | The file in json , "config.json"
198 |
199 | The config file stores info regarding the crawling activity
200 | These Include :
201 |
202 | - **MONGODB_PWD** - pwd of mongoDB user
203 | - **MONGODB_URI** - uri for connecting to mongoDB
204 | - **TIMEOUT** - time out for get requests
205 | - **MAX_THREADS** - number of threads , set it as one if you don't wanna do multithreading
206 | - **bad_words** - the file containing list of bad words , which by default is bad_words.txt (bad_words.txt is provided)
207 | - **USE_PROXIES** - bool - if the crawler should use proxy (proxy wont be used even if set True for robot.txt scanning)
208 | - **Scan_Bad_Words** - bool - if you want to save the bad / offensive text score
209 | - **Scan_Top_Keywords** - bool - if you want to save the top keywords found in the html txt
210 | - **urlscan_key** - the url scan API key , if you are not use the feature leave it empty
211 | - **URL_SCAN** - bool - if you want to scan url using UrlScan API
212 |
213 |
214 | ## Working
215 |
216 |
217 |
218 | ### Files :
219 |
220 |
221 | |Filename |Type |Use |
222 | |----------|-------|-------------------------------------------------------------------------|
223 | |opencrawler|python|The main file which get called on using command opencrawler|
224 | |crawler.py|python|The file which do the crawling|
225 | |requirments.txt|text|The file containing names of python modules , to be installed|
226 | |search.py|python|Does the search|
227 | |opencrawler.1|roff|The user manual|
228 | |mongo_db.py|python|Handles mongoDB|
229 | |installer.py|python|Installer for linux, which will be ran by install.sh|
230 | |install.sh|shell|Install basic requirements like python3, **for linux use only**|
231 | |fix_db.py|python|Fixes the DB|
232 | |connection_tree.py|python|Makes the connection tree|
233 | |config.py|python|Configures the OpenCrawler|
234 | |bad_words.txt|text|Contains bad words used for predicting the bad/offensive text score|
235 |
236 | ### MongoDB Collections
237 |
238 | There are two collections used :
239 |
240 | - **waitlist** - Used for storing sites which is to be crawled
241 | - **crawledsites** - Used to store crawled sites and collected info about them
242 |
243 | ### How is data stored in mongoDB
244 |
245 | Structure in which data is stored in the collections...
246 |
247 | ##### crawledsites :
248 |
249 | ```
250 | ######### Crawled Info are stored in Mongo DB as #####
251 | Crawled sites = [
252 | {
253 | "website" : ""
254 |
255 | "time" : "",
256 | "mal" : Val/None, # malicious or not
257 | "offn" : Val/None, # 18 +/ Offensive language
258 | "ln" : "",
259 |
260 | "keys" : [],
261 | "desc" : "",
262 |
263 | "recc" : []/None,
264 | }
265 | ]
266 | ```
267 |
268 |
269 | ##### waitlist :
270 |
271 | ```
272 | waitlist = [
273 | {
274 | "website" : ""
275 | }
276 | ]
277 | ```
278 |
279 |
280 | ### Connection Tree
281 |
282 | *By default depth is 2*
283 |
284 | The command tree works by getting all urls found in a site,
285 | then doing the same with the urls found,
286 | the number of times this happens deppends on the depth
287 |
288 |
289 | ### Search
290 |
291 | The search command uses the data stored in the crawledsites.
292 |
293 | For each word of query it will check for sites containing them in,
294 | - website URL
295 | - desc
296 | - keywords
297 | - top recurring words
298 |
299 | The results are sorted with the ones with most number of words from the query
300 |
301 | ```
302 | url = list(_DB().Crawledsites.find({"$or" : [
303 | {"recc": {"$regex": re.compile(word, re.IGNORECASE)}},
304 | {"keys": {"$regex": re.compile(word, re.IGNORECASE)}},
305 | {"desc": {"$regex": re.compile(word, re.IGNORECASE)}},
306 | {"website" : {"$regex": re.compile(word, re.IGNORECASE)}}
307 | ]}))
308 |
309 | ```
310 |
311 |
312 | ## Note
313 |
314 | - Proxy doesn't work for robot.txt scans while you are crawling , this is because the urlib.robotparser doesnt allow the use of proxy
315 | - If you have any issues with pymongo not working try installing versions preffered for the specific python version
316 | - If you get errors regarding pymongo also make sure you give read and write perms to the user
317 | - You can use local mongoDB
318 | - Search function aint making use of all possible filtures to find a site
319 | - installer.py and install.sh aint same , install.sh also installs python and pip then runs installer.py
320 | - installer.py and install.sh is only for linux use
321 | - we use proxyscrape API for geting free proxies
322 | - we use Virus Total's API for scanning websites , if required
323 |
324 |
325 |
326 |
327 |
328 |
329 |
--------------------------------------------------------------------------------
/fix_db.py:
--------------------------------------------------------------------------------
1 | """
2 | Set of Tools to fix your DB | OpenCrawler v 1.0.0
3 | """
4 |
5 |
6 |
7 | from mongo_db import connect_db, _DB
8 | from rich import print
9 | import json
10 | import os
11 |
12 |
13 |
14 |
15 | def mongodb():
16 | # Config File
17 | config_file = "config.json"
18 |
19 |
20 | # Load configs from config_file - > json
21 | try:
22 | config_file = open(config_file, "r")
23 | configs = json.loads(config_file.read())
24 | config_file.close()
25 |
26 | except:
27 | try:
28 | os.system("python3 config.py") # Re-configures
29 | except:
30 | os.system("python config.py") # Re-configures
31 |
32 | config_file = open(config_file, "r")
33 | configs = json.loads(config_file.read())
34 | config_file.close()
35 |
36 |
37 | ## Setting Up Configs
38 | MONGODB_PWD = configs["MONGODB_PWD"]
39 | MONGODB_URI = configs["MONGODB_URI"]
40 |
41 | # Initializes MongoDB
42 | connect_db(MONGODB_URI, MONGODB_PWD)
43 |
44 |
45 |
46 |
47 | mongodb() # Connects to DB
48 |
49 |
50 |
51 | print("\n[blue]---------------------------------------DB-FIXER---------------------------------------[/blue]\n")
52 |
53 | print("""[dark_orange]\t[1] Remove Duplicates[/dark_orange]""")
54 |
55 |
56 | print("\n[blue]Option :[/blue]", end="")
57 | op = input(" ")
58 |
59 | if op == "1":
60 |
61 | print("[blue] Scan Crawledsites (y/enter to skip) >[/blue]", end="")
62 | if input(" ").lower() == "y":
63 | print("\n[green] [+] Scanning Duplicates In Crawledsites [/green]")
64 |
65 | e = _DB().Crawledsites.find({})
66 | for x in e:
67 |
68 | ww = list(_DB().Crawledsites.find({"website":x["website"]}))
69 | len_ = len(ww)
70 |
71 | ww = ww[0]
72 |
73 |
74 | if len_ != 1:
75 |
76 | _DB().Crawledsites.delete_many({"website":x["website"]})
77 |
78 | _DB().Crawledsites.insert_one(ww)
79 |
80 | print(f"[green] [+] Removed : {x['website']} [/green]")
81 |
82 |
83 |
84 | print("[blue] Scan waitlist (y/enter to skip) >[/blue]", end="")
85 | if input(" ").lower() == "y":
86 |
87 | print("[green] [+] Scanning Duplicates In waitlist [/green]")
88 |
89 | e = _DB().waitlist.find({})
90 | for x in e:
91 |
92 | ww = list(_DB().waitlist.find({"website":x["website"]}))
93 | len_ = len(ww)
94 |
95 | ww = ww[0]
96 |
97 | if len_ != 1:
98 |
99 | _DB().waitlist.delete_many({"website":x["website"]})
100 |
101 | _DB().waitlist.insert_one(ww)
102 |
103 | print(f"[green] [+] Removed : {x['website']} [/green]")
104 |
105 |
106 |
107 | # print("[blue] Scan Robots (y/enter to skip) >[/blue]", end="")
108 | # if input(" ").lower() == "y":
109 |
110 | # print("[green] [+] Scanning Duplicates In Robots [/green]")
111 |
112 | # e = _DB().Robots.find({})
113 | # for x in e:
114 |
115 | # ww = list(_DB().Robots.find({"website":x["website"]}))
116 | # len_ = len(ww)
117 |
118 | # ww = ww[0]
119 |
120 | # if len_ != 1:
121 |
122 | # _DB().Robots.delete_many({"website":x["website"]})
123 |
124 | # _DB().Robots.insert_one(ww)
125 |
126 | # print(f"[green] [+] Removed : {x['website']} [/green]")
127 |
128 |
129 | else:
130 | print(f"[red] [-] Option '{op}' Not Found[/red]")
131 |
132 |
133 | print("\n[blue]--------------------------------------------------------------------------------------[/blue]\n")
134 |
135 |
136 |
--------------------------------------------------------------------------------
/install.sh:
--------------------------------------------------------------------------------
1 | sudo apt update
2 | sudo apt install python3
3 | sudo apt install python3-pip
4 | python3 installer.py
5 |
--------------------------------------------------------------------------------
/installer.py:
--------------------------------------------------------------------------------
1 | """
2 | Open Crawler v 1.0.0 Installer
3 | - Linux Installer
4 | """
5 |
6 |
7 |
8 |
9 | import os
10 |
11 |
12 |
13 | os.system("pip3 install rich")
14 |
15 | os.system("clear")
16 |
17 |
18 | from rich import print
19 |
20 | print("[green] [+] Installing Requirements[/green]")
21 |
22 |
23 | os.system("pip3 install -r requirements.txt")
24 |
25 |
26 | print("[green] [+] Installing The Manual | You can run 'man opencrawler' now yey!! [/green]")
27 |
28 |
29 | os.system("sudo cp opencrawler.1 /usr/local/man/man1/opencrawler.1")
30 |
31 | print("[green] [+] Adding files to path[/green]")
32 |
33 | files = ["search.py", "robots_txt.py", "mongo_db.py", "crawler.py", "fix_db.py", "opencrawler", "connection_tree.py", "config.py", "bad_words.txt"] # FIles which will be added to path
34 |
35 | for file in files:
36 | os.system(f"sudo cp {file} /usr/bin/{file}")
37 |
38 | os.system("sudo chmod +x /usr/bin/opencrawler")
39 |
40 |
41 | print("[green] Exited[/green]")
42 |
43 |
44 |
--------------------------------------------------------------------------------
/mongo_db.py:
--------------------------------------------------------------------------------
1 | """
2 | Open Crawler v 1.0.0 | Mongo - DB
3 | """
4 |
5 |
6 | from pymongo.mongo_client import MongoClient
7 | from rich import print
8 | import atexit
9 | import time
10 | import json
11 | import os
12 |
13 |
14 | # Main Variables
15 | CLIENT = None
16 | DB = None
17 |
18 |
19 |
20 | def connect_db(uri, pwd):
21 | """
22 | Initializes Connection With MongoDB
23 | uri : str - > The URI given by MongoDB
24 | pwd : str - > The password to connect
25 | """
26 |
27 | global CLIENT, DB
28 |
29 | uri = uri.replace("", pwd)
30 |
31 | try:
32 | CLIENT = MongoClient(uri)
33 | print("[spring_green1] [+] Connected To MongoDB [/spring_green1]")
34 |
35 | DB = CLIENT.Crawledsites
36 |
37 |
38 | except Exception as e:
39 | print("[red] [-] Error Occured While Connecting To Mongo DB [/red]")
40 | print(f"[red] [bold] \t\t\t\t Error : {e}[/bold] [/red]")
41 | quit()
42 |
43 |
44 |
45 | def if_waiting(url):
46 | """
47 | Checks if a website is in the waiting list
48 | returns bool
49 | """
50 | try:
51 | a = DB.waitlist.find_one({"website":url})["website"]
52 | if a != None:
53 | return True
54 | else:
55 | return False
56 | except:
57 | return False
58 |
59 |
60 | def _DB():
61 | """
62 | returns the DB
63 | """
64 | return DB
65 |
66 |
67 | def get_info():
68 | """
69 | To get count of docs in main collections
70 | returns list of int
71 | """
72 |
73 | a = int(DB.Crawledsites.estimated_document_count())
74 | b = int(DB.waitlist.estimated_document_count())
75 |
76 | a = f" Len : {a} | Storage : {a*257} Bytes"
77 | b = f" Len : {b} | Storage : {b*618} Bytes"
78 |
79 | return [a, b]
80 |
81 |
82 |
83 | def get_last():
84 | """
85 | Last crawled site
86 | returns str
87 | """
88 | a = DB.Crawledsites.find().sort("_id", -1)
89 | return a[0]["website"]
90 |
91 |
92 |
93 | def get_crawl(website):
94 | """
95 | Get crawled info of a site
96 | returns dict
97 | """
98 |
99 | return dict(DB.Crawledsites.find_one({"website":website}))
100 |
101 |
102 |
103 | def if_crawled(url):
104 | """
105 | Checks if a site was crawled
106 | returns Bool , time/None (last crawled time)
107 | """
108 | try:
109 | a = DB.Crawledsites.find_one({"website":url})
110 | return True, a["time"]
111 |
112 | except:
113 | return False, None
114 |
115 |
116 |
117 | def update_crawl(website, time, mal, offn, ln, key, desc, recc):
118 | """
119 | Updates a crawl
120 | """
121 | DB.Crawledsites.delete_many({"website":website})
122 | DB.Crawledsites.insert_one({"website":website, "time":time, "mal":mal, "offn":offn, "ln":ln, "key":key, "desc":desc, "recc":recc})
123 |
124 |
125 |
126 |
127 | def save_crawl(website, time, mal, offn, ln, key, desc, recc):
128 | """
129 | Saves a crawl
130 | """
131 | DB.Crawledsites.insert_one({"website":website, "time":time, "mal":mal, "offn":offn, "ln":ln, "key":key, "desc":desc, "recc":recc})
132 |
133 |
134 |
135 | def save_robots(website, robots):
136 | """
137 | Saves dissallowed sites
138 | """
139 |
140 | DB.Robots.insert_one({"website":website, "restricted":robots})
141 |
142 |
143 |
144 |
145 | def get_robots(website):
146 | """
147 | Gets dissallowed sites from the database
148 | """
149 | return DB.Robots.find_one({"website":website})["restricted"]
150 |
151 |
152 |
153 | def get_wait_list(num):
154 | """
155 | Gets websites to crawl
156 | num : int - > number of websites to recv
157 | returns list - > list of websites
158 | """
159 |
160 | wait = list(DB.waitlist.find().limit(num))
161 |
162 | for e in wait:
163 | DB.waitlist.delete_many({"website":e["website"]})
164 |
165 | return wait
166 |
167 |
168 |
169 | def write_to_wait_list(list_):
170 | """
171 | Writes to collection of websites to get crawled
172 | list_ : list - > website urls
173 | """
174 |
175 | list_ = set(list_)
176 | list__ = []
177 |
178 |
179 |
180 | for e in list_:
181 | if not if_waiting(e):
182 | list__.append({"website": e})
183 |
184 |
185 | try:
186 | DB.waitlist.insert_many(list__)
187 | except:
188 | pass
189 |
190 |
191 |
192 | # Part of testings
193 | if __name__ == "__main__":
194 |
195 | # Config File
196 | config_file = "config.json"
197 |
198 |
199 | # Load configs from config_file - > json
200 | try:
201 | config_file = open(config_file, "r")
202 | configs = json.loads(config_file.read())
203 | config_file.close()
204 |
205 | except:
206 | try:
207 | os.system("python3 config.py") # Re-configures
208 | except:
209 | os.system("python config.py") # Re-configures
210 |
211 | config_file = open(config_file, "r")
212 | configs = json.loads(config_file.read())
213 | config_file.close()
214 |
215 |
216 | ## Setting Up Configs
217 | MONGODB_PWD = configs["MONGODB_PWD"]
218 | MONGODB_URI = configs["MONGODB_URI"]
219 |
220 | connect_db(MONGODB_URI, MONGODB_PWD)
221 |
222 | # save_crawl("w1",1,0,1,3,4,5,5)
223 | # save_crawl("w2",4,0,1,3,4,5,5)
224 | # save_crawl("w3",10,0,1,3,4,5,5)
225 | # print(if_crawled("w"))
226 | # update_crawl("w",1,1,1,3,4,5,5)
227 | # print(get_last())
228 | # print(if_crawled("https://darkmash-org.github.io/"))
229 | # print(get_robots("www.bfi.org.uk"))
230 | # print(if_waiting("https://www.w3.org/blog/2015/01/"))
231 |
--------------------------------------------------------------------------------
/opencrawler:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python3
2 |
3 |
4 | """
5 | Open Crawler v 1.0.0 | CLI
6 | """
7 |
8 |
9 |
10 | from rich.table import Table
11 | from rich import print
12 | from mongo_db import *
13 | import robots_txt
14 | import platform
15 | import requests
16 | import json
17 | import sys
18 | import os
19 |
20 |
21 |
22 |
23 | def mongodb():
24 | # Config File
25 | config_file = "config.json"
26 |
27 |
28 | # Load configs from config_file - > json
29 | try:
30 | config_file = open(config_file, "r")
31 | configs = json.loads(config_file.read())
32 | config_file.close()
33 |
34 | except:
35 | try:
36 | os.system("python3 config.py") # Re-configures
37 | except:
38 | os.system("python config.py") # Re-configures
39 |
40 |
41 | config_file = open(config_file, "r")
42 | configs = json.loads(config_file.read())
43 | config_file.close()
44 |
45 |
46 | ## Setting Up Configs
47 | MONGODB_PWD = configs["MONGODB_PWD"]
48 | MONGODB_URI = configs["MONGODB_URI"]
49 |
50 | # Initializes MongoDB
51 | connect_db(MONGODB_URI, MONGODB_PWD)
52 |
53 |
54 |
55 |
56 | # Finding the location of the file
57 | cur_path = __file__.split("/")
58 | cur_path.remove(cur_path[-1])
59 | cur_path_ = ""
60 | for dir_ in cur_path:
61 | cur_path_ += "/" + dir_
62 | cur_path = cur_path_
63 |
64 |
65 |
66 | table_commands = Table(title="Help - Open Crawler v 1.0.0")
67 |
68 | table_commands.add_column("Command", style="cyan", no_wrap=True)
69 | table_commands.add_column("Use", style="magenta")
70 | table_commands.add_column("No", justify="right", style="green")
71 |
72 | table_commands.add_row("help", "Get info about the commands", "1")
73 | table_commands.add_row("v", "Get the version of open crawler", "2")
74 | table_commands.add_row("crawl", "Starts up the normal crawler", "3")
75 | table_commands.add_row("force_crawl ", "Forcefully crawls a website", "4")
76 | table_commands.add_row("crawled_status", "Shows the amount of data in DB , etc", "5")
77 | table_commands.add_row("configure", "Write / reWrite The config file", "6")
78 | table_commands.add_row("connection-tree ", "Makes a tree of websites connected to it, layers by default is 2", "7")
79 | table_commands.add_row("check_html ", "Checks if a website respond with html content", "8")
80 | table_commands.add_row("crawlable ", "Checks if a website is allowed to be crawled", "9")
81 | table_commands.add_row("dissallowed ", "Lists the websites not allowed to be crawled", "10")
82 | table_commands.add_row("re-install", "Re installs the Open Crawler", "11")
83 | table_commands.add_row("update", "Updates the open crawler", "12")
84 | table_commands.add_row("install-requirements", "Installs requirements for open crawler", "13")
85 | table_commands.add_row("search ", "Search from the crawled data", "14")
86 | table_commands.add_row("fix_db", "Tools to fix the DB", "15")
87 |
88 |
89 |
90 | try:
91 | main_arg = sys.argv[1]
92 | except:
93 | print(table_commands)
94 | quit()
95 |
96 | try:
97 | if main_arg == "help":
98 | print(table_commands)
99 |
100 |
101 | elif main_arg == "v":
102 | print("""
103 | [medium_spring_green]
104 | ______ ______ __
105 | / \ / \ | \
106 | | $$$$$$\ ______ ______ _______ | $$$$$$\ ______ ______ __ __ __ | $$
107 | | $$ | $$ / \ / \ | \ | $$ \$$ / \ | \ | \ | \ | \| $$
108 | | $$ | $$| $$$$$$\| $$$$$$\| $$$$$$$\ | $$ | $$$$$$\ \$$$$$$\| $$ | $$ | $$| $$
109 | | $$ | $$| $$ | $$| $$ $$| $$ | $$ | $$ __ | $$ \$$/ $$| $$ | $$ | $$| $$
110 | | $$__/ $$| $$__/ $$| $$$$$$$$| $$ | $$ | $$__/ \| $$ | $$$$$$$| $$_/ $$_/ $$| $$
111 | \$$ $$| $$ $$ \$$ \| $$ | $$ \$$ $$| $$ \$$ $$ \$$ $$ $$| $$
112 | \$$$$$$ | $$$$$$$ \$$$$$$$ \$$ \$$ \$$$$$$ \$$ \$$$$$$$ \$$$$$\$$$$ \$$
113 | | $$
114 | | $$
115 | \$$ [bold]v 1.0.0[/bold] [/medium_spring_green]
116 |
117 | """)
118 |
119 |
120 | elif main_arg == "fix_db":
121 | try:
122 | os.system(f"python3 {cur_path}/fix_db.py")
123 | except:
124 | os.system(f"python {cur_path}/fix_db.py")
125 |
126 |
127 | elif main_arg == "search":
128 |
129 | pool = False
130 |
131 | try:
132 | test_arg = sys.argv[2:]
133 |
134 | except:
135 | print("[red] [-] No Search Text[/red]")
136 | quit()
137 |
138 | txt = ""
139 | for e in test_arg:
140 | txt += " " + e
141 |
142 | try:
143 | os.system(f"python3 {cur_path}/search.py {txt}")
144 | except:
145 | os.system(f"python {cur_path}/search.py {txt}")
146 |
147 |
148 | elif main_arg == "configure":
149 | try:
150 | os.system(f"python3 {cur_path}/config.py")
151 | except:
152 | os.system(f"python {cur_path}/config.py")
153 |
154 |
155 | elif main_arg == "crawl":
156 | try:
157 | os.system(f"python3 {cur_path}/crawler.py")
158 | except:
159 | os.system(f"python {cur_path}/crawler.py")
160 |
161 |
162 | elif main_arg == "forced_crawl":
163 | try:
164 | web = sys.argv[2]
165 | except:
166 | print("[red] [-] Link Not Passed In [/red]")
167 | quit()
168 |
169 | try:
170 | os.system(f"python3 {cur_path}/crawler.py {web}")
171 | except:
172 | os.system(f"python {cur_path}/crawler.py {web}")
173 |
174 |
175 | elif main_arg == "crawled_status":
176 | mongodb()
177 | try:
178 | res = get_info()
179 | except:
180 | print("[red] [-] Couldn't Get The Info[/red]")
181 | quit()
182 |
183 | print("[yellow3] [?] The Info Given Wouldn't Be Accurate[/yellow3]\n")
184 |
185 | print(f"[dark_orange] \t : Crawled Sites - > {res[0]} [/dark_orange]")
186 | print(f"[dark_orange] \t : Wait list - > {res[1]} [/dark_orange]")
187 |
188 | print("")
189 |
190 | elif main_arg == "connection-tree":
191 | try:
192 | web = sys.argv[2]
193 | except:
194 | print("[red] [-] Link Not Passed In [/red]")
195 | quit()
196 |
197 | num = 2
198 |
199 | try:
200 | num = sys.argv[3]
201 | except:
202 | pass
203 | try:
204 | os.system(f"python3 {cur_path}/connection_tree.py {web} {num}")
205 | except:
206 | os.system(f"python {cur_path}/connection_tree.py {web} {num}")
207 |
208 | elif main_arg == "dissallowed":
209 | try:
210 | web = sys.argv[2]
211 | except:
212 | print("[red] [-] Link Not Passed In [/red]")
213 | quit()
214 |
215 | try:
216 | restricted = robots_txt.disallowed(web, None)
217 | except:
218 | print("[red] [-] The site was down or some other error [/red]")
219 | quit()
220 |
221 | print("[green] [+] Dissallowed : [/green]")
222 | for e in restricted:
223 | print(f"[green] \t\t-------> {e} [/green]")
224 |
225 |
226 | elif main_arg == "crawlable":
227 | try:
228 | web = sys.argv[2]
229 | except:
230 | print("[red] [-] Link Not Passed In [/red]")
231 | quit()
232 |
233 | try:
234 | restricted = robots_txt.disallowed(web, None)
235 | except:
236 | print("[red] [-] The site was down or some other error [/red]")
237 | quit()
238 |
239 | site = web
240 | site = site.replace("https://", "")
241 | site = site.replace("http://", "")
242 | web = site.split("/")
243 | web.remove(web[0])
244 | site = ""
245 | for e in web:
246 | site += e + "/"
247 | web = site
248 |
249 | A = True
250 | for e in restricted:
251 | if web.startswith(e):
252 | A = False
253 | break
254 |
255 | if A:
256 | print(f"[green] [+] Can Be Crawled [/green]")
257 | else:
258 | print(f"[red] [-] Can't Be Crawled [/red]")
259 |
260 |
261 | elif main_arg == "check_html":
262 | try:
263 | web = sys.argv[2]
264 | except:
265 | print("[red] [-] Link Not Passed In [/red]")
266 | quit()
267 |
268 | try:
269 | is_html = "html" in requests.get(web).headers["Content-Type"]
270 | except:
271 | print("[red] [-] Can't Be Checked Because the site is down or no content type provided in headers[/red]")
272 | quit()
273 |
274 | if is_html:
275 | print(f"[green] [+] '{web}' Respond With HTML Content [/green]")
276 | else:
277 | print(f"[red] [-] '{web}' Doesn't Respond With HTML Content [/red]")
278 |
279 |
280 | elif main_arg == "update":
281 | if platform.system() != "windows":
282 |
283 | try:
284 | os.system("git clone https://github.com/merwin-asm/OpenCrawler.git")
285 | except:
286 | os.system("sudp rm -rf OpenCrawler")
287 | os.system("git clone https://github.com/merwin-asm/OpenCrawler.git")
288 |
289 | os.system("cd OpenCrawler")
290 | os.system("chmod +x install.sh")
291 | os.system("./install.sh")
292 | else:
293 | print("[yellow] This wont work on windows [/yellow]")
294 |
295 | elif main_arg == "install_requirements":
296 | try:
297 | os.system(f"pip3 install -r {cur_path}/requirements.txt")
298 | except:
299 | os.system(f"pip install -r {cur_path}/requirements.txt")
300 |
301 |
302 | elif main_arg == "re_install":
303 | if platform.system() != "windows":
304 |
305 | try:
306 | os.system("git clone https://github.com/merwin-asm/OpenCrawler.git")
307 | except:
308 |
309 | os.system("sudp rm -rf OpenCrawler")
310 | os.system("git clone https://github.com/merwin-asm/OpenCrawler.git")
311 |
312 | os.system("cd OpenCrawler")
313 | os.system("chmod +x install.sh")
314 | os.system("./install.sh")
315 |
316 | else:
317 | print("[yellow] This wont work on windows [/yellow]")
318 |
319 |
320 | else:
321 | print(f"[red] [-] Command '{main_arg}' Not Found || Use 'opencrawler help' For Commands[/red]")
322 |
323 | except:
324 | print("[red] [-] Some Error Occurred[/red]")
325 |
326 |
327 |
--------------------------------------------------------------------------------
/opencrawler.1:
--------------------------------------------------------------------------------
1 | .TH OpenCrawler 1
2 |
3 | .SH NAME OpenCrawler v 1.0.0
4 |
5 | .SH DESCRIPTION
6 |
7 | .B OpenCrawler v 1.0.0
8 | Is an program for crawling through websites.
9 |
10 | .TP
11 | An open source crawler/spider. LICENSE - MIT.
12 |
13 | .SH COMMANDS
14 |
15 | .TP
16 | .BR help
17 | Get info about the commands
18 |
19 | .TP
20 | .BR v
21 | Get the version of open crawler
22 |
23 | .TP
24 | .BR crawl
25 | Starts up the normal crawler
26 |
27 | .TP
28 | .BR forced_crawl
29 | Forcefully crawl a website/Make the crawler crawl the website
30 |
31 | .TP
32 | .BR crawled_status
33 | Shows the amount of data in DB , etc
34 |
35 | .TP
36 | .BR configure
37 | Write / reWrite The config file
38 |
39 | .TP
40 | .BR connection-tree
41 | Makes a tree of websites connected to it, layers by default is 2
42 |
43 | .TP
44 | .BR check_html
45 | Checks if a website respond with html content
46 |
47 | .TP
48 | .BR crawlable
49 | Checks if a website is allowed to be crawled
50 |
51 | .TP
52 | .BR dissallowed
53 | Lists the websites not allowed to be crawled
54 |
55 | .TP
56 | .BR re-install
57 | Re installs the Open Crawler
58 |
59 | .TP
60 | .BR update
61 | Updates the open crawler
62 |
63 | .TP
64 | .BR install-requirements
65 | Installs requirements for open crawler
66 |
67 | .TP
68 | .BR search
69 | Search from the crawled data
70 |
71 |
72 | .TP
73 | .BR fix_db
74 | Tools to fix the DB
75 |
76 |
--------------------------------------------------------------------------------
/proxy_tool.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from rich import print
3 | import threading
4 | import time
5 |
6 |
7 | working_proxies = []
8 |
9 |
10 | def check(url, protocol, proxy):
11 | global working_proxies
12 |
13 | try:
14 | response = requests.get(url, proxies={protocol:proxy}, timeout=2)
15 | if response.status_code == 200:
16 | print(f"Proxy {proxy} works! [{len(working_proxies)+1}]")
17 | working_proxies.append(proxy)
18 | else:
19 | print(f"Proxy {proxy} failed")
20 | except requests.RequestException as e:
21 | pass
22 |
23 |
24 | def proxy_checker(proxies, url="https://www.wired.com/review/klipsch-flexus-core-200/"):
25 | proxies = proxies.text.split("\r\n")
26 |
27 | if url.startswith("https"):
28 | protocol = "https"
29 | else:
30 | protocol = "http"
31 |
32 | proxies.pop()
33 |
34 | i = 0
35 | for proxy in proxies:
36 | threading.Thread(target=check, args=(url, protocol, proxy)).start()
37 | i += 1
38 | if i == 20:
39 | time.sleep(1.5)
40 | i = 0
41 |
42 | return working_proxies
43 |
44 | def proxy_checker_(proxies, url="https://www.wired.com/review/klipsch-flexus-core-200/"):
45 | proxies = proxies.split("\r\n")
46 |
47 | if url.startswith("https"):
48 | protocol = "https"
49 | else:
50 | protocol = "http"
51 |
52 | proxies.pop()
53 |
54 |
55 | i = 0
56 | for proxy in proxies:
57 | threading.Thread(target=check, args=(url, protocol, proxy)).start()
58 | i += 1
59 | if i == 20:
60 | time.sleep(1.5)
61 | i = 0
62 |
63 | return working_proxies
64 |
65 | def get_proxy():
66 | global working_proxies
67 |
68 | try:
69 | f = open("found_proxies_http")
70 | f2 = open("found_proxies_https")
71 | res = f.read()
72 | res2 = f2.read()
73 |
74 | HTTP = proxy_checker_(res, "https://www.wired.com/review/klipsch-flexus-core-200/")
75 | working_proxies = []
76 | HTTPS = proxy_checker(res2)
77 |
78 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {len(HTTP)} [/green]")
79 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {len(HTTPS)} [/green]")
80 |
81 | f2.close()
82 | f.close()
83 |
84 | f = open("found_proxies_http", "w")
85 | f2 = open("found_proxies_https", "w")
86 | f.write("\n".join(HTTP))
87 | f2.write("\n".join(HTTPS))
88 | f.close()
89 | f2.close()
90 |
91 |
92 | except:
93 | print("[red] [-] Failed, Maybe Because Already Didn't Have A proxyList To Refresh[/red]")
94 |
95 | def gen_new():
96 | global working_proxies
97 | print("We are generating a new proxy list so it would take time... \[this happens when you are using old proxylist/have none]")
98 | res = requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all")
99 | res2= requests.get("https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all")
100 | HTTP = proxy_checker(res, "http://www.wired.com/review/klipsch-flexus-core-200/")
101 | working_proxies = []
102 | HTTPS = proxy_checker(res2)
103 |
104 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {len(HTTP)} [/green]")
105 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {len(HTTPS)} [/green]")
106 |
107 | f = open("found_proxies_http", "w")
108 | f2 = open("found_proxies_https", "w")
109 | f.write("\n".join(HTTP))
110 | f2.write("\n".join(HTTPS))
111 | f.close()
112 | f2.close()
113 |
114 | print("[blue] This Tool Belonging To OpenCrawler Project Can \nHelp With Generating And Checking HTTP And HTTPS Proxy List![/blue]")
115 | print("")
116 | print("""[blue]
117 | \t [1] - Refresh The ProxyList, Remove Proxies Which Doesn't Work.
118 | \t [2] - Renew The ProxyList, Autogenerate A New ProxyList.
119 | \t [3] - Show Count of the Proxies HTTP And HTTPS.
120 | [/blue]""")
121 |
122 | t = input("[1,2,3] > ")
123 | if t == "1":
124 | get_proxy()
125 | elif t == "2":
126 | gen_new()
127 | elif t == "3":
128 | try:
129 | f = open("found_proxies_http")
130 | f2 = open("found_proxies_https")
131 | res = f.read()
132 | res2 = f2.read()
133 | a = len(res.split('\r\n'))
134 | b = len(res2.split('\r\n'))
135 | print(f"[green]Total Number Of HTTP PROXIES FOUND : {a} [/green]")
136 | print(f"[green]Total Number Of HTTPS PROXIES FOUND : {b} [/green]")
137 |
138 | f2.close()
139 | f.close()
140 | except:
141 | print("[red] [-] Failed, Maybe Because Already Didn't Have A proxyList[/red]")
142 |
143 | else:
144 | print("[red] [-] Unknown Command[/red]")
145 |
146 |
147 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pymongo
2 | rich
3 | requests
4 | memory_profiler
5 | beautifulsoup4
6 | langdetect
7 |
--------------------------------------------------------------------------------
/robots_txt.py:
--------------------------------------------------------------------------------
1 | """
2 | Open Crawler - Robot.txt - Loader
3 | """
4 |
5 | import requests as r
6 |
7 |
8 |
9 |
10 | def get_robot_txt(site, proxies):
11 | site = site.replace("https://", "")
12 | site = site.replace("http://", "")
13 | robot_file_url = "https://" + site.split("/")[0] + "/robots.txt"
14 |
15 |
16 | try:
17 | res = r.get(robot_file_url, timeout = 3 , proxies = proxies)
18 | except:
19 | res = None
20 |
21 | if res == None:
22 | return ""
23 |
24 | status = int(res.status_code)
25 |
26 | if status == 200:
27 | return res.text
28 |
29 | else:
30 | return ""
31 |
32 |
33 | def get_lines(txt):
34 | txt = remove_comments(txt)
35 |
36 | while "" in txt:
37 | txt.remove("")
38 |
39 | while "\n" in txt:
40 | txt.remove("\n")
41 |
42 | while "\t" in txt:
43 | txt.remove("\t")
44 |
45 | return txt
46 |
47 |
48 | def remove_comments(txt):
49 | txt = txt.split("\n")
50 | txt_ = []
51 | for e in txt:
52 | txt_.append(e.split("#")[0])
53 |
54 | return txt_
55 |
56 |
57 | def disallowed(site, proxies):
58 |
59 | txt = get_robot_txt(site, proxies)
60 | txt = get_lines(txt)
61 |
62 | dis = []
63 |
64 | if txt == []:
65 | return txt
66 |
67 | record = False
68 | for line in txt:
69 | if line == "User-agent: *":
70 | record = True
71 | elif line.startswith("Disallow:"):
72 | if record:
73 | dis.append(line.split(" ")[-1])
74 | elif "User-agent:" in line:
75 | if record == True:
76 | break
77 |
78 | return dis
79 |
80 |
81 | if __name__ == "__main__":
82 | # print(get_robot_txt("https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt"))
83 | # print(disallowed("https://www.google.com/robots.txt"))
84 | pass
85 |
--------------------------------------------------------------------------------
/search.py:
--------------------------------------------------------------------------------
1 | """
2 | Open Crawler v 1.0.0 | search.py
3 |
4 | -- Note the official search functions doesnt count the clicks or learn from search patterns etc :]
5 | """
6 |
7 |
8 |
9 | from mongo_db import connect_db, _DB
10 | from rich import print
11 | import time
12 | import json
13 | import sys
14 | import os
15 | import re
16 |
17 |
18 |
19 |
20 | def mongodb():
21 | # Config File
22 | config_file = "config.json"
23 |
24 |
25 | # Load configs from config_file - > json
26 | try:
27 | config_file = open(config_file, "r")
28 | configs = json.loads(config_file.read())
29 | config_file.close()
30 |
31 | except:
32 | try:
33 | os.system("python3 config.py") # Re-configures
34 | except:
35 | os.system("python config.py") # Re-configures
36 |
37 |
38 | config_file = open(config_file, "r")
39 | configs = json.loads(config_file.read())
40 | config_file.close()
41 |
42 |
43 | ## Setting Up Configs
44 | MONGODB_PWD = configs["MONGODB_PWD"]
45 | MONGODB_URI = configs["MONGODB_URI"]
46 |
47 | # Initializes MongoDB
48 | connect_db(MONGODB_URI, MONGODB_PWD)
49 |
50 |
51 |
52 |
53 | mongodb() # Connects to DB
54 |
55 |
56 | # Get the search
57 | search = sys.argv[1:]
58 |
59 |
60 | RESULTS = {} # Collects the results
61 |
62 |
63 | if len(search) > 1:
64 |
65 | t_1 = time.time()
66 |
67 | for e in search:
68 | url = list(_DB().Crawledsites.find({"$or" : [
69 | {"recc": {"$regex": re.compile(e, re.IGNORECASE)}},
70 | {"keys": {"$regex": re.compile(e, re.IGNORECASE)}},
71 | {"desc": {"$regex": re.compile(e, re.IGNORECASE)}},
72 | {"website" : {"$regex": re.compile(e, re.IGNORECASE)}}
73 | ]}))
74 |
75 | res = []
76 | [res.append(x["website"]) for x in url if x["website"] not in res]
77 |
78 | del url
79 |
80 | for url in res:
81 | if url in RESULTS.keys():
82 | RESULTS[url] += 1
83 | else:
84 | RESULTS.setdefault(url, 1)
85 |
86 |
87 | t_2 = time.time()
88 |
89 |
90 | RESULTS_ = RESULTS
91 |
92 | RESULTS = sorted(RESULTS.items(), key=lambda x:x[1], reverse=True)
93 |
94 | c = 0
95 | for result in RESULTS:
96 | if RESULTS_[result[0]] > 1:
97 | print(f"[green]Link: {result[0]} | Common words: {result[1]} [/green]")
98 | c += 1
99 |
100 | print(f"[dark_orange]Query : {search} | Total Results : {c} | Time Taken : {t_2 - t_1}s[/dark_orange]")
101 |
102 | else:
103 | t_1 = time.time()
104 | e = search[0]
105 | url = list(_DB().Crawledsites.find({"$or" : [
106 | {"recc": {"$regex": re.compile(e, re.IGNORECASE)}},
107 | {"keys": {"$regex": re.compile(e, re.IGNORECASE)}},
108 | {"desc": {"$regex": re.compile(e, re.IGNORECASE)}},
109 | {"website" : {"$regex": re.compile(e, re.IGNORECASE)}}
110 | ]}))
111 | t_2 = time.time()
112 |
113 | res = []
114 | [res.append(x["website"]) for x in url if x["website"] not in res]
115 |
116 | del url
117 |
118 | for result in res:
119 | print(f"[green]Link: {result}[/green]")
120 |
121 | print(f"[dark_orange]Query : {search} | Total Results : {len(res)} | Time Taken : {t_2 - t_1}s[/dark_orange]")
122 |
123 |
124 |
--------------------------------------------------------------------------------
/search_website.py:
--------------------------------------------------------------------------------
1 | from flask import Flask, request, jsonify, render_template
2 | import os
3 | import pickle
4 |
5 | app = Flask(__name__)
6 |
7 | @app.route('/')
8 | def index():
9 | return render_template('index.html')
10 |
11 | @app.route('/search', methods=['GET'])
12 | def search():
13 | query = request.args.get('query')
14 | if not query:
15 | return jsonify({"error": "No query provided"}), 400
16 |
17 | os.system(f"python3 search.py {query}")
18 |
19 | f = open(".results_"+".".join(query.split(" ")), "rb")
20 | data = pickle.loads(f.read())
21 | f.close()
22 |
23 | output = {
24 | "urls": data["res"],
25 | "summary": f"Query : {query} | Total Results : {len(data['res'])} | Time Taken : {data['time']}"
26 | }
27 |
28 | return jsonify(output)
29 |
30 | if __name__ == '__main__':
31 | app.run(port="8080")
32 |
33 |
--------------------------------------------------------------------------------
/templates/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 | OpenCSearch
7 |
8 |
100 |
101 |
102 |
105 |
106 |
110 |
111 |
112 |
113 |
138 |
139 |
140 |
141 |
--------------------------------------------------------------------------------
/templates/tree.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 | Fetch Tree Image
7 |
21 |
22 |
23 | Connection Tree WEB
24 |
31 |
32 |
33 |
62 |
63 |
64 |
65 |
--------------------------------------------------------------------------------
/tree_website.py:
--------------------------------------------------------------------------------
1 | from flask import Flask, send_file, request, render_template
2 | import networkx as nx
3 | import matplotlib.pyplot as plt
4 | import io
5 | from PIL import Image
6 | import json
7 | import os
8 | import pickle
9 |
10 | app = Flask(__name__)
11 |
12 |
13 | def get_data(root, layers):
14 | os.system(f"python3 connection_tree_v2.py {root} {layers}")
15 | f = open(f".{root}_{layers}".replace("/", "o"), "rb")
16 | data = pickle.loads(f.read())
17 | f.close()
18 |
19 | return data
20 |
21 |
22 | def add_nodes_edges(graph, node, parent=None):
23 | """
24 | Recursively add nodes and edges to the graph.
25 | """
26 | if parent:
27 | graph.add_edge(parent, node['name'])
28 |
29 | if 'children' in node:
30 | for child in node['children']:
31 | add_nodes_edges(graph, child, node['name'])
32 |
33 | def plot_tree(data, output_file='tree.png'):
34 | """
35 | Plot a tree graph from the data and save it as an image.
36 | """
37 | G = nx.DiGraph()
38 | add_nodes_edges(G, data)
39 |
40 | pos = nx.spring_layout(G, seed=42, k=0.5) # Position nodes using spring layout
41 | plt.figure(figsize=(14, 10))
42 | nx.draw(G, pos, with_labels=True, arrows=True, node_size=2000, node_color='lightblue', font_size=6, font_weight='bold', edge_color='gray')
43 | plt.title('Tree Graph')
44 | plt.savefig(output_file)
45 |
46 |
47 |
48 | @app.route('/')
49 | def index():
50 | return render_template('tree.html')
51 |
52 |
53 | @app.route('/generate_tree')
54 | def generate_tree():
55 | root = request.headers.get('root')
56 | layers = int(request.headers.get('layers'))
57 |
58 | data = get_data(root, layers)
59 | data = data[root]
60 | plot_tree(data)
61 |
62 | return send_file("tree.png", mimetype='image/png')
63 |
64 | if __name__ == '__main__':
65 | app.run()
66 |
--------------------------------------------------------------------------------