├── LICENSE ├── README.md └── src ├── client.py ├── config ├── helper └── rule-tester.py ├── launch-clients.sh ├── modules ├── __init__.py ├── configuration.py ├── logger.py ├── protocol.py ├── rule.py ├── scrapping.py └── storage.py ├── server.py └── url.txt /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014-2015 Diastro - Zeek 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Zeek 2 | ==== 3 | 4 | Python distributed web crawling / web scraper 5 | 6 | This the first version of my distributed web crawler. It isn't perfect yet but I'm sharing it because the end result is far better then what I expected and it can easily be adapted to your needs. Feel free to improve/fork/report issues. 7 | 8 | I'm planning to continue working on it and probably release an updated version in the future but i'm not sure when yet. 9 | 10 | ### Use cases 11 | * Visit a **predetermined** list of URLs and scrape specific data on these pages 12 | * Visit or **dynamically visit** web pages on a periodic bases and **scrape** data on these pages 13 | * Dynamically visit pages on a **given domain** and scrape data on these pages 14 | * Dynamically visit pages **all over the internet** and scrape data on these pages 15 | 16 | All the scraped data can be stored in an output file (ie: `.csv`, `.txt`) or in a database 17 | 18 | *David Albertson* 19 | 20 | ## Execution 21 | 1) Download the source and install the required third party library 22 | ~~~ sh 23 | $ git clone https://github.com/Diastro/Zeek.git 24 | $ easy_install beautifulsoup4 25 | $ easy_install lxml 26 | ~~~ 27 | 28 | 2) Update the configuration files : 29 | * change the server `listeningAddress / listeningPort` to the right info; 30 | * change the client `hostAddr / hostPort` to the right info. 31 | 32 | 3) Update the /modules/rule.py and modules/storage.py : 33 | * See the documentation for more information on how to adapt these files. 34 | 35 | 4) Launch the server on the **master** node 36 | 37 | ~~~ sh 38 | $ python server.py 39 | ~~~ 40 | 41 | 5) Launch the client on the **working** nodes 42 | 43 | ~~~ sh 44 | $ python client.py 45 | ~~~ 46 | 47 | #### Third party library 48 | - [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/) 49 | - [lxml](http://lxml.de/) 50 | 51 | ## Configuration fields 52 | **[server]**
53 | *listeningAddr* : Address on which the server listens for incoming connections from clients (ex : 127.0.0.1)
54 | *listeningPort* : Port on which the server listens for incoming connections from clients (ex : 5050)
55 | 56 | **[client]**
57 | *hostAddr* : Address to connect to, on which the server listens for incoming connections (ex : 127.0.0.1)
58 | *hostPort* : Port to connect to, on which the server listens for incoming connections (ex : 5050)
59 | 60 | **[common]**
61 | *verbose* : Enables or disables verbose output in the console (ex: True, False)
62 | *logPath* : Path where to save the ouput logfile of each process (ex : logs/)
63 | *userAgent* : Usually the name of your crawler or bot (ex : MyBot 1.0)
64 | *crawling* : Type of crawling (ex : dynamic, static)
65 | *robotParser* : Obey or ignore the robots.txt rule while visiting a domain (ex : True, False)
66 | *crawlDelay* : Delay, in seconds, between the 2 subsequent request (ex : 0, 3, 5)
67 | 68 | **[dynamic]** (Applies only if the crawling type is set to dynamic)
69 | *domainRestricted* : If set to true, the crawler will only visit url that are same as the root url (ex : True, False)
70 | *requestLimit* : Stops the crawler after the limit is reach (after visiting x pages) (ex : 0, 2, 100, ...)
71 | *rootUrls* : Url to start from (ex : www.businessinsider.com)
72 | 73 | **[static]** (Applies only if the crawling type is set to static)
74 | *rootUrlsPath* : Path to the file which contains a list of url to visit (ex : url.txt)
75 | 76 | ## How it works 77 | ***Coming soon*** 78 | 79 | ### Rule.py Storage.py 80 | ***Coming soon*** 81 | 82 | ### Testing your rule.py 83 | ***Coming soon*** 84 | 85 | ## Recommended topologies 86 | Zeek can be launched in 2 different topologies depending on which resource is limiting you. When you want to crawl a large quantity of web pages, you need a large bandwidth (when executing multiple parallel requests) and you need computing power (CPU). Depending on which of these 2 is limiting you, you should use the appropriate topology for the fastest crawl time. 87 | Keep in mind that if time isn't a constraint for you, a 1-1 approach is always the safest and less expensive! 88 | * Basic topology (recommended) : see the **1-1 topology** 89 | * Best performance topology : see the **1-n topology** 90 | 91 | No matter which topology you are using, you can always use the `launch-clients.sh` to launch multiple instance of client.py on a same computer. 92 | 93 | ### 1-1 Topology 94 | The 1-1 Topology is probably the easiest to achieve. It only requires 1 computer so it makes it easy for anyone to deploy Zeek this way. Using this type of topology you first deploy the server.py (using 127.0.0.1 as the listeningAddr) and connect as many client.py processes to it (using 127.0.0.1 as the hostAddr) and everything runs on the same machine. Be aware that depending on the specs of you computer, you will end up being limited by the number of threads launched by the serve.py process at some point. server.py launches 3 threads per client that connects to it so if your computer allows you to create 300 thread per process, the maximum number of client.py that you will be able to launch will be approximately 100. If you end up launching that many client, you might end up being limited by your bandwidth at some point.
95 | [1-1 Topology schema](http://i.imgur.com/7NJGodN.jpg) 96 | 97 | ### 1-n Topology 98 | This topology is perfect if you want to achieve best performance but requires that you have more than 1 computer at your disposal. The only limitation you have using this topology is regarding the number of clients that can connect to the server.py process. As explained above, server.py launches 3 threads per client that connects to it so if your computer allows you to create 300 thread per process, the maximum number of client.py that you will be able to launch will be approximately 100. Though in this case, if each computer uses a seperate connection, bandwidth shouldn't be a problem.
99 | [1-n Topology schema](http://i.imgur.com/lXCEAk6.jpg) 100 | 101 | ## Stats - Benchmark 102 | ***Coming soon*** 103 | 104 | ## Warning 105 | Using a distributed crawler/scraper can make your life easier but also comes with great responsibilities. When you are using a crawler to make request to a website, you generate connections to this website and if the targeted web site isn't configured properly, it can have disastrous consequences. You're probably asking yourself "What exactly does he mean?". What I mean is that by using 10 computers each having 30 client.py instances running you could (in a perfect world) generate 300 parallels requests. If these 300 parallel request are targetting the same website/domain, you will be downloading a lot a data pretty quickly and if the targeted domain isn't prepared for it, you could potentially shut it down.
106 | During the development of Zeek I happened to experience something similar while doing approximatly 250 parallel request to a pretty well known website. The sysadmins of this website ended up contacting the sysadmin where I have my own server hosted being worried that something strange was happenning (they were probably thinking of an attack). During this period of time I ended up downloading 7Gb of data in about 30 minutes. This alone trigged some internal alert on their side. That being now said, I'm not responsible for your usage of Zeek. Simply try to be careful and respectful of others online! 107 | 108 | ## References 109 | - [Wikipedia - WebCrawler](http://en.wikipedia.org/wiki/Web_crawler) 110 | - [Wikipedia - Distributed crawling](http://en.wikipedia.org/wiki/Distributed_web_crawling) 111 | - [How to Parse data using BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html) 112 | - [Understanding Robots.txt](http://www.robotstxt.org/faq.html) 113 | -------------------------------------------------------------------------------- /src/client.py: -------------------------------------------------------------------------------- 1 | import ConfigParser 2 | import datetime 3 | import Queue 4 | import logging 5 | import os 6 | import pickle 7 | import socket 8 | import sys 9 | import time 10 | import thread 11 | import traceback 12 | import modules.logger as logger 13 | import modules.protocol as protocol 14 | import modules.scrapping as scrapping 15 | 16 | sys.setrecursionlimit(10000) 17 | 18 | buffSize = 524288 19 | delimiter = '\n\n12345ZEEK6789\n' 20 | 21 | 22 | class WorkingNode(): 23 | def __init__(self): 24 | # socket 25 | self.host = None 26 | self.port = None 27 | self.data = "" 28 | 29 | # general 30 | self.isActive = True 31 | self.masterNodeFormattedAddr = None 32 | self.crawlingType = None 33 | 34 | # data container 35 | self.outputQueue = Queue.Queue(0) 36 | self.infoQueue = Queue.Queue(0) 37 | self.urlToVisit = Queue.Queue(0) 38 | 39 | # object 40 | self.scrapper = None 41 | self.config = None 42 | 43 | 44 | def connect(self, host, port): 45 | """Sets up the connection to the server (max 6 attemps)""" 46 | self.host = host 47 | self.port = port 48 | self.masterNodeFormattedAddr = "[" + str(self.host) + ":" + str(self.port) + "]" 49 | 50 | logger.log(logging.DEBUG, "Socket initialization") 51 | self.s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 52 | for connectionAttempt in range(6, 0, -1): 53 | if connectionAttempt == 1: 54 | logger.log(logging.CRITICAL, "Unable to connect to host " + self.masterNodeFormattedAddr) 55 | sys.exit() 56 | try: 57 | logger.log(logging.DEBUG, "Connecting to host... " + self.masterNodeFormattedAddr) 58 | self.s.connect((self.host, self.port)) 59 | logger.log(logging.INFO, "Connected to " + self.masterNodeFormattedAddr) 60 | break 61 | except socket.error: 62 | logger.log(logging.INFO, "Connection failed to " + self.masterNodeFormattedAddr) 63 | logger.log(logging.INFO, "Retrying in 3 seconds.") 64 | time.sleep(3) 65 | 66 | def readConfig(self): 67 | """Reads the configuration from the server""" 68 | logger.log(logging.DEBUG, "Waiting for configuration from the server.") 69 | if self.isActive: 70 | try: 71 | deserializedPacket = self.readSocket() 72 | logger.log(logging.DEBUG, "Configuration received.") 73 | 74 | if deserializedPacket.type == protocol.CONFIG: 75 | self.crawlingType = deserializedPacket.payload.crawlingType 76 | self.config = deserializedPacket.payload.config 77 | 78 | # dynamic module reload 79 | basePath = os.path.dirname(sys.argv[0]) 80 | if basePath: 81 | basePath = basePath + "/" 82 | 83 | # path building 84 | rulePath = basePath + "modules/rule.py" 85 | scrappingPath = basePath + "modules/scrapping.py" 86 | 87 | # re-writing source .py 88 | logger.log(logging.INFO, "Importing rule.py from server") 89 | ruleFd = open(rulePath, 'w') 90 | ruleFd.write(self.config.rule_py) 91 | ruleFd.close() 92 | 93 | logger.log(logging.INFO, "Importing scrapping.py from server") 94 | scrappingFd = open(scrappingPath, 'w') 95 | scrappingFd.write(self.config.scrapping_py) 96 | scrappingFd.close() 97 | 98 | # compilation test 99 | try: 100 | code=open(rulePath, 'rU').read() 101 | compile(code, "rule_test", "exec") 102 | except: 103 | exc_type, exc_value, exc_traceback = sys.exc_info() 104 | message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 105 | logger.log(logging.CRITICAL, message) 106 | logger.log(logging.ERROR, "Unable to compile rule.py (is the syntax right?)") 107 | sys.exit(0) 108 | 109 | try: 110 | code=open(scrappingPath, 'rb').read(os.path.getsize(scrappingPath)) 111 | compile(code, "scrapping_test", "exec") 112 | except: 113 | exc_type, exc_value, exc_traceback = sys.exc_info() 114 | message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 115 | logger.log(logging.CRITICAL, message) 116 | logger.log(logging.ERROR, "Unable to compile scrapping.py (is the syntax right?)") 117 | sys.exit(0) 118 | 119 | # dynamic reload of modules 120 | # TODO reloading of rule.py should eventually come here 121 | logger.log(logging.INFO, "Reloading modules imported for server") 122 | reload(sys.modules["modules.scrapping"]) 123 | 124 | 125 | payload = protocol.InfoPayload(protocol.InfoPayload.CLIENT_ACK) 126 | packet = protocol.Packet(protocol.INFO, payload) 127 | self.writeSocket(packet) 128 | 129 | logger.log(logging.DEBUG, "Sending ACK for configuration.") 130 | else: 131 | raise Exception("Unable to parse configuration.") 132 | except: 133 | exc_type, exc_value, exc_traceback = sys.exc_info() 134 | message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 135 | logger.log(logging.CRITICAL, message) 136 | self.isActive = False 137 | 138 | def run(self): 139 | """Launches main threads""" 140 | logger.log(logging.INFO, "\n\nStarting Crawling/Scrapping sequence...") 141 | if self.isActive: 142 | thread.start_new_thread(self.outputThread, ()) 143 | thread.start_new_thread(self.inputThread, ()) 144 | thread.start_new_thread(self.interpretingThread, ()) 145 | thread.start_new_thread(self.crawlingThread, ()) 146 | 147 | def inputThread(self): 148 | """Listens for inputs from the server""" 149 | logger.log(logging.DEBUG, "InputThread started") 150 | 151 | while self.isActive: 152 | try: 153 | deserializedPacket = self.readSocket() 154 | self.dispatcher(deserializedPacket) 155 | except EOFError: 156 | self.isActive = False 157 | except: 158 | exc_type, exc_value, exc_traceback = sys.exc_info() 159 | message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 160 | logger.log(logging.CRITICAL, message) 161 | self.isActive = False 162 | 163 | def outputThread(self): 164 | """Checks if there are messages to send to the server and sends them""" 165 | logger.log(logging.DEBUG, "OutputThread started") 166 | 167 | while self.isActive: 168 | try: 169 | obj = self.outputQueue.get(True) #fix with helper method to prevent block 170 | self.writeSocket(obj) 171 | logger.log(logging.DEBUG, "Sending obj of type " + str(obj.type)) 172 | except: 173 | exc_type, exc_value, exc_traceback = sys.exc_info() 174 | message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 175 | logger.log(logging.CRITICAL, message) 176 | self.isActive = False 177 | 178 | def interpretingThread(self): 179 | """Interprets message from the server other than type URL. (ie: INFO)""" 180 | logger.log(logging.DEBUG, "InterpretingThread started") 181 | 182 | while self.isActive: 183 | try: 184 | time.sleep(0.01) #temp - For testing 185 | packets = protocol.deQueue([self.infoQueue]) 186 | 187 | if not packets: 188 | continue 189 | 190 | for packet in packets: 191 | if packet.type == protocol.INFO: 192 | logger.log(logging.INFO, "Interpreting INFO packet : " + str(packet.payload.urlList)) 193 | except: 194 | exc_type, exc_value, exc_traceback = sys.exc_info() 195 | message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 196 | logger.log(logging.CRITICAL, message) 197 | self.isActive = False 198 | 199 | def crawlingThread(self): 200 | """Takes URL from the urlToVisit queue and visits them""" 201 | logger.log(logging.DEBUG, "CrawlingThread started") 202 | 203 | self.scrapper = scrapping.Scrapper(self.config.userAgent, self.config.robotParserEnabled, self.config.domainRestricted, self.config.crawling) 204 | 205 | while self.isActive: 206 | try: 207 | urlList = protocol.deQueue([self.urlToVisit]) 208 | 209 | if not urlList: 210 | time.sleep(0.2) #temp - For testing 211 | continue 212 | 213 | for url in urlList: 214 | session = self.scrapper.visit(url) 215 | logger.log(logging.DEBUG, "Session \n" + str(session.url) + 216 | "\nCode : " + str(session.returnCode) + 217 | "\nRequest time : " + str(session.requestTime) + 218 | "\nBs time : " + str(session.bsParsingTime)) 219 | 220 | if not session.failed: 221 | if self.crawlingType == protocol.ConfigurationPayload.DYNAMIC_CRAWLING: 222 | payload = protocol.URLPayload(session.scrappedURLs, protocol.URLPayload.SCRAPPED_URL) 223 | packet = protocol.Packet(protocol.URL, payload) 224 | self.outputQueue.put(packet) 225 | 226 | payload = protocol.URLPayload([url], protocol.URLPayload.VISITED, session=session) 227 | packet = protocol.Packet(protocol.URL, payload) 228 | self.outputQueue.put(packet) 229 | else: 230 | logger.log(logging.INFO, "Skipping URL : " + url) 231 | payload = protocol.URLPayload([url], protocol.URLPayload.SKIPPED, session) 232 | packet = protocol.Packet(protocol.URL, payload) 233 | self.outputQueue.put(packet) 234 | continue 235 | 236 | except: 237 | exc_type, exc_value, exc_traceback = sys.exc_info() 238 | message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 239 | logger.log(logging.CRITICAL, message) 240 | self.isActive = False 241 | 242 | def dispatcher(self, packet): 243 | """Dispatches packets to the right packet queue""" 244 | if packet is None: 245 | return 246 | elif packet.type == protocol.INFO: 247 | logger.log(logging.DEBUG, "Dispatching INFO packet") 248 | self.infoQueue.put(packet) 249 | elif packet.type == protocol.URL: 250 | logger.log(logging.DEBUG, "Dispatching url packet : " + str(packet.payload.urlList[0])) 251 | for site in packet.payload.urlList: 252 | self.urlToVisit.put(site) 253 | else: 254 | logger.log(logging.CRITICAL, "Unrecognized packet type : " + str(packet.type) + ". This packet was dropped") 255 | return 256 | 257 | logger.log(logging.DEBUG, "Dispatched packet of type: " + str(packet.type)) 258 | 259 | def writeSocket(self, obj): 260 | try: 261 | serializedObj = pickle.dumps(obj) 262 | logger.log(logging.DEBUG, "Sending " + str(len(serializedObj + delimiter)) + " bytes to server") 263 | self.s.sendall(serializedObj + delimiter) 264 | except: 265 | exc_type, exc_value, exc_traceback = sys.exc_info() 266 | message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 267 | logger.log(logging.CRITICAL, message) 268 | raise Exception("Unable to write to socket (lost connection to server)") 269 | 270 | def readSocket(self, timeOut=None): 271 | self.s.settimeout(timeOut) 272 | data = self.data 273 | 274 | if "\n\n12345ZEEK6789\n" in data: 275 | data = data.split("\n\n12345ZEEK6789\n") 276 | self.data = "\n\n12345ZEEK6789\n".join(data[1:]) 277 | return pickle.loads(data[0]) 278 | 279 | while self.isActive: 280 | buffer = self.s.recv(buffSize) 281 | data = data + buffer 282 | 283 | if not buffer: 284 | logger.log(logging.INFO, "\nLost connection to server " + self.masterNodeFormattedAddr) 285 | self.isActive = False 286 | 287 | if "\n\n12345ZEEK6789\n" in data: 288 | data = data.split("\n\n12345ZEEK6789\n") 289 | self.data = "\n\n12345ZEEK6789\n".join(data[1:]) 290 | break 291 | 292 | if self.isActive == False: 293 | return 294 | 295 | logger.log(logging.DEBUG, "Receiving " + str(len(data[0])) + " bytes from server") 296 | 297 | return pickle.loads(data[0]) 298 | 299 | def disconnect(self): 300 | """Disconnects from the server""" 301 | self.isActive = False 302 | self.s.close() 303 | 304 | 305 | def main(): 306 | path = os.path.dirname(sys.argv[0]) 307 | if path: 308 | path = path + "/" 309 | 310 | #config 311 | config = ConfigParser.RawConfigParser(allow_no_value=True) 312 | config.read(path + 'config') 313 | host = config.get('client', 'hostAddr') 314 | port = config.getint('client', 'hostPort') 315 | logPath = config.get('common', 'logPath') 316 | verbose = config.get('common', 'verbose') 317 | if verbose == "True" or verbose == "true": 318 | verbose = True 319 | else: 320 | verbose = False 321 | 322 | #setup 323 | logger.init(logPath, "client-" + str(datetime.datetime.now())) 324 | logger.debugFlag = verbose 325 | 326 | node = WorkingNode() 327 | node.connect(host, port) 328 | node.readConfig() 329 | node.run() 330 | 331 | while node.isActive: 332 | time.sleep(0.5) 333 | 334 | node.disconnect() 335 | 336 | if __name__ == "__main__": 337 | main() -------------------------------------------------------------------------------- /src/config: -------------------------------------------------------------------------------- 1 | [server] 2 | listeningAddr = 127.0.0.1 3 | listeningPort = 5050 4 | 5 | [client] 6 | hostAddr = 127.0.0.1 7 | hostPort = 5050 8 | 9 | [common] 10 | verbose = False 11 | logPath = logs/ 12 | userAgent = Zeek-1.0a 13 | crawling = dynamic 14 | robotParser = true 15 | crawlDelay = 0 16 | 17 | [dynamic] 18 | domainRestricted = true 19 | requestLimit = 0 20 | rootUrls = http://www.businessinsider.com/ 21 | 22 | [static] 23 | rootUrlsPath = url.txt 24 | -------------------------------------------------------------------------------- /src/helper/rule-tester.py: -------------------------------------------------------------------------------- 1 | import urllib2, cookielib 2 | from bs4 import BeautifulSoup 3 | 4 | # url to test the parsing 5 | urls = ["http://www.nytimes.com/2013/11/19/us/politics/republicans-block-another-obama-nominee-for-key-judgeship.html"] 6 | 7 | for u in urls: 8 | # cookie 9 | cj = cookielib.CookieJar() 10 | opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 11 | 12 | # builds request 13 | request = urllib2.Request(u) 14 | request.add_header('User-agent', "test") 15 | response = opener.open(request) 16 | 17 | # parsing response 18 | bs = BeautifulSoup(response) 19 | 20 | # - Test your parsing here - 21 | 22 | # example : 23 | # 24 | # headline = bs.find("h1", {"itemprop": "headline"}) 25 | # if headline is not None: 26 | # title = headline.find("nyt_headline") 27 | # if headline is not None: 28 | # print title.get_text().encode('ascii', 'ignore') 29 | -------------------------------------------------------------------------------- /src/launch-clients.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # launch-clients.sh 3 | 4 | COUNT=-1 5 | 6 | function help { 7 | echo "usage: -h help" 8 | echo " -n number of instances of client to launch locally" 9 | } 10 | 11 | while getopts n:h opt 12 | do 13 | case $opt in 14 | n) COUNT="${OPTARG}";; 15 | h) HELP="-1";; 16 | *) exit 1;; 17 | esac 18 | done 19 | 20 | if [ "$HELP" == "-1" ] || [ "$COUNT" == "-1" ] 21 | then 22 | help 23 | exit 1 24 | fi 25 | 26 | for i in $(seq 1 $COUNT) 27 | do 28 | python client.py > /dev/null & 29 | echo "Client $i started with PID : $!" 30 | done 31 | -------------------------------------------------------------------------------- /src/modules/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'David' 2 | -------------------------------------------------------------------------------- /src/modules/configuration.py: -------------------------------------------------------------------------------- 1 | import ConfigParser 2 | import inspect 3 | import os 4 | import sys 5 | 6 | 7 | class Configuration(): 8 | def __init__(self): 9 | self.host = "" 10 | self.port = "" 11 | self.logPath = "" 12 | self.userAgent = "" 13 | self.verbose = False 14 | 15 | self.crawling = "" 16 | self.robotParserEnabled = False 17 | self.domainRestricted = False 18 | self.requestLimit = 0 19 | self.crawlDelay = 0.0 20 | self.rootUrls = [] 21 | 22 | self.rule_py = "" 23 | self.scrapping_py = "" 24 | 25 | def readStaticUrl(path): 26 | urls = [] 27 | file = open(path, 'r') 28 | for url in file: 29 | url = "".join(url.split()).replace(",","") 30 | urls.append(url) 31 | return urls 32 | 33 | def readFile(path): 34 | content = "" 35 | file = open(path, 'r') 36 | for line in file: 37 | content = content + line 38 | return content 39 | 40 | def configParser(): 41 | path = os.path.dirname(sys.argv[0]) 42 | if path: 43 | path = path + "/" 44 | 45 | config = Configuration() 46 | configParser = ConfigParser.RawConfigParser(allow_no_value=True) 47 | 48 | configParser.read(path + 'config') 49 | config.host = configParser.get('server', 'listeningAddr') 50 | config.port = configParser.getint('server', 'listeningPort') 51 | config.logPath = configParser.get('common', 'logPath') 52 | verbose = configParser.get('common', 'verbose') 53 | if verbose == "True" or verbose == "true": 54 | config.verbose = True 55 | else: 56 | config.verbose = False 57 | 58 | config.userAgent = configParser.get('common', 'userAgent') 59 | config.crawlDelay = configParser.getfloat('common', 'crawlDelay') 60 | robotParserEnabled = configParser.get('common', 'robotParser') 61 | if robotParserEnabled == "True" or robotParserEnabled == "true": 62 | config.robotParserEnabled = True 63 | else: 64 | config.robotParserEnabled = False 65 | 66 | config.crawling = configParser.get('common', 'crawling') 67 | if config.crawling == 'dynamic': 68 | domainRestricted = configParser.get('dynamic', 'domainRestricted') 69 | config.requestLimit = configParser.getint('dynamic', 'requestLimit') 70 | rootUrls = configParser.get('dynamic', 'rootUrls') 71 | rootUrls = "".join(rootUrls.split()) 72 | config.rootUrls = rootUrls.split(',') 73 | 74 | if domainRestricted == "True" or domainRestricted == "true": 75 | config.domainRestricted = True 76 | else: 77 | config.domainRestricted = False 78 | else: 79 | config.rootUrls = readStaticUrl(configParser.get('static', 'rootUrlsPath')) 80 | 81 | # dynamic module reload 82 | config.rule_py = readFile(path + "modules/rule.py") 83 | config.scrapping_py = readFile(path + "modules/scrapping.py") 84 | 85 | return config -------------------------------------------------------------------------------- /src/modules/logger.py: -------------------------------------------------------------------------------- 1 | import inspect 2 | import logging 3 | import os 4 | import sys 5 | 6 | debugFlag = True 7 | GREEN = '\033[92m' 8 | PINK = '\033[95m' 9 | BLUE = '\033[94m' 10 | RED = '\033[91m' 11 | YELLOW = '\033[93m' 12 | NOCOLOR = '\033[0m' 13 | 14 | color = [GREEN, PINK, BLUE, RED, YELLOW, NOCOLOR] 15 | 16 | def init(path, logName): 17 | basePath = os.path.dirname(sys.argv[0]) 18 | if basePath: 19 | basePath = basePath + "/" 20 | path = basePath + path 21 | if not os.path.exists(path): 22 | os.makedirs(path) 23 | logging.basicConfig(filename=path+logName, format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG) 24 | 25 | 26 | def debugFlag(flag): 27 | debugFlag = flag 28 | if debugFlag: 29 | logging.disable(logging.NOTSET) 30 | else: 31 | logging.disable(logging.DEBUG) 32 | 33 | 34 | def log(level, message): 35 | #message formatting 36 | if level == logging.DEBUG: 37 | func = inspect.currentframe().f_back.f_code 38 | fileName = ''.join(func.co_filename.split('/')[-1]) 39 | line = func.co_firstlineno 40 | message = "[" + str(func.co_name) + " - " + str(fileName) + ", " + str(line) + "] " + message 41 | 42 | if level == logging.CRITICAL: 43 | message = "\n\n ************************\n" + message 44 | 45 | #printing to logs 46 | if debugFlag and level == logging.DEBUG: 47 | print(message) 48 | elif level is not logging.DEBUG: 49 | print(message) 50 | 51 | for c in color: 52 | message = message.replace(c, "") 53 | 54 | logging.log(level, message) 55 | 56 | 57 | def formatBrackets(message): 58 | return "[" + str(message) + "]" 59 | 60 | 61 | def printAsciiLogo(): 62 | print("") 63 | print(" .----------------. .----------------. .----------------. .----------------. ") 64 | print("| .--------------. || .--------------. || .--------------. || .--------------. |") 65 | print("| | ________ | || | _________ | || | _________ | || | ___ ____ | |") 66 | print("| | | __ _| | || | |_ ___ | | || | |_ ___ | | || | |_ ||_ _| | |") 67 | print("| | |_/ / / | || | | |_ \_| | || | | |_ \_| | || | | |_/ / | |") 68 | print("| | .'.' _ | || | | _| _ | || | | _| _ | || | | __'. | |") 69 | print("| | _/ /__/ | | || | _| |___/ | | || | _| |___/ | | || | _| | \ \_ | |") 70 | print("| | |________| | || | |_________| | || | |_________| | || | |____||____| | |") 71 | print("| | | || | | || | | || | | |") 72 | print("| '--------------' || '--------------' || '--------------' || '--------------' |") 73 | print(" '----------------' '----------------' '----------------' '----------------' ") 74 | print("") 75 | print(" +++ ") 76 | print(" (o o) ") 77 | print(" -ooO--(_)--Ooo- ") 78 | print(" v1.0a ") 79 | print(" David Albertson ") 80 | print("") 81 | 82 | 83 | -------------------------------------------------------------------------------- /src/modules/protocol.py: -------------------------------------------------------------------------------- 1 | import Queue 2 | 3 | CONFIG = 'CONFIG' 4 | INFO = 'INFO' 5 | URL = 'URL' 6 | 7 | 8 | class Packet: 9 | def __init__(self, type, payload): 10 | self.type = type 11 | self.payload = payload 12 | 13 | def setPayload(self, payload): 14 | self.payload = payload 15 | 16 | 17 | class ConfigurationPayload(): 18 | STATIC_CRAWLING = 'STATIC' 19 | DYNAMIC_CRAWLING = 'DYNAMIC' 20 | 21 | def __init__(self, crawlingType, config): 22 | self.crawlingType = crawlingType 23 | self.config = config 24 | 25 | 26 | class InfoPayload(): 27 | CLIENT_ACK = 0 28 | SERVER_ACK = 1 29 | 30 | def __init__(self, info): 31 | self.info = info 32 | 33 | 34 | class URLPayload(): 35 | VISITED = 'VISITED' 36 | SKIPPED = 'SKIPPED' 37 | TOVISIT = 'TOVISIT' 38 | SCRAPPED_URL = 'SCRAPPED' 39 | 40 | def __init__(self, urlList, type, session=None, data=None): 41 | #self.url = url TODO : add url param (to know where the data is coming from) 42 | self.urlList = [] 43 | self.type = type 44 | self.data = data 45 | self.session = session 46 | 47 | for url in urlList: 48 | self.urlList.append(url) 49 | 50 | 51 | def deQueue(queueArray): 52 | packetArray = [] 53 | for queue in queueArray: 54 | try: 55 | packet = queue.get(block=False) 56 | packetArray.append(packet) 57 | except Queue.Empty: 58 | pass 59 | return packetArray -------------------------------------------------------------------------------- /src/modules/rule.py: -------------------------------------------------------------------------------- 1 | import urlparse 2 | 3 | class Container: 4 | def __init__(self): 5 | #data = dict() 6 | self.hasData = False 7 | 8 | self.title = None 9 | self.author = None 10 | 11 | def scrape(url, bs): 12 | # for testing - this is scrapping article titles from www.nytimes.com 13 | container = Container() 14 | domain = urlparse.urlsplit(url)[1].split(':')[0] 15 | 16 | # extracting data from NYTimes 17 | if domain == "www.nytimes.com": 18 | headline = bs.find("h1", {"itemprop": "headline"}) 19 | if headline is not None: 20 | title = headline.find("nyt_headline") 21 | if title is not None: 22 | container.title = title.get_text().encode('ascii', 'ignore') 23 | 24 | byline = bs.find("h6", {"class": "byline"}) 25 | if byline is not None: 26 | author = byline.find("span", {"itemprop": "name"}) 27 | if author is not None: 28 | container.author = author.get_text().encode('ascii', 'ignore') 29 | 30 | return container 31 | 32 | return Container() -------------------------------------------------------------------------------- /src/modules/scrapping.py: -------------------------------------------------------------------------------- 1 | import cookielib 2 | import urllib2 3 | import logging 4 | import logger 5 | import time 6 | import robotparser 7 | import rule 8 | import socket 9 | import sys 10 | import traceback 11 | import urlparse 12 | from bs4 import BeautifulSoup 13 | 14 | robotDict = {} 15 | 16 | class Session: 17 | def __init__(self, url, failed, code, info, requestTime, bsParsingTime, scrappedURLs, dataContainer=None, errorMsg=None): 18 | self.url = url 19 | self.failed = failed 20 | self.returnCode = code 21 | self.returnInfo = info 22 | self.requestTime = requestTime 23 | self.bsParsingTime = bsParsingTime 24 | 25 | self.scrappedURLs = scrappedURLs 26 | self.dataContainer = dataContainer 27 | 28 | # add error handling 29 | # err.msg 30 | 31 | #url error 32 | self.errorMsg = errorMsg 33 | 34 | class Scrapper: 35 | def __init__(self, userAgent, robotParserEnabled, domainRestricted, crawlingType): 36 | self.userAgent = userAgent 37 | self.robotParserEnabled = robotParserEnabled 38 | self.domainRestricted = domainRestricted 39 | self.crawlingType = crawlingType 40 | 41 | # eventually move this to client.py 42 | reload(rule) 43 | 44 | def visit(self, url): 45 | """Visits a given URL and return all the data""" 46 | logger.log(logging.INFO, "Scrapping : " + str(url)) 47 | 48 | # in the case the rootUrl wasnt formatted the right way 49 | if (url.startswith("http://") or url.startswith("https://")) is False: 50 | url = "http://" + url 51 | 52 | domain = urlparse.urlsplit(url)[1].split(':')[0] 53 | httpDomain = "http://" + domain 54 | 55 | try: 56 | # robot parser 57 | if self.robotParserEnabled: 58 | if httpDomain not in robotDict: 59 | parser = robotparser.RobotFileParser() 60 | parser.set_url(urlparse.urljoin(httpDomain, 'robots.txt')) 61 | parser.read() 62 | robotDict[httpDomain] = parser 63 | parser = robotDict[httpDomain] 64 | 65 | isParsable = parser.can_fetch(self.userAgent, url) 66 | if not isParsable: 67 | raise Exception("RobotParser") 68 | 69 | # request 70 | start_time = time.time() 71 | cj = cookielib.CookieJar() 72 | opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 73 | request = urllib2.Request(url) 74 | request.add_header('User-agent', self.userAgent) 75 | data = opener.open(request, timeout=4) 76 | urlRequestTime = time.time() - start_time 77 | 78 | # parsing 79 | start_time = time.time() 80 | bs = BeautifulSoup(data) 81 | bsParsingTime = time.time() - start_time 82 | 83 | # url scrapping - dynamic crawling 84 | if self.crawlingType == "dynamic": 85 | illegal = [".mp4", ".mp3", ".flv", ".m4a", \ 86 | ".jpg", ".png", ".gif", \ 87 | ".xml", ".pdf", ".gz", ".zip", ".rss"] 88 | 89 | links = bs.find_all('a') 90 | links = [s.get('href') for s in links] 91 | links = [unicode(s) for s in links] 92 | if self.domainRestricted: 93 | links = [s for s in links if s.startswith("http://" + domain + "/") or s.startswith("https://" + domain )] 94 | for ext in illegal: 95 | links = [s for s in links if ext not in s] 96 | links = [s for s in links if s.startswith("http:") or s.startswith("https:")] 97 | foundUrl = set(links) 98 | 99 | # data scrapping 100 | dataContainer = rule.scrape(url, bs) 101 | if dataContainer is None: 102 | raise("None data container object") 103 | 104 | logger.log(logging.DEBUG, "Scrapping complete") 105 | return Session(url, False, data.getcode(), data.info(), urlRequestTime, bsParsingTime , foundUrl, dataContainer) 106 | 107 | except urllib2.HTTPError, err: 108 | logger.log(logging.INFO, "Scrapping failed - HTTPError " + str(err.msg) + " " + str(err.code)) 109 | return Session(url, True, err.code, "no data", 0, "", "", errorMsg=err.msg.replace('\n', "")) 110 | except socket.timeout: 111 | logger.log(logging.INFO, "Scrapping failed - Timeout") 112 | return Session(url, True, -1, "no data", 0, "", "", errorMsg="Request timeout") 113 | except Exception as e: 114 | if e.message == "RobotParser": 115 | logger.log(logging.INFO, "Scrapping failed - RobotParser") 116 | return Session(url, True, -2, "no data", 0, "", "", errorMsg="Request is not allowed as per Robot.txt") 117 | else: 118 | logger.log(logging.INFO, "Scrapping failed - Un-handled") 119 | exc_type, exc_value, exc_traceback = sys.exc_info() 120 | message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 121 | logger.log(logging.ERROR, message) 122 | return Session(url, True, -100, "no data", 0, "", "", errorMsg=traceback.format_exception(exc_type, exc_value, exc_traceback)[-1].replace('\n', "")) -------------------------------------------------------------------------------- /src/modules/storage.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import logger 3 | import atexit 4 | 5 | dataFd = None 6 | errorFd = None 7 | 8 | def writeToFile(session, container): 9 | global dataFd, errorFd 10 | try: 11 | if (not session.failed) and (session.dataContainer.title is not None): 12 | if dataFd is None: 13 | dataFd = open('output.txt', 'w') 14 | dataFd.write(container.author.replace(",","") + "," + container.title.replace(",","") + "," + session.url +"\n") 15 | elif session.failed: 16 | if errorFd is None: 17 | errorFd = open('error.txt', 'w') 18 | errorFd.write(str(session.returnCode).replace(",","") + "," + str(session.errorMsg).replace(",","") + "." + session.url.replace(",","") + "\n") 19 | #else: 20 | # raise Exception("..") 21 | except: 22 | logger.log(logging.ERROR, "Unhandled exception in storage.py") 23 | 24 | def writeToDb(session, container): 25 | a = "Will come soon - Happy Halloween" 26 | 27 | def atexitfct(): 28 | """Cleanly closes file objects""" 29 | if dataFd is not None: 30 | dataFd.close() 31 | if errorFd is not None: 32 | errorFd.close() 33 | 34 | atexit.register(atexitfct) -------------------------------------------------------------------------------- /src/server.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | from distutils.command.config import config 3 | import logging 4 | import pickle 5 | import Queue 6 | import signal 7 | import socket 8 | import sys 9 | import time 10 | import thread 11 | import traceback 12 | import uuid 13 | import modules.storage as storage 14 | import modules.logger as logger 15 | import modules.protocol as protocol 16 | import modules.configuration as configuration 17 | 18 | 19 | buffSize = 524288 20 | delimiter = '\n\n12345ZEEK6789\n' 21 | 22 | # (string:url) - Crawling algo 23 | urlVisited = dict() # url already visited 24 | urlPool = Queue.Queue(0) # url scrapped by working nodes 25 | urlToVisit = Queue.Queue(0) # url scrapped by working nodes 26 | 27 | # (string:url) - For stats 28 | scrappedURLlist = [] 29 | visitedURLlist = [] 30 | skippedURLlist = [] 31 | 32 | # (packet+payload) - To be sent to _any_ node 33 | outputQueue = Queue.Queue(200) 34 | 35 | # (session:session) - for storage 36 | sessionStorageQueue = Queue.Queue(0) 37 | 38 | # temporary for server.run() 39 | serverRunning = False 40 | skippedSessions = [] 41 | 42 | class Server: 43 | def __init__(self, host, port): 44 | self.host = host 45 | self.port = port 46 | self.s = None 47 | self.clientDict = {} 48 | self.isActive = True 49 | self.requestLimit = 0 50 | self.requestCount = 0 51 | 52 | def setup(self, configuration): 53 | """Basic setup operation (socket binding, listen, etc)""" 54 | logger.log(logging.DEBUG, "Socket initialization") 55 | self.s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 56 | self.s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) 57 | self.s.bind((self.host, self.port)) 58 | self.s.listen(5) 59 | logger.log(logging.INFO, "Listening on [" + str(self.host) + ":" + str(self.port) + "]") 60 | 61 | self.configurationPayload = configuration 62 | self.requestLimit = configuration.config.requestLimit 63 | 64 | def run(self): 65 | """Launches the urlDispatcher and mainRoutine threads""" 66 | logger.log(logging.DEBUG, "Starting beginCrawlingProcedure") 67 | thread.start_new_thread(self.urlDispatcher, ()) 68 | thread.start_new_thread(self.mainRoutine, ()) 69 | thread.start_new_thread(self.storageRoutine, ()) 70 | 71 | def listen(self): 72 | """Waits for new clients to connect and launches a new client thread accordingly""" 73 | print("- - - - - - - - - - - - - - -") 74 | logger.log(logging.INFO, "Waiting for working nodes to connect...") 75 | while self.isActive: 76 | try: 77 | client, address = self.s.accept() 78 | thread.start_new_thread(self.connectionHandler, (client, address)) 79 | except: 80 | exc_type, exc_value, exc_traceback = sys.exc_info() 81 | message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 82 | logger.log(logging.CRITICAL, message) 83 | self.isActive = False 84 | 85 | def connectionHandler(self, socket, address): 86 | """Creates a server-side client object and makes it listen for inputs""" 87 | clientID = uuid.uuid4() 88 | client = SSClient(clientID, socket, address) 89 | self.clientDict[clientID] = client 90 | 91 | #temp testing, could take a parameter from config 92 | global serverRunning 93 | if len(self.clientDict) > 0 and serverRunning == False: 94 | self.run() 95 | serverRunning = True 96 | 97 | #for clients in self.clientDict: 98 | # logger.log(logging.DEBUG, "Working node connected : " + str(self.clientDict[clients].id)) 99 | 100 | try: 101 | client.sendConfig(self.configurationPayload) 102 | client.run() 103 | while client.isActive: 104 | time.sleep(0.3) 105 | except EOFError: 106 | pass 107 | except: 108 | client.isActive = False 109 | exc_type, exc_value, exc_traceback = sys.exc_info() 110 | message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 111 | logger.log(logging.ERROR, message) 112 | finally: 113 | client.disconnect() 114 | del self.clientDict[clientID] 115 | 116 | def urlDispatcher(self): 117 | """Reads from the urlPool, makes sure the url has not been visited and adds it to the urlToVisit Queue""" 118 | logger.log(logging.INFO, "Starting server urlDispatcher") 119 | 120 | while self.isActive: 121 | try: 122 | url = urlPool.get(True) 123 | if url not in urlVisited: 124 | urlVisited[url] = True 125 | #logic if static crawling will come here 126 | urlToVisit.put(url) 127 | scrappedURLlist.append(url) 128 | except: 129 | exc_type, exc_value, exc_traceback = sys.exc_info() 130 | message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 131 | logger.log(logging.ERROR, message) 132 | 133 | def mainRoutine(self): 134 | """To Come in da future. For now, no use""" 135 | logger.log(logging.INFO, "Starting server mainRoutine") 136 | 137 | for url in self.configurationPayload.config.rootUrls: 138 | payload = protocol.URLPayload([str(url)], protocol.URLPayload.TOVISIT) 139 | packet = protocol.Packet(protocol.URL, payload) 140 | urlVisited[url] = True 141 | outputQueue.put(packet) 142 | 143 | if self.configurationPayload.crawlingType == protocol.ConfigurationPayload.STATIC_CRAWLING and (self.configurationPayload.config.crawlDelay != 0): 144 | if self.configurationPayload.config.crawlDelay != 0: 145 | time.sleep(self.configurationPayload.config.crawlDelay) 146 | 147 | while self.isActive: 148 | try: 149 | if self.configurationPayload.crawlingType == protocol.ConfigurationPayload.DYNAMIC_CRAWLING: 150 | url = urlToVisit.get(True) 151 | payload = protocol.URLPayload([str(url)], protocol.URLPayload.TOVISIT) 152 | packet = protocol.Packet(protocol.URL, payload) 153 | outputQueue.put(packet) 154 | self.requestCount = self.requestCount + 1 155 | 156 | if self.configurationPayload.config.crawlDelay != 0: 157 | time.sleep(self.configurationPayload.config.crawlDelay) 158 | 159 | if self.requestLimit != 0 and len(visitedURLlist)+1 > self.requestLimit: 160 | break 161 | 162 | elif self.configurationPayload.crawlingType == protocol.ConfigurationPayload.STATIC_CRAWLING: 163 | if (len(skippedURLlist+visitedURLlist) == len(self.configurationPayload.config.rootUrls)): 164 | break 165 | else: 166 | time.sleep(0.3) 167 | except: 168 | exc_type, exc_value, exc_traceback = sys.exc_info() 169 | message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 170 | logger.log(logging.ERROR, message) 171 | 172 | logger.log(logging.INFO, "Scrapping complete. Terminating...") 173 | self.disconnectAllClient() 174 | self.isActive = False 175 | 176 | def storageRoutine(self): 177 | """Stores session and data""" 178 | logger.log(logging.INFO, "Starting server storageRoutine") 179 | 180 | while self.isActive: 181 | try: 182 | sessions = protocol.deQueue([sessionStorageQueue]) 183 | 184 | if not sessions: 185 | continue 186 | 187 | for session in sessions: 188 | storage.writeToFile(session, session.dataContainer) 189 | except: 190 | exc_type, exc_value, exc_traceback = sys.exc_info() 191 | message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback)) 192 | logger.log(logging.ERROR, message) 193 | 194 | def disconnectAllClient(self): 195 | """Disconnects all clients""" 196 | 197 | for connectedClient in self.clientDict: 198 | if self.clientDict[connectedClient].isActive: 199 | self.clientDict[connectedClient].disconnect() 200 | 201 | 202 | class SSClient: 203 | def __init__(self, cId, socket, address): 204 | self.id = cId 205 | self.socket = socket 206 | self.address = address 207 | self.isActive = True 208 | self.formattedAddr = logger.formatBrackets(str(str(address[0]) + ":" + str(address[1]))) + " " 209 | self.sentCount = 0 210 | self.data = "" 211 | self.configuration = None 212 | 213 | logger.log(logging.INFO, logger.GREEN + self.formattedAddr + "Working node connected" + logger.NOCOLOR) 214 | 215 | def sendConfig(self, configuration): 216 | """Sends the configuration to the client""" 217 | logger.log(logging.DEBUG, self.formattedAddr + "Sending configuration") 218 | self.configuration = configuration 219 | 220 | packet = protocol.Packet(protocol.CONFIG, self.configuration) 221 | self.writeSocket(packet) 222 | 223 | logger.log(logging.DEBUG, self.formattedAddr + "Configuration sent waiting for ACK") 224 | packet = self.readSocket(5) 225 | 226 | if packet.type == protocol.INFO: 227 | if packet.payload.info == protocol.InfoPayload.CLIENT_ACK: 228 | logger.log(logging.DEBUG, self.formattedAddr + "Working node ACK received (configuration)") 229 | return 230 | else: 231 | self.isActive = False 232 | raise Exception("Unable to transmit configuration") 233 | 234 | def run(self): 235 | """Launched the input and output thread with the client itself""" 236 | thread.start_new_thread(self.inputThread, ()) 237 | thread.start_new_thread(self.outputThread, ()) 238 | 239 | def inputThread(self): 240 | """Listens for inputs from the client""" 241 | logger.log(logging.DEBUG, self.formattedAddr + "Listening for packets") 242 | 243 | while self.isActive: 244 | try: 245 | deserializedPacket = self.readSocket() 246 | self.dispatcher(deserializedPacket) 247 | 248 | except EOFError: 249 | #Fixes the pickle error if clients disconnects 250 | self.isActive = False 251 | 252 | def outputThread(self): 253 | """Checks if there are messages to send to the client and sends them""" 254 | while self.isActive: 255 | if self.sentCount > 5: 256 | time.sleep(0.03) 257 | continue 258 | packetToBroadCast = protocol.deQueue([outputQueue]) 259 | 260 | if not packetToBroadCast: 261 | continue 262 | 263 | for packet in packetToBroadCast: 264 | self.writeSocket(packet) 265 | self.sentCount = self.sentCount+1 266 | logger.log(logging.DEBUG, self.formattedAddr + "Sending URL " + str(packet.payload.urlList[0])) 267 | 268 | def dispatcher(self, packet): 269 | """Dispatches packets to the right packet queue or takes action if needed (ie: infoPacket)""" 270 | if packet is None: 271 | return 272 | logger.log(logging.DEBUG, "Dispatching packet of type: " + str(packet.type)) 273 | 274 | if packet.type == protocol.INFO: 275 | logger.log(logging.DEBUG, self.formattedAddr + "Received INFO packet") 276 | elif packet.type == protocol.URL: 277 | 278 | if packet.payload.type == protocol.URLPayload.SCRAPPED_URL: 279 | logger.log(logging.INFO, self.formattedAddr + "Receiving scrapped URLs : " + str(len(packet.payload.urlList)).center(5) + " / " + str(len(scrappedURLlist)).center(7) + " - " + str(len(skippedURLlist)).center(5)) 280 | for url in packet.payload.urlList: 281 | urlPool.put(url) 282 | 283 | if packet.payload.type == protocol.URLPayload.VISITED: 284 | self.sentCount = self.sentCount-1 285 | for url in packet.payload.urlList: 286 | logger.log(logging.INFO, self.formattedAddr + "Receiving scrapped data") 287 | logger.log(logging.DEBUG, self.formattedAddr + "Receiving scrapped data" + url) 288 | visitedURLlist.append(url) 289 | if hasattr(packet.payload, 'session'): 290 | if packet.payload.session is not None: 291 | sessionStorageQueue.put(packet.payload.session) 292 | 293 | if packet.payload.type == protocol.URLPayload.SKIPPED: 294 | self.sentCount = self.sentCount-1 295 | for url in packet.payload.urlList: 296 | skippedURLlist.append(url) 297 | if hasattr(packet.payload, 'session'): 298 | if packet.payload.session is not None: 299 | sessionStorageQueue.put(packet.payload.session) 300 | if packet.payload.session.returnCode == -1: 301 | logger.log(logging.INFO, logger.PINK + self.formattedAddr + "Skipped (timeout) : " + url + logger.NOCOLOR) 302 | elif packet.payload.session.returnCode == -2: 303 | logger.log(logging.INFO, logger.PINK + self.formattedAddr + "Skipped (request not allowed - robot parser) : " + url + logger.NOCOLOR) 304 | elif packet.payload.session.returnCode == -100: 305 | logger.log(logging.INFO, logger.YELLOW + self.formattedAddr + "Skipped (unknown error) : " + url + logger.NOCOLOR) 306 | else: 307 | logger.log(logging.INFO, logger.BLUE + self.formattedAddr + "Skipped (html error " + str(packet.payload.session.returnCode) + ") : " + url + logger.NOCOLOR) 308 | else: 309 | logger.log(logging.INFO, logger.RED + self.formattedAddr + "No session returned" + url + logger.NOCOLOR) 310 | else: 311 | logger.log(logging.CRITICAL, "Unrecognized packet type : " + str(packet.type) + ". This packet was dropped") 312 | return 313 | 314 | def writeSocket(self, obj): 315 | try: 316 | serializedObj = pickle.dumps(obj) 317 | logger.log(logging.DEBUG, self.formattedAddr + "Sending " + str(len(serializedObj + delimiter)) + " bytes") 318 | self.socket.sendall(serializedObj + delimiter) 319 | except: 320 | raise Exception("Unable to write to socket (client disconnected)") 321 | 322 | def readSocket(self, timeOut=None): 323 | self.socket.settimeout(timeOut) 324 | data = self.data 325 | 326 | if "\n\n12345ZEEK6789\n" in data: 327 | data = data.split("\n\n12345ZEEK6789\n") 328 | self.data = "\n\n12345ZEEK6789\n".join(data[1:]) 329 | return pickle.loads(data[0]) 330 | 331 | while self.isActive: 332 | buffer = self.socket.recv(buffSize) 333 | data = data + buffer 334 | 335 | if not buffer: 336 | logger.log(logging.INFO, logger.RED + self.formattedAddr + "Lost connection" + logger.NOCOLOR) 337 | self.isActive = False 338 | 339 | if "\n\n12345ZEEK6789\n" in data: 340 | data = data.split("\n\n12345ZEEK6789\n") 341 | self.data = "\n\n12345ZEEK6789\n".join(data[1:]) 342 | break 343 | 344 | if self.isActive == False: 345 | return 346 | 347 | logger.log(logging.DEBUG, self.formattedAddr + "Receiving " + str(len(data[0])) + " bytes") 348 | return pickle.loads(data[0]) 349 | 350 | def disconnect(self): 351 | """Disconnects the client""" 352 | 353 | if self.socket != None: 354 | logger.log(logging.INFO, logger.RED + self.formattedAddr + "Disconnecting" + logger.NOCOLOR) 355 | self.isActive = False 356 | self.socket.close() 357 | self.socket = None 358 | 359 | 360 | def ending(): 361 | """Temporary ending routine""" 362 | try: 363 | scrapped = len(scrappedURLlist) 364 | skipped = len(skippedURLlist) 365 | visited = len(visitedURLlist) 366 | skipRate = (float(skipped)/float(skipped+visited) * 100) 367 | 368 | print("\n\n-------------------------") 369 | print("Scrapped : " + str(scrapped)) 370 | print("Skipped : " + str(skipped)) 371 | print("Visited : " + str(visited)) 372 | print("-------------------------") 373 | print(str(skipRate) + "% skipping rate\n") 374 | except: 375 | #handles cases where crawling did occur (list were empty) 376 | pass 377 | sys.exit() 378 | 379 | def handler(signum, frame): 380 | ending() 381 | 382 | def main(): 383 | signal.signal(signal.SIGINT, handler) 384 | logger.printAsciiLogo() 385 | 386 | #parse config file 387 | config = configuration.configParser() 388 | 389 | #logging 390 | logger.init(config.logPath, "server-" + str(datetime.datetime.now())) 391 | logger.debugFlag = config.verbose 392 | 393 | #node configration 394 | if config.crawling == 'dynamic': 395 | nodeConfig = protocol.ConfigurationPayload(protocol.ConfigurationPayload.DYNAMIC_CRAWLING, config) 396 | else: 397 | nodeConfig = protocol.ConfigurationPayload(protocol.ConfigurationPayload.STATIC_CRAWLING, config) 398 | 399 | #server 400 | server = Server(config.host, config.port) 401 | server.setup(nodeConfig) 402 | thread.start_new_thread(server.listen, ()) #testing 403 | 404 | while server.isActive: 405 | time.sleep(0.5) 406 | 407 | server.disconnectAllClient() 408 | ending() 409 | 410 | 411 | if __name__ == "__main__": 412 | main() -------------------------------------------------------------------------------- /src/url.txt: -------------------------------------------------------------------------------- 1 | http://www.businessinsider.com/sai 2 | http://www.lapresse.ca/ 3 | http://mashable.com/ 4 | --------------------------------------------------------------------------------