├── LICENSE
├── README.md
└── src
    ├── client.py
    ├── config
    ├── helper
        └── rule-tester.py
    ├── launch-clients.sh
    ├── modules
        ├── __init__.py
        ├── configuration.py
        ├── logger.py
        ├── protocol.py
        ├── rule.py
        ├── scrapping.py
        └── storage.py
    ├── server.py
    └── url.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2014-2015 Diastro - Zeek
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in
13 | all copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21 | THE SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Zeek
  2 | ====
  3 | 
  4 | Python distributed web crawling / web scraper
  5 | 
  6 | This the first version of my distributed web crawler. It isn't perfect yet but I'm sharing it because the end result is far better then what I expected and it can easily be adapted to your needs. Feel free to improve/fork/report issues.
  7 | 
  8 | I'm planning to continue working on it and probably release an updated version in the future but i'm not sure when yet.
  9 | 
 10 | ### Use cases
 11 |  * Visit a **predetermined** list of URLs and scrape specific data on these pages
 12 |  * Visit or **dynamically visit** web pages on a periodic bases and **scrape** data on these pages
 13 |  * Dynamically visit pages on a **given domain** and scrape data on these pages
 14 |  * Dynamically visit pages **all over the internet** and scrape data on these pages
 15 |  
 16 | <small>All the scraped data can be stored in an output file (ie: `.csv`, `.txt`) or in a database</small>
 17 | 
 18 | *David Albertson*
 19 | 
 20 | ## Execution
 21 | 1) Download the source and install the required third party library
 22 | ~~~ sh
 23 | $ git clone https://github.com/Diastro/Zeek.git
 24 | $ easy_install beautifulsoup4
 25 | $ easy_install lxml
 26 | ~~~
 27 | 
 28 | 2) Update the configuration files :
 29 |   * change the server `listeningAddress / listeningPort` to the right info;
 30 |   * change the client `hostAddr / hostPort` to the right info.
 31 | 
 32 | 3) Update the /modules/rule.py and modules/storage.py :
 33 |   * See the documentation for more information on how to adapt these files.
 34 | 
 35 | 4) Launch the server on the **master** node
 36 | 
 37 | ~~~ sh
 38 | $ python server.py
 39 | ~~~
 40 | 
 41 | 5) Launch the client on the **working** nodes
 42 | 
 43 | ~~~ sh
 44 | $ python client.py
 45 | ~~~
 46 | 
 47 | #### Third party library
 48 | - [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/)
 49 | - [lxml](http://lxml.de/)
 50 | 
 51 | ## Configuration fields
 52 | **[server]**<br>
 53 | *listeningAddr* : Address on which the server listens for incoming connections from clients (ex : 127.0.0.1)<br>
 54 | *listeningPort* : Port on which the server listens for incoming connections from clients (ex : 5050)<br>
 55 | 
 56 | **[client]**<br>
 57 | *hostAddr* : Address to connect to, on which the server listens for incoming connections (ex : 127.0.0.1)<br>
 58 | *hostPort* : Port to connect to, on which the server listens for incoming connections (ex : 5050)<br>
 59 | 
 60 | **[common]**<br>
 61 | *verbose* : Enables or disables verbose output in the console (ex: True, False)<br>
 62 | *logPath* : Path where to save the ouput logfile of each process (ex : logs/)<br>
 63 | *userAgent* : Usually the name of your crawler or bot (ex : MyBot 1.0)<br>
 64 | *crawling* : Type of crawling (ex : dynamic, static)<br>
 65 | *robotParser* : Obey or ignore the robots.txt rule while visiting a domain (ex : True, False)<br>
 66 | *crawlDelay* : Delay, in seconds, between the 2 subsequent request (ex : 0, 3, 5)<br>
 67 | 
 68 | **[dynamic]** (Applies only if the crawling type is set to dynamic)<br>
 69 | *domainRestricted* : If set to true, the crawler will only visit url that are same as the root url (ex : True, False)<br>
 70 | *requestLimit* : Stops the crawler after the limit is reach (after visiting x pages) (ex : 0, 2, 100, ...)<br>
 71 | *rootUrls* : Url to start from (ex : www.businessinsider.com)<br>
 72 | 
 73 | **[static]** (Applies only if the crawling type is set to static)<br>
 74 | *rootUrlsPath* : Path to the file which contains a list of url to visit (ex : url.txt)<br>
 75 | 
 76 | ## How it works
 77 | ***Coming soon***
 78 | 
 79 | ### Rule.py Storage.py
 80 | ***Coming soon***
 81 | 
 82 | ### Testing your rule.py
 83 | ***Coming soon***
 84 | 
 85 | ## Recommended topologies
 86 | Zeek can be launched in 2 different topologies depending on which resource is limiting you. When you want to crawl a large quantity of web pages, you need a large bandwidth (when executing multiple parallel requests) and you need computing power (CPU). Depending on which of these 2 is limiting you, you should use the appropriate topology for the fastest crawl time.
 87 | Keep in mind that if time isn't a constraint for you, a 1-1 approach is always the safest and less expensive!
 88 |  * Basic topology (recommended) : see the **1-1 topology**
 89 |  * Best performance topology : see the **1-n topology**
 90 | 
 91 | No matter which topology you are using, you can always use the `launch-clients.sh` to launch multiple instance of client.py on a same computer.
 92 | 
 93 | ### 1-1 Topology
 94 | The 1-1 Topology is probably the easiest to achieve. It only requires 1 computer so it makes it easy for anyone to deploy Zeek this way. Using this type of topology you first deploy the server.py (using 127.0.0.1 as the listeningAddr) and connect as many client.py processes to it (using 127.0.0.1 as the hostAddr) and everything runs on the same machine. Be aware that depending on the specs of you computer, you will end up being limited by the number of threads launched by the serve.py process at some point. server.py launches 3 threads per client that connects to it so if your computer allows you to create 300 thread per process, the maximum number of client.py that you will be able to launch will be approximately 100. If you end up launching that many client, you might end up being limited by your bandwidth at some point.<br>
 95 | [1-1 Topology schema](http://i.imgur.com/7NJGodN.jpg)
 96 | 
 97 | ### 1-n Topology
 98 | This topology is perfect if you want to achieve best performance but requires that you have more than 1 computer at your disposal. The only limitation you have using this topology is regarding the number of clients that can connect to the server.py process. As explained above, server.py launches 3 threads per client that connects to it so if your computer allows you to create 300 thread per process, the maximum number of client.py that you will be able to launch will be approximately 100. Though in this case, if each computer uses a seperate connection, bandwidth shouldn't be a problem.<br>
 99 | [1-n Topology schema](http://i.imgur.com/lXCEAk6.jpg)
100 | 
101 | ## Stats - Benchmark
102 | ***Coming soon***
103 | 
104 | ## Warning
105 | Using a distributed crawler/scraper can make your life easier but also comes with great responsibilities. When you are using a crawler to make request to a website, you generate connections to this website and if the targeted web site isn't configured properly, it can have disastrous consequences. You're probably asking yourself "What exactly does he mean?". What I mean is that by using 10 computers each having 30 client.py instances running you could (in a perfect world) generate 300 parallels requests. If these 300 parallel request are targetting the same website/domain, you will be downloading a lot a data pretty quickly and if the targeted domain isn't prepared for it, you could potentially shut it down.<br>
106 | During the development of Zeek I happened to experience something similar while doing approximatly 250 parallel request to a pretty well known website. The sysadmins of this website ended up contacting the sysadmin where I have my own server hosted being worried that something strange was happenning (they were probably thinking of an attack). During this period of time I ended up downloading 7Gb of data in about 30 minutes. This alone trigged some internal alert on their side. That being now said, I'm not responsible for your usage of Zeek. Simply try to be careful and respectful of others online!
107 | 
108 | ## References
109 | - [Wikipedia - WebCrawler](http://en.wikipedia.org/wiki/Web_crawler)
110 | - [Wikipedia - Distributed crawling](http://en.wikipedia.org/wiki/Distributed_web_crawling)
111 | - [How to Parse data using BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html)
112 | - [Understanding Robots.txt](http://www.robotstxt.org/faq.html)
113 | 


--------------------------------------------------------------------------------
/src/client.py:
--------------------------------------------------------------------------------
  1 | import ConfigParser
  2 | import datetime
  3 | import Queue
  4 | import logging
  5 | import os
  6 | import pickle
  7 | import socket
  8 | import sys
  9 | import time
 10 | import thread
 11 | import traceback
 12 | import modules.logger as logger
 13 | import modules.protocol as protocol
 14 | import modules.scrapping as scrapping
 15 | 
 16 | sys.setrecursionlimit(10000)
 17 | 
 18 | buffSize = 524288
 19 | delimiter = '\n\n12345ZEEK6789\n'
 20 | 
 21 | 
 22 | class WorkingNode():
 23 |     def __init__(self):
 24 |         # socket
 25 |         self.host = None
 26 |         self.port = None
 27 |         self.data = ""
 28 | 
 29 |         # general
 30 |         self.isActive = True
 31 |         self.masterNodeFormattedAddr = None
 32 |         self.crawlingType = None
 33 | 
 34 |         # data container
 35 |         self.outputQueue = Queue.Queue(0)
 36 |         self.infoQueue = Queue.Queue(0)
 37 |         self.urlToVisit = Queue.Queue(0)
 38 | 
 39 |         # object
 40 |         self.scrapper = None
 41 |         self.config = None
 42 | 
 43 | 
 44 |     def connect(self, host, port):
 45 |         """Sets up the connection to the server (max 6 attemps)"""
 46 |         self.host = host
 47 |         self.port = port
 48 |         self.masterNodeFormattedAddr = "[" + str(self.host) + ":" + str(self.port) + "]"
 49 | 
 50 |         logger.log(logging.DEBUG, "Socket initialization")
 51 |         self.s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
 52 |         for connectionAttempt in range(6, 0, -1):
 53 |             if connectionAttempt == 1:
 54 |                 logger.log(logging.CRITICAL, "Unable to connect to host " + self.masterNodeFormattedAddr)
 55 |                 sys.exit()
 56 |             try:
 57 |                 logger.log(logging.DEBUG, "Connecting to host... " + self.masterNodeFormattedAddr)
 58 |                 self.s.connect((self.host, self.port))
 59 |                 logger.log(logging.INFO, "Connected to " + self.masterNodeFormattedAddr)
 60 |                 break
 61 |             except socket.error:
 62 |                 logger.log(logging.INFO, "Connection failed to " + self.masterNodeFormattedAddr)
 63 |                 logger.log(logging.INFO, "Retrying in 3 seconds.")
 64 |                 time.sleep(3)
 65 | 
 66 |     def readConfig(self):
 67 |         """Reads the configuration from the server"""
 68 |         logger.log(logging.DEBUG, "Waiting for configuration from the server.")
 69 |         if self.isActive:
 70 |             try:
 71 |                 deserializedPacket = self.readSocket()
 72 |                 logger.log(logging.DEBUG, "Configuration received.")
 73 | 
 74 |                 if deserializedPacket.type == protocol.CONFIG:
 75 |                     self.crawlingType = deserializedPacket.payload.crawlingType
 76 |                     self.config = deserializedPacket.payload.config
 77 | 
 78 |                     # dynamic module reload
 79 |                     basePath = os.path.dirname(sys.argv[0])
 80 |                     if basePath:
 81 |                         basePath = basePath + "/"
 82 | 
 83 |                     # path building
 84 |                     rulePath = basePath + "modules/rule.py"
 85 |                     scrappingPath = basePath + "modules/scrapping.py"
 86 | 
 87 |                     # re-writing source .py
 88 |                     logger.log(logging.INFO, "Importing rule.py from server")
 89 |                     ruleFd = open(rulePath, 'w')
 90 |                     ruleFd.write(self.config.rule_py)
 91 |                     ruleFd.close()
 92 | 
 93 |                     logger.log(logging.INFO, "Importing scrapping.py from server")
 94 |                     scrappingFd = open(scrappingPath, 'w')
 95 |                     scrappingFd.write(self.config.scrapping_py)
 96 |                     scrappingFd.close()
 97 | 
 98 |                     # compilation test
 99 |                     try:
100 |                         code=open(rulePath, 'rU').read()
101 |                         compile(code, "rule_test", "exec")
102 |                     except:
103 |                         exc_type, exc_value, exc_traceback = sys.exc_info()
104 |                         message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
105 |                         logger.log(logging.CRITICAL, message)
106 |                         logger.log(logging.ERROR, "Unable to compile rule.py (is the syntax right?)")
107 |                         sys.exit(0)
108 | 
109 |                     try:
110 |                         code=open(scrappingPath, 'rb').read(os.path.getsize(scrappingPath))
111 |                         compile(code, "scrapping_test", "exec")
112 |                     except:
113 |                         exc_type, exc_value, exc_traceback = sys.exc_info()
114 |                         message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
115 |                         logger.log(logging.CRITICAL, message)
116 |                         logger.log(logging.ERROR, "Unable to compile scrapping.py (is the syntax right?)")
117 |                         sys.exit(0)
118 | 
119 |                     # dynamic reload of modules
120 |                     # TODO reloading of rule.py should eventually come here
121 |                     logger.log(logging.INFO, "Reloading modules imported for server")
122 |                     reload(sys.modules["modules.scrapping"])
123 | 
124 | 
125 |                     payload = protocol.InfoPayload(protocol.InfoPayload.CLIENT_ACK)
126 |                     packet = protocol.Packet(protocol.INFO, payload)
127 |                     self.writeSocket(packet)
128 | 
129 |                     logger.log(logging.DEBUG, "Sending ACK for configuration.")
130 |                 else:
131 |                     raise Exception("Unable to parse configuration.")
132 |             except:
133 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
134 |                 message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
135 |                 logger.log(logging.CRITICAL, message)
136 |                 self.isActive = False
137 | 
138 |     def run(self):
139 |         """Launches main threads"""
140 |         logger.log(logging.INFO, "\n\nStarting Crawling/Scrapping sequence...")
141 |         if self.isActive:
142 |             thread.start_new_thread(self.outputThread, ())
143 |             thread.start_new_thread(self.inputThread, ())
144 |             thread.start_new_thread(self.interpretingThread, ())
145 |             thread.start_new_thread(self.crawlingThread, ())
146 | 
147 |     def inputThread(self):
148 |         """Listens for inputs from the server"""
149 |         logger.log(logging.DEBUG, "InputThread started")
150 | 
151 |         while self.isActive:
152 |             try:
153 |                 deserializedPacket = self.readSocket()
154 |                 self.dispatcher(deserializedPacket)
155 |             except EOFError:
156 |                 self.isActive = False
157 |             except:
158 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
159 |                 message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
160 |                 logger.log(logging.CRITICAL, message)
161 |                 self.isActive = False
162 | 
163 |     def outputThread(self):
164 |         """Checks if there are messages to send to the server and sends them"""
165 |         logger.log(logging.DEBUG, "OutputThread started")
166 | 
167 |         while self.isActive:
168 |             try:
169 |                 obj = self.outputQueue.get(True) #fix with helper method to prevent block
170 |                 self.writeSocket(obj)
171 |                 logger.log(logging.DEBUG, "Sending obj of type " + str(obj.type))
172 |             except:
173 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
174 |                 message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
175 |                 logger.log(logging.CRITICAL, message)
176 |                 self.isActive = False
177 | 
178 |     def interpretingThread(self):
179 |         """Interprets message from the server other than type URL. (ie: INFO)"""
180 |         logger.log(logging.DEBUG, "InterpretingThread started")
181 | 
182 |         while self.isActive:
183 |             try:
184 |                 time.sleep(0.01) #temp - For testing
185 |                 packets = protocol.deQueue([self.infoQueue])
186 | 
187 |                 if not packets:
188 |                     continue
189 | 
190 |                 for packet in packets:
191 |                     if packet.type == protocol.INFO:
192 |                         logger.log(logging.INFO, "Interpreting INFO packet : " + str(packet.payload.urlList))
193 |             except:
194 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
195 |                 message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
196 |                 logger.log(logging.CRITICAL, message)
197 |                 self.isActive = False
198 | 
199 |     def crawlingThread(self):
200 |         """Takes URL from the urlToVisit queue and visits them"""
201 |         logger.log(logging.DEBUG, "CrawlingThread started")
202 | 
203 |         self.scrapper = scrapping.Scrapper(self.config.userAgent, self.config.robotParserEnabled, self.config.domainRestricted, self.config.crawling)
204 | 
205 |         while self.isActive:
206 |             try:
207 |                 urlList = protocol.deQueue([self.urlToVisit])
208 | 
209 |                 if not urlList:
210 |                     time.sleep(0.2) #temp - For testing
211 |                     continue
212 | 
213 |                 for url in urlList:
214 |                     session = self.scrapper.visit(url)
215 |                     logger.log(logging.DEBUG, "Session \n" + str(session.url) +
216 |                       "\nCode : " + str(session.returnCode) +
217 |                       "\nRequest time : " + str(session.requestTime) +
218 |                       "\nBs time : " + str(session.bsParsingTime))
219 | 
220 |                     if not session.failed:
221 |                         if self.crawlingType == protocol.ConfigurationPayload.DYNAMIC_CRAWLING:
222 |                             payload = protocol.URLPayload(session.scrappedURLs, protocol.URLPayload.SCRAPPED_URL)
223 |                             packet = protocol.Packet(protocol.URL, payload)
224 |                             self.outputQueue.put(packet)
225 | 
226 |                         payload = protocol.URLPayload([url], protocol.URLPayload.VISITED, session=session)
227 |                         packet = protocol.Packet(protocol.URL, payload)
228 |                         self.outputQueue.put(packet)
229 |                     else:
230 |                         logger.log(logging.INFO, "Skipping URL : " + url)
231 |                         payload = protocol.URLPayload([url], protocol.URLPayload.SKIPPED, session)
232 |                         packet = protocol.Packet(protocol.URL, payload)
233 |                         self.outputQueue.put(packet)
234 |                         continue
235 | 
236 |             except:
237 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
238 |                 message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
239 |                 logger.log(logging.CRITICAL, message)
240 |                 self.isActive = False
241 | 
242 |     def dispatcher(self, packet):
243 |         """Dispatches packets to the right packet queue"""
244 |         if packet is None:
245 |             return
246 |         elif packet.type == protocol.INFO:
247 |             logger.log(logging.DEBUG, "Dispatching INFO packet")
248 |             self.infoQueue.put(packet)
249 |         elif packet.type == protocol.URL:
250 |             logger.log(logging.DEBUG, "Dispatching url packet : " + str(packet.payload.urlList[0]))
251 |             for site in packet.payload.urlList:
252 |                 self.urlToVisit.put(site)
253 |         else:
254 |             logger.log(logging.CRITICAL, "Unrecognized packet type : " + str(packet.type) + ". This packet was dropped")
255 |             return
256 | 
257 |         logger.log(logging.DEBUG, "Dispatched packet of type: " + str(packet.type))
258 | 
259 |     def writeSocket(self, obj):
260 |         try:
261 |             serializedObj = pickle.dumps(obj)
262 |             logger.log(logging.DEBUG, "Sending " + str(len(serializedObj + delimiter)) + " bytes to server")
263 |             self.s.sendall(serializedObj + delimiter)
264 |         except:
265 |             exc_type, exc_value, exc_traceback = sys.exc_info()
266 |             message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
267 |             logger.log(logging.CRITICAL, message)
268 |             raise Exception("Unable to write to socket (lost connection to server)")
269 | 
270 |     def readSocket(self, timeOut=None):
271 |         self.s.settimeout(timeOut)
272 |         data = self.data
273 | 
274 |         if "\n\n12345ZEEK6789\n" in data:
275 |             data = data.split("\n\n12345ZEEK6789\n")
276 |             self.data = "\n\n12345ZEEK6789\n".join(data[1:])
277 |             return pickle.loads(data[0])
278 | 
279 |         while self.isActive:
280 |             buffer = self.s.recv(buffSize)
281 |             data = data + buffer
282 | 
283 |             if not buffer:
284 |                 logger.log(logging.INFO, "\nLost connection to server " + self.masterNodeFormattedAddr)
285 |                 self.isActive = False
286 | 
287 |             if "\n\n12345ZEEK6789\n" in data:
288 |                 data = data.split("\n\n12345ZEEK6789\n")
289 |                 self.data = "\n\n12345ZEEK6789\n".join(data[1:])
290 |                 break
291 | 
292 |         if self.isActive == False:
293 |             return
294 | 
295 |         logger.log(logging.DEBUG, "Receiving " + str(len(data[0])) + " bytes from server")
296 | 
297 |         return pickle.loads(data[0])
298 | 
299 |     def disconnect(self):
300 |         """Disconnects from the server"""
301 |         self.isActive = False
302 |         self.s.close()
303 | 
304 | 
305 | def main():
306 |     path = os.path.dirname(sys.argv[0])
307 |     if path:
308 |         path = path + "/"
309 | 
310 |     #config
311 |     config = ConfigParser.RawConfigParser(allow_no_value=True)
312 |     config.read(path + 'config')
313 |     host = config.get('client', 'hostAddr')
314 |     port = config.getint('client', 'hostPort')
315 |     logPath = config.get('common', 'logPath')
316 |     verbose = config.get('common', 'verbose')
317 |     if verbose == "True" or verbose == "true":
318 |         verbose = True
319 |     else:
320 |         verbose = False
321 | 
322 |     #setup
323 |     logger.init(logPath, "client-" + str(datetime.datetime.now()))
324 |     logger.debugFlag = verbose
325 | 
326 |     node = WorkingNode()
327 |     node.connect(host, port)
328 |     node.readConfig()
329 |     node.run()
330 | 
331 |     while node.isActive:
332 |         time.sleep(0.5)
333 | 
334 |     node.disconnect()
335 | 
336 | if __name__ == "__main__":
337 |     main()


--------------------------------------------------------------------------------
/src/config:
--------------------------------------------------------------------------------
 1 | [server]
 2 | listeningAddr = 127.0.0.1
 3 | listeningPort = 5050
 4 | 
 5 | [client]
 6 | hostAddr = 127.0.0.1
 7 | hostPort = 5050
 8 | 
 9 | [common]
10 | verbose = False
11 | logPath = logs/
12 | userAgent = Zeek-1.0a
13 | crawling = dynamic
14 | robotParser = true
15 | crawlDelay = 0
16 | 
17 | [dynamic]
18 | domainRestricted = true
19 | requestLimit = 0
20 | rootUrls = http://www.businessinsider.com/
21 | 
22 | [static]
23 | rootUrlsPath = url.txt
24 | 


--------------------------------------------------------------------------------
/src/helper/rule-tester.py:
--------------------------------------------------------------------------------
 1 | import urllib2, cookielib
 2 | from bs4 import BeautifulSoup
 3 | 
 4 | # url to test the parsing
 5 | urls = ["http://www.nytimes.com/2013/11/19/us/politics/republicans-block-another-obama-nominee-for-key-judgeship.html"]
 6 | 
 7 | for u in urls:
 8 | 	# cookie
 9 | 	cj = cookielib.CookieJar()
10 | 	opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
11 | 
12 | 	# builds request
13 | 	request = urllib2.Request(u)
14 | 	request.add_header('User-agent', "test")
15 | 	response = opener.open(request)
16 | 
17 | 	# parsing response
18 | 	bs = BeautifulSoup(response)
19 | 
20 |     # - Test your parsing here -
21 | 
22 |     # example :
23 |     #
24 |     #    headline = bs.find("h1", {"itemprop": "headline"})
25 |     #    if headline is not None:
26 |     #        title = headline.find("nyt_headline")
27 |     #        if headline is not None:
28 |     #            print title.get_text().encode('ascii', 'ignore')
29 | 


--------------------------------------------------------------------------------
/src/launch-clients.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # launch-clients.sh
 3 | 
 4 | COUNT=-1
 5 | 
 6 | function help {
 7 |     echo "usage: -h help"
 8 |     echo "       -n number of instances of client to launch locally"
 9 | }
10 | 
11 | while getopts n:h opt
12 |  do
13 |       case $opt in
14 |           n) COUNT="${OPTARG}";;
15 |           h) HELP="-1";;
16 |           *) exit 1;;
17 |         esac
18 | done
19 | 
20 | if [ "$HELP" == "-1" ] || [ "$COUNT" == "-1" ]
21 | then
22 |     help
23 |     exit 1
24 | fi
25 | 
26 | for i in $(seq 1 $COUNT)
27 | do
28 | 	python client.py > /dev/null &
29 | 	echo "Client $i started with PID : $!"
30 | done
31 | 


--------------------------------------------------------------------------------
/src/modules/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'David'
2 | 


--------------------------------------------------------------------------------
/src/modules/configuration.py:
--------------------------------------------------------------------------------
 1 | import ConfigParser
 2 | import inspect
 3 | import os
 4 | import sys
 5 | 
 6 | 
 7 | class Configuration():
 8 |     def __init__(self):
 9 |         self.host = ""
10 |         self.port = ""
11 |         self.logPath = ""
12 |         self.userAgent = ""
13 |         self.verbose = False
14 | 
15 |         self.crawling = ""
16 |         self.robotParserEnabled = False
17 |         self.domainRestricted = False
18 |         self.requestLimit = 0
19 |         self.crawlDelay = 0.0
20 |         self.rootUrls = []
21 | 
22 |         self.rule_py = ""
23 |         self.scrapping_py = ""
24 | 
25 | def readStaticUrl(path):
26 |     urls = []
27 |     file = open(path, 'r')
28 |     for url in file:
29 |         url = "".join(url.split()).replace(",","")
30 |         urls.append(url)
31 |     return urls
32 | 
33 | def readFile(path):
34 |     content = ""
35 |     file = open(path, 'r')
36 |     for line in file:
37 |         content = content + line
38 |     return content
39 | 
40 | def configParser():
41 |     path = os.path.dirname(sys.argv[0])
42 |     if path:
43 |         path = path + "/"
44 | 
45 |     config = Configuration()
46 |     configParser = ConfigParser.RawConfigParser(allow_no_value=True)
47 | 
48 |     configParser.read(path + 'config')
49 |     config.host = configParser.get('server', 'listeningAddr')
50 |     config.port = configParser.getint('server', 'listeningPort')
51 |     config.logPath = configParser.get('common', 'logPath')
52 |     verbose = configParser.get('common', 'verbose')
53 |     if verbose == "True" or verbose == "true":
54 |         config.verbose = True
55 |     else:
56 |         config.verbose = False
57 | 
58 |     config.userAgent = configParser.get('common', 'userAgent')
59 |     config.crawlDelay = configParser.getfloat('common', 'crawlDelay')
60 |     robotParserEnabled = configParser.get('common', 'robotParser')
61 |     if robotParserEnabled == "True" or robotParserEnabled == "true":
62 |         config.robotParserEnabled = True
63 |     else:
64 |         config.robotParserEnabled = False
65 | 
66 |     config.crawling = configParser.get('common', 'crawling')
67 |     if config.crawling == 'dynamic':
68 |         domainRestricted = configParser.get('dynamic', 'domainRestricted')
69 |         config.requestLimit = configParser.getint('dynamic', 'requestLimit')
70 |         rootUrls = configParser.get('dynamic', 'rootUrls')
71 |         rootUrls = "".join(rootUrls.split())
72 |         config.rootUrls = rootUrls.split(',')
73 | 
74 |         if domainRestricted == "True" or domainRestricted == "true":
75 |             config.domainRestricted = True
76 |         else:
77 |             config.domainRestricted = False
78 |     else:
79 |         config.rootUrls = readStaticUrl(configParser.get('static', 'rootUrlsPath'))
80 | 
81 |     # dynamic module reload
82 |     config.rule_py = readFile(path + "modules/rule.py")
83 |     config.scrapping_py = readFile(path + "modules/scrapping.py")
84 | 
85 |     return config


--------------------------------------------------------------------------------
/src/modules/logger.py:
--------------------------------------------------------------------------------
 1 | import inspect
 2 | import logging
 3 | import os
 4 | import sys
 5 | 
 6 | debugFlag = True
 7 | GREEN = '\033[92m'
 8 | PINK = '\033[95m'
 9 | BLUE = '\033[94m'
10 | RED = '\033[91m'
11 | YELLOW = '\033[93m'
12 | NOCOLOR = '\033[0m'
13 | 
14 | color = [GREEN, PINK, BLUE, RED, YELLOW, NOCOLOR]
15 | 
16 | def init(path, logName):
17 |     basePath = os.path.dirname(sys.argv[0])
18 |     if basePath:
19 |         basePath = basePath + "/"
20 |     path = basePath + path
21 |     if not os.path.exists(path):
22 |         os.makedirs(path)
23 |     logging.basicConfig(filename=path+logName, format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG)
24 | 
25 | 
26 | def debugFlag(flag):
27 |     debugFlag = flag
28 |     if debugFlag:
29 |         logging.disable(logging.NOTSET)
30 |     else:
31 |         logging.disable(logging.DEBUG)
32 | 
33 | 
34 | def log(level, message):
35 |     #message formatting
36 |     if level == logging.DEBUG:
37 |         func = inspect.currentframe().f_back.f_code
38 |         fileName = ''.join(func.co_filename.split('/')[-1])
39 |         line = func.co_firstlineno
40 |         message = "[" + str(func.co_name) + " - " + str(fileName) + ", " + str(line) + "] " + message
41 | 
42 |     if level == logging.CRITICAL:
43 |         message = "\n\n ************************\n" + message
44 | 
45 |     #printing to logs
46 |     if debugFlag and level == logging.DEBUG:
47 |         print(message)
48 |     elif level is not logging.DEBUG:
49 |         print(message)
50 | 
51 |     for c in color:
52 |         message = message.replace(c, "")
53 | 
54 |     logging.log(level, message)
55 | 
56 | 
57 | def formatBrackets(message):
58 |     return "[" + str(message) + "]"
59 | 
60 | 
61 | def printAsciiLogo():
62 |     print("")
63 |     print(" .----------------.  .----------------.  .----------------.  .----------------. ")
64 |     print("| .--------------. || .--------------. || .--------------. || .--------------. |")
65 |     print("| |   ________   | || |  _________   | || |  _________   | || |  ___  ____   | |")
66 |     print("| |  |  __   _|  | || | |_   ___  |  | || | |_   ___  |  | || | |_  ||_  _|  | |")
67 |     print("| |  |_/  / /    | || |   | |_  \_|  | || |   | |_  \_|  | || |   | |_/ /    | |")
68 |     print("| |     .'.' _   | || |   |  _|  _   | || |   |  _|  _   | || |   |  __'.    | |")
69 |     print("| |   _/ /__/ |  | || |  _| |___/ |  | || |  _| |___/ |  | || |  _| |  \ \_  | |")
70 |     print("| |  |________|  | || | |_________|  | || | |_________|  | || | |____||____| | |")
71 |     print("| |              | || |              | || |              | || |              | |")
72 |     print("| '--------------' || '--------------' || '--------------' || '--------------' |")
73 |     print(" '----------------'  '----------------'  '----------------'  '----------------' ")
74 |     print("")
75 |     print("                                       +++                                      ")
76 |     print("                                      (o o)                                     ")
77 |     print("                                 -ooO--(_)--Ooo-                                ")
78 |     print("                                      v1.0a                                     ")
79 |     print("                                 David Albertson                                ")
80 |     print("")
81 | 
82 | 
83 | 


--------------------------------------------------------------------------------
/src/modules/protocol.py:
--------------------------------------------------------------------------------
 1 | import Queue
 2 | 
 3 | CONFIG = 'CONFIG'
 4 | INFO = 'INFO'
 5 | URL = 'URL'
 6 | 
 7 | 
 8 | class Packet:
 9 |     def __init__(self, type, payload):
10 |         self.type = type
11 |         self.payload = payload
12 | 
13 |     def setPayload(self, payload):
14 |         self.payload = payload
15 | 
16 | 
17 | class ConfigurationPayload():
18 |     STATIC_CRAWLING = 'STATIC'
19 |     DYNAMIC_CRAWLING = 'DYNAMIC'
20 | 
21 |     def __init__(self, crawlingType, config):
22 |         self.crawlingType = crawlingType
23 |         self.config = config
24 | 
25 | 
26 | class InfoPayload():
27 |     CLIENT_ACK = 0
28 |     SERVER_ACK = 1
29 | 
30 |     def __init__(self, info):
31 |         self.info = info
32 | 
33 | 
34 | class URLPayload():
35 |     VISITED = 'VISITED'
36 |     SKIPPED = 'SKIPPED'
37 |     TOVISIT = 'TOVISIT'
38 |     SCRAPPED_URL = 'SCRAPPED'
39 | 
40 |     def __init__(self, urlList, type, session=None, data=None):
41 |         #self.url = url TODO : add url param (to know where the data is coming from)
42 |         self.urlList = []
43 |         self.type = type
44 |         self.data = data
45 |         self.session = session
46 | 
47 |         for url in urlList:
48 |             self.urlList.append(url)
49 | 
50 | 
51 | def deQueue(queueArray):
52 |     packetArray = []
53 |     for queue in queueArray:
54 |         try:
55 |             packet = queue.get(block=False)
56 |             packetArray.append(packet)
57 |         except Queue.Empty:
58 |             pass
59 |     return packetArray


--------------------------------------------------------------------------------
/src/modules/rule.py:
--------------------------------------------------------------------------------
 1 | import urlparse
 2 | 
 3 | class Container:
 4 |     def __init__(self):
 5 |         #data = dict()
 6 |         self.hasData = False
 7 | 
 8 |         self.title = None
 9 |         self.author = None
10 | 
11 | def scrape(url, bs):
12 |     # for testing - this is scrapping article titles from www.nytimes.com
13 |     container = Container()
14 |     domain = urlparse.urlsplit(url)[1].split(':')[0]
15 | 
16 |     # extracting data from NYTimes
17 |     if domain == "www.nytimes.com":
18 |         headline = bs.find("h1", {"itemprop": "headline"})
19 |         if headline is not None:
20 |             title = headline.find("nyt_headline")
21 |             if title is not None:
22 |                 container.title = title.get_text().encode('ascii', 'ignore')
23 | 
24 |         byline = bs.find("h6", {"class": "byline"})
25 |         if byline is not None:
26 |             author = byline.find("span", {"itemprop": "name"})
27 |             if author is not None:
28 |                 container.author = author.get_text().encode('ascii', 'ignore')
29 | 
30 |         return container
31 | 
32 |     return Container()


--------------------------------------------------------------------------------
/src/modules/scrapping.py:
--------------------------------------------------------------------------------
  1 | import cookielib
  2 | import urllib2
  3 | import logging
  4 | import logger
  5 | import time
  6 | import robotparser
  7 | import rule
  8 | import socket
  9 | import sys
 10 | import traceback
 11 | import urlparse
 12 | from bs4 import BeautifulSoup
 13 | 
 14 | robotDict = {}
 15 | 
 16 | class Session:
 17 |     def __init__(self, url, failed, code, info, requestTime, bsParsingTime, scrappedURLs, dataContainer=None, errorMsg=None):
 18 |         self.url = url
 19 |         self.failed = failed
 20 |         self.returnCode = code
 21 |         self.returnInfo = info
 22 |         self.requestTime = requestTime
 23 |         self.bsParsingTime = bsParsingTime
 24 | 
 25 |         self.scrappedURLs = scrappedURLs
 26 |         self.dataContainer = dataContainer
 27 | 
 28 |         # add error handling
 29 |         # err.msg
 30 | 
 31 |         #url error
 32 |         self.errorMsg = errorMsg
 33 | 
 34 | class Scrapper:
 35 |     def __init__(self, userAgent, robotParserEnabled, domainRestricted, crawlingType):
 36 |         self.userAgent = userAgent
 37 |         self.robotParserEnabled = robotParserEnabled
 38 |         self.domainRestricted = domainRestricted
 39 |         self.crawlingType = crawlingType
 40 | 
 41 |         # eventually move this to client.py
 42 |         reload(rule)
 43 | 
 44 |     def visit(self, url):
 45 |         """Visits a given URL and return all the data"""
 46 |         logger.log(logging.INFO, "Scrapping : " + str(url))
 47 | 
 48 |         # in the case the rootUrl wasnt formatted the right way
 49 |         if (url.startswith("http://") or url.startswith("https://")) is False:
 50 |             url = "http://" + url
 51 | 
 52 |         domain = urlparse.urlsplit(url)[1].split(':')[0]
 53 |         httpDomain = "http://" + domain
 54 | 
 55 |         try:
 56 |             # robot parser
 57 |             if self.robotParserEnabled:
 58 |                 if httpDomain not in robotDict:
 59 |                     parser = robotparser.RobotFileParser()
 60 |                     parser.set_url(urlparse.urljoin(httpDomain, 'robots.txt'))
 61 |                     parser.read()
 62 |                     robotDict[httpDomain] = parser
 63 |                 parser = robotDict[httpDomain]
 64 | 
 65 |                 isParsable = parser.can_fetch(self.userAgent, url)
 66 |                 if not isParsable:
 67 |                     raise Exception("RobotParser")
 68 | 
 69 |             # request
 70 |             start_time = time.time()
 71 |             cj = cookielib.CookieJar()
 72 |             opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
 73 |             request = urllib2.Request(url)
 74 |             request.add_header('User-agent', self.userAgent)
 75 |             data = opener.open(request, timeout=4)
 76 |             urlRequestTime = time.time() - start_time
 77 | 
 78 |             # parsing
 79 |             start_time = time.time()
 80 |             bs = BeautifulSoup(data)
 81 |             bsParsingTime = time.time() - start_time
 82 | 
 83 |             # url scrapping - dynamic crawling
 84 |             if self.crawlingType == "dynamic":
 85 |                 illegal = [".mp4", ".mp3", ".flv", ".m4a", \
 86 |                            ".jpg", ".png", ".gif", \
 87 |                            ".xml", ".pdf", ".gz", ".zip", ".rss"]
 88 | 
 89 |                 links = bs.find_all('a')
 90 |                 links = [s.get('href') for s in links]
 91 |                 links = [unicode(s) for s in links]
 92 |                 if self.domainRestricted:
 93 |                     links = [s for s in links if s.startswith("http://" + domain + "/") or s.startswith("https://" + domain )]
 94 |                 for ext in illegal:
 95 |                     links = [s for s in links if ext not in s]
 96 |                 links = [s for s in links if s.startswith("http:") or s.startswith("https:")]
 97 |                 foundUrl = set(links)
 98 | 
 99 |             # data scrapping
100 |             dataContainer = rule.scrape(url, bs)
101 |             if dataContainer is None:
102 |                 raise("None data container object")
103 | 
104 |             logger.log(logging.DEBUG, "Scrapping complete")
105 |             return Session(url, False, data.getcode(), data.info(), urlRequestTime, bsParsingTime , foundUrl, dataContainer)
106 | 
107 |         except urllib2.HTTPError, err:
108 |             logger.log(logging.INFO, "Scrapping failed - HTTPError " + str(err.msg) + " " + str(err.code))
109 |             return Session(url, True, err.code, "no data", 0, "", "", errorMsg=err.msg.replace('\n', ""))
110 |         except socket.timeout:
111 |             logger.log(logging.INFO, "Scrapping failed - Timeout")
112 |             return Session(url, True, -1, "no data", 0, "", "", errorMsg="Request timeout")
113 |         except Exception as e:
114 |             if e.message == "RobotParser":
115 |                 logger.log(logging.INFO, "Scrapping failed - RobotParser")
116 |                 return Session(url, True, -2, "no data", 0, "", "", errorMsg="Request is not allowed as per Robot.txt")
117 |             else:
118 |                 logger.log(logging.INFO, "Scrapping failed - Un-handled")
119 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
120 |                 message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
121 |                 logger.log(logging.ERROR, message)
122 |                 return Session(url, True, -100, "no data", 0, "", "", errorMsg=traceback.format_exception(exc_type, exc_value, exc_traceback)[-1].replace('\n', ""))


--------------------------------------------------------------------------------
/src/modules/storage.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import logger
 3 | import atexit
 4 | 
 5 | dataFd = None
 6 | errorFd = None
 7 | 
 8 | def writeToFile(session, container):
 9 |     global dataFd, errorFd
10 |     try:
11 |         if (not session.failed) and (session.dataContainer.title is not None):
12 |             if dataFd is None:
13 |                 dataFd = open('output.txt', 'w')
14 |             dataFd.write(container.author.replace(",","") + "," + container.title.replace(",","") + "," + session.url +"\n")
15 |         elif session.failed:
16 |             if errorFd is None:
17 |                 errorFd = open('error.txt', 'w')
18 |             errorFd.write(str(session.returnCode).replace(",","") + "," + str(session.errorMsg).replace(",","") + "." + session.url.replace(",","") + "\n")
19 |         #else:
20 |         #    raise Exception("..")
21 |     except:
22 |         logger.log(logging.ERROR, "Unhandled exception in storage.py")
23 | 
24 | def writeToDb(session, container):
25 |     a = "Will come soon - Happy Halloween"
26 | 
27 | def atexitfct():
28 |     """Cleanly closes file objects"""
29 |     if dataFd is not None:
30 |         dataFd.close()
31 |     if errorFd is not None:
32 |         errorFd.close()
33 | 
34 | atexit.register(atexitfct)


--------------------------------------------------------------------------------
/src/server.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | from distutils.command.config import config
  3 | import logging
  4 | import pickle
  5 | import Queue
  6 | import signal
  7 | import socket
  8 | import sys
  9 | import time
 10 | import thread
 11 | import traceback
 12 | import uuid
 13 | import modules.storage as storage
 14 | import modules.logger as logger
 15 | import modules.protocol as protocol
 16 | import modules.configuration as configuration
 17 | 
 18 | 
 19 | buffSize = 524288
 20 | delimiter = '\n\n12345ZEEK6789\n'
 21 | 
 22 | # (string:url) - Crawling algo
 23 | urlVisited = dict() # url already visited
 24 | urlPool = Queue.Queue(0) # url scrapped by working nodes
 25 | urlToVisit = Queue.Queue(0) # url scrapped by working nodes
 26 | 
 27 | # (string:url) - For stats
 28 | scrappedURLlist = []
 29 | visitedURLlist = []
 30 | skippedURLlist = []
 31 | 
 32 | # (packet+payload) - To be sent to _any_ node
 33 | outputQueue = Queue.Queue(200)
 34 | 
 35 | # (session:session) - for storage
 36 | sessionStorageQueue = Queue.Queue(0)
 37 | 
 38 | # temporary for server.run()
 39 | serverRunning = False
 40 | skippedSessions = []
 41 | 
 42 | class Server:
 43 |     def __init__(self, host, port):
 44 |         self.host = host
 45 |         self.port = port
 46 |         self.s = None
 47 |         self.clientDict = {}
 48 |         self.isActive = True
 49 |         self.requestLimit = 0
 50 |         self.requestCount = 0
 51 | 
 52 |     def setup(self, configuration):
 53 |         """Basic setup operation (socket binding, listen, etc)"""
 54 |         logger.log(logging.DEBUG, "Socket initialization")
 55 |         self.s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
 56 |         self.s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
 57 |         self.s.bind((self.host, self.port))
 58 |         self.s.listen(5)
 59 |         logger.log(logging.INFO, "Listening on [" + str(self.host) + ":" + str(self.port) + "]")
 60 | 
 61 |         self.configurationPayload = configuration
 62 |         self.requestLimit = configuration.config.requestLimit
 63 | 
 64 |     def run(self):
 65 |         """Launches the urlDispatcher and mainRoutine threads"""
 66 |         logger.log(logging.DEBUG, "Starting beginCrawlingProcedure")
 67 |         thread.start_new_thread(self.urlDispatcher, ())
 68 |         thread.start_new_thread(self.mainRoutine, ())
 69 |         thread.start_new_thread(self.storageRoutine, ())
 70 | 
 71 |     def listen(self):
 72 |         """Waits for new clients to connect and launches a new client thread accordingly"""
 73 |         print("- - - - - - - - - - - - - - -")
 74 |         logger.log(logging.INFO, "Waiting for working nodes to connect...")
 75 |         while self.isActive:
 76 |             try:
 77 |                 client, address = self.s.accept()
 78 |                 thread.start_new_thread(self.connectionHandler, (client, address))
 79 |             except:
 80 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
 81 |                 message = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
 82 |                 logger.log(logging.CRITICAL, message)
 83 |                 self.isActive = False
 84 | 
 85 |     def connectionHandler(self, socket, address):
 86 |         """Creates a server-side client object and makes it listen for inputs"""
 87 |         clientID = uuid.uuid4()
 88 |         client = SSClient(clientID, socket, address)
 89 |         self.clientDict[clientID] = client
 90 | 
 91 |         #temp testing, could take a parameter from config
 92 |         global serverRunning
 93 |         if len(self.clientDict) > 0  and serverRunning == False:
 94 |             self.run()
 95 |             serverRunning = True
 96 | 
 97 |         #for clients in self.clientDict:
 98 |         #    logger.log(logging.DEBUG, "Working node connected : " + str(self.clientDict[clients].id))
 99 | 
100 |         try:
101 |             client.sendConfig(self.configurationPayload)
102 |             client.run()
103 |             while client.isActive:
104 |                 time.sleep(0.3)
105 |         except EOFError:
106 |             pass
107 |         except:
108 |             client.isActive = False
109 |             exc_type, exc_value, exc_traceback = sys.exc_info()
110 |             message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
111 |             logger.log(logging.ERROR, message)
112 |         finally:
113 |             client.disconnect()
114 |             del self.clientDict[clientID]
115 | 
116 |     def urlDispatcher(self):
117 |         """Reads from the urlPool, makes sure the url has not been visited and adds it to the urlToVisit Queue"""
118 |         logger.log(logging.INFO, "Starting server urlDispatcher")
119 | 
120 |         while self.isActive:
121 |             try:
122 |                 url = urlPool.get(True)
123 |                 if url not in urlVisited:
124 |                     urlVisited[url] = True
125 |                     #logic if static crawling will come here
126 |                     urlToVisit.put(url)
127 |                     scrappedURLlist.append(url)
128 |             except:
129 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
130 |                 message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
131 |                 logger.log(logging.ERROR, message)
132 | 
133 |     def mainRoutine(self):
134 |         """To Come in da future. For now, no use"""
135 |         logger.log(logging.INFO, "Starting server mainRoutine")
136 | 
137 |         for url in self.configurationPayload.config.rootUrls:
138 |             payload = protocol.URLPayload([str(url)], protocol.URLPayload.TOVISIT)
139 |             packet = protocol.Packet(protocol.URL, payload)
140 |             urlVisited[url] = True
141 |             outputQueue.put(packet)
142 | 
143 |             if self.configurationPayload.crawlingType == protocol.ConfigurationPayload.STATIC_CRAWLING and (self.configurationPayload.config.crawlDelay != 0):
144 |                 if self.configurationPayload.config.crawlDelay != 0:
145 |                         time.sleep(self.configurationPayload.config.crawlDelay)
146 | 
147 |         while self.isActive:
148 |             try:
149 |                 if self.configurationPayload.crawlingType == protocol.ConfigurationPayload.DYNAMIC_CRAWLING:
150 |                     url = urlToVisit.get(True)
151 |                     payload = protocol.URLPayload([str(url)], protocol.URLPayload.TOVISIT)
152 |                     packet = protocol.Packet(protocol.URL, payload)
153 |                     outputQueue.put(packet)
154 |                     self.requestCount = self.requestCount + 1
155 | 
156 |                     if self.configurationPayload.config.crawlDelay != 0:
157 |                         time.sleep(self.configurationPayload.config.crawlDelay)
158 | 
159 |                     if self.requestLimit != 0 and len(visitedURLlist)+1 > self.requestLimit:
160 |                         break
161 | 
162 |                 elif self.configurationPayload.crawlingType == protocol.ConfigurationPayload.STATIC_CRAWLING:
163 |                     if (len(skippedURLlist+visitedURLlist) == len(self.configurationPayload.config.rootUrls)):
164 |                         break
165 |                     else:
166 |                         time.sleep(0.3)
167 |             except:
168 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
169 |                 message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
170 |                 logger.log(logging.ERROR, message)
171 | 
172 |         logger.log(logging.INFO, "Scrapping complete. Terminating...")
173 |         self.disconnectAllClient()
174 |         self.isActive = False
175 | 
176 |     def storageRoutine(self):
177 |         """Stores session and data"""
178 |         logger.log(logging.INFO, "Starting server storageRoutine")
179 | 
180 |         while self.isActive:
181 |             try:
182 |                 sessions = protocol.deQueue([sessionStorageQueue])
183 | 
184 |                 if not sessions:
185 |                         continue
186 | 
187 |                 for session in sessions:
188 |                     storage.writeToFile(session, session.dataContainer)
189 |             except:
190 |                 exc_type, exc_value, exc_traceback = sys.exc_info()
191 |                 message = "\n" + ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
192 |                 logger.log(logging.ERROR, message)
193 | 
194 |     def disconnectAllClient(self):
195 |         """Disconnects all clients"""
196 | 
197 |         for connectedClient in self.clientDict:
198 |             if self.clientDict[connectedClient].isActive:
199 |                 self.clientDict[connectedClient].disconnect()
200 | 
201 | 
202 | class SSClient:
203 |     def __init__(self, cId, socket, address):
204 |         self.id = cId
205 |         self.socket = socket
206 |         self.address = address
207 |         self.isActive = True
208 |         self.formattedAddr = logger.formatBrackets(str(str(address[0]) + ":" + str(address[1]))) + " "
209 |         self.sentCount = 0
210 |         self.data = ""
211 |         self.configuration = None
212 | 
213 |         logger.log(logging.INFO, logger.GREEN + self.formattedAddr + "Working node connected" + logger.NOCOLOR)
214 | 
215 |     def sendConfig(self, configuration):
216 |         """Sends the configuration to the client"""
217 |         logger.log(logging.DEBUG, self.formattedAddr + "Sending configuration")
218 |         self.configuration = configuration
219 | 
220 |         packet = protocol.Packet(protocol.CONFIG, self.configuration)
221 |         self.writeSocket(packet)
222 | 
223 |         logger.log(logging.DEBUG, self.formattedAddr + "Configuration sent waiting for ACK")
224 |         packet = self.readSocket(5)
225 | 
226 |         if packet.type == protocol.INFO:
227 |             if packet.payload.info == protocol.InfoPayload.CLIENT_ACK:
228 |                 logger.log(logging.DEBUG, self.formattedAddr + "Working node ACK received (configuration)")
229 |                 return
230 |             else:
231 |                 self.isActive = False
232 |                 raise Exception("Unable to transmit configuration")
233 | 
234 |     def run(self):
235 |         """Launched the input and output thread with the client itself"""
236 |         thread.start_new_thread(self.inputThread, ())
237 |         thread.start_new_thread(self.outputThread, ())
238 | 
239 |     def inputThread(self):
240 |         """Listens for inputs from the client"""
241 |         logger.log(logging.DEBUG, self.formattedAddr +  "Listening for packets")
242 | 
243 |         while self.isActive:
244 |             try:
245 |                 deserializedPacket = self.readSocket()
246 |                 self.dispatcher(deserializedPacket)
247 | 
248 |             except EOFError:
249 |                 #Fixes the pickle error if clients disconnects
250 |                 self.isActive = False
251 | 
252 |     def outputThread(self):
253 |         """Checks if there are messages to send to the client and sends them"""
254 |         while self.isActive:
255 |             if self.sentCount > 5:
256 |                 time.sleep(0.03)
257 |                 continue
258 |             packetToBroadCast = protocol.deQueue([outputQueue])
259 | 
260 |             if not packetToBroadCast:
261 |                     continue
262 | 
263 |             for packet in packetToBroadCast:
264 |                 self.writeSocket(packet)
265 |                 self.sentCount = self.sentCount+1
266 |                 logger.log(logging.DEBUG, self.formattedAddr + "Sending URL " + str(packet.payload.urlList[0]))
267 | 
268 |     def dispatcher(self, packet):
269 |         """Dispatches packets to the right packet queue or takes action if needed (ie: infoPacket)"""
270 |         if packet is None:
271 |             return
272 |         logger.log(logging.DEBUG, "Dispatching packet of type: " + str(packet.type))
273 | 
274 |         if packet.type == protocol.INFO:
275 |             logger.log(logging.DEBUG, self.formattedAddr + "Received INFO packet")
276 |         elif packet.type == protocol.URL:
277 | 
278 |             if packet.payload.type == protocol.URLPayload.SCRAPPED_URL:
279 |                 logger.log(logging.INFO, self.formattedAddr + "Receiving scrapped URLs : " + str(len(packet.payload.urlList)).center(5) + " / " + str(len(scrappedURLlist)).center(7) + " - " + str(len(skippedURLlist)).center(5))
280 |                 for url in packet.payload.urlList:
281 |                     urlPool.put(url)
282 | 
283 |             if packet.payload.type == protocol.URLPayload.VISITED:
284 |                 self.sentCount = self.sentCount-1
285 |                 for url in packet.payload.urlList:
286 |                     logger.log(logging.INFO, self.formattedAddr + "Receiving scrapped data")
287 |                     logger.log(logging.DEBUG, self.formattedAddr + "Receiving scrapped data" + url)
288 |                     visitedURLlist.append(url)
289 |                 if hasattr(packet.payload, 'session'):
290 |                     if packet.payload.session is not None:
291 |                         sessionStorageQueue.put(packet.payload.session)
292 | 
293 |             if packet.payload.type == protocol.URLPayload.SKIPPED:
294 |                 self.sentCount = self.sentCount-1
295 |                 for url in packet.payload.urlList:
296 |                     skippedURLlist.append(url)
297 |                 if hasattr(packet.payload, 'session'):
298 |                     if packet.payload.session is not None:
299 |                         sessionStorageQueue.put(packet.payload.session)
300 |                         if packet.payload.session.returnCode == -1:
301 |                             logger.log(logging.INFO, logger.PINK + self.formattedAddr + "Skipped (timeout) : " + url + logger.NOCOLOR)
302 |                         elif packet.payload.session.returnCode == -2:
303 |                             logger.log(logging.INFO, logger.PINK + self.formattedAddr + "Skipped (request not allowed - robot parser) : " + url + logger.NOCOLOR)
304 |                         elif packet.payload.session.returnCode == -100:
305 |                             logger.log(logging.INFO, logger.YELLOW + self.formattedAddr + "Skipped (unknown error) : " + url + logger.NOCOLOR)
306 |                         else:
307 |                             logger.log(logging.INFO, logger.BLUE + self.formattedAddr + "Skipped (html error " + str(packet.payload.session.returnCode) + ") : " + url + logger.NOCOLOR)
308 |                 else:
309 |                     logger.log(logging.INFO, logger.RED + self.formattedAddr + "No session returned" + url + logger.NOCOLOR)
310 |         else:
311 |             logger.log(logging.CRITICAL, "Unrecognized packet type : " + str(packet.type) + ". This packet was dropped")
312 |             return
313 | 
314 |     def writeSocket(self, obj):
315 |         try:
316 |             serializedObj = pickle.dumps(obj)
317 |             logger.log(logging.DEBUG, self.formattedAddr + "Sending " + str(len(serializedObj + delimiter)) + " bytes")
318 |             self.socket.sendall(serializedObj + delimiter)
319 |         except:
320 |             raise Exception("Unable to write to socket (client disconnected)")
321 | 
322 |     def readSocket(self, timeOut=None):
323 |         self.socket.settimeout(timeOut)
324 |         data = self.data
325 | 
326 |         if "\n\n12345ZEEK6789\n" in data:
327 |             data = data.split("\n\n12345ZEEK6789\n")
328 |             self.data = "\n\n12345ZEEK6789\n".join(data[1:])
329 |             return pickle.loads(data[0])
330 | 
331 |         while self.isActive:
332 |             buffer = self.socket.recv(buffSize)
333 |             data = data + buffer
334 | 
335 |             if not buffer:
336 |                 logger.log(logging.INFO, logger.RED + self.formattedAddr + "Lost connection" + logger.NOCOLOR)
337 |                 self.isActive = False
338 | 
339 |             if "\n\n12345ZEEK6789\n" in data:
340 |                 data = data.split("\n\n12345ZEEK6789\n")
341 |                 self.data = "\n\n12345ZEEK6789\n".join(data[1:])
342 |                 break
343 | 
344 |         if self.isActive == False:
345 |             return
346 | 
347 |         logger.log(logging.DEBUG, self.formattedAddr + "Receiving " + str(len(data[0])) + " bytes")
348 |         return pickle.loads(data[0])
349 | 
350 |     def disconnect(self):
351 |         """Disconnects the client"""
352 | 
353 |         if self.socket != None:
354 |             logger.log(logging.INFO, logger.RED + self.formattedAddr + "Disconnecting" + logger.NOCOLOR)
355 |             self.isActive = False
356 |             self.socket.close()
357 |             self.socket = None
358 | 
359 | 
360 | def ending():
361 |     """Temporary ending routine"""
362 |     try:
363 |         scrapped = len(scrappedURLlist)
364 |         skipped = len(skippedURLlist)
365 |         visited = len(visitedURLlist)
366 |         skipRate = (float(skipped)/float(skipped+visited) * 100)
367 | 
368 |         print("\n\n-------------------------")
369 |         print("Scrapped : " + str(scrapped))
370 |         print("Skipped : " + str(skipped))
371 |         print("Visited : " + str(visited))
372 |         print("-------------------------")
373 |         print(str(skipRate) + "% skipping rate\n")
374 |     except:
375 |         #handles cases where crawling did occur (list were empty)
376 |         pass
377 |     sys.exit()
378 | 
379 | def handler(signum, frame):
380 |     ending()
381 | 
382 | def main():
383 |     signal.signal(signal.SIGINT, handler)
384 |     logger.printAsciiLogo()
385 | 
386 |     #parse config file
387 |     config = configuration.configParser()
388 | 
389 |     #logging
390 |     logger.init(config.logPath, "server-" + str(datetime.datetime.now()))
391 |     logger.debugFlag = config.verbose
392 | 
393 |     #node configration
394 |     if config.crawling == 'dynamic':
395 |         nodeConfig = protocol.ConfigurationPayload(protocol.ConfigurationPayload.DYNAMIC_CRAWLING, config)
396 |     else:
397 |         nodeConfig = protocol.ConfigurationPayload(protocol.ConfigurationPayload.STATIC_CRAWLING, config)
398 | 
399 |     #server
400 |     server = Server(config.host, config.port)
401 |     server.setup(nodeConfig)
402 |     thread.start_new_thread(server.listen, ()) #testing
403 | 
404 |     while server.isActive:
405 |         time.sleep(0.5)
406 | 
407 |     server.disconnectAllClient()
408 |     ending()
409 | 
410 | 
411 | if __name__ == "__main__":
412 |     main()


--------------------------------------------------------------------------------
/src/url.txt:
--------------------------------------------------------------------------------
1 | http://www.businessinsider.com/sai
2 | http://www.lapresse.ca/
3 | http://mashable.com/
4 | 


--------------------------------------------------------------------------------