├── LICENSE
├── README.md
└── itunes.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2015, Tuts+
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are met:
 6 | 
 7 | * Redistributions of source code must retain the above copyright notice, this
 8 |   list of conditions and the following disclaimer.
 9 | 
10 | * Redistributions in binary form must reproduce the above copyright notice,
11 |   this list of conditions and the following disclaimer in the documentation
12 |   and/or other materials provided with the distribution.
13 | 
14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
15 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
16 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
17 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
18 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
19 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
20 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
21 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
22 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
23 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
24 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # [Crawl the Web With Python][published url]
 2 | ## Instructor: [Derek Jensen][instructor url]
 3 | 
 4 | 
 5 | In a recent business endeavor, I found it necessary to collect bulk data from different online sources in order to centralize it and make it easier for people to find and to make sense of. I would have liked if those sites had exposed public APIs, but unfortunately none did. So, I decided to try my hand at a little web crawling and scraping to obtain this metadata and after searching around for awhile, I found that many people when faced with the same issue have turned towards Python. After all, if it was good enough for Google, it's definitely good enough for me!
 6 | 
 7 | In this course I will share some of my findings and show you how you can go about creating your own basic web crawler and scraper.
 8 | 
 9 | ## Source Files Description
10 | 
11 | The full project source can be found in the itunes.py file.
12 | 
13 | 
14 | ------
15 | 
16 | These are source files for the Tuts+ course: [Crawl the Web With Python][published url]
17 | 
18 | Available on [Tuts+](https://tutsplus.com). Teaching skills to millions worldwide.
19 | 
20 | [published url]: https://code.tutsplus.com/courses/crawl-the-web-with-python
21 | [instructor url]: https://tutsplus.com/authors/derek-jensen
22 | 


--------------------------------------------------------------------------------
/itunes.py:
--------------------------------------------------------------------------------
 1 | from lxml import html
 2 | import requests
 3 | import time
 4 | 
 5 | class AppCrawler:
 6 | 
 7 |     def __init__(self, starting_url, depth):
 8 |         self.starting_url = starting_url
 9 |         self.depth = depth
10 |         self.current_depth = 0
11 |         self.depth_links = []
12 |         self.apps = []
13 | 
14 |     def crawl(self):
15 |         app = self.get_app_from_link(self.starting_url)
16 |         self.apps.append(app)
17 |         self.depth_links.append(app.links)
18 | 
19 |         while self.current_depth < self.depth:
20 |             current_links = []
21 |             for link in self.depth_links[self.current_depth]:
22 |                 current_app = self.get_app_from_link(link)
23 |                 current_links.extend(current_app.links)
24 |                 self.apps.append(current_app)
25 |                 time.sleep(5)
26 |             self.current_depth += 1
27 |             self.depth_links.append(current_links)
28 | 
29 | 
30 |     def get_app_from_link(self, link):
31 |         start_page = requests.get(link)
32 |         tree = html.fromstring(start_page.text)
33 | 
34 |         name = tree.xpath('//h1[@itemprop="name"]/text()')[0]
35 |         developer = tree.xpath('//div[@class="left"]/h2/text()')[0]
36 |         price = tree.xpath('//div[@itemprop="price"]/text()')[0]
37 |         links = tree.xpath('//div[@class="center-stack"]//*/a[@class="name"]/@href')
38 | 
39 |         app = App(name, developer, price, links)
40 | 
41 |         return app
42 | 
43 | 
44 | class App:
45 | 
46 |     def __init__(self, name, developer, price, links):
47 |         self.name = name
48 |         self.developer = developer
49 |         self.price = price
50 |         self.links = links
51 | 
52 |     def __str__(self):
53 |         return ("Name: " + self.name.encode('UTF-8') + 
54 |         "\r\nDeveloper: " + self.developer.encode('UTF-8') + 
55 |         "\r\nPrice: " + self.price.encode('UTF-8') + "\r\n")
56 | 
57 | 
58 | crawler = AppCrawler('https://itunes.apple.com/us/app/candy-crush-saga/id553834731', 2)
59 | crawler.crawl()
60 | 
61 | for app in crawler.apps:
62 |     print app


--------------------------------------------------------------------------------